How to ingest xml or mp4 data from hadoop tool to apache kafka as a stream using a hdfssource-connector

Hello ladies and gentlemen, I’m new to confluent plattform and also to kafka.

I’m currently finishing my degree in the field of distributed data analysis and data streaming. So I wanted to ask, so I will briefly explain that I work with apache hadoop where data (Xsense data, mp4) from multi-computers are stored in congured form. I would like to use apache kafka and kafka connect api to save data continuously from hadoop in kafka as an event: I work on my local computer on the hadoop cluster and run apache kafka platform. I’ve tried a lot with connectors, but confluent comes up more often, that’s why I ask the experts here to help me. I am wondering if I have to completely install the confluent platform on my local computer so that the hdfssourceconnector can work. On the kafka platform, after the connector has been created with the org.apache.kafka.connector.class.FileStreamSource, no data is received when reading with the kafka-console-consumer.sh.

I’ve tried a lot with connectors, but confluent comes up more often, that’s why I ask the experts here to help me. I am wondering whether I have to completely install the confluent platform on my local computer so that the hdfssourceconnector can work.
Nevertheless, the error occurs with io.confluent.hdfsHDfssourceconnector on the local computer without docker with kafka platform

curl -i -X ​​POST -H "Accept: application / json" -H "Content-Type: application / json" localhost: 8083 / connectors / -d '{"name": "test_hdfs", "config": {" connector.class ":" io.confluent.connect.hdfs3.Hdfs3SourceConnector "," kerberos.ticket.renew.period.ms ":" 3600000 "," topic.prefix ":" testt_hdfs "," hdfs.url ":" hdfs: // localhost: 9000 "," hdfs.authentication.kerberos ":" false "," hadoop.conf.dir ":" / usr / local / hadoop / etc / hadoop "," storage.class ":" io .confluent.connect.s3.storage.S3Storage "," hadoop.home ":" / usr / local / hadoop "}} '

HTTP / 1.1 500 Internal Server Error
Date: Mon, 20 Sep 2021 22:49:02 GMT
Content-Type: application / json
Content-Length: 2716
Server: Jetty (9.4.24.v20191120)

{"error_code": 500, "message": "Failed to find any class that implements Connector and which name matches io.confluent.connect.hdfs3.Hdfs3SourceConnector, available connectors are: PluginDesc {class = class org.apache.kafka.connect .file.FileStreamSinkConnector, name = 'org.apache.kafka.connect.file.FileStreamSinkConnector', version = '2.6.0', encodedVersion = 2.6.0, type = sink, typeName = 'sink', location = 'classpath' }, PluginDesc {klass = class org.apache.kafka.connect.file.FileStreamSourceConnector, name = 'org.apache.kafka.connect.file.FileStreamSourceConnector', version = '2.6.0', encodedVersion = 2.6.0, type = source, typeName = 'source', location = 'classpath'}, PluginDesc {class = class org.apache.kafka.connect.mirror.MirrorCheckpointConnector, name = 'org.apache.kafka.connect.mirror.MirrorCheckpointConnector', version = '1', encodedVersion = 1, type = source, typeName = 'source', location = 'classpath'}, PluginDesc {klass = class org.apache.kafka.connect.mirror.MirrorHeartbeatConnector, name = 'org.apache. kafka.connect.mirror.M irrorHeartbeatConnector ', version =' 1 ', encodedVersion = 1, type = source, typeName =' source ', location =' classpath '}, PluginDesc {klass = class org.apache.kafka.connect.mirror.MirrorSourceConnector, name =' org.apache.kafka.connect.mirror.MirrorSourceConnector ', version =' 1 ', encodedVersion = 1, type = source, typeName =' source ', location =' classpath '}, PluginDesc {klass = class org.apache.kafka .connect.tools.MockConnector, name = 'org.apache.kafka.connect.tools.MockConnector', version = '2.6.0', encodedVersion = 2.6.0, type = connector, typeName = 'connector', location = ' classpath '}, PluginDesc {klass = class org.apache.kafka.connect.tools.MockSinkConnector, name =' org.apache.kafka.connect.tools.MockSinkConnector ', version =' 2.6.0 ', encodedVersion = 2.6.0 , type = sink, typeName = 'sink', location = 'classpath'}, PluginDesc {class = class org.apache.kafka.connect.tools.MockSourceConnector, name = 'org.apache.kafka.connect.tools.MockSourceConnector' , version = '2.6.0', encodedVersion = 2.6.0, type = source, typeName = 'source', location = 'class path '}, PluginDesc {klass = class org.apache.kafka.connect.tools.SchemaSourceConnector, name =' org.apache.kafka.connect.tools.SchemaSourceConnector ', version =' 2.6.0 ', encodedVersion = 2.6.0 , type = source, typeName = 'source', location = 'classpath'}, PluginDesc {class = class org.apache.kafka.connect.tools.VerifiableSinkConnector, name = 'org.apache.kafka.connect.tools.VerifiableSinkConnector' , version = '2.6.0', encodedVersion = 2.6.0, type = source, typeName = 'source', location = 'classpath'}, PluginDesc {class = class org.apache.kafka.connect.tools.VerifiableSourceConnector, name = 'org.apache.kafka.connect.tools.VerifiableSourceConnector', version = '2.6.0', encodedVersion = 2.6.0, type = source, typeName = 'source', location = 'classpath'} "

Ask for help the good will
Thanks

Hi,

could you provide some details about your environment?

how did you install confluent platform?
which version is in place?
which components are installed?

Good morning again,

To be honest, I only work with Ubuntu 20.04.2 and downloaded the apache kafka 2.13-2.6.0 package and these various confluent connectors have been tried without a Confluent platform. What I’m trying to say is that Confluent platform has not yet been installed on my Ubuntu. Therefore my question whether the Confluent platform should be completely installed so that the connector class io.confluent.hdfsSourceconnector konnectoren work. I’m sorry, I’m new and a little confused about what to do next.
Thanks for your help again

hi,

I see.
so I think there are to possibilities:

  1. use your current installation and configure kafka connect manually
    → more manual and a bit harder way :wink:

  2. use confluent quickstart to get your kafka environment running https://docs.confluent.io/platform/current/quickstart/ce-quickstart.html#ce-quickstart and the configure Kafka connect accordingly
    → easier to setup up the connect plugins with confluent-hub and so on

I think both options are worth to try, if you like to dig into the Kafka and Kafka connect internals (and have some time left :wink: ) I would go for option 1.

If you’re aiming for a quick win go for option 2.

1 Like

I thought of you for your assistance. I will deal with both options from tomorrow onwards, but I think I will start with the confluent platform. I thank you very much for that

1 Like

sounds great
if there are further questions let me know :slight_smile:

Good evening again,

As discussed yesterday, today I dealt with the confluent platform as described in this link: https://docs.confluent.io/platform/current/quickstart/ce-quickstart.html#ce-quickstart

After that could
Starting zookeeper
Zookeeper is [UP]
Starting Kafka
Kafka is [UP]
Starting Schema Registry
Schema Registry is [UP]
Starting Kafka REST
Kafka REST is [UP]
Starting Connect
Connect is [UP] start without a problem. I also performed this configuration on my second computer. In the end I could both start with the distributed connectorten. The path to the plugins was added to the configuration file of the distributed workers for both computers. the hdfs datanodes were also started. When I tried to create my Hdfs connector I got an error, it says connector created but my worker says error. I don’t know what it could be.!
Good evening again,

As discussed yesterday, today I dealt with the confluent platform as described in this link: https://docs.confluent.io/platform/current/quickstart/ce-quickstart.html#ce-quickstart

After that could
Starting zookeeper
Zookeeper is [UP]
Starting Kafka
Kafka is [UP]
Starting Schema Registry
Schema Registry is [UP]
Starting Kafka REST
Kafka REST is [UP]
Starting Connect
Connect is [UP] start without a problem. I also performed this configuration on my second computer. In the end, I could both start with the distributed connectorten. The path to the plugins was added to the configuration file of the distributed workers for both computers. the hdfs datanodes were also started. When I tried to create my Hdfs connector I got an error, it says connector created but my worker says error. I don’t know what it could be.
Below you can see the output in distributed workers after the connector with the rest of the API has started

At the very bottom you can see how I tried to create hdfs connector with the Rest API so that I can read the data with the console.consumer. This appears that the connector was created but it will not be visible between the existing topics and when you run console.consumer, it simply remains stuck without an answer.

so could you please provide some details

how you installed the Hdfs3SourceConnector? (seems there is still missing something)

did you install with

“confluent-hub install…”

or did you follow the manual installation path?

would you please share the config file of your kafka connect workers?

Good day again,

So I installed the hdfsSourceconnector manually with the download installation HDFS 3 Source Connector | Confluent Hub, then the path was copied to kafka worker. Here in my case:
plugin.path = /usr/local/share/kafka/plugins/kafka-connect-hdfs3-source-1.4.6/lib

But I have to admit that the confluent hub command

confluent-hub install
–no-prompt confluentinc / kafka-connect-datagen: latest

as in the link https://docs.confluent.io/platform/current/quickstart/ce-quickstart.html#ce-quickstart point 7 was carried out.

Below you can see my current kafka connect worker in distributed mode:
At the bottom you can see the path of the connector after the manual download and extraction

Hi,

plugin path should be
/usr/local/share/kafka/plugins/

no worries about the datagen
it’s also possible to install the hdfs source plugin like this

Hi,
I changed the path in which the jar files were copied into the lib directory in plugins.
plugin.path = / usr / local / share / kafka / plugins /
Although in the plugins directory all Jar files were copied as you said but as soon as the worker is started and the restapi tries to create the hdfs connector, the error still comes

Hi,
I changed the path in which the jar files were copied into the lib directory in plugins.
plugin.path = / usr / local / share / kafka / plugins /
Although in the plugins directory all Jar files were copied as you said but as soon as the worker is started and the restapi tries to create the hdfs connector, the error still comes

And below are the jar files that come from the confluenthdfs directory, now these have been copied into the plugins that were created.

My guess is the REST API with curl befehl

Also in my distributed config file updated with
plugin.path = / usr / local / share / kafka / plugins /

hey,

would you please provide

  • you connector configuration
  • the used rest call

Here is my connector configuration:
A json file:

'{“name”: “hd_second”, “config”: {“connector.class”: “io.confluent.connect.hdfs3.Hdfs3SourceConnector”, “kerberos.ticket.renew.period.ms”: “3600000”, " topic “:” hd_first “,” hdfs.url “:” hdfs: // hadoop-master: 9000 “,” hdfs.authentication.kerberos “:” false “,” hadoop.conf.dir “:” / usr / local / hadoop / etc / hadoop / “,” format.class “:” io.confluent.connect.s3.format.json.JsonFormat “,” hdfs.namesode.principal “:” “,” connect.hdfs.keytab ": “”, “connect.hdfs.principal”: “”, “storage.class”: “io.confluent.connect.s3.storage.S3Storage”, “hadoop.home”: “/ usr / local / hadoop”}} ’

I used curl as a REST client. Entering the command in the terminal:

curl -i -X ​​POST -H “Accept: application / json” -H “Content-Type: application / json” localhost: 8083 / connectors / -d '{“name”: “hd_second”, “config”: {" connector.class “:” io.confluent.connect.hdfs3.Hdfs3SourceConnector “,” kerberos.ticket.renew.period.ms “:” 3600000 “,” topic “:” hd_first “,” hdfs.url “:” hdfs: // hadoop-master: 9000 “,” hdfs.authentication.kerberos “:” false “,” hadoop.conf.dir “:” / usr / local / hadoop / etc / hadoop / “,” format.class “:” io.confluent.connect.s3.format.json.JsonFormat “,” hdfs.namesode.principal “:” “,” connect.hdfs.keytab “:” “,” connect.hdfs.principal “:” “,” storage .class “:” io.confluent.connect.s3.storage.S3Storage “,” hadoop.home “:” / usr / local / hadoop "}} ’

and the connector is created with the confluent platform, but the topics are neither listed nor read with the consol.consumer

Thanks for your Help !

error is the same?

what does

curl localhost:8083/connector-plugins | jq

give you as output?

probably it might be worth to start over with a new env or vm to have a clean starting point.

Der Befehl curl localhost:8083/connector-plugins | jq

developer@hadoop-master:~/Downloads/confluent-2.0.0$ curl localhost:8083/connector-plugins | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 295 100 295 0 0 98333 0 --:–:-- --:–:-- --:–:-- 98333
parse error: Invalid numeric literal at line 2, column 0
developer@hadoop-master:~/Downloads/confluent-2.0.0$

und wird mit 200 ok zurück gegeben aber in verteilten worker steht noch der fehler:

ERROR Couldn’t instantiate connector hd_second because it has an invalid connector configuration. This connector will not execute until reconfigured. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:642)
org.apache.kafka.common.config.ConfigException: Invalid value io.confluent.connect.hdfs3.Hdfs3SourceConnector for configuration connector.class: Class io.confluent.connect.hdfs3.Hdfs3SourceConnector could not be found.

With a new environnemnt or VM, what do you mean specifically?

basically I was thinking about “a new clean start”
so delete all the Confluent/Kafka related settings done and start over to have a proper start

One question:
As your folder in the output above startes

Which confluent did you download?

Good Morning,

I was busy with a new start yesterday, in which the confluent platform was downloaded in the Downloads area. I deleted all of them yesterday and downloaded a new directory, but the same error occurred. It did not find the connector class io.confluent.connect.hdfs.HdfsSourceConnector after the worker distributed in was started.

I actually downloaded confluent 2.0.0 from the beginning and yesterday, the same version was also downloaded.

Could it be due to the version type, maybe try a different version?

hey,

where did you download the confluent platform from?
current release is confluent 6.2…

see
https://www.confluent.io/installation

1 Like

I downloaded the confluent 2.0.0 from a confluent site but as I see this version is more recent and still 6.2. I’ll deal with that.
thank you again for your availability and all this help

1 Like

Hello again

First of all I wanted to thank you for all the tips you gave me. Unfortunately, however, after all possible attempts, there were always errors. I came to the conclusion that the confluent does not fit the system requirements of the confluent for me.
https://docs.confluent.io/platform/current/installation/system-requirements.html#system-requirements
Because according to systems requirements, the versions confluent 3.3 to confluent 6.2 are compatible up to Ubuntu 18.04 and yet Ubuntu 20.04 is installed on my computer. I am desperate because the hadoop cluster runs on ubuntu 20.04 in my master node. I’m afraid that if I upgrade with ubuntu 18.04 that the hadoop will no longer work. Would you have a few tips on how to do this. I thank you again very much