How to ingest xml or mp4 data from hadoop tool to apache kafka as a stream using a hdfssource-connector

Hi,

I think it should also work with Ubuntu 20.04.
I’m currently running my dev/demo environments on Ubuntu 20 with several
Confluent and open Source Kafka versions.

Nevertheless I would suggest the following:
Configure Kafka connect and start with a more simple connector like the FileSourceConnector (no dependencies like a running Hadoop cluster and so on)
If everything works as expected go a step further and try with the HDFS connector.

See the quickstart:
https://docs.confluent.io/platform/current/connect/quickstart.html

Hello again, I’m sorry to annoy you

I followed your advice and stayed with ubuntu 04/20. Yesterday I started from scratch and this time worked with the local command confluent cli. All services are started. I was even able to install the hdfssourceconnector using the confluent hub and it recognizes the connector class HDFSsourceconnector but not FileStream although it is listed as you can see on the screen. The hdfs connector is returned 200 but with the command
./bin/kafka-console-consumer --topic test --bootstrap-server localhost: 9092, just gets stuck. here is no hadoop running only the confluent service. Maybe gives another command to read the data in hadoop or connector configuration so that the data in kafka topic can be read. The hdfssourceconnector works but you don’t get any data.

See the pictures below


Hier ist die Ausgabe mit dem hdfssourceconnector.
developer@hadoop-master:~/Downloads/confluent-6.2.1$ curl -i -X POST -H “Accept:application/json” -H “Content-Type:application/json” localhost:8083/connectors/ -d ‘{ “name” : “1hdfs3-source”, “config” : { “connector.class” : “io.confluent.connect.hdfs3.Hdfs3SourceConnector”, “tasks.max” : “1”, “hdfs.url” : “hdfs://localhost:9000/”, “format.class” : “io.confluent.connect.hdfs3.format.json.JsonFormat”, “confluent.topic.bootstrap.servers” : “localhost:9092”, “confluent.topic.replication.factor” :“1”, “topic”:“test” }}’
HTTP/1.1 201 Created
Date: Wed, 29 Sep 2021 16:14:45 GMT
Location: http://localhost:8083/connectors/1hdfs3-source
Content-Type: application/json
Content-Length: 381
Server: Jetty(9.4.43.v20210629)

{“name”:“1hdfs3-source”,“config”:{“connector.class”:“io.confluent.connect.hdfs3.Hdfs3SourceConnector”,“tasks.max”:“1”,“hdfs.url”:“hdfs://localhost:9000/”,“format.class”:“io.confluent.connect.hdfs3.format.json.JsonFormat”,“confluent.topic.bootstrap.servers”:“localhost:9092”,“confluent.topic.replication.factor”:“1”,“topic”:“test”,“name”:“1hdfs3-source”},“tasks”:,“type”:“source”}

With the command / bin / kafka-console-consumer --topic test --bootstrap-server localhost: 9092
remains the whole time without result

hmm according to the docs you need also the hdfs sink connector

The Kafka Connect HDFS 3 Source connector provides the capability to read data exported to HDFS 3 by the Kafka Connect HDFS 3 Sink connector and publish it back to a Kafka topic

https://docs.confluent.io/kafka-connect-hdfs3-source/current/overview.html#hdfs-3-source-connector-for-cp

Hello again,

Thanks again for the tip,

As far as I understand, the Hdfs3SinkConnector should first be used to write the hdfs there in hadoop, then the Hdfs3Sourceconnector is used to read the written data in hadoop.

My problem is now with data writing with the producer consoler and I have data types like mp4 and xml in my hadoop.

I am researching how I write the mp4 data to hadoop through producer console: Most of what I have seen so far are just messages or self-written schema as shown below.

./bin/kafka-avro-console-producer --broker-list localhost: 9092 --topic parquet_field_hdfs
–property value.schema = '{“type”: “record”, “name”: “myrecord”, “fields”: [{“name”: “name”, “type”: “string”}, {" name “:” address “,” type “:” string “}, {” name “:” age “,” type “:” int “}, {” name “:” is_customer “,” type “:” boolean " }]} ’

paste each of these messages

{“name”: “Peter”, “address”: “Mountain View”, “age”: 27, “is_customer”: true}
{“name”: “David”, “address”: “Mountain View”, “age”: 37, “is_customer”: false}
{“name”: “Kat”, “address”: “Palo Alto”, “age”: 30, “is_customer”: true}
{“name”: “David”, “address”: “San Francisco”, “age”: 35, “is_customer”: false}
{“name”: “Leslie”, “address”: “San Jose”, “age”: 26, “is_customer”: true}
{“name”: “Dani”, “address”: “Seatle”, “age”: 32, “is_customer”: false}
{“name”: “Kim”, “address”: “San Jose”, “age”: 30, “is_customer”: true}
{“name”: “Steph”, “address”: “Seatle”, “age”: 31, “is_customer”: false}

or so :

./bin/kafka-avro-console-producer --broker-list localhost: 9092 --topic test_hdfs
–property value.schema = ‘{“type”: “record”, “name”: “myrecord”, “fields”: [{“name”: “f1”, “type”: “string”}]}’

paste each of these messages

{“f1”: “value1”}
{“f1”: “value2”}
{“f1”: “value3”}

I don’t know if you have experience with other data like mp4 or xml, which are also in my local

Thanks for everything already

Just to be right

you try to push your mp4 data (in hadoop) to your Kafka cluster, right?

Yes exactly
I now have a python client that can stream data from the local but I would like to push the data from hadoop into kafka for streaming

Here python source code

ok I see

I’m not a developer but if my understanding is correct you won’t need Kafka connect necessarily.
The code snippets would allow writing directly to Kafka, correct?

that’s right. since i had difficulties with hdfssoconnector and no idea how to write the video data sensor data mvnx with producer. then i found this python script, but it can stream data from local computers. But what I’m trying to do is write and read the stored data from hadoop in kafka

ok I see.
I need to try the Hadoop → Kafka with Kafka Connect by myself
keep you posted, but need to start a small Hadoop setup first :wink:

ahh ok all right

With hadoop multi node it was a bit difficult for me, but at the moment I’m also trying to save files to the hadoop with text files or excel files and load them from kafka. If you need anything about hadoop, you can ask