How to ingest xml or mp4 data from hadoop tool to apache kafka as a stream using a hdfssource-connector

mmuehlbeyer · 28 September 2021 14:56

Hi,

I think it should also work with Ubuntu 20.04.
I’m currently running my dev/demo environments on Ubuntu 20 with several
Confluent and open Source Kafka versions.

Nevertheless I would suggest the following:
Configure Kafka connect and start with a more simple connector like the FileSourceConnector (no dependencies like a running Hadoop cluster and so on)
If everything works as expected go a step further and try with the HDFS connector.

See the quickstart:
https://docs.confluent.io/platform/current/connect/quickstart.html

Lansana · 29 September 2021 16:43

Hello again, I’m sorry to annoy you

I followed your advice and stayed with ubuntu 04/20. Yesterday I started from scratch and this time worked with the local command confluent cli. All services are started. I was even able to install the hdfssourceconnector using the confluent hub and it recognizes the connector class HDFSsourceconnector but not FileStream although it is listed as you can see on the screen. The hdfs connector is returned 200 but with the command
./bin/kafka-console-consumer --topic test --bootstrap-server localhost: 9092, just gets stuck. here is no hadoop running only the confluent service. Maybe gives another command to read the data in hadoop or connector configuration so that the data in kafka topic can be read. The hdfssourceconnector works but you don’t get any data.

See the pictures below

Hier ist die Ausgabe mit dem hdfssourceconnector.
developer@hadoop-master:~/Downloads/confluent-6.2.1$ curl -i -X POST -H “Accept:application/json” -H “Content-Type:application/json” localhost:8083/connectors/ -d ‘{ “name” : “1hdfs3-source”, “config” : { “connector.class” : “io.confluent.connect.hdfs3.Hdfs3SourceConnector”, “tasks.max” : “1”, “hdfs.url” : “hdfs://localhost:9000/”, “format.class” : “io.confluent.connect.hdfs3.format.json.JsonFormat”, “confluent.topic.bootstrap.servers” : “localhost:9092”, “confluent.topic.replication.factor” :“1”, “topic”:“test” }}’
HTTP/1.1 201 Created
Date: Wed, 29 Sep 2021 16:14:45 GMT
Location: http://localhost:8083/connectors/1hdfs3-source
Content-Type: application/json
Content-Length: 381
Server: Jetty(9.4.43.v20210629)

{“name”:“1hdfs3-source”,“config”:{“connector.class”:“io.confluent.connect.hdfs3.Hdfs3SourceConnector”,“tasks.max”:“1”,“hdfs.url”:“hdfs://localhost:9000/”,“format.class”:“io.confluent.connect.hdfs3.format.json.JsonFormat”,“confluent.topic.bootstrap.servers”:“localhost:9092”,“confluent.topic.replication.factor”:“1”,“topic”:“test”,“name”:“1hdfs3-source”},“tasks”:,“type”:“source”}

With the command / bin / kafka-console-consumer --topic test --bootstrap-server localhost: 9092
remains the whole time without result

mmuehlbeyer · 30 September 2021 06:05

hmm according to the docs you need also the hdfs sink connector

The Kafka Connect HDFS 3 Source connector provides the capability to read data exported to HDFS 3 by the Kafka Connect HDFS 3 Sink connector and publish it back to a Kafka topic

https://docs.confluent.io/kafka-connect-hdfs3-source/current/overview.html#hdfs-3-source-connector-for-cp

Lansana · 30 September 2021 15:00

Hello again,

Thanks again for the tip,

As far as I understand, the Hdfs3SinkConnector should first be used to write the hdfs there in hadoop, then the Hdfs3Sourceconnector is used to read the written data in hadoop.

My problem is now with data writing with the producer consoler and I have data types like mp4 and xml in my hadoop.

I am researching how I write the mp4 data to hadoop through producer console: Most of what I have seen so far are just messages or self-written schema as shown below.

./bin/kafka-avro-console-producer --broker-list localhost: 9092 --topic parquet_field_hdfs
–property value.schema = '{“type”: “record”, “name”: “myrecord”, “fields”: [{“name”: “name”, “type”: “string”}, {" name “:” address “,” type “:” string “}, {” name “:” age “,” type “:” int “}, {” name “:” is_customer “,” type “:” boolean " }]} ’

paste each of these messages

{“name”: “Peter”, “address”: “Mountain View”, “age”: 27, “is_customer”: true}
{“name”: “David”, “address”: “Mountain View”, “age”: 37, “is_customer”: false}
{“name”: “Kat”, “address”: “Palo Alto”, “age”: 30, “is_customer”: true}
{“name”: “David”, “address”: “San Francisco”, “age”: 35, “is_customer”: false}
{“name”: “Leslie”, “address”: “San Jose”, “age”: 26, “is_customer”: true}
{“name”: “Dani”, “address”: “Seatle”, “age”: 32, “is_customer”: false}
{“name”: “Kim”, “address”: “San Jose”, “age”: 30, “is_customer”: true}
{“name”: “Steph”, “address”: “Seatle”, “age”: 31, “is_customer”: false}

or so :

./bin/kafka-avro-console-producer --broker-list localhost: 9092 --topic test_hdfs
–property value.schema = ‘{“type”: “record”, “name”: “myrecord”, “fields”: [{“name”: “f1”, “type”: “string”}]}’

paste each of these messages

{“f1”: “value1”}
{“f1”: “value2”}
{“f1”: “value3”}

I don’t know if you have experience with other data like mp4 or xml, which are also in my local

Thanks for everything already

mmuehlbeyer · 4 October 2021 05:50

Just to be right

you try to push your mp4 data (in hadoop) to your Kafka cluster, right?

Lansana · 4 October 2021 08:13

Yes exactly
I now have a python client that can stream data from the local but I would like to push the data from hadoop into kafka for streaming

Here python source code

mmuehlbeyer · 4 October 2021 12:20

ok I see

I’m not a developer but if my understanding is correct you won’t need Kafka connect necessarily.
The code snippets would allow writing directly to Kafka, correct?

Lansana · 4 October 2021 13:09

that’s right. since i had difficulties with hdfssoconnector and no idea how to write the video data sensor data mvnx with producer. then i found this python script, but it can stream data from local computers. But what I’m trying to do is write and read the stored data from hadoop in kafka

mmuehlbeyer · 7 October 2021 05:08

ok I see.
I need to try the Hadoop → Kafka with Kafka Connect by myself
keep you posted, but need to start a small Hadoop setup first

Lansana · 7 October 2021 20:50

ahh ok all right

With hadoop multi node it was a bit difficult for me, but at the moment I’m also trying to save files to the hadoop with text files or excel files and load them from kafka. If you need anything about hadoop, you can ask

Topic		Replies	Views
Http Source Connector for Confluent Platform Self-Managed Connectors	3	4525	21 October 2022
Step-by-Step Guide: Installing and Using Confluent Hub Connectors in Apache Kafka Cluster without Confluent Platform Kafka Connect	2	1702	16 September 2023
How to using Kafka connector for read and write data with XML format? Kafka Connect	4	4870	12 April 2023
Kafka Connect Source connector for listening to http posts / web service calls Kafka Connect	3	3925	24 February 2021
Send data from kafka topic to apache hive Kafka Connect	2	207	11 July 2024

How to ingest xml or mp4 data from hadoop tool to apache kafka as a stream using a hdfssource-connector

paste each of these messages

paste each of these messages

Related topics