I would like to load data from Kafka to Files. For example, we have 1million messages in Kafka topic. We would like to load those 1 million messages from Kafka to files, where each file will have 100 thousand records. So we will get 10 files in with the 1 million messages.
Is it possible to achieve this using a connector? Also, I noticed that in FileSinkConnector docs page, it was mentioned that " Confluent does not recommended the FileStream Connector for production use. If you want a production connector to read from and write to files, use a Spool Dir connector". But in Spool Dir connector, I dont see any examples for sink. Is it only for reading?
Hi, Spooldir connector is a source connector hence you can only use it to ingest data into Kafka
The embedded Filestream connector is basically only for development and tests and does not split your sink into multiple files.
It is definitely possible to do such a task using a Connector, you can find some of them already implemented on GitHub (just search for kafka connect file sink). Anyway writing to file is not the typical production use case and it has some caveats, for example you may want to split the files not by the number of records but by partitions to preserve record ordering. May I ask you what is your use case? Maybe if you want to use files as an intermediate step to write to another system later, there may be some connector that fits better.
I would support this point times over. Landing data to files and reading data from files is often an anti-pattern, when systems can produce and consumer data from Kafka topics directly using numerous different client language libraries, REST proxy, connectors, etc etc
We were trying to use an ETL tool called InformaticaPowerCenter, using the Informatica Connector, process the data and load to Netezza targets. But we could only get binary data. So we parsed the binary data using java code inside a java transformation. But the processing seems to be taking more time.
We can also try to use jdbc connector. That is also an option we have. But I am not sure about the performance of jdbc to load large volumes of data.
Any suggestion on how to go about this is welcome!
According to the infrastructure you illustraded, is the binary data you need to process produced by the “Source”? In this case your need is to transform data in Kafka before let it be ingested by ETL?
The source sends the serialized data to Kafka. When we read from Kafka, the connector provided by the existing ETL tool is not able to unnest/flatten the json data. Data does not have embedded schema field. So we read it as binary, then use Java code to deserialize to string, and then parse the json objects, and then do the flattening/unnesting operations. After that we write the data to target tables.
By unnesting, I mean, flattening the arrays and generate multiple rows. So, if we get data like below,
Data in Kafka topic:
If your ETL is focused only on deserialization and flattening, and there is a native Kafka Connector for your destination, you may possibly use SMT to do the job.
You will need to Set Schema Metadata on your incoming JSON, then use Flatten or other transformations you need before pass it to the destination.