Hello,
Currently I am testing the use case of Kafka / Confluent for IoT. The prototype is simple. I am collecting data from a pump (50 Attributes) and produce this data every second
into a topic .
Each message is run through an ml-algorithm which tells me the probability the pump will fail. This works like a charme. Currently I am using faust-streaming for this. I am taking a look at apache flink in the next weeks.
Besides realtime analytics. I want to use kafka streaming as some sort of SCADA / DATA Historian. SCADA Systems allow Users to view data of a specific period of time ( last hour, last 24 hour, last week). I would like to use Kafka as a Data Historian. So for example if somebody wants to see the data from the past 6 hours I can calculate the offset and consume all data from this offset till the end.
Currently 600.000+ msg are stored in one kafka topic as JSON. The size of the topic is 800+MB. When I try to consume all the message at ones using the python API. I only receive 500 msg at once. Which is probably because of max.partition.fetch.bytes.
Simple Increasing the size of the fetches doesn’t seem right to me. But if I didn’t than I would make 1200 polls (for getting the data for the last week) which seems wrong to me as well. Is it a good idea to use Kafka for things like that or should I store those data in some sort of table/database?
Doing 1200 Polls is quite fast. It takes 3.5 seconds. Converting the json_str takes 10 seconds and putting the data in to a pandas dataframe takes another 2 seconds. So in total it takes 15.5 seconds till I can start visualizing the data which is okay. But still I am wondering if I am doing the right thing. Thanks for any advices here.