Kafka as Scada / Historian

Hello,

Currently I am testing the use case of Kafka / Confluent for IoT. The prototype is simple. I am collecting data from a pump (50 Attributes) and produce this data every second
into a topic .

Each message is run through an ml-algorithm which tells me the probability the pump will fail. This works like a charme. Currently I am using faust-streaming for this. I am taking a look at apache flink in the next weeks.

Besides realtime analytics. I want to use kafka streaming as some sort of SCADA / DATA Historian. SCADA Systems allow Users to view data of a specific period of time ( last hour, last 24 hour, last week). I would like to use Kafka as a Data Historian. So for example if somebody wants to see the data from the past 6 hours I can calculate the offset and consume all data from this offset till the end.

Currently 600.000+ msg are stored in one kafka topic as JSON. The size of the topic is 800+MB. When I try to consume all the message at ones using the python API. I only receive 500 msg at once. Which is probably because of max.partition.fetch.bytes.

Simple Increasing the size of the fetches doesn’t seem right to me. But if I didn’t than I would make 1200 polls (for getting the data for the last week) which seems wrong to me as well. Is it a good idea to use Kafka for things like that or should I store those data in some sort of table/database?

Doing 1200 Polls is quite fast. It takes 3.5 seconds. Converting the json_str takes 10 seconds and putting the data in to a pandas dataframe takes another 2 seconds. So in total it takes 15.5 seconds till I can start visualizing the data which is okay. But still I am wondering if I am doing the right thing. Thanks for any advices here.

1 Like

@reyemb Love this is use case. I know its been over a year since you posted this question and apologies that it doesn’t look like anyone answered. I’m sure you have already figured out a solution by now, but for the benefit of other people who might read this, your currently outlined initial approach to calculating these temporal aggregates is more aligned with how you would do this with a database. In the streaming world the way you would handle those data historian oriented roll-ups is to constantly be calculating them with a stream processor.

In your case you are using Faust so would probably have a processor or multiple processors calculating the windowed aggregates you want to a “table” and emitting the results to a new topic. Tables and Windowing — Faust 1.9.0 documentation. All stream processing frameworks pretty much have the ability to do this. Flink, KStreams, ksqlDB.

The other option is to introduce a time series database into the mix and send the samples from Kafka to that. Whether or not you need to do this depends on how much slice and dicing of the windows and times you want to do. At some point it would become inefficient to be calculating every possible windowed aggregate you want and storing these precomputed aggregates for reporting.

Windowed aggregates have a lot of uses aside from just simple reporting of course. A great example in your space can be seen in this blog post from Keith Resar: Delivering Real-Time Manufacturing Predictive Maintenance