Solution for Persistent Storage for Subsequent Data Analysis

ElToro · 23 May 2023 12:12

We have an air-gapped Kafka-cluster that is used to gather data from about 100 sensors, each sending a few bytes of data per minute. We now need to structure the data in a permanent storage of some sort. The data will be used to build ML models.

What solution for permanent storage would you choose in this situation? We’re a small outfit with limited resources, so long and steep learning curves are best avoided. We were thinking of just building a PostgreSQL database and getting data from Kafka with a consumer (written in Python using the kafka library). Other possibilities (that we are aware of) are Apache Cassandra or Apache HBase, but we would have to read up on those…

Any input would be much appreciated.

OneCricketeer · 23 May 2023 21:35

Postgres (or TimescaleDB) should work fine for sensor data.

Depending on what queries you will run, you may want to try Apache Pinot or Druid before moving to Cassandra or HBase.

ElToro · 24 May 2023 08:17

Thank you, @OneCricketeer.

Looked further at TimescaleDB, Apache Pinot, and Apache Druid, and they all look promising. The focus on immutable data and the OLAP features of Pinot, and the fact that it has a Python client library, makes it particularly interesting for us. Furthermore, according to the docs, Pinot also has near real-time ingestion with Apache Kafka.

We do not know exactly what queries we will run just yet, so flexibility is important.

system · 31 May 2023 08:17

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Recording ready to view: SPEAKER Q&A THREAD: 14 September 2023 - From Streams to Insights: Apache Kafka® x Apache Pinot™ Events	0	1558	15 September 2023
Enabling Product Personalisation Using Apache Kafka, Apache Pinot and Trino [Kafka Summit 2022] Summit	0	3053	22 April 2022
Eternal data retention, pitfalls if any and KIP-405 Architecture and Design	6	4407	20 September 2021
Kafka as Scada / Historian Architecture and Design	1	3053	27 February 2024
Urgent: Mitigating Slow Consumer Impact and Seeking Open-Source Solutions in Apache Kafka Consumers Architecture and Design	0	1496	15 September 2023

Solution for Persistent Storage for Subsequent Data Analysis

Related topics