Solution for Persistent Storage for Subsequent Data Analysis

We have an air-gapped Kafka-cluster that is used to gather data from about 100 sensors, each sending a few bytes of data per minute. We now need to structure the data in a permanent storage of some sort. The data will be used to build ML models.

What solution for permanent storage would you choose in this situation? We’re a small outfit with limited resources, so long and steep learning curves are best avoided. We were thinking of just building a PostgreSQL database and getting data from Kafka with a consumer (written in Python using the kafka library). Other possibilities (that we are aware of) are Apache Cassandra or Apache HBase, but we would have to read up on those…

Any input would be much appreciated.

Postgres (or TimescaleDB) should work fine for sensor data.

Depending on what queries you will run, you may want to try Apache Pinot or Druid before moving to Cassandra or HBase.

1 Like

Thank you, @OneCricketeer.

Looked further at TimescaleDB, Apache Pinot, and Apache Druid, and they all look promising. The focus on immutable data and the OLAP features of Pinot, and the fact that it has a Python client library, makes it particularly interesting for us. Furthermore, according to the docs, Pinot also has near real-time ingestion with Apache Kafka.

We do not know exactly what queries we will run just yet, so flexibility is important.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.