We have an air-gapped Kafka-cluster that is used to gather data from about 100 sensors, each sending a few bytes of data per minute. We now need to structure the data in a permanent storage of some sort. The data will be used to build ML models.
What solution for permanent storage would you choose in this situation? We’re a small outfit with limited resources, so long and steep learning curves are best avoided. We were thinking of just building a PostgreSQL database and getting data from Kafka with a consumer (written in Python using the kafka library). Other possibilities (that we are aware of) are Apache Cassandra or Apache HBase, but we would have to read up on those…
Looked further at TimescaleDB, Apache Pinot, and Apache Druid, and they all look promising. The focus on immutable data and the OLAP features of Pinot, and the fact that it has a Python client library, makes it particularly interesting for us. Furthermore, according to the docs, Pinot also has near real-time ingestion with Apache Kafka.
We do not know exactly what queries we will run just yet, so flexibility is important.