Long-term event storage: Forever in Kafka vs Storing in separately (S3, Mongo, etc)

Conversation from Confluent Community Slack. Copied here to make it available to all.

Pedro Lozano
Hi all. I’d like to know what people are using (or is recommended) for long term event storage:
Do you keep your events forever in Kafka and not store them in a separate database?
Do you store them in any kind of NoSQL database? (assuming using JSON format, not sure what DBs exists for other formats like Avro)
What would you use in case you want to replay (re-stream) the events stored in long term storage? Connect?
Thanks!

Neil Buesing @nbuesing:
Every use case is different, but generally I have stored events in Kafka “forever” in certain cases, but you do need a strategy to handle data corruption (I updated a kafka streams application and it had a bug that messed up my data scenario).
Trying to figure that out after it happens will be a mad scramble.

Storing data into s3 buckets for pure backup purposes has been discussed with clients and usually the way they go.

Using Connect is usually the way to sink to a store (s3, mongo, etc) and then source it back in. Just have to know/understand what it means to restore data as many applications will treat them as new events. Again, lots depending on your use-case.

So if you are looking to store events as a way to audit and/or handle a data corruption issue I would look to a bucket store. If you are looking for data to be accessible for other uses cases then a more available document store might work for you (e.g. Mongo). However, if you have a document store and you only keep the latest copy — how do you handle/prevent data corruption (e.g. programming error and bad data gets generated by your streaming applications).

Some will find ways to replay from the source-of-record, some look to your backup. If you can replay from source of truth and your down-streams systems are idempotent — things get a lot easier to work through (IMHO)

Mitch Henderson @mitchell-h:
What Neil said, but for me it comes down to “how will I query this historical data?”
And will I need sub-second access? Sub-minute? No SLA?
Once you’ve answered those questions, your storage becomes fairly self evident

Pedro Lozano:
Thanks @nbuesing for the comprehensive answer. I’ll try to define better my use cases. This is my first Kafka project and this concern came early because I’m using AWS MSK and, as far as I know, the clusters can’t be stopped, you can only delete them, and there is no snapshots or backups, so if I shut down the cluster I just lose all data that’s in Kafka.

Vijay Nadkarni @vnadkarni:
@mitchell-h buesing, It would be interesting to know how KIP-405 (Kafka Tiered Storage) changes things. Do you still see situations where users may have to sink historical data to S3, etc?

Neil Buesing:
KIP-405 doesn’t support compacted topics, so the scenarios that I’m more thinking about (backing up state used by Kafka Streams) wouldn’t be automatic, you would have to copy from your compacted topic to another non compacted topic for backup. Which would be a way to handle versioning.

So short answer is yes this could be leveraged, but I don’t think it changes the amount of work needed to build a backup/restore that may be needed to Kafka streams scenario.

Mitch Henderson:
Kip-405 only changes things if the Kafka query pattern is what you need long term.

1 Like

Trying to understand @mitchell-h’s comment better:

And will I need sub-second access? Sub-minute? No SLA?

Let’s say we need sub-minute performance for historical data querying, what could be the options here?


Also, speaking of data corruption, could anyone elaborate more here? Is it about the case when the Kafka Streams application reads and writes to the same topic? Or what exactly is called “corruption” here?

1 Like

@whatsupbros The reason I asked that question is because, like a database, Kafka is a rather expensive place to keep & query your data. If you only need sub-minute queries there are place that have a more flexible query model and are cheaper to host your data.

It’s also important to think about HOW you’ll be querying as well as for WHAT. When you say sub-minute query, are you picking out a single key-value of 10billion? Reading the entire topic and performing some sort of aggregation?

If you’re thinking of hosting all your historical data in kafka, then going back for a sub-minute query for a single k:v pair in a topic of a couple billion events there are cheaper, faster and easier ways to do this.

Let me add more details to my comment and why it is interesting for me in general, if this is “a good idea” to store historical data in Kafka forever.

And this is the right question. I intentionally did not mention this in my previous comment to better understand what exactly you meant.

And yes, the issue is, that nobody usually knows “what queries” they are going to have to their historical data in future. This is basically the reason why we have relational databases and normal forms for so long - because by doing so, the data is optimized to “any” query by default (which you improve further by adding indexes and so on).

And yes, I totally understand, that it is suboptimal to use Kafka for such “unknown” queries for historical data (and I am not talking about trivial query for a particular single k:v pair).

But let me explain myself.

The original question was “if it is a good idea to store historical data in Kafka forever”. And to me this question is not equal to “how will I query this historical data?”.

To be able to run my queries, I still can sink some part of my Kafka-stored data to another place, which is more efficient for such querying (database/elastic/etc…)

However, I still may need to store data in Kafka forever (event topic with unlimited retention/compacted topic depending on the use-case), because I may need to be ready for data consumers coming and going.
And when a new consumer appears in the ecosystem, this application should be able to “restore” correct system state, which is sometimes possible only by replaying all necessary events of a particular type from the very beginning.
Otherwise, some data can be lost, and consumer’s state may end up inconsistent.

Or am I missing something?

I don’t think you’re missing anything. You’ve summarized why event sourcing is such an attractive architecture. The ability to store data in Kafka than reply it to the query runner of choice for what you’re trying to achieve.

1 Like