Conversation from Confluent Community Slack. Copied here to make it available to all.
Pedro Lozano
Hi all. I’d like to know what people are using (or is recommended) for long term event storage:
Do you keep your events forever in Kafka and not store them in a separate database?
Do you store them in any kind of NoSQL database? (assuming using JSON format, not sure what DBs exists for other formats like Avro)
What would you use in case you want to replay (re-stream) the events stored in long term storage? Connect?
Thanks!
Neil Buesing @nbuesing:
Every use case is different, but generally I have stored events in Kafka “forever” in certain cases, but you do need a strategy to handle data corruption (I updated a kafka streams application and it had a bug that messed up my data scenario).
Trying to figure that out after it happens will be a mad scramble.
Storing data into s3 buckets for pure backup purposes has been discussed with clients and usually the way they go.
Using Connect is usually the way to sink to a store (s3, mongo, etc) and then source it back in. Just have to know/understand what it means to restore data as many applications will treat them as new events. Again, lots depending on your use-case.
So if you are looking to store events as a way to audit and/or handle a data corruption issue I would look to a bucket store. If you are looking for data to be accessible for other uses cases then a more available document store might work for you (e.g. Mongo). However, if you have a document store and you only keep the latest copy — how do you handle/prevent data corruption (e.g. programming error and bad data gets generated by your streaming applications).
Some will find ways to replay from the source-of-record, some look to your backup. If you can replay from source of truth and your down-streams systems are idempotent — things get a lot easier to work through (IMHO)
Mitch Henderson @mitchell-h:
What Neil said, but for me it comes down to “how will I query this historical data?”
And will I need sub-second access? Sub-minute? No SLA?
Once you’ve answered those questions, your storage becomes fairly self evident
Pedro Lozano:
Thanks @nbuesing for the comprehensive answer. I’ll try to define better my use cases. This is my first Kafka project and this concern came early because I’m using AWS MSK and, as far as I know, the clusters can’t be stopped, you can only delete them, and there is no snapshots or backups, so if I shut down the cluster I just lose all data that’s in Kafka.
Vijay Nadkarni @vnadkarni:
@mitchell-h buesing, It would be interesting to know how KIP-405 (Kafka Tiered Storage) changes things. Do you still see situations where users may have to sink historical data to S3, etc?
Neil Buesing:
KIP-405 doesn’t support compacted topics, so the scenarios that I’m more thinking about (backing up state used by Kafka Streams) wouldn’t be automatic, you would have to copy from your compacted topic to another non compacted topic for backup. Which would be a way to handle versioning.
So short answer is yes this could be leveraged, but I don’t think it changes the amount of work needed to build a backup/restore that may be needed to Kafka streams scenario.
Mitch Henderson:
Kip-405 only changes things if the Kafka query pattern is what you need long term.