We’re currently considering ways to save (i.e. backup) and load (i.e. restore) all the messages in an Kafka topic in the most efficient way (both storage needed and throughput) possible. The goal is to save the messages in S3 for extra durability and restore them from there when needed, like for instance in a disaster recovery scenario. Some of the workloads in our architecture use Kafka to persist data, so this scenario has to be covered.
Ideally we do not want to have to deal with serialization/deserialization of the messages, that means ignoring completely the format of the message (text, JSON, Avro, parquet, or anything similar) and any need for a schema. We want to save the messages in “raw” format and restore them the same way.
This does not seem to be possible using Kafka Connect as it requires the use of SerDes, as per this 1 blog post and our own testing.
We basically want to have a way to backup and restore topic messages on behalf of the user without us caring about the message format and schema. Similar to taking a database backup, where you do not care about the tables’ columns.
We’re thinking in several options:
- Using kcat
(kafka cat)
- Assuming it’s possible, writing our own producer and consumer code that reads the messages in raw binary format from Kafka and saves them so S3
As an alternative we could use MirrorMaker 2 onto another cluster, but this does not satisfy our need to save the messages to S3 for increased durability in case of accidental deletion of the replicated topic. At the end of the day a replica is not really a backup.
What is the best way to achieve this? What are other options we should consider here? Any suggestions or comments?