What's the most efficient way to copy Kafka messages to S3 and back

frodera · 20 May 2024 08:52

We’re currently considering ways to save (i.e. backup) and load (i.e. restore) all the messages in an Kafka topic in the most efficient way (both storage needed and throughput) possible. The goal is to save the messages in S3 for extra durability and restore them from there when needed, like for instance in a disaster recovery scenario. Some of the workloads in our architecture use Kafka to persist data, so this scenario has to be covered.

Ideally we do not want to have to deal with serialization/deserialization of the messages, that means ignoring completely the format of the message (text, JSON, Avro, parquet, or anything similar) and any need for a schema. We want to save the messages in “raw” format and restore them the same way.

This does not seem to be possible using Kafka Connect as it requires the use of SerDes, as per this 1 blog post and our own testing.

We basically want to have a way to backup and restore topic messages on behalf of the user without us caring about the message format and schema. Similar to taking a database backup, where you do not care about the tables’ columns.

We’re thinking in several options:
- Using kcat (kafka cat)
- Assuming it’s possible, writing our own producer and consumer code that reads the messages in raw binary format from Kafka and saves them so S3

As an alternative we could use MirrorMaker 2 onto another cluster, but this does not satisfy our need to save the messages to S3 for increased durability in case of accidental deletion of the replicated topic. At the end of the day a replica is not really a backup.

What is the best way to achieve this? What are other options we should consider here? Any suggestions or comments?

Topic		Replies	Views
Backing up the Kafka cluster data Ops	10	12488	16 October 2024
Initial Load + Continuous Load with Debezium and S3 Sink Kafka Connect	2	4489	5 October 2021
Disaster Recovery Strategy Cluster Replication	7	4315	8 November 2023
Replicating Schema Registry to a new Kafka Cluster while preserving Schema IDs Schema Registry	0	139	6 September 2024
S3 Sink Connector - fails to deserialize large JSON messages Self-Managed Connectors	3	144	11 October 2024

What's the most efficient way to copy Kafka messages to S3 and back

Related topics