Hello everyone. I asked this question yesterday at Kafka Summit but wasn’t able to watch if Gunnar or @rmoff answered it. Besides that the “Ask the Experts” sessions aren’t able to watch later. So I’m asking here now.
Considering a small 3 node Kafka Cluster, what would be the recommended approach to retrieve all data from a database (1000+ tables) using Debezium then use S3 Sink to put it into S3 and then continuing to receive it in a frequency of 1 minute or with a flush.size
of 50?
-
Create 1 connector for each table with
snapshot.mode='initial'
and S3 Sink with a lowflush.size
(Even that this would take a lot of time). -
Separate the load into S3 into two steps:
First: Create 1 connector for each table with
snapshot.mode='initial_only'
and a S3 Sink with a highflush.size
for these topics. (That would require a temporary increase of the node’s RAM considering the Heap Size usage of Connect)Second: Create 1 connector for each table with
snapshot.mode='schema_only'
and a S3 Sink with a low flush.size for these topics.This approach would require some additional steps to guarantee that all events from the Database were actually read.
3 - Any other idea from you.
Some additional informations: The database doesn’t have a high number of events per day when comparing to other situations where Kafka is used, so the cluster can be small. But to retrieve all historical data then continue to retrieve it continuously is what seems tricky for me.