Initial Load + Continuous Load with Debezium and S3 Sink

rafaelols · 16 September 2021 13:17

Hello everyone. I asked this question yesterday at Kafka Summit but wasn’t able to watch if Gunnar or @rmoff answered it. Besides that the “Ask the Experts” sessions aren’t able to watch later. So I’m asking here now.

Considering a small 3 node Kafka Cluster, what would be the recommended approach to retrieve all data from a database (1000+ tables) using Debezium then use S3 Sink to put it into S3 and then continuing to receive it in a frequency of 1 minute or with a flush.size of 50?

Create 1 connector for each table with snapshot.mode='initial' and S3 Sink with a low flush.size (Even that this would take a lot of time).
Separate the load into S3 into two steps:

First: Create 1 connector for each table with snapshot.mode='initial_only' and a S3 Sink with a high flush.size for these topics. (That would require a temporary increase of the node’s RAM considering the Heap Size usage of Connect)

Second: Create 1 connector for each table with snapshot.mode='schema_only' and a S3 Sink with a low flush.size for these topics.

This approach would require some additional steps to guarantee that all events from the Database were actually read.

3 - Any other idea from you.

Some additional informations: The database doesn’t have a high number of events per day when comparing to other situations where Kafka is used, so the cluster can be small. But to retrieve all historical data then continue to retrieve it continuously is what seems tricky for me.

rafaelols · 5 October 2021 12:16

Hey everyone. An update about this issue if someone faces it.

We’ve come up with the following approach: Separation of the initial load from the continuous load. These are the steps.

Start of Debezium with snapshot.mode='schema_only' for all tables.
Create a snapshot of the databases some day after Debezium is up and running.
The initial load to S3 is done using Sqoop or something related using that
snapshot, directly to a “historic” or even to the so-called “bronze” layer, it is up to
you depending on the format used on each layer.

Notes: As we have the CDC logs since before the snapshot, we are able to have the updated state of the data. Also, some actions must be taken to guarantee you just use the logs received after each table’s snapshot.

This way the Kafka cluster is not going to be overhelmed with the initial load and S3 Sink configs can be dimensioned taking into account just the continuous load.

system · 12 October 2021 12:16

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with data received from CDC containing large amount of data but Kafka has problem to capture them all Kafka Connect	4	4164	30 June 2022
Reddit CDC with Debezium Architecture and Design	2	4081	2 November 2021
🎥 Kafka Connect in Action : S3 Sink Self-Managed Connectors	1	3372	9 February 2021
Do we need to load complete data from OLD Data base to create streams Confluent Cloud	1	3279	19 February 2021
Push multiple tables in one topic and use Sink to write back to multiple tables Kafka Connect	3	4344	24 October 2021

Initial Load + Continuous Load with Debezium and S3 Sink

Related topics