Initial Load + Continuous Load with Debezium and S3 Sink

Hello everyone. I asked this question yesterday at Kafka Summit but wasn’t able to watch if Gunnar or @rmoff answered it. Besides that the “Ask the Experts” sessions aren’t able to watch later. So I’m asking here now.

Considering a small 3 node Kafka Cluster, what would be the recommended approach to retrieve all data from a database (1000+ tables) using Debezium then use S3 Sink to put it into S3 and then continuing to receive it in a frequency of 1 minute or with a flush.size of 50?

  1. Create 1 connector for each table with snapshot.mode='initial' and S3 Sink with a low flush.size (Even that this would take a lot of time).

  2. Separate the load into S3 into two steps:

    First: Create 1 connector for each table with snapshot.mode='initial_only' and a S3 Sink with a high flush.size for these topics. (That would require a temporary increase of the node’s RAM considering the Heap Size usage of Connect)

    Second: Create 1 connector for each table with snapshot.mode='schema_only' and a S3 Sink with a low flush.size for these topics.

    This approach would require some additional steps to guarantee that all events from the Database were actually read.

3 - Any other idea from you.

Some additional informations: The database doesn’t have a high number of events per day when comparing to other situations where Kafka is used, so the cluster can be small. But to retrieve all historical data then continue to retrieve it continuously is what seems tricky for me.

Hey everyone. An update about this issue if someone faces it.

We’ve come up with the following approach: Separation of the initial load from the continuous load. These are the steps.

  1. Start of Debezium with snapshot.mode='schema_only' for all tables.

  2. Create a snapshot of the databases some day after Debezium is up and running.

  3. The initial load to S3 is done using Sqoop or something related using that
    snapshot, directly to a “historic” or even to the so-called “bronze” layer, it is up to
    you depending on the format used on each layer.

Notes: As we have the CDC logs since before the snapshot, we are able to have the updated state of the data. Also, some actions must be taken to guarantee you just use the logs received after each table’s snapshot.

This way the Kafka cluster is not going to be overhelmed with the initial load and S3 Sink configs can be dimensioned taking into account just the continuous load.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.