I have a question about JdbcSourceConnector bulk mode

Hi all, This is a beginner question.
I have tried to load 100 millions of record from MSSQL server to Kafka topic through Kafka JdbcSourceConnector.

I tried Incremental query modes is “bulk”
But, when I running the connector in my local environment, it only gets 100 data records.

My guess is it’s a JVM memory issue. (now local KAFKA_HEAP_OPTS : -Xms8G -Xmx8G)
I plan to test it by modifying the KAFKA_HEAP_OPTS option on the development server.

So, I have a question regarding this.
In “bulk” mode, Is there a way to split and import 100 million data?
I was wondering if there is a way to get data by polling in “bulk” mode.

here’s my JdbcSourceConnector config setting

"tasks.max": "1",
            "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
            "connection.url": "jdbc:sqlserver:// ${datasource}",
            "connection.user": "${user}",
            "connection.password": "${password}",
            "topic.prefix": "mssql-source-",
            "mode": "bulk",
            "table.whitelist": "${table name}",
            "poll.interval.ms": "86400000"

Is there a way to split and import 100 million data?

I think batch.max.rows is the configuration option you’re looking for here.

1 Like

@rmoff Thanks for your reply.

We are also considering the options you answered.
However, the amount of data could increase by more than 100 million.

If there is no other way, We will proceed with the option below.

batch.max.rows: 1000000000

I don’t know, is it the right way?

If I understand it correctly, the option says how many rows you’ll pull from the database each batch, until all rows are fetched. So if you have 1000 rows and set batch.max.rows to 100 it’s going to take 10 iterations of the batch fetch to complete. If there are then 2000 rows, the connector will still work, it’ll just take 20 iterations.
The point is that instead of trying to eat a whole elephant at once (and blowing the JVM memory from trying to hold all the DB records in one go) you eat it one bite (batch of records) at a time (and thus fewer in JVM memory at once).

That is my understanding of it anyway - I have not looked at the code to verify it.

1 Like

@rmoff
Yes. That’s what I thought at first.

As you mentioned, if the mode is bulk, can it be iterations?

I tested it while revising the options below, but I couldn’t get the result I wanted.

Am I mistaken?? :sob:

"mode": "bulk",
"poll.interval.ms": "10000",
"batch.max.rows" : "100",
"table.poll.interval.ms": "5000"
  • It’s the result I want. (ex. total topic count : 8000)
  1. add 100 messages
  2. after poll.interval.ms
  3. add 100 messages
  4. It will continue to 8000.
  • result
  1. add 8000 messages
  2. after poll.interval.ms
  3. add 8000 messages
  4. continue add messages

As you mentioned, if the mode is bulk, can it be iterations?

So in bulk mode, we have a table querier thread that will continuously read the next set of records from the table. When this fills a batch - we will commit them to kafka. So we do not wait till the entire result set is loaded to memory before we commit to kafka. We will buffer (at the application level) at most the batch size.

  1. add 100 messages
  2. after poll.interval.ms
  3. add 100 messages
  4. It will continue to 8000.

This understanding is incorrect. The poll interval ms impacts when we run the subsequent query. I.e if we’re still working on some result set, we don’t really care about the poll.interval.ms . After we have finished some query, we will wait at most poll.interval.ms before we running the next query

2 Likes

@SajanaW Thank you. I’ll study kafka more.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.