We have Kafka connect cluster running in Kubernetes, we have multiple connectors needed to be deployed, that includes snapshot data and stream data. we are using the Debezium CDC MYSQL connector as of now. Data size is gonna be very huge. currently, we are using Kafka Connect with 3 replicas. So what is the best way to autoscale or pipeline jobs, is it at Debezium connector level or Kafka connect cluster level .? Please suggest
In a very generalized use case, CPU will be your limiting factor when using DBZ+connect. So you’ll scale when you’ve hit N% of CPU for a period of time, and add a node or two.
Here’s the fine print that’s more important than when too autoscale, adding a node does not cause the connect tasks to rebalance. You’ll need to manually call Connect REST Interface | Confluent Documentation in order for that tasks to rebalance and get the benefit of additional resources. This will cause the tasks to pause until the rebalance is complete, somewhere between 1second and 3 minutes, during this time no work will be done and you’ll fall further behind.
It’s highly suggested that you have a steady state connect cluster that can handle your peak load.
Can you please elaborate this a bit further?
In order to scale a connect cluster has to pause, and rebalance. This pause will cause you to get further and further behind. Also you can easily get into a state where you’re “flapping”, spending all the time in a paused state while you constantly rebalance.