Hi, we are running kafka connect on top of k8s for CDC use case. We have a 60 sec end to end freshness SLO, and we would like to have some guidance on the setup.
Rebalance delay
There is a 5 min rebalance delay controlled by scheduled.rebalance.max.delay.ms, this means we will see at least 5 min lag in freshness when a pod get recycled by K8S. We try to set it to 30 sec - it seems we will take a few sec lag everytime we trigger a rebalance with incremental rebalance. Wonder if we missed any reason that default value is set to 5 min?
Rolling restart
During a deployment, we are using rolling restart, we want to make sure the previous worker starts taking task before the next pod get restarted. So that we make sure only the tasks that are assigned to one worker are being rebalanced. Is there an endpoint on connect that can be used to integrate with health probe? Or if there is any other suggestions to deal with the rolling restart like this to minimize the end to end freshness?
Happy to share - is there anything in particular that you are looking for?
I don’t think we have some special setup on k8s, there are two cases that a pod will get recycled: 1/ a pod is unhealthy 2/ we periodically recycle pods that have been running for more than X days