We are using kafka streams in kubernetes deployment and running 10 instance of our pod. All the pod has static consumer group ID configured. Each pod has 4 streaming thread. The topic has 60 partition.
We are seeing rebalance storm and message not getting processed / lag being built. Again this problem is very much reproducible.
From kafka properties point of view
max.poll.inteval.ms is set to Int.MaxValue
session.timeout.ms is set to 60 sec
heartbeat.interval.ms is 20 sec.
Rebalance itself may be fine but we are seeing messages are not processed for long time and huge lag getting built. Please help how to overcome this issue.
For static group membership, it’s usually recommended to increase the session timeout to a very large value, like 5 minutes (or even larger) to ensure that no rebalance is triggered if a POD is moved within the Kubernetes cluster. The session timeout should be larger than the maximum expected downtime of a POD.
Also, if you have a stateful application, you want to use “stateful set” to make sure that persistent volumes are re-attached to the same PODs to avoid expensive state store recovery.
I guess you could inspect the log (client and broker side) to investigate why a rebalance is triggered… If a static member re-joins the group, no rebalance should be trigger (ie, if the static member was not removed from the group previously). – Could also be a config issue (ie, the static member ID is not configures correctly)?
Your config does not contain a static group ID (ie, group.instance.id) that must be unique within the consumer group, ie, different for each POD. Check out the talk I linked above for more details.