Kafka Stream co-operative rebalance storm and Processing Blocked

We are using kafka streams in kubernetes deployment and running 10 instance of our pod. All the pod has static consumer group ID configured. Each pod has 4 streaming thread. The topic has 60 partition.

We are seeing rebalance storm and message not getting processed / lag being built. Again this problem is very much reproducible.

From kafka properties point of view
max.poll.inteval.ms is set to Int.MaxValue
session.timeout.ms is set to 60 sec
heartbeat.interval.ms is 20 sec.

Rebalance itself may be fine but we are seeing messages are not processed for long time and huge lag getting built. Please help how to overcome this issue.

For static group membership, it’s usually recommended to increase the session timeout to a very large value, like 5 minutes (or even larger) to ensure that no rebalance is triggered if a POD is moved within the Kubernetes cluster. The session timeout should be larger than the maximum expected downtime of a POD.

Also, if you have a stateful application, you want to use “stateful set” to make sure that persistent volumes are re-attached to the same PODs to avoid expensive state store recovery.

Cf. Deploying Kafka Streams Applications with Docker and Kubernetes - Confluent

Thanks. I tried with session.timeout.ms with 5 minutes. Still seeing same issue.

I guess you could inspect the log (client and broker side) to investigate why a rebalance is triggered… If a static member re-joins the group, no rebalance should be trigger (ie, if the static member was not removed from the group previously). – Could also be a config issue (ie, the static member ID is not configures correctly)?

Thanks for response. Find below the properties configured for our application for stream

serviceName = k8s appName (Fixed string)

application.id = “serviceName_Kafka-Stream-Application”
client.id = “serviceName_Kafka-Stream-Application”
group.id = “serviceName_Kafka-Stream-Application”
max.poll.interval.ms = Integer.MaxValue
session.timeout.ms = 60 second
heartbeat.interval.ms = 20 second
replication.factor = 3
num.stream.threads = 4
processing.guarantee = at_least_once

Do let me know if any of them need to be tuned or need to be different. The issue is always seen when we do rolling restart of our k8s application.

You mentioned could be static member ID not configured correctly? can you elaborate more.

Your config does not contain a static group ID (ie, group.instance.id) that must be unique within the consumer group, ie, different for each POD. Check out the talk I linked above for more details.