Kafka streams rebalance storm

Hello, we are running 10 pods of kafka streams in kubernetes statefulset with the group_instance_id attached to each kafka streams app.
The streams have 7 threads, and our topic contains 180 partitions.
The rebalance protocol is cooperative.
Sometimes we encounter 4-5 minutes of rebalance until its resolved, in that time we are gaining lag on our topic.

The config is as follows:
producer-compression = “lz4”
producer-max-request-size = 5000000
batch-size-bytes = 16384
producer-linger-ms = 100
deserialization-exception-handler = “org.apache.kafka.streams.errors.LogAndContinueExceptionHandler”
retries = 1000
producer-retry-backoff-ms = 250
consumer-retry-backoff-ms = 250
replication-factor = 3
max-poll-interval-ms = 60000
max-poll-records = 1000
fetch-max-bytes = 52428800
session-timeout-ms = 90000
heartbeat-ms = 2000
fetch-min-bytes = 1
group-instance-id = ${?KAFKA_GROUP_INSTANCE_ID}

We wish to solve that rebalance storm or understand better on why it might be happening.Is it related to thread amount?is it something in the configuration? thread to partition ratio ?
Any help will be appreciated, even as pointing on where we should look; ) Thanks!

Hi @timurg,

Do you know what the trigger of the rebalance storms are? Have you tried to increase the session.timeout.ms config to a higher value and verified it the rebalance storm disappears?

Best,
Bruno

hi @Bruno ,
We increased it from 30sec to 90s and it didnt help.This only occures when a pod is crashed(instance revoked by aws) and is trying to get back into the group

Hi @timurg,

What is your expectation? Do you expect that there will not be any rebalances at all due to the static membership or do you expect that the rebalances take less time?

For the former, are you sure the pod is able to come back within 90 s?

For the latter, I noticed that you have quite a low value for max.poll.interval.ms. The default is 5 minutes and you decreased it to 1 minute. During rebalances max.poll.interval.ms is used to timeout rebalances. That means if not all of your 70 (10 pods x 7 stream threads) Kafka consumers join within this timeout but a little later the current rebalance will be aborted and a new one started. This might happen a couple of times. So, I would try to set max.poll.interval.ms to its default or more.

Best,
Bruno

hi @Bruno , thanks for your answer.
our processing time is fast, so max.poll.interval is not a concern i think.
Sometimes it takes for the services longer than 1.5 minutes to come up, and i understand that it would trigger a rebalance.I am fine with that.
Even if all my pods are up and they all come back down and come up 1 by 1 , no more than 20 rebalances should happen. 1 after the leave group, 1 after join group per each pod.
However, we are encountering sometimes few minutes or constant rebalances, which we were able to reproduce.

Steps to reproduce:

  1. Generate big lag
  2. Make sure you have 7 threads per pod
  3. Make sure all pods are up and processing
  4. Trigger rebalance

This scenario should create a rebalance storm, as we encounter it and that we wish to solve and not sure how.
We suspect that its the overall thread amount for the consumer group ,as it started to happen after we increased the thread amount to 7 for each pod ( total 70).
Our topic have 180 partitions.
Thanks!

Hi @timurg,

I think we are misunderstanding each other. The max.poll.interval.ms is not only used to timeout processing but also to timeout rebalances. If not all 70 Kafka consumers join within max.poll.interval.ms additional rebalances are triggered for the consumers that join after the timeout. With 70 Kafka consumers that should join it might happen that some of them miss a rebalance. So I would try to increase max.poll.interval.ms to the default or higher.

How do you count the rebalances?
With cooperative rebalancing, each rebalance might consist of multiple rebalances during which records from partitions that are not migrated to other clients should still be processed. So, the actual rebalances might be more than 20 due to cooperative rebalances.

If you monitor the KafkaStreams state, be aware that the state REBALANCING also comprises the restoration of local state which last a bit.

Best,
Bruno

Hi Bruno
We are also seeing exactly same issue with kafka streams cooperative rebalance algo. It is creating rebalance storm / huge lag and behavior is not consistent all the time. We are using 10 pods and 4 streaming threads per pod. 60 partition for given topic. max.poll.interval.ms in our case set Int.MaxValue. Please suggest how to overcome this issue?

Due to inconsistent behaviour of rebalance with cooperative rebalance algo we plan to go back to rebalance algo used in version 2.3

@timurg Did this issue resolved for you? If yes can you please share resolution.

@Bruno Any other suggestion or thoughts.