Rebalancing Loop when updating Kafka streams lib

Hey,

I was wondering if someone could help with an issue we are finding when updating our kafka streams java client from 2.5.1 to 2.6.1. When the pods start up the state loops from Running to Rebalancing. It appears that it state stays in rebalancing for 10 then goes to running for about 10 seconds. In some cases the service eventually stays running and behaves normally afterwards, but in others its been in that state for days.

Any help with how to investigate this would be amazing

Thanks

Chris

You might hit https://issues.apache.org/jira/browse/KAFKA-10678 – it’s fixed in 2.6.2, 2.7.1, and 2.8.0 releases.

1 Like

Hi @mjsax,

Thank you for your speedy response. Will give 2.6.2 a try just now and let you know how it goes

1 Like

Hi @mjsax

I have updated the lib to 2.6.2 however still seeing the rebalance/running loop. When I look at the broker logs I can see 2 messages that tie in with the rebalance

  • Member internal_jaws_journey_change_of_tenancy_streamx-v3-test-cot-service-0-1-8b9ddbbb-3dd4-4de6-96d1-e18420466828 in group internal_jaws_journey_change_of_tenancy_streamx-v3-test has failed, removing it from the group
    and
  • Preparing to rebalance group internal_jaws_journey_change_of_tenancy_streamx-v3-test in state PreparingRebalance with old generation 517 (__consumer_offsets-6) (reason: removing member internal_jaws_journey_change_of_tenancy_streamx-v3-test-cot-service-0-1-8b9ddbbb-3dd4-4de6-96d1-e18420466828 on heartbeat expiration)

Can you give any advice on debugging this (for the heartbeat expiration I am trying to increase the session timeout and the heartbeat interval - as a side I also watched your video on “everything you always wanted to know about kafka rebalance protocol but were afraid to ask” it was very informative

2 Likes

The first messages indicates that the heartbeat failed. Thus increasing session.timeout.ms should help (not sure why you would need to change the config in 2.6.x compared to 2.5.x release though).

For the heartbeat interval, you might not want to increase it though, as it would mean to send fewer heartbeats and thus make it more likely to drop out of the group (ie, it would be contra productive compared to increasing the session timeout). You can either keep it as-is, or could decrease it; but I guess increasing session timeout should be sufficient.

It is also be possible to log heartbeat errors client side, that might help to dig into it deeper if necessary. Those should be logged a consumer client log4j DEBUG level.

Turns out using exactly_once_beta resolved the rebalancing/running loop. Looking into whether there was a issue with one of the nodes.

Thanks again for all your help @mjsax

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.