Rebalancing Loop when updating Kafka streams lib

ChrisFord · 5 May 2021 15:18

Hey,

I was wondering if someone could help with an issue we are finding when updating our kafka streams java client from 2.5.1 to 2.6.1. When the pods start up the state loops from Running to Rebalancing. It appears that it state stays in rebalancing for 10 then goes to running for about 10 seconds. In some cases the service eventually stays running and behaves normally afterwards, but in others its been in that state for days.

Any help with how to investigate this would be amazing

Thanks

Chris

mjsax · 5 May 2021 19:15

You might hit [KAFKA-10678] Re-deploying Streams app causes rebalance and task migration - ASF JIRA – it’s fixed in 2.6.2, 2.7.1, and 2.8.0 releases.

ChrisFord · 5 May 2021 19:35

Hi @mjsax,

Thank you for your speedy response. Will give 2.6.2 a try just now and let you know how it goes

ChrisFord · 6 May 2021 12:07

Hi @mjsax

I have updated the lib to 2.6.2 however still seeing the rebalance/running loop. When I look at the broker logs I can see 2 messages that tie in with the rebalance

Member internal_jaws_journey_change_of_tenancy_streamx-v3-test-cot-service-0-1-8b9ddbbb-3dd4-4de6-96d1-e18420466828 in group internal_jaws_journey_change_of_tenancy_streamx-v3-test has failed, removing it from the group
and
Preparing to rebalance group internal_jaws_journey_change_of_tenancy_streamx-v3-test in state PreparingRebalance with old generation 517 (__consumer_offsets-6) (reason: removing member internal_jaws_journey_change_of_tenancy_streamx-v3-test-cot-service-0-1-8b9ddbbb-3dd4-4de6-96d1-e18420466828 on heartbeat expiration)

Can you give any advice on debugging this (for the heartbeat expiration I am trying to increase the session timeout and the heartbeat interval - as a side I also watched your video on “everything you always wanted to know about kafka rebalance protocol but were afraid to ask” it was very informative

mjsax · 10 May 2021 23:41

The first messages indicates that the heartbeat failed. Thus increasing session.timeout.ms should help (not sure why you would need to change the config in 2.6.x compared to 2.5.x release though).

For the heartbeat interval, you might not want to increase it though, as it would mean to send fewer heartbeats and thus make it more likely to drop out of the group (ie, it would be contra productive compared to increasing the session timeout). You can either keep it as-is, or could decrease it; but I guess increasing session timeout should be sufficient.

It is also be possible to log heartbeat errors client side, that might help to dig into it deeper if necessary. Those should be logged a consumer client log4j DEBUG level.

ChrisFord · 18 May 2021 13:06

Turns out using exactly_once_beta resolved the rebalancing/running loop. Looking into whether there was a issue with one of the nodes.

Thanks again for all your help @mjsax

system · 25 May 2021 13:06

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kafka streams rebalance storm Kafka Streams	7	6615	16 July 2021
Kafka Stream co-operative rebalance storm and Processing Blocked Kafka Streams	5	3951	16 July 2021
Syncgroup keeps on failing with message "The group began another rebalance" and it never ends Kafka Streams	0	4905	27 May 2022
Kafka Stream application is getting rebalance Kafka Streams	0	1771	11 August 2023
Kafka Consumer Rebalancing Kafka Streams	3	3639	6 December 2022

Rebalancing Loop when updating Kafka streams lib

Related topics