🎧 Smooth Scaling and Uninterrupted Processing with Apache Kafka ft. Sophie Blee-Goldman

alice.richardson · 31 March 2021 16:34

There’s a new Streaming Audio episode - check it out!

Availability in Kafka Streams is hard, especially in the face of any changes. Any change to topic metadata or group membership triggers a rebalance. But Kafka Streams struggles even after this stop-the-world rebalance has finished. According to Apache Kafka® Committer and Confluent Software Engineer Sophie Blee-Goldman, this is because a Streams app will generally have some state associated with a given partition, and to move this state from one consumer instance to another requires rebuilding this state from a special backing topic called a changelog, the source of truth for a partition’s state.

Restoring the changelog can take hours, and until the state is ready, Streams can’t do any further processing on that partition. Furthermore, it can’t serve any requests for local state until the local state is “caught up” with the changelog. So scaling out your Streams application results in pretty significant downtime—which is a bummer, especially if the reason for scaling out in the first place was to handle a particularly heavy workload.

To solve the stop-the-world rebalance, we have to find a way to safely assign partitions so we can be confident that they’ve been revoked from their previous owner before being given to a new consumer. To solve the scaling out problem in Kafka Streams, we go a step further. When you add a new instance to your Streams application, we won’t immediately assign any stateful partitions to it. Instead, we’ll leave them assigned to their current owner to continue processing and serving queries as usual. During this time, the new instance will start to “warm up” the local state in the background; it starts consuming from the changelog and building up the local state. We then follow a similar pattern as in cooperative rebalancing, and issue a follow-up rebalance.

In KIP-441, we call these probing rebalances. Every so often (i.e., 10 minutes by default), we trigger a rebalance. In the member’s subscription metadata that it sends to the group leader, each member encodes the current status of its local state. We use the changelog lag as a measure of how “caught up” a partition is. During a rebalance, only instances that are completely caught up are allowed to own stateful tasks; everything else must first warm up the state. So long as there is some task still warming up on a node, we will “probe” with rebalances until it’s ready.

EPISODE LINKS

Listen to the episode

Topic		Replies	Views
How to minimize switchover time between Kafka Streams instances Kafka Streams	5	1961	26 April 2024
Improving Fault Tolerance and Scaling Out in Kafka Streams [Kafka Summit 2022] Summit	0	3642	24 April 2022
Kafka Stream application is getting rebalance Kafka Streams	0	1771	11 August 2023
Kafka streams rebalance storm Kafka Streams	7	6615	16 July 2021
Rebalancing Loop when updating Kafka streams lib Kafka Streams	6	5119	18 May 2021

🎧 Smooth Scaling and Uninterrupted Processing with Apache Kafka ft. Sophie Blee-Goldman

Related topics