Drop in throughput when broker is being replaced (data is replicated to new broker)

Evan Williams @ewilliams
Hi all!

We are experiencing something rather odd. When replacing a broker (same broker ID), and replicating data from scratch - as soon as replication starts, we get an instant drop in total messages through the cluster. Auto leader election is turned off, so the broker isn’t the leader of any partitions. So all it should be doing is replicating data from other replicas. The simple fix is implementing a follower replication throttle, to something like 80 MB/s.

I’m just struggling to understand, how a replicating broker at full throughput (at around 250 MB/s), that has no partition leadership - can possibility cause this ?

When looking at the network and disk throughput on the other brokers during replication, they aren’t maxing out at all.

I just hate to think I’m missing something really obvious here :sweat_smile:

Mitch Henderson @mitchell-h
You’ve hit the classic rebuild problem.

The problem is that while it’s not the leader it’s pulling data from all the other leaders of partitions it will host to rebuild the replaced broker’s local storage

for fun history, KIP-73 Replication Quotas - Apache Kafka - Apache Software Foundation one of the reasons KIP-73 got put in was to prevent a rebuild from hurting the rest of the cluster.

and that same KIP shows examples of how to throttle the rebuild in order to minimize the impact to the rest of the cluster.

Evan Williams
Right, however viewing the disk throughput and network metrics for the other brokers, they aren’t breaking a sweat while replication is happening. So it’s very weird…

Network wise, the leaders are at around 50 MB/s max, out of 120 MB/s that they can sustain as a baseline (Talking AWS EC2 here).

Evan Williams
Also I’ve noticed that setting cluster wide defaults (below) for the broker throttling has no effect in my testing. While broker specific ones do.

kafka-configs --bootstrap-server localhost:9092 --entity-type brokers --entity-default --alter --add-config 'follower.replication.throttled.rate=30000000'

Mitch Henderson
50mb/s per broker will cause cluster wide throughput to drop. Each one of those replication requests also takes up a network and request thread, which causes contention on the brokers, slowing them down more.

Replication is a per broker quota. Each broker controls their own replication throttle.

Evan Williams
Right, so during replication - even if:

  1. Network bandwidth on all brokers look fine

  2. Available network/IO threads are fine.

  3. replication.fetcher.threads = 1

It can still cause throughput drop. That’s well …interesting :slightly_smiling_face:

Mitch Henderson
Yes. Thread contention is a thing. If you have cpu and network to spare you can up the replication threads to 2 or 3. But I would wait for the rebuild to complete.

Mitch Henderson
On a longer term, sizing your cluster to support the rebuild is generally the best capacity planning advice. How long can you tolerate a node being down? Weigh that against how much impact a rebuild can have and you’ll have your quota and cluster size

Evan Williams
But how does upping the replication fetcher threads on the replicating broker - effect contention on the other brokers ? Is there something I can tweak on the sending (non replicating) brokers, to ease thread contention ?

The cluster is over provisioned size wise, and we can live with a broker “down” for as long as we need. The rest of the brokers (replicas), can easily handle the load.

Mitch Henderson
That’s exactly what it does. Replica threads are two way(leader and followed replica fetch requests)

Evan Williams
Ahh wow. I always thought that replica.fetcher.threads, was purely threads for “fetching” data from replicas, and not used when serving data to followers. I guess it’s the naming/description of it, that is a little confusing. As it only mentions it can increase parallelism on the follower :wink:

Number of fetcher threads used to replicate messages from a source broker. Increasing this value can increase the degree of I/O parallelism in the follower broker.

So upping the threads on each broker (CPU allowing) should give me headroom, for replicating at higher throughputs - before contention possibly hits ?

Evan Williams
Thanks a lot for the explanations here. It makes total sense.

Mitch Henderson

me headroom, for replicating at higher throughputs - before contention possibly hits

yes

1 Like