Broker performance issues caused by too many locks on log.append

mvanyushkin · 11 April 2023 06:56

Hello folks!
Could some please give an explanation of what we’re doing wrong operating our kafka cluster.
So, we’ve got a cluster of 5 kafka brokers (2.7 ver) running on 5 strong bare-metal hosts backed by SSD.
The byte in workload sits between 150meg and 500 megabytes per second on the top.
We noticed that one of our brokers stucks when the workload overcomes some threshold. In that cases, the sick broker’s request queue is full and the broker throughput dramatically goes down. Also, in that cases the other brokers aren’t able to catch up the replication lag and cluster makes “shrinking ISR operation”. It happens because of the sick broker can’t handle all the FetchFollower requests (due to some stuckness).
We did lots of tweaks (io, network, buffers, etc), nothing helped us to get it worked.
Having done some research (including reading kafka sources and profiling brokers) we’ve figured out that the IO thread pool blocks on the “Log.Append” line

Moreover, switching partition leaders between brokers we can move that issue from one broker to the other one. So it kind of floating issue.

Has anyone ever met this issue?
How did you solve it?
What we could additionally check to get the full picture?

We would check our SSD disks, but taking into consideration that the issue is floating, the IO subsystem doesn’t seem to be the root cause of issue

mvanyushkin · 11 April 2023 06:57

Sick broker lock contention (JFR record, 1 minute of profiling)

mvanyushkin · 11 April 2023 06:58

Healthy broker lock contention (JFR record, 1 minute of profiling)

mvanyushkin · 11 April 2023 06:59

Throughput slowly degradation

mvanyushkin · 21 March 2024 13:47

Thank you very much guys!
The solution and explanation are there https://www.youtube.com/watch?v=S1q4MfEvLFg!

system · 28 March 2024 13:48

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Broker resync some partitions catch up very slowly Ops	1	1004	23 February 2024
Kafka producer latency is high Ops	0	3380	22 June 2022
Kafka cluster collapsed for no reason Ops	0	101	30 November 2024
Performance Issues with confluent kafka vs older cluster Ops	25	152	25 February 2025
Drop in throughput when broker is being replaced (data is replicated to new broker) Ops	0	3947	15 March 2022

Broker performance issues caused by too many locks on log.append

Related topics