Could some please give an explanation of what we’re doing wrong operating our kafka cluster.
So, we’ve got a cluster of 5 kafka brokers (2.7 ver) running on 5 strong bare-metal hosts backed by SSD.
The byte in workload sits between 150meg and 500 megabytes per second on the top.
We noticed that one of our brokers stucks when the workload overcomes some threshold. In that cases, the sick broker’s request queue is full and the broker throughput dramatically goes down. Also, in that cases the other brokers aren’t able to catch up the replication lag and cluster makes “shrinking ISR operation”. It happens because of the sick broker can’t handle all the FetchFollower requests (due to some stuckness).
We did lots of tweaks (io, network, buffers, etc), nothing helped us to get it worked.
Having done some research (including reading kafka sources and profiling brokers) we’ve figured out that the IO thread pool blocks on the “Log.Append” line
Moreover, switching partition leaders between brokers we can move that issue from one broker to the other one. So it kind of floating issue.
- Has anyone ever met this issue?
- How did you solve it?
- What we could additionally check to get the full picture?
We would check our SSD disks, but taking into consideration that the issue is floating, the IO subsystem doesn’t seem to be the root cause of issue