Broker resync some partitions catch up very slowly

Daniel314 · 23 February 2024 19:41

Hi,

I haven’t posted here before, so if this post isn’t in the right place, could someone point me to where I should be asking?

I’ve been running a 3-node Kafka cluster (on-prem) for several years now and it’s been working relatively well. I recently upgraded the cluster to Kafka version 3.6.1, in an attempt to migrate from Zookeeper to Kraft (that migration hasn’t gone well, but that’s for another thread).

Anyway, I’ve made two attempts to migrate, then backed off. When I back off, I restore the server.properties to a ZooKeeper config (no migration), wipe the Kafka log dir, and restart Kafka to resume normal operations.

The reason for this post is that some (but not all) topic partitions take a very long time to resync (an hour or two, while other partitions take only minutes), and I was wondering if anyone could give me some pointers on what I should be checking to diagnose this issue? Here is what my broker configs look like (I’ve removed the values for the top three lines):

broker.id=xxx
log.dirs=xxx
zookeeper.connect=xxx
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.partitions=16
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=4
log.retention.check.interval.ms=300000
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=0
delete.topic.enable=true
inter.broker.protocol.version=3.6
log.message.format.version=3.6
replica.socket.receive.buffer.bytes=1024000
replica.fetch.max.bytes=20480000

I’m using this command to monitor the partition sync status:

kafka-topics.sh --describe --bootstrap-server localhost:9092 --under-replicated-partitions

If I ‘du -sh’ the partition directories, the ones that are under-replicated only grow slowly (maybe a few megabytes a minute), while the other partitions grow hundreds of megabytes a minute (when re-syncing).

The issue isn’t disk IO contention – the disk utilization for the log partition (as reported by iostat) is under 10%.

I haven’t poked at this a lot, but sometimes restarting the broker again seems to help.

Any thoughts would be appreciated!
Thanks,

Daniel

Daniel314 · 23 February 2024 19:51

A few additional details:

I’m using the Apache Kafka distribution, not one of the Confluent Kafka releases
I’m running Kafka under Ubuntu 22.04
I have fewer than 200 partitions across all topics.
My (busier) topic partitions tend to be 2-3GB in size.

Topic		Replies	Views
Will topics still be available while their partitions are being reassigned? Ops	3	4766	23 March 2021
Kafka cluster collapsed for no reason Ops	0	113	30 November 2024
Broker performance issues caused by too many locks on log.append Ops	5	2241	21 March 2024
Kafka Broker problem Non-Java Clients	0	1148	18 January 2024
Kafka Brokers problem Kafka Streams	2	2005	18 January 2024

Broker resync some partitions catch up very slowly

Related topics