Broker resync some partitions catch up very slowly

Hi,

I haven’t posted here before, so if this post isn’t in the right place, could someone point me to where I should be asking?

I’ve been running a 3-node Kafka cluster (on-prem) for several years now and it’s been working relatively well. I recently upgraded the cluster to Kafka version 3.6.1, in an attempt to migrate from Zookeeper to Kraft (that migration hasn’t gone well, but that’s for another thread).

Anyway, I’ve made two attempts to migrate, then backed off. When I back off, I restore the server.properties to a ZooKeeper config (no migration), wipe the Kafka log dir, and restart Kafka to resume normal operations.

The reason for this post is that some (but not all) topic partitions take a very long time to resync (an hour or two, while other partitions take only minutes), and I was wondering if anyone could give me some pointers on what I should be checking to diagnose this issue? Here is what my broker configs look like (I’ve removed the values for the top three lines):

broker.id=xxx
log.dirs=xxx
zookeeper.connect=xxx
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.partitions=16
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
transaction.state.log.min.isr=1
log.retention.hours=4
log.retention.check.interval.ms=300000
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=0
delete.topic.enable=true
inter.broker.protocol.version=3.6
log.message.format.version=3.6
replica.socket.receive.buffer.bytes=1024000
replica.fetch.max.bytes=20480000

I’m using this command to monitor the partition sync status:

kafka-topics.sh --describe --bootstrap-server localhost:9092 --under-replicated-partitions

If I ‘du -sh’ the partition directories, the ones that are under-replicated only grow slowly (maybe a few megabytes a minute), while the other partitions grow hundreds of megabytes a minute (when re-syncing).

The issue isn’t disk IO contention – the disk utilization for the log partition (as reported by iostat) is under 10%.

I haven’t poked at this a lot, but sometimes restarting the broker again seems to help.

Any thoughts would be appreciated!
Thanks,

  • Daniel

A few additional details:

  • I’m using the Apache Kafka distribution, not one of the Confluent Kafka releases
  • I’m running Kafka under Ubuntu 22.04
  • I have fewer than 200 partitions across all topics.
  • My (busier) topic partitions tend to be 2-3GB in size.