I’m playing around with a simple 3-node cluster. I only have 7 topics, most with 10 partitions (one with 1, and another 25) and all replication factor of 2. I have ensured these partitions are spread amongst the Brokers. Most topics are set to “delete” and one to “compact”. Self-balancing is on and set to improve balance “anytime”. Like I say, simple.
Over time I can see in Control Center that the disc usage goes out of balance. Brokers 1 and 2 are at 4.8GB and 4.4GB respectively, but Broker 3 sits at 2.2GB. So something is up dspite the configuration being identical.
I have tried searching for information on how trace the problem, but it’s like looking for a needle in a haystack. I suspect it’s a problem with the Partition balance (Broker 3 seems to have about a third the partitions of the others in QA), but is this not what self-balancing is supposed to address?
In theory they should, but that is a very basic thing I should certainly go and check.
I guess if one or two are getting battered, there’s not a lot that balancing can do about it!
Edit: A quick mark one eyeball doesn’t show anyting amiss; I shall do some proper digging in a bit. Cheers for the tip. (And yes, I’ll come back and mark this as “Solved” if you’re on the money. )
Thanks for the tip, @rmoff. It doesn’t appear to be data volume in my topics but while checking that out I spotted that a lot of the topics for Control Center (e.g. _confluent-controlcenter-6-0-1-1-cluster-rekey) are only replicated on the first two nodes, not evenly across the all brokers. Maybe I goosed some config somewhere.
That may not be the actual problem, but it’s the only thing that is popping out at me.
Yes, it was the Control Center topics causing the imbalance. I am not quite sure why the self-balancing isn’t kicking into to move the topics/partitions around to restore balance, but at least I know the root cause now and can see that our own topics are not the source of the problem.