Often a maximum of 4,000 partitions per broker is recommended (Kafka definitive guide, Apache Kafka + Confluent Blog). My understanding is that more partitions per Broker require more RAM and probably also more CPU due to the additional operations that are required per partition.
However I can easily scale CPU and RAM vertically in the Cloud and I wonder whether this recommendations still apply to newer Kafka versions (v2.6.0+) and if so what is the problem with more than 4k partitions? How can I tell whether my broker suffers from problems due to the number of partitions.
Background: I run one Kafka cluster where each broker hosts ~6-7k partitions. I noticed a lot of open file descriptors, a high mmap count but once I bump these values I only expect more RAM usage then.
@weeco is referencing this blog post. But this blog post makes no mention of the underlying H/W on which 4K partitions runs seamlessly. One should also keep in mind the costs associated with unplanned downtime when loading up a broker with 4K partitions. A hard shutdown on a broker servicing a large number of partitions takes really long to recover.
@weeco , could you give some reasoning for you large number of partitions? Is it volume driven or design driven. Maximums aside, more partitions leads to more overhead, and not only on all brokers, but also on all producing and consuming clients, as publishes and consuming commits happen “per partition”. With partitions, the general rule of thumb is that less is better. There are special cases as with anything. Understanding your reason for the amount of partitions will be valuable.
@atulyab9 Thanks for the hint about longer shutdown (and probably also startup) times. That makes sense, but honestly at 4k partitions this is still fine for us and I think the major impact would be number of segments and the size of the data.
@Charlla Yes I’m aware that partitions aren’t free, but well they are here. The teams that created topics overestimated how many partitions they actually need I’d say. They wanted to be flexible like - what if I need to consume the topic very quickly: I can spin up 60 pods that will consume that topic and therefore I should also have 60 partitions. I agree we have too many partitions for the throughput we have. At the same time we have quite a few topics and now it isn’t too easy to reduce the number of partitions of existing topics (that will take time).
Never the less my question remains the same: What’s the major problem with having so many partitions when I can scale the hardware vertically? These articles never took hardware in consideration at all - they just said no more than 4k partitions without giving further reasons.
@weeco, the short answer is cost. If you are scaling to service lots of partitions, many of which you do not need, then you are simply throwing money away.
How many partitions you need should have been determined during performance testing as you attempt to hit the required throughput whilst maintaining resiliance etc.
The longer asnwer is, I suspect, it all depends. Hardware, load, size of messages, geography and so on will all play a part in what your upper limit is. And, unfortunately, that will change over time as those factors change.
Over 10k latency definitely becomes an issue. There is a fix going in that makes sure it’s only active partitions (i.e. active writes) that contribute to this. The combination of these and ZK’s removal which takes the cluster level upper bound into the millions should help a lot.