What factors do you use in deciding how many partitions to configure for a topic?
The first thing you need to worry about when choosing how many partitions you will configure in your topic is mainly performance.
Remember that partitions inside a topic is the unit of parallelism of that topic. Consumer groups usually contain a number of consumers at maximum equal to the number of partitions of the topic. So, if you have a topic of five partitions, at most five consumers inside a group will be able to process the messages of this topic. If inside a single group you start more consumers than the number of partitions, they will be inactive.
So how many consumers do you want ? This can be derived from your use case. Your use case will tell you what your optimal throughput is according to the requirements. If you are building an order management system and you are designing the order topic, then you need to know how many orders you want to process per second - how you want to scale. So let’s hypothetically say you send 1 MB of orders data per seconds in the topic. You want your consumer to be able to process 1 MB / s of orders data at a minimum (and even more), or it will never catch up and start falling behind, which is something you must avoid at all cost.
So your target throughput for your consumer is 1 MB / s. What you need to test is the actual throughput of your consumer for a single partition - what’s the amount of data the consumer can read per second with no parallelism. This will include the message processing time which is related to the type of consumer you are building - an email microservice will have different message processing performance than an ERP.
Once you identify this, you can derive the number of partitions by comparing your target throughput and your consumer throughput. So you know that if your consumer throughput is 0.10 MB / s, and your target throughput is 1 MB / s, you know you would need at least 10 consumers in parallel to handle the target throughput and not fall behind. So, 10 consumers means at least 10 partitions.
That’s a good starting point! But they are some extra stuff you need to take into consideration. Your throughput is dependent on a lot of characteristics that may not remain steady in time : your volumetry / frequency might increase, the average message size might increase, the message processing time might increase… The world is changing fast.
You also have to take into account that in the future, other consumer groups might be added to the system and start consuming your data. They will have specific throughput, and they will require different parallelism capabilities.
The way you can handle this flexibility is by using a highly divisible partition count. If you have only one partition in your topic, then only one consumer in a group will be able to process your messages - no parallelism. If you have two partitions, then you could have either one consumer processing two partitions or two consumers processing one partition each. If you have three partitions, you could have one or three consumers in parallel. But two consumers would be awkward, because with three partitions, this would mean one consumer would consume the messages of two partitions, and the other consumer would consume the messages of one partition, so the load is not equal between them, and that would be bad for the system. You want the load of your consumers to be shared equally between them.
The magic numbers are 4 partitions (1 consumer, 2 consumers, or 4 consumers would divide the load equally between them), 6 partitions (1, 2, 3, 6), 8 partitions (1, 2, 4, 8), 10 partitions (1, 2, 5, 10), 12 partitions (1, 2, 3, 4, 6, 12), 18 partitions (1, 2, 3, 6, 9, 18), 24 partitions (1,2,3,4,6,8,12,24) and 30 partitions (1,2,3,5,6,10,15,30).
Usually, the upper limit is 30 partitions and you won’t need to go higher than that. More than 30 is rarely useful in practice. Pick a partition count which is reasonable enough for your use case and allow you to be flexible in the future by adopting many possible consumer group configurations. Use one partition only if you use compacted topics with small volumetry (for instance configuration data).
A good policy is to over-partition - remember also that the number of partitions is important if your messages have keys and ordering matters to you. So in doubt, pick a higher partition count.
See also the section How to partition your events in Part 2: Topics, Partitions, and Storage Fundamentals of my 4-part blog series on Kafka fundamentals.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.