What should I use as the key for my Kafka message?

Can the key just be left blank? What should I set it to? What difference does it make?

2 Likes

That’s an interesting question. It depends on what you want to do with the message, of course! :wink:

If the ordering of the messages matters for you, then the key is important. Let’s say you have a Kafka topic where you want to send order status. If you receive several status updates about the same order - like “prepared”, “shipped”, and “delivered”, you want to make sure that the applications consumes these statutes in the right order.

In Kafka, the messages are guaranteed to be processed in order only if they share the same key (and you use the default partitionner, but let’s come back to that later). So in our example, using the order id as the message key makes perfect sense.

However, this solution puts some constraint on the design of your Kafka solution. To understand this, you need to understand how Kafka achieves parallelism.

Each Kafka topic is divided in a number of partitions chosen by the user. These partitions are subdivisions of the messages received in this topic. The key point to remember is that each partition will keep the messages in order. And Kafka allows the consumer to read partitions in parallel - it is common practice in the Kafka world to allocate one thread per partition for a given consumer. However, if a Kafka consumer reads all the messages in a partition, then the messages will be read in order.You can have ordering when you need it, and parallelism at the same time, which is pretty neat.

To guarantee the order, the producer needs to make sure that the messages of the same key are sent to the same partition, in order. This is the role of the default partitionner used by the producer in Kafka. The producer first calculates a numeric hash of the key (using murmur2 if you use the java client), and then selects the partition number by the following formulae: murmur2(key) % number_of_partitions.

So, for instance, the murmur2 of the string “azerty” is 2710803828. If you have a topic of 20 partitions, 2710803828 mod 20 makes 8. So you know that if the producer uses the default partitionner, every message that has an “azerty” key will land in the same partition, partition number 8.

This ordering by key property is also useful as soon as you use Kafka Streams. Often, business cases arise where you need to join messages (for example a product and it’s price) and it’s helpful if you can do this with a predetermined key.

If you choose a specific key attribute for your messages, be careful. First, don’t use general attributes - like a color or a code shared by many messages. You want to get an even distribution of messages between partitions if you can. If you use a pool of ten different keys, but have 20 partitions, then it will mean that 10 partitions will never be used. Randomly chosen keys (i.e. serial numbers and UUID) are the best example of message keys.

Second, you want to make sure the partitions count of the topic is chosen carefully. Remember, you can parallelize messages read up to the number of partitions, so this can become a performance bottleneck if the partition count is low. The bad news is, if you are using a message key because ordering is important for you, you will run into troubles if you need to scale up the number of partitions in the topic - can you spot why ? So make sure you over-partition when choosing partitioning count instead of under-partition.

Null keys are the easiest of options. With null keys, the default partitioner will automatically load balance the messages across all the partitions available. This means you won’t have to worry about having to resize your partitions, ever.

5 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.