Here are some pitfalls to be aware of so you don’t accidentally break your key-based partitioning.
Avoid using complex serialization formats (Avro, JSON, Protobuf) for the key. For example, if you reorder fields, it will break partitioning. Also protobuf and json schemas don’t serialize deterministically, so using those formats for keys will also break partitioning.
Schema registry is meant to support schema evolution on avro, json, or protobuf serialized data. But keys aren’t supposed to evolve. Keys determine partition placement, so if keys evolve, it means partitioning breaks. Avoid using avro, json, protobuf for keys!
Stick to simple data types. String uuid. Long id. Things like that.
If you do decide to use a complex format for the key, make sure to set auto.register.schemas=false so you don’t accidentally register a new (completely valid and compatible) schema that will break your partitioning.
Different producer clients also can use different hashing algorithms to determine key-based partitioning (hash(key) modulo number of partitions). Librdkafka based clients use the consistent_random hash function whereas Java based clients use murmur2. So if you have multiple clients producing to the same topic, your partitioning will be screwy. You need to make sure all producers use the same hashing algorithm for key-based partitioning to work consistently.
edit: I misread your post. Yes, it seems like a good idea for Schema Registry to simply refuse complex schemas for keys, at least by default. However, I think the Kafka community tends avoid opinionated behavior like that. This is why it’s sometimes hard to learn Kafka stuff – “Look how flexible this is! You can do whatever you want!” is great in some ways, but super opinionated opinions are more learner friendly, IMO.
Hey gklijs,
To clarify, Schema Registry does support schemas for keys. But what we as a community aren’t necessarily good at pointing out to new users is that you shouldn’t because of the pitfalls I listed. It’s a bad practice because it’s so difficult to do without breaking partitioning, and it offers little in terms of practical use. There’s seldom really a good reason to use a schema for the key, and there’s lots of reasons not to.
I get all that, but that wasn’t my question. If it’s a pitfall, and there isn’t really a good reason to have the pitfall, than why is the pitfall there?
For example the Kafka Cloud event serializer simply through an error when used for the key. Why isn’t that the default?You could set a property like allow.use.for.keys if you really want to use them as keys. But that would at least make it harder to do so.
Correct me if I’m wrong but isn’t the ability to use a schema for the key just the consequence of the design of Kafka accepting bytes ?
I think usually the schema key is used to provide more information than necessary (relative information of the key’s identifier), and maybe we could use a string (uuid or whatever) and use the headers to give more information…
Yes and no, since the serializer ‘know’ if it’s being used for the key or value. This is also how schema registry serializer knows by default if the schema should be stored with -key or -value. And how the Kafka Cloud Event Serializer can throw an error when used for the key.
For example with Rust it’s a bit different, since there is no concept of serializers. So they don’t ‘know’ if it’s called for the key or value, since it’s just a function from data to bytes, that then needs to be included in a Kafka Record. (But to mimic the Java behavior, in some cases you do need to supply the information if it’s used for a key or a value.)