I would like confirmation that the approach we want to take makes sense.
We want to use kafka also as configuration storage.
We have topics such as
images-png with many partitions: consumers
subscribe to these topics and let kafka decide how to share the load.
Then we have a
configuration topic, where messages (keyed by extension) contain configuration params that tell us how to handle a particular image format.
We want each worker process to consume all
configuration messages from all partitions, without ever storing offsets/committing them.
We then want each consumer to
subscribe to all possible
images-* topics, based on the list of accumulated configuration messages, and let kafka assign partitions.
The approach we want to take is the following:
- we have one
configuration consumer that queries kafka (using
list_topics) to know how many partitions the
configuration topic is made of
- we then call
configuration_consumer.assign passing the list of all topic partitions
- once all configuration messages are consumed, we build the list of
- we then call
images_consumer.subscribe with the list of images topics
configuration consumer that is manually
assigned all topic partitions of the
images consumer that
subscribes to the resulting list of
This would be a major change for us, as we currently do this stuff based on configuration files, which are getting harder and harder to maintain.
So, before we hit prod, do you see anything wrong with this approach?
There’s nothing inherently wrong with your approach. In fact this is how things like
connect-configs works for control the Kafka connect configurations. I would suggest you take a look at the code used there as an example for how to do this.
Do you have a link to share? I must admit I’m pretty new, and I also use kafka in a python shop, so I miss lots of java-only cool stuff
Also, do you think this mixed usage of dynamic assignment (via subscribe) and manual assignment (via assign) could be troublesome?
I faced a similar scenario in one of my previous projects. There we had some central configuration which contained about ~100 different “types/categories” of something and behind that there was a map with flags and other parameters that steered the behavior of a specific “type/category” at multiple systems across a huge business process. Basically the requirement was to have ONE central configuration for those parameters, that new “types/categories” can get introduced without deployment and that parameters behind that “types/categories” can be changed at runtime.
The solution was to have a configuration topic with 1 Partition (why more when you expect only ~100 records) configured as a compaction topic and of course replicated. The app that owned the config produced updates per key to this topic so that the latest config for one key was there as soon it got “activated”.
On the consumer side we had different possibilities:
→ Some apps used Kafka Connect to get those configs into their RDBMs and so they were directly able to use it in their stored procedures and so on (You need to upsert here)
→ Java Apps just built a GlobalKTable for this topic and therefore had a local Materialized view of it which can be looked up quite fast by key.
→ KStream apps also were able to join in that topic as GlobalTable.
In our case I think that worked out pretty well, enabled easy testing of different configurations in our integration tests and also allowed straight-forward distribution of config updates in a distributed system.
I would prefer GlobalKTables over manual consumers as you described.
It is a just a few lines of code and you have a ready2use state-store which can be looked up by key and KStreams keeps care that the data is up to date (In-Memory Store might be enough).
For me this approach has also the advantage that you can provide the same information to a normal Java App and to KStream Topologies (e.g. looking up some values and map it into another record) with the same tooling and without some extra pieces of technology needed.