Hello! I have some experience with Kafka, a little with KafkaStreams and read/viewed related tutorials, tutorial videos and dev guide chapters, so I guess I got the idea about the differences, the pros and cons. Also, in the school book examples it’s mostly pretty clear whether a KStream, a KTable or a GlobalKTable is the right choice/abstraction. In our real world situation however, it’s still quite unclear to me what (combination) would be the right choice.
Our situation could be summarized as follows:
- There’s one topic, let’s call it E, where we get versions of some logical entities. The key of such an event corresponds to the entity’s ID.
- In another topic, let’s call it P, we get publish events for particular versions of these entities. (Note, with publish I don’t mean the technical publishing of a Kafka event but publishing in terms of the domain - like sending an email or so to the customer.) The key of such an event corresponds to the ID of the entity to be published.
- For a particular version of an entity, any number (including 0) of publish events may be consumed and in any order. That is, the event on P or the corresponding event on E might be first.
The straight forward approach would probably be to
a) read the events from E into a KTable as consecutive events for an entity with a particular ID represent new versions and thus updates, and
b) read the events from P into a KStream as every event asks to publish the corresponding entity, then
c) join the KStream with the KTable and finally
d) publish events from the join to the outgoing result topic R
E---> KTable ---\
Join >---> KStream ---> R
P --> KStream ---/
Unfortunately our world is not that simple. For a particular entity
- we might initially get multiple versions on E without any publish events on P and these should not result in any events on the result topic R. We’re fine with the above draftet solution in this regard.
- but once the entity is published on R (and thus known to the world consuming from R), we need to publish new versions of an entity even without an explicit publish event on P. We wouldn’t get that with the above draftet solution.
- and to make it worse, there’s an exception of the preceding point: we know (from the entity’s state) whether a publish event on P is to be expected and if such an event on P is expected we have to wait for that publish event before publishing on R, even if we published that entity on R in the past. We wouldn’t get that neither.
The second requirement (that we need to publish on R without having an event on P) would allow for the conclusion that we need to read the events from E into a KStream instead of into a KTable. However, joining afterwards the two KStreams would mean that we need a join window. But the events on P may be consumed way after the last version was consumed from E (event months later).
I hope I could describe the problem clearly enough to ignite some discussion!