KStream vs KTable vs GlobalKTable

Hello! I have some experience with Kafka, a little with KafkaStreams and read/viewed related tutorials, tutorial videos and dev guide chapters, so I guess I got the idea about the differences, the pros and cons. Also, in the school book examples it’s mostly pretty clear whether a KStream, a KTable or a GlobalKTable is the right choice/abstraction. In our real world situation however, it’s still quite unclear to me what (combination) would be the right choice.

Our situation could be summarized as follows:

  1. There’s one topic, let’s call it E, where we get versions of some logical entities. The key of such an event corresponds to the entity’s ID.
  2. In another topic, let’s call it P, we get publish events for particular versions of these entities. (Note, with publish I don’t mean the technical publishing of a Kafka event but publishing in terms of the domain - like sending an email or so to the customer.) The key of such an event corresponds to the ID of the entity to be published.
  3. For a particular version of an entity, any number (including 0) of publish events may be consumed and in any order. That is, the event on P or the corresponding event on E might be first.

The straight forward approach would probably be to
a) read the events from E into a KTable as consecutive events for an entity with a particular ID represent new versions and thus updates, and
b) read the events from P into a KStream as every event asks to publish the corresponding entity, then
c) join the KStream with the KTable and finally
d) publish events from the join to the outgoing result topic R

E---> KTable  ---\
             Join >---> KStream ---> R
P --> KStream ---/

Unfortunately our world is not that simple. For a particular entity

  • we might initially get multiple versions on E without any publish events on P and these should not result in any events on the result topic R. :+1: We’re fine with the above draftet solution in this regard.
  • but once the entity is published on R (and thus known to the world consuming from R), we need to publish new versions of an entity even without an explicit publish event on P. :-1: We wouldn’t get that with the above draftet solution.
  • and to make it worse, there’s an exception of the preceding point: we know (from the entity’s state) whether a publish event on P is to be expected and if such an event on P is expected we have to wait for that publish event before publishing on R, even if we published that entity on R in the past. :-1: We wouldn’t get that neither.

The second requirement (that we need to publish on R without having an event on P) would allow for the conclusion that we need to read the events from E into a KStream instead of into a KTable. However, joining afterwards the two KStreams would mean that we need a join window. But the events on P may be consumed way after the last version was consumed from E (event months later).

I hope I could describe the problem clearly enough to ignite some discussion!

It seems as though you have rather specific requirements but want some stuff to be lenient as well.

First things first:

  1. In a KStream-KTable join, only the KStream side triggers an emit. If you use inner join, only IF there is a value in the KTable, otherwise nothing is emitted. If you use leftJoin, you can emit a null if there is no value in E corresponding to the key.

  2. Since R is just a result topic, you COULD produce to it directly from a regular producer to bypass your Kafka Stream join which publishes to R. However, whether you SHOULD is not clear because you’d break the non-determinism of your KStream topology if you ever replayed the events of E and P. Also, we tend to favour single writer principle to the same topic/partition. Maybe what you need to do is to produce “publish events” to a separate topic (e.g. R-explicit) topic. A consumer which is interested in publish event can listen to both R and R-explicit in the same consumer.subscribe() call to do its next action.

I don’t understand your third requirement. P and R are not tables that you can do lookups on to determine “if not produced to R, produce to P first”. However, you could have another part of the topology which maintains some logic (and potentially some in-memory object) which then does a “Produce to P, wait for ack, then produce to R”. Again, this would have to be a plain consumer/producer app probably.

Perhaps it would help to elaborate the business requirements rather than this theoretical mathematical notation which is very hard to decipher and design for or if it is even a good fit.

Kafka Streams may be a good fit for some use cases where Stream Processing (especially around creation of materialized views/state from replaying events deterministically) is appropriate. But it is not a tool for all use cases (for example bpm-like workflows) since you’d have to hack around quite a bit and even then might not get the behaviour you want. The plain producer and consumer APIs can give you much higher flexibility there and you can have hybrid solutions where some parts are done using KStreams and some parts done using plain consumer/producer.