Hello! I’m new to Kafka and Kafka Streams and I’m looking for some educated advice regarding my scenario.
I have a requirement to combine events that come in pretty distinct moments in time. I.e. I might receive event X today and event Y five days later. Only after I’ve received both X and Y I should be allowed to produce a new Z event. There’s no way for me to guess a timeline for those events, only the rules on how they would be combined.
Does anyone know if this is something relatively easy to achieve using tools from the Kafka Ecosystem? My research tells me that I might have to build custom logic for all of that, but I wonder if I’m missing something, since I’m new and might not be approaching the problem from the right angle.
Using Kafka Streams, it should be possible to do. You would need to implement a custom Processor or Transformer (depending if you use the Processor API or DSL). You need to ensure that events that should be merged are written into the same partition – if you cannot control the upstream pipeline, you would need to repartition the data accordingly.
Inside the Processor/Transformer you can use a state store to buffer partial result and update partial results until they are completed, for which case you can delete it from the state store and emit the result downstream.
I’m a bit confused when you said I have to repartition data accordingly. I understand partitions are a subset of the data inside the topic, but not sure what you mean by repartition in this scenario…
Thank you for the other info, puts me on the right path for further research. Didn’t think about using state stores (prob backup up by RocksDB if multiple services?) for this use case.
Let’s say you have record A in topic partition t1-p0 and record B in topic partition t1-p1 and you need to merge A and B. For this case, you need to create a new topic, and write both A and B into the same partition in the new topic. (If A and B are not in the same partition, KS would process them in two different tasks and use two different shards of the state store, and thus you cannot merge both. – Kafka Streams does data parallel processing per input partition and thus data that needs to be processed together must be in the same partition.) Next, you can read back the data from the second topic to process. This is called data repartitioning and the partition of input data changes in the second topic compared to the input topic. For example, the DSL has an operator KStream#repartition() that does this write and read back for you under the hood (base on key based hashing).
Yes, by default Kafka Streams uses RocksDB (and in addition changelog topics to backup the state reliable in the Kafka cluster) but in-memory stores are also available (those are also fully fault-tolerant via changelog topics).