Strategies to handle emit behavior during app resets vs. normal processig

Hey everyone,

We are using Kafka Streams (in the context of an insurance company) to build data products. Such a data product could be the current state of a contract or a claim. Those data products are published within the company as an API topic (compated), which can be consumed by many consumers.

We have an issue whenever we have to reset a streams app. For example, such a reset can re-process like 100 million input events, that are used to build the state of 500k contracts. Such loads will also take quite some hours to complete.

Because of eventual consistency, a specific contract is published multiple times during this period until it reaches its current state. During that time, the streams apps emit x-times more events that will be relevant for the consumer in the end. The downstream consumers must process the load as well and trigger further actions by themselves (or even publishing new events for others). We have seen something like a “data avalanches” passing down the Kafka landscape during such application resets, causing quite some stress for the overall organization.

It feels like we need two operation modes for Kafka Streams:

  • Normal mode: Events are emitted within seconds (latency matters)
  • Reset/Load mode: Events are only emitted once they are eventual consistent – or as seldom as possible to prevent downstream pressure (throughput and compaction matters)

As a current workaround, we use a processor as “pressure valve” at the end of the topology to control the emit rate. Whenever the processing rate reaches a certain threshold (in other words: a reset/load is detected), we stop the downstream propagation and store all keys of not yet published events in a state store. Once the processing rate lowers, all stored keys (respectively the corresponding state from the state store) will be published via punctuator. That approach works, but contains some hard limits and does not really feel like super Kafka streams concept compliant :slight_smile:

It would be interesting to hear if other also face such issues with emit control. What strategies do you use to solve it? Are there any best practices known? Is there even a proper solution possible with Kafka Streams?

Any input is highly welcome!

Thanks a lot

Thomas

Overall, the approach you picked sounds ok to me.

I guess some other thing you could try is, to get the input topic “end-offsets” (or end-offset minus some delta) before you re-start the application and use them in your pre-sink processor to drop result as long as end-offsets are not reached yet. It would require some additional logic to track when you reach those offsets and thus start to enable sending results. But it could help to avoid the need to buffer up results and use a punctuation to publish it later? (Just an idea.)

The only other think I can think of is to land the data in a KTable and apply suppress() to reduce the output data rate. But it might be semantically not strict enough for your use case, and you could also not just turn if off.

HTH.