How should you implement a retry/delay queue using Kafka topics?

TLDR; How did you implement a delay/retry queue in Kafka?

Has anyone solved the problem of implementing retry/delay functionality in Kafka?

Originally I looked at Ubers example ( Uber retry topic ) but I do not want to create n topics and consumers as the overhead would be too much. I essentially want 1 topic to act as a retry queue and 1 topic to be a dead letter queue. I also would like to keep away from using a database.

This is what I designed but I quickly ran into some issues.

Blocking consumers.
To begin with, I created the flow above and had the retry consumer wait a calculated amount of time based on the number of attempts (exponential backoff). If the retry failed it would produce the event back onto the topic. The main reason this was flawed was because when the delay was observed, it blocked the Consumer, this would also block consumption of other events, increasing the observed delay, essentially meaning events would be blocked for much longer than the expected time. (i.e event 1 is forced to wait 16 seconds but event 2, is forced to wait for the processing of event 1 event though it’s only meant to wait 2 seconds. As you can tell, this gets nasty with multiple queued events.)

Window processing in Streams.
The next option I thought of was creating a stream that processes events in a time window, for example, when the consumer puts the event on the topic it will calculate the time it needs to be processed. The stream then calculates what events need to be processed in the window and processes them.

I have not yet finished implementing the window processing in the stream as it feels like an overly complicated way to do something simple.

Has anyone dealt with this in the past? If so, how did you go about it? what technologies did you use? and what did you learn when implementing?

Thanks for taking the time to read and I look forward to reading your comments!

Kind regards

  • Jordan
1 Like

In our case any there was no concern with the topic partition being blocked due to a retrying message, as subsequent messages would only result in the same retry. e.g. due to the event processing making a REST call to a temporarily unavailable service.

The bigger concern for us was to ensure that any retry delays did you result in consumer group rebalancing, due to the consumer poll timing out. So to that end we took advantage of Spring Kafka’s stateful retry, enabling retry via a re-poll from the topic. (Noting that Spring no longer call this ‘stateful’ retry, but rather it is their standard behaviour).

Here is my blog post on this:

Rob.

2 Likes

Hi Rob!

Thank you for getting back to me, I have read your blog post and it is well written.

What you have looks like it will work for retrying with a small delay (a few seconds) and if blocking is acceptable but what if you wanted to retry after 10 minutes? or perhaps even longer? Wouldn’t the polling window then be far to long or wouldn’t you end up creating a large number of blocked consumers?

Your idea of having a consumer group rebalance and have a new consumer take ownership of the partition of the retry consumer seems like you could end up with either no consumers on any partitions if they are all retrying or if you have auto scaling, then a very large number of consumers.

Another issue that the solution I had thought of and the one you have given, is order. How can we guarantee order when the topic is non blocking? In my example it is clear to see as if event 1 fails and enters a retry topic, then event 2 is successfully processed before event 1, then we have lost order. In your example, if event 1 is in consumer 1 and the partition is rebalanced, then consumer 2 takes over the partition and the second event succeeds, the order is lost.

1 Like

Hi Jordan,

With the Spring Kafka ‘stateful’ retry, the retry can be safely as long as you like. So long as the maximum delay for any single retry is less than the consumer poll timeout. When the retryable exception is thrown the processing ends and the event is re-polled from the topic, so the consumer poll time starts again. The ‘stateful’ part means that the same consumer instance that is polling this topic partition knows how many retries it has done for this message, and hence when the retries are exceeded it can dead letter the message.

So the intention is very much that there is NO consumer group rebalance, as this delays processing and is to be avoided.

The topic partition is blocked until the message either retries successfully or retries are exceeded. In our case, we don’t want subsequent messages to be processed while there is a transient error, as that would affect them too.

We wanted to avoid the overhead of retry topics and the extra complexity they bring, although this is another pattern that could indeed be used to ensure your topic partition is not blocked by a retrying message. But, as you mentioned, one repercussion with that approach is that you likely lose message ordering.

Rob.

2 Likes

Hi Rob,

Ah I see, your article makes sense now, so the concern for you was around poll timeouts and not blocking.

You make a good point around not being concerned with blocking as other events are likely to fail, though I am mostly trying to find a solution that does not block the partition.

Thank you for replying and clearing up what you said and apologies for the late reply from me!

-Jordan

1 Like

Hi Jordan, running into a similar issue did you end up solving this so that you are not blocking the partition?

Hi Matt,

Unfortunately I didn’t find a simple enough solution to cover the whole thing.

I ended up deciding to scale the problem back and have simple retries in the consumer (so only retry a couple times with a small wait so blocking was acceptable).

For the scenarios where blocking is desired (i.e you know for certainty the next event will fail for the same reason e.g the consumer cannot connect to the downstream system), I implemented two categories of exceptions, “blocking exceptions” and “non-blocking exceptions”.

  • Blocking exceptions allow the consumer to infinitely retry the event and block the topic so other events will not be consumed and order is maintained.
  • Non-blocking exceptions would be produced to a ‘failure’ topic where another consumer would handle it and the topic would be un-blocked.

There was a couple ideas around using KSQL to push events that had reached the retry time and retry them but the KSQL joining got too complex and felt un-maintainable.

However since then, Flink has been released and it may be able to solve this problem ?
(Otherwise a traditional database would do the trick, but it does feel a little wrong going event driven to be underpinned by a database.)

Good luck! I hope you manage to find a solution (I’d be interested to read it if you do!)

Cheers
Jordan