Kafka Streams EOS - Producer fenced

roookeee · 31 March 2026 06:25

I am currently trying to get rid of the following error in our EOS-configured Kafka Streams Spring Boot application:

org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.

There is a newer producer with the same transactionalId which fences the current one.
Written offsets would not be recorded and no more records would be sent since the producer is fenced, indicating the task may be migrated out

I know this error is recoverable but I still feel like it happens too often. See the following error chart over 7 days to get an idea:

This usually happens after/while rebalancing because our service is running on k8s AWS spot instances which are shutdown frequently. Our services do receive a graceful shutdown though.

At the end of the day these errors are kind of “ruining” our monitoring because the team is starting to get numb to error logs because of this error, which is always bad.

As it’s an ERROR, I was wondering on how I could prevent it from happening this often. Once a week or month would be enough. On this note, why is it an ERROR and not WARN if its “expected” and recoverable?

I tried to do some digging but sadly found nothing that could help, besides maybe enabling leave group for Kafka Streams on shutdown, but I am just grasping at straws here.

Any help is appreciated. I really don’t want to do error log filtering for our monitoring

Further context:
Spring Boot Application with Kafka Streams using exactly_once_v2 with state stores to deduplicate output messages. Our state store topics are quite small (16k entries per partiton, each being at most a couple kb in size) and properly compacted

mjsax · 31 March 2026 22:00

If you indeed do a clean shutdown, no such error should happen on the happy path. – Maybe the instance does not get enough time to really cleanly shut down? For a clean shutdown, the final “state” of the application should be NOT_RUNNING. Thread and client state transition are logged at INFO level, so should be a good first thing to verify.

Sending a leave group request could maybe help, too. I would give it a try. Btw: since Kafka Streams 4.2.0, there is a new close(CloseOption) that allows you to control if a leave group request is sent or not.

roookeee · 1 April 2026 08:18

I will take a look at our graceful shutdown handling

Here are some logs (in order of occurence) right before the error occurs:

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-3-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-3]
Disconnecting from node 9 due to request timeout.

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-3-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-3]
Cancelled in-flight METADATA request with correlation id 28 due to node 9 being disconnected (elapsed time since creation: 30029ms, elapsed time since send: 30029ms, throttle time: 0ms, request timeout: 30000ms)

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-1]
Disconnecting from node 3 due to request timeout.

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-1]
Cancelled in-flight METADATA request with correlation id 16 due to node 3 being disconnected (elapsed time since creation: 30151ms, elapsed time since send: 30151ms, throttle time: 0ms, request timeout: 30000ms)

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-1]
Transiting to fatal error state due to org.apache.kafka.common.errors.ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one.

[ERROR] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-1]
Aborting producer batches due to fatal error

[ERROR] stream-thread [redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1] stream-task [0_6] Error encountered sending record to topic redactedOutputTopic for task 0_6 due to:
org.apache.kafka.common.errors.ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one.
Written offsets would not be recorded and no more records would be sent since the producer is fenced, indicating the task may be migrated out

These timeouts seem to appear frequently before the errors occur, is there any correlation? I talked to our ops team and they are not aware of any node failures / errors in the time frames of our errors.

roookeee · 2 April 2026 14:38

I did another analysis session with my ops team today and as it stands there is no issue with the responsible brokers. No errors have been logged, no network error has been found.

But I stumbled upon another log which I think is interesting:

[Producer clientId=refactedTopic-ce73a4e2-89f6-464d-9f14-c74a5694ead6-StreamThread-3-producer, transactionalId=refactedTopic-ce73a4e2-89f6-464d-9f14-c74a5694ead6-3]
Cancelled in-flight PRODUCE request with correlation id 842 due to node 9 being disconnected (elapsed time since creation: 55820ms, elapsed time since send: 55820ms, throttle time: 0ms, request timeout: 30000ms)

How does this even happen? How can a request take 2x the time of the configured default request timeout. We ruled out DNS and k8s node network issues so I am a bit dumbstruck.

Any help or insight is appreciated.

roookeee · 16 April 2026 09:08

Another update: this error occurs during regular rebalances (e.g. when Kafka Streams is attemped to have a better distribution of partitions). We see a lot of “closing dirty” logs - how can this happen? I tried to research on my own but seem to be hitting a dead end.

How can a normal rebalance lead to dirty closing of tasks?
I will try to give some more insight in what we are doing:

Every 30 seconds a punctuation runs which just context.forward all state store entries for a system time based re-calculation. Each partition has at most 20k state store entries and our metrics indicate that each punctuation takes at most 100ms to call context.forward on a partition. Can this somehow influence the rebalancing? I think 100ms is not a long duration for a punctuation, but maybe a punctuation is running during rebalancing? But that makes little sense because the punctuation should stop the consumer polling and thus should not interfere with rebalances.

I appreciate any hints on how to proceed.

Topic		Replies	Views
Producer Fenced Exceptions in kafka streams 2.7 Kafka Streams	0	4788	4 November 2022
Why am I getting InvalidProducerEpochException when no producer is used? Kafka Streams	11	14473	5 November 2021
InvalidProducerEpochException when running EOS and utillizing state stores Kafka Streams	5	1982	6 December 2023
Kafka streams stuck partition Kafka Streams	1	5015	5 September 2023
InvalidProducerEpochException causing duplicate despite of EOS Kafka Streams	1	691	14 June 2024

Kafka Streams EOS - Producer fenced

Related topics