Kafka Streams EOS - Producer fenced

I am currently trying to get rid of the following error in our EOS-configured Kafka Streams Spring Boot application:

org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.

There is a newer producer with the same transactionalId which fences the current one.
Written offsets would not be recorded and no more records would be sent since the producer is fenced, indicating the task may be migrated out

I know this error is recoverable but I still feel like it happens too often. See the following error chart over 7 days to get an idea:

This usually happens after/while rebalancing because our service is running on k8s AWS spot instances which are shutdown frequently. Our services do receive a graceful shutdown though.

At the end of the day these errors are kind of “ruining” our monitoring because the team is starting to get numb to error logs because of this error, which is always bad.

As it’s an ERROR, I was wondering on how I could prevent it from happening this often. Once a week or month would be enough. On this note, why is it an ERROR and not WARN if its “expected” and recoverable?

I tried to do some digging but sadly found nothing that could help, besides maybe enabling leave group for Kafka Streams on shutdown, but I am just grasping at straws here.

Any help is appreciated. I really don’t want to do error log filtering for our monitoring :smiley:

Further context:
Spring Boot Application with Kafka Streams using exactly_once_v2 with state stores to deduplicate output messages. Our state store topics are quite small (16k entries per partiton, each being at most a couple kb in size) and properly compacted

If you indeed do a clean shutdown, no such error should happen on the happy path. – Maybe the instance does not get enough time to really cleanly shut down? For a clean shutdown, the final “state” of the application should be NOT_RUNNING. Thread and client state transition are logged at INFO level, so should be a good first thing to verify.

Sending a leave group request could maybe help, too. I would give it a try. Btw: since Kafka Streams 4.2.0, there is a new close(CloseOption) that allows you to control if a leave group request is sent or not.

I will take a look at our graceful shutdown handling

Here are some logs (in order of occurence) right before the error occurs:

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-3-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-3]
Disconnecting from node 9 due to request timeout.

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-3-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-3]
Cancelled in-flight METADATA request with correlation id 28 due to node 9 being disconnected (elapsed time since creation: 30029ms, elapsed time since send: 30029ms, throttle time: 0ms, request timeout: 30000ms)

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-1]
Disconnecting from node 3 due to request timeout.

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-1]
Cancelled in-flight METADATA request with correlation id 16 due to node 3 being disconnected (elapsed time since creation: 30151ms, elapsed time since send: 30151ms, throttle time: 0ms, request timeout: 30000ms)

[INFO] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-1]
Transiting to fatal error state due to org.apache.kafka.common.errors.ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one.

[ERROR] [Producer clientId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1-producer, transactionalId=redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-1]
Aborting producer batches due to fatal error

[ERROR] stream-thread [redactedTopic-b8050bf5-c4e6-4ec9-9a50-de2955dcb381-StreamThread-1] stream-task [0_6] Error encountered sending record to topic redactedOutputTopic for task 0_6 due to:
org.apache.kafka.common.errors.ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one.
Written offsets would not be recorded and no more records would be sent since the producer is fenced, indicating the task may be migrated out

These timeouts seem to appear frequently before the errors occur, is there any correlation? I talked to our ops team and they are not aware of any node failures / errors in the time frames of our errors.