Kafka streams stuck partition

We’ve got a set of kafka streams/spring-boot/spring-kafka applications that had an incident over the past couple of days where we noticed that a single partition of a topic had many thousands of messages that were stuck. After further investigation we found that our normal streams consumers, which have isolation read_committed could not read these stuck messages, but a consumer with read_uncommitted could. We verified that it wasn’t stuck on a single message, it appears that all messages in the topic after the first stuck message can only be consumed with read_uncommitted. I believe this means that all of these thousands of messages were produced but the transactions were not committed somehow.

versions: kafka-clients, kafka-streams: 3.2.3, spring-boot 2.7.9, spring-kafka 2.9.1, broker version 3.4.0

configuration properties:

poll.ms = 10
commit.interval.ms = 0ms
topology.optimization = all
processing.guarantee = exactly_once_v2
num.standby.replicas = 1
metrics.recording.level = INFO
num.stream.threads = 64
replication.factor = 1
acks = all
linger.ms = 0
retries = 2147483647
max.block.ms = 60000
metadata.max.age.ms = 300000
max.poll.interval.ms = 60000
max.poll.records = 30
session.timeout.ms = 10000
fetch.max.wait.ms = 500
fetch.min.bytes = 1
heartbeat.interval.ms = 3000
allow.auto.create.topics = false
auto.offset.reset = latest
max.task.idle.ms = 10

I checked broker logs, and there was nothing, though we had log level set to WARN level. The node that the producer was running on was in the process of terminating at the time that the incident started (kubernetes node being drained). I found these producer exceptions from the pods on the node that was terminating that seem related:

2023-08-22 01:13:34.026  WARN [gateway-united-kafka-service,,] 1 --- [ead-64-producer] o.a.k.clients.producer.internals.Sender  : [Producer clientId=wagers-united-transactions-stream-c064ad3f-ba6a-4371-9e21-33e14af4a87a-StreamThread-64-producer, transactionalId=wagers-united-transactions-stream-502b91fe-fe5e-4af1-be54-bd1b25181882-64] Got error produce response with correlation id 10156 on topic-partition wagers-united-transactions-recovery-6, retrying (2147480373 attempts left). Error: NETWORK_EXCEPTION. Error Message: Disconnected from node 3

2023-08-22 01:13:34.026  WARN [gateway-united-kafka-service,,] 1 --- [ead-64-producer] o.a.k.clients.producer.internals.Sender  : [Producer clientId=wagers-united-transactions-stream-c064ad3f-ba6a-4371-9e21-33e14af4a87a-StreamThread-64-producer, transactionalId=wagers-united-transactions-stream-502b91fe-fe5e-4af1-be54-bd1b25181882-64] Received invalid metadata error in produce request on partition wagers-united-transactions-recovery-6 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 3. Going to request metadata update now

2023-08-22 01:14:04.142  WARN [gateway-united-kafka-service,,] 1 --- [ead-64-producer] o.a.k.clients.producer.internals.Sender  : [Producer clientId=wagers-united-transactions-stream-c064ad3f-ba6a-4371-9e21-33e14af4a87a-StreamThread-64-producer, transactionalId=wagers-united-transactions-stream-502b91fe-fe5e-4af1-be54-bd1b25181882-64] Got error produce response with correlation id 10159 on topic-partition wagers-united-transactions-recovery-6, retrying (2147480372 attempts left). Error: NETWORK_EXCEPTION. Error Message: Disconnected from node 3

2023-08-22 01:14:04.142  WARN [gateway-united-kafka-service,,] 1 --- [ead-64-producer] o.a.k.clients.producer.internals.Sender  : [Producer clientId=wagers-united-transactions-stream-c064ad3f-ba6a-4371-9e21-33e14af4a87a-StreamThread-64-producer, transactionalId=wagers-united-transactions-stream-502b91fe-fe5e-4af1-be54-bd1b25181882-64] Received invalid metadata error in produce request on partition wagers-united-transactions-recovery-6 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 3. Going to request metadata update now

2023-08-22 01:14:18.614 ERROR [gateway-united-kafka-service,64e2cdf1f01ed187e85e2a18a3d3615e,2f422c46d406a8cd] 1 --- [read-1-producer] o.a.k.s.p.internals.RecordCollectorImpl  : stream-thread [wagers-united-transactions-recovery-stream-gateway-united-kafka-66db75bbbf-ccgzv-StreamThread-1] task [0_4] Error encountered sending record to topic wagers-united-transactions-done for task 0_4 due to:
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.
Written offsets would not be recorded and no more records would be sent since the producer is fenced, indicating the task may be migrated out

org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.

2023-08-22 01:14:18.614 ERROR [gateway-united-kafka-service,64e2cdf1f01ed187e85e2a18a3d3615e,e44a9f539cb474cb] 1 --- [read-2-producer] o.a.k.s.p.internals.RecordCollectorImpl  : stream-thread [wagers-united-transactions-recovery-stream-gateway-united-kafka-66db75bbbf-ccgzv-StreamThread-2] task [0_4] Error encountered sending record to topic wagers-united-transactions-done for task 0_4 due to:
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.
Written offsets would not be recorded and no more records would be sent since the producer is fenced, indicating the task may be migrated out

org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.

2023-08-22 01:14:18.624  INFO [gateway-united-kafka-service,,] 1 --- [ead-64-producer] o.a.k.c.p.internals.TransactionManager   : [Producer clientId=wagers-united-transactions-stream-c064ad3f-ba6a-4371-9e21-33e14af4a87a-StreamThread-64-producer, transactionalId=wagers-united-transactions-stream-502b91fe-fe5e-4af1-be54-bd1b25181882-64] Transiting to fatal error state due to org.apache.kafka.common.errors.ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one.

I can’t seem to find any error level logging related to the producer on the pods that started at this time to replace the pods from the drained node.

The recovery from this required us to create a new topic, move messages over to it from the poisoned partition, and update our apps to use the new topic. This was a very costly problem to deal with, so we’d really like to figure out what caused it to prevent it in the future.

Checkout KIP-664: Provide tooling to detect and abort hanging transactions - Apache Kafka - Apache Software Foundation

HTH