Problem with exactly-once semantics in kafka-streams

Mateus · 6 August 2025 23:20

Hello, I wanted to report an issue I encountered with my Kafka Streams application after attempting to upgrade its version from 2.8.1 to 3.4.0. My application has been running multiple Kafka Streams instances using exactly-once semantics for years, and we’ve never had any problems related to exactly-once guarantees, even during high data volume or instability/rebalancing scenarios.

The issue began after we upgraded the Kafka Streams client to 3.4.0 (we performed the upgrade by completely shutting down the system and restarting it after updating the code). Everything ran smoothly for several days until we experienced some producer timeouts in the application, which triggered a few rebalances.

However, the real problem was that some messages that should have been discarded were actually committed by Kafka Streams, resulting in duplicate messages that were processed (and commited) more than once.

In short: My application records operations that must have a unique ID provided by the client. We have an input topic(let’s call it TopicA). After a message enters TopicA, it is forwarded to TopicB, where a uniqueness check is performed using only data stored in a Kafka Streams state store. If the ID is not unique, the message is sent back to TopicA marked as “rejected.” If the ID is unique, the message proceeds to TopicC, where application-specific processing occurs, and then the message returns to TopicA marked as “executed.”

The important part is that if I have X messages entering TopicA, I expect X messages returning to it, each message having a single output with the same id_cmd (an internal ID related to the command), marked either as “executed” or “rejected.”

What happened is that after some rebalances, we observed certain messages with two outputs, either two “executed” or one “executed” and one "rejected, which suggests that some messages that should have been revoked were instead being counted and processed as valid. I think either it is being processed and committed twice in topicC, resulting in two ‘executed’, or in topicA or topicB, resulting in one from each, since the second check will find that the ID already exists in the state store and should reject it(i am not using a database, only a state-store for this validation, so it should rollback any record if the transaction is revoked).

All our consumers are configured with read_commited and I’ve already verified that the extra messages are, in fact, committed. We were able to reproduce this issue in a local environment, although not consistently. Whatever is causing it seems very specific and hard to reproduce, but I can confidently say it only occurs from Kafka Streams version 3.4.0 onward. I tested up to version 3.7.0 and encountered the same problem.

To reinforce: with the exact same code, this issue does not occur in any environment (local or production) when using Kafka Streams version 2.8.1, even under scenarios with high instability and frequent rebalances. I tested locally with EXACTLY_ONCE and EXACTLY_ONCE_V2 and i was able to reproduce it, even using just 1 kafka broker.

I am using this configuration for kafka-streams instances

{NUM_STREAM_THREADS_CONFIG           4
 PROCESSING_GUARANTEE_CONFIG         StreamsConfig/EXACTLY_ONCE
 REPLICATION_FACTOR_CONFIG           3
 SECURITY_PROTOCOL_CONFIG           "PLAINTEXT"
 AUTO_OFFSET_RESET_CONFIG           "earliest"
 CLIENT_DNS_LOOKUP_CONFIG           "use_all_dns_ips"
 COMPRESSION_TYPE_CONFIG            "zstd"
 CLIENT_DNS_LOOKUP_CONFIG           "use_all_dns_ips"
 MAX_BLOCK_MS_CONFIG                "300000"
 TRANSACTION_TIMEOUT_CONFIG         "60000"
 MAX_REQUEST_SIZE_CONFIG            "10400000"}

I didn’t see any unusual logs in the Kafka broker or in the application in the scenarios where the issue was identified, apart from the usual consumer group rebalancing logs.

Has anyone ever seen a similar issue?

Thank you in advance for your help.

mjsax · 6 August 2025 23:32

Hard to say, but we did some EOS correctness bugs in newer releases, so it might be good to upgrade to 4.0.0 version. Two examples blow.

[KAFKA-18943] Kafka Streams incorrectly commits TX during task revokation - ASF JIRA
[KAFKA-17635] Lost events on internal repartition topic when excatly_once_v2 is set and producer is fenced - ASF JIRA

Mateus · 6 August 2025 23:41

Thanks for your response. I looked into the issues you mentioned, and they seem to be related to the Exactly-Once V2 implementation. Could these issues also occur with V1?

I also noticed they’re marked as fixed in version 3.9.1. If these were indeed the root cause, would version 3.9.1 be free from this bug, or is it only fully resolved in 4.0.0?

mjsax · 10 August 2025 17:08

Fix version is set to 3.9.1, so 3.9.1 does contain the fixes, too.

I believe it might also affect EOS v1, but not 100% sure from top of my head.

Mateus · 11 August 2025 22:08

i will try it, thanks for the help

Topic		Replies	Views
Messages processing more than once in a exactly_once kafka streams application Kafka Streams	8	3142	17 October 2025
Should we stick with at-least-once or switch to exactly-once-v2? Kafka Streams	5	525	30 October 2024
Streams app with EOS gets stuck restoring after upgrade to 2.8 Kafka Streams	3	5629	27 August 2021
Does kafka stream works well when there are one input topic and two sink processor node? Kafka Streams	1	4010	20 May 2022
Kafka state stores and at-least-once processing guarantee Kafka Streams	2	3818	21 March 2022

Problem with exactly-once semantics in kafka-streams

Related topics