Hello, I wanted to report an issue I encountered with my Kafka Streams application after attempting to upgrade its version from 2.8.1 to 3.4.0. My application has been running multiple Kafka Streams instances using exactly-once semantics for years, and we’ve never had any problems related to exactly-once guarantees, even during high data volume or instability/rebalancing scenarios.
The issue began after we upgraded the Kafka Streams client to 3.4.0 (we performed the upgrade by completely shutting down the system and restarting it after updating the code). Everything ran smoothly for several days until we experienced some producer timeouts in the application, which triggered a few rebalances.
However, the real problem was that some messages that should have been discarded were actually committed by Kafka Streams, resulting in duplicate messages that were processed (and commited) more than once.
In short: My application records operations that must have a unique ID provided by the client. We have an input topic(let’s call it TopicA). After a message enters TopicA, it is forwarded to TopicB, where a uniqueness check is performed using only data stored in a Kafka Streams state store. If the ID is not unique, the message is sent back to TopicA marked as “rejected.” If the ID is unique, the message proceeds to TopicC, where application-specific processing occurs, and then the message returns to TopicA marked as “executed.”
The important part is that if I have X messages entering TopicA, I expect X messages returning to it, each message having a single output with the same id_cmd (an internal ID related to the command), marked either as “executed” or “rejected.”
What happened is that after some rebalances, we observed certain messages with two outputs, either two “executed” or one “executed” and one "rejected, which suggests that some messages that should have been revoked were instead being counted and processed as valid. I think either it is being processed and committed twice in topicC, resulting in two ‘executed’, or in topicA or topicB, resulting in one from each, since the second check will find that the ID already exists in the state store and should reject it(i am not using a database, only a state-store for this validation, so it should rollback any record if the transaction is revoked).
All our consumers are configured with read_commited and I’ve already verified that the extra messages are, in fact, committed. We were able to reproduce this issue in a local environment, although not consistently. Whatever is causing it seems very specific and hard to reproduce, but I can confidently say it only occurs from Kafka Streams version 3.4.0 onward. I tested up to version 3.7.0 and encountered the same problem.
To reinforce: with the exact same code, this issue does not occur in any environment (local or production) when using Kafka Streams version 2.8.1, even under scenarios with high instability and frequent rebalances. I tested locally with EXACTLY_ONCE and EXACTLY_ONCE_V2 and i was able to reproduce it, even using just 1 kafka broker.
I am using this configuration for kafka-streams instances
{NUM_STREAM_THREADS_CONFIG 4
PROCESSING_GUARANTEE_CONFIG StreamsConfig/EXACTLY_ONCE
REPLICATION_FACTOR_CONFIG 3
SECURITY_PROTOCOL_CONFIG "PLAINTEXT"
AUTO_OFFSET_RESET_CONFIG "earliest"
CLIENT_DNS_LOOKUP_CONFIG "use_all_dns_ips"
COMPRESSION_TYPE_CONFIG "zstd"
CLIENT_DNS_LOOKUP_CONFIG "use_all_dns_ips"
MAX_BLOCK_MS_CONFIG "300000"
TRANSACTION_TIMEOUT_CONFIG "60000"
MAX_REQUEST_SIZE_CONFIG "10400000"}
I didn’t see any unusual logs in the Kafka broker or in the application in the scenarios where the issue was identified, apart from the usual consumer group rebalancing logs.
Has anyone ever seen a similar issue?
Thank you in advance for your help.