Deduplication layer

mohanr · 21 February 2022 12:39

Hi,
I have seen an example of using window to detect duplicates in the Streams consumer. Our data flows through Connectors from MQ and DBs and I understand it is delivered ‘Atleast Once’. Please correct me if this is wrong.
So it looks like both in the Sink and as well as the consumer that pushes this data out of the system we need duduplication topics.
What is the drawback of this ? Does every topic now need a deduplicated topic ? I believe we can fix a window for this.

Thanks

abellemare · 23 February 2022 20:14

Hi Mohanr

I’m not quite sure I understand correctly what you mean by a deduplication topic. My understanding of your connector setup would be that you replicate events from MQ to a matching Kafka Topic, and that that single topic may contain duplicate events due to things like an intermittent network or connector failure.

There is a lot to consider when it comes to deduplication. In some instances, such as using change-data capture to get data from a database, duplicates don’t matter too much. Most consumers are materializing a read-only copy of the database state, and duplicates just end up being an extra event that doesn’t tend to do much.

Alternately, if you are processing payments, a duplicate event may result in issuing a duplicate charge to a customer’s credit card. This is certainly a case where you would want to try to fence out duplicate events, either at the producer side using transactional semantics, or at the consumer time by maintaining state (such as a window or state store) related to events you have already processed.

Windowing can help a consumer avoid duplicates, but as you observed, must be implemented at each consumer. This is usually an acceptable option, as it leaves it up to the consumer system to decide if it cares about duplicates, or if it is idempotent.

There are also some exactly-once semantics for Kafka Connectors in the works (KIP-618, but that is yet to be released.

Hope some of this helps. Let me know if you have more questions.

mohanr · 25 February 2022 10:14

Thanks for pointing out KIP-618. Even though I haven’t read it fully I was looking for something like this.
Our messages are not transactions. They are Emails,SMSs and Cloud messaging Push messages. So for messages like security alerts( Database CDC ) I can open a window in the Streams library and check for duplicates for about 30 minutes or so.
We also have MQ SMSs triggered by payment transactions but they are still messages. It is not easy for me to understand how the connector can fail. So I understand MQ can be subscribed to by the connector. So that shouldn’t be a problem ? We don’t get duplicates.
When we deliver back to kafka we may have duplicates. Is that right ?

Thanks.

abellemare · 25 February 2022 16:10

If you are using Kafka Connect to consume from an MQ and write to Kafka, you can get duplicates. It is very rare, but it can happen:

Consume messages from MQ
Package into Kafka events
Write to Kafka
CONNECTOR FAILS, RESTART FROM 1

In this case, the messages read from the MQ are not acknowledged, but the events have been written to Kafka.

The systems you have that write events into the MQ may also produce duplicates in this very same way.

Write payment message to MQ
MQ accepts the message
MQ tries to respond to the client that the message was received
CLIENT FAILED - it must restart and try again, possibly publishing the same message. The client has no idea if the message was successfully posted or not, so it retries.

In this second case, Kafka Connect reading from the MQ will publish the duplicate as well. It has no idea it is a duplicate.

Duplicates can be introduced at any boundary of client to MQ/Kafka, or MQ/Kafka to client, unless you have exactly-once production/consumption supported by the client and the MQ/Kafka broker.

Topic		Replies	Views
How to deduplicate records in a kstream or ktable Kafka Streams	5	8090	13 September 2021
Does kafka stream works well when there are one input topic and two sink processor node? Kafka Streams	1	3986	20 May 2022
Duplication of records in realtime streaming from MSSQL using debezium class Kafka Connect	1	374	19 June 2024
De-duplicator Using Kafka Streams Kafka Streams	2	3068	4 April 2022
Should we stick with at-least-once or switch to exactly-once-v2? Kafka Streams	5	299	30 October 2024

Deduplication layer

Related topics