Deduplication layer

Hi,
I have seen an example of using window to detect duplicates in the Streams consumer. Our data flows through Connectors from MQ and DBs and I understand it is delivered ‘Atleast Once’. Please correct me if this is wrong.
So it looks like both in the Sink and as well as the consumer that pushes this data out of the system we need duduplication topics.
What is the drawback of this ? Does every topic now need a deduplicated topic ? I believe we can fix a window for this.

Thanks

Hi Mohanr

I’m not quite sure I understand correctly what you mean by a deduplication topic. My understanding of your connector setup would be that you replicate events from MQ to a matching Kafka Topic, and that that single topic may contain duplicate events due to things like an intermittent network or connector failure.

There is a lot to consider when it comes to deduplication. In some instances, such as using change-data capture to get data from a database, duplicates don’t matter too much. Most consumers are materializing a read-only copy of the database state, and duplicates just end up being an extra event that doesn’t tend to do much.

Alternately, if you are processing payments, a duplicate event may result in issuing a duplicate charge to a customer’s credit card. This is certainly a case where you would want to try to fence out duplicate events, either at the producer side using transactional semantics, or at the consumer time by maintaining state (such as a window or state store) related to events you have already processed.

Windowing can help a consumer avoid duplicates, but as you observed, must be implemented at each consumer. This is usually an acceptable option, as it leaves it up to the consumer system to decide if it cares about duplicates, or if it is idempotent.

There are also some exactly-once semantics for Kafka Connectors in the works (KIP-618, but that is yet to be released.

Hope some of this helps. Let me know if you have more questions.

Thanks for pointing out KIP-618. Even though I haven’t read it fully I was looking for something like this.
Our messages are not transactions. They are Emails,SMSs and Cloud messaging Push messages. So for messages like security alerts( Database CDC ) I can open a window in the Streams library and check for duplicates for about 30 minutes or so.
We also have MQ SMSs triggered by payment transactions but they are still messages. It is not easy for me to understand how the connector can fail. So I understand MQ can be subscribed to by the connector. So that shouldn’t be a problem ? We don’t get duplicates.
When we deliver back to kafka we may have duplicates. Is that right ?

Thanks.

If you are using Kafka Connect to consume from an MQ and write to Kafka, you can get duplicates. It is very rare, but it can happen:

  1. Consume messages from MQ
  2. Package into Kafka events
  3. Write to Kafka
  4. CONNECTOR FAILS, RESTART FROM 1

In this case, the messages read from the MQ are not acknowledged, but the events have been written to Kafka.

The systems you have that write events into the MQ may also produce duplicates in this very same way.

  1. Write payment message to MQ
  2. MQ accepts the message
  3. MQ tries to respond to the client that the message was received
  4. CLIENT FAILED - it must restart and try again, possibly publishing the same message. The client has no idea if the message was successfully posted or not, so it retries.

In this second case, Kafka Connect reading from the MQ will publish the duplicate as well. It has no idea it is a duplicate.

Duplicates can be introduced at any boundary of client to MQ/Kafka, or MQ/Kafka to client, unless you have exactly-once production/consumption supported by the client and the MQ/Kafka broker.