How to deduplicate records in a kstream or ktable

Hi, How can I remove duplicate records in a kstream or ktable based on the values of two fields in the value portion of key, value pair. The value is a deserialized json record. I want to use two fields from this record to do a group by operation. I did try it with one of the fields but it does not seem to work. The resulting ktable still has all the duplicate rows.

Do I get it right.
Messages:
msg1 - value: {“fld1”:“val1”, “fld2”:“val2”,“fld3”:“foo”}
msg2 - value: {“fld1”:“val1”, “fld2”:“val2”,“fld3”:“bar”}

Goal: ignore msg2, because fld1 and fld2 are identical to msg1?
Did I understand that correct?
Do both records share the same key?

No something like this,

key1, value1
key1, value2
key2, value1
key1, value2
key3, value1
key2, value1

and I want

key1, value1
key1, value2
key2, value1
key3, value1

Figured it out. I implemented a key, value state store to do the deduplication. Works like a charm.

3 Likes

There is actually an example for de-duplication on GitHub: kafka-streams-examples/EventDeduplicationLambdaIntegrationTest.java at a2962a8e9d01477390c60ec6ba5cd1ee9ae8d880 · confluentinc/kafka-streams-examples · GitHub

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.