How to deduplicate records in a kstream or ktable

danoomistmatiste · 6 September 2021 05:48

Hi, How can I remove duplicate records in a kstream or ktable based on the values of two fields in the value portion of key, value pair. The value is a deserialized json record. I want to use two fields from this record to do a group by operation. I did try it with one of the fields but it does not seem to work. The resulting ktable still has all the duplicate rows.

An-Si · 7 September 2021 14:11

Do I get it right.
Messages:
msg1 - value: {“fld1”:“val1”, “fld2”:“val2”,“fld3”:“foo”}
msg2 - value: {“fld1”:“val1”, “fld2”:“val2”,“fld3”:“bar”}

Goal: ignore msg2, because fld1 and fld2 are identical to msg1?
Did I understand that correct?
Do both records share the same key?

danoomistmatiste · 12 September 2021 15:57

No something like this,

key1, value1
key1, value2
key2, value1
key1, value2
key3, value1
key2, value1

and I want

key1, value1
key1, value2
key2, value1
key3, value1

danoomistmatiste · 12 September 2021 22:27

Figured it out. I implemented a key, value state store to do the deduplication. Works like a charm.

mjsax · 13 September 2021 20:52

There is actually an example for de-duplication on GitHub: kafka-streams-examples/EventDeduplicationLambdaIntegrationTest.java at a2962a8e9d01477390c60ec6ba5cd1ee9ae8d880 · confluentinc/kafka-streams-examples · GitHub

system · 20 September 2021 20:52

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to remove deduplication upon struct in ksqldb ksqlDB	1	3103	18 February 2022
Batch processing in kstreams Kafka Streams	9	5858	19 September 2021
Kafka duplicate documents Kafka Streams	1	3400	30 November 2022
Duplicate key in KSQL Table ksqlDB	5	5604	12 May 2021
KTable: transforming one record into multiple records Kafka Streams	1	2375	21 February 2023

How to deduplicate records in a kstream or ktable

Related topics