Duplicate key in KSQL Table

maiconramones · 12 April 2021 12:27

I’m making some test with Kafka and KSQL and I want to know how to deal with duplicate keys. Let me show you my test case:

Firstly, I created a topic named as ACCOUNT with a key “id” and some test data with the same key was pushed to the topic.

And after this I created a Table with this command:

CREATE TABLE accounts
   (ID VARCHAR PRIMARY KEY, number VARCHAR, BRANCH VARCHAR, bank VARCHAR, balance double, owner VARCHAR)
   WITH (KAFKA_TOPIC='account', VALUE_FORMAT='JSON');

The problem is: a select with “select * from accounts where id = ‘7’ EMIT CHANGES;” shows repeated rows.

What I need to do to keep only the last row based on the primary key?

tlberglund · 12 April 2021 13:02

Hey, maiconramones, welcome to the Forum!

That table really does contain only one row for each unique key value. When you’re inspecting the table by doing that SELECT at the ksqlDB CLI (which I assume is what you’re doing—set me straight if that’s not the case), you will see updates to that row (say, where id=7) as the row in the table changes. Those updates are not additional rows having that same key, but the same row with non-key values having changed.

Does this help?

maiconramones · 12 April 2021 13:57

Hi @tlberglund, thanks for the answer,

I understand the explanation, yes I’m using ksqldb-cli and control-center to test my querys, in the ksqldb-cli I close the terminal and open again, and at this time the select return the row two times and any update was pushed to the topic.

Appear to me that the “primary key” value is different. See my print, during this select any update occurred and appear two rows.

mjsax · 13 April 2021 02:19

As Tim said, when you use EMIT CHANGE clause, you do not query a table snapshot, but you get the changelog stream of the table from its initial empty state up to now (you issue a so-called push query). Thus, the result of your query is a stream of updated to the table!

ksqlDB also support key-lookups against table snapshots: for this case, you would need to materialize the input topic by reading it as a STREAM, and applying LATEST_BY_OFFSET aggregation function. You can issue so-called pull queried against the result table of the stream aggregation query afterwards.

-- create an input STREAM from the topic
-- note: for a STREAM you use KEY instead of PRIMARY KEY
CREATE STREAM accountStream
  (ID VARCHAR KEY, number VARCHAR, BRANCH VARCHAR, bank VARCHAR, balance double, owner VARCHAR)
   WITH (KAFKA_TOPIC='account', VALUE_FORMAT='JSON');

-- aggregate the stream into a table
CREATE TABLE accounts AS
  SELECT
    id,
    LATEST_BY_OFFSET(number),
    LATEST_BY_OFFSET(BRANCH),
    ... -- repeat for all columns
  FROM accountStream
  GROUP BY id;

-- issue a pull query,
-- ie, a single row lockup against the latest table snapshot
SELECT * FROM accounts WHERE id = '7';

maiconramones · 13 April 2021 13:21

Very good, thanks for the clarification guys!

Yes this approach worked @mjsax.

Thanks for the support!

system · 12 May 2021 12:27

This topic was automatically closed after 30 days. New replies are no longer allowed.

Topic		Replies	Views
When I create a table and specify a primary key it is not being used ksqlDB	5	3899	24 March 2022
Values multiplied in my table ksqlDB	9	3319	27 April 2022
Cache/Query latest value of each key in a topic? ksqlDB	2	3708	2 July 2021
Tombstone message in Table when filtering duplicate events ksqlDB	5	3525	20 September 2021
Table not filtering by key ksqlDB	9	3565	13 January 2022

Duplicate key in KSQL Table

Related topics