Eliminating all but last CDC record for a topic within 2 sec window

tim.jones.001 · 5 February 2024 11:18

We have a legacy source SQL Server DB that produces multiple CDC records for each table update due to post-insert/update triggers.

We are using debezium to capture the SQL Server transactions and can see how we could filter all but the first within a couple of seconds in a SMT, but not the last.

Rather than processing each record, we would like to generate a filtered topic with only the last record of the series, on which to do further processing.

We can see how we can do this manually by reading the topic with its duplicates, and caching records for a second and only producing a filtered record to another topic containing the last value within the 2 second window, but can’t help thinking that ksqlDB was created to do such things more easily.

Can anyone point us in the right direction please?

dtroiano · 5 February 2024 16:07

You could use a tumbling window with EMIT FINAL to accomplish this. E.g., if you have a stream with deduplication key column id, and another column that might vary called val, then you can get the last val per id over 2-second windows like this:

SELECT id,
    LATEST_BY_OFFSET(val),
    WINDOWSTART,
    WINDOWEND
FROM test
WINDOW TUMBLING (SIZE 2 SECONDS, GRACE PERIOD 0 SECOND)
GROUP BY id
EMIT FINAL;

Note that data has to keep flowing in order for windows to close (see here).

Flink SQL’s window deduplication is also a good fit for this. By ordering by timestamp DESC and returning the first row, you get the last row per window, e.g.:

SELECT id, val FROM (
  SELECT id,
    val,
    window_start,
    window_end,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC) AS rownum
  FROM TABLE(TUMBLE(TABLE test, DESCRIPTOR(ts), INTERVAL '2' SECONDS))
) WHERE rownum = 1;

tim.jones.001 · 19 February 2024 22:10

Thanks very much for your reply; I’ll give it a try

system · 6 March 2024 11:19

This topic was automatically closed after 30 days. New replies are no longer allowed.

Topic		Replies	Views
CDC delete row issue ksqlDB	2	4449	31 July 2022
Sync processing of multiple simultaneous messages ksqlDB	1	1509	24 September 2023
KSQLDB Table with tumbling window and emit final ksqlDB	2	3820	27 January 2023
Select events from a topic between two dates without emitting changes ksqlDB	2	3449	9 March 2021
Batch processing in kstreams Kafka Streams	9	5856	19 September 2021

Eliminating all but last CDC record for a topic within 2 sec window

Related topics