Eliminating all but last CDC record for a topic within 2 sec window

We have a legacy source SQL Server DB that produces multiple CDC records for each table update due to post-insert/update triggers.

We are using debezium to capture the SQL Server transactions and can see how we could filter all but the first within a couple of seconds in a SMT, but not the last.

Rather than processing each record, we would like to generate a filtered topic with only the last record of the series, on which to do further processing.

We can see how we can do this manually by reading the topic with its duplicates, and caching records for a second and only producing a filtered record to another topic containing the last value within the 2 second window, but can’t help thinking that ksqlDB was created to do such things more easily.

Can anyone point us in the right direction please?

You could use a tumbling window with EMIT FINAL to accomplish this. E.g., if you have a stream with deduplication key column id, and another column that might vary called val, then you can get the last val per id over 2-second windows like this:

SELECT id,
    LATEST_BY_OFFSET(val),
    WINDOWSTART,
    WINDOWEND
FROM test
WINDOW TUMBLING (SIZE 2 SECONDS, GRACE PERIOD 0 SECOND)
GROUP BY id
EMIT FINAL;

Note that data has to keep flowing in order for windows to close (see here).

Flink SQL’s window deduplication is also a good fit for this. By ordering by timestamp DESC and returning the first row, you get the last row per window, e.g.:

SELECT id, val FROM (
  SELECT id,
    val,
    window_start,
    window_end,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC) AS rownum
  FROM TABLE(TUMBLE(TABLE test, DESCRIPTOR(ts), INTERVAL '2' SECONDS))
) WHERE rownum = 1;

Thanks very much for your reply; I’ll give it a try :+1:t3:

This topic was automatically closed after 30 days. New replies are no longer allowed.