We have a legacy source SQL Server DB that produces multiple CDC records for each table update due to post-insert/update triggers.
We are using debezium to capture the SQL Server transactions and can see how we could filter all but the first within a couple of seconds in a SMT, but not the last.
Rather than processing each record, we would like to generate a filtered topic with only the last record of the series, on which to do further processing.
We can see how we can do this manually by reading the topic with its duplicates, and caching records for a second and only producing a filtered record to another topic containing the last value within the 2 second window, but can’t help thinking that ksqlDB was created to do such things more easily.
Can anyone point us in the right direction please?
You could use a tumbling window with
EMIT FINAL to accomplish this. E.g., if you have a stream with deduplication key column
id, and another column that might vary called
val, then you can get the last
id over 2-second windows like this:
WINDOW TUMBLING (SIZE 2 SECONDS, GRACE PERIOD 0 SECOND)
GROUP BY id
Note that data has to keep flowing in order for windows to close (see here).
Flink SQL’s window deduplication is also a good fit for this. By ordering by timestamp
DESC and returning the first row, you get the last row per window, e.g.:
SELECT id, val FROM (
ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC) AS rownum
FROM TABLE(TUMBLE(TABLE test, DESCRIPTOR(ts), INTERVAL '2' SECONDS))
) WHERE rownum = 1;
Thanks very much for your reply; I’ll give it a try