Delete row from aggregation table/performance effects if not doing so

alexander · 29 August 2023 16:15

Hello!

I am working with several types of denormalized events that carry information about computer processes and network connections.

I create a unified stream based on the process start event and perform an input of the process end event. The stream is the source for a latest_by_offset-table that keeps the latest non-null value (I really like that behavior!). In the end I have the process information from start and end in one row. The key is the process uuid.

I do the same (incl. the latest_by_offset-Aggregation) for the network connections (start+end) but here the key is process uuid+start-timestamp. I use a table since I can’t be sure that the messages arrive in correct order in regard to the process start/end events.

Then I perform a FK join on the uuid with a filter condition that process_table.uuid != null AND process_table.start_ts != null AND connection_table.start_ts != null AND connection_table.end_ts != null. So I can be sure that all important information is available when I perform the join.

Now I am asking myself what performance implications this solution has and whether it is possible to delete the completed connections from the table the join is performed on. Does it make a difference whether there are 100s of millions of mostly completed connections in the table or just a couple of million (in one week)?

I tried to create a tombstone stream in KAFKA-format with the same topic the connection stream is using to ingest tombstones but the table rows didn’t change, although the ‘workaround’ apparently used to work at some point: Kafka Connect, ksqlDB, and Kafka Tombstone messages
I tried the same with a source table that also used the same underlying topic and it worked.
I guess the aggregation table is not using the KAFKA-format and consequently just ignores the tombstones or the latest_by_offset-Aggregation can’t handle them.

I am really grateful about any guidance! Thank you!

system · 28 September 2023 16:16

This topic was automatically closed after 30 days. New replies are no longer allowed.

Topic		Replies	Views
Fresh aggregate for each join like BlockingQueue#drainTo Kafka Streams	12	2529	25 January 2023
KTable aggregation with tombstones Kafka Streams	9	430	9 September 2024
Missing records in STREAM-TABLE join ksqlDB	6	2336	16 September 2023
The results is incorrect when a stream join a ktable ksqlDB	4	3171	26 February 2022
A question about event ordering ksqlDB	1	30	17 December 2024

Delete row from aggregation table/performance effects if not doing so

Related topics