I am working with several types of denormalized events that carry information about computer processes and network connections.
I create a unified stream based on the process start event and perform an input of the process end event. The stream is the source for a latest_by_offset-table that keeps the latest non-null value (I really like that behavior!). In the end I have the process information from start and end in one row. The key is the process uuid.
I do the same (incl. the latest_by_offset-Aggregation) for the network connections (start+end) but here the key is process uuid+start-timestamp. I use a table since I can’t be sure that the messages arrive in correct order in regard to the process start/end events.
Then I perform a FK join on the uuid with a filter condition that process_table.uuid != null AND process_table.start_ts != null AND connection_table.start_ts != null AND connection_table.end_ts != null. So I can be sure that all important information is available when I perform the join.
Now I am asking myself what performance implications this solution has and whether it is possible to delete the completed connections from the table the join is performed on. Does it make a difference whether there are 100s of millions of mostly completed connections in the table or just a couple of million (in one week)?
I tried to create a tombstone stream in KAFKA-format with the same topic the connection stream is using to ingest tombstones but the table rows didn’t change, although the ‘workaround’ apparently used to work at some point: Kafka Connect, ksqlDB, and Kafka Tombstone messages
I tried the same with a source table that also used the same underlying topic and it worked.
I guess the aggregation table is not using the KAFKA-format and consequently just ignores the tombstones or the latest_by_offset-Aggregation can’t handle them.
I am really grateful about any guidance! Thank you!