Suggestion on how best use KStream for aggregation(real time) till End of Day and then write data to Files at scale

pk2022 · 19 April 2022 16:27

Use case
Process is into two steps. Step 1. For a day there are inflow of messages, where each message has a numeric filed to represent the version of record. I need to continuously find the latest version based on the numeric filed and only keep the latest version. This process will continue till EOD.
Step 2 happened after Step1 EOD , is to only take the message which are identified as latest and write to the file

How I am solving
Step1 – define a topology that read message from topic A and use KV store to only store the latest version. Logic compare the incoming record version with version stored in KV for that key. This process continue till EOD and then I write everything from KV Store to txt files where each file is one to one mapping with partition number like file-0.txt. this file has the latest offset number for the record. This output becomes a input for the Step2
Step 2 – load the output file-0.txt in HashMap onparition assigned call and then start consuming he message. As message comes in do the comparison of the offset I have previous stored in file to the current record. If offset matched then just write the record to files else simply ignore and increment the counter of the offset file to point to the next offset to compare.

Question
I am reading about KStream joins and KTable goodness but not able to figure out how Step1 2nd half where after identifying the latest version till EOD, can be stored as a lookup table and Step 2 can simply join the topic A with the lookup table and only fetch the record that matched by there offset number. I am solving this currently with more of java solution and seems missing all Kafka goodness of Kstream/KTables. Any direction, guidance on how to convert values into a store that can be used for comparison.

Topic		Replies	Views
KStream vs KTable vs GlobalKTable Kafka Streams	1	4292	8 February 2022
The results is incorrect when a stream join a ktable ksqlDB	4	3181	26 February 2022
Need help with EARLIEST_BY_OFFSET and GROUP BY ksqlDB	2	1121	7 February 2024
Problem with left-join Kafka Streams	1	2522	25 April 2023
A question about event ordering ksqlDB	1	39	17 December 2024

Suggestion on how best use KStream for aggregation(real time) till End of Day and then write data to Files at scale

Related topics