We have a kafka application that processes CDC events from Debezium, and cleanses the data (10-200 events across 10+ topics) before assigning a common key to all records from the same transaction and pushing the events (using multiple schemas) to a single partitioned topic. Another application needs to process all events by transaction. The end result is that each partition can have a stream of events from interleaved transactions, so once-only processing becomes complex if the application attempts to manage the offsets itself.
Would a viable solution be to convert to a KTable grouping by transaction and returning an array of Avro GenericRecords be a viable solution? Given that there will be billions of records during initial load, will this be performant? Source events are created in kafka transactions so if read-committed is enabled, events for transactions should not appear until a commit is received.
Can anyone describe a better “black box” solution that does not require manually coding an offset tracker?