Limits of topic log compaction

Hi,

when thinking of topic log compaction the first use cases which come up to my mind concern master data or configuration data.

In my current project I have to find a solution for the next 6 months until we have an appropriate CDC solution in place and was wondering if enabling topic log compaction could be an approach for that:

I have records in an Oracle (19c) database which I extract once an hour to Kafka. These records include a creation date and a closing date, but no update dates and we are not allowed to solve this on the Oracle site with before-update-triggers for example. So I have to grab all the records with closing date equals NULL in order to get possible updates which leads to duplicates in the topic.

So I was thinking about configuring the concerned topics with ‘compact,delete’ with ‘retention.ms’ in order to keep the topics small. E.g. a segment time of approx. 1 hour and a retention time of a few days. I guess working with multiple partitions doesn’t make any issues because all data with the same key is written all the time to the same partition.

Question: Is this a misuse of compaction? Are there limits for amounts of data (e.g. a few hundreds to thousands records are loaded per hour)? What would be metrics to track/monitor in order to make sure that even if there are bigger loads, the Kafka cluster stays healthy? Are there any rule of thumbs/limits? Any ideas and thoughts would be a great input.

1 Like

@ke.rstin You might want to be sure you understand specifically what Log Compaction means with Kafka: Kafka Design | Confluent Documentation In particular log compaction will retain at least the last value by message key.

I’m not certain exactly what you’re asking but there are fundamental differences between the delete and compact policy that you should understand thoroughly.

I see, I’ll have to provide an example:

Source data in Oracle
ID CREATE_TS STATUS END_TS
1 22.06.2021 15:21:00 new NULL
1 22.06.2021 15:21:00 declined NULL
1 22.06.2021 15:21:00 accepted 22.06.2021 18:12:03

So you see, no UPDATE_TS unfortunately.

Ideal solution: In an ideal world, I would have a CDC-technology which gets its info from the Redo Logs. But I don’t have that so far.

Topic log compaction solution: I am allowed to connect to the Oracle database once an hour and grab all new and updated data. As the consumer is only interested in the status quo of the tables (“snapshots”), I thought about topic log compaction. So the consumer gets at least the newest data for the key - maybe there are also some older records left in Kafka, I know. Why am I talking about delete? Because records older than e.g. a week doesn’t interest the consumer any more. I am aware of losing my history of the data with this approach and this would be okay.

So in the best case the consumer sees this data (key:message)
1: 1 22.06.2021 15:21:00 accepted 22.06.2021 18:12:03

In the worst case something like this (key:message)
1: 1 22.06.2021 15:21:00 declined NULL
1: 1 22.06.2021 15:21:00 accepted 22.06.2021 18:12:03

Wouldn`t you simply want to create an SCD2 style representation of the data somehow (irrespective of in- or outside of Kafka) to have a data structure satisfying your needs?

Would be a feasible approach, but the info valid_to or flags are not really necessary. The topic log compaction approach would be kind of SCD1 when you see it like that and would be enough. Another approach could be to store the loadkey/load date as the kafka message key and when your producer starts the next time, the new loaded data is compared to the data of the last load with the last loadkey/load date (in order to prevent duplicates) and in case of new or updated data, it gets written. So with this approach you would have the history. But I am sure there a more elegant ways

If I understood what you are trying to do, and how the data is structured, you would want the Message Key=ID+CREATE_TS – is that correct? (This way, in the end, only 1 message for that Key will remain in the Kafka Topic – depending on how the Compaction is configured).

You could possibly solve the problem of duplicate records by using the sane Kafka Topic as the data store for your Producer app. If your Producer app is also a Consumer of the same Topic, it will be able to cache in memory the latest version of the record it sent to the Topic, and only send a new record for that ID+CREATE_TS. (Or you can implement some alternate way of caching that).

It is fine to use DELETE together with COMPACTION, as long as you don’t need the older records any longer.