Using CDC fed Kafka topics for replay with new consumers

n99 · 24 January 2022 09:10

Hello

I’ve been reading about kafka and now have too many tabs open. I have one (maybe still unanswered) question to determine how kafka might work in a potential first use case.

Looking at MySQL Change Data Capture (CDC) Source (Debezium) Connector for Confluent Cloud Quick Start | Confluent Documentation

It seems like an interesting idea to use a CDC fed topic to bootstrap a new consumer with all events/records in a database table.

Having the kafka topic set to use log compaction and tombstones sounds like the perfect way to minimise resources used for this, as well as sending an up to date picture.

However I had a (probably very naive) question:

Log Compaction will leave the latest events (could be a create or update I guess…) in the topic.

But if your consumer will act differently with creates vs updates, then this is an issue when you are replaying?

Is there a way around this with log compaction ,or is it simply to not use log compaction?

Cheers for any advice

n99

hansj.melby · 24 January 2022 09:26

I am by no means an expert, but if you use log compaction, you will miss some actions on the database. If for example a value for a given ID is A and then changed to B (for the same key) then with log compaction the stream will only show B. That is… the consumer will get both events, but in a replay scenario, the consumer will only see B.
I think it all depends on how you want to use the data. If you just want a replika og the database then log compaction is a god fit. you dont need to know the fact that the value changed from A to B.
If you however need to know that an atribute changed (e.g adress changed) and you have logic depending on that change, than log compaction is not an option if you want to replay the event stream.

n99 · 24 January 2022 14:00

thanks for that - I do wonder if there is a way to work with multiple kafka topics in some way. Not sure how.

The consumer end case I was thinking of was one that called a downstream REST API (different calls for create or update).

I also suppose the consumer could initially call the REST api to determine if the call should be a POST or a PATCH/different update url…

I wonder how much kakfa is useful for message replay for new consumers…

hansj.melby · 25 January 2022 15:13

Kafka is reallly! usefull for message replay for new consumers… that is if you need it to be (No shit Sherlock ) The project I am currently located to uses kafka “replay” heavely (or to be more precise planning to do it). We store the data on the topic for a long period of time (3 months) and the idea is that if we somehow need a different database model (the original services persists data from kafka) we kan just spin up a new database and a “new” consumer with a new scheme. That way we can have two seperat models runnig in paralell until we can confirm that the new model is correct and then we can delete the old database and again work with just one model. Godbye SQL migrations scrips and late nighs…Working in paralell have saved us from so much time and greef. My only concirn is that 3 month is not long enough… why not forever ?
Ofcause this way of working is not exactly like you presented your issue, bit i will still argue that it explains why not “deleting” data is a god thing. Tradeoff is diskspace/storrage… no question about it…

n99 · 25 January 2022 15:41

Hi
Thanks for that.
Yes maybe it’s not so crazy to keep data for ever!

cheers

Topic		Replies	Views
Limits of topic log compaction Architecture and Design	5	4333	23 July 2021
Eternal data retention, pitfalls if any and KIP-405 Architecture and Design	6	4412	20 September 2021
[Question][Doubt] Compacted Topic Kafka Streams	6	3882	28 July 2021
Kafka Topic and Tombstone messages Learning Apache Kafka®	6	10051	15 February 2022
Real-Time Materialized Views/Streams with ksqlDB - new consumers and replay Architecture and Design	0	3170	28 July 2022

Using CDC fed Kafka topics for replay with new consumers

Related topics