I’ve been reading about kafka and now have too many tabs open. I have one (maybe still unanswered) question to determine how kafka might work in a potential first use case.
It seems like an interesting idea to use a CDC fed topic to bootstrap a new consumer with all events/records in a database table.
Having the kafka topic set to use log compaction and tombstones sounds like the perfect way to minimise resources used for this, as well as sending an up to date picture.
However I had a (probably very naive) question:
Log Compaction will leave the latest events (could be a create or update I guess…) in the topic.
But if your consumer will act differently with creates vs updates, then this is an issue when you are replaying?
Is there a way around this with log compaction ,or is it simply to not use log compaction?
I am by no means an expert, but if you use log compaction, you will miss some actions on the database. If for example a value for a given ID is A and then changed to B (for the same key) then with log compaction the stream will only show B. That is… the consumer will get both events, but in a replay scenario, the consumer will only see B.
I think it all depends on how you want to use the data. If you just want a replika og the database then log compaction is a god fit. you dont need to know the fact that the value changed from A to B.
If you however need to know that an atribute changed (e.g adress changed) and you have logic depending on that change, than log compaction is not an option if you want to replay the event stream.
Kafka is reallly! usefull for message replay for new consumers… that is if you need it to be (No shit Sherlock ) The project I am currently located to uses kafka “replay” heavely (or to be more precise planning to do it). We store the data on the topic for a long period of time (3 months) and the idea is that if we somehow need a different database model (the original services persists data from kafka) we kan just spin up a new database and a “new” consumer with a new scheme. That way we can have two seperat models runnig in paralell until we can confirm that the new model is correct and then we can delete the old database and again work with just one model. Godbye SQL migrations scrips and late nighs…Working in paralell have saved us from so much time and greef. My only concirn is that 3 month is not long enough… why not forever ?
Ofcause this way of working is not exactly like you presented your issue, bit i will still argue that it explains why not “deleting” data is a god thing. Tradeoff is diskspace/storrage… no question about it…