I’ve been experimenting with larger data sets for a couple of weeks now, feeding from Debezium snapshots and putting some of that data in materialized tables. I’m trying to understand how it works in different situations :
1- Starting from scratch :
- Easiest scenario. Just snapshot all the tables you want to process in the correct order (reference tables first, then streaming tables). Everything is smooth.
2- Restarting the server :
- KsqlDB reads from its internal (confluent-ksql-*) topics to reload materialized tables into memory
- It might produce topics with null values from reference tables or log “Skipping record due to null join key or value.” if it tries to process messages coming in before it had time to reload its memory database.
- Overall, it works pretty good once the init is finished
3- Deploying changes to queries into an existing running project :
- On restart, KsqlDB will transfer messages from its old internal (confluent-ksql) topics to new ones.
- However, once it’s done with that, it doesn’t seem to remember any of the data in materialized tables, even those that didn’t suffer any change to their schema or query.
- I’ve waited for it to transfer 43 millions messages from one internal topic to another, but once it was finished, it didn’t seem to remember any of that data.
So basically, I’m trying to find the best strategies for managing ksqlDB servers and how to deal with changes in production without losing any data and making it as smooth as possible. Or at the very least, I’d like to know what to expect in different scenarios and for that, I need to understand how KsqlDB works on restart in different scenarios.
Any ideas where I could find documentation about it?