Preparing to move away from ksql queries file (= headless mode)…
Our ksql queries file consists of a number of CTAS and CSAS commands, as well as INSERT INTOs.
Because headless mode seems to be phased out at some point and we’re unable to actually remove anything from the queries file that we don’t need anymore, we’re trying to plan our migation.
The generated consumer groups contain a prefix, a short form of the command and an increasing integer that represesents which n-th statement the command is in the queries.file. This is also the reason why we can’t just remove something from the middle of the file (the query number in the generated group name will break).
So to move to interactive mode, we either need to have a way of specifying the consumer group name in a command or we’ll need to “rename” the existing group names to the group names that will be autogenerated from the interactive commands, correct? (before actually creating the queries)
Otherwise, everything will start from earliest or latest again, I guess?
What’s the best way to do this? Trying to rename the groups or wait for semantics to allow the group name to be specified in the KSQL command?
There is no currently support for this migration path yet. But it is a known issue we plan to do in the future. There is a github ticket (Write execution plans to the internal topic in headless mode · Issue #6282 · confluentinc/ksql · GitHub) where we plan to write all headless statements in a command topic. You will be able to use the headless command topic in your interactive app so all queries are executed with the same query ID.
That would be a cool feature. So this would mean with a future version – upon start –, the query file commands would be persisted into the command topic and after that we could continue in interactive mode?
If we were to migrate before that we’re thinking about this method:
Stop all data coming in
Recreate everyting from the query file with auto.offset.reset = latest
This would have some effects on the tumbling window aggregation queries that we are currently running due to the new key structure.
Do you think, “simulating” the old way with GROUP BY CAST(K1 AS STRING) + '|+|' + CAST(K2 AS STRING)
would continue the aggregations as before?
After trying out quite a couple of things, my conclusion for the 0.15 migration part is:
You cannot migrate CTAS with multiple GROUP BY by recreating the query.
Option 1: Use new key format with same output topic
This won’t work because aggregations are by key and the key changes. Windowed aggregation sums/counts will not be correct this way.
Option 2: Replicate old behavior using the concat/string-cast hack
This won’t work neither in our case because we create a special column in the output that combines group by columns into our own custom string for the database. So when creating the KSQL_COL_0 the query loses the information about the grouped columns are doesn’t permit such an operation.
Conclusion
We’ll have to do a stop the world migration close to the window boundary. This means, we’ll lose some aggregation information because exactly matching the window boundary (hour in our case) is impossible due to late arrivals etc etc.
We will combine this with the query file migration:
Stop inputs (mirrormaker/connect) close to the hour boundary and outputs (connect)
Wait for 0 lag on everything
Stop headless KSQLs
Restart with 0.15 and no queries file resulting in an empty KSQL Server
Create queries in interactive mode with offset at latest