Python - How to properly handle ILLEGAL_GENERATION errors with transactions with on_assign/on_revoke

Hey there everyone! For a little background on myself and project, I made a post on the confluent-kafka-python discussion board around a streams-like python client library my team is building on top of confluent-kafka-python that we hope to open source soon

As my description may imply, we have implemented transactions, along with tabling via SQLITE + changelog topics in a similar fashion to how RocksDB is utilized. Additionally, we have it handling messages via batch commits, which has been huge for throughput!

Everything works extremely well…most of the time! I have an issue around rebalancing that only happens…sometimes. Specifically, sometimes we get an ILLEGAL_GENERATION , which causes messages to be skipped (due to the consumer continuing after a transaction commit is interrupted and not resetting state); there is not a lot of documentation around how to handle it correctly, especially since I’m not dealing with the lower level C library.

For context…the table files + changelog topic assigning is all handled via the consumer.subscribe 's on_assign and on_revoke , where I additionally assign/revoke changelog topics (as-needed, depending on whether the table offset doesn’t match the changelog…otherwise I do not subscribe to them), along the accompanying partition-based table files.

I have a feeling that the methods passed to these args need to operate in a specific way that I’m not doing, as our non-table implementation (where we dont call on_assign/revoke ) seems to handle rebalances just fine.

So, to start…my general question is, when a rebalance happens with a transactional consumer and you are usingon_assign and on_revoke …what do you need to do besides assign/unassign partitions to ensure proper handling of a rebalance? Right now I am handling a lot of stuff manually, like throwing exceptions to interrupt current processing, and aborting any transactions and seeking the consumer to a previous offset (since we process more than 1 msg at a time), but I have no idea if this is correct or not (clearly not since it has issues sometimes).

I am happy to provide any details as needed, and happy to move this discussion elsewhere if there’s a more appropriate location. Thanks!