I was wondering if the Connect cluster can benefit from the Streams framework, Changelog topics and ‘Exactly-once’ delivery. Is there buffering in the worker cluster ?
I read the ‘backpressure’ section in Streams Architecture | Confluent Documentation
The 2 types of CDC mechanisms that could impact the worker cluster in our case are log-tailing and JDBC Query polls.
The connect framework has some in-memory buffering but only for the short batches it is consuming or producing to the topics. With source connectors, we can’t make any assumptions on whether the source system can be re-read in case of an error but if it can (say a SQL Database with a table that has an id and/or timestamp), the connect worker keeps the track of processed logical offsets in a connect-offsets topics. Sink connectors on the other hand just read as regular consumers from Kafka topics so they can try again. This way, for the most part you achieve at-least once semantics. There are some edge cases such as non-retriable source systems or permanent errors from sink systems which could require skipping and continuing.
The ideas from changelog topics in Streams might be interesting but would require a complete rethinking of the connect architecture. As well, in Kafka Streams you usually only deal with From Kafka To Kafka topologies and that can make use of transactions to ensure exactly once processing. As soon as you make external calls like REST APIs or SQL queries, we cannot make that guarantee anymore so we’re back to at least once processing (unless you skip over permanent exceptions).
I hope this gives some colour on this topic.
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.