Hi,
I have a question regarding “exactly once” delivery guarantee in Kafka Connect Amazon S3 Sink Connector.
A picture in “Exactly-once delivery on top of eventual consistency” section of the documentation, says that if late data is possible, then such a configuration invalidates exactly-once guarantees.
I’m trying to understand why is this so.
I understand the idea S3 Sink Connector is using to achieve exactly-once: if partitioning (both folder-wise and file-wise) are stable between retries (i.e. subsequent reruns cut files at exactly same offsets as previous runs, and put these files to exactly the same folders as previous runs), then from the S3 reader point of view, nothing will change: old files will simply be overwritten by new files, which have the same content.
So I’m trying to understand how existence of late data can invalidate the above?
I use deterministic RecordFieldTimestampExtractor, TimeBasedPartitioner ('year'=YYYY/'month'=MM/'day'=dd/'hour'=HH
), rotate interval and not using scheduled rotate interval. Let’s take the following sequence of records in a single Kafka partition:
- Record1 (offset=0, timestamp=(hour=1, minute=1))
- Record2 (offset=1, timestamp=(hour=2, minute=3))
- Record3 (offset=2, timestamp=(hour=1, minute=20))
- Record4 (offset=3, timestamp=(hour=2, minute=25))
Note that Record3 has timestamp lower than Record2, so Record3 is essentially a “late data”.
When S3 Sink Connector process that sequence, the following files will be generated in S3:
- /hour=01/topic+partition+0.extention (containing Record1)
- /hour=01/topic+partition+2.extention (containing Record3)
- /hour=02/topic+partition+1.extention (containing Record2)
- /hour=02/topic+partition+3.extention (containing Record4)
If I rerun the Connector, given the same source records, I expect the same files to be generated (overwritten). Even despite timestamp is not always increase.
I don’t see a case when files after rerun would be different and invalidate exactly-once guarantee. Am I missing something?
So my question is: what was the reason of adding that last “Late Data Possible?” branch to the picture in S3 Sink Connector documentation? What is the case to invalidate exactly-once if we have late data, but use deterministic RecordFieldTimestampExtractor?
Thanks.