S3 Sink Connector: "exactly once" when late data is possible

lion-simba · 24 February 2022 09:03

Hi,

I have a question regarding “exactly once” delivery guarantee in Kafka Connect Amazon S3 Sink Connector.

A picture in “Exactly-once delivery on top of eventual consistency” section of the documentation, says that if late data is possible, then such a configuration invalidates exactly-once guarantees.

I’m trying to understand why is this so.

I understand the idea S3 Sink Connector is using to achieve exactly-once: if partitioning (both folder-wise and file-wise) are stable between retries (i.e. subsequent reruns cut files at exactly same offsets as previous runs, and put these files to exactly the same folders as previous runs), then from the S3 reader point of view, nothing will change: old files will simply be overwritten by new files, which have the same content.

So I’m trying to understand how existence of late data can invalidate the above?

I use deterministic RecordFieldTimestampExtractor, TimeBasedPartitioner ('year'=YYYY/'month'=MM/'day'=dd/'hour'=HH), rotate interval and not using scheduled rotate interval. Let’s take the following sequence of records in a single Kafka partition:

Record1 (offset=0, timestamp=(hour=1, minute=1))
Record2 (offset=1, timestamp=(hour=2, minute=3))
Record3 (offset=2, timestamp=(hour=1, minute=20))
Record4 (offset=3, timestamp=(hour=2, minute=25))

Note that Record3 has timestamp lower than Record2, so Record3 is essentially a “late data”.

When S3 Sink Connector process that sequence, the following files will be generated in S3:

/hour=01/topic+partition+0.extention (containing Record1)
/hour=01/topic+partition+2.extention (containing Record3)
/hour=02/topic+partition+1.extention (containing Record2)
/hour=02/topic+partition+3.extention (containing Record4)

If I rerun the Connector, given the same source records, I expect the same files to be generated (overwritten). Even despite timestamp is not always increase.

I don’t see a case when files after rerun would be different and invalidate exactly-once guarantee. Am I missing something?

So my question is: what was the reason of adding that last “Late Data Possible?” branch to the picture in S3 Sink Connector documentation? What is the case to invalidate exactly-once if we have late data, but use deterministic RecordFieldTimestampExtractor?

Thanks.

lion-simba · 14 March 2022 05:24

Just to follow up on this.

In absence of official answer to this, we did our own tests. These tests shown that having “late data” doesn’t invalidate “exactly-once” guarantee (if using deterministic timestamp extractor and partitioner).

Nevertheless, it would be great to get an official confirmation to this.

Topic		Replies	Views
Exactly once - S3 Sink - documentation questions Kafka Connect	4	3851	18 June 2021
Amazon S3 Sink Connector Self-Managed Connectors	0	1838	11 June 2023
S3 sink connector Kafka Connect	2	4109	10 June 2021
Tasks and Partitions Rebalancing Mechanism Self-Managed Connectors	0	1239	19 October 2023
Control filename in s3 sink connector (repeated query) Kafka Connect	2	956	1 March 2024

S3 Sink Connector: "exactly once" when late data is possible

Related topics