S3 sink connector generates files twice in a day

Prakhar · 28 July 2022 14:02

We have one JDBC source connector that is extracting data from oracle to Kafka topic and the s3 sink connector will generate files as per data available in Kafka topic. We have one cron job schedule for deleting and recreating connectors every day.

Error: We are getting those S3 files twice a day due to a polling interval of 24 hrs looks like. So we have changed the polling interval to 6 days but still, it’s taking 24 hrs of that polling interval.

Do we have any metadata location so that we can see when the connectors and topics’ last execution happened?

OneCricketeer · 29 July 2022 16:35

Can you clarify what you mean by “polling interval”?

The S3 sink will always immediately consume the records. It’ll buffer them in memory until a Partition is reached, based on your chosen partitioner class, but always will dump out “flush.size” or from the “scheduled rotation” configuration. There may be a JMX metric that shows the size of the buffer, but otherwise, a drop in consumer group lag will indicate when records actually get committed to S3.

Would you mind sharing your sink configuration?

Prakhar · 29 July 2022 17:11

Prakhar · 29 July 2022 17:17

PFA sink connector properties.

Actually, we have a cron job scheduled for delet and create connectors every day at 3 am and it’s running perfectly fine from source to s3 sink files generation but somehow at around 10 am, again same s3 files generating and no spike we have seen in production and consumption graph and no data loaded into Kafka topic.

Can we know from somewhere how many times the connector is running or s3 sink is taking data from some other source as well?

Prakhar · 29 July 2022 17:19

PFA source connector properties as well

OneCricketeer · 1 August 2022 14:39

You shouldn’t need a cron to delete and create. That may cause more issues than solutions. As mentioned, there is no “polling interval”. You can reduce the scheduled rotation interval and flush size if you’d like data written more frequently to S3. And the S3 sink only “takes data” from the topics you’ve configured it for.

know from somewhere how many times the connector is running

I would start at the logs

Prakhar · 1 August 2022 15:05

We have a polling interval of 24 hrs that got mentioned in the source connector file.

The problem is we don’t have data into the topic a second time but s3 is generating files a second time don’t know from where. As you can see s3 sink connector is associated with one single topic.

OneCricketeer · 1 August 2022 16:01

Have you checked the source topic yourself?

The S3 sink should have exactly once delivery - From Apache Kafka to Amazon S3: Exactly Once | Confluent

However, if you’re frequently deleting and recreating the connector, then that might not be guaranteed any longer. Especially if you create the connector with a new name each time

Prakhar · 2 August 2022 14:46

Yes, I have checked the topic on the confluent control center and it didn’t produce and consumed any data at the 2nd-time run.

Also, what would be the maximum value for polling interval and why?

OneCricketeer · 2 August 2022 15:16

As mentioned, there is no polling “interval”. The consumer api will fetch the records as soon as they are available in the topic. The S3 connector will then buffer the records in memory until flush, scheduled rotation, or new partition is reached

Prakhar · 2 August 2022 15:31

Just for clarification, I am not asking polling interval for the s3 sink connector.

For source connectors, we usually give polling intervals and that is what I am asking.

Polling interval related to source connector only. Please go through my source properties file and if possible please let me know about the source connector polling interval.

OneCricketeer · 22 December 2022 20:07

You have poll.interval.ms defined. That is the source polling interval.

That has nothing to do with your original question of “S3 file, twice a day”. That is not controlled by your source connector, only the sink connector config, in which there isn’t any poll interval, only flush size or partitions.

adhishankarit · 30 June 2024 22:53

How we can fine tune the flush size, rotation interval of 2mins, and gcs part size 8MB and partition duration is 2 mins to write files as 50MB if data is available in Kafka topics. I am using the above property in TCS sink connector but it still writes less KB bytes files rathan than bigger MB files when there is a Data at Kafka topic.

I am also using the same properties for consuming 30 topics and writing to respective topics gcs bucket objects using a single gcs sink connector config.

I do use JDBC source connector to write the data into 30 Kafka topics as well as 24hours polling interval.

@OneCricketeer @rmoff

Can you guide based on your experience. I am looking to create a target each file minimum 50MB instead of smaller files within 2mins rotation schedule using GCS sink connector. Increased gcs part size,it is still did not work.

Topic		Replies	Views
S3 sink connector Kafka Connect	2	4189	10 June 2021
Amazon S3 Sink Connector Self-Managed Connectors	0	1846	11 June 2023
S3 Sink Connector: "exactly once" when late data is possible Self-Managed Connectors	1	3391	14 March 2022
S3 Sink Connector - Small File Creation Issue Managed Connectors	0	2039	7 September 2023
Kafka connect s3 Source Kafka Connect	3	3004	12 March 2022

S3 sink connector generates files twice in a day

Related topics