S3 sink connector generates files twice in a day

We have one JDBC source connector that is extracting data from oracle to Kafka topic and the s3 sink connector will generate files as per data available in Kafka topic. We have one cron job schedule for deleting and recreating connectors every day.

Error: We are getting those S3 files twice a day due to a polling interval of 24 hrs looks like. So we have changed the polling interval to 6 days but still, it’s taking 24 hrs of that polling interval.

Do we have any metadata location so that we can see when the connectors and topics’ last execution happened?

Can you clarify what you mean by “polling interval”?

The S3 sink will always immediately consume the records. It’ll buffer them in memory until a Partition is reached, based on your chosen partitioner class, but always will dump out “flush.size” or from the “scheduled rotation” configuration. There may be a JMX metric that shows the size of the buffer, but otherwise, a drop in consumer group lag will indicate when records actually get committed to S3.

Would you mind sharing your sink configuration?

PFA sink connector properties.

Actually, we have a cron job scheduled for delet and create connectors every day at 3 am and it’s running perfectly fine from source to s3 sink files generation but somehow at around 10 am, again same s3 files generating and no spike we have seen in production and consumption graph and no data loaded into Kafka topic.

Can we know from somewhere how many times the connector is running or s3 sink is taking data from some other source as well?

PFA source connector properties as well

You shouldn’t need a cron to delete and create. That may cause more issues than solutions. As mentioned, there is no “polling interval”. You can reduce the scheduled rotation interval and flush size if you’d like data written more frequently to S3. And the S3 sink only “takes data” from the topics you’ve configured it for.

know from somewhere how many times the connector is running

I would start at the logs

We have a polling interval of 24 hrs that got mentioned in the source connector file.

The problem is we don’t have data into the topic a second time but s3 is generating files a second time don’t know from where. As you can see s3 sink connector is associated with one single topic.

Have you checked the source topic yourself?

The S3 sink should have exactly once delivery - https://www.confluent.io/blog/apache-kafka-to-amazon-s3-exactly-once/

However, if you’re frequently deleting and recreating the connector, then that might not be guaranteed any longer. Especially if you create the connector with a new name each time

Yes, I have checked the topic on the confluent control center and it didn’t produce and consumed any data at the 2nd-time run.

Also, what would be the maximum value for polling interval and why?

As mentioned, there is no polling “interval”. The consumer api will fetch the records as soon as they are available in the topic. The S3 connector will then buffer the records in memory until flush, scheduled rotation, or new partition is reached

Just for clarification, I am not asking polling interval for the s3 sink connector.

For source connectors, we usually give polling intervals and that is what I am asking.

Polling interval related to source connector only. Please go through my source properties file and if possible please let me know about the source connector polling interval.