We have a confluent hosted kafka-connect cluster serving traffic in production. In our setup, we have a custom sink connector modeled off the standard jdbc based sink connector (with a few extra steps to retrieve data from another system prior to uploading data to the destination).
We typically commit offsets every 100s from each task. Recently, we made a change on our end where instead of flushing all records to the destination and committing offsets at the same time, we flush records to the destination in fixed batches (to have better control over flush times during peak traffic) but still have offsets be committed every 100 seconds. Since the change, we’ve noticed that we see increasing offset commit times during peak traffic times followed by asynchronous offset commit timeouts (current offset commit timeout value is 60s). These coupled together cause elevated consumer lag on specific tasks.
The weirdest thing is that during this time, if we pause our connectors and restart the confluent platform on our workers and then resume our connectors, we don’t see this issue recurring until the next day or so and the built up consumer lag catches up really quickly with no issues in committing offsets since the restart. There isn’t any evidence of any potential memory leaks on the application side so I was curious on whether anyone else has seen something similar to this?