We are having issues with out Kafka Connect cluster running in Docker on AWS ECS. We currently have three tasks running Kafka Connect and 110 connectors in total.
Often the Kafka Connect just freeze, nothing looks wrong in the logs except no new entries are generated. ECS sees the tasks as healthy. But Kafka Connect UI times out trying to retrieve Connectors as does API calls e.g. 0.0.0.0:8083/connectors.
I am looking for advice as I cannot get to the bottom of this. I’m wondering would we be better off running two tasks and put more resources memory/cpu towards the tasks? Or add a fourth tasks?