Hello colleagues. Pretty new to the Kafka as well as here.
We have an application (GO+sarama) consuming data from 8 kafka brokers (0-7) from single topic split into 3600 partitions with 2 replicas, both Kafka and application working in the same k8s cluster.
Our Kafka is Confluent based image 5.0.1 for Kafka 2.0.0 , sarama version is v1.24.1.
The issue is that at certain moment kafka-0 started flapping being unreachable, then it happened with kafka-7 and after that, even when kafka became ok, we’ve got troubles with sarama application not being able to connect. The only approach helped was to completely restart all kafka consumers and then everything went normal and stable.
Maybe some one had similar kind of behavior and can advise on the ways to do a further digging.
We would appreciate any hint on where to look and what to look for to fix, or maybe what additional debug information to gather if something like this happens in the future.
Had some logs extraction here. The issue is quite old and not reproduced so far, but we finally would like to work on this point. Thanks a lot for advises in advance.