Hello team,I’m sorry if this isn’t the correct channel to post this question but I’m having some odd behavior in my Kafka dev cluster and I would like the help of the community to troubleshoot this situation: So I have a micro service acting as my Kafka Consumer built in python and Django and making use of faust-streaming lib. I have the GCP Kafka instance with replication acting as my kafka cluster in the cloud. this product provides 3 kafka broker instances and 3 zookeeper instance.
So some time ago we started noticing a behavior that (fortunately) until now only occurred in our dev environment. The behavior we noticed was that our kafka broker container that was the elected leader started entering a CrashLoopBackoff state. After some investigation we noticed that this happened because of the following error:
kafka.common.InconsistentClusterIdException: The Cluster ID m1Ze6AjGRwqarkcxJscgyQ doesn’t match stored clusterId Some(1TGYcbFuRXa4Lqojs4B9Hw) in meta.properties. The broker is trying to join the wrong cluster. Configured zookeeper.connect may be wrong.
So after weeks of investigation we concluded that this error occurred due to the fact the all our zookeepers went down at the same time on our dev environment when our machines rotated. Due to the fact that this Google Instance doesn’t mount a volume of persistent data of zookeeper info if all the zookeepers go down at the same time a new Cluster ID mus be created which creates incongruences with the info stored in server.properties
file in all of my brokers.After we solve this problem and restart both kafka and zookeeper containers everything seems to stay fine on my kafka cluster however my consumer app after establishing a connection to the broker cannot consume any more messages after the previously error occurred. we tried creating new consumer groups with different group ids but the problem still seamed to persist and the only solution that solved this problem was to delete the __consumer__offsets
directory from my broker. Furthermore we noticed that if we launch a console producer and consumer within our brokers bash we could only produce messages to some of our topics which is a very odd behavior. We checked our broker logs and didn’t find any error log situation regarding the “corrupted” topics. We also described our topic info and everything seemed correct.Example topic_1
Messages are produced and when we run describe command this info pops:
Topic: topic_1 TopicId: K8aPagpBTd-EItR8ygYm_A PartitionCount: 1 ReplicationFactor: 1 Configs: Topic: topic_1 Partition: 0 Leader: 0 Replicas: 1 Isr: 1
Example topic_2
We can´t produce any message and the following error occurs: >[2024-01-17 10:42:34,433] WARN [Producer clientId=console-producer] Got error produce response with correlation id 6 on topic-partition topic_2-0, retrying (2 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
Topic: topic_2 TopicId: 6MD0UXuRS2O0HUnNRcspng PartitionCount: 1 ReplicationFactor: 1 Configs: Topic: topic_2 Partition: 0 Leader: 0 Replicas: 0 Isr: 0
So my questions that I still couldn’t fin any valid answer to are:
1 → Why doesn’t my consumer app rebalance the information itself? Following kafka streams logic in theory wouldn’t that occur “out-of-the-box”?
2 → If I extend this GCP Kafka solution and mount a volume to zookeeper data the Cluster ID error stops occurring even if I delete all the zookeepers at the same time. Does creating this volume have nay impact on other stuff that I may be missing? Why didn’t Google had the mount volume to this zookeeper info in this solution?
3 → Why do only some topics are “corrupted”? Is there any valid reason for this to help?We tried searching the internet for some similar cases and the one that seemed very close to what we are going through is this one. We tried every suggestion that was written there but nothing helped…Can anyone help me please?