Kafka multi-datacenter solution

afyon · 30 January 2023 09:22

I have 2 services ( 1 producer who writes 15 000 messages to kafka topic, and 1 consumer who reads this messages from that topic ) and i have streched 3 dc kafka cluster ( this 3 dc locates within same city, so latency is low )

to immitate 2 dc failure i’m simultaneously shutdown 2 kafkas ( systemctl kill through ansible ) so i have only 1 kafka up & running, i have acks=all and isr=3 and min isr=3, so in theory if even 1 kafka will be down all writes to kafka will stop but in my case my service write to kafka with only 1 node alive!

why this happens?

here’s my /etc/kafka/server.properties

zookeeper.connect=192.168.1.11:2181,192.168.1.12:2181,192.168.1.13:2181
log.dirs=/var/lib/kafka/data
broker.id=0
group.initial.rebalance.delay.ms=0
log.retention.check.interval.ms=30000
log.retention.hours=3
log.roll.hours=1
log.segment.bytes=1073741824
num.io.threads=16
num.network.threads=8
num.partitions=1
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=3
socket.receive.buffer.bytes=1024000
socket.request.max.bytes=104857600
socket.send.buffer.bytes=1024000
transaction.state.log.min.isr=3
transaction.state.log.replication.factor=3
zookeeper.connection.timeout.ms=10000
delete.topic.enable=True
replica.fetch.max.bytes=5242880
max.message.bytes=5242880
message.max.bytes=5242880
default.replication.factor=3
min.insync.replicas=3
replica.fetch.wait.max.ms=200
replica.lag.time.max.ms=1000
advertised.listeners=PLAINTEXT://192.168.1.11:9092
unclean.leader.election=false
acks=all

mmuehlbeyer · 30 January 2023 12:47

hey @afyon

in your scenario
min.insync.replicas=3
couldn’t ne guaranteed with only one broker.

out of the docs:

A typical scenario would be to create a topic with a replication factor of 3, set min.insync.replicas to 2, and produce with acks of “all”. This will ensure that the producer raises an exception if a majority of replicas do not receive a write.

afyon · 30 January 2023 14:28

Hey @mmuehlbeyer
Yeah! but in my case when i turn off 2 nodes ( 1 node alive ) new messages writes to kafka! why this happens? because my min.isr=3 and ack=all on kafka side

mmuehlbeyer · 30 January 2023 15:16

ah got it now
did you check the topic configuration?

afyon · 31 January 2023 03:26

@mmuehlbeyer

kafka-topics –describe –zookeeper localhost:2181 –topic  command
Topic: command    TopicId: -9JXMy1zTa-B_uy5PfOVxg PartitionCount: 1       ReplicationFactor: 3    Configs:
        Topic: command    Partition: 0    Leader: 2       Replicas: 2,1,0 Isr: 0,1,2

here is producer config

conf := sarama.NewConfig()
conf.Version = sarama.V2_0_1_0
conf.Producer.Partitioner = sarama.NewRoundRobinPartitioner
conf.Producer.RequiredAcks = sarama.WaitForAll
clientProducer, err := sarama.NewAsyncProducer(cfg.Kafka.Brokers, conf)
if err != nil {
 return nil, err
}

mmuehlbeyer · 31 January 2023 07:04

@afyon

ok I see
though imho the 2 offline partitions should be listed as “offline”

kafka-topics --describe -bootstrap-server localhost:29091 --topic isr-test
Topic: isr-test	TopicId: 6--kLfhsTDSPq_MEMlEDTA	PartitionCount: 1	ReplicationFactor: 3	Configs: min.insync.replicas=3
	Topic: isr-test	Partition: 0	Leader: 3	Replicas: 3,1,2	Isr: 3,1,2	Offline:

killing 2 brokers

kafka-topics --describe -bootstrap-server localhost:29091 --topic isr-test
Topic: isr-test	TopicId: 6--kLfhsTDSPq_MEMlEDTA	PartitionCount: 1	ReplicationFactor: 3	Configs: min.insync.replicas=3
	Topic: isr-test	Partition: 0	Leader: 1	Replicas: 3,1,2	Isr: 1	Offline: 3,2

best,
michael

afyon · 31 January 2023 07:15

@mmuehlbeyer
Sorry i listed describe from healthy brokers
here is for 2 dead brokers

kafka-topics –describe –zookeeper localhost:2181 –topic command
Exception in thread "main" kafka.zookeeper.ZooKeeperClientTimeoutException: Timed out waiting for connection while in state: CONNECTING

**at** kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:271)

but service writes messages to 1 alive brokers and all 15 000 messages are here!

kafka-run-class kafka.tools.GetOffsetShell   –broker-list 192.168.2.11:9092 –topic command   | awk -F  ":" '{sum += $3} END {print "Result: "sum}'
Result: 15000

mmuehlbeyer · 31 January 2023 12:06

hey @afyon
mmh zookeeper should at least respond I guess?
are you shutting down everything?

could just provide some details of your setup?
3 datacenter right?
3 x kafka?
how much zookeeper?

afyon · 1 February 2023 04:21

@mmuehlbeyer hey!

i have 3 datacenter which locates in same city ( within 20km )
i have 3 vm ( Oracle Linux ) and on each vm kafka and zookeeper is installed

for the test case i run 1 producer and 1 consumer ( producer write 15 000 messages to kafka, and consumer reads messages from topic and echo this messages to logs ) and to test DC failure i run this services ( producer connects to kafka then start writing, and i’m kill zookeeper and kafka on 2 servers, only 1 node alive ) so using acks=all and min.isr=3 should stop writing to the topic, but this does not happen and the records are successfully put into kafka

mmuehlbeyer · 1 February 2023 06:47

ok I see
so basically I would recommend not to run the zookeeper services on kafka nodes
normally they should survive an outage of an kafka node

what would happen if you don’t kill the zookeepers?

nevertheless could you check the following setting:
unclean.leader.election.enable

see Kafka Unclean Leader Election. The unclean leader election… | by Rob Golder | Lydtech Consulting | Medium.

best,
michael

afyon · 2 February 2023 06:03

yeah! when 2 zookeepers alive the messages won’t go into topic!
so i should run zookeepers on separate vms? 3 kafka brokers and 3 zookeepers is enough?
@mmuehlbeyer

mmuehlbeyer · 2 February 2023 15:35

hey @afyon
I would highly recommend to run zookeeper on separate vms

there is a nice presentation available regarding the different options.
see slide 20 following especially the rack awareness might be useful

afyon · 3 February 2023 06:05

Thank you so much! my friend!

Topic		Replies	Views
Best way to build multi-dc kafka Cluster Replication	0	2602	14 December 2022
What happens to Kafka if ZooKeeper quorum is lost? Ops	2	6957	8 February 2021
Kafka Connect docker doesn't work properly (unhealthy) Kafka Connect	14	7406	19 March 2021
What is wrong with my cluster setup? Ops	8	5363	31 December 2021
Why losing messages on read? Architecture and Design	9	6669	16 September 2021

Kafka multi-datacenter solution

Related topics