Kafka - What is a right value for replica.lag.time.max.ms?

gsolano777 · 16 February 2022 18:04

We are receiving a lot of Kafka alerts related to topic being under replicated . We think these are not real issues because of the alerts bouncing in and off. This may be caused by having a tight value for replica.lag.time.max.ms . This setting controls when a replica is considered out of sync and thus removed from the In-Sync replicas list.

We could relax this value and received less alerts, but how do we guarantee this not becomes an issue of hiding real problems.

Is there an expected normal # of these alerts we can target to? Or are there any other metrics we can also use to assess the quality of our replicas after relaxing the setting?

mmuehlbeyer · 17 February 2022 12:01

hey @gsolano777

what is current value of replica.lag.time.max.ms?

how does your setup look like?
any firewalls etc. in between?

in my experience the default (30s) worked quite well in most cases.

best
michael

best,
michael

mitchell-h · 17 February 2022 17:28

Welcome @gsolano777 .

replica.lag.time.max.ms can effect durability. Since the amount of time between “We think it’s good” and “it’s not good remove it” is extended.
So I when folks start to think about increasing it the value, I always ask “Why do you believe that will solve your issue?”.

Most of the times I’ve had to increase the value is because they had a heavily used cluster and/or the cluster has slow network links between the nodes(VMware for some reason comes into play here a lot). Could you describe your cluster deployment a bit? How many nodes? stats for kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent and kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent would be great too.

Ideally the number of ISR shrinks and expands should be 0. Anything greater than 0 and you’re risking durability and creating spiky cluster load that can effect SLAs.

gsolano777 · 21 February 2022 15:40

Our team was informed that perhaps we are using too few replica threads, which can be slowing down our replication. We’re gonna first try increasing num.replica.fetchers

Topic		Replies	Views
Producer slows down if multiple consumers lag Clients	1	3283	5 October 2021
Kafka: error publishing request: Failed to allocate memory within the configured max blocking time 100 ms Clients	10	9466	3 March 2021
🎧 Benchmarking Apache Kafka Latency at the 99th Percentile ft. Anna Povzner News and Blogs	0	3266	16 April 2021
Best practices for Increasing replication factor Ops	7	4445	12 October 2021
Kafka streams rebalance storm Kafka Streams	7	6758	16 July 2021

Kafka - What is a right value for replica.lag.time.max.ms?

Related topics