We are receiving a lot of Kafka alerts related to topic being under replicated . We think these are not real issues because of the alerts bouncing in and off. This may be caused by having a tight value for replica.lag.time.max.ms
. This setting controls when a replica is considered out of sync and thus removed from the In-Sync replicas list.
We could relax this value and received less alerts, but how do we guarantee this not becomes an issue of hiding real problems.
Is there an expected normal # of these alerts we can target to? Or are there any other metrics we can also use to assess the quality of our replicas after relaxing the setting?
hey @gsolano777
what is current value of replica.lag.time.max.ms
?
how does your setup look like?
any firewalls etc. in between?
in my experience the default (30s) worked quite well in most cases.
best
michael
best,
michael
1 Like
Welcome @gsolano777 .
replica.lag.time.max.ms
can effect durability. Since the amount of time between “We think it’s good” and “it’s not good remove it” is extended.
So I when folks start to think about increasing it the value, I always ask “Why do you believe that will solve your issue?”.
Most of the times I’ve had to increase the value is because they had a heavily used cluster and/or the cluster has slow network links between the nodes(VMware for some reason comes into play here a lot). Could you describe your cluster deployment a bit? How many nodes? stats for kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent
and kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent
would be great too.
Ideally the number of ISR shrinks and expands should be 0. Anything greater than 0 and you’re risking durability and creating spiky cluster load that can effect SLAs.
1 Like
Our team was informed that perhaps we are using too few replica threads, which can be slowing down our replication. We’re gonna first try increasing num.replica.fetchers