What are reasonable SLOs for Kafka?

Charlla · 3 February 2021 21:08

Opinions are my own…
These depend on the SLAs you are supporting with your SLIs. But here are a couple of core ones:

Controller count - must equal 1 else something is wrong

Under replicated partitions - under replicated partitions greater than one is normally an early warning that something is about to go pear shaped. Depending on your setting for publish acks, this might mean that some publishers might also stop, if min ISR is less than required.

Leader elections - These might happen due to rebalances, but a lot of these might be an indication that you have network packet loss between brokers or a skewed distribution in partitioning.

Offline partitions - publishers and consumers will be offline for said partitions at this time, normally because no ISR are available. When you see Offline Partitions things have already gone wrong, message loss is imminant if no action is taken.

Bytes in - if you have constant semi predictable traffic, you can monitor band ranges here. A substantial spike or dip could indicate publisher failures.

Bytes out - same as above, a spike or dip could indicate consumer failure, consumer offset resets or even additional consumers connecting.

Both the last two mentioned metrics could be measured together with your network bandwith to indicate if you need to start looking at quotas.

Bunches more to monitor and the Confluent docs have some neat sections on available metrics as well.

All these are JMX. Hope this helps

Topic		Replies	Views
Monitor kraft based kafka cluster Ops	1	2184	22 August 2023
Kafka Controller Monitoring Using Grafana Kafka Connect	2	215	14 November 2024
Hung broker in Zookeeper active broker list Stream Processing	1	1925	15 June 2023
✍️ Monitoring Your Event Streams: Tutorial for Observability Into Apache Kafka Clients News and Blogs	0	3202	22 April 2021
🎧 Common Apache Kafka Mistakes to Avoid News and Blogs	0	2836	23 June 2022

What are reasonable SLOs for Kafka?

Related topics