What are reasonable SLOs for Kafka?

weeco · 3 February 2021 20:49

As platform team I wonder what reasonable SLO/SLIs for Apache Kafka are and in case they are not directly exposed as JMX metric, how can I monitor these?

Charlla · 3 February 2021 21:08

Opinions are my own…
These depend on the SLAs you are supporting with your SLIs. But here are a couple of core ones:

Controller count - must equal 1 else something is wrong

Under replicated partitions - under replicated partitions greater than one is normally an early warning that something is about to go pear shaped. Depending on your setting for publish acks, this might mean that some publishers might also stop, if min ISR is less than required.

Leader elections - These might happen due to rebalances, but a lot of these might be an indication that you have network packet loss between brokers or a skewed distribution in partitioning.

Offline partitions - publishers and consumers will be offline for said partitions at this time, normally because no ISR are available. When you see Offline Partitions things have already gone wrong, message loss is imminant if no action is taken.

Bytes in - if you have constant semi predictable traffic, you can monitor band ranges here. A substantial spike or dip could indicate publisher failures.

Bytes out - same as above, a spike or dip could indicate consumer failure, consumer offset resets or even additional consumers connecting.

Both the last two mentioned metrics could be measured together with your network bandwith to indicate if you need to start looking at quotas.

Bunches more to monitor and the Confluent docs have some neat sections on available metrics as well.

All these are JMX. Hope this helps

roadSurfer · 4 February 2021 12:05

What do you find are the best tools for keeping an eye on the Brokers?

For example, Influx & Grafana. Or does some other combination of tools work better?

(I realise that answer are likely to be rather subjective and affected by pre-exisiting infrastructure.)

Charlla · 9 February 2021 21:45

@roadSurfer This is one of those opinionated ones. There are various good combos out there, but as you asked, Grafana and Influx works beautifully . Also easy to monitor multiple clusters once you have your core metrics set up. If you have support, obviously Control Center is awesome to monitor for lag, check broker health, check distribution, topic sizes, partition skew etc. With Grafana, it’s also easy to combine with your other metrics to have a combined view on host health, networking etc.

Topic		Replies	Views
Consumer group metrics using JMX Ops	12	1503	8 March 2024
Monitor kraft based kafka cluster Ops	1	2183	22 August 2023
✍️ Monitoring Your Event Streams: Tutorial for Observability Into Apache Kafka Clients News and Blogs	0	3201	22 April 2021
Monitoring Strategy Confluent Cloud	11	6251	6 March 2021
✍️ Apache Kafka Lag Monitoring at AppsFlyer News and Blogs	0	3260	10 December 2020

What are reasonable SLOs for Kafka?

Related topics