What are reasonable SLOs for Kafka?

Opinions are my own…
These depend on the SLAs you are supporting with your SLIs. But here are a couple of core ones:

Controller count - must equal 1 else something is wrong

Under replicated partitions - under replicated partitions greater than one is normally an early warning that something is about to go pear shaped. Depending on your setting for publish acks, this might mean that some publishers might also stop, if min ISR is less than required.

Leader elections - These might happen due to rebalances, but a lot of these might be an indication that you have network packet loss between brokers or a skewed distribution in partitioning.

Offline partitions - publishers and consumers will be offline for said partitions at this time, normally because no ISR are available. When you see Offline Partitions things have already gone wrong, message loss is imminant if no action is taken.

Bytes in - if you have constant semi predictable traffic, you can monitor band ranges here. A substantial spike or dip could indicate publisher failures.

Bytes out - same as above, a spike or dip could indicate consumer failure, consumer offset resets or even additional consumers connecting.

Both the last two mentioned metrics could be measured together with your network bandwith to indicate if you need to start looking at quotas.

Bunches more to monitor and the Confluent docs have some neat sections on available metrics as well.

All these are JMX. Hope this helps

2 Likes