As platform team I wonder what reasonable SLO/SLIs for Apache Kafka are and in case they are not directly exposed as JMX metric, how can I monitor these?
Opinions are my own…
These depend on the SLAs you are supporting with your SLIs. But here are a couple of core ones:
Controller count - must equal 1 else something is wrong
Under replicated partitions - under replicated partitions greater than one is normally an early warning that something is about to go pear shaped. Depending on your setting for publish acks, this might mean that some publishers might also stop, if min ISR is less than required.
Leader elections - These might happen due to rebalances, but a lot of these might be an indication that you have network packet loss between brokers or a skewed distribution in partitioning.
Offline partitions - publishers and consumers will be offline for said partitions at this time, normally because no ISR are available. When you see Offline Partitions things have already gone wrong, message loss is imminant if no action is taken.
Bytes in - if you have constant semi predictable traffic, you can monitor band ranges here. A substantial spike or dip could indicate publisher failures.
Bytes out - same as above, a spike or dip could indicate consumer failure, consumer offset resets or even additional consumers connecting.
Both the last two mentioned metrics could be measured together with your network bandwith to indicate if you need to start looking at quotas.
Bunches more to monitor and the Confluent docs have some neat sections on available metrics as well.
All these are JMX. Hope this helps
What do you find are the best tools for keeping an eye on the Brokers?
For example, Influx & Grafana. Or does some other combination of tools work better?
(I realise that answer are likely to be rather subjective and affected by pre-exisiting infrastructure.)
@roadSurfer This is one of those opinionated ones. There are various good combos out there, but as you asked, Grafana and Influx works beautifully . Also easy to monitor multiple clusters once you have your core metrics set up. If you have support, obviously Control Center is awesome to monitor for lag, check broker health, check distribution, topic sizes, partition skew etc. With Grafana, it’s also easy to combine with your other metrics to have a combined view on host health, networking etc.