Monitoring Strategy

Hi,

We are relatively new with Confluent Cloud.
I’m wondering what was your approach to achieve observability?
Have you connected it to your Ops monitoring tools (we don’t have DataDog, Slack, PagerDuty)?
How you monitor replication between clusters, consumers’ lag time?

Regards,

Sergey

1 Like

The list of metrics we expose are here.
It’s a relatively small list as Confluent Cloud’s is a fully managed service so you shouldn’t need them. If there’s a problem with replication - that’s our problem to solve not yours.
I think lag is a client-side metric, but I may be wrong.

Hi Benjamin,

Thank you for your answer!

Metrics API is nice, but it is not complete. I’d like to get alert when consumer lag is above its threshold.

As far as I know the replication between clusters in different regions is not yet part of Confluent Cloud offering. I hope it will be available soon. So I need to monitor my replicators. Some connectors are not available as managed offering too. I need to monitor their health as well. And I need to implement C3 to manage connect clusters. So I need to monitor C3 too.

Regards,

Sergey

Ah, I see.
Replication (which uses cluster linking) is in preview. Getting C3 etc. should be pretty simple if you email info@confluent dot io. if you hit any friction let me know.
update: it looks like if you have a cloud commit you get this option automatically.

Hi Benjamin,

We have C3 installed and configured with alerts such as consumer lag. However it turned out that C3 does not retain alert history so we can not query it through C3 API in order to integrate alerts into our monitoring system.
You had suggested to monitor consumer lag from the client but in that case we would need to implement a Java client that is using kafka.admin.ConsumerGroupCommand Scala library and deploy this client on some reliable Java service hosting platform.
This would be a development project in its own and some research shows that this is not a good practice for monitoring since it does not scale well.
Also there is a security concern because monitoring Java client would need read access not just to consumer group but also to all topics.
Any other ideas on monitoring consumer lag and CCloud availability? We need to monitor cluster availability because we need to implement the DR switch between regions. Confluent will probably notify us if the whole region is down but it is email channel only and can’t be used in our automated pipelines.
May be we can construct some alerts from Metrics API by some smart query?

Regards,

Igor

There’s a few options, I’d recommend taking a look at this post.

Please also note that consumer lag is not exactly a metric per se on the server side in Kafka generally (at the moment). That means it needs a little bit of special treatment. There’s some discussion about this in the Metrics API FAQ.

1 Like

To implement a DR switch you’ll need something that monitors liveness of the system. Any producer/consumer could do that, it doesn’t need to be C3. So you might create a liveness topic that you write to and read from periodically.

1 Like

Thank you Dustin. We looked at all this material and because we want automated monitoring (not just visual dashboards) it looks like client interceptors is potentially our best bet.
We still need to experiment if alerts could be created in C3 from these metrics and if we can integrate to them via C3 alert history API.
Does anybody have experience with getting alert history from C3 that is used with CCloud? So far we observed that triggered alerts are not retained in C3 history.

Does anybody have experience with getting alert history from C3 that is used with CCloud? So far we observed that triggered alerts are not retained in C3 history.

I’ve documented in this table available here what you can expect to be working or not with C3 connected to Confluent Cloud. Let me know if that helps

1 Like

Thank you for this info! This table definitely helps us to understand which alert stays in history and which is not. Something that was not clear from documentation.
We are planning to add monitoring interceptors and query alert history APIs for four metrics that are retained in C3 alert history according to your table.
If this works it will be a good start for our monitoring strategy,

We tried monitoring interceptors for creating alerts on consumer lag as it was suggested by Vincent. Unfortunately it didn’t work with CCloud for us
Interceptors worked for the consumers connected to self-managed cluster and alerts on consumer latency were triggered in C3 and were retained in alert history.
However the same consumer did not rigger any alerts when connected to CCLoud - in this case C3 was connected to the CCloud as well.
Are we missing something that needs to be setup in CCloud for the interceptors to work?

Since confluent cloud is still Kafka as a service in the end you could use any Prometheus exporter that exports Kafka metrics (e.g. consumer group lags). I wrote KMinion which talks to Kafka clusters via the API and I’m pretty sure it’s compatible with Confluent Cloud too: GitHub - redpanda-data/kminion: KMinion is a feature-rich Prometheus exporter for Apache Kafka written in Go. It is lightweight and highly configurable so that it will meet your requirements.