🎧 Handling 2 Million Apache Kafka Messages Per Second at Honeycomb

There’s a new Streaming Audio episode - check it out!

How many messages can Apache Kafka® process per second? At Honeycomb, it's easily over one million messages.

In this episode, get a taste of how Honeycomb uses Kafka on massive scale. Liz Fong-Jones (Principal Developer Advocate, Honeycomb) explains how Honeycomb manages Kafka-based telemetry ingestion pipelines and scales Kafka clusters.

And what is Honeycomb? Honeycomb is an observability platform that helps you visualize, analyze, and improve cloud application quality and performance. Their data volume has grown by a factor of 10 throughout the pandemic, while the total cost of ownership has only gone up by 20%.

But how, you ask? As a developer advocate for site reliability engineering (SRE) and observability, Liz works alongside the platform engineering team on optimizing infrastructure for reliability and cost. Two years ago, the team was facing the prospect of growing from 20 Kafka brokers to 200 Kafka brokers as data volume increased. The challenge was to scale and shuffle data between the number of brokers while maintaining cost efficiency.

The Honeycomb engineering team has experimented with using sc1 or st1 EBS hard disks to store the majority of longer-term archives and keep only the latest hours of data on NVMe instance storage. However, this approach to cost reduction was not ideal, which resulted in needing to keep data that is older than 24 hours on SSD. The team began to explore and adopt Zstandard compression to decrease bandwidth and disk size; however, the clusters were still struggling to keep up.

When Confluent Platform 6.0 rolled out Tiered Storage, the team saw it as a feature to help them break away from being storage bound. Before bringing the feature into production, the team did a proof of concept, which helped them gain confidence as they watched Kafka tolerate broker death and reduce latencies in fetching historical data. Tiered Storage now shrinks their clusters significantly so that they can hold on to local NVMe SSD and the tiered data is only stored once on Amazon S3, rather than consuming SSD on all replicas. In combination with the AWS Im4gn instance, Tiered Storage allows the team to scale for long-term growth.

Honeycomb also saved 87% on the cost per megabyte of Kafka throughput by optimizing their Kafka clusters.

EPISODE LINKS


:headphones: Listen to the episode