🎧 Optimizing Cloud-Native Apache Kafka Performance ft. Alok Nikhil and Adithya Chandra

alice.richardson · 20 January 2022 08:21

There’s a new Streaming Audio episode - check it out!

Maximizing cloud Apache Kafka® performance isn’t just about running data processes on cloud instances. There is a lot of engineering work required to set and maintain a high-performance standard for speed and availability.

Alok Nikhil (Senior Software Engineer, Confluent) and Adithya Chandra (Staff Software Engineer II, Confluent) share about their efforts on how to optimize Kafka on Confluent Cloud and the three guiding principles that they follow whether you are self-managing Kafka or working on a cloud-native system:

Know your users and plan for their workloads
Infrastructure matters for performance as well as cost efficiency
Effective observability—you can’t improve what you don’t see

A large part of setting and achieving performance standards is about understanding that workloads vary and come with unique requirements. There are different dimensions for performance, such as the number of partitions and the number of connections. Alok and Adithya suggest starting by identifying the workload patterns that are the most important to your business objectives for simulation, reproduction, and using the results to optimize the software.

When identifying workloads, it’s essential to determine the infrastructure that you’ll need to support the given workload economically. Infrastructure optimization is as important as performance optimization. It's best practice to know the infrastructure that you have available to you and choose the appropriate hardware, operating system, and JVM to allocate the processes so that workloads run efficiently.

With the necessary infrastructure patterns in place, it’s crucial to monitor metrics to ensure that your application is running as expected consistently with every release. Having the right observability metrics and logs allows you to identify and troubleshoot issues relatively quickly. Profiling and request sampling also help you dive deeper into performance issues, particularly, during incidents. Alok and Adithya’s team uses tooling such as the async-profiler for profiling CPU cycles, heap allocations, and lock contention.

Alok and Adithya summarize their learnings and processes used for optimizing managed Kafka as a service, which can be applicable to your own cloud-native applications. You can also read more about their journey on the Confluent blog.

EPISODE LINKS

Listen to the episode

Topic	Replies	Views
🎧 Examining Apache Kafka Performance Metrics ft. Alok Nikhil News and Blogs	3217	1 February 2021
🎧 Running Apache Kafka Efficiently on the Cloud ft. Adithya Chandra News and Blogs	3260	25 May 2021
The Cloud-Native Chasm: Lessons Learned from Reinventing Apache Kafka® as a Cloud-Native, Online Service Resources	3258	10 December 2021
🎧 Using Apache Kafka as Cloud-Native Data System ft. Gwen Shapira News and Blogs	3040	7 December 2021
🎧 Expanding Apache Kafka Multi-Tenancy for Cloud-Native Systems ft. Anna Povzner and Anastasia Vela News and Blogs	2881	27 January 2022

🎧 Optimizing Cloud-Native Apache Kafka Performance ft. Alok Nikhil and Adithya Chandra

Related topics