Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka [New White Paper]

victoria · 19 May 2021 17:31

Download the paper here: Rethinking Distributed Stream Processing in Apache Kafka

An increasingly important system requirement for distributed stream processing applications is to provide strong correctness guarantees under unexpected failures and out-of-order data so that its results can be authoritative (not needing complementary batch results). Although existing systems have put a lot of effort into addressing some specific issues, such as consistency and completeness, how to enable users to make flexible and transparent trade-off decisions among correctness, performance, and cost still remains a practical challenge. Specifically, similar mechanisms are usually applied to tackle both consistency and completeness, which can result in unnecessary performance penalties.

This paper presents Apache Kafka’s core design for stream processing, which relies on its persistent log architecture as the storage and inter-processor communication layers to achieve correctness guarantees. Kafka Streams, a scalable stream processing client library in Apache Kafka, defines the processing logic as read-process-write cycles in which all processing state updates and result outputs are captured as log appends. Idempotent and transactional write protocols are utilized to guarantee exactly-once semantics. Furthermore, revision-based speculative processing is employed to emit results as soon as possible while handling out-of-order data. This paper also demonstrates how Kafka Streams behaves in practice with large-scale deployments and performance insights exhibiting its flexible and low-overhead trade-offs.

This paper was authored by Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao.

Topic	Replies	Views
🎧 Consistent, Complete Distributed Stream Processing ft. Guozhang Wang News and Blogs	3128	22 July 2021
✍️ Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka News and Blogs	3138	18 June 2021
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink in Apache Flink [Kafka Summit 2022] Summit	3670	24 April 2022
Recording ready to view: SPEAKER Q&A THREAD: 4 November 2023 - Apache Kafka® Scaling & End to End Exactly Once with Flink and Kafka Events	1144	23 November 2023
Streaming Updates through Complex Operations in Kafka Streams at Scale [Kafka Summit 2022] Summit	3032	22 April 2022

Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka [New White Paper]

Related topics