What are the components of the Apache Kafka eco-system?

The central piece is of course a Kafka cluster, i.e., a cluster of broker nodes (servers) that store data in partitioned topics. Everything else, is “outside” of the broker cluster. The most basic APIs are the producer and consumer API (and admin client if you want to add it), that allows you to write and read data into/from topics.

Kafka Connect:

  • It is a framework that allows you to import/export data to/from Kafka. Internally, it uses the producer/consumer APIs, but it provides a higher level abstraction over producer/consumer so you don’t have to deal with the nitty-gritty details.
  • Connect itself is again deployed as a cluster of so-called “connect worker nodes”. In the connect cluster, you deploy so called connectors (source connectors to read data from an external system to write into the Kafka cluster, and sink connectors to “export” data).
  • If you implement your own connector (usually not required, as there is a huge set of existing connectors), you basically need to implement the code that connects to the external system, while the actually data publishing to a Kafka topic (or reading from a Kafka topic) is done by the connect framework without your need to deal with consumers/producers.
  • Kafka Connect also allows you to define SMTs, that are simple (and stateless) transformation on each record while you copy the data.
  • There is also MirrorMaker2 (and Confluent’s Replicator), that are both Kafka cluster replication tools: both are built as connectors on top of Kafka Connect.
  • Another example of a popular connector is Debezium: it is specialized to import CDC data from databases into Kafka topics.

Kafka Streams:

  • It’s a stream processing library that allows you to do sophisticated stream processing. Internally (again), it uses the consumer/producer APIs to read/write data from a Kafka cluster. Hence, Streams API is a higher level APIs that hides consumer/producer details. However, Streams will always work against a single cluster. It reads input topics and produces output topic both stored in the same cluster. In contrast to SMTs, it also allows you to do complex stream processing, not just simple stateless record-by-record. You can for example aggregate and join data from multiple topics, i.e., you can do stateful stream processing. Using Kafka Streams you build a regular Java application and you can deploy it as any other Java app – you can also deploy the application in a distributed manner (i.e., starting multiple instances) and the library will take care of load balancing / scaling in/out etc.

Apache Samza / Storm / Flink et al.:

  • The are quite similar to Kafka Streams, however, they are not as deeply integrated into Kafka. They also allow you to process data from other systems than Kafka. The main difference is, the for those stream processing systems, you need to deploy a cluster of “worker nodes” and you write you stream processing logic as a job that you submit to the cluster. Hence, the deployment mode is quite different to Kafka Streams. Also, the APIs are of course slightly different to the Kafka Streams API.

ksqlDB:

  • It’s a streaming database for Apache Kafka. You deploy a cluster of ksqlDB servers and you can create STREAMs/TABLEs using DDL, and query both using DML. ksqlDB uses a Kafka cluster as it’s “main storage layer”, i.e., each time you create a STREAM or TABLE, there will be a corresponding Kafka topic that stores the data. In addition, for (some) TABLEs, ksqlDB also holds a copy of the data from the topic in a ksqlDB server local RocksDB instance. Hence, for (some) TABLEs, data is stored in the Kafka cluster and in the ksqlDB server (for this case, the data in the Kafka cluster is the source of truth and you can consider the ksqlDB server local RocksDB copy an “optimization”, i.e., some kind of a “local cache”).
  • Internally, ksqlDB compiles your SQL queries into Kafka Streams programs that are executed within the ksqlDB servers.

Schema Registry:

  • If you are using Avro, Protocol Buffers, or Json as data format, you can use Schema Registry to store the corresponding schemas. Producer/Consumer configured with the corresponding (de)serializers will only store a schema ID in the actually payload (key or value) of the message while the actually schema is stored in schema registry.
  • ksqlDB integrates which schema registry and it’s also possible to use it in connection with Kafka Connect and Kafka Streams.
6 Likes

Love this! Thanks for the breakdown, @mjsax!

1 Like