Total Noob re: Kafka

Hello - I have a long history as a developer but haven’t developed professionally for 10 years or so as have spent some time in IT management, QA and test automation. I’ve only very recently started reading about Kafka and watching some good videos especially by Tim Burgland.

I’m sure I’ll get some eyerolls here, but I don’t get it! Maybe I just haven’t read/watched enough yet, but I don’t see the advantage of Kafka over a traditional DB based system. Seems like a topic is a table, publishers are services that do “inserts”, consumers are services that do “selects”, and enhancing data in on topic with data from another is a “join”.

I keep reading that Kafka is “real time”. Well, if I’m inserting event records in a database as they happen, and I’m querying the database often enough wont’ that approach real time - or at least as close as makes no meaningful difference anyway?

Please enlighten me, I must be missing something(s) :slight_smile:

3 Likes

A lot of databases don’t have a streaming api. So like you said you can query a lot to make sure you notice new things in near real time. As with most of these things, you can always kind of make it work in some other way. But this often comes at a price.

In this case traditional databases are going to have a tough time when a lot of apps are doing queries each time. Then there is the problem for each app of knowing what the new data is. Another concern with traditional databases is that all data is considered equal. So in time it becomes more slow. Each app needing some data, needs credentials to read the data, and often also needs to know a bit about how the data is structured.

Kafka has a lot better story for all these things. Where because of topics and partitions it’s better to scale. With the consumer, and offsets stored in Kafka itself it’s easy for apps to know what the new information is. With tiered storage, you can get the latest data quickly, but are also able to get all the data when needed. By using topics and acl’s, it’s easy for some consumer to get additional data. And by using Schema Registry, it’s easy for the consumer to process the data.

For real apps you often need both something streaming, and some database like functionality. But building streaming functionality based on a traditional database, is a lot harder than using Kafka in my experience.

However with ksqlDB Kafka is more becoming like a database, and more and more databases offer some kind of streaming api’s. So the true story is not all that black and white.

3 Likes

I don’t think that Kafka is a database – I would prefer the term “data store”. However, the whole idea of Kafka is to build your apps differently; ie, move away from a single central DB as source of truth, but build your app in an event-driven manner, to decentralized and decouple the overall architecture, and to allow for better scaling.

Directly comparing Kafka to a (relational) database may be misleading. Instead, you should compare the overall / end-to-end system architecture, ie, how you use Kafka and how you design your apps around Kafka, to how you use a database. It’s really a mental shift, because you won’t use the same patterns that you use in your DB-apps in Kafka-apps.

In other words: Kafka is not a “drop in replacement” for a database. You can use Kafka instead of a database, but you also need to build your apps around Kafka differently. And it’s always a question of pros/cons. Database centric apps have their advantages, too, and I don’t think they will disappear completely.

Neha did a great keynote at Berlin Buzzwords on this topic: Berlin Buzzwords 2016: Neha Narkhede - Application development and data in the emerging world ... - YouTube

This Kafka Summit talk may help a little bit, too: Query the Application, Not a Database: “Interactive Queries” in Kafka’s Streams API

3 Likes

On some levels, the frontier between Kafka and a database is thin. I like this article that explains some of this nicely.

Kafka’s primary purpose is letting it’s clients (for example apps in your system) publish and consume events. The fact that it’s a distributed log allows you to develop streaming around these events, for example correlating them, and analyzing them, at scale, and in real-time.

Gregor Hohpe in his book Enterprise Integration Patterns describes using a database the Shared Database integration style. If you would replace Kafka with a database, that’s basically what you would get. Many services / components / apps would read and write from a centralized, source of truth database.

Multiple problems are usually associated with this approach. The first problem is that in a database, data is normalized for your queries and insert/updates. This means data in a database is usually built around the application that works with it. This can be problematic for external systems that have to adapt to this optimized format. This creates coupling between data producers and consumers. Imagine for example some customer tables (customer, customer_address, customer_purchases) which would be used to propagate real-time customer-related events. Consumers would have to use triggers and know the exact implementation of the customers tables. But this implementation is heavily tied to the producing application. Let’s say the producer needs to evolve to accommodate new business requirements (for example the handling of VIP customers) ; every consumer of the customers tables would need to evolve to support the changing schema. Imagine the coordination required to put into production a feature like this!

Another problem is the data consistency. In a shared database scenario, it’s very hard for producers and consumers to agree on transactional data change in a performant, scalable way. Eventual consistency appeared precisely because coordinating distributed systems to have a consistent view of data (also known as distributed locking) is really hard to do at scale. One solution for this problem is using append-only logs. They are immutable, and they have the nice property that the reader can consume the log independently from how it’s produced and still have a consistent view of the data (no locks required). Unfortunately, RDBMs were not designed to handle immutable logs. It’s just not in their DNA.

Designing data schema is an important part of Kafka. The events the producers publish into Kafka must be well-designed in order to reduce coupling between producer and consumers. This is referred to the Canonical Data Model in the Enterprise Integration Patterns. When you begin to design your data schema independently from any application, business value emerges. In Kafka, your topics contain data of business events. These events are not tied to any specific implementation but are meaningful.

This unbounded sequence of events is the foundation of streaming technology. The goal of streaming is to get immediate feedback and business value when reading these events. For example, if you have a topic containing credit card authorization events, you could consume these events to do fraud detections (for instance, by seeking repeated attempts with the same credit card in a short time window). You could also consume these events to calculate some transactional amount (customers bought a total of XYZ € of merchandise). When you use Kafka and a lot of business events are streaming through those pipes, you are sitting on a gold mine of business information.

In a traditional database world, this is harder to accomplish. Usually, data in a system (such as an ERP, or, let’s say, SAP) is hard to use because it is heavily tied to the producer system. If you had new requirements, you typically would have to do some extraction of the data in batches (a few times a day), use an ETL job to map / filter / cleanse / enrich the data and send them to a data warehouse. The BI team would then analyze the data and get back to you in a couple of days. This is hardly real-time.

I hope I could illustrate some fundamental differences between Kafka and a database and make it a little bit clearer for you.

3 Likes

Just a tiny remark, which otherwise might confuse people. Kafka main purpose is to publish and consume messages. These messages could be events, but they could easily be other things.

2 Likes

I’m not sure if this is correct to compare the two.

Each tool has its application area.

Kafka/Kafka Connect/Kafka Streams/ksqlDB flourish, when it comes to integration complexity of systems of different sort, and when you need to exchange data between those systems.

But we should not forget, I suppose, that each system from this big ecosystem, has their own context. And this context, or storage, is very often organized as a traditional transactional or analytical relational database. And they are not going to be replaced there with something else in any foreseable future, in my opinion. Because they do their job.

Because, as I currently understand this, Kafka is not about accessing data. It’s about exchanging events, which are written to the central cluster. But then these events (or messages) are written to conventional data storages, and are accessed much more easily from there.

At least, I cannot really understand how you’re going to write a query, operating on a several terrabytes dataset, to construct, i.e. some report, in Kafka efficiently, without proper indexing and joining of the data sources. You will write that data to a database first, and will write your query there.

The same applies for the opposite. Why bother building relyable message bus and integration system using your conventional database, when it is much easier and more efficiently done in Kafka?

So, have them both to rule them all!
Or am I missing something? :face_with_raised_eyebrow:

1 Like

That’s certainly a way to handle it. I worked at a place where each service had it’s ‘own’ MongoDB for storage. And collections relevant for other systems where streamed to Kafka using Debezium. With such an architecture each server can have async ‘updates’ from other services. One of the nice things here with Kafka is that the committed offset is kept for each service separately. So a service can go offline, and still be sure to get all the updates. This works nice for reading, and decoupling services. But still sometimes we need http calls directly server to server to change/query things. It was mentioned somewhere in the thread before, with Kafka in place you design services differently that from a traditional way.

While technically you could us just Kafka, or just a DB, this often makes some kind of services much more complex, then they would be if you had both.

I am definitely missing here something important, but how do I design a transactional system exclusively using Kafka and without a conventional storage, such as a relational database?

Let’s consider a classic example - an online shop. In there we have buyers, sellers, items and orders. This simple set of entites covers a lot of use-cases - today I want to see “all unprocessed orders by a particular set of sellers”, tomorrow it’s a report on “number of orders by buyers and item type in the last 180 days”. I can continue for very long. Let’s imagine now that this database is really huge, like terrabytes and terrabytes in size, and it’s just unreasonable to consume all the data from it as a continious stream of events every time, when I need to have some small subset from there. But with old good SQL and proper indexing/partioning, and an adequate RDBMS query optimizer of course, this is done very easily. Because it has been created for tasks like this. And data normalization was also invented for this (for data being correct, consistent and reusable).

This is what I was trying to say - it is not about the system design. It is about the usecase, if I understand it right. This is not convenient to use a microscope as a hummer and vice versa. Why should I bother and think hard about how to design my system exclusively with Kafka, or exclusively with a conventional RDBMS, when there are classical usecase patterns for both of them, and I can have both in my ecosystem?

I was not about a transactional system, but just a service. Like a smaller app, that tries to do only one thing. You can build the state from the messages on Kafka, either with the help of Kafka Streams/RocksDB or not. And do anything you could do on a traditional database, if you really want. Main problem is rebuilding the state each time. But you could snapshot those. But than is becomes a lot more complex then it would when you used just a database.

1 Like