On some levels, the frontier between Kafka and a database is thin. I like this article that explains some of this nicely.
Kafka’s primary purpose is letting it’s clients (for example apps in your system) publish and consume events. The fact that it’s a distributed log allows you to develop streaming around these events, for example correlating them, and analyzing them, at scale, and in real-time.
Gregor Hohpe in his book Enterprise Integration Patterns describes using a database the Shared Database integration style. If you would replace Kafka with a database, that’s basically what you would get. Many services / components / apps would read and write from a centralized, source of truth database.
Multiple problems are usually associated with this approach. The first problem is that in a database, data is normalized for your queries and insert/updates. This means data in a database is usually built around the application that works with it. This can be problematic for external systems that have to adapt to this optimized format. This creates coupling between data producers and consumers. Imagine for example some customer tables (customer, customer_address, customer_purchases) which would be used to propagate real-time customer-related events. Consumers would have to use triggers and know the exact implementation of the customers tables. But this implementation is heavily tied to the producing application. Let’s say the producer needs to evolve to accommodate new business requirements (for example the handling of VIP customers) ; every consumer of the customers tables would need to evolve to support the changing schema. Imagine the coordination required to put into production a feature like this!
Another problem is the data consistency. In a shared database scenario, it’s very hard for producers and consumers to agree on transactional data change in a performant, scalable way. Eventual consistency appeared precisely because coordinating distributed systems to have a consistent view of data (also known as distributed locking) is really hard to do at scale. One solution for this problem is using append-only logs. They are immutable, and they have the nice property that the reader can consume the log independently from how it’s produced and still have a consistent view of the data (no locks required). Unfortunately, RDBMs were not designed to handle immutable logs. It’s just not in their DNA.
Designing data schema is an important part of Kafka. The events the producers publish into Kafka must be well-designed in order to reduce coupling between producer and consumers. This is referred to the Canonical Data Model in the Enterprise Integration Patterns. When you begin to design your data schema independently from any application, business value emerges. In Kafka, your topics contain data of business events. These events are not tied to any specific implementation but are meaningful.
This unbounded sequence of events is the foundation of streaming technology. The goal of streaming is to get immediate feedback and business value when reading these events. For example, if you have a topic containing credit card authorization events, you could consume these events to do fraud detections (for instance, by seeking repeated attempts with the same credit card in a short time window). You could also consume these events to calculate some transactional amount (customers bought a total of XYZ € of merchandise). When you use Kafka and a lot of business events are streaming through those pipes, you are sitting on a gold mine of business information.
In a traditional database world, this is harder to accomplish. Usually, data in a system (such as an ERP, or, let’s say, SAP) is hard to use because it is heavily tied to the producer system. If you had new requirements, you typically would have to do some extraction of the data in batches (a few times a day), use an ETL job to map / filter / cleanse / enrich the data and send them to a data warehouse. The BI team would then analyze the data and get back to you in a couple of days. This is hardly real-time.
I hope I could illustrate some fundamental differences between Kafka and a database and make it a little bit clearer for you.