Data Mesh course material question

I’ve gone through the great data mesh material.

There is a paragraph on the Data Available Everywhere, Self-Service / A global mesh part which I don’t understand:

“While the data product can send data to or receive data from the mesh, if it’s playing the role of a destination port, it typically can’t use the underlying Kafka implementation, since that sits under central control. Instead, it should use its own Kafka instance.”

Why can’t the underlying Kafka implementation be used? “Sits under central control” is not very clear to me.

2 Likes

Great question, Natan! I think as a community we are still working this stuff out, so let me think out loud a bit.

I have seen some discussion of the tension between centralization and decentralization in Data Mesh. Clearly decentralization is a key priority, but I think we should see this as decentralization of product ownership, not necessarily infrastructure. (NB: not everyone agrees on this point!) Teams have to own the schema of the analytics products they create, but those products have to be published in some way that is, subject to governance, globally accessible. This could be a database, some species of HTTP interface, or as we generally assume 'round these parts, a Kafka topic.

This leads me to resolve the centralization tension as follows: infrastructure can still be centralized (an IP network over which I can make HTTP calls, a database that will let me connect to it, a topic in a shared Kafka cluster), but ownership of data products must be decentralized. No central data team gets to tell me what my data outputs look like, but I still have to put them somewhere accessible, and—I think this is finally getting close to the heart of your question—I still need to document their schema in some centralized repository.

Having said all that, I’m still not sure I’ve answered your question. Push me a bit if not. :slight_smile:

2 Likes

Good question. It could actually be a bit clearer adding the bolded text:
“While the data product can send data to or receive data from the mesh, if it’s using a stream processing application to play the role of a destination port, it typically can’t use the underlying Kafka implementation, since that sits under central control. Instead, it should use its own Kafka instance.”

Streaming data into MongoDB or something wouldn’t require a separate Kafka. With stream processing applications you typically want a Kafka instance that is under your control so you can manage your release lifecycle independent of the central cluster.