Datamesh is the new “hype” these days, and I wanted to check if anyone have any thoughts on using kafka as part of the infrastructure in a data mesh. In short, what is kafkas “role” in the data mesh architecture ?
Now, kafka along with kafka connect source and sinks are a pefect fit for transfering data from/to the data product “storage”… Is is however also the correct tool to “ETL” between the opperational plane and analytical plane? Is sounds almost “too easy” to do ETL with e.g ksqldb from one model (opperational) to a analytical model.
Does anyone have any experience with this on a large scale? (big!! data and on an enterprice level)
Inside one domain it clearly help driving the pipeline, but can/should one use kafka as the “base” for all products that is created? The examples I have seen so far uses kafka to stream data from source over to e.g google biq query where reoports and procucs are created…
Thoughts and links to relevant articles is highly appriciated
I’ve been thinking about this too. Since Data Mesh is technology agnostic, I would think that using the technology stack that you are most comfortable with makes sense. So, for a lot of people, this is going to include Kafka.
Thanks a lot I have taken the Data Mesh course and it was great, but it is still a bit “basic”. Ofcause it is limited what a class this short can cover . I will read the blog post you refer to
Looking forward to the next Kafka summit in London
Exiting times! Definitly !
I used to work at Shopify, prior to Confluent, and we were working hard on setting up change data capture topics to power both analytics (created much like you described with Google BigQuery) and operational systems (as we had a number of event-driven use-cases that were too difficult to exploit given the monolithic rails architecture). One way to look at it is that the analytics stuff is powered by BigQuery data products, while the operational stuff is powered by the event stream. However, if you zoom out a bit, you can see that there are several things you can name as “data product” - the source DB, the event stream itself, and the BigQuery data. This is where the other principles, of domain ownership, federated governance, and self-serve come into play - we need to formalize what is a data product, and what isn’t. There are many ways to do this, but it really ends up depending on the org.
Shopify also chose to do CDC on the internal model of their main relational database instead of abstracting it away into a “public” data product model. This has led to some breakages in the past, where the domain owners change their model without consulting the downstream consumers, causing some fairly painful outages. This problem is fairly common, however, and is an example of what differentiates Data Mesh from plain ol CDC ETL - the foresight to delineate ownership and ensure that the model the consumers couple on is treated as a product (release cycles, consultation, guarantees of uptime).
To me this answer is yes - and I would have given you that answer even before joining Confluent (my book should be proof of that actually - started it back in 2018, joined Confluent in 2021).
A lot of this has to do with the maturity of Apache Kafka, and a lot to do with cloud computing and the ease of making and managing new services (including their state stores). There was a time not so long ago when event/message brokers/queues were unable to maintain indefinite storage (all events had a TTL) and were also unable to do all of the things you need for large scale distributed materialized state: partitions, infinite retention, multiple consumers consuming independently.
That lack of good event broker options really influenced people’s thinking on “message brokers” - as a simple means of routing temporary data from one very specific source to one very specific destination. This sort of stale thinking persists today, and you commonly see it in many design patterns.
You’ve correctly identified that it is indeed that easy (the vast majority of times). However, the complexity lies in the remainder of the problems to be solved:
How do we:
Ensure that the producers (domain owners) can isolate their internal data model from that published to the event stream (data product).
Promote event streams to full fledged products (governance comes into play here): SLAs, quality levels, on-call rotation, release cycles, evolution, and handling breaking changes
Enable consumers to find the relevant data products they need. Use a data catalog to track metadata about product.
Manage dependencies: which consumers are using my data products? I need to notify them about an upcoming breaking change.
Make it self serve: I want my producers to be able to create new streams, publish data, and register as products. I want my consumers to be able to use these products effortlessly, either by using CDC to put into cloud storage (for BigQuery, lets say), consume with ksqlDB, build a Kafka Streams app, or use CDC to sink to MySQL. Lots of options, but this list also depends on what the governance team wants to support, as it requires streamlining infrastructure to support both the operational and analytical side.
I hope that helps answer some questions. Please let me know if you have any follow ups (just @abellemare me so I don’t miss it)
@abellemare Than you so much for such an extensive and complete answer to my question. This is way more than i ever anticipated
I find it really interesting the way you explain how CDC of the internal model causes issues. Not that it supprises me, but it confirms the fact that you need a hard “line” between the opperational lane and the analytical plane.
Now from a “product” perspective it might be irellevant, but would you recomend defining the product in the DB as e.g outbox patterm or view along with cdc to kafka , or would you recomed using scemas and microservices/KSQLDB to do the ETL to a “propper” product ?
When i tink of if, it will allways depend i guess , on the model and the products you want to create.
Anyway i guess one would like a seperate mesh kafka cluster where the product live and it gets it data from a cluster link feeding from the “opperational” cluster? or is that maybe over enginering?
Again , thanks for your thoughts and ideas. They are truly a great input to an interesting discussion.
This is pretty much it. CDC is extremely useful, but at the end of the day it is a coupling on the schema and internal model of whatever tables it is pulling data from.
I have seen both used successfully. The outbox pattern works pretty well, but puts more load on the source database. Sometimes you have to fight with the DBA to get them to approve the additional load, though in the world of cloud computing it’s often just a matter of scaling up the instance and spending a bit more money. MySQL blackhole engine works well for this too, if you want to use CDC to capture the outbox writes to the blackhole table (the data isn’t written to disk, only to the binlog, making it easy to capture via CDC without putting much/any load on the database).
You can also use microservices/ksqlDB to transform the internal model of the CDC data into a well-formatted data product. In this case, you usually need to handle foreign-key streaming table joins in the microservice, since CDC data tends to be from normalized relational database tables. I have seen this also used successfully (and I have dubbed it eventification: turning highly relational CDC data into denormalized and easy to use events, so that other consumers don’t have to jump through a bunch of data cleaning hoops).
Both options require careful consideration of the domain ownership principle. Who is going to be responsible for ensuring the outbox table is compatible with changes to the internal domain model? Who is going to be responsible for ensuring the microservice join logic and stream formats remain compatible? Often times I see an ownership seam awkwardly placed right in the middle. Some poor “data team” has to manage the outbox table logic and CDC, but they’re also not kept in the loop on the internal model changes. Or they have to manage the microservices, but also have no ownership of the internal model, resulting in more outages. By adhering to the data mesh principles, the ownership and making data easy to use would fall completely on the domain team. This is a social shift that organizations have to make to be successful with data mesh (or even just in using CDC data, unless the data model never changes…).
So I don’t see the same operational / analytical divide. I spent a lot of time as a data engineer prior to Confluent, and most of what I see is the same needs (access to clean, consistent, and usable data) in both the operational and analytical plane. In the ideal world, both operational and analytical systems should draw from precisely the same data source - the event stream. It enables both planes to have the same data, including the same errors, corrections, and idiosyncrasies.
One of the more common failure modes of the operational/analytical divide is that the operational systems claim one thing (eg: 1000 sales were made) while analytical claims something almost the same but different (eg: 992 sales). By unifying both operational and analytical on the same data source, you remove a massive burden for both planes, and make it really easy to stop looking for data and to actually start using it.
But to your original question of multiple clusters, I would say your best bet is to keep it simple. Use a single cluster for operational and analytical needs. You can always create artificial divides / namespacing to conceal data, hide topics, etc. If you find you need multiple clusters for other reasons (data locality rules, latency, etc) then you can start expanding out, but by and large, I think it’s important to consider data as data without having to artificially label it as “analytical” or “operational”.