I would like to know what’s the best approach to denormalise an N:N
table in ksqlDB?
For example, let’s explore the scenario where I have the following pipeline: SQL DB → Debenzium → Kafka → ksqlDB. And the following tables exist on the SQL side:
CREATE TABLE dogs (
id INT(11) NOT NULL AUTO_INCREMENT
,name VARCHAR(255)
,PRIMARY KEY (id)
);
CREATE TABLE owners (
id INT(11) NOT NULL AUTO_INCREMENT
,name VARCHAR(255)
,PRIMARY KEY (id)
);
CREATE TABLE dogs_owners (
ownerId INT(11)
,dogId INT(11)
,mainOwner TINYINT(1)
,PRIMARY KEY (ownerId, dogId)
);
As you can see the dogs_owners
table is an N:N table, where dogs can have multiple owners, and owners can have multiple dogs. Following the CDC pipeline, each of these tables will become a Kafka topic, receiving every change that happens to them (I’m using JSON encoder instead of AVRO if it makes any difference). At this moment I would like to create tables in ksqlDB to reference those, like:
DROP TABLE IF EXISTS dogs;
CREATE TABLE dogs (
key STRUCT<id INT> PRIMARY KEY
,name VARCHAR
) WITH (
kafka_topic='dogs',
key_format='json',
value_format='json'
);
DROP TABLE IF EXISTS owners;
CREATE TABLE owners (
key STRUCT<id INT> PRIMARY KEY
,name VARCHAR
) WITH (
kafka_topic='owners',
key_format='json',
value_format='json'
);
DROP TABLE IF EXISTS dogs_owners;
CREATE TABLE dogs_owners (
key STRUCT<ownerId INT, dogId INT> PRIMARY KEY
,mainOwner INT
) WITH (
kafka_topic='dogs_owners',
key_format='json',
value_format='json'
);
Now I would like to create a denormalised view of those 3 tables, so I would have only one table with all the info. Like:
CREATE TABLE dogs_owners_denormalised AS
SELECT
owners.id AS ownerId
,owners.name AS ownerName
,dogs.id AS dogId
,dogs.name AS dogName
,dogs_owners.mainOwner
FROM dogs_owners
JOIN owners ON (
owners.key->id = dogs_owners.key->ownerId
)
JOIN dogs ON (
dogs.key->id = dogs_owners.key->dogId
);
I understand that I cannot do this kind of join, as I need a full primary key match when working with tables. What is the ksqlDB approach when working with N:N tables for this scenario?
I would like to avoid using streams because in my real-world scenario (that has many more tables and relations), I would like to do aggregations on this denormalised final table, and the related data from each table could arrive on a different time, so it’s hard to determinate a stream join for a window period.
Thank you in advance!