I would like to know what’s the best approach to denormalise an
N:N table in ksqlDB?
For example, let’s explore the scenario where I have the following pipeline: SQL DB → Debenzium → Kafka → ksqlDB. And the following tables exist on the SQL side:
CREATE TABLE dogs ( id INT(11) NOT NULL AUTO_INCREMENT ,name VARCHAR(255) ,PRIMARY KEY (id) ); CREATE TABLE owners ( id INT(11) NOT NULL AUTO_INCREMENT ,name VARCHAR(255) ,PRIMARY KEY (id) ); CREATE TABLE dogs_owners ( ownerId INT(11) ,dogId INT(11) ,mainOwner TINYINT(1) ,PRIMARY KEY (ownerId, dogId) );
As you can see the
dogs_owners table is an N:N table, where dogs can have multiple owners, and owners can have multiple dogs. Following the CDC pipeline, each of these tables will become a Kafka topic, receiving every change that happens to them (I’m using JSON encoder instead of AVRO if it makes any difference). At this moment I would like to create tables in ksqlDB to reference those, like:
DROP TABLE IF EXISTS dogs; CREATE TABLE dogs ( key STRUCT<id INT> PRIMARY KEY ,name VARCHAR ) WITH ( kafka_topic='dogs', key_format='json', value_format='json' ); DROP TABLE IF EXISTS owners; CREATE TABLE owners ( key STRUCT<id INT> PRIMARY KEY ,name VARCHAR ) WITH ( kafka_topic='owners', key_format='json', value_format='json' ); DROP TABLE IF EXISTS dogs_owners; CREATE TABLE dogs_owners ( key STRUCT<ownerId INT, dogId INT> PRIMARY KEY ,mainOwner INT ) WITH ( kafka_topic='dogs_owners', key_format='json', value_format='json' );
Now I would like to create a denormalised view of those 3 tables, so I would have only one table with all the info. Like:
CREATE TABLE dogs_owners_denormalised AS SELECT owners.id AS ownerId ,owners.name AS ownerName ,dogs.id AS dogId ,dogs.name AS dogName ,dogs_owners.mainOwner FROM dogs_owners JOIN owners ON ( owners.key->id = dogs_owners.key->ownerId ) JOIN dogs ON ( dogs.key->id = dogs_owners.key->dogId );
I understand that I cannot do this kind of join, as I need a full primary key match when working with tables. What is the ksqlDB approach when working with N:N tables for this scenario?
I would like to avoid using streams because in my real-world scenario (that has many more tables and relations), I would like to do aggregations on this denormalised final table, and the related data from each table could arrive on a different time, so it’s hard to determinate a stream join for a window period.
Thank you in advance!