Hi,
Current doing exploring the use of the Confluent Community Platform V.7.0.0 and have come across the following oddity when joining datatsets in KSQL. We are using a 3 node cluster that is running on hardened VM’s so things like the /tmp has been repointed using the ENV variables.
The issue that we have come across is that when using a Join, the joined data does not always return, unfortuantely I am not in a position to provide any example datasets as all the data has been torn down, I will attempt to explain the issue as clearly as I can though. The issue has been seen on at least two different queries and both where slightly different.
This example may be pushing the use case of KSQL beyond what it is intended for;
stream = stream dataset with a data struct field
materialised_table = create table as select with configuration data
Select a.field, b.field
From stream as a
join materialised_table as b on b.key = coalesce(a.data[‘nofield’],‘key_value’) ;
The join here is effectively creating a 1=1 join so should always return the field from the table, however what I noticed when running this query that we would either get 100%, 99% or 50% of the records returned, changing the join to be a left outer join would as expected return 100% of the a records but with either 50% or 99% o of the joined b records. Things get even weirder when changing the query as follows;
stream = stream dataset with a data struct field
materialised_table_object = create table as select with the same object type as the stream data
materialised_table_config = create table as select with configuration data
Select a.field, c.field
From stream as a
join materialised_table_object as b on a.id = b.id
join materialised_table_config as c on c.key = coalesce(a.data[‘nofield’],‘key_value’) ;
adding an additional join in first, would mean that 100% of the records joined, even without using b in the returned dataset. If however swapping the order of teh joins around, the same issue was experienced as above where not all expected records are returned.
As mentioned the config join is probably pushing the boundries of the intention of the joins, however the same issue has also been encountered when using more standard joins, when adding an additional non referenced join in to the query seems to ensure that all records are returned.
i.e.
select
from stream
join table_a on stream-id = table_a.id
join table_b on stream.fk_if = tableb.id
which would return 100% of table_b and 0% of table_a
vs
select
from stream
join table_c on stream-id = table_c.id
join table_a on stream-id = table_a.id
join table_b on stream.fk_if = tableb.id
which would return 100% of bother a and b. where tables a and c hold different objects.
Any idea what could be causing this problem? as the issue has been relatively starighforward to spot with a small test dataset but if we where to move forward with something like this in production, it would prove very difficult to track an issue like this down. Are we just pushing the boundries of joins inside of KSQL?
Thanks in advance for any help,
Leander