LEFT JOIN with INNER JOIN

I’m trying to understand how doing a combination of LEFT JOIN and INNER JOIN works together. (as well as OUTER JOIN and INNER JOIN)

From the documentation it seems the join order is important? Is the semantic logic such that the WITHIN of the 1st join is taken into account and only the results of that are then processed with the next join?

This is especially confusing when it comes to OUTER JOIN being first in the list. E.g.

SELECT *
FROM stream1
LEFT JOIN stream2
WITHIN 10 MINUTES GRACE PERIOD 0 SECONDS
ON stream1.some_field = stream2.some_field
INNER JOIN stream3
WITHIN 20 MINUTES GRACE PERIOD 0 SECONDS
ON stream1.other_field = stream3.other_field
EMIT CHANGES;

In this case for example, how will the windows execute if the left join “fail” and gets a null value? This is the use case we’re trying to build to look for “gaps in the data”, but we also need to rule out some other logical case which is what the inner join would be for.

My understanding is that the 10 minutes from the left join will “time out” and then the inner join will get applied on that record (with the +10m timestamp now) and try to find anythig within 20m of it. Is this correct?

Is the semantic logic such that the WITHIN of the 1st join is taken into account and only the results of that are then processed with the next join?

That’s right.

In this case for example, how will the windows execute if the left join “fail” and gets a null value?

If there is a left/outer join result, and some columns will be NULL after the first join and you want to use those columns in the second join the record would actually be dropped and there won’t be any join result (even if you use left/outer in the second join) as you cannot join on NULL. The record entering the second join with a NULL column on the join attribute will be dropped as invalid.

My understanding is that the 10 minutes from the left join will “time out” and then the inner join will get applied on that record (with the +10m timestamp now) and try to find anythig within 20m of it. Is this correct?

No. Join results are emitted right away. There is no reason to wait for the end of the window – if two records joins, the result will be emitted without delay. Furthermore the result record of the join will get max(leftRecord.ts, rightRecord.ts) as its timestamp.

Last but not least, for your code example, a left/outer-join will trigger eagerly, too – you should add a GRACE clause to avoid potential “spurious” results and to tell the system to only emit left/outer join result after the grace-period passed (you can also use a GRACE of zero, meaning the left/outer join result will be emitted when window-end time is reached). Cf Announcing ksqlDB 0.23.1 - New Features and Updates for more details.

1 Like

Ah ok, that’s different than I thought. So for the “no record found on the right side” cases.

If WITHIN X MINUTES GRACE PERIOD 0 then the “left side only” record will be emitted at the end of the within window. Will it however contain the original left side record rowtime or the after window rowtime?

And if GRACE PERIOD > 0 is this added on top of WITHIN or “alongside”? So e.g. WITHIN 5 MINUTES GRACE PERIOD 1 MINUTE will emit a “not found” record after 1 minute or 6 minutes of left side arriving?

Will it however contain the original left side record rowtime or the after window rowtime?

The result will have the left side record’s timestamp.

If you use grace period, emit time of left/outer join result will be at window-end + grace (if it would be record-ts + grace it could still be spurious and would not really address the issue, as it’s only save to emit left/outer join result after window-end).

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.