Hi,
Thanks for the video, it was an interesting watch.
However I dont think this is the cause, the explanation given in the video matches how I would expect the joins to work, yet the results I am seeing are non-determanistic.
I will try and provide a clearer example, however without a load of work, I cant provide a generic dataset for testing. I am hoping I can be pointed in the right direction with something I am doing wrong before it gets that far.
event_stream == event stream of all changes from a CI system. All events are lumped together with a category that denotes the type of object that is changed. the majority of the data is passed in as a JSON/struct object which is defined in the AVRO schema as a string for both the key and the value.
lifecycle_table = create table as select data[‘id’], data[‘name’] as lifecycle from event_stream where category = ‘lifecycle’ emit changes;
type_table = create table as select data[‘id’], data[‘name’] as type from event_stream where category = ‘type’ emit changes;
Now with the taps turned off (no new events flowing through) running the following query retrurns non-determanistic results;
select a.data['id#], b.name as lifecycle, c.name as type
from event_stream a
left outer join lifecycle_table b on a.data[‘lifecycle_id’] = b.id
left outer join type_table c on a.data[‘type_id’] = c.id
where category = ‘CI’;
This can return the following results
1, retired, physical
2, retired, virtual
3, active, null
4, retired, null
5, maintenance, null
or just by immediately re-running the query with no new events;
1, retired, null
2, retired, null
3, active, null
4, retired, null
5, maintenance, null
If I change the order of the joins around, I have even seen the following;
select a.data['id#], b.name as lifecycle, c.name as type
from event_stream a
left outer join type_table c on a.data[‘type_id’] = c.id
left outer join lifecycle_table b on a.data[‘lifecycle_id’] = b.id
where category = ‘CI’;
1, null, null
2, null, null
3, null, null
4, null, null
5, null, null
it gets weirder yet, by defining using an extra materialised table in the equasion;
another_table = create table as select data['id], data[‘name’] from event_stream where category = ‘someothervalidcategory’ emit changes;
then by running
select a.data['id#], b.name as lifecycle, c.name as type
from event_stream a
left outer join another_table d on a.data[‘id’] = d.id
left outer join lifecycle_table b on a.data[‘lifecycle_id’] = b.id
left outer join type_table c on a.data[‘type_id’] = c.id
where category = ‘CI’;
1, retired, physical
2, retired, virtual
3, active, virtual
4, retired, virtual
5, maintenance, physical
is returned.
However again changing the query to the following;
select a.data['id#], b.name as lifecycle, c.name as type
from event_stream a
left outer join another_table d on a.data[‘id’] = d.id
left outer join type_table c on a.data[‘type_id’] = c.id
left outer join lifecycle_table b on a.data[‘lifecycle_id’] = b.id
where category = ‘CI’;
swapping b & c around in the order of the joins would consistently return;
1, null, null
2, null, null
3, null, null
4, null, null
5, null, null
Based on this result behaviour, the results produced are not deterministic.
The tables that have been defined where done so by setting the offset to earlist so that all event data gets processed.
When running the join query again, this is being done with the offset having been set to earliest.
Is this then whats causing the oddity in the behaviour? Would overwritting and setting the timestamp to 0 on the materialised tables then potentially fix this problem as suggested in the video? as those records would always be older then any of the streamed events? would that make sense as all the data here is effectively coming from the same source topic, just sliced in different ways?
Thanks again for the help,
Leander