Hello,
we are running confluentinc/cp-ksqldb-server:7.6.0
and are facing a following problem with regard to daily windowing due to UTC timestamps.
We are running aggregations on data that is essentially similar to the following:
{
"id": "<uuid>"
"timestamp": "2024-02-27:02:56:00.000Z",
"quantity": 3.25
}
I want to create a table that groups data with the same id
using windowing, as follows:
CREATE OR REPLACE TABLE aggregated WITH (KEY_FORMAT='AVRO', VALUE_FORMAT='AVRO')
AS SELECT
ID, AS_VALUE(ID) AS "dataId",
SUM(QUANTITY) AS "sumQuantity",
AS_VALUE(windowstart) as "day"
FROM INPUT_DATA
WINDOW TUMBLING ( SIZE 1 DAY, GRACE PERIOD 72 HOURS )
GROUP BY ID;
However, we want to aggregate not by calendar days but by business days, which for us are defined as the interval from 04:00:00.000Z
, inclusive, until 03:59:59.999Z
, exclusive.
The problem is essentially what is described in the docs:
ksqlDB is based on the Unix epoch time in the UTC timezone, and this can affect time windows. For example, if you define a 24-hour tumbling time window, it will be in the UTC timezone, which may not be appropriate if you want to have daily windows in your timezone.
Is it possible to use a custom timestamp extractor so that WINDOW
uses my definition of intervals for a day (i.e. [04:00:00.000Z, 03:59:59.999)
instead of the default [00:00:00.000Z, 23:59:59.999Z)
?
If not, is there a recommended way to deal with this issue?
One way that I can think of would be to have a separate table of “days” that I join my stream on based on which day interval the timestamp of the data lands on, but seeing how this table and the stream would (by design) have different time semantics, it feels like the wrong approach.
Another way would be to simply offset all of my data by 4 hours
. If nothing else, this approach should work but requires either:
- adding a custom timestamp field containing the offset timstamp, or
- a copy of the data, or
- manipulating the raw timestamp
The first of which seems like the least offensive and invasive approach. It would be ideal, though, if this logic were encapsulated within the business-case depicted by the specific ksql query.