COUNT() and COUNT_DISTINCT() on ksqlDB

rununrath · 11 April 2023 06:55

Hello,

I was confuse about my results of COUNT() and COUNT_DISTINCT().
Could everyone helping me in this case?

My Kafka topic has Key and Value of my topic as AVRO.
Structure of my key is contains multiple field as:
"key":{"COL1":{"string":"11"},"COL2":{"string":"222"},"COL3":{"string":"333"},"COL4":{"string":"444"}}

On ksqlDB, I had try to create stream from my Kafka topic and specify key & value format as AVRO like:
CREATE STREAM MY_STREAM_FROM_TOPIC ( KEY_STRUCT STRUCT<COL1 VARCHAR, COL2 VARCHAR, COL3 VARCHAR, COL4 VARCHAR KEY) WITH (kafka_topic='MY_TOPIC',VALUE_FORMAT='avro',KEY_FORMAT='avro');

I would like to check total records and the results that I got as:
1 Kafka topic had total records as 100 that checked by end offset - start offset

2 Using count() on ksqlDB then I got 100 that be same as previous
ksqldb command:
SELECT COUNT(KEY_STRUCT), 1 FROM MY_STREAM_FROM_TOPIC GROUP BY 1 EMIT CHANGES;

3 Using COUNT_DISTINCT() on ksqlDB then I got 85 that lower than results from COUNT()
ksqldb command:
SELECT COUNT_DISTINCT(KEY_STRUCT), 1 FROM MY_STREAM_FROM_TOPIC GROUP BY 1 EMIT CHANGES;

After that I had try to find out which KEY_STRUCT are duplicated on topic that cause the results of COUNT_DISTINCT(KEY_STRUCT) = 85 by using count with having like

#1 : SELECT KEY_STRUCT , COUNT(KEY_STRUCT) FROM MY_STREAM_FROM_TOPIC GROUP BY KEY_STRUCT HAVING  COUNT(KEY_STRUCT) > 1 EMIT CHANGES;
#2 : SELECT KEY_STRUCT , COUNT(*) FROM MY_STREAM_FROM_TOPIC GROUP BY KEY_STRUCT HAVING  COUNT(*) > 1 EMIT CHANGES;

But I didn’t get any record that had COUNT(*) > 1 from the both query.
So I would like to find out that cause of COUNT_DISTINCT(KEY_STRUCT)=85 but COUNT(KEY_STRUCT)=100.

Please help to give any suggestion to me.

Thanks
May

dtroiano · 11 April 2023 10:56

Hi May,

COUNT_DISTINCT returns an approximate count:

Returns the approximate number of unique values of col1 in a group.

The function implementation uses HyperLogLog to estimate cardinalities of 10^9 with a typical standard error of 2 percent.

Approximation is fairly common for counting distinct values in order to achieve lower latency when there are many distinct values. Sometimes approximation is optional but with ksqlDB it’s the only option.

HTH,
Dave

system · 11 May 2023 06:56

This topic was automatically closed after 30 days. New replies are no longer allowed.

Topic		Replies	Views
Duplicate key in KSQL Table ksqlDB	5	5608	12 May 2021
Trouble with STREAM STREAM JOIN ksqlDB	4	3849	22 April 2021
Aggregation for complex structures ksqlDB	3	4041	20 June 2021
KsqlDB Table-Table join on STRUCT Type Primary Key ksqlDB	1	1103	29 February 2024
How to remove deduplication upon struct in ksqldb ksqlDB	1	3103	18 February 2022

COUNT() and COUNT_DISTINCT() on ksqlDB

Related topics