COUNT() and COUNT_DISTINCT() on ksqlDB

Hello,

I was confuse about my results of COUNT() and COUNT_DISTINCT().
Could everyone helping me in this case?

My Kafka topic has Key and Value of my topic as AVRO.
Structure of my key is contains multiple field as:
"key":{"COL1":{"string":"11"},"COL2":{"string":"222"},"COL3":{"string":"333"},"COL4":{"string":"444"}}

On ksqlDB, I had try to create stream from my Kafka topic and specify key & value format as AVRO like:
CREATE STREAM MY_STREAM_FROM_TOPIC ( KEY_STRUCT STRUCT<COL1 VARCHAR, COL2 VARCHAR, COL3 VARCHAR, COL4 VARCHAR KEY) WITH (kafka_topic='MY_TOPIC',VALUE_FORMAT='avro',KEY_FORMAT='avro');

I would like to check total records and the results that I got as:
1 Kafka topic had total records as 100 that checked by end offset - start offset

2 Using count() on ksqlDB then I got 100 that be same as previous
ksqldb command:
SELECT COUNT(KEY_STRUCT), 1 FROM MY_STREAM_FROM_TOPIC GROUP BY 1 EMIT CHANGES;

3 Using COUNT_DISTINCT() on ksqlDB then I got 85 that lower than results from COUNT()
ksqldb command:
SELECT COUNT_DISTINCT(KEY_STRUCT), 1 FROM MY_STREAM_FROM_TOPIC GROUP BY 1 EMIT CHANGES;

After that I had try to find out which KEY_STRUCT are duplicated on topic that cause the results of COUNT_DISTINCT(KEY_STRUCT) = 85 by using count with having like

#1 : SELECT KEY_STRUCT , COUNT(KEY_STRUCT) FROM MY_STREAM_FROM_TOPIC GROUP BY KEY_STRUCT HAVING  COUNT(KEY_STRUCT) > 1 EMIT CHANGES;
#2 : SELECT KEY_STRUCT , COUNT(*) FROM MY_STREAM_FROM_TOPIC GROUP BY KEY_STRUCT HAVING  COUNT(*) > 1 EMIT CHANGES;

But I didn’t get any record that had COUNT(*) > 1 from the both query.
So I would like to find out that cause of COUNT_DISTINCT(KEY_STRUCT)=85 but COUNT(KEY_STRUCT)=100.

Please help to give any suggestion to me.

Thanks
May

Hi May,

COUNT_DISTINCT returns an approximate count:

Returns the approximate number of unique values of col1 in a group.

The function implementation uses HyperLogLog to estimate cardinalities of 10^9 with a typical standard error of 2 percent.

Approximation is fairly common for counting distinct values in order to achieve lower latency when there are many distinct values. Sometimes approximation is optional but with ksqlDB it’s the only option.

HTH,
Dave

This topic was automatically closed after 30 days. New replies are no longer allowed.