Hello,
I was confuse about my results of COUNT() and COUNT_DISTINCT().
Could everyone helping me in this case?
My Kafka topic has Key and Value of my topic as AVRO.
Structure of my key is contains multiple field as:
"key":{"COL1":{"string":"11"},"COL2":{"string":"222"},"COL3":{"string":"333"},"COL4":{"string":"444"}}
On ksqlDB, I had try to create stream from my Kafka topic and specify key & value format as AVRO like:
CREATE STREAM MY_STREAM_FROM_TOPIC ( KEY_STRUCT STRUCT<COL1 VARCHAR, COL2 VARCHAR, COL3 VARCHAR, COL4 VARCHAR KEY) WITH (kafka_topic='MY_TOPIC',VALUE_FORMAT='avro',KEY_FORMAT='avro');
I would like to check total records and the results that I got as:
1 Kafka topic had total records as 100 that checked by end offset - start offset
2 Using count() on ksqlDB then I got 100 that be same as previous
ksqldb command:
SELECT COUNT(KEY_STRUCT), 1 FROM MY_STREAM_FROM_TOPIC GROUP BY 1 EMIT CHANGES;
3 Using COUNT_DISTINCT() on ksqlDB then I got 85 that lower than results from COUNT()
ksqldb command:
SELECT COUNT_DISTINCT(KEY_STRUCT), 1 FROM MY_STREAM_FROM_TOPIC GROUP BY 1 EMIT CHANGES;
After that I had try to find out which KEY_STRUCT are duplicated on topic that cause the results of COUNT_DISTINCT(KEY_STRUCT) = 85 by using count with having like
#1 : SELECT KEY_STRUCT , COUNT(KEY_STRUCT) FROM MY_STREAM_FROM_TOPIC GROUP BY KEY_STRUCT HAVING COUNT(KEY_STRUCT) > 1 EMIT CHANGES;
#2 : SELECT KEY_STRUCT , COUNT(*) FROM MY_STREAM_FROM_TOPIC GROUP BY KEY_STRUCT HAVING COUNT(*) > 1 EMIT CHANGES;
But I didn’t get any record that had COUNT(*) > 1 from the both query.
So I would like to find out that cause of COUNT_DISTINCT(KEY_STRUCT)=85 but COUNT(KEY_STRUCT)=100.
Please help to give any suggestion to me.
Thanks
May