Why does ksqlDB register an identical logical schema when providing a schema ID in CSAS/CTAS?

filpano · 1 March 2024 17:38

I have the following schema registered in my schema registry:

my_table-key
my_table-value
with schema ID 2.

When I run the following CTAS statement:

CREATE OR REPLACE TABLE my_table 
WITH (KEY_FORMAT='AVRO', VALUE_FORMAT='AVRO', VALUE_SCHEMA_ID=2) 
AS SELECT
...

ksqlDB automatically registers another set of schemas that are identical to the one I provided, which are called MY_TABLE-key and MY_TABLE-value. These are identical to my own.

I would like to be able to control the schema evolution of my output tables and streams to have more control over my schema evolution. However, I don’t understand why ksqlDB creates another set of schemas that pollute the registry.

If I name my schemas MY_TABLE-value then it seems that ksqlDB does not create extra schemas.

In the docs, I can only find the following w.r.t. physical schemas:

The schema in Schema Registry is a “physical schema”, and the schema in ksqlDB is a “logical schema”. The physical schema, not the logical schema, is registered under the subject <topic-name>-key or <topic-name>-value if corresponding KEY_SCHEMA_ID or VALUE_SCHEMA_ID values are provided.

But this doesn’t seem to really explain this behaviour. What exactly is happening here?

filpano · 27 March 2024 11:05

I believe I’ve figured out this behaviour after coming back to it after a week.

The key takeaway from my above CTAS statement:

CREATE OR REPLACE TABLE my_table 
WITH (KEY_FORMAT='AVRO', VALUE_FORMAT='AVRO', VALUE_SCHEMA_ID=2) 
AS SELECT
...

is that this table is created without a named topic. (WITH (KAFKA_TOPIC='topic-name', ...)

By default, ksqlDB creates a backing topic as follows with an uppercase name equal to the stream/table name. From the docs:

KAFKA_TOPIC
The name of the Kafka topic that backs the stream.

If KAFKA_TOPIC isn’t set, the name of the stream in upper case is used as the topic name.

The default subject naming strategy for the Confluent Schema Registry Client is the TopicNameStrategy. Hence, when a new stream is created without KAFKA_TOPIC being set, a new topic called UPPERCASE(stream-or-table-name) is created. By TopicNameStrategy, new subjects called UPPERCASE(stream-or-table-name)-key and UPPERCASE(stream-or-table-name)-value are created. So no big mystery here.

However:

If a schema ID is provided but not Kafka Topic is provided, a new subject with UPPERCASE(stream-or-table-name) is created.
If a Schema ID is provided as well as a Kafka Topic, no new subject is created.
If a Schema ID is provided for a Subject that has the same name as what the new Subject would have, no new subject is created.

To me this, point 1 above seems a bit counter-intuitive, especially since Schemas created by ksqlDB have extra metadata w.r.t. Connect.

Currently, I have configured a backing topic for every one of my streams. This allows me to control the schema creation a bit better. The side effect of improving the organization of topics and schemas is a nice little bonus.

It would be great if anyone could share their experiences on best practices in stream/table configuration with regard to stream configurations like this!

system · 3 April 2024 11:06

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
My schema name is different than the topic name ksqlDB	4	3281	25 February 2022
Schema incompatibility error when creating 2 ksqldb tables with same kafka topic ksqlDB	3	2434	3 June 2023
Ksqldb different schema to same topic ksqlDB	2	1211	24 December 2023
KsqlDB always creates new avro schema version ksqlDB	6	3552	28 April 2021
Ksqldb table key and value with avro schema serialization error ksqlDB	1	15	6 April 2025

Why does ksqlDB register an identical logical schema when providing a schema ID in CSAS/CTAS?

Related topics