Why does ksqlDB register an identical logical schema when providing a schema ID in CSAS/CTAS?

I have the following schema registered in my schema registry:

my_table-key
my_table-value
with schema ID 2.

When I run the following CTAS statement:

CREATE OR REPLACE TABLE my_table 
WITH (KEY_FORMAT='AVRO', VALUE_FORMAT='AVRO', VALUE_SCHEMA_ID=2) 
AS SELECT
...

ksqlDB automatically registers another set of schemas that are identical to the one I provided, which are called MY_TABLE-key and MY_TABLE-value. These are identical to my own.

I would like to be able to control the schema evolution of my output tables and streams to have more control over my schema evolution. However, I don’t understand why ksqlDB creates another set of schemas that pollute the registry.

If I name my schemas MY_TABLE-value then it seems that ksqlDB does not create extra schemas.

In the docs, I can only find the following w.r.t. physical schemas:

The schema in Schema Registry is a “physical schema”, and the schema in ksqlDB is a “logical schema”. The physical schema, not the logical schema, is registered under the subject <topic-name>-key or <topic-name>-value if corresponding KEY_SCHEMA_ID or VALUE_SCHEMA_ID values are provided.

But this doesn’t seem to really explain this behaviour. What exactly is happening here?

1 Like

I believe I’ve figured out this behaviour after coming back to it after a week.

The key takeaway from my above CTAS statement:

CREATE OR REPLACE TABLE my_table 
WITH (KEY_FORMAT='AVRO', VALUE_FORMAT='AVRO', VALUE_SCHEMA_ID=2) 
AS SELECT
...

is that this table is created without a named topic. (WITH (KAFKA_TOPIC='topic-name', ...)

By default, ksqlDB creates a backing topic as follows with an uppercase name equal to the stream/table name. From the docs:

KAFKA_TOPIC
The name of the Kafka topic that backs the stream.

If KAFKA_TOPIC isn’t set, the name of the stream in upper case is used as the topic name.

The default subject naming strategy for the Confluent Schema Registry Client is the TopicNameStrategy. Hence, when a new stream is created without KAFKA_TOPIC being set, a new topic called UPPERCASE(stream-or-table-name) is created. By TopicNameStrategy, new subjects called UPPERCASE(stream-or-table-name)-key and UPPERCASE(stream-or-table-name)-value are created. So no big mystery here.

However:

  1. If a schema ID is provided but not Kafka Topic is provided, a new subject with UPPERCASE(stream-or-table-name) is created.
  2. If a Schema ID is provided as well as a Kafka Topic, no new subject is created.
  3. If a Schema ID is provided for a Subject that has the same name as what the new Subject would have, no new subject is created.

To me this, point 1 above seems a bit counter-intuitive, especially since Schemas created by ksqlDB have extra metadata w.r.t. Connect.

Currently, I have configured a backing topic for every one of my streams. This allows me to control the schema creation a bit better. The side effect of improving the organization of topics and schemas is a nice little bonus.

It would be great if anyone could share their experiences on best practices in stream/table configuration with regard to stream configurations like this!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.