Confluent Cloud Schema Registry, JSON schemas, and Spark

Hey there!

I’m attempting to consume from a Confluent Cloud Kafka topic using spark and I would like to make use of the built-in schema registry where the schema has been created/registered for defining the schema. I’m running into a few hiccups trying to make this work. ( It should be noted I’ve scoured the internet for 6+ hours looking for answers which includes time spent in the forum, before I decided to post here )

Schema Registry Used: Yes, Confluent Cloud
Secured: Yes
Topic Schema Format: JSON schema
Consumer: Spark readstream ( in databricks )
References: consume-avro-data-from-kafka-topics-and-secured-schema-registry-with-databricks-confluent-cloud
Databricks Spark with Schema Registry

The above article assumes the topic messages are AVRO serialized with an AVRO schema. This is not my case, they are JSON serlialized with a JSON schema. Can I still make use of the schema registry for defining the schema in spark or will I need to create custom processes for retrieving the schema?

I’m happy to provide more context as needed in order to find an optimal solution.

Regards,
Brandon

You can always consume byte arrays from Kafka with Spark (which is the default, anyway) and wrap the JSONSchemaDeserializer in a Spark UDF which you’d invoke through a dataframe select function.

This blog shows Avro, yes, but at least the steps to extract and download the schema will be useful since the wire format for the Registry payloads are similar. I’m not sure how well Spark works with JSONSchema, however, so you might still need to manually define a Spark Structtype.