Optimal Avro Outbox

Goal

I want to produce Kafka messages in the desired Avro schema using the outbox pattern.

Background
The outbox pattern avoids inconsistencies when writing both to a database and to Kafka.
You write the messages to the database into an “outbox” table in the same database transaction as other changes.
Then you use tools like Debezium or the JDBC Source Connector to write the messages in the outbox table to Kafka.

Article: Reliable Microservices Data Exchange With the Outbox Pattern

Challenge
What is the best way to write Avro messages with this pattern?

You could directly write Avro messages to the outbox table.
But serializing the Avro messages requires some communication with the Schema Registry.
Even if the Schema Registry data is then cached, you would never want to communicate with the Schema Registry during a database transaction.

How might we avoid a runtime dependency on the Schema Registry?

Options?

You could write JSON convert it to Avro later.
Maybe with ksqlDB – but that would be another moving piece in your system.
Maybe with a Connect converter – but that may not be trivial; and could you guarantee that the JSON matches the schema?

What other options are there?
How do you do this?

1 Like

If you go down the “store as JSON” approach, I highly recommend looking at GitHub - allegro/json-avro-converter: JSON to Avro conversion tool designed to make migration to Avro easier. to do the conversion from JSON to Avro. it cannot handle the union concept when the union is of multiple records, but I haven’t had issues with it otherwise. You will need someplace to specify the .avsc so that the library knows how to convert the data types effectively.

2 Likes

Yes, converting van JSON to Avro seems the best way. I once worked with a similar setup, where from Debezium we would have a ‘raw’ topic, and using a schema we would create a ‘neat’ topic. In this case Protobuf without schema registry, but same thing is possible.

You do have to think about how you want to evolve the schema. In our case we first updated the shared protobuf definition, and sometimes we would genarate the ‘neat’ topic again with the new fields.

2 Likes

The key issue here is that Avro and other IDLs like Protobuf are not aware of their own schema IDs and do not by default ship with them. By using Confluent’s default Serdes the schema ID is fetched during runtime. As you have noted this is a point of failure and in some sense wasteful as schema IDs are immutable after creation in the schema registry, the perfect candidate for permanent caching. We can call this the traditional integration approach supplied by Confluent where your applications need to fetch schema IDs at runtime.

Another approach which one can do is to take advantage of the fact that schema IDs are immutable. This approach works well if you control the producer/consumer fully and can make modifications to your schema artifact build processes.

The approach can be summarized as “baking in” the schema IDs into the schema artifacts that you build based on your respective IDL (Avro, Protobuf etc). Most applications usually include a dependency which has the generated code based off your schemas for the developer to code against. By piggybacking off this existing artifact setup, a custom build pipeline can also project the necessary schema IDs needed for the artifact. You will need to think about what subject naming strategy you are using as this will determine what schema IDs you will need to place into the “enhanced” artifacts. I find the RecordNameStrategy works well with this “baking in” approach as you do not need to worry about what topics this schema is being sent to.

This projection can be adding a file into the artifact or a generated class, which then a custom Serdes aware of the “bake in” approach can then read the schema IDs locally, instead of needing to reach out to schema registry. For the Java stack one can extend the SchemaRegistryClient interface and provide a new implementation that reads from a local file or in-memory map (call it LocalRegistryClient).

I implemented this approach for mobile devices where instead of exposing schema registry as per the traditional method I utilized the “bake in” approach:

Now millions of mobile devices do not need to reach out to Schema Registry reducing: a point of failure, costs in terms of requests & money, and all the ops/security work in exposing Schema Registry in an HA setup.

You can view more about this approach in a talk I did recently (“Schemas Beyond The Edge” if the link does not work).

Note for your use case of an outbox pattern you could avoid serializing the Confluent-encoded Avro until the last possible moment, that is remove any Schema ID aware Serdes until the Outbox Poller processes the event (just persist vanilla Avro bytes to your outbox table).

This would require a custom app that has some configuration and changes to your outbox table to track what the raw Avro bytes schema is (e.g. class name) for later fetching by Schema registry. By doing so the critical path now involves doing your business write, and doing the outbox write with vanilla Avro based on your event definition (no schema registry needed). The actual fetching of the schema ID for the serialized vanilla bytes would then be done by the custom process, decoupling it from the critical request path.

2 Likes