Snapshotting with the outbox pattern

We’re in the planning phases of building tooling to help teams expose data products that other teams can consume, creating materialized views in their own databases. Many engineers are interested in doing so by writing business events to an outbox table, following the outbox pattern. If teams write events to an outbox table, they make updates to their data available, but not data which has not been updated and written to the outbox. I can think of a number of solutions to the problem, but I’d love to hear what others have done in a similar situation and how it’s gone.

I’m currently considering:

  • A one-time snapshot job that writes snapshot events for all the current data to the outbox, maintaining full retention of the topic. Replaying a full event history seems tractable for some time given our current data size. I estimate ~1h at most for the largest data sets, though this would need testing.
  • A periodic snapshot job, allowing reduced retention of the topic. Would need to be maintained with schema changes. Would make it faster to replay events in the event of something like a consumer bug.
  • Move event data out of outbox table, and attach debezium to domain tables. Build a kstreams app that groups by transaction and reconstructs event data from domain table CDC data. Having to reconstruct events outside of the application layer feels awkward, but we could take advantage of debezium’s built-in snapshotting.

I’m not sure I understand the question. Data in domain tables that has not been updated should not result in new events in the outbox. It may be worth restating this part so it’s clearer what the problem you’re facing is.

1 Like

I think they may be asking: “how do I get domain data that hasn’t been updated, and therefore hasn’t been put into the outbox, into a Kafka topic?”

If this is the case, then let’s start with the options you presented:
1) one-time snapshot:
This is my preferred route. Full retention is far more effective than using topics with reduced retention (option 2) - you may spend a bit more on storage costs, but you can tame this with compaction. Additionally, you save money and time in the engineering efforts of trying to support #2. Remember, it’s up to the consumer to specify if they want to read the whole topic or not - they’re absolutely free to start reading much closer to “now” - it’s just they have the option without any extra work.

2) Periodic Snapshots / reduced retention:
I have to admit I’m not really sure what the periodic snapshots mean - is this snapshot on demand? eg: a consumer requests a snapshot of history, because it has expired out of the topic? If so, the reason this is a poor choice is that you don’t want snapshotted history going to all of your existing consumers. I’ve seen a lot of attempts to solve this in other ways, such as dedicated topics per consumer, but it all ends up being complicated and wasteful, and incurs high engineering costs overall. I suggest you stick with infinite retention and let the consumer’s manage where they want to start consuming from.

3) Debezium + Domain + Stream Processors:
I have implemented this pattern myself several times. It’s fairly heavyweight in the sense that each table becomes a Kafka topic, but you do get the benefit of reconstructing the data how you want, outside of the app. It can be highly effective, but it also requires you to set up and run Debezium, Kafka Connect, and a Kafka Streams app (unless you already have all of this running?)

Here’s what I suggest:

Go with option 1. One way to snapshot existing data is to select all of the unmodified data from the relevant tables, and run it through the very same business logic that populates the outbox. After all, this is precisely what you’ll be doing in production.

This will ensure that the outbox data created from the domain snapshots is generated with the same code as it will at runtime. I assume you also know how you’re going to get the data out of the outbox and into Kafka already, so I’ll forgo that.

Now you may encounter some race conditions depending on if you attempt to do this live or not. A row may be updated after you do the initial select, such that the data you push through to the outbox is stale. If you can pause inserts/updates to the database while you snapshot data to the outbox then it’s fairly easy. If you can’t, then things get more complicated. Hope this helps a bit.

5 Likes

Thanks, this is exactly what I meant, and thank you so much for your detailed response!

If we use compaction, we would need to ensure the latest event on the topic for any given entity has its full state, right? If I understand correctly, that would mean either doing some kstreams post-processing or requiring that events contain the full entity state.

There are a number of use cases where services want to build their own local cache of another service’s state, so it will be important to have at least one snapshot to build from.

I was thinking something like take a snapshot once a week if retention is limited to 2 weeks with the idea that the full state is always available on the topic for new consumers. This may be necessary if we cannot get full retention, or if replaying the full event log is too expensive.

We do have all of these running currently in the company, but a limited subset of engineers is using them. My main concern is that most engineers would need to learn and operate these tools to create a data product for their service, increasing the barrier to entry. I’m also concerned that it would be frustrating to reconstruct a business event via kafka streams that was readily available at the application layer.

This seems like a good place to start. It’s nice that it can be evolved where necessary to option 2 if retention or replay become problematic. I believe I can address race conditions by taking a lock on rows that are currently being snapshotted (and deadlocks by choosing an appropriate batch size).

Thank you again for your response!

Hi codonnell, sorry for the delay on your follow ups:

If we use compaction, we would need to ensure the latest event on the topic for any given entity has its full state, right?

Yes, correct.

take a snapshot once a week if retention is limited to 2 weeks

This is Lambda architecture, and it’s one that I am personally not that fond of. The main difficulty is that you now need to maintain two sources for data: the snapshot and the event stream. You also need to ensure that the two sources are consistent with one another. For example, a thought experiment: will two separate instances of precisely the same application end up in the same exact state, if one is consuming from the event stream since t=0, and the other bootstrapped from the snapshot from t=now? The answer should be yes, but in practice I find that having two code paths (one to maintain the stream, one to maintain taking + exposing the snapshot) don’t always end up with the same results, and chasing down the inconsistencies can be painful.

A further complication arises when you’re trying to mix multiple snapshots + streams together. Say you have two sources upstream, each with a snapshot + stream component. If you’re doing any sort of time-based or ordering sensitive calculations, your code may end up looking quite complicated, as you need to manage the “seams” between the snapshots and the streams. And this isn’t just the seam between the data in snapshotA and streamA, but between snapshotA and streamB, snapshotA and shapshotB, streamB and snapshotA, etc.

It really depends on what you’re trying to do with the data in the event stream. If you only care about transferring state and not driving any business logic, then snapshot+stream can work fine. But usually we’re using a stream because we want to process business logic, and you can get into some hairy situations when using Lambda style architectures.

My main concern is that most engineers would need to learn and operate these tools to create a data product for their service, increasing the barrier to entry. I’m also concerned that it would be frustrating to reconstruct a business event via kafka streams that was readily available at the application layer.

You may want to look into the outbox pattern and see if you can get the application developers on board with denormalizing data at write time to the outbox. This can reduce some of the overhead of recomposing data outside, while isolating the internal model of the relational database.

As for running Kafka Connect as a Service, yeah, I hear you. I think your comment does illustrate the difficult-to-quantify accumulation of barriers. I’ve been partially responsible for running and maintaining Kafka and Kafka Connect in the past - it’s one of the reasons I (biasedly, of course - incoming sales pitch) think that SaaS solutions like Confluent Cloud offer a lot of value. It’s can be hard to quantify the overhead spent fixing connectors, scaling the cluster, etc etc etc, but when that burden is lifted it really opens up what can be done with event streams and provides a lot of operational mobility.

Thank you again for your response!

You’re welcome. Again, sorry for the delay on getting back to you. I need to double check my forum notification settings.

2 Likes

No worries at all on the delay; I appreciate however much time and energy you can devote to helping me, a random person on the internet.

Thank you for sharing your perspective about the Lambda architecture idea. Maybe it is the kind of thing that seems simple when you’re implementing a PoC, but turns hairy when you graduate to maintaining a larger variety of data that is changing quickly. The kind of bug you describe does sound awful to track down and fix.

Due to compliance requirements and non-technical factors restricting my hosting options, I believe infinite retention is unlikely to be an option–at least at the start. If we find a demonstrated need for it as work progresses, the option may open up down the line. Given those constraints, it seems like a good compromise may be to use the outbox pattern and require messages to contain the full aggregate state, allowing publication to a compacted topic. We will need to manage the potential for lost updates due to concurrent writers, but at least the mitigation to that problem involves technology (postgres) that most engineers have some experience with already.

Thank you again for your insight! Your point about the lambda architecture is particularly well taken. In my head, lambda involves two separate data pipelines, but you’re spot on that it is exactly the design pattern I was proposing.

1 Like