We are building a system in the area of financial data integrity. The data is around trades or transactions, and daily volumes keep going up - especially on bad days which is when the system is most needed. So we have pretty high rates of new entities entering the system - for some customer deployments this could be as high as 1.5 billion new entities per day.
These entities are normally active for only a few days, although sometimes there may be a later correction - so main activity is over 10 days and then they are potentially active out to 200 days. We then need to retain all of this for audit search out to 10 years. Typically there are only a tiny handful of commands/events per entity.
We will be consuming all events and building a long term data store for audit enquiries, as the search requires it. This means that we would ideally want to be able to archive from the part of the system that deals with data which is active - so after the 200 day point; event sourcing looks like it would be a good fit for us for the part of the system handing active data, but not for the very long haul side of things.
Unfortunately as far as I can see it will be difficult to use Kafka for event sourcing here. It generally looks like it would work well for us, but the only approach we can see for handling the archiving is to have a fixed retention period on the event topics. That then introduces complexity around reprocessing of events to handle the fact that some events for a given entity have already been dropped. However it seems it will be much more difficult to deal with across multiple topics for various different services dealing with those entities.
I was wondering if anyone here had some experience with dealing with these kinds of archiving issues around event sourcing on Kafka.