In the last two posts, I discussed a particularly problematic read model that took longer than 2 days to rebuild from scratch. Why would you want to rebuild a read model from scratch? Making a schema change, consuming a new type of data, fixing bugs or even deploying to a new environment are all reasonable reasons to warrant a rebuild. A read model, in my understanding, is that it is not the source of truth, it is just a customized and specialized view of the data. With this reduced responsibility comes the freedom to delete, recreate or manipulate them.

The “Reporting” read model

Under the pressure of the deadlines that developers face each day, reuse is a common tool in our arsenal to reduce the amount of code we need to write and test. Arguably one of the cornerstones of the software industry is our ability to reuse and share libraries to accelerate productivity. Libraries such as one that serializes to and from a format, one that executes HTTP requests and processes responses, one that connects to a data store for data access, all reduce the burden for an application developer so that they may focus on delivering business value. But with reuse comes coupling. You create a dependency on that library. Often this cost is well understood and acceptable. However sometimes it is not understood, not accounted for and therefore difficult to reject.

The read model under scrutiny was a “reporting” read model. I put reporting in quotes, because I find that naming vague. Who were the reports for? Which business domain? I personally think that naming and ownership are important.

This reporting read model appeared to start its life as relational database view of the transaction log for given business. A good way to store the 20,000,000 or so transactions that made up the ledger. Some basic reporting like getting the balance for an account, or seeing cash flows for a period could be serviced by this read model. But then I imagine that the lure for reuse was too strong. A requirement like “get me the account summaries for a customer”, meant that you could just add some customer events handlers to the read model populator and you could satisfy that requirement quickly. What about cash flow by branch office? All missed payments? Pending Direct Debit payments? Accounts in hardship? Written off debt? Why not track some KPIs in here?

The attraction is clear, this data store has the 20M ledger transactions that are the basis for most of the other queries. However, what is not clear is who now is the owner of this data. If the banking team want to add more columns, do we need to ask the collections team, what about the front office team? What if we want to re-purpose a column, or delete one? Who gives it the ok?

In this case we have reuse of both software and data. I would argue though, that it has lost its usefulness. Something can’t be reusable, if it isn’t first usable.

A software system that can’t be safely modified due to unclear roles, responsibilities and ownership has lost its utility.

Performance cost

The real cost of this reuse is two-fold

  • development cost
  • deployment cost

which both effectively affect business agility.

The development cost is increased due to the coupling of code. If you want to add to the code base, you need to understand the ramifications of the change you will make. If any part of the system requires a rebuild of the read model, then all other parts of the system will have to also pay the cost. This make developers less likely to willingly make changes. As a developer you pay this cost when running the system locally. As a tester, you can never be sure what parts of the system warrant regression testing. When you have a network effect of dependencies, you can approach a complex system with emergent behaviour.

Then at deployment time the extra cost of these dependencies is very noticeable. As each dependency brings along its own requirements, so does it bring along the need for its own data. The read model in question needs to consume 90 million of the 500 million events in the event store. However, each individual part only needs at most 25 million of those events. If that reporting read model was deconstructed into its constituent parts as independent read models they would need only 1/4 the data, and would be less likely to require rebuilds due to unrelated requirements.

Granted 20 million of those 25 million events are the ledger events, and these independent read models would probably all need to consume those same 20 million events. However, what becomes interesting is how little of the event is required to be persisted when your read model is specialized. We also take pause to note that disk space is usually cheap when compared to business agility and development costs. Yes, you the developer are a very expensive resource!

A final cost that should be pointed out is query cost. If your read models are highly specialized, then they can be highly tuned to the one goal of that read model. The generalized reporting read model will always face tradeoffs that will be at the benefit of one consumer, but the cost of another.

Ownership of data

My understanding of DDD, SOA, CQRS and even Microservices, is that requirements come from a product owner, and that the software to facilitate those requirements can and should be able to encapsulate the logic and data into a deployment unit. This creation of clear boundaries allows the system to change, grow, scale, version, fail and die independently of other systems. IMO data such as read models are included in that sphere. I personally like the Credit Card model where a service boundary is funded by a credit card. When the owner of the service sees no further use in the service, the credit card can be cut, and the services with it too.

By clearly identifying the ownership and boundaries of our reporting monolith read model, we could create isolated read models that require 3-4x less events than previously required as a single read model. This can obviously provide a 3-4x improvement in rebuild times. A 4x improvement is good, but it is always faster to do nothing. So to be clear, the major win here is the decoupling of deployments. If the “Reporting” read model served 25 separate requirements, that each only had 2 change requests a year, that averages out to needing a rebuild every week. Obviously if the read models were segregated then only the 2 rebuilds per year would be required for each of the read models. i.e. a requirements change for one report had no impact on the other reports, thus they never needed to be rebuilt.

Smaller, faster, lower risk deployments.