In the CQRS design pattern, a read model is a specialised view of a given set of data. Ideally the state within the read model is pre-calculated and ready to be served in its final form. This can be having the reporting rows reduced in a relational database so that you can issue a ‘select from xxx where yyy’. It might be that the JSON returned from your web API is stored directly in a document datastore. More aggressively, the actual document served might be pre-generated. A pdf bank statement is a good example of this pre-generated content.
I see three dominant patterns for processing and persisting read models with the CQRS pattern over Event Sourcing:
- Read/Write
- Read once, write many
- Write only
Here we discuss the merits of each, continuing with the theme of performance from the previous two posts.
To help illustrate our examples, we will have a scenario where we have a series of events that represent the opening of a bank account, some transactions for that account and potentially a closing of the bank account. The read model in question will be tasked to answer the question of when did fraudulent transaction occur for accounts created online.
Read/Write
The read/write pattern when processing a read model is probably the simplest for an imperative developer to adopt. In my experience, most developers lean to imperative style of coding over say a declarative or functional. Even a developer who would favour functional may throw their hands up and revert back to imperative as the problem space or time pressures overwhelm them. I am not saying this is right or wrong, just what I have seen.
With this pattern, relevant state is loaded for each event as it is handled.
Here is a simple example of the processing logic you could write
public class ReadWrite_ReadModelProcessor { private IRepository _repo;//Get set some how public void Handle(AccountCreatedEvent accountCreatedEvent) { var account = new Account(accountCreatedEvent.AccountId); account.Create(accountCreatedEvent); _repo.Save(account); } public void Handle(DebitTransactionEvent debitEvent) { var account = _repo.Get(debitEvent.AccountId); if (account.IsOnline) { account.Debit(debitEvent); _repo.Save(account); } } public void Handle(FraudulentDebitTransactionEvent fraudEvent) { var account = _repo.Get(fraudEvent.AccountId); if (account.IsOnline) { account.RegisterFraudulentActivity(fraudEvent); _repo.Save(account); } } public void Handle(AccountClosedEvent accountClosedEvent) { var account = _repo.Get(accountClosedEvent.AccountId); if (account.HasAnyFradulentActivity) { account.Close(accountClosedEvent); _repo.Save(account); } else { //If it hasn't had any fraulent activity, then we wont report on it _repo.Delete(account); } } }
Note that no state is maintained in memory. Any time we need data, we go back to the persisted read model to get it. This allows the read model to be easy to read, easy to reason about, and transactionally consistent for each individual event.
The other benefit of this pattern, is that it is probably familiar. I imagine similar code maybe adopted in a Web API for accepting an HTTP Post request.
Read Once, Write many
As Greg Young often expresses “Current State is a Left Fold of previous behaviours”. The “left fold” concept is often skimmed past when learning the CQRS pattern with Event Sourcing. It can be a difficult concept to understand initially, but can be easily first adopted with a simple running sum example.
The sum of the numbers [0,1,2,3,4] is 10. However, if I add a new number to the set to make it [0,1,2,3,4,5], the way I get the new sum can have a big impact on the design and performance of my system. I could add each number again (i.e. 0+1+2+3+4+5) which would be 5 operations, or I could just add to my previous sum of 10 (i.e. 10+ 5) which is one operation.
Now imagine that the set was dozens or even hundreds of items in length. Now imagine that there are millions of these sets. That design decision can be important now.
While the first example using the Read/Write pattern still achieves the outcome of a Functional Fold Left, it does so at what might be an extremely high cost to commit each step of the way. If we consider that on handling each event, we need to connect to the data store, read the correct data, load that into memory, process the event and then commit that back to the data store. Even if we can do all of that in the impressive tie of 1ms, that still limits ourselves to 1,000 messages per second. If the number of events we need to consume is 25 million, then that is nearly 7hrs. Not too bad, but if 1ms was a best effort and the mean was more likely to be 10ms then we are looking at 2 days. 100ms to process one event, blows you out to 28 days.
Something that can be forgotten is that a read model in an event sourced CQRS part of an eventually consistent design. I question why you would opt in to pay such a high cost in the read/write pattern to try to feign some level of consistency that was never there in the first place. Instead, embrace the design and instead consider how stale that data can be to remain acceptable.
Also consider that there are probably two states your read model can be in; rebuilding from old events, or, processing live events. This means you can take the opportunity to make some more design decisions. When you are rebuilding your read model and the process is interrupted, how far back can you afford to go to continue reading? Can you afford to re process 1,000 events? 1 million events? Or is time a better measurement? Can you happily lose the last 2 minutes of processed events?
Ideally, we don’t fail, and if we did, we wouldn’t have to go back and reprocess any events. However, what if I could tell you that we could save 300 changes in the same 1ms that we were saving 1 event in the read/write pattern? What if it was 16,000, but the cost was 40ms? That now gives us a trade-off to consider. Now are you ok giving away some of this imaginary consistency if you can get 300x throughput improvement?
We can do this is by adopting a bulk change design.
Instead of saving each time we handle an event, we just store the result of the change in an uncommitted list in memory. After some period measured in either number of events processed, or time elapsed, we commit those changes to the data store. Now this could be done with a minor change to the Read/Write pattern above. The implementation of the Repository could be modified to only flush all the accumulated save requests at a given point. However, another key part of making this fast is the reduced number of reads required by the system. So, the other change to the program is to add an explicit `Load` phase. In the Load phase, the existing data is read from the data store and loaded into memory. This is done once, before any events are processed. This is the one and only read that the system will perform from the data store. As the program is the only thing to modify the datastore, it will have no requirement to ask it again what the state is.
public class ReadOnceWriteMany_ReadModelProcessor { private IRepository _repo;//Get set some how private ISet _onlineAccountIds; public void Load() { var ids = _repo.Query("GetActiveOnlineAccount").Select(a => a.Id); _onlineAccountIds = new HashSet(ids); } public void Handle(AccountCreatedEvent accountCreatedEvent) { var account = new Account(accountCreatedEvent.AccountId); account.Create(accountCreatedEvent); if (account.IsOnline) { _onlineAccountIds.Add(account.Id); _repo.Save(account); } } public void Handle(DebitTransactionEvent debitEvent) { if (_onlineAccountIds.Contains(debitEvent.AccountId)) { var debitTransactionRecord = new AccountTransactionRecord { AccountId = debitEvent.AccountId, TransactionId = debitEvent.TransactionId, Timestamp = debitEvent.Timestamp, Amount = debitEvent.Amount}; _repo.Save(debitTransactionRecord); } } public void Handle(FraudulentDebitTransactionEvent fraudEvent) { if (_onlineAccountIds.Contains(fraudEvent.AccountId)) { var fradulentTransactionRecord = new FradulentTransactionRecord { AccountId = fraudEvent.AccountId, TransactionId = fraudEvent.TransactionId}; _repo.Save(fradulentTransactionRecord); } } public void Handle(AccountClosedEvent accountClosedEvent) { if (_onlineAccountIds.Contains(accountClosedEvent.AccountId)) { //NOTE: This is a logical DELETE, but represented as an INSERT. var accountClosedRecord = new AccountClosedRecord { AccountId = accountClosedEvent.AccountId, Timestamp = accountClosedEvent.Timestamp}; _repo.Save(accountClosedRecord); _onlineAccountIds.Remove(accountClosedEvent.AccountId); } } }
If you also have an insert-only data model, this leaves you open to use bulk insert features which can perform phenomenally well (compared to updates, “upserts” or merges).
Write only
The other pattern I wanted to touch on was the write only pattern. This is the simplest of read models. The processor has no concern of previous state and is effectively persisting the output of a message translator. This is basically a “left-right copy”. While this may sound rather useless, this pattern can be popular for numerous reasons like
- loading text from events for searching (names, addresses, product codes)
- loading transaction data
- loading diagnostic logs
As this pattern requires no reads, and has no state required for in memory processing, it can be extremely fast to process data. The pattern can also be used(abused?) if you want to leverage the data store to perform calculations at query time. This can be a write vs read trade off. Normally a read model design is aimed at highly specialised models just for performing a single query, as such are normally extremely fast. If the read model happened to be a more generalised model, perhaps for say self-service querying, you might want to take back some of the speed at query time to make a fast and more simple code base to process and persist the read model data.
I have found this last trade off to be acceptable in several SQL Server read models where the query was effectively a sum or a count by partition. The queries were only issued a dozen or so times per day and would run in sub second, so the trade off to the user was unnoticeable, but our code base was simpler for it.
Not just about performance
When considering which persistence pattern you want to adopt, consider the developer cost as well as the performance cost. Write-only may be the easiest to develop, and fastest to run, but will querying the data be significantly more complex? Alternatively is pandering to the imperative read/write pattern killing the throughput of your system?
Excellent post Lee, thanks for sharing.
LikeLike