Borrowing from ACID patterns to tame eventual consistency - Written by Matt Howard
I recently had a discussion on a message board with someone who was looking for ways to implement ACID-like guarantees (Atomic, Consistent, Isolated, Durable) where using an ACID-compliant database wasn't possible. I think that conversation has a much broader appeal - I see the same general question being asked over and over in many different ways and in different contexts. It has more to do with a shift in thinking about data than it does with the details of database behavior. So I thought I’d move the topic over here for the benefit of anyone else who finds themselves in our shoes.
Developers who are used to building apps around traditional relational databases are struggling to adjust to some of the latest tools taking over cloud platforms. Several years ago I was trying to make sense of building a financial system on top of Cassandra - an eventually-consistent database. My question then, and the topic from this recent poster, was “how do I make things behave atomically and consistently without a database to guarantee it?” The example scenario was something like:
- Insert $100 transfer from person A to person B into a ledger
- Update balance of person A = balance-$100
- Update balance of person B = balance+$100
These things have to happen atomically - if any step fails the entire series must be aborted. This used to be something developers didn’t have to think about… wrap the statements in a transaction, send it off to the database and wipe our hands of the complexities that make it work. Now many are finding they have to figure it out for themselves with databases that have abandoned ACID in favor of speed and availability. So how do you maintain data consistency in a distributed or disconnected system? The answers I have found are:
- You can’t, stop trying… and
- Follow the patterns they used to implement ACID guarantees in the first place
Unfortunately most of the answers you find online stop at #1… and leave the conversation at “consistency is overrated - you don’t need it”. While this is true in many circumstances there are critical times when it is necessary. The good news is that there is much more to the story. We’ve just become accustomed to doing things this way because we could get away with it - not necessarily because that is a good way or the only way of maintaining data. In most/many cases there are better alternatives that come right out of the ACID playbook.
To go forward with these new notions of consistency we have to go back a bit to remember how the old model works. This is a big topic that I'm going to distill down to two concepts that I believe make relational databases so reliable:
- Write-ahead-logs (WAL) - before you actually do something you log the fact that you are about to do it
- Idempotency - you can take the same action twice (by replaying the logs) without any unwanted side-effects… this is often defined without the word “unwanted” but that is a critical difference
So let’s return to that atomic transaction above using Oracle as an example. I want to insert a record, then update 2 other records as a single all-or-nothing step. The first thing Oracle will do is generate REDO and UNDO logs before it touches my data. The REDO log is a description of the changes being applied so if the system crashes it can replay those logs to get back to a consistent state once it comes back up. The UNDO log is the opposite - a description of what the data was like before the change, so if the system crashes in the middle of a transaction it can then apply the UNDO logs to make it appear to me that a partial change never happened. UNDO logs are also used to effectively hide uncommitted transactions from concurrent connections.
- Insert $100 transfer from person A to B into ledger
- Update balance of person A = -$100
- Crash / failure / whatever you like
Since we didn’t commit the transaction Oracle will apply our undo logs on recovery to clean up the $100 ledger entry and the $100 debit to person A’s balance.
Note that atomicity is a bit of an illusion. “All-or-nothing” is a display trick that the database does to make it look like nothing happened. From https://docs.oracle.com/cd/B28359_01/server.111/b28310/undo001.htm:
When a ROLLBACK statement is issued, undo records are used to undo changes that were made to the database by the uncommitted transaction.
The changes “were made” for steps 1 and 2 above. It’s just that Oracle’s read model uses the undo log to hide those changes from other concurrent sessions and it takes compensating action to remove the unwanted effects of those changes in the event something goes wrong.
The same UNDO logs are used to control the consistency of reads for data that is either in the middle of being changed (but not committed) or has been committed but only after a read request was issued.
The above diagram shows how Oracle hides updates to 2 blocks of data that were committed at time SCN 10024 from an earlier SELECT statement issued at time SCN 10023. It reads the 2 blocks from SCN 10024, sees they are newer than the requested “date” and applies the UNDO logs which revert those 2 blocks of data back to SCN 10006 and SCN 10021.
Note that if you don't fully control the read model then you also lose control over your consistency boundary.
This is a key tenet of micro services that pays off greatly when trying to manage eventual consistency - control your read model by not allowing outside actors to peer into your internal state.
Developers are so used to things appearing as if they never happened on rollback that we make it an absolute requirement, however by adjusting our definition of “never happened” to match Oracle’s definition we gain a lot of flexibility to work in a world of relaxed consistency. We certainly often write details of attempted transactions to our application logs and don't expect a database rollback to erase those logs because application logs aren’t visible to the users. In fact we like to keep those logs so we know something was attempted. The existence of “invalid” data isn't an issue if the application doesn't expose the incorrect effects of that data. Relying on structured logs is a decades-old pattern in computing, and many argue it is rooted in practices dating back millennia. Replaying time-series logs to reconstruct a view of the world is almost inherently idempotent and gives us back much of the consistency control developers are seeking. In Part 2 we’ll look further at what this means and how adjusting our definition of “data” can help us to apply these concepts in our own applications.