The Oracle-SimpleDB Hybrid Part 2 : Solving the Eventual Consistency Problem
Preamble: Read Part 1: Pulling Data out of Oracle Efficiently
Creating the Oracle-SimpleDB Hybrid system is a challenge. For one, it is a multihomed system, accepting writes in both the cloud (i.e. SimpleDB) and in our data center (i.e. Oracle). Secondly, the data center and the AWS reqion (i.e. US-east) are on opposite coasts of the US with network latency ~50-100ms. Thirdly, clocks are not synchronized via network protocols like NTP across WANs. NTP across WANs introduce tens of milliseconds of inaccuracy, which may not be good enough to resolve all forms of conflict.
To build a multi-homed system, we needed to keep our Oracle DB in our data center in sync with our SimpleDB domains in the East Coast region.
This is a tough problem to solve. How consistent do we want the data? What if we shoot for strong-consistency?
In order to build a strongly-consistent link between Oracle and SimpleDB, we could use dual-writes via a 2 Phase Commit (a.k.a 2PC) protocol. However, 2PC over a 50-100ms link would be an availability bottleneck, and hence 2PC is not a viable option. Any consensus protocol would suffer the same short-coming.
Since we cannot achieve Strong Consistency between Oracle and SimpleDB, can we achieve Eventual Consistency?
In an Eventually-consistent system, writes that occur in one place (i.e. SimpleDB or Oracle) are replicated asynchronously by IR processes. This is unlike the synchronous writes that occur in a 2PC system.
The biggest problem then becomes resolving data conflicts. In other words, when parallel writes are occurring to the same row in both Oracle and SimpleDB, how do you settle on a winner?
Our business rules allowed us to adopt the Consistency, Not Accuracy Principle. For most data (i.e. non-financial data and other business critical data), we don’t need to be accurate in terms of a global clock or vector clocks. In effect, we don’t need to ensure that the most recent write wins! Clocks are not synchronized, hence we cannot rely on local times to determine when the most recent change occurred.
All we need to do is to pick the same winner in both Oracle and SimpleDB. We need the 2 data sources to be in sync.
If there is a network partition such that Oracle and SimpleDB cannot communicate with one another, then the 2 data sources will diverge as long as the partition exists. This is unavoidable. However, as soon as the communicate link is reestablished, the system s expected to heal and the two data sources are expected to reach agreement.
One night a few weeks ago, I came up with a solution to this problem in the form of an invariant — a condition that, if maintained, would ensure that SimpleDB and Oracle remained in sync.
Introducing the Consistency Invariant
Consistency Invariant: candidate_row.version > stored_row.version
Any candidate_row being written to either SimpleDB or an RDBMS must have a version number greater than that of the existing stored record. If the candidate row does not have a higher version number, the writer must take exactly one of the following actions:
- Abandon the write
- Pick a higher version and retry the write
As long as this invariant is maintained in the system, the Oracle and SimpleDB components of this multi-homed system will not diverge. One special condition is imposed on replication : replication must abandon writes if the data it wants to replicate (i.e. candidate_row) has a lower version than the targeted stored_row in the replication destination.
- Start with a record in both Oracle and SimpleDB: ID=1, Some_field=foo, version=01/01/09 12:00:00:000
- Imagine a write to Oracle that results in: ID=1, Some_field=bar, version=01/01/09 12:10:00:000
- Imagine a write to SimpleDB that results in: ID=1, Some_field=kar, version=01/01/09 12:10:00:001
- Oracle to SimpleDB replication is abandoned as the candidate_row.version is smaller than the stored_row.version (i.e. 01/01/09 12:10:00:000 < 01/01/09 12:10:00:001 )
- SimpleDB to Oracle replication succeeds so that Oracle now has: ID=1, Some_field=kar, version=01/01/09 12:10:00:001
At the end of step 5, both SimpleDB and Oracle have the same data based on local clock values. Due to clock-skew, the winner may not have been the most recent in terms of an omniscient observer’s eyes.
In order to implement this system, we need some functionality in SimpleDB. Amazon’s SimpleDB team is working on making this a reality. Once it comes out, we will be able to build our Eventually Consistent system.
Without the anticipated Amazon API, we cannot build an eventually-consistent Hybrid system optimized for AP (i.e. from CAP theorem). We would have had to rely on dual-writes, defeating our goal to be highly-available. This pattern and Invariant will likely be the standard solution in time to come as more companies try to move existing RDBMS-based applications to the cloud.