The Oracle-SimpleDB Hybrid Part 1 : Pulling data out of Oracle Efficiently

Preamble : See my previous post titled "Introducing the Oracle-MySQL Hybrid"

In my previous post, I provided an overview of an Oracle-SimpleDB Hybrid system that I am building. It supports writes to multiple masters, replicates data between masters in single-digit seconds (i.e. in the absence of long-term network partitions), is eventually-consistent, and is designed for optimal AP — it survives Network Partitions and is highly available.

My company already relies on Oracle databases. In order to transition to SimpleDB, we will need to move one application at a time into the cloud while keeping our service running. As this cannot happen overnight, we need to keep both SimpleDB and Oracle in sync.

In Part 1 : Pulling data out of Oracle Efficiently, I’m going to discuss one of 3 methods we have devised to replicate data out of an RDBMS. This method is called Trigger-oriented Incremental Replication (a.k.a TIR) and is depicted below in the bottom gray-box.

Before the Oracle-SimpleDB Hybrid system could go live, we needed to copy a lot of data from Oracle to SimpleDB. There were 2 distinct goals:

  1. Copy historical data from Oracle to SimpleDB - i.e. a one-time data fork-lift
  2. Replicate incremental changes as they occur in the live system

One-time Fork-lift : Phase 1

When you run a production OLTP Oracle table that has millions or billions of rows, a one-time fork-lift could mean capturing a large snapshot of the data and then using SimpleDB’s BatchPutAttributes API to load that large snapshot into SimpleDB. The upload could take several weeks. Also, you might need to involve DBAs and SAs to set the snapshot up on a separate host machine so as to not impact your OLTP machine.

Incremental Replication : Phase 2

Once the fork-lift phase of the data replication is complete, you can start streaming new changes to SImpleDB as they happen in Oracle.

Introducing Trickle-lifting, a novel architecture

Our team used a single architecture to solve both of these goals concurrently. In other words, rather than have 2 distinct phases, our architecture met the fork-lift and incremental replication requirements in a single phase with a single architecture.

Essentially, the gray box on the bottom supported incremental replication (i.e. our second requirement). When we piggy-backed the Trickle-lift Architecture (i.e. the gray box on the top) on top of the TIR architecture, we were able to meet our first requirement as well. The Trickle-lift Architecture is a novel architectural alternative to traditional fork-lifting architectures.

Pros and Cons of Entire Architecture

  • Pros
    • No need for DBAs and SAs to set up a snapshot on a separate host
    • Less development work is needed
    • Ability to easily tune the data uploading rate
    • No need to switch off between separate fork-lift and incremental replication phases
  • Cons
    • Use of trigger for all CRUD operations means that writes to the main table will now take longer and be more expensive in terms of system resources like CPU

Trickle-lifting was used in the billion record upload task (see earlier post)

  1. curs-bnr reblogged this from rooksfury
  2. rooksfury posted this
blog comments powered by Disqus
About Me
A blog describing my work in building websites that hundreds of millions of people visit. I'm a senior member of LinkedIn Search Infrastructure. I previously held technical and leadership roles at Netflix, Etsy, eBay & Siebel Systems. In addition to the nerdy stuff, I've included some stunning photography for your pure enjoyment!
Tumblelogs I follow: