Cloud Tips: How to Efficiently Forklift 1 Billion Rows into SimpleDB

About 9 months ago, I was tasked with fork-lifting a massive amount of data into Amazon’s SimpleDB in a short amount of time. I achieved it. Here’s what you need to know.

If you read-on, I’ll show you how to achieve data upload rates of around 10K items/second

SimpleDB Basics

First of all, if you have 1 billion rows to upload, you will need more than 1 domain. This is because Amazon SDB imposes certain limits on how much data you can store in one domain : see limits

Without digressing too much, figure out your optimal domain sharding scheme for you data growth by keeping the following formula in mind:

Storage Usage = (ItemNamesSizeBytes + AttributeValuesSizeBytes + AttributeNameSizebytes)

This is how Amazon computes your Storage Usage vis-a-vis their 10GB limit.

Note: You might need to ask them to raise your domains per account beyond 100 if you find 100 domains is too few for your data growth.

SimpleDB Throttling

Now that you have created the requisite number of domains (say 100) that you need, you might think that you can just pump all this data into SimpleDB in a few hours,

Wrong!

Amazon imposes a concept of fairness on all users of the system. If you try to execute too many writes, Amazon starts returning 503 - Service Unavailable fast-fail responses to you. SDB throttles you.

How many put’s can you execute per second per domain?

My experience has been: 70 singleton puts/domain/sec

If you had 100 domains and could achieve 70 singleton puts/domain/sec, how long would it take you to forklift 1 Billion records?  It would take about 1.6 days.

In reality, the single put rate slows down the more full your domains become, so you don’t get this rate throughout.

Earlier this year, Amazon unveiled Batch Put . This gave me a 20x throughput improvement (I was inserting 20 items per batch put call) on nearly-empty domains. I did not trend how the batch put performed as my domains got more full, but my impression was that it was still much faster than the singleton put.

Now, if you have used PutAttributes or BatchPutAttributes, you will know that you can specify “replace=true” or leave the default “replace=false” option on each item. Leaving the default is much, much more efficient. If you know that you are fork-lifting brand new data, then this is the way to go.

SimpleDB Forklift Experiment

I ran the following experiment:

Data Shape and Domain Count

  • ~100 domains
  • ~10 attributes per item
  • ~10 bytes per attribute
  • ~20 byte key (20 byte item id)
  • ~35 Million records

Other Info

  • Default Replace=false on all items
  • Domains were empty to start with
  • 20 items per BatchPutAttribute call
  • Source data from Oracle instance on west coast, SDB target on east coast

Resulting Performance

  • All data forklifted in 55 minutes (i.e. Average Rate = 10-11K items/sec)

Conclusion:

You can achieve a pretty good rate of data upload into SimpleDB — 35M records in 55 minutes across the US (west coast to east coast).

For my 1 Billion record task, I ran into a series of issues that kept me from sustaining the 10K items/sec upload rate. These include:

  • Not being able to use replace=false (I had to use replace=true)
  • Source Oracle DB was a bottle-neck

If you have any questions feel free to DM me at @r39132 (Twitter)

blog comments powered by Disqus
About Me
A blog describing my work in building websites that hundreds of millions of people visit. I'm Chief Architect at ClipMine, an innovative video mining and search company. I previously held technical and leadership roles at LinkedIn, Netflix, Etsy, eBay & Siebel Systems. In addition to the nerdy stuff, I've included some stunning photography for your pure enjoyment!
Tumblelogs I follow: