Cloud Tips: How to Efficiently Forklift 1 Billion Rows into SimpleDB
About 9 months ago, I was tasked with fork-lifting a massive amount of data into Amazon’s SimpleDB in a short amount of time. I achieved it. Here’s what you need to know.
If you read-on, I’ll show you how to achieve data upload rates of around 10K items/second
First of all, if you have 1 billion rows to upload, you will need more than 1 domain. This is because Amazon SDB imposes certain limits on how much data you can store in one domain : see limits
Without digressing too much, figure out your optimal domain sharding scheme for you data growth by keeping the following formula in mind:
Storage Usage = (ItemNamesSizeBytes + AttributeValuesSizeBytes + AttributeNameSizebytes)
This is how Amazon computes your Storage Usage vis-a-vis their 10GB limit.
Note: You might need to ask them to raise your domains per account beyond 100 if you find 100 domains is too few for your data growth.
Now that you have created the requisite number of domains (say 100) that you need, you might think that you can just pump all this data into SimpleDB in a few hours,
Amazon imposes a concept of fairness on all users of the system. If you try to execute too many writes, Amazon starts returning 503 - Service Unavailable fast-fail responses to you. SDB throttles you.
How many put’s can you execute per second per domain?
My experience has been: 70 singleton puts/domain/sec
If you had 100 domains and could achieve 70 singleton puts/domain/sec, how long would it take you to forklift 1 Billion records? It would take about 1.6 days.
In reality, the single put rate slows down the more full your domains become, so you don’t get this rate throughout.
Earlier this year, Amazon unveiled Batch Put . This gave me a 20x throughput improvement (I was inserting 20 items per batch put call) on nearly-empty domains. I did not trend how the batch put performed as my domains got more full, but my impression was that it was still much faster than the singleton put.
Now, if you have used PutAttributes or BatchPutAttributes, you will know that you can specify “replace=true” or leave the default “replace=false” option on each item. Leaving the default is much, much more efficient. If you know that you are fork-lifting brand new data, then this is the way to go.
SimpleDB Forklift Experiment
I ran the following experiment:
Data Shape and Domain Count
- ~100 domains
- ~10 attributes per item
- ~10 bytes per attribute
- ~20 byte key (20 byte item id)
- ~35 Million records
- Default Replace=false on all items
- Domains were empty to start with
- 20 items per BatchPutAttribute call
- Source data from Oracle instance on west coast, SDB target on east coast
- All data forklifted in 55 minutes (i.e. Average Rate = 10-11K items/sec)
You can achieve a pretty good rate of data upload into SimpleDB — 35M records in 55 minutes across the US (west coast to east coast).
For my 1 Billion record task, I ran into a series of issues that kept me from sustaining the 10K items/sec upload rate. These include:
- Not being able to use replace=false (I had to use replace=true)
- Source Oracle DB was a bottle-neck
If you have any questions feel free to DM me at @r39132 (Twitter)
- paul-russell reblogged this from rooksfury and added:
- rooksfury posted this