Cloud Tips: How to Efficiently Forklift 1 Billion Rows into SimpleDB

About 9 months ago, I was tasked with fork-lifting a massive amount of data into Amazon’s SimpleDB in a short amount of time. I achieved it. Here’s what you need to know.

If you read-on, I’ll show you how to achieve data upload rates of around 10K items/second

SimpleDB Basics

First of all, if you have 1 billion rows to upload, you will need more than 1 domain. This is because Amazon SDB imposes certain limits on how much data you can store in one domain : see limits

Without digressing too much, figure out your optimal domain sharding scheme for you data growth by keeping the following formula in mind:

Storage Usage = (ItemNamesSizeBytes + AttributeValuesSizeBytes + AttributeNameSizebytes)

This is how Amazon computes your Storage Usage vis-a-vis their 10GB limit.

Note: You might need to ask them to raise your domains per account beyond 100 if you find 100 domains is too few for your data growth.

Read More

About Me

A blog describing my work in building websites that millions of people visit. I'm a senior member of LinkedIn's Distributed Data Systems team. I previously held technical and leadership roles at Netflix, Etsy, eBay & Siebel Systems.
Tumblelogs I follow: