Netflix @ DataStax SF 2011 - Monday, July 11

Netflix will be sponsoring DataStax SF 2011 on Monday, July 11. Though Netflix does not often sponsor events or recruit at conferences, we hope to engage with others active in the Cassandra and Distributed Systems community.
Stop by our booth to learn about what we are working on! Bring your resume as well!
Sid, Cloud Systems, Netflix
Cloud Tips: How to Efficiently Forklift 1 Billion Rows into SimpleDB

About 9 months ago, I was tasked with fork-lifting a massive amount of data into Amazon’s SimpleDB in a short amount of time. I achieved it. Here’s what you need to know.
If you read-on, I’ll show you how to achieve data upload rates of around 10K items/second
SimpleDB Basics
First of all, if you have 1 billion rows to upload, you will need more than 1 domain. This is because Amazon SDB imposes certain limits on how much data you can store in one domain : see limits
Without digressing too much, figure out your optimal domain sharding scheme for you data growth by keeping the following formula in mind:
Storage Usage = (ItemNamesSizeBytes + AttributeValuesSizeBytes + AttributeNameSizebytes)
This is how Amazon computes your Storage Usage vis-a-vis their 10GB limit.
Note: You might need to ask them to raise your domains per account beyond 100 if you find 100 domains is too few for your data growth.
Read More
RDBMS vs SimpleDB Overview
Enter the key-value store, exit the RDBMS
Anyone who has worked directly or indirectly with a relational database will tell you that it would be foolish to build a business that didn’t use one to store your business’s data.
One may argue whether MySQL or Oracle is the better choice, but would someone actually argue that an RDBMS (a.k.a. relation database) was not the best choice for storing your data?

Yes! There is a movement, the NoSQL movement, that is challenging the supremacy of RDBMSs for storing your data!
Some are listed here
Now 10 years after Eric Brewer’s game-changing introduction of the CAP theorem, a mass exodus is starting towards AP (availability + partition tolerance) and away from CP (strong consistency + partition tolerance).

Amazon’s SimpleDB is one such alternative to an RDBMS. Simply put, it is a distributed, replicated, eventually-consistent, always-available, key-value store owned and operated (i.e. hosted) by Amazon’s Web Services division.
Read More
Eventual Consistency Explained for Techies
Preamble: Have a look at my previous article titled “Eventually Consistency Explained for non-Techies”
Eventual and Weak Consistency
SimpleDB is eventually consistent. Eventual consistency is a version of weak consistency — you may not see the latest writes committed to the system.
Imagine that you have a system of N nodes. Of these, W nodes are involved in any write sent to the system and R nodes are contacted on any read from the system. Strong consistency can be achieved if R+W > N. In other words, if the read sets and write sets overlap, the read can discover the most recent write to the system.
Read More
The CAP Theorem Distilled
Anyone who has studied Distributed Computing in a graduate school computer science course appreciates that one of the hardest problems to solve is that of keeping writes in multiple locations synchronized. Are the writes synchronous or asynchronous? If the latter, what is the replication delay in the average and worst cases? How are write-inconsistencies resolved in the face of parallel asynchronous writes? Can you ensure that the last write wins? How?
If synchronous, then why don’t you worry about availability, write-throughput, and latency?
The CAP theorem states that it is possible to optimize at most 2 of ‘C’, ‘A’, and ‘P’. ‘C’ refers to strong consistency, ‘A’ refers to availability, and ‘P’ refers to network partition tolerance.
Read More