The State of NoSQL in 2012

Preamble Ramble
If you’ve been working in the online (e.g. internet) space over the past 3 years, you are no stranger to terms like “the cloud” and “NoSQL”.
In 2007, Amazon published a paper on Dynamo. The paper detailed how Dynamo, employing a collection of techniques to solve several problems in fault-tolerance, provided a resilient solution to the on-line shopping cart problem. A few years go by while engineers at AWS toil in relative obscurity at standing up their public cloud.
It’s December 2008 and I am a member of Netflix’s Software Infrastructure team. We’ve just been told that there is something called the “CAP theorem” and because of it, we are to abandon our datacenter in hopes of leveraging Cloud Computing.
Huh?
A month into the investigation, we start wondering about our Oracle database. How are we are going to move it into the cloud? That’s when we are told that we are to abandon our RDBMS too. Instead, we are going to focus on “AP”-optimized systems. “AP” or high-availability systems is our focus in 2008-2010 as Netflix launches its video streaming business. What’s more available than TV right? Hence, no downtime is allowed, no excuses!
Fast forward to the end of 2011: the past 3 years have been an amazing ride. I helped Netflix with two migrations in 2010: the first was from Netflix’s datacenter to AWS’ cloud, the second from Oracle to SimpleDB. 2011 was no less exciting: with a move from SimpleDB to Cassandra, Netflix was able to expand overseas to the UK and Ireland. Since Cassandra offers configurable routing, a cluster can be spread across multiple continents.
Fast forward to today: I’ve completed my first month at LinkedIn. I’ve spent this month getting familiar with the various NoSQL systems that LinkedIn has built in-house. These include Voldemort (another Dynamo-based system), Krati (a single-machine data store), Espresso (a new system being actively developed), etc… LinkedIn is now facing similar challenges to the Netflix of 3 years ago. Traffic is growing and both Oracle and the datacenter face potential obsolescence.
NoSQL and Cloud Computing to the rescue?

The State of NoSQL Today
Ignoring the datacenter vs. public cloud question for the time-being, what would I pick today regarding a NoSQL alternative to Oracle? Not many people get a chance to solve the same tough problem twice.
For one, there are still quite a few NoSQL alternatives in the market. Some are supported by startups (e.g. Cassandra, Riak, MongoDB, etc..), some are supported indirectly by companies (e.g. LinkedIn’s support of Voldemort, Facebook & StumbleUpon’s support of HBase), and some are supported directly by companies (e.g. AWS’s S3, SimpleDB, and DynamoDB, etc… ).
In making a decision, I’ll consider the following learnings:
- Any system that you pick will require 24-7 operational support. If it is not hosted (e.g. by AWS), be prepared to hire a fleet of ops folks to support it yourself. If you don’t have the manpower, I recommend AWS’ DynamoDB
- Just because the company got by with one big machine for Oracle, don’t be surprised if the equivalent NoSQL option results in 36 smaller machines. All complete solutions to fault-tolerance support “rebalancing”. Rebalancing speed is determined by data size of a shard. Hence, it’s better to keep the size per shard reasonable to minimize MTTR in times of disaster.
- Understand the limitations of your choice:
- MongoDB, at the time of this writing, has a global write-lock. This means that only one write can proceed at a time in a node. If you require high write-throughput, consider something else
- Cassandra (similar to other Dynamo systems) offers great primary key-based access operations (e.g. get, put, delete), but doesn’t scale well for secondary-index lookups
- Cassandra, like some other systems, has a lot of tunables and a lot of internal processes. You are better off turning off some features (e.g. Anti-entropy repair, row cache, etc…) in production to safeguard consistent performance.

Many of the NoSQL vendors view the “battle of NoSQL” to be akin to the RDBMS battle of the 80s, a winner-take-all battle. In the NoSQL world, it is by no means a winner-take-all battle. Distributed Systems are about compromises.
A distributed system picks specific design elements in order to perform well at certain operations. These design elements comprise the DNA of the system. As a result, the system will perform poorly at other operations. In the rush to win mindshare, some NoSQL vendors have added features that don’t make sense for the DNA of the system.
I’ll cite an example here. Systems that shard data based on a primary key will do well when routed by that key. When routed by a secondary key, the system will need to “spray” a query across all shards. If one of the shards is experiencing high latency, the system will return either no results or incomplete (i.e. inconsistent) results. For this reason, it would make sense to store the secondary index on an unsharded (but replicated) system. This concept has been utilized internally at Netflix to support internal use-cases. Secondary indexes are stored in Lucene to point to data in Cassandra.
LinkedIn is following the same pattern in the design of its new system, Espresso. Secondary indexes will be served by Lucene. The secondary index will return rowids in the primary store.

This brings me to another observation. In reviewing Voldemort code recently, I was impressed by the clarity and quality of the code. In core systems, whether distributed or not, code quality and clarity goes a long way. Although Voldemort is an open source project and has been for years, a high degree of discipline has been maintained. I was also impressed by the simplicity of its contract - get(key), put(key,value), and delete(value). The implementors understood the DNA of the system and did not add functionality to the system that is ill-suited to the DNA.
In a similar vein, Krati, Kafka, Zookeeper, and a few other notable open-source projects stick to clear design principles and simple contracts. As such, they become reusable infrastructure pieces that can be used to build an Distributed System that you need. Hence, the system we end up building might be composed of several specialty component systems that can be independently tuned, in some ways similar to HBase. As a counter example, to achieve predictable performance in Cassandra without a significant investment in tuning, it may be easier to turn off features. This is because multiple features in a single machine contend for resources — since each feature has a different DNA (or resource consumption profile), performance diagnosis and tuning can be a pain point.
Till next time, stay tuned!
SimpleDB Essentials for High Performance Users : Part 3

This is Part 3 of SimpleDB Essentials for High Performance Users. Check out Part 2
- Work around Attribute Value Length Limits
- If you need to store data that is vastly larger than 1024 bytes in a SimpleDB attribute, consider storing that data in S3 and putting a pointer (i.e. bucket name + object key) to the data in the simpleDB attribute. However, the drawback from this approach is that you will require 2 round-trips (i.e. one to SimpleDB and one to S3) to compose one logical row. Beyond the obvious performance hit, this approach is not transactionally sound.
- A better approach is to split that data over several SimpleDB attributes. You will need to control the splitting and joining logic of these SimpleDB attributes, but you will only need one roundtrip and you can leverage conditional puts for concurrency control. This approach is ideal if your data can fit in 10 or fewer attributes.
- Just remember that subsequent updates to these split attributes might be of different length
- Getting tripped up by the Default Select Query pagination limit of 100
- You must be aware that the SDB Select query supports the “limit N” expression. This allows the developer to specify N up to a max of 2500. If the developer chooses N=200 for example and 1000 items match the WHERE clause conditions, then the results would be returned in chunks of 200 at a time. 5 subsequent round trips would be required to fetch the 1000 items. For customer facing functionality, you are risking end-user timeouts. To avoid this, always specify “limit 2500”. Note: if you don’t specify it, the default value of 100 is assumed by SimpleDB
- Avoid any client code that auto-follows tokens returned by SimpleDB. SimpleDB Query timeouts could result in an unpredictably long-cycle of next-pointers. Auto-following these can not only result in an infinite loop on your servers, but customer-browser timeouts as well. Instead, follow these next pointers judiciously.
- Avoid carrying multi-table relationships into the cloud in the form of multi-domain relationships. Try to denormalize these relationships into single items. Doing joins in the application tier might require multiple round-trips to SDB and open customer-facing functionality to time-outs
- Remember that there are no sequences, locks, constraints (except for the uniqueness constraint on the item name), triggers, etc.. in SimpleDB. Don’t expect them
SimpleDB Essentials for High Performance Users : Part 2

This Part 2 of SimpleDB Essentials for High Performance Users. Check out Part 1
Read More
SimpleDB Essentials for High Performance Users : Part 1

Preamble
I’ve been a heavy-user of SimpleDB since January 2009, storing, writing, and reading billions of items. Based on my experience, I’ve compiled a list of best practices and conventions to simplify working with SimpleDB. I’ve divided this into multiple parts to ease readability.

Read More
“The cloud” Explained for Normal People
If you are like most people in software, you have heard the term “The cloud” but have no idea what it means. If you are industrious enough to buy a book, google the term, or troll twitter for related tweets, you are likely exasperated by the shear marketing buzzword blast you encounter.
To make it easier on you, I am going to tell you what it means to me with a very specific example:
I am helping my company move into the cloud. Specifically, we are going to use most of Amazon’s AWS services.
Definitions
First, some abbreviations and definitions:
-
AWS
- Amazon Web Services, a division of Amazon focusing on hosting our applications and data
-
SimpleDB
- AWS’s always-available replacement for RDBMSs. Specifically SimpleDB is their hosted, replicated key-value store that is always available and accessible as a web service
-
S3
- (a.k.a Simple Storage Service) AWS’s always-available file storage solution accessible as a web service
-
SQS
- (a.k.a Simple Queue Service) AWS’s always-available queueing service accessible as a web service
-
ELB
- (a.k.a Elastic Load Balancer) AWS’s always-available load balancing service accessible as a web service
-
EC2
- (a.k.a Elastic Compute Cloud) AWS’s on-demand server offering accessible as a web service
-
CloudFront
- AWS’s CDN (a.k.a Content Delivery Network) offering accessible as a web service
All of the services above are pay-as-go (and are very reasonable at that) and are accessible as web services. They are also always-available.
So why does one go about using these services?
Read More
Website Performance - Why you should care and what you can do!

Why Does Performance Matter?
Oftentimes, people speak interchangeably about web site performance, scalability, and availability. Although these 3 terms are related, they are distinct and unique. Here are their definitions:
-
availability - what is the total length of time that [some part of] a web site is available during a hour/day/year?
-
scalability - what is the largest number of concurrent users that a system can handle?
-
performance - what is the [worst] perceived response time for a single user?
It should be pretty clear that for web sites that charge money (either through ads or via subscription-based services), lesser availability translates into lesser revenue. In the same way that you can lose revenue via an outage, you can lose revenue during traffic peaks if your website cannot handle those peaks.
Hence, just as scalability and availability can reduce your top-line (revenue), can performance have a similar affect?
- In 2006 Google’s tests showed that increasing load time by 0.5 seconds resulted in a 20% drop in traffic.
- In 2007 Amazon’s tests showed that for every 100ms increase in load time, sales would decrease 1%.
- This year (2009) Akamai (a CDN leader) revealed in a study that 2 seconds is the new threshold for eCommerce web page response times.
Hence, it does.
Where Should You Look for Performance Issues?
Ask anyone who has worked on a web site or an enterprise system and they will say “Look at your database!”. Although that is true, the mistake people often make is stopping there. After tuning queries to their satisfaction, engineers seem to ignore 5-7 second website page load times. 80-90% of this time really comes from assembling the web page.
Read More
The Oracle-SimpleDB Hybrid Part 2 : Solving the Eventual Consistency Problem
Preamble: Read Part 1: Pulling Data out of Oracle Efficiently
Creating the Oracle-SimpleDB Hybrid system is a challenge. For one, it is a multihomed system, accepting writes in both the cloud (i.e. SimpleDB) and in our data center (i.e. Oracle). Secondly, the data center and the AWS reqion (i.e. US-east) are on opposite coasts of the US with network latency ~50-100ms. Thirdly, clocks are not synchronized via network protocols like NTP across WANs. NTP across WANs introduce tens of milliseconds of inaccuracy, which may not be good enough to resolve all forms of conflict.
To build a multi-homed system, we needed to keep our Oracle DB in our data center in sync with our SimpleDB domains in the East Coast region.
This is a tough problem to solve. How consistent do we want the data? What if we shoot for strong-consistency?
In order to build a strongly-consistent link between Oracle and SimpleDB, we could use dual-writes via a 2 Phase Commit (a.k.a 2PC) protocol. However, 2PC over a 50-100ms link would be an availability bottleneck, and hence 2PC is not a viable option. Any consensus protocol would suffer the same short-coming.
Since we cannot achieve Strong Consistency between Oracle and SimpleDB, can we achieve Eventual Consistency?

Read More
The Oracle-SimpleDB Hybrid Part 1 : Pulling data out of Oracle Efficiently
Preamble : See my previous post titled “Introducing the Oracle-MySQL Hybrid”
In my previous post, I provided an overview of an Oracle-SimpleDB Hybrid system that I am building. It supports writes to multiple masters, replicates data between masters in single-digit seconds (i.e. in the absence of long-term network partitions), is eventually-consistent, and is designed for optimal AP — it survives Network Partitions and is highly available.
My company already relies on Oracle databases. In order to transition to SimpleDB, we will need to move one application at a time into the cloud while keeping our service running. As this cannot happen overnight, we need to keep both SimpleDB and Oracle in sync.
In Part 1 : Pulling data out of Oracle Efficiently, I’m going to discuss one of 3 methods we have devised to replicate data out of an RDBMS. This method is called Trigger-oriented Incremental Replication (a.k.a TIR) and is depicted below in the bottom gray-box.

Before the Oracle-SimpleDB Hybrid system could go live, we needed to copy a lot of data from Oracle to SimpleDB. There were 2 distinct goals:
- Copy historical data from Oracle to SimpleDB - i.e. a one-time data fork-lift
- Replicate incremental changes as they occur in the live system
Read More
Introducing the Oracle-SimpleDB Hybrid
My company would like to migrate its systems to the cloud. As this will take several months, the engineering team needs to support data access in both the cloud and its data center in the interim. Also, the RDBMS system might be maintained until some functionality (e.g. Backup-Restore) is created in SimpleDB.
To this aim, for the past 9 months, I have been building an eventually-consistent, multi-master data store. This system is comprised of an Oracle replica and several SimpleDB replicas. As I near completion of this system, I’d like to share its design.
Here’s the system:

We plan on accepting reads and writes in our data center (Oracle) and in our AWS region (SimpleDB). There are 2 Incremental Replicators (IRs) that transmit the changes between Oracle and SimpleDB. One replicates data from Oracle to SimpleDB, the other replicates data back from SimpleDB to Oracle.
Read More
RDBMS vs SimpleDB Overview
Enter the key-value store, exit the RDBMS
Anyone who has worked directly or indirectly with a relational database will tell you that it would be foolish to build a business that didn’t use one to store your business’s data.
One may argue whether MySQL or Oracle is the better choice, but would someone actually argue that an RDBMS (a.k.a. relation database) was not the best choice for storing your data?

Yes! There is a movement, the NoSQL movement, that is challenging the supremacy of RDBMSs for storing your data!
Some are listed here
Now 10 years after Eric Brewer’s game-changing introduction of the CAP theorem, a mass exodus is starting towards AP (availability + partition tolerance) and away from CP (strong consistency + partition tolerance).

Amazon’s SimpleDB is one such alternative to an RDBMS. Simply put, it is a distributed, replicated, eventually-consistent, always-available, key-value store owned and operated (i.e. hosted) by Amazon’s Web Services division.
Read More