January 2012
1 post
6 tags
The State of NoSQL in 2012
Preamble Ramble
If you’ve been working in the online (e.g. internet) space over the past 3 years, you are no stranger to terms like “the cloud” and “NoSQL”.
In 2007, Amazon published a paper on Dynamo. The paper detailed how Dynamo, employing a collection of techniques to solve several problems in fault-tolerance, provided a resilient solution to the on-line...
December 2011
1 post
2 tags
Netflix & LinkedIn
After 4+ years at Netflix, I am about to embark on a new challenge. On Monday, I start at LinkedIn as a senior member of the Service Infrastructure team. It has been my privilege and honor to work at Netflix these many years. To date, Netflix is the best company that I have worked at.
Netflix
Central to the company’s success is its unique culture of “Freedom and...
November 2011
2 posts
Slides from my QCon SF 2011 talk on Fault Tolerant...
Abstract : http://qconsf.com/sf2011/speaker/Siddharth+Anand
Keeping Movies Running Amid Thunderstorms!
View more presentations from Sid Anand
2 tags
"Head in the Cloud" Article about Netflix in the...
Check out the latest Cornell Engineering Magazine Issue for an article on how Netflix leverages the cloud! The story starts on page 10.
September 2011
1 post
July 2011
3 posts
3 tags
"NoSQL @ Netflix, Part 2" Talk at OSCON 2011
I’ve been having a great week so far at OSCON. Excellent talks from Riak, Facebook, 10gen, DataStax, Acunu, StumbleUpon, etc….
Uploading slides from my talk here..
OSCON Data 2011 — NoSQL @ Netflix, Part 2 View more presentations from Sid Anand
NoSQL @ Netfix @ OSCON Data, July 25
During July 25-27, O’Reilly will be dedicating a special section of their OSCON open source conference to big data use-cases and technologies. Tracks cover a range of topics, from NoSQL to relational technologies. Additionally tracks dedicated to Hadoop and Visualization are also included. This section is aptly named OSCON Data.
I’ll be speaking during the Monday, 10:40 am slot...
June 2011
5 posts
3 tags
Acunu - Cassandra optimized (SSDs, new data...
]
For some workloads, they achieve 2 orders of magnitude better latency results relative to a vanilla Cassandra distribution. This is achieved through a combination of SSDs and a new data structure (Stratified B-tree).
Acunu White paper. http://bit.ly/jQc423
Short Blog entries on Cassandra performance
http://bit.ly/mUqWLO Cassandra under heavy write load, Part...
Recent GigaOM Structure 2011 Cloud Guru Panel Video (6/22/2011)
4 tags
Netflix @ DataStax SF 2011 - Monday, July 11
Netflix will be sponsoring DataStax SF 2011 on Monday, July 11. Though Netflix does not often sponsor events or recruit at conferences, we hope to engage with others active in the Cassandra and Distributed Systems community.
Stop by our booth to learn about what we are working on! Bring your resume as well!
Sid, Cloud Systems, Netflix
4 tags
Some upcoming conferences that I will be speaking...
4 tags
Netflix for Mobile in the Cloud
Last week, I attended Intuit’s annual CTOF (Create The OFfering) conference in San Diego. The focus this year was on mobile best practices. I was happy to represent Netflix at this venue and learn from other companies with a foot in either mobile, the cloud, or both.
Netflix is an example of both with support on several mobile platforms (e.g. iOS, Android, Windows Mobile) and traffic...
May 2011
1 post
3 tags
Distributed Systems & Dev Ops Engineers, Netflix...
Hi Folks!
As many of you know, Netflix has been expanding its video streaming offering throughout the US and the world. We have been successful thanks to our move to Amazon Web Services’ cloud and our adoption of NoSQL solutions.
Senior Software Engineer - Cloud Database Performance
Senior Operations Engineer - Cloud Database
-s
March 2011
1 post
3 tags
Talks, Videos, and White Papers
I thought that I would put up a list of previous and upcoming talks that I will be giving. These talks will tend to focus on my work in Distributed “Cloud” Computing and NoSQL.
Recent & Upcoming Speaking Appearances
QCon SF 2010
Cornell Silicon Valley 2010 Cloud Panel
Silicon Valley Cloud Computing Group (Meetup) — “NoSQL @ Netflix”
QCon London 2011 (2...
February 2011
2 posts
5 tags
NoSQL @ Netflix : Part 1
Hi Folks!
I had a great time this past Thursday (Feb 17, 2011) speaking at the Silicon Valley Cloud Computing Group Meetup at Facebook. The talk covered Netflix’s move from RDBMS to NoSQL, specifically SimpleDB. Subsequent parts will provide our experiences with Cassandra, HBase, and other technologies.
Video is now available. The first 10 minutes are from sponsors, VMWare, RackSpace,...
3 tags
Silicon Valley Meetup & QCon London
Hi Folks!
I’ll be giving a 2-part lecture on NoSQL @ Netflix at the Silicon Valley Cloud Computing Meetup in Mountain View on Feb 17. The lectures will be a month apart.
I will detail the challenges involved in going from an RDBMS in our Data Center to AWS’s SimpleDB and S3 in the Cloud. I was intimately involved in this transition. Now, my team at Netflix is investing in Cassandra...
January 2011
1 post
Red Black Trees vs Skip Lists →
December 2010
1 post
3 tags
How Process Priority Inversion Can Burn CPU via...
For the past few weeks, we have been wrestling with an interesting bug in Oracle 11g at Netflix.
We are seeing high CPU attributed with a high number of wait events for the following:
cursor: mutex S
latch: shared pool
I was perplexed as to why waits would result in high CPU so I hopped on to Google. I didn’t find an answer for 11g, but I did find something reported in 10.2 and...
November 2010
2 posts
4 tags
Cassandra : Row Cache & Memtable Q&A
Below, you will find 2 sets of questions and answers regarding Cassandra’s (now v0.7) Row Cache and Memtable. These were answered by Matthew Dennis at Riptano, a company that is actively developing both Cassandra and a management suite known as RipCord.
Questions
Hi! I have a super column family. Writes either modify a column within a super column or add a super column to the...
5 tags
Netflix's Transition to High-Availability Storage...
I presented at QCon SF (2010) yesterday on Netlflix’s transition to high-availability storage. The slides are on slideshare.
I expect that the presentation will be available on the QCon SF site in a few days. It’s (loosely) based on my previous white paper of the same title - see this post.
October 2010
1 post
4 tags
Netflix's Transition to High-Availability Storage...
I just published a white paper titled Netflix’s Transition to High-Availability Storage Systems
Feel free to email me your thoughts at siddharthanand@yahoo.com
To download this paper as PDF, click on this .
June 2010
3 posts
3 tags
SimpleDB Essentials for High Performance Users :...
This is Part 3 of SimpleDB Essentials for High Performance Users. Check out Part 2
Work around Attribute Value Length Limits If you need to store data that is vastly larger than 1024 bytes in a SimpleDB attribute, consider storing that data in S3 and putting a pointer (i.e. bucket name + object key) to the data in the simpleDB attribute. However, the drawback from this approach is that you...
3 tags
SimpleDB Essentials for High Performance Users :...
This Part 2 of SimpleDB Essentials for High Performance Users. Check out Part 1
Beware of Case-senstivity Since domain names and attribute names are case-sensitive, for all domain and attribute names, use uppercase lettering and separate words with “_”
When sharding domains, adopt zero-based index numbering and separate it from the root name with “_” e.g....
4 tags
SimpleDB Essentials for High Performance Users :...
Preamble
I’ve been a heavy-user of SimpleDB since January 2009, storing, writing, and reading billions of items. Based on my experience, I’ve compiled a list of best practices and conventions to simplify working with SimpleDB. I’ve divided this into multiple parts to ease readability.
Details
Since sorting is...
March 2010
1 post
6 tags
A Java Out-of-Memory Error involving GZIP, Typica,...
UPDATE
I am providing an update here to the root cause.
Overview I ran into an interesting Out of Memory bug this week. It occurs if you use gzip to send/receive data and under-utilize your Java Heap memory. This land-mine has existed since 2004, though hopefully you will not be bitten by it. Problem Stack A Java process was throwing the following Out-of-Memory Error.
JVMDUMP013I Processed Dump...
January 2010
3 posts
2 tags
SimpleDB Performance : 5 Steps to Achieving High...
I was recently tasked with fork-lifting ~1 billion rows from Oracle into SimpleDB. I completed this forklift in November 2009 after many attempts. To make this as efficient as possible, I worked closely with Amazon’s SimpleDB folks to troubleshoot performance problems and create new APIs. I’d like to share some recommendations and observations.
Although I have covered these...
2 tags
"The cloud" Explained for Normal People
If you are like most people in software, you have heard the term “The cloud” but have no idea what it means. If you are industrious enough to buy a book, google the term, or troll twitter for related tweets, you are likely exasperated by the shear marketing buzzword blast you encounter.
To make it easier on you, I am going to tell you what it means to me with a very specific example:
...
4 tags
HTML 5's Web Sockets explained
I’ve been reading a bit about HTML 5’s WebSockets lately.
First, here are some definitions:
Comet - an umbrella term referring to techniques that provide “server push” using standard browser functionality — i.e. without the aid of specialty browser plug-ins. In practice, Comet in most-often implemented via Ajax with long polling.
Long polling - (from Wikipedia)...
December 2009
13 posts
4 tags
Website Performance - Why you should care and what...
Why Does Performance Matter?
Oftentimes, people speak interchangeably about web site performance, scalability, and availability. Although these 3 terms are related, they are distinct and unique. Here are their definitions:
availability - what is the total length of time that [some part of] a web site is available during a hour/day/year?
scalability - what is the largest number of concurrent...
3 tags
Denial of Service (DoS) : Some Thoughts
About a year ago, I had the opportunity to solve a class of Denial-of-Service attacks that were compromising our availability and scalability. During that investigation, I happened upon a revelation. That revelation led to a solution. I’ve since seen that learning applied to other systems, including Amazon’s SimpleDB, so I wanted to share it here.
Consider the following scenario (also...
5 tags
SimpleDB Recommended Reading List (12/23/09)
Below is a list of recommended reading to understand SimpleDB and other cloud-related topics. The reading list starts with distributed computing basics and ends with in-depth SimpleDB best-practices.
The CAP Theorem Distilled
The “Consistency, Not Accuracy” Principle
Eventual Consistency Explained for Non-techies
Eventual Consistency Explained for Techies
RDBMS vs. SimpleDB...
6 tags
The Oracle-SimpleDB Hybrid Part 3 : Defining the...
Preamble : See Part 2 : Solving the Eventual-Consistency Problem
When building a SimpleDB-Oracle (i.e. any key_value_store-RDBMS) hybrid system, translating between two very different data models presents a challenge. The challenge expands beyond the obvious ACID vs. BASE differences.
Most RDBMSs support the following features:
Triggers
Stored Procedures
Constraints (e.g. integrity,...
6 tags
The Oracle-SimpleDB Hybrid Part 2 : Solving the...
Preamble: Read Part 1: Pulling Data out of Oracle Efficiently
Creating the Oracle-SimpleDB Hybrid system is a challenge. For one, it is a multihomed system, accepting writes in both the cloud (i.e. SimpleDB) and in our data center (i.e. Oracle). Secondly, the data center and the AWS reqion (i.e. US-east) are on opposite coasts of the US with network latency ~50-100ms. Thirdly, clocks are not...
5 tags
The Oracle-SimpleDB Hybrid Part 1 : Pulling data...
Preamble : See my previous post titled “Introducing the Oracle-MySQL Hybrid”
In my previous post, I provided an overview of an Oracle-SimpleDB Hybrid system that I am building. It supports writes to multiple masters, replicates data between masters in single-digit seconds (i.e. in the absence of long-term network partitions), is eventually-consistent, and is designed for optimal...
4 tags
Introducing the Oracle-SimpleDB Hybrid
My company would like to migrate its systems to the cloud. As this will take several months, the engineering team needs to support data access in both the cloud and its data center in the interim. Also, the RDBMS system might be maintained until some functionality (e.g. Backup-Restore) is created in SimpleDB.
To this aim, for the past 9 months, I have been building an eventually-consistent,...
7 tags
Cloud Tips: How to Efficiently Forklift 1 Billion...
About 9 months ago, I was tasked with fork-lifting a massive amount of data into Amazon’s SimpleDB in a short amount of time. I achieved it. Here’s what you need to know.
If you read-on, I’ll show you how to achieve data upload rates of around 10K items/second
SimpleDB Basics
First of all, if you have 1 billion rows to upload, you will need more than 1 domain. This is...
5 tags
RDBMS vs SimpleDB Overview
Enter the key-value store, exit the RDBMS
Anyone who has worked directly or indirectly with a relational database will tell you that it would be foolish to build a business that didn’t use one to store your business’s data.
One may argue whether MySQL or Oracle is the better choice, but would someone actually argue that an RDBMS (a.k.a. relation database) was not the best choice for...
6 tags
Eventual Consistency Explained for Techies
Preamble: Have a look at my previous article titled “Eventually Consistency Explained for non-Techies”
Eventual and Weak Consistency
SimpleDB is eventually consistent. Eventual consistency is a version of weak consistency — you may not see the latest writes committed to the system.
Imagine that you have a system of N nodes. Of these, W nodes are involved in any write sent to...
3 tags
Eventual Consistency Explained for Non-techies
If you work in the Computer industry, especially the Internet industry, chances are good that you have encountered an eventually-consistent system.
For example, when managing an internet or IT business, you might have considered one of all of the following DB architectures:
Use a single DB host e.g. MyHost
Use a single DB host for your writes, but several for your reads e.g. MyWriteHost...
5 tags
The "Consistency, Not Accuracy" Principle
Preamble: Read my post “The CAP Theorem distilled”
In my previous post, I started talking about the “Consistency, Not Accuracy” Principle (a.k.a. The CNA Principle)
Essentially, in order to scale your web site and to keep running amidst unpredictable network and system outages, you need to have a replicated, fault-tolerant data store that accepts reads and writes in...
4 tags
The CAP Theorem Distilled
Anyone who has studied Distributed Computing in a graduate school computer science course appreciates that one of the hardest problems to solve is that of keeping writes in multiple locations synchronized. Are the writes synchronous or asynchronous? If the latter, what is the replication delay in the average and worst cases? How are write-inconsistencies resolved in the face of parallel...