-
Senior Software Engineer – Cloud Performance - Netflix
Senior Software Engineer – Cloud Performance – Netflix
Los Gatos, CA
The Culture
Netflix hires extraordinary performers and gives them the freedom to make an impact. You may be aware of our booming streaming business, but are you aware of our advanced usage of cloud computing? In anticipation of our business’ rapid growth, Netflix, an early cloud adopter, now ranks among the top users of cloud-based infrastructure-as-a-service (a.k.a. IAAS).
The Position
We are looking for best-of-breed, performance-minded software engineers with a passion and talent for scaling high-traffic distributed systems. You should have experience building similarly-trafficked systems and a track record of improving them. Your improvements should be represented by hard-numbers and grounded in engineering principles.
Responsibilities include
• Drive cloud performance & scalability optimization at Netflix and at our cloud partners
• Proactively define and expose metrics that can improve our services’ performance, scalability, and availability
• As a member of this team, you will also work on parts of our core software
• Define and evangelize best-practices at Netflix for Cloud usage
Minimum Job Qualifications
• 10 years of relevant software engineering experience - 6 years of experience with high-traffic, large-scale distributed systems and client-server architectures
• Experience with Cloud Computing platforms (e.g. Amazon AWS, Microsoft Azure, Google App Engine)
• Object-oriented programming experience with Java
• BS/MS in computer science (or equivalent)
Winning Qualities
• Understands complex systems from a performance perspective
• Works well in teams
• Shows leadership
• Is meticulous and numbers-driven
• Employs unambiguous, crystal-clear communicationContact me at siddharthanand@yahoo.com with your resume. Principals only (i.e. no recruiters). Locals preferred. Also, implement a method in Java, given input like AAABBaaccC, writes A3B2a2c2C1. Attach the source code with your email. What is the runtime and memory complexity of your solution using Big O notation?
-
SimpleDB Essentials for High Performance Users : Part 3

This is Part 3 of SimpleDB Essentials for High Performance Users. Check out Part 2
- Work around Attribute Value Length Limits
- If you need to store data that is vastly larger than 1024 bytes in a SimpleDB attribute, consider storing that data in S3 and putting a pointer (i.e. bucket name + object key) to the data in the simpleDB attribute. However, the drawback from this approach is that you will require 2 round-trips (i.e. one to SimpleDB and one to S3) to compose one logical row. Beyond the obvious performance hit, this approach is not transactionally sound.
- A better approach is to split that data over several SimpleDB attributes. You will need to control the splitting and joining logic of these SimpleDB attributes, but you will only need one roundtrip and you can leverage conditional puts for concurrency control. This approach is ideal if your data can fit in 10 or fewer attributes.
- Just remember that subsequent updates to these split attributes might be of different length
- Getting tripped up by the Default Select Query pagination limit of 100
- You must be aware that the SDB Select query supports the “limit N” expression. This allows the developer to specify N up to a max of 2500. If the developer chooses N=200 for example and 1000 items match the WHERE clause conditions, then the results would be returned in chunks of 200 at a time. 5 subsequent round trips would be required to fetch the 1000 items. For customer facing functionality, you are risking end-user timeouts. To avoid this, always specify “limit 2500”. Note: if you don’t specify it, the default value of 100 is assumed by SimpleDB
- Avoid any client code that auto-follows tokens returned by SimpleDB. SimpleDB Query timeouts could result in an unpredictably long-cycle of next-pointers. Auto-following these can not only result in an infinite loop on your servers, but customer-browser timeouts as well. Instead, follow these next pointers judiciously.
- Avoid carrying multi-table relationships into the cloud in the form of multi-domain relationships. Try to denormalize these relationships into single items. Doing joins in the application tier might require multiple round-trips to SDB and open customer-facing functionality to time-outs
- Remember that there are no sequences, locks, constraints (except for the uniqueness constraint on the item name), triggers, etc.. in SimpleDB. Don’t expect them
- Work around Attribute Value Length Limits
-
SimpleDB Essentials for High Performance Users : Part 2
-
SimpleDB Essentials for High Performance Users : Part 1

Preamble
I’ve been a heavy-user of SimpleDB since January 2009, storing, writing, and reading billions of items. Based on my experience, I’ve compiled a list of best practices and conventions to simplify working with SimpleDB. I’ve divided this into multiple parts to ease readability.

-
A Java Out-of-Memory Error involving GZIP, Typica, and SimpleDB
UPDATE
I am providing an update here to the root cause.
Overview
I ran into an interesting Out of Memory bug this week. It occurs if you use gzip to send/receive data and under-utilize your Java Heap memory. This land-mine has existed since 2004, though hopefully you will not be bitten by it.
Problem Stack
A Java process was throwing the following Out-of-Memory Error.JVMDUMP013I Processed Dump Event "uncaught", detail "java/lang/OutOfMemoryError".
Exception in thread "SDB WriterPool_4_rentalusers_incremental-thread-1" java.lang.OutOfMemoryError: ZIP004:OutOfMemoryError, MEM_ERROR in inflateInit2
at java.util.zip.Inflater.init(Native Method)
at java.util.zip.Inflater.<init>(Inflater.java:105)
at java.util.zip.ZipFile.getInflater(ZipFile.java:416)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:359)
at java.util.zip.ZipFile.getInputStream(ZipFile.java:324)
at java.util.jar.JarFile.getInputStream(JarFile.java:467)
at sun.net.www.protocol.jar.JarURLConnection.getInputStream(JarURLConnection.java:165)
at java.net.URL.openStream(URL.java:1041)
at java.lang.ClassLoader.getResourceAsStream(ClassLoader.java:455)
at com.xerox.amazonws.common.AWSQueryConnection.<init>(AWSQueryConnection.java:102)
at com.xerox.amazonws.sdb.Domain.<init>(Domain.java:72)
at com.xerox.amazonws.sdb.SimpleDB.getDomain(SimpleDB.java:202)
....
at java.lang.Thread.run(Thread.java:803) -
SimpleDB Performance : 5 Steps to Achieving High Write Throughput
I was recently tasked with fork-lifting ~1 billion rows from Oracle into SimpleDB. I completed this forklift in November 2009 after many attempts. To make this as efficient as possible, I worked closely with Amazon’s SimpleDB folks to troubleshoot performance problems and create new APIs. I’d like to share some recommendations and observations.
Although I have covered these recommendations in depth in a previous post (i.e. link above), I’d like present a more succinct list of recommendations and observations here to maximize knowledge transfer.
Architecture
The architecture consists of a daemon (i.e. IR, for Item Replicator) that reads records out of Oracle and puts them into multiple SimpleDB domains. I’ve actually shown a second IR process that reads data out of SimpleDB for insertion into Oracle, but you should ignore it for the purpose of this discussion. When I refer to IR in this article, I mean the process replicating from Oracle to SimpleDB.

Recommendations
- Shard your data
- You can achieve much higher data access rates to multiple domains than to a single domain. Hence, rather than using a single domain, use multiple. This is because write traffic acts as if throttled or rate-limited at a domain level.
- Use slow-ramp up for writing
- AWS (SimpleDB) doesn’t like bursty writes and will often respond by throttling IR. When your data uploader starts up, have it slowly increase the write rate
- Use some sort of back-off strategy
- I’ve adopted Amazon recommendation for retry intervals (i.e. 250ms, 500ms, 1s, 2s). Essentially, wait 250 milliseconds on first failure before retrying, 500 milliseconds on second failure before retrying, and so on. After the 3rd retry attempt, stick to 2 second idle intervals.
- Use BatchPutAttributes instead of the singleton PutAttributes
- This will get you an order-of-magnitude improvement in throughput
- Set replace=false on puts
- This is the default. If you know that you are strictly always inserting unique records, puts with replace=false will run much faster than replace=true
- Also, since this is the default, Amazon recommends that users not set replace=false at all
Feel free to follow me on Twitter (@r39132).
- Shard your data
-
“The cloud” Explained for Normal People
If you are like most people in software, you have heard the term “The cloud” but have no idea what it means. If you are industrious enough to buy a book, google the term, or troll twitter for related tweets, you are likely exasperated by the shear marketing buzzword blast you encounter.
To make it easier on you, I am going to tell you what it means to me with a very specific example:
I am helping my company move into the cloud. Specifically, we are going to use most of Amazon’s AWS services.
Definitions
First, some abbreviations and definitions:
-
AWS
- Amazon Web Services, a division of Amazon focusing on hosting our applications and data
-
SimpleDB
- AWS’s always-available replacement for RDBMSs. Specifically SimpleDB is their hosted, replicated key-value store that is always available and accessible as a web service
-
S3
- (a.k.a Simple Storage Service) AWS’s always-available file storage solution accessible as a web service
-
SQS
- (a.k.a Simple Queue Service) AWS’s always-available queueing service accessible as a web service
-
ELB
- (a.k.a Elastic Load Balancer) AWS’s always-available load balancing service accessible as a web service
-
EC2
- (a.k.a Elastic Compute Cloud) AWS’s on-demand server offering accessible as a web service
-
CloudFront
- AWS’s CDN (a.k.a Content Delivery Network) offering accessible as a web service
All of the services above are pay-as-go (and are very reasonable at that) and are accessible as web services. They are also always-available.
So why does one go about using these services?
-
AWS
-
HTML 5’s Web Sockets explained

I’ve been reading a bit about HTML 5’s WebSockets lately.
First, here are some definitions:
- Comet - an umbrella term referring to techniques that provide “server push” using standard browser functionality — i.e. without the aid of specialty browser plug-ins. In practice, Comet in most-often implemented via Ajax with long polling.
- Long polling - (from Wikipedia) “With long polling, the client requests information from the server in a similar way to a normal poll. However, if the server does not have any information available for the client, instead of sending an empty response, the server holds the request and waits for some information to be available. Once the information becomes available (or after a suitable timeout), a complete response is sent to the client. ”
- Gateway - like a proxy, except gateways don’t alter the requests or response that they ferries between browser and server.
With these definitions out of the way, what are WebSockets?
WebSockets is a new proposal under HTML 5 to provide full-duplex, bi-directional client-server interaction over a single TCP connection. The goal of this proposal is manyfold:
- Increasing web server connection scalability - web applications that leverage WebSockets use half the number of web server connections than do Comet-based applications. Comet requires separate upstream and downstream connections.
- Simplifing the coding task - the WebSocket API is much simpler to code with than the XMLHttpRequest()
Sounds great! So, how do we use them?
Unfortunately, WebSockets require browser support and currently only Chrome supports them (i.e. since version 4.0.249.0). As a stop gap, a company named Kaazing provides a gateway for your existing browsers.
To learn more about WebSockets, check out the links below:
- A good how-to using WebSockets from Tornado’s developer:
- A short introduction to Web Sockets from Kaazing:
-
Website Performance - Why you should care and what you can do!

Why Does Performance Matter?
Oftentimes, people speak interchangeably about web site performance, scalability, and availability. Although these 3 terms are related, they are distinct and unique. Here are their definitions:
- availability - what is the total length of time that [some part of] a web site is available during a hour/day/year?
- scalability - what is the largest number of concurrent users that a system can handle?
- performance - what is the [worst] perceived response time for a single user?
It should be pretty clear that for web sites that charge money (either through ads or via subscription-based services), lesser availability translates into lesser revenue. In the same way that you can lose revenue via an outage, you can lose revenue during traffic peaks if your website cannot handle those peaks.
Hence, just as scalability and availability can reduce your top-line (revenue), can performance have a similar affect?
- In 2006 Google’s tests showed that increasing load time by 0.5 seconds resulted in a 20% drop in traffic.
- In 2007 Amazon’s tests showed that for every 100ms increase in load time, sales would decrease 1%.
- This year (2009) Akamai (a CDN leader) revealed in a study that 2 seconds is the new threshold for eCommerce web page response times.
Hence, it does.
Where Should You Look for Performance Issues?
Ask anyone who has worked on a web site or an enterprise system and they will say “Look at your database!”. Although that is true, the mistake people often make is stopping there. After tuning queries to their satisfaction, engineers seem to ignore 5-7 second website page load times. 80-90% of this time really comes from assembling the web page.
-
Denial of Service (DoS) : Some Thoughts
About a year ago, I had the opportunity to solve a class of Denial-of-Service attacks that were compromising our availability and scalability. During that investigation, I happened upon a revelation. That revelation led to a solution. I’ve since seen that learning applied to other systems, including Amazon’s SimpleDB, so I wanted to share it here.
Consider the following scenario (also depicted below):
- A web client issues an HTTP request to a web site
- The web site, upon receiving the request, attempts to determine if the current request is part of a larger DOS attack
- If so, a defense is executed
- If not, the web request follows a normal execution of business logic
- The web server returns a response to the web client

