how-netflix-does-devops

How Is Netflix SO GOOD at DevOps?

Sooo… How does Netflix think about DevOps? Easy! They don’t. The end.

Just kidding, we have prepared an entire article on this topic for you. However, there is a grain of truth in that statement – Netflix doesn’t prioritize DevOps. They don’t get caught up in metrics and goals such as having zero downtime; instead, they prioritize innovation.

No other company in the world innovates at a higher velocity than Netflix, and this approach pays off when it comes to the quality of their service.

“The rate at which this entertainment game-changer has adopted new technologies and implemented them into its DevOps approach is setting new standards in IT.” – Coman Hamilton, Editor of JAXenter.com

So, how are they so good at DevOps if they don’t think about DevOps? And more importantly, how can you implement the same strategies into your organization? Read on to find out.

How Netflix Thinks about DevOps

We have already established that Netflix doesn’t really think about DevOps. So, what do they do then?

They don’t prevent engineers from accessing the production environment in any way (through systems, policies, or procedures) – every Netflix engineer has full access to the production environment from day 1.

This might seem scary for some organizations – giving people full access to everything means they could shut down the service. Yet this has never happened at Netflix. Engineers have the freedom to solve problems in the way they think it’s best and take responsibility for the decisions they make.

They don’t prioritize uptime at all costs, especially if, to achieve 100% uptime, they need to sacrifice innovation.

In industries such as healthcare or banking, zero downtime is mandatory, but not for Netflix. If their engineers can come up with new features and ideas, they have the freedom to implement them even if they affect uptime. In the end, what they gain will far surpass a few minutes of downtime.

They don’t focus on processes and procedures. That’s because it’s difficult for such a large organization to move as quickly if engineers are tied down by specific policies they need to follow. Also, it’s impossible to come up with new approaches if a system dictates the expected outcome and the steps you need to follow to achieve that desired outcome.

They don’t enforce using specific programming languages and frameworks. Instead, they give engineers the freedom to choose the best standard for the job if that means the code is optimized and the users get a better experience.

They don’t believe in gut instincts and traditional thinking but focus on data instead. The majority of the decisions Netflix makes depend upon data.

You can find out more about how Netflix thinks about DevOps in this DevOpsDays Rockies keynote speech.

Short Intro to DevOps

DevOps’ goal is to shorten the development lifecycle and provide consistent delivery of high-quality software by bridging development and IT operations.

The DevOps philosophy builds upon the Agile Principles. You can look at it as a combination of cultural philosophies, practices, and tools that increases a company’s ability to deliver applications and services faster. At the same time, DevOps enables evolving and improving products quicker than using traditional software development and infrastructure management processes.

The Advantages of DevOps

The advantages of this new approach are:

  1. Speed – DevOps speeds up the release cycle by increasing the frequency of releases
  2. Efficiency – DevOps seeks to automate workflows wherever possible
  3. Reliability – DevOps ensures the quality of application updates and infrastructure changes so organizations can reliably deliver continuous updates while maintaining a positive experience for their customers
  4. Improved collaboration – because DevOps encourages communication and collaboration, it helps teams become more efficient by reducing inefficiencies.

Netflix DevOps Data & Numbers

From a technical point of view, Netflix has 3 main components:

  • compute and storage, managed through Amazon Web Services
  • UI & small assets built using Akamai
  • Netflix Open Connect – their purpose-built video CDM.

You can find out more details about their CDM, as well as all their open-source projects and software on Netflix’s GitHub.

Now let’s get into data. Netflix has:

  • 100s of microservices
  • 1,000s of daily production changes
  • 10,000s of virtual instances inside Amazon
  • 100,000s of customer interactions per minute
  • 1,000,000s of customers
  • 1,000,000,000s of time series metrics

And they manage all this with ~70 operations engineers and 0 network ops centers. If that’s not impressive, we don’t know what is.

via GIPHY

How Netflix Does DevOps

When the entertainment giant switched from delivering DVDs to streaming videos over the internet, there weren’t many tools available that could help the company’s massive cloud infrastructure to run smoothly.

So how does Netflix manage to serve millions of users all over the world with near-perfect uptime? Here’s Netflix’s approach to DevOps:

First, they moved their infrastructure from on-prem to cloud to be able to scale their service, a process that took several years to complete.

“Our journey to the cloud at Netflix began in August of 2008, when we experienced a major database corruption and for three days could not ship DVDs to our members. That is when we realized that we had to move away from vertically-scaled single points of failure, like relational databases in our datacentre, towards highly reliable, horizontally-scalable, distributed systems in the cloud.” – Yury Izrailevsky, VP, Cloud Computing and Platform Engineering, Netflix

To all the companies that think that building your own infrastructure and tools from scratch is the best approach because no one can do it as good as you – one of the main reasons Netflix is so successful today is because they realized the scalability advantages of cloud early on, and let Amazon handle the heavy-lifting of building the best datacenters. Instead, they focused on their product. Something we at Bunnyshell also encourage and help organizations do through our DevOps automation platform.

In their endless pursuit for scalability, Netflix also implemented containerization. Two key advantages of containerization are:

  • consistency between environments
  • the fact that containers can be destroyed and created very quickly, which helps with scaling, reliability, and efficient rollbacks.

To further streamline this process, Netflix developed its own container management tool called Titus that could handle their unique requirements.

“Titus is Netflix’s infrastructural foundation for container-based applications. Titus provides Netflix scale cluster and resource management as well as container execution with deep Amazon EC2 integration and common Netflix infrastructure enablement.” – Andrew Spyker, Andrew Leung, Tim Bozarth, Netflix Technology Blog

Last but not least, Netflix builds for failure. Outages are quite common and, in recent years, we’ve seen many major websites taken down.

On Christmas Eve 2012, Netflix experienced a partial outage to their service (caused by a fault with AWS) that lasted for a few hours. Nowadays, the company can easily cope with these kinds of issues.

How? By accepting that, at some point, parts of their applications won’t work as expected and preparing for these eventualities. For example, they have a tool they call ‘Chaos Monkey,’ which helps them to test the stability of their production applications.

“(Chaos Monkey is) A tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.” – Netflix Technology Blog

Why the Netflix Example Is an Exception

Although Netflix doesn’t deliberately try to be good at DevOps, thanks to their company culture, they still manage to achieve this. However, this situation is unique to their work environment and doesn’t necessarily apply to all organizations.

As we’ve previously mentioned, they intentionally sacrifice some amount of uptime if that means they can provide their customers with a better product in the long run (yet, even so, they have near-perfect uptime). This is not something all companies could trade.

Despite the fact that they are a very data-driven company, they don’t have a single monitor in their offices that shows them their metrics in real-time. Instead, they let algorithms take care of analyzing the data and notify them only if something is wrong. This enables their engineers to focus on what they want to build and don’t waste time trying to make sense of data. Again, this might not be something all companies could do, especially if they have specific metrics at the core of their product.

All in all, Netflix’s approach can work for organizations that give their employees the freedom to do what they’re best at and not for those that have a lot of processes and a heavy structure to get the work done. Netflix believes in context over control, not otherwise.

The Netflix Culture

“In the DevOps world, Netflix has been the gold standard for many years; just about as many years as we’ve been using the term ‘DevOps.’ Netflix is different because they don’t just talk DevOps like many companies do while still being too frightened to change. Instead, Netflix embraces changes and constant improvement. 

For example, many companies would be petrified to release something into their production environment that purposely causes systems to break. Not Netflix. Netflix’s ‘Chaos Monkey’ is just one project that proves they’re not afraid to continuously improve. 

Netflix is willing to put its production environment on the line and risk downtime in the short term for a more reliable environment in the long term. 

It takes a certain personality to embrace their ‘no obstacles to production’ approach. Some organizations are simply too scared or lack the expertise to design systems for this approach. Netflix’s strategy isn’t for everyone. To make changes in production as fast as they do requires a lot of upfront automation and systems planning. 

I believe many organizations fail at the Netflix approach because they simply don’t have the engineering muscle that Netflix does. Even though they may want to deliver faster and more efficiently and are even OK with taking on a little more risk, they can’t. They need the team to make it happen.” – Adam Bertram, tech blogger at adamtheautomator.com.

The Bunnyshell Solution

As Coman Hamilton says, Netflix’s approach to DevOps sets a great example of how it can contribute to the growth and development of a business and raises the bar for IT companies everywhere. So, if you’re looking to achieve the same results Netflix has, build a positive culture, and encourage your team members to contribute, you should give Bunnyshell a try.

To help your team:

  • we simplified processes and standardized workflows
  • we motivate them to follow a systematic approach to the entire infrastructure and other related activities
  • enable your engineers to focus on building a great product.

Get in touch with us to learn more about how Bunnyshell makes your work easier.