Resiliency Through Failure

Ariel Tseitlin (Scale Venture Partner)
Java & JVM
Location: E147
Average rating: ****.
(4.83, 12 ratings)
Slides:   external link

Netflix created a suite of tools, collectively called the Simian Army, to improve resiliency and maintain the cloud environment.

In the typical case, failure modes are corner cases which are poorly, if at all, tested. It’s only by failing often that we can ensure that we are resilient to failure.

We look for ways to induce failure in our production environment to better prepare us for the inevitable failures that will occur.

We’ve open sourced some of the monkeys (Chaos, Janitor, Conformity), and are working on releasing the rest. The presentation will cover the reason for creating each monkey, what we’ve been able to learn from running them, and tips for those interested in adopting the approach.

Chaos Monkey randomly terminates virtual machines to ensure that services are resilient to node failure.

Chaos Gorilla is a more powerful version of Chaos Monkey, terminating an entire AWS Availability Zone (data center) to ensure resiliency to a single zone failure.

Latency Monkey induces random network delays and errors to ensure that services are resilient to degradation in their dependencies.

Janitor Monkey is the cloud cleaning crew. It prevents clutter by cleaning up old and unused resources.

The presentation will cover the reason for creating each monkey, what we’ve been able to learn from running them, and tips for those interested in adopting the approach.

Photo of Ariel Tseitlin

Ariel Tseitlin

Scale Venture Partner

Ariel Tseitlin is a Venture Partner at Scale Venture Partners where he focuses on enterprise software investments in cloud, big data, security, and mobile.

Ariel joined ScaleVP from Netflix, where he was Director of Cloud Solutions and responsible for creating and operating one of the most modern cloud infrastructures in the industry, accounting for a full third of all US downstream internet traffic at peak. Ariel’s team built many of the Netflix OSS components like Asgard and the Simian Army, including the Chaos Monkey, making the Netflix streaming service more resilient, reliable, and manageable.

Prior to Netflix, Ariel was VP of Technology and Products at Sungevity and before that was the Founder & CEO of CTOWorks, a software consultancy helping early-stage entrepreneurs deliver their first product to market. Earlier in his career, Ariel held senior management positions at Siebel Systems and Oracle. Ariel holds a bachelor’s degree in Computer Science from UC Berkeley and an MBA with honors from the Wharton School of Business at the University of Pennsylvania.

Sponsors

Sponsorship Opportunities

For information on exhibition and sponsorship opportunities at the conference, contact Sharon Cordesse at (707) 827-7065 or scordesse@oreilly.com.

Contact Us

View a complete list of OSCON contacts