30
@mayralois August, 2016 The Mushroom Cloud Effect or What Happens When Containers Fail? Alois Mayr Technology Lead Cloud & Containers Microservices and Containers Meetup Austin

When containers fail

Embed Size (px)

Citation preview

@mayraloisAugust,2016

The Mushroom Cloud Effector

What Happens When Containers Fail?

AloisMayrTechnologyLeadCloud&Containers

Microservices andContainersMeetup Austin

@mayralois

about:me• Austrian• Never seen

Sound of Music• Often seen much more modern

technology stuff

• Seen even more technology stuffnow with Dynatrace

• Technology Lead for Cloud & ContainersCloudFoundry, Docker, AWS, etc.

@mayralois

about:dynatrace• APM market leader who helps companies in

Digital transformation

• Founded in Austria back in 2005

• ~ 1600 employees worldwide

• > 8000 customers across all industries

• Seen many performance and stability problems and patterns out there

@mayralois

about:you• Who of you run/manage containers in

production?

• Whose life has become easier since then?

• What’s needed to make it easy?

• Thanks!

@mayraloisSource:http://www.schoonoart.de/

…there’sbeen themushroom cloudeffect

ohyeah,everythingscrewed up

@mayralois

TheMushroomCloudEffector

WhatHappensWhenContainersFail?

@mayralois

Biggest LatAm E-Commerce Company

• ~ U$ 2.5 billion revenue• 4 sites: Americanas, Shoptime,

Submarino, Soubarato

• ~ 150 hosts across 4 regions• 5k-15k containers• 1k-3k services

@mayralois

TL;DR

@mayralois

AboutCloud-ScaleSystems

@mayralois

Important Aspects…

• Lots of (micro-)services

• Lots of communication between services

• Service dependencies

• Versioning and API compatibilities

• Zero downtime

@mayralois

@mayralois

Platform-related Aspects

• Most often container-based

• Clustered for scalability

• Ephemeral containers

• Resilient architecture

• Cross AZ fail-overs

• SDN for communication

@mayralois

Deployments are no Longer Static

7:00a.m.Lowload,servicerunningwithminimumredundancy

12:00p.m.Scaledupserviceduringpeak loadwithfailoverofproblematicnode

7:00p.m.Scaledbackdowntolowerload,movetodifferentgeolocation

@mayralois

Anatomy of dynamic environments

https://www.dynatrace.com/en/ruxit/

@mayralois

All About (Service) Dependencies

@mayralois

Failing containers…

…may or may not have an (immediate) impact to service performance

@mayralois

CascadingFailuresLeadtoaMushroomCloudEffect

@mayralois

@mayralois

The Hungry Container Breakdown

• Shared /logs partition on host• No log rotation, no archiving for app logs• No proper log management used for Docker environment• Shared /logs partition ran out of space

What was the problem?

@mayralois

The Hungry Container Breakdown

• Container health checks failed• Orchestration killed container and rescheduled new one• Still no free space on /logs• Termination and rescheduling• /var/lib/docker ran out of space• Cluster nodes were no longer able to run any containers

How the problem has evolved over time?

@mayralois

The Hungry Container Breakdown

• Services at the top of the graph• Increased failure rates• Lots of depending Tomcat and DB services affected

How the problem affected services?

@mayralois

@mayralois

The Hungry Container Breakdown

Log management tools for app logs--log-driver=none|syslog

Remove container / clean-up jobs--rm=true

/var/lib/docker deserves its own partition

How the problem could have been avoided?

@mayralois

The Hungry Container Breakdown

BuggyContainersMayKillYourNodes

@mayralois

TrytoBreakYourClustersEarly(AndbePreparedforBlackFriday)

@mayralois

Break Your Clusters Early

Massiveloadtesting!

Survivethreedaysofpain

IncludeeverythingServices,Containers,

Orchestration,EC2instances

@mayralois

Testing everything

13.3kcontainers(+nodes)

3,451services

@mayralois

@mayralois

AutomationNeeded to Pinpoint theRootCause of CascadingFailures!

@mayralois

Questions? Or Beer? Or Both?

How doyou know if afailing containerbreaks your apps?