Upload
alois-mayr
View
43
Download
0
Embed Size (px)
Citation preview
@mayraloisAugust,2016
The Mushroom Cloud Effector
What Happens When Containers Fail?
AloisMayrTechnologyLeadCloud&Containers
Microservices andContainersMeetup Austin
@mayralois
about:me• Austrian• Never seen
Sound of Music• Often seen much more modern
technology stuff
• Seen even more technology stuffnow with Dynatrace
• Technology Lead for Cloud & ContainersCloudFoundry, Docker, AWS, etc.
@mayralois
about:dynatrace• APM market leader who helps companies in
Digital transformation
• Founded in Austria back in 2005
• ~ 1600 employees worldwide
• > 8000 customers across all industries
• Seen many performance and stability problems and patterns out there
@mayralois
about:you• Who of you run/manage containers in
production?
• Whose life has become easier since then?
• What’s needed to make it easy?
• Thanks!
@mayraloisSource:http://www.schoonoart.de/
…there’sbeen themushroom cloudeffect
ohyeah,everythingscrewed up
@mayralois
Biggest LatAm E-Commerce Company
• ~ U$ 2.5 billion revenue• 4 sites: Americanas, Shoptime,
Submarino, Soubarato
• ~ 150 hosts across 4 regions• 5k-15k containers• 1k-3k services
@mayralois
Important Aspects…
• Lots of (micro-)services
• Lots of communication between services
• Service dependencies
• Versioning and API compatibilities
• Zero downtime
@mayralois
Platform-related Aspects
• Most often container-based
• Clustered for scalability
• Ephemeral containers
• Resilient architecture
• Cross AZ fail-overs
• SDN for communication
@mayralois
Deployments are no Longer Static
7:00a.m.Lowload,servicerunningwithminimumredundancy
12:00p.m.Scaledupserviceduringpeak loadwithfailoverofproblematicnode
7:00p.m.Scaledbackdowntolowerload,movetodifferentgeolocation
@mayralois
The Hungry Container Breakdown
• Shared /logs partition on host• No log rotation, no archiving for app logs• No proper log management used for Docker environment• Shared /logs partition ran out of space
What was the problem?
@mayralois
The Hungry Container Breakdown
• Container health checks failed• Orchestration killed container and rescheduled new one• Still no free space on /logs• Termination and rescheduling• /var/lib/docker ran out of space• Cluster nodes were no longer able to run any containers
How the problem has evolved over time?
@mayralois
The Hungry Container Breakdown
• Services at the top of the graph• Increased failure rates• Lots of depending Tomcat and DB services affected
How the problem affected services?
@mayralois
The Hungry Container Breakdown
Log management tools for app logs--log-driver=none|syslog
Remove container / clean-up jobs--rm=true
/var/lib/docker deserves its own partition
How the problem could have been avoided?
@mayralois
Break Your Clusters Early
Massiveloadtesting!
Survivethreedaysofpain
IncludeeverythingServices,Containers,
Orchestration,EC2instances