Download pdf - DevOps Nirvana: Seven Steps to a Peaceful Life on AWS (ARC210) | AWS re:Invent 2013

Seven Steps to a Peaceful Life on AWS

Andrew ShiehSmugMug

@shandrew

Philip Jacob Stackdriver @whirlycott

Friday, November 15, 13



Stuff we have in common

✓ Years of AWS experience✓ Success and failure with many lessons learned✓ Both using Stackdriver for infrastructure monitoring✓ Lots of data✓ Philosophically aligned on how to run on AWS‣ Superheroes



Transition to Distributed SystemsLure of Elasticity

Peak of Expectations

DevOps Nirvana

Operational Enlightenment

CLOUD HYPE

TIME


STEPS



1: Apply lean production principles


Release all the time: continuous improvement


Make it frictionless


$ stack deploy



2: Choose the right instance type


Factors to Consider

CPUNetworkDisk I/O

WorkloadCost

Tools to help you decide

vmstatiostatsarR

ExcelStackdriver + agent


21%$20%$

12%$

11%$

9%$

7%$7%$

3%$ 2%$ 2%$ 2%$1%$ 1%$

0%$ 0%$ 0%$ 0%$ 0%$

m1.large$

m1.small$

m1.m

edium

$

c1.medium

$

c1.xlarge$

t1.micro$

m1.xlarge$

m2.xlarge$

m2.2xlarge$

m2.4xlarge$

m3.xlarge$

m3.2xlarge$

cc2.8xlarge$

hi1.4xlarge$

cg1.4xlarge$

hs1.8xlarge$

cc1.4xlarge$

cr1.8xlarge$

Distribu=on$of$EC2$Instance$Usage$


+ EC2


3: Use configuration management



4: Choose the right monitoring solution



Rapid Setup Full-stack AWS Integration IntelligentCluster-aware


5: Design effective alerting policies


Simple rules for confidently waking up ops@ at 3am

1.Something had better be broken (or close to it) for the customer

2.The broken thing should be as obvious as possible

3. It should be clear what action I can take to make the situation better

Customers seeing huge spike in 5XX errors

Code deploy to web cluster one hour ago

Revert!


6: Architect for high availability


Elastic Load BalancingAmazon RDSApache

Zookeeper


AI

F

Cell-1GW

MQ

AI

F

Cell-2GW

MQ

Cloud Integration System Agents Custom Metrics

Load Balancing 1 Load Balancing nLoad Balancing 2

DNSData Ingestion

S3

Archival Online Analysis

Serving

WorkersWorkers

Workers

AgentsAgents

Agents

APIAPI

API

Q 1

2n

3

Cassandra

Batch

Aggregation Correlation Trending

Web/Mobile

o UIUI

Anomaly

Health

AI

F

Cell-nGW

MQ

Elastic Load Balancingw/ haproxy

Localized failureIdentical dimensions

Easy to reasonNetwork partitions ok


Handling failure

Avoid it

Mask it

Minimize it

Recover quickly

Cluster AZ Region

Resilience

Tolerance


7: Think holistically about quality assurance


AUTOSCALING +AUTOMATION +CONTINUOUS INTEGRATION +DEVOPS GOVERNANCE +ELASTICITY +PROGRAMMABLE INFRASTRUCTURE =CONSTANT CHANGE


You cannot pre-test every change

So

You need to be really good at detecting issues

Very quickly


Monitoring is a key part of quality assurance for dynamic systems

But monitoring tools need to be intelligent

Distributed sensorsCloud-aware

Anomaly detectionSynthetic transactions


• Training• Recommended reading:

• Systemantics (aka The Systems Bible)

• High Scalability (http://highscalability.com/)

• James Hamilton’s blog (http://perspectives.mvdirona.com/)


http://highscalability.com/

http://highscalability.com/

http://perspectives.mvdirona.com/

http://perspectives.mvdirona.com/

Visit us at http://www.smugmug.com/


http://www.smugmug.com

http://www.smugmug.com

Visit us at booth 315!