Seven Steps to a Peaceful Life on AWS
Andrew ShiehSmugMug
@shandrew
Philip Jacob Stackdriver @whirlycott
Friday, November 15, 13
Friday, November 15, 13
Friday, November 15, 13
Stuff we have in common
✓ Years of AWS experience✓ Success and failure with many lessons learned✓ Both using Stackdriver for infrastructure monitoring✓ Lots of data✓ Philosophically aligned on how to run on AWS‣ Superheroes
Friday, November 15, 13
Friday, November 15, 13
Transition to Distributed SystemsLure of Elasticity
Peak of Expectations
DevOps Nirvana
Operational Enlightenment
CLOUD HYPE
TIME
Friday, November 15, 13
STEPS
Friday, November 15, 13
Friday, November 15, 13
1: Apply lean production principles
Friday, November 15, 13
Release all the time: continuous improvement
Friday, November 15, 13
Make it frictionless
Friday, November 15, 13
$ stack deploy
Friday, November 15, 13
Friday, November 15, 13
2: Choose the right instance type
Friday, November 15, 13
Factors to Consider
CPUNetworkDisk I/O
WorkloadCost
Tools to help you decide
vmstatiostatsarR
ExcelStackdriver + agent
Friday, November 15, 13
21%$20%$
12%$
11%$
9%$
7%$7%$
3%$ 2%$ 2%$ 2%$1%$ 1%$
0%$ 0%$ 0%$ 0%$ 0%$
m1.large$
m1.small$
m1.m
edium
$
c1.medium
$
c1.xlarge$
t1.micro$
m1.xlarge$
m2.xlarge$
m2.2xlarge$
m2.4xlarge$
m3.xlarge$
m3.2xlarge$
cc2.8xlarge$
hi1.4xlarge$
cg1.4xlarge$
hs1.8xlarge$
cc1.4xlarge$
cr1.8xlarge$
Distribu=on$of$EC2$Instance$Usage$
Friday, November 15, 13
+ EC2
Friday, November 15, 13
3: Use configuration management
Friday, November 15, 13
Friday, November 15, 13
4: Choose the right monitoring solution
Friday, November 15, 13
Friday, November 15, 13
Rapid Setup Full-stack AWS Integration IntelligentCluster-aware
Friday, November 15, 13
5: Design effective alerting policies
Friday, November 15, 13
Simple rules for confidently waking up ops@ at 3am
1.Something had better be broken (or close to it) for the customer
2.The broken thing should be as obvious as possible
3. It should be clear what action I can take to make the situation better
Customers seeing huge spike in 5XX errors
Code deploy to web cluster one hour ago
Revert!
Friday, November 15, 13
6: Architect for high availability
Friday, November 15, 13
Elastic Load BalancingAmazon RDSApache
Zookeeper
Friday, November 15, 13
AI
F
Cell-1GW
MQ
AI
F
Cell-2GW
MQ
Cloud Integration System Agents Custom Metrics
Load Balancing 1 Load Balancing nLoad Balancing 2
DNSData Ingestion
S3
Archival Online Analysis
Serving
WorkersWorkers
Workers
AgentsAgents
Agents
APIAPI
API
Q 1
2n
3
Cassandra
Batch
Aggregation Correlation Trending
Web/Mobile
o UIUI
Anomaly
Health
AI
F
Cell-nGW
MQ
Elastic Load Balancingw/ haproxy
Localized failureIdentical dimensions
Easy to reasonNetwork partitions ok
Friday, November 15, 13
Handling failure
Avoid it
Mask it
Minimize it
Recover quickly
Cluster AZ Region
Resilience
Tolerance
Friday, November 15, 13
7: Think holistically about quality assurance
Friday, November 15, 13
AUTOSCALING +AUTOMATION +CONTINUOUS INTEGRATION +DEVOPS GOVERNANCE +ELASTICITY +PROGRAMMABLE INFRASTRUCTURE =CONSTANT CHANGE
Friday, November 15, 13
You cannot pre-test every change
So
You need to be really good at detecting issues
Very quickly
Friday, November 15, 13
Monitoring is a key part of quality assurance for dynamic systems
But monitoring tools need to be intelligent
Distributed sensorsCloud-aware
Anomaly detectionSynthetic transactions
Friday, November 15, 13
• Training• Recommended reading:
• Systemantics (aka The Systems Bible)
• High Scalability (http://highscalability.com/)
• James Hamilton’s blog (http://perspectives.mvdirona.com/)
Friday, November 15, 13
Visit us at http://www.smugmug.com/
Friday, November 15, 13
Visit us at booth 315!
Friday, November 15, 13
Recommended