View
1
Download
0
Category
Preview:
Citation preview
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
They Don't Hug Back! Or Why You Need To Stop Worrying About prodweb001 And Start Loving i-98fb9856
Chris Munns, Amazon Web Services
November 13, 2013
Why are we here? Old-school IT practices continue to weigh us down in the cloud. We need a way out.
“Everything now is a programmable resource. There are no physical things anymore. Things that you needed to do by walking to the datacenter, by hugging your servers, and believe me I’ve hugged servers enough in my life. They DO NOT hug you back.”
“Everything now is a programmable resource. There are no physical things anymore. Things that you needed to do by walking to the datacenter, by hugging your servers, and believe me I’ve hugged servers enough in my life. They DO NOT hug you back.” -
Dr. Werner Vogels (Re:Invent 2012)
“But I love my servers!” - You (now)
https://secure.flickr.com/photos/schluesselbein/4157426778/
“They hate you, actually, I honestly believe that they hate you.
“They hate you, actually, I honestly believe that they hate you. At least that is how they behaved towards me.” –
Dr. Werner Vogels (Re:Invent 2012)
“But I love my servers!” “Well now I’m kind of sad.”
- You (now)
https://secure.flickr.com/photos/bensonkua/2687804310/
So where does server hugging
come from?
NAMING THEM
https://secure.flickr.com/photos/quinnanya/4464205726
So where does server hugging come from?
Why do we name them?
So where does server hugging come from?
Why do we name them? Because we have to know where to find them.
So where does server hugging come from?
Why do we name them? Because we have to know where to find them. Where do we need to find them?
Here
https://secure.flickr.com/photos/arthur-caranta/2925352521
Here
Or here?
https://secure.flickr.com/photos/arthur-caranta/2925352521
IF THIS THING IS OUT OF TAPE, YOU HAD A REALLY BAD DAY.
https://secure.flickr.com/photos/stephendotcarter/6587082437
So where does server hugging come from?
Why did we need to find them in person?
So where does server hugging come from?
Why did we need to find them in person? Because we HAD to fix them.
So where does server hugging come from?
Why did we need to find them in person? Because we HAD to fix them. WHY?
So where does server hugging come from?
We fixed them because: Dead servers == dead space Dead space == wasted $$$ Dead servers == worse performance Worse performance == lost $$$
So where else does server hugging
come from?
SERVERS != OUR PETS
https://secure.flickr.com/photos/thegirlsny/3877243166/
What we name our pets • Greek gods: Zeus, Thor, Hercules… • Elements: Hydrogen, Helium, Lithium… • Comic book heroes: Superman, Ironman… • Musicians, Cities, Countries, Movies • Prodweb01, Prodapi01… • Web01.prod, Web01.test… • Tacotruck01 • P1cfw01v03
What we name our pets • Greek gods: Zeus, Thor, Hercules… • Elements: Hydrogen, Helium, Lithium… • Comic book heroes: Superman, Ironman… • Musicians, Cities, Countries, Movies • Prodweb01, Prodapi01… • Web01.prod, Web01.test… • Tacotruck01 • P1cfw01v03
P1cfw01v03 https://secure.flickr.com/photos/75898532@N00/3243666946/
EC2
EC2
EC2
EC2 EC2
EC2 EC2
EC2
P1cfw01v03 https://secure.flickr.com/photos/verylastexcitingmoment/3118396767/
Waking when they cry: *** Nagios *** Notification Type: PROBLEM Service: Web CPU Host: web03.example.com Address: 10.167.10.51 State: CRITICAL Date/Time: Thu Oct 24 08:14:13 UTC 2013 Additional Info: CRITICAL – CPU LOAD 29
Hugging server babies and you • Is the site performing worse? • Are your customers impacted? • How impacted are they? • What are the other 20 web instances doing? • Did I really need to wake up at 4am for this? • If a server uses 100% of its CPU, should I care? • If this server is bad, how much work is there in fixing
it? • Is there something custom about this server?
Server hugging bad practices • “Pet-ting” – caring about a server’s “name,” its
well being, its individual status • “Snowflakes” – unique hosts in a common pool • “Model T-ing” – Hand-built one-off servers • “Names In Stone” – overuse of host names as
a source of truth
In short, there are a lot of old-school, dated habits being taken to cloud infrastructure. And once you’ve brought them to the cloud, you lose out on a lot of the benefits of the cloud. Such as: • Dynamic scale up/down • Self healing infrastructures • Increased flexibility • Automation
https://secure.flickr.com/photos/tolomea/5113266973/
Letting go involves moving forward with some of the best of what AWS can offer you in terms of services and how you can work with them in some pretty incredible ways.
Letting go and loving the new way
• Using Auto Scaling for everything • ENIs and EIPs • Tags are the new DNS • Deployment tools • Host-based configuration • Service registries
Sleeping through Infrastructure Recovery
https://secure.flickr.com/photos/dominiqs/331702231
The things that should never wake you up
• High CPU usage on anything • High memory usage on anything • Thread/process exhaustion • Filled disks • Not running software • Failed instances
Metrics:
Metrics:
Common actions taken when paged
1. Look at logs
2. Look at graphs
3. Reboot/restart related application/instance
Common actions taken when paged
1. Look at logs
2. Look at graphs
3. Reboot/restart related application/instance
} Looking at past data
Common actions taken when paged
1. Look at logs
2. Look at graphs
3. Reboot/restart related application/instance
} Looking at past data
Why do this manually?
Provisioned capacity
Traffic to our site vs. provisioned capacity manually
76%
24%
Provisioned capacity
Traffic to our site vs. provisioned capacity manually
Traffic to our site vs. provisioned capacity with Auto Scaling
Provisioned capacity
STONITH "Shoot the other node in the head”
Don’t be afraid to kill a node a with
something wrong with it as a resolution to failure!
With Auto Scaling it’s fine!
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Auto Scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SQS Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Auto scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SQS Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Watcher Instance
Auto scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SQS Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Watcher Instance
EC2 API
Auto scaling Group min=3
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
Amazon SQS Amazon SNS
Auto scaling Group min=3
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
CloudWatch
Alarm
Watcher Instance
EC2 API
STONITH
AWS Cloud
Virtual Private Cloud Availability Zone Availability Zone
Availability Zone
CloudWatch Amazon SQS Amazon SNS
Web Instance
Web Instance
Web Instance
Internet Gateway
ELB ELB ELB
EC2 API
Watcher Instance
Auto scaling Group min=3
Auto Scaling for everything! • You can use Auto Scaling for singular instances that
don’t scale up or down – min = 1, max = 1
• Auto Scaling gives you the ability to specify multiple Availability Zones, even you only need a single host – gives you multi-AZ failover
• Auto Scaling supports notifications on instance creation/termination – Useful for configuring other resources, bootstrapping, and
provisioning • Auto Scaling is free!
Auto Scaling for everything!
• Make use of the user data or configuration management tools to do things like: – Re-attaching an Amazon Elastic Block Store (EBS) volume with
application data – Re-attaching an Elastic Network Interface (ENI) – Update service registries – Update DNS – Update other reliant applications of the new host
Elastic Network Interfaces/Elastic IPs ENI: • Add additional interfaces to an
instance • One or more secondary private
IP addresses • Has its own MAC address • Can have Security Groups
assigned • Tag-able • Free
EIP: • A static public IP address • Can be assigned to either an
instance or an ENI • Doesn’t replace private IP • Small hourly charge when not
attached to an instance
Elastic Network Interfaces
Attaching multiple network interfaces to an instance is useful when you want to: • Create a management network. • Use network and security appliances in your
Amazon Virtual Private Cloud (VPC). • Create dual-homed instances with workloads/roles on distinct
subnets. • Create a low-budget, high-availability solution.
Elastic Network Interfaces
Attaching multiple network interfaces to an instance is useful when you want to: • Create a management network. • Use network and security appliances in your
Amazon Virtual Private Cloud (VPC). • Create dual-homed instances with workloads/roles on distinct
subnets. • Create a low-budget, high-availability solution.
Healing a single instance
AWS Cloud
EC2 API
AWS CloudFormation
Healing a single instance
AWS Cloud
Virtual Private Cloud
Availability Zone
EC2 API
AWS CloudFormation
Internet Gateway
NAT Instance
Healing a single instance
AWS Cloud
Virtual Private Cloud
Availability Zone
App Instance
EC2 API
AWS CloudFormation
Internet Gateway
NAT Instance
Healing a single instance
AWS Cloud
Virtual Private Cloud
Availability Zone
Auto-Scaling Group
App Instance
EC2 API
AWS CloudFormation
NAT Instance
Internet Gateway
Healing a single instance
AWS Cloud
Virtual Private Cloud
Availability Zone
Auto-Scaling Group
Elastic Network Instance
App Instance
EBS Volume NAT
Instance
Internet Gateway
EC2 API
AWS CloudFormation
Healing a single instance
AWS Cloud
Virtual Private Cloud
Availability Zone
Auto-Scaling Group
Elastic Network Instance
App Instance
EBS Volume NAT
Instance
Internet Gateway
EC2 API
AWS CloudFormation
Healing a single instance
Instances
AWS Cloud
Virtual Private Cloud
Availability Zone
Auto-Scaling Group
Elastic Network Instance
App Instance
EBS Volume NAT
Instance
Internet Gateway
EC2 API
AWS CloudFormation
Healing a single instance
Instances
AWS Cloud
Virtual Private Cloud
Availability Zone
Auto-Scaling Group
Elastic Network Instance
App Instance
EBS Volume NAT
Instance
Internet Gateway
EC2 API
AWS CloudFormation
Healing a single instance
Instances
AWS Cloud
Virtual Private Cloud
Availability Zone
Auto-Scaling Group
Elastic Network Instance
App Instance
EBS Volume NAT
Instance
Internet Gateway
EC2 API
AWS CloudFormation
Healing a single instance
Instances
AWS Cloud
Virtual Private Cloud
Availability Zone
Auto-Scaling Group
Elastic Network Instance
App Instance
EBS Volume NAT
Instance
Internet Gateway
EC2 API
AWS CloudFormation
Healing a single instance
Instances
AWS Cloud
Virtual Private Cloud
Availability Zone
Auto-Scaling Group
Elastic Network Instance
App Instance
EBS Volume NAT
Instance
Internet Gateway
EC2 API
AWS CloudFormation
Healing a single instance
Instances
AWS Cloud
Virtual Private Cloud
Availability Zone
Auto-Scaling Group
Elastic Network Instance
App Instance
EBS Volume NAT
Instance
Internet Gateway
EC2 API
AWS CloudFormation
Healing a single instance "myENI" : {
"Type" : "AWS::EC2::NetworkInterface",
"Properties" : {
"Tags": [{"Key":"Name","Value":"AppENI"}, {"Key":"Project","Value":"Blog"}],
"Description": "Blog One Off App Server ENI.",
"SubnetId": "subnet-d2286cb9",
"PrivateIpAddress": "192.168.11.100"
}
}
Healing a single instance import boto.ec2
import boto.utils
conn = boto.ec2.connect_to_region('us-west-2')
myfilters = {'tag:Name': 'AppENI', 'tag:Project': 'Blog’}
myEni=conn.get_all_network_interfaces(filters=myfilters)
myInstance=boto.utils.get_instance_metadata()['instance-id']
conn.attach_network_interface(myEni[0].id, myInstance, device_index=1, dry_run=False)
Healing a single instance import boto.ec2
import boto.utils
conn = boto.ec2.connect_to_region('us-west-2')
myfilters = {'tag:Name': 'AppENI', 'tag:Project': 'Blog’}
myEni=conn.get_all_network_interfaces(filters=myfilters)
myInstance=boto.utils.get_instance_metadata()['instance-id']
conn.attach_network_interface(myEni[0].id, myInstance, device_index=1, dry_run=False)
Connect to API
Find the right ENI Attach ENI to instance
https://secure.flickr.com/photos/cambodia4kidsorg/260004685
Use tags as a source of “truth” in your
infrastructure
DNS bad. Tags good.
DNS • 30-year old technology • Only tells us a single
thing about a host, a hostname to IP mapping.
• Potential for split brain/broken replicas
• Caching issues, caching issues, caching issues
• Set by you the user, held in AWS and available via APIs
• Key:Value is totally up to you
• Can have several per resource
• Free to implement and query
Tags
DNS bad. Tags good.
DNS Web03.example.com:
– 10.167.10.51
Tags i-933f81a4:
– Name:Web – Env:Prod – Project:Blog – Owner:BobSmith – aws:autoscaling:groupName :
ProdBlogWebsASG – aws:cloudformation:stack-name:
BlogSiteProd
Tags as a source of truth
• Tie various resources together • Billing reports • IAM resource-level permissions • Build automation • Deploy automation • Security resource grouping
Stop hand-crafting servers!
https://secure.flickr.com/photos/ndrwfgg/115898387
Use automation!
https://secure.flickr.com/photos/genewolf/147722350
AWS management tools
AWS Elastic Beanstalk AWS OpsWorks AWS CloudFormation
Higher-level services Do it yourself
Convenience Control
Host-based configuration management
Fabric
Host-based configuration management
• All more or less accomplish the same things – File configuration, package/software installation, user management, run
commands, interface with OS, process management
• All have their own syntax that isn’t too dissimilar • Some rely on agents, some are agentless • Use HBCM alongside one of the tools from the previous
slide • Spend the time required to learn them • Can’t scale easily without HBCM
“I don’t have time to learn Chef!?”
https://secure.flickr.com/photos/45909111@N00/9374169461/
“I don’t have time to learn Chef!?”
“I wrote custom shell scripts instead!”
https://secure.flickr.com/photos/45909111@N00/9374169461/
https://secure.flickr.com/photos/45909111@N00/9374169461/
Go visit the AWS & Partner exhibits and ask for more
info!
Making Use of Service Registries
https://secure.flickr.com/photos/fringedbenefit/9178086713
https://secure.flickr.com/photos/smartfinn/2651755337/
NOT THAT KINDA REGISTRY!
https://secure.flickr.com/photos/smartfinn/2651755337/
“A service registry is one of the fundamental pieces of service-oriented architecture (SOA) for achieving reuse. It refers to a
place in which service providers can impart information about their offered services and
potential clients can search for services.” - www.architecturejournal.net, Sept 2009
Service registry workflow
1. A new instance boots. 2. It registers itself with our “service registry.” 3. Changes to the service registry kick off changes on
other systems related to the new instance. 4. Other instances now know about our new instance. 5. On instance termination, instance is deregistered,
and other instances remove it from use.
Service registry examples:
• Zookeeper • MuleSoft Anypoint Service Registry • Netflix Eureka • IBM WebSphere Service Registry and
Repository • Airbnb SmartStack
Zookeeper “is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” – zookeeper.apache.org
– leader election – group membership – configuration maintenance – event notification – locking – priority queue mechanism
Zookeeper
AWS Cloud
Virtual Private Cloud Availability Zone
Availability Zone Availability Zone
Zookeeper Instance
Auto scaling Group min=2
Worker Instance
Worker Instance
Zookeeper Instance
Zookeeper Instance
Leader Host
Enough from me!
Customer Story: Airbnb SmartStack Martin Rhoads
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Martin Rhoads SRE @ Airbnb November 13, 2013
Airbnb SmartStack Helping you build Service Oriented Architectures
not at Re:Invent
Intros
Igor Serebryany + SRE at Airbnb since 2012 + Built datacenter automation at
SingleHop + Scientific computing at University
of Chicago + Hobbies: welding, biking, long
walks on the beach
102
This guy is even more bearded than the last!
Intros
Martin Rhoads + SRE at Airbnb + user of AWS since 2006 + First 10 employees at RightScale + Previously worked at
Cloudscaling deploying OpenStack at Tier1s and Telcos
+ BioInformatics at UCSB + Obsessed with making things
easier
103
SmartStack Helping you build SOA
What are you trying to sell me?
Why do I need SOA?
+ The definitive way to scale your architecture + Allow different people to work on different code without stepping on toes + Separate deployment schedules + Separate machine and data requirements + Fail separately -- so you can have graceful degradation
105
How SOA happens When customers love a service very, very much...
106
How SOA happens
107
When customers love a service very, very much...
How SOA happens
108
When customers love a service very, very much...
How SOA happens When customers love a service very, very much...
109
How SOA happens When customers love a service very, very much...
110
How SOA happens When customers love a service very, very much...
111
Here’s how it ends up A certain kind of fun
112
To sum up
113
1 Services help you scale
2 SOA is an architecture style designed around services
3 A SOA is hard to manage
4 SmartStack makes managing SOA a breeze
What is SmartStack? And how does it help?
SERVICE 1 Service(s) you want to deliver
2 Zookeeper registry to track everything
ZOOKEEPER
3 Nerve checks health and updates Zookeeper
4 Synapse routes between services
SYNAPSE NERVE NERVE
MONORAIL
NERVE SYNAPSE
MOBILE WEB
NERVE SYNAPSE
ZOOKEEPER
+ /production/monorail/services/i-1234567 => {‘host’: 1.2.3.4, ‘port’: 5678}
+ /production/mobile_web/services/i-0abcdef => {‘host’: 5.6.7.8, ‘port’: 5678}
haproxy
We get myriad benefits from haproxy + Stable and well-tested
+ Performs in-process connectivity checks
+ Great introspection and logging
+ Lots of load-balancing algorithms (RR, least-conn)
+ Somewhat dynamically reconfigurable (stats socket)
At the core of synapse
117
To Recap SmartStack in action
118
Introspection
Abstraction and DRY
Distributed by design
Automatic failure detection Why SmartStack?
Abstraction
120
+ The same code in the same language is always doing discovery/registration
+ Your application doesn’t know about nerve/synapse -- it only knows about its dependencies
+ Always consistent across your infrastructure
You don’t have to wake up
Automatic Failure Handling
+ Bad backends are automatically taken out of rotation + Useful during both problems and routine maintenance/deploys + Push-based => very rapid detection; avoid those little blips + haproxy even routes around network partitions!
121
See what’s REALLY going on
Introspection
Leverage the power of haproxy + status page that lets you see local
state + lots of available integrations to
gather global state + world-class logging for large-scale
analysis
122
No central point of failure
Distributed by Design
+ Traffic flows directly between boxes -- no routing layer + Even if SmartStack is stopped or broken, haproxy keeps traffic flowing + Zookeeper helps to avoid common pitfalls (like different backends in
different network segments)
123
How SmartStack has changed Airbnb
The Impact
124
100+
Services using
SmartStack
Requests per second
LOC deleted
Engineers using
SmartStack
2K 3K 30
Ben: “SmartStack is great! It helped me to discover services – and quit smoking”
Phillippe: “Distributed computing? And all this time I thought everything was running on one machine”
Spike : “Nerve and Synapse have greatly simplified my life as an application developer, and have enabled me to launch our first Node.js services with very little ops overhead.”
Barbara: “I love it!”
Sean: “Smart Stack has made deployment of new java services a matter of beer and 20 lines of ruby”
Our engineers love SmartStack
Future Direction Is this project, like, done...?
126
1
2
3
4
Better resiliency: more graceful handling of zookeeper edge cases
Better testing: improve on the current integration test suite
Dynamic registration: for services running on Mesos et. al.
A push API for nerve: allow services to communicate coming downtime
5 An auto-scaling layer: use nerve information to determine load levels
I’m sold! How do I get started?
Getting Started
128
1
2
3
install Vagrant
git clone https://github.com/airbnb/smartstack-cookbook.git
vagrant up
Where is the code?
129
https://github.com/airbnb/nerve.git
https://github.com/airbnb/synapse.git
AWS re:Invent Pub Crawl
Join the AWS Startup Team this evening at the AWS Pub Crawl When: Wednesday November 13, 5:30pm - 7:30pm Where: Canaletto at The Venetian, 2nd Floor Who Will Be There: Startups, the AWS Startup Team, Startup Launch Companies, and AWS re:Invent Hackathon winners
Startup Spotlight Sessions with Dr. Werner Vogels Thurs. Nov 14, Marcello Room 4406
SPOT 203 – Fireside Chats – Startup Founders, 1:30-2:30pm – Eliot Horowitz, CTO of MongoDB – Jeff Lawson, CEO of Twilio – Valentino Volonghi, Chief Architect of AdRoll
SPOT 204 – Fireside Chats – Startup Influencers, 3:00-4:00pm – Albert Wegner, Managing Partner at Union Square Ventures – David Cohen, Founder and CEO of TechStars
SPOT 101 - Startup Launches, 4:15-5:15pm – 5 companies powered by AWS launching at AWS re:Invent 2013
We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
Recommended