Upload
siddharth-anand
View
1.774
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Keynote at the SAP Cloud Conference, February 2012
Citation preview
The AWS Cloud Leveraging the State of the Art
Sid Anand (@r39132)
SAP Cloud Inside Track 2012
1
Thursday, February 16, 2012
What is the AWS Cloud?A Real World Scenario
2
Thursday, February 16, 2012
3
Question
If you were to build your own website today, what would you need?
Answer
You need a machine!
For simplicity, we will assume that your web server and application server code run on the same box!
AWS offers EC2 instances (i.e. virtual instances) to host your code
- Various sizes (e.g. IOps, # of Spindles, CPUs, Memory, Network bandwidth)
- Various configurations (e.g. Virtual Private Cloud, High Performance Cluster )
- Various pricing schemes (e.g. on-demand, reserved, SPOT, etc....)
A Real World Scenario
Thursday, February 16, 2012
4
Question
Is one machine enough to handle traffic from all of your users?
What if that machine were to fall over or need maintenance (i.e. a restart)?
Answer
Add many machines!
A Real World Scenario
Thursday, February 16, 2012
5
Question
This handles more traffic, but what if your servers were to fall over or need maintenance?
Answer
AWS offers AutoScaleGroups (a.k.a. ASG)!
You can deploy your servers under the protection of an ASG with a min and max pool size set.
The ASG ensures that machines are replaced when they die to guarantee your “min” pool size
ASGs monitor the health of your machines by polling an http port on each machine
A Real World Scenario
Thursday, February 16, 2012
6
Question
How do you distribute traffic to all of your machines evenly?
Answer
Deploy your favorite software load balancer!
And write some custom code to register/deregister your machine instances with the load balancer
A Real World Scenario
Thursday, February 16, 2012
7
Question
What if the load balancer were to fall over or to need maintenance or to become a traffic choke point?
Answer
Add multiple servers and deploy them under an ASG!
This is not ideal for a few reasons
- Need to register/deregister your Load Balancer instances with DNS
- Need to sync with ASGsʼs view of what is alive and dead, being added or removed, etc...
A Real World Scenario
Thursday, February 16, 2012
8
Answer
AWS offers Elastic Load Balancers (i.e. ELB)
- Conceptually similar to having many LBs in an ASG, with some additional features:
- Provides DNS hostname (e.g. mysite-11111111.us-east-1.elb.amazonaws.com)
- Maps all of the load balancer instances to this hostname
- Takes care of maintenance of the load balancer machines and the requisite DNS registrations/deregistrations
- Syncs with the ASG -- if the ASG replaces one of your instances, the ELB will also remove that instance
- Letʼs see how it works in action!
A Real World Scenario
Thursday, February 16, 2012
@r39132 23 9
Thursday, February 16, 2012
10
A Real World ScenarioQuestion
What about a DB to persist my data?
Answer
Multiple AWS hosted/managed options!
- DynamoDB (the new SimpleDB replacement) offers key-value semantics
Netflix replaced Oracle with SimpleDB and ran on it 2010-2011
- 4.5 Billion user-facing request a day
- S3 offers key-value semantics for very large files (e.g. 5TB). Typically for Map-Reduce files, media files, or Oracle BLOBS/CLOBS
- RDS - hosted Oracle or MySQL if you need relations and complex queries
Thursday, February 16, 2012
11
A Real World ScenarioQuestion
What if I have high-volume writes, but donʼt care when they are written -- e.g. event streams
Answer
Simple Queue Service
- Think Enterprise Message Bus
- Highly available, infinitely scalable
- Handles application/system monitoring event traffic and social graph events at Netflix
Thursday, February 16, 2012
12
A Real World Scenario
Question
What if the whole Data Center goes down? How do I keep my service available?
Answer
Amazon Data Center = Availability Zone
Thursday, February 16, 2012
13
A Real World Scenario
Answer
Always deploy your code in multiple Availability Zones!
- Netflix deploys in 3 AZs in Virgina
- Best Practice : Always deploy enough capacity in each AZ to handle losing one AZ during peak
- Netflix follows this best practice!
Thursday, February 16, 2012
14
A Real World ScenarioQuestion
What if your Asian and European customers complain of slow response times?
Recall : Higher Response times, lower scalability
Answer
AWS has 8 global regions! Each region has between 3 and 4 AZs
- Netflixʼs launch in the UK and Ireland were out of AWS EU-West Region
Thursday, February 16, 2012
15
A Real World Scenario
Thursday, February 16, 2012
16
A Real World Scenario
Other AWS Services:
- Elastic Map Reduce : Map-Reduce as a Service for analytics. Supports PIG and Hive
- ElastiCache : A hosted cache service (think Memcached as a Service)
Whatʼs Missing (or coming soon)?:
- Discovery & Load Balancing for N-tier applications!
- In effect, weʼd like ELB for internal traffic
- Crypto as a Service
- Currently, none of the services are cross-region! Itʼs left to the user to transfer data or proxy requests between regions
Thursday, February 16, 2012
Who Uses AWS?Netflix’s Cloud Architecture
17
Thursday, February 16, 2012
Netflix’s Cloud Architecture
Components
Many (~100) applications, organized in clusters (a.k.a. ASGs)
Clusters can be at different levels in the call stack
Clusters can call each other
ELB ELB
NES NES NES NES
Discovery
NMTS NMTS
NMTS NMTS
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
18
Thursday, February 16, 2012
Netflix’s Cloud Architecture Levels
NES : Netflix Edge Services
NMTS : Netflix Mid-tier Services
NBES : Netflix Back-end Services
IAAS : AWS IAAS Services
Discovery : Help services discover NMTS and NBES services
ELB ELB
NES NES NES NES
Discovery
NMTS NMTS
NMTS NMTS
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
19
Thursday, February 16, 2012
Netflix’s Cloud Architecture Components (NES)
Overview
Any service that browsers and streaming devices connect to over the internet
They sit behind AWS Elastic Load Balancers (a.k.a. ELB)
They call clusters at lower levels
ELB ELB
NES NES NES NES
Discovery
NMTS NMTS
NMTS NMTS
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
20
Thursday, February 16, 2012
Netflix’s Cloud Architecture Components (NES)
Examples
API Servers
Support the video browsing experience
Also allows users to modify their Q
Serves 1.4 Billions calls/day
Streaming Control Servers
Support streaming video playback
Authenticate your Wii, PS3, etc...
Download DRM to the Wii, PS3, etc...
Return a list of CDN urls to the Wii, PS3, etc...
ELB ELB
NES NES NES NES
Discovery
NMTS NMTS
NMTS NMTS
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
21
Thursday, February 16, 2012
Netflix’s Cloud Architecture
Components (NMTS)
Overview
Can call services at the same or lower levels
Other NMTS
NBES, IAAS
Not NES
Exposed through our Discovery service
ELB ELB
NES NES NES NES
Discovery
NMTS NMTS
NMTS NMTS
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
22
Thursday, February 16, 2012
Netflix’s Cloud Architecture Components (NMTS)
Examples
Netflix Queue Servers
Modify items in the usersʼ movie queue
Viewing History Servers
Record and track all streaming movie watching
SIMS Servers
Compute and serve user-to-user and movie-to-movie similarities
ELB ELB
NES NES NES NES
Discovery
NMTS NMTS
NMTS NMTS
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
23
Thursday, February 16, 2012
Netflix’s Cloud Architecture Components (NBES)
Overview
A back-end, usually 3rd party, open-source service
Leaf in the call tree. Cannot call anything else
ELB ELB
NES NES NES NES
Discovery
NMTS NMTS
NMTS NMTS
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
24
Thursday, February 16, 2012
Netflix’s Cloud Architecture
Components (NBES)
Examples
Cassandra Clusters
Our new cloud database is Cassandra and stores all sorts of data to support application needs
Zookeeper Clusters
Our distributed lock service and sequence generator
Memcached Clusters
Typically caches things that we store in S3 but need to access quickly or often
ELB ELB
NES NES NES NES
Discovery
NMTS NMTS
NMTS NMTS
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
25
Thursday, February 16, 2012
Netflix’s Cloud Architecture Components (IAAS)
Examples
AWS S3
Large-sized data (e.g. video encodes, application logs, etc...) is stored here, not Cassandra
AWS SQS
Amazonʼs message queue to send events (e.g. Facebook network updates are processed asynchronously over SQS)
ELB ELB
NES NES NES NES
Discovery
NMTS NMTS
NMTS NMTS
NMTS NMTS
NBES NBES
IAAS IAAS IAAS
26
Thursday, February 16, 2012
Netflix’s Cloud Architecture
Architecture Pros
Horizontally scalable at every level
Should give us maximum availability
Architecture Cons
A user-issued call will pass through multiple levels (a.k.a. hops) during normal operation
Latency can be a concern
EC2 instances in AWS can die at any time!
A lot of moving parts
27
Thursday, February 16, 2012
Dealing with the Cons!
We have a little help
28
Thursday, February 16, 2012
Simian ArmyPrevention (& Early Detection) is the best medicine
29
Thursday, February 16, 2012
Simian Army• Chaos Monkey
• Simulates hard failures in AWS by killing a few instances per ASG (e.g. Auto Scale Group)
• Similar to how EC2 instances can be killed by AWS with little warning
• Tests Netflixʼs ability to gracefully deal with broken connections, interrupted calls, etc...
• Verifies that all services are running within the protection of AWS Auto Scale Groups, which reincarnates killed instances
• If not, the Chaos monkey will win!
30
Thursday, February 16, 2012
Simian Army
• Latency Monkey
• Simulates soft failures -- i.e. a service gets slower
• Injects random delays in servers!
• Tests the ability of applications to detect and recover (i.e. Graceful Degradation) from the harder problem of delays
• Delays cause Thundering Herds (outside of the scope of this talk!)
31
Thursday, February 16, 2012
Simian Army
Does this solve all of our issues?
32
Thursday, February 16, 2012
Simian Army
The infinite cloud is infinite when your needs are moderate!
To ensure fairness among tenants, AWS meters or limits every resource
Hence, we hit limits quite often. Our “velocity” is limited by how long it takes for AWS to turn around and raise the limit -- a few hours!
33
Thursday, February 16, 2012
Simian Army
• Limits Monkey
• Checks once an hour whether we are approaching one of our limits and triggers alerts for us to proactively reach out to AWS!
• Conformity & Janitor Monkeys
• Finds and clean up orphaned resources (e.g. EC2 instances that are not in an ASG, unreferenced security groups, ELBs, ASGs, etc...) to increase head-room
• Buys us more time before we run out of resources and also saves us $$$$
34
Thursday, February 16, 2012
Questions?
Sid Anand
@r39132
http://www.linkedin.com/in/siddharthanand
35
Thursday, February 16, 2012