75
Why Cloud Architecture is Different! Michael Stiefel www.reliablesoftware.com [email protected] Architecting For Failure

Why Cloud Architecture is Different! Michael Stiefel [email protected] Architecting For Failure

Embed Size (px)

Citation preview

Page 1: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Why Cloud Architecture is Different!

Michael Stiefel

www.reliablesoftware.com

[email protected]

Architecting For Failure

Page 2: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Outsource Infrastructure?

Page 3: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Traditional Web Application

Web Site

Virtual Machine / Directly on Hardware

100 MB Relational Database

Inbound Transactions

Output Transactions

File System

Page 4: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Hosting Provider Costs

Provider $ / Monthly Cost

Host Gator 9.95

Go Daddy 10

ORCS Web 69

Amazon 83+ BYOS

Windows Azure 97

Note: traditional hosting, no custom colocation, virtualized data centers.

Page 5: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Cloud is Not Cheaper for Hosting

Page 6: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Perhaps, Higher Availability?

Page 7: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

SLA is Not Radically Different

Provider Compute SLA (%)

Go Daddy 99.9

ORCS Web 99.9

Host Gator 99.9

Amazon 99.95

Azure 99.95

Difference is seven minutes a day; 1.75 days a year.

Page 8: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Higher Rate Since You Pay for Flexibility

Page 9: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Hosting is Not Cloud Computing

Page 10: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Why Utility Computing?

Scalability: do not have to pay for peak scenarios.

Availability: can approach 100% if you want to pay.

Page 11: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Architecturally, they are the same problem

Page 12: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

You must design to accommodate missing computing resources.

Page 13: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Designing for Failure is Cloud Computing

Page 14: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

What’s wrong with this Code Fragment?

ClientProxy client = new ClientProxy();

Response response = client.Do (request);

 

Page 15: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Never assume that any interface between two components always succeeds.

Page 16: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

So You Put in a Catch Handler

try

{

ClientProxy client = new ClientProxy();

int result = client.Do (a, b, c);

}

catch (Exception ex)

{

}

 

Page 17: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

What if…

a timeout, how many retries?

the result is a complete failure?

the underlying hardware crashed?

you need to save the user’s data?

you are in the middle of a transaction?

Page 18: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

What Do You Put in the Catch Handler?

try

{

ClientProxy client = new ClientProxy();

int result = client.Do (a, b, c);

}

catch (Exception ex)

{

????

}

 

Page 19: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

You can’t program yourself out of a failure.

Page 20: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Failure is a first-class design citizen.

Page 21: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

The critical issue is how to respond to failure. The underlying infrastructure

cannot guarantee availability.

Principle #1

Page 22: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Consequences of Failure

Multiple tiers and dependencies.If your order queue fails, you cannot do orders.

If your customer service fails, you cannot get membership information.

The more dependencies, the more consequences of a poorly handle failure.

Dependencies include your code, third parties, the Internet/Web, anything you do not control.

Page 23: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Unhandled failures propagate (like cracks) through your application.

Page 24: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Failures Cascade – an unhandled failure in one part of the system becomes a

failure of your application.

Principle #2

Page 25: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Two Types of Failure

Transient Failure

Resource Failure

Page 26: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Typical Response to a Transient Failure

Retry

How Often?

How Long Before You Give Up?

Page 27: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Delays Cascade Just Like Failures

Delays occur while you are waiting or retrying

Delays hog resources like threads, TCP/IP ports, database connections, memory.

Since delays are usually the result of resource bottlenecks, waiting or retrying for long periods adds to the bottleneck.

Page 28: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Transient failures become resource failures.

Page 29: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Transient Failures

Retry for a short time, then give up (like a circuit breaker) if unsuccessful.

Never block on I/O, timeout and assume failure.

Page 30: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

There is no such thing as a transient failure. Fail fast and treat it as a resource

failure.

Principle #3

Page 31: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Make Components Failure Resistant

Design For Beyond Largest Expected LoadUnderstand latency of adding a new resource

User load, virtual memory, CPU size, bandwidth, database

Handle all Errors

Failure affects more people than on the desktop.

Page 32: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Define your own SLA.

Page 33: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Stress test components and system.

Page 34: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

A chain is a strong as its weakest link.

Page 35: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Use a Margin of Safety when designing the resources used.

Principle #4

Page 36: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

What is the cost of availability?

Page 37: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Any component or instance can fail – eliminate single points of failure.

Page 38: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Search for Dependencies

Hardware / Virtual Machines

Third Party Libraries

Internet/Web

Interfaces to your own components

TCP/IP ports

DNS Servers

Message Queues

Database Drivers

Credit Card Processors, Geocoding services, etc.

Page 39: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Examine Queries

Only three types of result sets:

Zero, One, Many (can become large overnight)

Search Providers limit results returned

Remember those 5 way joins your ORM uses

Objects on a DCOM or RMI call

Page 40: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Eliminate single points of failure. Accept the fact that you must build a distributed

application.

Principle #5

Page 41: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

You need redundancy...

Page 42: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

but you have to manage state.

Page 43: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Solutions such as database mirroring may have unacceptable latencies, such as

over geography.

Page 44: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Reduce the parts of your application that handle state to a minimum.

Page 45: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Loss of a stateful component usually means loss of user data.

Page 46: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

State Handling Components

Does the UI layer need session state?

Business Logic, Domain Layer should be stateless

Use queues where they make sense to hold data

Design services for minimal dependenciesPay with a customer number

Keep state with the message

Don’t forget infrastructure logs, configuration files

State is in specialized stores

Page 47: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Build atomic services.

Atomic means unified, not small.

Decouple the services.

Page 48: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Stateless components allow for scalability and redundancy.

Page 49: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

What about the data tier?

Page 50: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Can you relax consistency constraints?What is acceptable data loss?

Page 51: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

What is the cost of an apology?

Page 52: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

How important is the relational model?

Page 53: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Design for Eventual Consistency

Page 54: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Consider CQRS

Page 55: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Monitor your components.

Understand why they fail.

Page 56: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Reroute traffic to existing instances or another data center or geographic area?

Page 57: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Add more instances?

Page 58: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Caching or throttling can help your application run under failure.

Page 59: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Poorer performance may be acceptable.

Page 60: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Automate…Automate….Automate

Page 61: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Degrade gracefully and predictably. Know what you can live without.

Principle #6

Page 62: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Cloud Outages Happen

Page 63: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Some Are Normal

Some Are Black Swans

Page 64: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Humans Reason About Probabilities Poorly

Page 65: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Assume the Rare Will Occur - It Will Occur

Principle #7

Page 66: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Case Study: Amazon Four Day Outage

Page 67: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Facts

April 21, 2011

One Day of Stabilization, Three Days of Recovery

Problems: EC2, EBS, Relational Database Service

Affected: Quora, Hootsite, Foursquare, Reddit

Unaffected: Netflix, Twillo

Page 68: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Why were Netflix and Twillo Unaffected?

They Designed For Failure

Page 69: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Netflix Explicitly Architected For Failure

Page 70: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Although more errors, higher latency, no increase in customer service calls or

inability to find or start movies.

Page 71: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Key Architectural Decisions

Stateless Services

Data stored across isolation zonesCould switch to hot standby

Had Excess Capacity (N + 1)Handle large spikes or transient failures

Used relational databases only where needed.Could partition data

Degraded Gracefully

Page 72: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Degraded Gracefully

Fail Fast, Aggressive Timeouts

Can degrade to lower quality serviceno personalized movie list, still can get list of available movies

Non Critical Features can be removed.

Page 73: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Chaos Monkey

Page 74: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Some Problems

Had to manually reroute traffic; use more automation in the future for failover and recovery

Round robin load balancer can overload decreased number of instances.

May have to change auto scaling algorithm and internal load balancing.

Expand to Geographic Regions

Page 75: Why Cloud Architecture is Different! Michael Stiefel  development@reliablesoftware.com Architecting For Failure

Summary

Hosting in a cloud computing environment is valid.

Cloud Computing means designing for failure.