Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina...

Preview:

Citation preview

Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin

Vahdat… and a cast of hundreds at Google

Network availability is the biggest challenge facing large content and

cloud providers today

2

Why?

3

At four 9s availability❖ Outage budget is 4 mins per month

At five 9s availability❖ Outage budget is 24 seconds per month

The push towards higher 9s of availability

4

By learning from failuresHow do providers achieve these levels?

What design principles can achieve high availability?

What has Google Learnt from Failures?

Why is high network availability a challenge?

What are the characteristics of network availability failures?

5

Paper’s Focus

Why is high network availability a challenge?

Velocity of EvolutionScale

Management Complexity

6

Evolution

Time

Cap

acity

Saturn

Firehose 1.0

Watchtower

Firehose 1.1

4 Post

Jupiter

7

Network hardware evolves continuously

Evolution

B4

2006

2008

2010

2012

2014Google Global Cache

BwE

JupitergRPC

Freedome

Watchtower

QUIC

Andromeda

8

So does network software

Evolution

9

New hardware and software can❖ Introduce bugs❖ Disrupt existing software

Result: Failures!

B2B4

Data centers

Other ISPs

Scale and Complexity

10

Scale and Complexity

11

B4 and Data Centers❖ Use merchant silicon chips❖ Centralized control planes

Design Differences

B2❖ Vendor gear❖ Decentralized control plane

Scale and Complexity

12

Design Differences

These differences increase management complexity and pose availability challenges

The Management

Plane

Management Plane Software

13

Managesnetwork evolution

Management Plane Operations

Connect a new data center to B2 and B4

Upgrade B4 or data center control plane software

Drain or undrain links, switches, routers, services

Many operations require multiple steps and can

take hours or days

Temporarily remove from service

14

The Management

Plane

15

Low-level abstractions for management operations❖ Command-line interfaces to high

capacity routers

A small mistake by operator can impact a large part of network

Why is high network availability a challenge?

What are the characteristics of network availability failures?

Duration, Severity, PrevalenceRoot-cause Categorization

16

Key Takeaway

17

Content provider networks evolve rapidly

The way we manage evolution can impact availability

We must make it easy and safe to evolve the network daily

We analyzed over 100 Post-mortem reports written over a

2 year period

18

What is a Post-mortem?

Carefully curated description of a previously unseen failure that had significant availability impact

Helps learn from failures

19

Blame-free process

What a Post-Mortem

Contains

20

Description of failure, with detailed timeline

Root-cause(s) confirmed by reproducing the failure

Discussion of fixes, follow up action items

Failure Examples

and Impact

21

❖ Entire control plane fails❖ Upgrade causes backbone traffic shift❖Multiple top-of-rack switches fail

Examples

❖ Data center goes offline❖WAN capacity falls below demand❖ Several services fail concurrently

Impact

Key Quantitative

Results

22

70% of failures occur when management plane operation is in progress

Failures are everywhere: all three networks and three planes see comparable failure rates

80% of failure durations between 10 and 100 minutes

Evolution impacts availability

No silver bullet

Need fast recovery

Root causes

23

Lessons learned from root causes motivate availability design principles

Why is high network availability a challenge?

What are the characteristics of network availability failures?

What design principles can achieve high availability?

Re-Think Management PlaneAvoid and Mitigate Large Failures

Evolve or Die

24

25

Re-think the Management Plane

Availability Principle

26

Operator types wrong CLI command, runs wrong script

Backbone router fails

Minimize Operator

Intervention

Availability Principle

27

To upgrade part of a large device…❖ Line card, block of Clos fabric

… proceed while rest of device carries traffic❖ Enables higher availability

Necessary for upgrade-in-place

Availability Principle

28

Ensure residual capacity > demand

Early risk assessments were manual

Risky!

High packet loss

Assess risk continuously

Re-think the Management

Plane

I want to upgrade this router

“Intent”

Management Plane Software

Management Operations

Device Configurations

Tests to Verify Operation

29

Re-think the Management

Plane

Management Plane Run-time

Management Operations

Device Configurations

Tests to Verify Operation

Apply Configuration

Perform management operation

Verify operation

AssessRisk

Continuously

Minimize Operator

Intervention

30

Automated Risk

Assessment

31

Avoid and Mitigate Large Failures

Availability Principle

32

B4 and data-centers have dedicated control-plane network❖ Failure of this can bring down entire control plane

Fail openContain failure radius

Fail OpenCentralized

Control Plane

Preserve forwarding state of all switches❖ Fail-open the entire data center

33

Traffic

Exceedingly tricky!

Data center

Availability Principle

34

A bug can cause state inconsistency between control plane components ➔ Capacity reduction in WAN or data center

Design fallback strategies

Design Fallback Strategies

35

A large section of the WAN fails, so demand exceeds capacity

B4

Design Fallback Strategies

36

B2

Fallback to B2!

Can shift largetraffic volumes from many data centers

B4

Design Fallback

Strategies

37

When centralized traffic engineering fails...❖ … fallback to IP routing

Big Red Buttons❖ For every new software upgrade, design controls so

operator can initiate fallback to “safe” version

38

Evolve or Die!

39

We cannot treat a change to the network as an exceptional

event

Evolve or Die

Make change the common case

Make it easy and safe to evolve the network daily

❖ Forces management automation❖ Permits small, verifiable changes

40

Conclusion

41

Content provider networks evolve rapidly

The way we manage evolution can impact availability

We must make it easy and safe to evolve the network daily

Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure

Presentation template from SlidesCarnival

43

Older Slides

Popular root-cause

categories

44

Cabling error, interface card failure, cable cut….

Popular root-cause

categories

45

Operator types wrong CLI command, runs wrong script

Popular root-cause

categories

46

Incorrect demand or capacity estimation for upgrade-in-place

Upgrade in place

47

Assessing Risk Correctly

Residual Capacity? Demand?

Varies by interconnect Can change dynamically

48

Popular root-cause

categories

49

Hardware or link layer failures in control plane network

Popular root-cause

categories

50

Two control plane components have inconsistent views of control plane state, caused by bug

Popular root-cause

categories

51

Running out of memory, CPU, OS resources (threads)...

Lessons from Failures

The role of evolution in failures▸ Rethink the

Management Plane

The prevalence of large, severe, failures▸ Prevent and

mitigate large failures

Long failure durations▸ Recover fast

52

High-level Management

Plane Abstractions

I want to upgrade this router

Why is this difficult? Modern high capacity routers:❖ Carry Tb/s of traffic❖ Have hundreds of interfaces❖ Interface with associated optical equipment❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which

have network-wide impact ❖ Have high capacity fabrics with complicated dynamics❖ Have configuration files which run into 100s of thousands of lines

“Intent”

53

High-level Management

Plane Abstractions

I want to upgrade this router

“Intent”

Management Plane Software

Management Operations

Device Configurations

Tests to Verify Operation

54

Management Plane

Automation

Management Plane Software

Management Operations

Device Configurations

Tests to Verify Operation

Apply Configuration

Perform management operation

Verify operation

AssessRisk

Continuously

Minimize Operator

Intervention

55

Large Control

Plane Failures

Centralized Control Plane

56

Contain the blast radiusCentralized

Control Plane

57

Centralized Control Plane

Smaller failure impact, but increased complexity

Fail-OpenCentralized

Control Plane

Preserve forwarding state of all switches❖ Fail-open the entire fabric

58

Defensive Control-Plane

Design

Gateway

Topology Modeler

TE Server

BwE

59

One piece of this large update

seems wrong!!

Trust but Verify

Gateway

Topology Modeler

TE Server

BwE

60

Let me check the correctness of the update...

Fallback to B2

Gateway

Topology Modeler

TE Server

BwE

61

B2

Mitigating Large Failures

Design Fallback Strategies▸ B4 B2▸ Tunneling IP routing▸ Big Red Buttons

62

Continuously Monitor

Invariants

63

Must have onefunctional backup

SDN controller

Anycast route must have AS path length

of 3

Data center must peer with two B2

routers

This Alone isn’t Enough...

64

65

We cannot treat a change to the network as an exceptional

event

Evolve or Die

Make change the common case

Make it easy and safe to evolve the network daily

❖ Forces management automation❖ Permits small, verifiable changes

66

Key Takeaway

67

Content provider networks evolve rapidly

The way we manage evolution can impact availability

We must make it easy and safe to evolve the network daily

Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin

Vahdat… and a cast of hundreds at Google

Presentation template from SlidesCarnival

Impact of Availability

Failures

69

What design principles can achieve high availability?

A Case Study: Google

Why is high network availability a challenge?

What are the characteristics of network availability failures?

70

The velocity of evolution is fueled by

traffic growth...

71

… and by an increase in

product and service

offerings

72

Networks have very different designs

Different hardware Different control planes

Different forwarding paradigms

These differences increase management and evolution complexity

73

❖ Fabrics with merchant silicon chips❖ Centralized control plane❖ Out of band control plane network

Data centers

Control plane

network

74

SIGCOMM 2015

B4

Gateway

Topology Modeler

TE Server

BwE

❖ B4 routers built using merchant silicon chips❖ Centralized control plane within each B4 site❖ Centralized traffic engineering❖ Bandwidth enforcement for traffic metering

75

SIGCOMM 2015

SIGCOMM 2013

Other ISPs

❖ B2 routers based on vendor gear❖ Decentralized routing and MPLS TE❖ Class of service (high/low) using MPLS priorities

B2

76

The Management

Plane

Low-level, per device, abstractions for

management operations77

Where do failures

happen?

No network or plane that dominates

78

How long do the failures

last?

Durations much longer than outage budgets

Shorter failures on B2

79

What role does

evolution play?

70% of failures happen when a management operation is in progress

80

Where do failures

happen?

12

326

10

8 5

14

6

Control plane

network

12

8

15

81

Failures are everywhere

82

Across networks

All three

All three

All three

All three

All three

83

Across planes

Data

Management

Data

Data

Control

Management

84

Management

Root-Cause Categorization

What are the root causes for these

failures?

85

Rethink the Management

Plane

Low-level network managementcannot ensure high availability

86

Re-think the Management

Plane

I want to upgrade this router

Lots of complexity hidden below this statement❖ Carry Tb/s of traffic❖ Have hundreds of interfaces❖ Interface with associated optical equipment❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which

have network-wide impact ❖ Have high capacity fabrics with complicated dynamics❖ Have configuration files which run into 1000s of thousands of lines

“Intent”

87

Contain failure radiusCentralized

Control Plane

88

Centralized Control Plane

Each partition managed by different control plane

Adds design complexity

Even if one partition fails, others can carry traffic

Key Takeaway

89

Content provider networks evolve rapidly

The way we manage evolution can impact availability

We must make it easy and safe to evolve the network daily

By learning from failures

90

What design principles can achieve high availability?▸ Lessons

learned from root-causes

What has Google Learnt from Failures?

Why is high network availability a challenge?▸ Factors that

impact availability

What are the characteristics of network failures?▸ Severity,

duration, prevalence

▸ Root-cause categorization

91

DataCenter

Data Center

DataCenter

In a global networkFailures are common Configuration can change

These can impact network availability92

How long does it take...

10s of minutes to hours Hours to days

DataCenter

Data Center

DataCenter

… to root-cause a failure … to upgrade part of the network

93

Outage budgets...

… for four 9s availability? … for five 9s availability?

4 minutes per month 24 seconds per month

99.99% uptime 99.999% uptime

94

To move towards higher availability targets, it is important to learn from

failures

95