61
1 HFR - TAG HFR - TAG High High Availability Availability Ravi Narayanan Ravi Narayanan ([email protected]) ([email protected])

1 HFR - TAG High Availability Ravi Narayanan ([email protected]) Ravi Narayanan ([email protected])

Embed Size (px)

Citation preview

1

HFR - TAG HFR - TAG High Availability High Availability

Ravi NarayananRavi Narayanan([email protected])([email protected])

2Cisco Systems, Inc. www.cisco.com

Cisco HFRGOAL - High Availability

Cisco HFRGOAL - High Availability

Goal: Non-Stop Availability5- 9’s or Greater Availabiliity

What customers require:

Quick Recovery from defects,

High MTBF, Low MTTR/DPM,

Built in Redundancy

3Cisco Systems, Inc. www.cisco.com

Cisco HFRA Five Nines Capable Router

Cisco HFRA Five Nines Capable Router

• Architecture

– Hardware

– Software

• Development Process

• Test Process

• Accounting, Logging & Alarms

• Conclusion

4Cisco Systems, Inc. www.cisco.com

Hardware ArchitectureHardware Architecture

• Apply Prior Experience

• No Single Points of Failure

• Hardware Non Stop Forwarding (NSF)

• Automated Fault Injection

• Verify Architecture with Modeling

5Cisco Systems, Inc. www.cisco.com

Apply Prior ExperienceApply Prior Experience

• ATM Switch Products

– Large Customer Frame Relay Network

– Many Years Measuring Availability

• GSR

– Now resets at RP/LC level (HFR provides finer granularity at component level)

– Routing NSF Developments Started

6Cisco Systems, Inc. www.cisco.com

No Single Points of FailureNo Single Points of Failure

• Redundancy– Active Standby

* (D) RP, SC

– Loadsharing

* Fabric, Power, Cooling, Management Interconnect (out of band ethernet 1:1)

– Port Protection (Linecards/PLIMs)

• No outage on Upgrade of Fabric

• Graceful Degradation of Fabric

7Cisco Systems, Inc. www.cisco.com

System Control Network

Gig EtherSwitch

Gig EtherSwitch

GE

Optional 10G

LC Chassis

LC

LC

RP

RP

LC Chassis

LC

LC

RP

RP

Fabric Chassis

S2

S2

SC

SC

FE

FE

FE

8Cisco Systems, Inc. www.cisco.com

Graceful DegradationGraceful Degradation

8 of 8

S1

S1

S2

S2

S3S3

S3S3

. . .

. .

.

. .

.

. . .

. .

2 of 8

S1

S1

S2

S2

S3S3

S3S3. .

.

. .

.

. .

.

1 of 8

S1

S1

S2

S2

S3S3

S3S3

. .

.

. .

.

. .

.

Line CardOC192

12

8

. . .

Line CardOC192

12

8

. . .

9Cisco Systems, Inc. www.cisco.com

Hardware Non Stop Forwarding

Hardware Non Stop Forwarding

• Reset Strategy

– Entire Board

– Individual Components on a Board

– CAM (HW forwarding database) Not reset unless desired

• Forwarding Strategy

– Metro - 176 PPEs forwarding using CAM

10Cisco Systems, Inc. www.cisco.com

LC NSF Strategy

PLU TLU STATS

DISTRIB MUX

PPE0

PPE2

PPE175

TCAM

11Cisco Systems, Inc. www.cisco.com

Automated Fault Injection Automated Fault Injection

• Designed into Hardware ASICs up front

• Makes testing easier and complete

• Off the shelf parts must have mechanism for injection

• System Test and Reliability tests use automated fault insertion testing mechanisms

• Fault insertion testing at all stages

– Bring up, Design Verification, component test, system test

• Ability to test multiple failure scenarios - in hardware & software

12Cisco Systems, Inc. www.cisco.com

Verify Architecture With Modeling

Verify Architecture With Modeling

• Early modeling influenced architecture

– Memory soft error rates -> ECC

– Opticial error rates -> FEC-Reed Solomon

– Board level MTBF >= 100,000 hours - is a Cisco Requirement

• Parts count model

– Telcordia TR-332 standards, close vendor interaction

13Cisco Systems, Inc. www.cisco.com

• Architecture

– Hardware

– Software

• Development Process

• Test Process

• Accounting, Logging & Alarms

• Conclusion

Cisco HFRA Five Nines Capable Router

Cisco HFRA Five Nines Capable Router

14Cisco Systems, Inc. www.cisco.com

Software ArchitectureSoftware Architecture

• Protected Memory Microkernel

• Separation of Control and Data Plane

• Software Non Stop Forwarding

• Scalable Distributed System

• Health Monitoring

• No Outage on upgrades - Packaging and Release Strategy

15Cisco Systems, Inc. www.cisco.com

Protected Memory Microkernel

Protected Memory Microkernel

• Every Process Has a Private Address Space - contains faults

• Enables Process Restartability

• Enables Board Failover

• Enables Hitless Software Upgrade

16Cisco Systems, Inc. www.cisco.com

1:1 Card Redundancy1:1 Card Redundancy

Card 1

Process A

Process B

Process C

Process A

Process B

Process C

Checkpointing

Active Logical Slot 1 Standby Logical Slot 1

Card 2

“Active”Processes

Checkpointing

Checkpointing

“Standby”Processes

17Cisco Systems, Inc. www.cisco.com

Active / Standby SwitchoverActive / Standby Switchover

Process A

Process CProcess B

System Mgr

Card 2

7

7

Active SC

LR Daemon

Process A

Process CProcess B

System Mgr

Card 1

1 6 10

3

5

8

RedCon RedCon

4 9

QSM

4

2

11

12

13

Process B’

14

18Cisco Systems, Inc. www.cisco.com

Separation of Control and Data Plane

Separation of Control and Data Plane

• Redundancy in Control Plane

– All protocols support NSF over board fail over

• Port Protection in Data Plane

– SONET APS

– Link Bundling

19Cisco Systems, Inc. www.cisco.com

Traffic Switchover- APSTraffic Switchover- APS

Line Card A

DRP

Line Card

Line Card

Line Card

APS Manager

FIB

FIB

FIBAPS Process

5

Line Card B

2

APS Process

1

3 3

4

5

5

6

Traffic before APS switch

Traffic after APS switch

Switching Fabric

20Cisco Systems, Inc. www.cisco.com

Traffic Switchover - Bundled link

Traffic Switchover - Bundled link

Line Card

DRP

DRP

DRP

DRP

Bundled IF

FIB

FIB

FIB

2

34

4

4

Switching Fabric

Link Monitor

Line Card

1

5

Traffic before link failure

Traffic after link failure

Link Monitor

Mgr

21Cisco Systems, Inc. www.cisco.com

Software Non Stop Forwarding

Software Non Stop Forwarding

• Architected with HW NSF

• Process Restartability

• Separation of Control and Data Planes

• Protocol Support (BGP, ISIS, OSPF, Multicast, MPLS), Support for HSRP, VRRP

22Cisco Systems, Inc. www.cisco.com

BGP NSF

RPRP

LCLC

Fabric

BGPComponent

BPM

bRIB

LPTS/TCPConnections

to peers

SysDB

gRIB

BGPSpeaker

BGPSpeaker

BGPSpeaker

Gig

E

BCDL

FIB

HWFWD

Incremental updates to FIB

23Cisco Systems, Inc. www.cisco.com

Non Stop Forwarding MPLSNon Stop Forwarding MPLS

• No impact on MPLS forwarding when one or more MPLS processes fail.

• No impact on MPLS forwarding when an active card from a pair of active/standby fails.

• Hitless software upgrade.

24Cisco Systems, Inc. www.cisco.com

MPLS - NSF in ActionMPLS - NSF in Action

MPLS Control

MPLS Forwarding

IP

Forwarding

System

ServicesIP Network

Services

• If the control plane fails, the forwarding plane can continue to send traffic. Headless forwarding.

• Minimize the time forwarding remains headless.

25Cisco Systems, Inc. www.cisco.com

MPLS ArchitectureMPLS Architecture

DRP

LC LC

Application: MPLS-TE Recovery: From systems services and check-poiniting

Label signaling: RSVP, LDP Recovery: From applications and neighbors

Infra: Label manager Recovery: From signaling layer

MPLS

Forwarding MPLS

Forwarding

Recovery: From Label Manager

26Cisco Systems, Inc. www.cisco.com

MPLS Fast RerouteMPLS Fast Reroute

• Supports Node, Path, and Link Protections

• Controlled by the routers at ends of a failed link

– link protection is configured on a per link basis

• Uses nested LSPs (stack of labels)

– original LSP nested within link protection LSP

27Cisco Systems, Inc. www.cisco.com

Scalable Distributed SystemScalable Distributed System

• Configuration and Operational Data Distributed Across System

– Allows system to scale, Logical Routers

– Fault containment and recovery (SysDB, IM, SC, dSC, d(LRSC) )

• Processing Distributed Across System

– Distributed RPs

– Enables faster convergence

28Cisco Systems, Inc. www.cisco.com

Managing ConfigurationManaging Configuration

• Designated SC (dSC) - An owner plane concept, Verifies Rack numbering among SCs

• Co-ordinates image management and versioning

• Co-ordinates LR membership information

• System Elected: Deterministic election through reboot

– Backup Elected as well

• d(LRSC) extends similar concept to a Logical Router configuration in LR plane.

LC

SCRP

SCRP

LC

SCRP(

dSC)

SCRP

GigESC

SC

Fabric C

29Cisco Systems, Inc. www.cisco.com

Managing Scaling/DistributionManaging Scaling/Distribution

LC LC DRP RP DRP LC LC

Local Local Local Local Local Local Local

Shared

30Cisco Systems, Inc. www.cisco.com

Process DistributionProcess Distribution

LRd placed

Logical Router

RP

DRP DRP

sysmgr

sysmgr

sysmgr

sysmgrLRconfig

Rack Rack

AB C

.startup filesof placeableapplications

LRd

A A

A

B C B

RPCiscopre-config

sysdbshared placed

standby replicated processes

31Cisco Systems, Inc. www.cisco.com

Health MonitoringHealth Monitoring

• Online Diagnostics

– Minimizes double faults at switchover time

• Detect failures before they become critical

– Standby RP/DRP, Fabric plane

– Hot tested spare units

–Alarm cards, Logging & Alarm system (LED A/N display, minor, major, critical alarms)

32Cisco Systems, Inc. www.cisco.com

No outage on Software Upgrades

No outage on Software Upgrades

• Packaging model – Allows modular upgrade (sub package / package) and software patches (SMU) to key components and packages without affecting others.

• Software Release Strategy

– Takes into account upgrade timings and impacts on system availability

– Progressive upgrade path defined, Compatibility requirements taken into consideration.

Process Restartability with NSF is key Enabler

33Cisco Systems, Inc. www.cisco.com

• Architecture

– Hardware

– Software

• Development Process

• Test Process

• Accounting, Logging & Alarms

• Conclusion

Cisco HFRA Five Nines Capable Router

Cisco HFRA Five Nines Capable Router

34Cisco Systems, Inc. www.cisco.com

Development ProcessDevelopment Process

• ISO compliant

• Mandatory design/code reviews

• API versioning controlled by tools

• Strictly enforced package boundaries (Tools)

• Continual automated measurement/improvement

• HA culture throughout program

35Cisco Systems, Inc. www.cisco.com

• Architecture

– Hardware

– Software

• Development Process

• Test Process

• Accounting, Logging & Alarms

• Conclusion

Cisco HFRA Five Nines Capable Router

Cisco HFRA Five Nines Capable Router

36Cisco Systems, Inc. www.cisco.com

Software Test ProcessSoftware Test Process

• Test Hierarchy (Waterfall model)– Q Integrated Sanity System (QISS)

– Component and Feature Test

– Regression Test

– System Integration Test

– Early Field Trial (EFT) and Beta

• Test Operations– Test Automation and Formal Script Review

– Central Reporting (online system - TIMS, Dashboard)

– Test Planning and Formal Review

37Cisco Systems, Inc. www.cisco.com

Software Test ToolsSoftware Test Tools

• IXIA – traffic generation & analyzer

• Agilent QA Robot – protocol conformance testing

• Agilent RouterTester – interface & protocol scalability

• REX – resource exhaustion

• CTF – component testing

• FIT – fault injection

• ATS – test scripting

• e-ARMS – test scheduler

• Pagent – packet generator

• RouteM – net emulation

• CFLOW – code coverage

• DDTS – defect tracking

• TIMS – test reporting

• Dashboard – test summary

3rd Part ToolsInternal Tools

38Cisco Systems, Inc. www.cisco.com

Test ActivitiesTest Activities

• Up time/Longevity

• Boot time

• Interface Scalability

• Protocol Scalability

• Throughput

• Latency

• APS Protection

•Security Audit

• Fault Detection Time

• Fail over time

•Process restart/resync

• Online Insertion Removal (OIR)

• Hitless Software Upgrade (HSU)

• Hot Standby Route Processor (HSRP)

• Fault Manager (FM)

• SW/HW Fault Injection

• Process Deadlock Simulation

• Process Restartability w/NSF

• SONET APS, DPT

• Reliability & Availability

• Standard Conformance

• Interop w. IOS/JunOS

Test Measurements

MTTRTest Validation

39Cisco Systems, Inc. www.cisco.com

• Architecture

– Hardware

– Software

• Development Process

• Test Process

• Accounting, Logging & Alarms

• Conclusion

Cisco HFRA Five Nines Capable Router

Cisco HFRA Five Nines Capable Router

40Cisco Systems, Inc. www.cisco.com

ACCOUNTING & HAACCOUNTING & HA

• Netflow support

– Multiple / Distributed collectors

• Persistent storage of accounting data

– Across failovers

– Checkpointed continually

41Cisco Systems, Inc. www.cisco.com

LOGGING & ALARM SYSTEMLOGGING & ALARM SYSTEM

• HA Attributes

– All bistate alarms checkpointed

–Alarms are sequenced and can be retrieved anytime

• Alarm Cards

– Alarm lights lit on failure conditions

– System wide storage of data

42Cisco Systems, Inc. www.cisco.com

HFR - High AvailabilityHFR - High Availability(Bird’s Eyeview)(Bird’s Eyeview)

Goal: Non-Stop Availability

Result: Quick Recovery (low MTTR/DPM)

Physical redundancy Dual processors, Power, Fabric, Cooling, OIR

Logical redundancy/protection SONET APS, DPT, HSRP/VRRP, MPLS FRR, Layer 3 load balancing, link bundling

Hitless Software/Hardware UpgradesUpgrade software/hardware while router is in service

Non Stop ForwardingNo line card reboot upon processor fail over

Forward user data during RP fail over

Process Restartability/upgrade and NSF

43Cisco Systems, Inc. www.cisco.com

ConclusionConclusion

• Target: 99.999% availability

• Availability modeling, availability design and fault injection testing incorporated as part of the development process

• Cisco uses HA analysis and modeling to identify the areas of improvements for future designs

• High availability (in some operational areas) will need close cooperation with customers and the required support process is being developed.

44© 1998, Cisco Systems, Inc.

45

Backup SlidesBackup Slides

46Cisco Systems, Inc. www.cisco.com

Cisco’s HA ProductsCisco’s HA Products

Cisco is certifying a variety of its products for HA compliance.

• MSSBU: (PXM1, PXM45, AXSM)

• IP: GSR, ESR 10000 (, DSL (Austin), Fermi, HFR

• Optical: Monterey

• Cisco’s IOS has been certified for 99.999% Availability in many service provider environments

Cisco’s efforts for achieving High Availability are both platform oriented and cross-platform oriented.

47Cisco Systems, Inc. www.cisco.com

IOS HA InitiativesIOS HA Initiatives

• RPR: Partial initialization of IOS in standby RP

• RPR+: Improves standby readiness over RPR (recognizes line cards and does not reset them on switchover)

• Single Line Card Reload: Problems in one VIP do not require an entire router reboot

• Fast reboot: Improves reboot time by 5 minutes

• Fast upgrade: Improves upgrade time by 5 minutes by pre-loading software onto standby

• Stateful switchover: Instant switchover to standby RP (includes non-stop forwarding routing protocol changes)

• In-service upgrade: Software upgrade without user impact

48Cisco Systems, Inc. www.cisco.com

HFR SystemHFR SystemFabric ShelvesContains Fabric cards,System Controllers

Shelf controller

Shelf controller

Line Card ShelvesContains Route Processors, Line cards, System controllers

EMS(Full system view)

Out of band GE control bus to all shelf controllers

100m

Shelf controller

49Cisco Systems, Inc. www.cisco.com

Software Test ProcessSoftware Test ProcessSoftware Test ProcessSoftware Test Process

• Tools for HA Testing

– REX (Resource Exhaustion Tool), CTF (Component Test Framework), measure how HFR HA features respond to different test conditions simulated by these tools.

• Test Restartability with Faults simulation

– memory failures, thread create failures, dependent process failures, multiple related processes failures, recovery on check point process failure, restartability under high CPU usage

• Test Hitless Software Upgrade

– Test under high resource/CPU utilization conditions

• Fault Manager Testing

– Check to see FM works properly under fault conditions

• MTTR Measurements

– Measure time to repair for most process/component failures

50Cisco Systems, Inc. www.cisco.com

Specific Availability RequirementsSpecific Availability Requirements

Here is what I ask a BU to do (chronological):

• Create an availability model to gain perspective

• Reduce/remove single points of failure

• Design for over 100,000 hours MTBF

• Automate measurement of DPM

• Write online diagnostics on active and standby

• Write and execute network level availability test plan

• Perform fault insertion testing

• Write and test a troubleshooting guide

Arch

Design

Test

Field

51Cisco Systems, Inc. www.cisco.com

Limit Headless Forwarding Time

Limit Headless Forwarding Time

• Check point data that cannot be recovered otherwise

• Dedicate MPLS process resources to the recovery of LSPs that are already established. Processing of any new configured LSP tunnels is temporarily suspended.

• Processing of new LSPs resumes when recovery completes.

52Cisco Systems, Inc. www.cisco.com

TIMING GOALSTIMING GOALS

• Boot from Flash / TFTP (~3 min)

• Total Single Rack Bring up time (~5min)

• OIR Recovery Time (~30 to 60 secs)

• Uptime = 14 days before ship

• BGP Aggregation Convergence ~ 60 sec

• BGP Backbone Convergence ~ 3 min

• OSPF Convergence ~ 25 secs

• IS-IS Convergence ~ 350 secs

53Cisco Systems, Inc. www.cisco.com

Redundant Cards & LinksRedundant Cards & Links

Fabric Chassis

...

SC0 GE LinksSC1 GE Links

Inter-SC FE Links

SC1

SC0

SC0

SC1

Line Card Chassis

...

DRP/SC1

DRP/SC1

DRP/SC0

DRP/SC0

External GE Switch 0

External GE Switch 1

54Cisco Systems, Inc. www.cisco.com

1:1 Card Redundancy1:1 Card Redundancy

Card 1

Process A

Process B

Process C

Process A

Process B

Process C

Checkpointing

Active Logical Slot 1 Standby Logical Slot 1

Card 2

“Active”Processes

Checkpointing

Checkpointing

“Standby”Processes

55Cisco Systems, Inc. www.cisco.com

Active / Standby SwitchoverActive / Standby Switchover

Process A

Process CProcess B

System Mgr

Card 2

7

7

Active SC

LR Daemon

Process A

Process CProcess B

System Mgr

Card 1

1 6 10

3

5

8

RedCon RedCon

4 9

QSM

4

2

11

12

13

Process B’

14

56Cisco Systems, Inc. www.cisco.com

SC/DRP Combo SwitchoverSC/DRP Combo Switchover

SC/DRP Combo 2SC/DRP Combo 1

LR Daemon

RedCon

RedCon

LR Daemon

RedCon

RedCon

34

12 98

67

5

SC1

DRP1

SC1

DRP1

10

11

57Cisco Systems, Inc. www.cisco.com

Traffic Switchover - Bundled link

Traffic Switchover - Bundled link

Line Card

DRP

DRP

DRP

DRP

Bundled IF

FIB

FIB

FIB

2

34

4

4

Switching Fabric

Link Monitor

Line Card

1

5

Traffic before link failure

Traffic after link failure

Link Monitor

Mgr

58Cisco Systems, Inc. www.cisco.com

Traffic Switchover- APSTraffic Switchover- APS

Line Card A

DRP

Line Card

Line Card

Line Card

APS Manager

FIB

FIB

FIBAPS Process

5

Line Card B

2

APS Process

1

3 3

4

5

5

6

Traffic before APS switch

Traffic after APS switch

Switching Fabric

59Cisco Systems, Inc. www.cisco.com

SC/RP Upgrade (Initial Config)SC/RP Upgrade (Initial Config)

Card 1

“Standby”Processes

Process A

Process B

Process C

“Active”Processes Checkpt.

Server

Process A

Process C

Card 2

Process BCheckpt.Server

Standby Logical Slot 1Active Logical Slot 1

Checkpointing

60Cisco Systems, Inc. www.cisco.com

HFR HA RoadmapHFR HA RoadmapHFR HA RoadmapHFR HA Roadmap

QFT-1 QFT-2 QFT-3 Beta/FOA

Target GSR

Demonstrate limited HSU

NSF for ISIS, OSPF

Multiple Verifier Support

CheckPointing and Mirroring

RP and DRP standby and failover

Target GSR

All processes Restartable

Restartability nonservice affecting to Routing and Forwarding plane apps

RP and DRP standby

Limited SC Functionality and SC HA features

NSF support with upgrade of config data

Support for checkpoint data with version differences between releases

Target- HFR test hardware

Full functionality of SC, RP, DRP, SP and Fabric SC will be demonstrated with high availability and failover features.

Process Redundancy mechanism across DRPs demonstrated

All apps support HSU - forwarding, multicast, security and base.

Multiple LRs support and fault isolation between LRs

Software downgrade to atleast 1 prev level

Target - HFR platform

All QFT1 to QFT3 goals met

Meet product requiremnets in HA PRD.

Minimum .9999 standalone availabiity and .99999 network availability

fCS/Post FCS: HA support and assurance programs, HA test support framework implementaton

61© 1998, Cisco Systems, Inc.