J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer)

Preview:

DESCRIPTION

J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer). Telephone Systems. Computer Systems. Internet. Cell phones. The Last 10 Years: Availability Dark Ages Ready for a Renaissance?. Things got better, then things got a lot worse!. 99.999%. 99.999%. - PowerPoint PPT Presentation

Citation preview

• J. Gray, Dependability in the Internet Era• (acknowledgement: slides from J.Gray, E.Brewer)

The Last 10 Years: Availability Dark Ages

Ready for a Renaissance? • Things got better, then things got a lot worse!

1950 1960 1970 1980 1990 2000

9%

99%

99.9%

99.99%

99.999%

99.999%

Computer Systems

Telephone Systems

Cellphones

InternetA

vaila

bilit

y

2010

DEPENDABILITY: The 3 ITIES• RELIABILITY / INTEGRITY:

Does the right thing. (also MTTF>>1)

• AVAILABILITY: Does it now.

(also 1 >> MTTR ) MTTF+MTTRSystem Availability:If 90% of terminals up & 99% of DB up?

(=>89% of transactions are serviced on time).

• Holistic vs. Reductionist view

SecurityIntegrityReliability

Availability

Fail-Fast is Good, Repair is Needed

Improving either MTTR or MTTF gives

benefit

Fault Detect

Repair

Return

Lifecycle of a moduleLifecycle of a modulefail-fast gives fail-fast gives short fault latencyshort fault latency

High Availability High Availability

is low UN-Availabilityis low UN-Availability

Unavailability ~ Unavailability ~ MTTRMTTR MTTFMTTF

Disks (raid) the BIG Success Story

• Duplex or Parity: masks faults• Disks @ 1M hours (~100 years) • But

– controllers fail and – have 1,000s of disks.

• Duplexing or parity, and dual path gives “perfect disks”

• Wal-Mart never lost a byte (thousands of disks, hundreds of failures).

• Only software/operations mistakes are left.

Fault Tolerance vs Disaster Tolerance

• Fault-Tolerance: mask local faults– RAID disks– Uninterruptible Power Supplies– Cluster Failover

• Disaster Tolerance: masks site failures– Protects against fire, flood, sabotage,..– Also, software changes, site moves,…– Redundant system and service

at remote site.

Availability99 999well-managed nodes

well-managed packs & clones

well-managed GeoPlex

Masks some hardware failures

Masks hardware failures, Operations tasks (e.g. software upgrades)Masks some software failures

Masks site failures (power, network, fire, move,…) Masks some operations failuresA

vaila

bilit

yUn-managed

Case Studies - Tandem Trends

MTTF improved

Shift from Hardware & Maintenance to from 50% to 10%

to Software (62%) & Operations (15%)

NOTE: Systematic under-reporting of EnvironmentOperations errorsApplication Software

unknown environment operations maintenance hardware software

0

1 0

2 0

3 0

4 0

5 0

6 0

7 0

8 0

9 0

100

1985 1987 1989

0

20

40

60

80

1 00

1 20

1985 19 87 1 989

Outag es/ 1000 Syste m Yearsby Primar y Cause

% of Outage s by Pri mary Cause

Dependability Status circa 1995 • ~4-year MTTF

• 5 9s for well-managed sys. Fault Tolerance Works.

• Hardware is GREAT (maintenance and MTTF).

• Software masks most hardware faults.• Many hidden software outages in operations:

• New Software.

• Utilities.

• Need to make all hardware/software changes ONLINE.

Progress?• MTTF improved from 1950-1995• MTTR incremental improvements 1970 ---

failover• Hardware and Software online change

(pNp) is now standard• Then the Internet arrived:

– No project can take more than 3 months.– Time to market is everything– Change is good.

Computer Systems

Telephone Systems

Cellphones

Internet

The Internet Changed Expectations

1990Phones delivered 99.999%

ATMs delivered 99.99%

Failures were front-page news.

Few hackers

Outages last an “hour”

2005Cell phones deliver 90%

Web sites deliver 99%

Failures are business-page news

Many hackers.

Outages last a “day”

This is progress?

2006

Eric Brewer said it best:

ACID vs BASEthe internet litmus test

• AtomicityConsistencyIsolation Durabilty

• Availability?• Strong consistency

Isolation

• Focus on commit• Conservative (Pessimistic)

• Difficult evolution (e.g. schema)

• Nested transactions

• BasicAvailabilitySoft StateEventual Consistency

• Availability FIRST• Weak consistency

stale data is OKApproximate answers OK

• Best effort• Aggressive (optimistic)• Easier Evolution.

• Simpler!• Faster

I think it is a spectrum

Recommended