Upload
liora
View
26
Download
0
Tags:
Embed Size (px)
DESCRIPTION
J. Gray, Dependability in the Internet Era (acknowledgement: slides from J.Gray, E.Brewer). Telephone Systems. Computer Systems. Internet. Cell phones. The Last 10 Years: Availability Dark Ages Ready for a Renaissance?. Things got better, then things got a lot worse!. 99.999%. 99.999%. - PowerPoint PPT Presentation
Citation preview
• J. Gray, Dependability in the Internet Era• (acknowledgement: slides from J.Gray, E.Brewer)
The Last 10 Years: Availability Dark Ages
Ready for a Renaissance? • Things got better, then things got a lot worse!
1950 1960 1970 1980 1990 2000
9%
99%
99.9%
99.99%
99.999%
99.999%
Computer Systems
Telephone Systems
Cellphones
InternetA
vaila
bilit
y
2010
DEPENDABILITY: The 3 ITIES• RELIABILITY / INTEGRITY:
Does the right thing. (also MTTF>>1)
• AVAILABILITY: Does it now.
(also 1 >> MTTR ) MTTF+MTTRSystem Availability:If 90% of terminals up & 99% of DB up?
(=>89% of transactions are serviced on time).
• Holistic vs. Reductionist view
SecurityIntegrityReliability
Availability
Fail-Fast is Good, Repair is Needed
Improving either MTTR or MTTF gives
benefit
Fault Detect
Repair
Return
Lifecycle of a moduleLifecycle of a modulefail-fast gives fail-fast gives short fault latencyshort fault latency
High Availability High Availability
is low UN-Availabilityis low UN-Availability
Unavailability ~ Unavailability ~ MTTRMTTR MTTFMTTF
Disks (raid) the BIG Success Story
• Duplex or Parity: masks faults• Disks @ 1M hours (~100 years) • But
– controllers fail and – have 1,000s of disks.
• Duplexing or parity, and dual path gives “perfect disks”
• Wal-Mart never lost a byte (thousands of disks, hundreds of failures).
• Only software/operations mistakes are left.
Fault Tolerance vs Disaster Tolerance
• Fault-Tolerance: mask local faults– RAID disks– Uninterruptible Power Supplies– Cluster Failover
• Disaster Tolerance: masks site failures– Protects against fire, flood, sabotage,..– Also, software changes, site moves,…– Redundant system and service
at remote site.
Availability99 999well-managed nodes
well-managed packs & clones
well-managed GeoPlex
Masks some hardware failures
Masks hardware failures, Operations tasks (e.g. software upgrades)Masks some software failures
Masks site failures (power, network, fire, move,…) Masks some operations failuresA
vaila
bilit
yUn-managed
Case Studies - Tandem Trends
MTTF improved
Shift from Hardware & Maintenance to from 50% to 10%
to Software (62%) & Operations (15%)
NOTE: Systematic under-reporting of EnvironmentOperations errorsApplication Software
unknown environment operations maintenance hardware software
0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
100
1985 1987 1989
0
20
40
60
80
1 00
1 20
1985 19 87 1 989
Outag es/ 1000 Syste m Yearsby Primar y Cause
% of Outage s by Pri mary Cause
Dependability Status circa 1995 • ~4-year MTTF
• 5 9s for well-managed sys. Fault Tolerance Works.
• Hardware is GREAT (maintenance and MTTF).
• Software masks most hardware faults.• Many hidden software outages in operations:
• New Software.
• Utilities.
• Need to make all hardware/software changes ONLINE.
Progress?• MTTF improved from 1950-1995• MTTR incremental improvements 1970 ---
failover• Hardware and Software online change
(pNp) is now standard• Then the Internet arrived:
– No project can take more than 3 months.– Time to market is everything– Change is good.
Computer Systems
Telephone Systems
Cellphones
Internet
The Internet Changed Expectations
1990Phones delivered 99.999%
ATMs delivered 99.99%
Failures were front-page news.
Few hackers
Outages last an “hour”
2005Cell phones deliver 90%
Web sites deliver 99%
Failures are business-page news
Many hackers.
Outages last a “day”
This is progress?
2006
Eric Brewer said it best:
ACID vs BASEthe internet litmus test
• AtomicityConsistencyIsolation Durabilty
• Availability?• Strong consistency
Isolation
• Focus on commit• Conservative (Pessimistic)
• Difficult evolution (e.g. schema)
• Nested transactions
• BasicAvailabilitySoft StateEventual Consistency
• Availability FIRST• Weak consistency
stale data is OKApproximate answers OK
• Best effort• Aggressive (optimistic)• Easier Evolution.
• Simpler!• Faster
I think it is a spectrum