Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
CAUSES OF FAILURE IN WEB APPLICATIONS
Feroz Zahid Simula Research Laboratoy & UiO
Report Details
Authors: Soila Portet and Priya Narasimhan Published by: Parallel Data Laboratory, Carnegie Melon University Year of PublicaCon: 2005 Type of ContribuCon: New Analysis Purpose: The report invesCgates causes and prevalence of failures in web applicaCons based on case studies and actual website outages data collected from different sources.
Report Overview
Summary of Findings
• Failure Types • SoMware Failure and Human Errors make 80% of the total
failures
• Causes of SoMware Failures • Maintenance, Upgrades • System overload, Resource exhausCon and complex fault-‐
recovery mechanisms
• DownCme • Ranges between few minutes to weeks • Fault-‐chains increases downCme • Planned downCme is about 80% of the total downCme • Planned downCme may also cause unplanned downCme
Findings What are the causes of failures?
• SoMware Failures • Human/Operator Errors • Hardware/Environmental Failures • Security ViolaCons/Breaches
SoMware Failures
SoCware Error Type Examples
Resource ExhausCon Memory leaks, Buffer overflows
Logical Errors Corrupt Pointers, Race CondiCons, Deadlocks
System Overload Flash Crowd, Slashdot Effect
Recovery Code Complex Fault-‐recovery, Backup restores
Failed SoMware Update Upgrade Dependencies, ConfiguraCon errors
SoMware Failure – Example Incidents
• System SoMware • PlanetLab – Bug in updated kernal module – Detected by
User Reports • America Online – Server Upgrade – Intermi^ent outages –
Several weeks • Symantec – Mar 2005 -‐ Patch for DNS cache poisoning with
redirecCon vulnerability – A^ackers redirected traffic to malware websites
• Zopewiki.org – Jul 2004 -‐ Memory leaks – Workaround was to reboot the webserver daily – Detected by performance slowdown
SoMware Failure – Example Incidents
• ApplicaCon Failures • Resource ExhausGon – PlanetLab – Nodes hang due to an
applicaCon bug which exhausted file descriptors • Logical Error – AOL – Dec 2004 – DeacCvated number of AIM
accounts in regular maintenance cycle – Several days downCme for the users
• Logical Error – Pricing error on Amazon’s UK site lists iPaq Pocket PC under $12 (regular price: $449) – 2.5 hours affected – Detected by abnormally high sales volumes
• Site Overload – Comair airlines – Cancels over 1000 flights when a surge in crew flight re-‐assignment knocked down its flight reservaCon system
• IntegraGon – HP and Compaq implementaCons of SAP soMware – Loss $400 million in revenue – 6 weeks (3 weeks planned)
SoMware Failure – Example Incidents
• Databases • Basecamphq.com – Feb 2005 – DB flagged table as read-‐only
– 30 minutes downCme • Walmart.com – Apr 2001 -‐ Database glitches -‐ 9 hours
downCme • Sony – June 2003 -‐ Stars Wars Galaxies Game –
Overwhelming traffic – Intermi^ent database errors for one day
• RECENTLY -‐ London Airport -‐ Dec 2014 – Inconsistency -‐ Nats -‐ transiCon between the two states caused a failure in the system -‐ NOT in PAPER
Human/Operator Errors
Human Error Type Examples
ConfiguraCon Errors Sysadmin mistakes
Procedural Errors Failure to backup, typos
Miscellaneous Accidents Accidently disconnect power supply
Human Errors – Example Incidents
• ConfiguraGon Error – MicrosoM – Incorrect configuraCon change in edge routers caused MicrosoM websites downCme from several hours to 1 day
• ConfiguraGon Error – MSN – mistakenly marked messages from Earthlink and RoadRunner as spam – Operator error
• Procedural Error – Gforge3 – Failure to restart database daemon aMer applying database patch – Several hours of downCme
• Miscellaneous – eBay – Electrician accidently knocked out a plug – ba^ery ran out 30 minutes later – system outage
Hardware/Environment Failures
Failure Type Examples
Hardware Failures Crashed hard disks, burnt circuits
Environmental Failures Power outages, OverheaCng
Hardware Failures – Example Incidents
• Equipment Failure – Wall Street Journal website – Mar 2004 -‐ Hardware failure – 1 hour downCme
• Equipment Failure – Yahoo Groups – Mar 2002 – Hardware problems – Several hours downCme
• Power Outage – eBay – Power outage in webhosCng facility – 3 hours downCme
• Hardware Upgrades – iWon – New hardware installaCon of $2 million worth – Several days of intermi^ent failures
Network Failures – Example Incidents
• PlanetLab – Experiment overloads university’s internet connecCon – Detected by bandwidth spikes
• Bank of America – Network connecCon slowed banking service – several days of intermi^ent outage
• Sprint – ISP passes bad rouCng informaCon – 2 hours of downCme
Security ViolaCons
• Unauthorized accesses • Password Disclosures • Denial of Service A^acks (DoS / DDoS) • SoMware VulnerabiliCes • Viruses, Worms
Security ViolaGons – Example Incidents
• MicrosoM – Aug 2003 – DOS a^ack causes website downCme of 1 hour
• Alkamai – Jun 2004 – DOS a^ack on DNS servers caused 2 hour downCme for Google, Yahoo, Apple and MicrosoM
• Google – Jul 2004 -‐ MyDoom worm causes parCal outage for several hours
• Verizon – May 2004 – TheM of network cards caused customers to lose their internet access for one day
• Many recent events – Sony Pictures – Google
ManifestaCon of Errors Type Examples
ParCal or EnCre Website Unavailable
File not found, Web server crashed
Systems ExcepCons / Access ViolaCons
RunCme excepCons
Incorrect Results Wrong page served, Invalid Cache used
Data loss or CorrupCon Disk block failures
Performance Slowdowns Network congesCon, System overload
Fault Chains
• Series of component failures • Uncoupled Fault Chains
• Independent failures occur one aMer another • Uncoupled Fault Chains
• Tightly-‐coupled Fault Chains • Correlated failures • For example, Power-‐outage caused air-‐condiConing to fail • SoMware dependencies
• 60% of the failures have fault chain of two
Prevalence of Failures
• 89% of Customers have experienced Issues when compleCng transacCons • 72.5% sites experienced failures in holiday season
Causes of Site Failures
ApplicaCon Failures
Human Error
Others
DownCme
Planned DownCme
Unplanned DownCme
CriCque
• Restricted to Web ApplicaCons • Large websites like AOL, MicrosoM, Walmart etc.
EvaluaCon – Comprehensive?
• With 40 real-‐world test cases • Not connected with the causes of soMware failure in general
• Small subset – evaluaCon could be biased
ApplicaCon Domain
CriCque
• Causes of failure is related with web applicaCon type • For example, news website is more likely to fail from crowd sourcing than an online CMS
Types of Failure -‐ Taxonomy
• Four faults taxonomy is quite primiCve • What type is a device driver failure? – SoMware or Hardware?
Didn’t consider the type of Web ApplicaCons
Few more important causes of failures..
• Website not tested on different plaporms • e.g. Smart phones, Tablets
• DNS Problems • Bandwidth – Webhost decides to put you off because you consumed too much bandwidth
• Police raid – What happed with the pirate bay J
CriCque
Web applicaCon failures can be generally backtracked to the development phase
• Lack of Well-‐defined scope • Lack to professional project management • Poor version control • Trying to reinvent the wheel • No funcConal tesCng • “Freelance Syndrome” J
Some Thoughts
Summary
• Website Failures are prevalent • Loss of revenue • Long-‐term losses like Customer DissaCsfacCon
• Important Study • Failure Taxonomy • Causes • General Causes • Real-‐world case studies
• Can be improved, extended and updated!