Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Why do data centres fail?Why do data centres fail?
Barry Elliott RCDD
Capitoline [email protected]
www.capitoline.eu www.capitoline.me www.capitoline.org
November 2010
Why do data centres fail?y
China Media Digest 0903 (week7)By Wei HE • February 15, 2009
TVCC of CCTV on fireThe northern building of the new CCTV complex was caught fire on Feb. 9, at around 8:00pm. The fire spread quickly and soon the entire structure was in flames.flames.The 44-storey building, about 200 meters from the iconic CCTV tower, houses the Television Culture Center (TVCC), the luxury Mandarin Oriental Hotel and an
l t i d t i telectronic data processing center
Causes of failure are reported differentlyCauses of failure are reported differently
57 3
26
Human error57.3
22
Human errorImproper failoverOverheatingPower loss
44
Power loss
Source: Avocent 2008
Capitoline’s roundup of published data centre failuresCapitoline s roundup of published data centre failures
User Location date Failure mode ConsequencesSpotify London 2010 Unexplained power
failureCRAC unit failed to restart. DC overheated and out for 2 hours
World San 2010 Change to Router All comms failed to work for 2 o dPress
SaFrancisco
0 0 C a ge to oute co s a ed to o ohours
Rapid switch
London 2010 Thieves stole fibre cable All comms severed for 24 hours
Internet Solutions
S Africa 2010 Leaking fire suppression gas
DC evacuated. Systems shut down
Peer 1 Toronto 2009 UPS caught fire All power out for 12 hours
The Planet
Houston 2008 Transformer fire DC out of action all weekend
Silver Top taxis
Melbourne 2009 Building fire Business out of action for 24 hours
User Location date Failure mode ConsequencesAuthorize. net
Seattle 2009 Building fire in adjacent shopping centre
Sprinklers destroyed power equipment
Green Bay data
Wisconsin 2008 Building fire DC destroyedBay data centre
Amazon USA West coast
2009 Lightning strike on building
DC out of action for 6 hours
Vodafone Istanbul 2009 Flood caused by rainstorm
DC destroyed
T-Mobile Washington 2008 Flood caused by DC destroyedstate rainstorm
Australian tax
Melbourne 2008 Zinc whiskers Mass server failuretax service
Level 3 London 2009 High external temperatures
CRAC units undersized. Overheat caused shutdown f 24 hfor 24 hours
User Location date Failure mode ConsequencesI d Ad l id 2009 S d DC f i f 8 hInternode Adelaide 2009 Storm caused power
failure but multiple standby generators could not synchronise
DC out of service for 8 hours
Centerlink Canberra 2009 Power surge knocked out UPS. ATS failed to start generators
DC out of service for aweekend
Queensland Australia 2009 Low voltage ‘brownout’ DC overheated as systemQueenslandHealth dept
Australia 2009 Low voltage brownout tripped out chiller system
DC overheated as system wasn’t monitored
Neilsen Florida 2009 Unexplained power loss DC out of action overnight
HBOS England 2009 Flood caused power loss Bank ATM system out of action over weekend
Air New Zealand
Auckland 2009 Faulty generator would not start after mainspower failure
Airline unable to take bookings for 6 hours
TATA London 2009 UPS failed and then Email servers out of action forTATA London 2009 UPS failed and then generator would not start
Email servers out of action for 2 hours
User Location date Failure mode ConsequencesUser Location date Failure mode ConsequencesHarvard University
USA 2009 Unexplained power loss All university systems unavailable for 5 hours
Amazon USA east 2009 Storm caused complete DC unavailable for 6 hourscoast
ppower loss
Legal & General
London 2009 Gas leak in road caused building to be evacuated
DC out of action for 48 hoursGeneral building to be evacuated
Rackspace Dallas 2009 Unexplained power loss Hosted servers went down for 5 hours
Twitter San 2009 Denial of service attack Twitter out for 3 hoursTwitter San Francisco
2009 Denial of service attack Twitter out for 3 hours
BT London 2009 Flood Communications links lost
Amazon USA 2010 Complete UPS failure after maintenance
DC unavailable for 7 hoursafter maintenance
User Location date Failure mode ConsequencesUser Location date Failure mode ConsequencesAmazon USA 2010 Short circuit in PDU DC out for 8 hours
Amazon USA 2010 Power outage and then DC out for 30 minutesgfaulty ATS
Teremark Miami, USA
2010 Overloaded networkserver failed
Principal services out for 7 hoursUSA server failed hours
Equinix California, USA
2010 Storage device problem Main customer out of action for 1 hour
Paypal USA 2010 Networking equipment ?Paypal USA 2010 Networking equipment ?
FibreNet W Virginia, 2010 DC power plant failure DC out for 4 hoursUSA
IBM Singapore 2010 Disk storage failure DC out for 7 hours
User Location date Failure mode ConsequencesO2 London 2010 Hot weather overloaded DC out of action for 3 hours
HVAC
EMIS UK 2010 Not identified DC out of action for 4 hours
Barclaycard UK 2010 Software error DC out of action for 20 minutes
Facebook USA 2010 Software error DC out of action for 2.5 hours
ORCON USA 2010 PDU failure DC out of action for 2.5 hours
Virgin Sydney, 2010 Server failure DC out of action for 21 hoursgAirlines
y y,Australia
Wellington Hospital
New Zealand
2010 UPS failure DC out of action for 4 hours
American Eagle
USA 2010 Disk storage failure Out of action or impaired for 192 hours
Barclays Bank
UK 2010 Not identified DC out of action for 1 hourBank
User Location date Failure mode ConsequencesNorthrop Virginia, 2010 SAN failure DC out of action for 24 hourspGrumman
g ,USA
Wikipedia USA 2010 External power failure DC out of action for 1 hour
DBS Bank Singapore 2010 Not identified DC out of action for 7 hours
Dept of Education
Australia 2010 HVAC failure DC out of action for 2 hoursEducation
Twitter USA 2010 Network overloaded DC out of action for 5.5 hours
Centerlink Canberra, 2010 External power failure DC out of action for 20 ,Australia
pminutes
Dallas County
USA 2010 Burst water main destroyed power system
DC out of action for 48 hours
Hosting.com Philadelphia USA
2010 Network switch failure DC out of action for 14 hours
Reserved for your data centre…………y
Mean time to failureMean time to failure
• 52 major data centre failures in 36 months and52 major data centre failures in 36 months, and that’s just the ones made public
• If we presume this is at best half of all failuresIf we presume this is at best half of all failures then a data centre goes down somewhere every 2 weeks
• And that’s excluding individual equipment failures
• Average downtime 16.2 hours per major incident– From 20 minutes to 8 days
Failure mechanismsFailure mode Sites
Power failures 15
Fire 5
Storm & flood 8
Power failures
29%Major IT problem
23%
Other 8%
Other external issues
1
Malicious attack 2Fire 10%Storm &
HVAC 8%
23%
Malicious attack 2
HVAC 4
Major IT problem 12
flood 16%
OtherMalicious
attackOther 4
Other external issues
2%
attack 4%
Source: Capitoline from published sources 2008-2010
Almost every major failure could h b id d ithhave been avoided with•Better design•More thought about location•Proper maintenance plans•Testing of all systems, not just componentsAd t fi i th d•Adequate fire suppression methods
•Monitoring•Business processes
Avoiding failureAvoiding failure• Design and build it to workg• Audit what you’ve got• Do a business continuity risk assessment• Do a business continuity risk assessment• Have ongoing operational policies and
d i lprocedures in place• Have a Disaster Recovery plan• Audit the whole process
Design and build it to workDesign and build it to work• Meet standards• N, N+1, 2N models• TIA 942• TIA 942• BICSI 002• The UpTime Institute• EN 50173-5• ISO 24764
Business continuity starts with a risk assessment
• National scale• Local scale• Internal to the Data Centre• Internal to the Data Centre
Local scaleLocal scale• Flooding, hurricanes, lightning
S it i i lit i• Security, criminality, issues• Strikes, blockades, pickets• Power and telecommunication
linksN b EMC• Nearby EMC source
• Local storage of oil, chemicals etc
Internal risks
• Loss of power• Loss of cooling• Fire Data centre fire at a Dutch UniversityFire• Cyber attack
M j IT i t
Data centre fire at a Dutch University
• Major IT equipment failure
• Sabotage
Risk assessmentRisk assessment• Conduct a risk assessment• What is the risk?• What/who is at risk?• What/who is at risk?• What can be done to mitigate the risk?• What do we do if there is a catastrophic
failure?
Disaster recoveryDisaster recovery• What is your recovery time objective?y y j• What will you back up?• Where will you back up to?• Where will you back up to?
– Another data centre in your own company?Commercial DR backup space– Commercial DR backup space
• What hardware will you backup to?
Data centre auditingData centre auditing• Our experience comes from auditing over 40
data centres in the UK, Ireland, Netherlands and the Middle East
• No two customers have the same expectation from a data centre audit
What are the motives to obtain a DC audit?What are the motives to obtain a DC audit?• Their customers require it
N d t d t d ‘Ti ’ ti• Need to understand ‘Tier’ rating• Know they have problems but need an external
lt t t fi th t t f f diconsultant to confirm that to free up funding• Have current and severe operational problems
and need to start the overhaul/replacementand need to start the overhaul/replacement process
• Need ISMS audits such as ISO 27000• Need ISMS audits such as ISO 27000• Need to know their green/CO2 /PUE position• Want compliance with H&S and other legislation• Want compliance with H&S and other legislation
It’s the simple things thatIt s the simple things that often go wrong, e.g. not putting the generator starter in ‘Auto’starter in Auto
About 50 separateseparate standards that could be applied to a ppdata centre plus many national requirements
‘Tier’ StandardsTier Standards• TUI is a design philosophyg p p y
– Tier 1, basic requirements– Tier 2, redundant components, p– Tier 3, concurrently maintainable– Tier 4, Autonomous fault toleranceTier 4, Autonomous fault tolerance
• TIA 942, a prescriptive design guideBICSI 002 some different ideas• BICSI 002, some different ideas
ISO 27002 code of practiceInformation technology Security techniques Code of practice forInformation technology — Security techniques — Code of practice for
information security management
1. Introduction and scope2. Terms & definitions3. Structure of the Standard4. Risk assessment and treatment5 S it li
Big on questions but proposes no answers
5. Security policy6. Organisation of information security7. Asset management8. Human resources security8. Human resources security9. Physical and environmental security10. Communications and operational management11. Access control12. Information systems, acquisition, development & maintenance13. Information security incident management14. Business continuity management15 Compliance15. Compliance
Data centre AuditingData centre Auditing• What does the customer want to achieve?• Use the right audit package to answer the
customer’s questions/requirements• Select from the range of appropriate standards
available. There is no one standard that fits all i trequirements
• An audit includes business processes not just ph sical attrib tesphysical attributes
• Tune it to your business
Thank youThank you
Barry Elliott RCDD
Capitoline LLPCapitoline [email protected]
www capitoline euwww.capitoline.eu