Why do data centres fail? - Bicsi

Why do data centres fail?Why do data centres fail?

Barry Elliott RCDD

Capitoline [email protected]

www.capitoline.eu www.capitoline.me www.capitoline.org

November 2010

Why do data centres fail?y

China Media Digest 0903 (week7)By Wei HE • February 15, 2009

TVCC of CCTV on fireThe northern building of the new CCTV complex was caught fire on Feb. 9, at around 8:00pm. The fire spread quickly and soon the entire structure was in flames.flames.The 44-storey building, about 200 meters from the iconic CCTV tower, houses the Television Culture Center (TVCC), the luxury Mandarin Oriental Hotel and an

l t i d t i telectronic data processing center

Causes of failure are reported differentlyCauses of failure are reported differently

57 3

26

Human error57.3

22

Human errorImproper failoverOverheatingPower loss

44

Power loss

Source: Avocent 2008

Distribution of failures at GoogleDistribution of failures at Google

Capitoline’s roundup of published data centre failuresCapitoline s roundup of published data centre failures

User Location date Failure mode ConsequencesSpotify London 2010 Unexplained power

failureCRAC unit failed to restart. DC overheated and out for 2 hours

World San 2010 Change to Router All comms failed to work for 2 o dPress

SaFrancisco

0 0 C a ge to oute co s a ed to o ohours

Rapid switch

London 2010 Thieves stole fibre cable All comms severed for 24 hours

Internet Solutions

S Africa 2010 Leaking fire suppression gas

DC evacuated. Systems shut down

Peer 1 Toronto 2009 UPS caught fire All power out for 12 hours

The Planet

Houston 2008 Transformer fire DC out of action all weekend

Silver Top taxis

Melbourne 2009 Building fire Business out of action for 24 hours

User Location date Failure mode ConsequencesAuthorize. net

Seattle 2009 Building fire in adjacent shopping centre

Sprinklers destroyed power equipment

Green Bay data

Wisconsin 2008 Building fire DC destroyedBay data centre

Amazon USA West coast

2009 Lightning strike on building

DC out of action for 6 hours

Vodafone Istanbul 2009 Flood caused by rainstorm

DC destroyed

T-Mobile Washington 2008 Flood caused by DC destroyedstate rainstorm

Australian tax

Melbourne 2008 Zinc whiskers Mass server failuretax service

Level 3 London 2009 High external temperatures

CRAC units undersized. Overheat caused shutdown f 24 hfor 24 hours

User Location date Failure mode ConsequencesI d Ad l id 2009 S d DC f i f 8 hInternode Adelaide 2009 Storm caused power

failure but multiple standby generators could not synchronise

DC out of service for 8 hours

Centerlink Canberra 2009 Power surge knocked out UPS. ATS failed to start generators

DC out of service for aweekend

Queensland Australia 2009 Low voltage ‘brownout’ DC overheated as systemQueenslandHealth dept

Australia 2009 Low voltage brownout tripped out chiller system

DC overheated as system wasn’t monitored

Neilsen Florida 2009 Unexplained power loss DC out of action overnight

HBOS England 2009 Flood caused power loss Bank ATM system out of action over weekend

Air New Zealand

Auckland 2009 Faulty generator would not start after mainspower failure

Airline unable to take bookings for 6 hours

TATA London 2009 UPS failed and then Email servers out of action forTATA London 2009 UPS failed and then generator would not start

Email servers out of action for 2 hours

User Location date Failure mode ConsequencesUser Location date Failure mode ConsequencesHarvard University

USA 2009 Unexplained power loss All university systems unavailable for 5 hours

Amazon USA east 2009 Storm caused complete DC unavailable for 6 hourscoast

ppower loss

Legal & General

London 2009 Gas leak in road caused building to be evacuated

DC out of action for 48 hoursGeneral building to be evacuated

Rackspace Dallas 2009 Unexplained power loss Hosted servers went down for 5 hours

Twitter San 2009 Denial of service attack Twitter out for 3 hoursTwitter San Francisco

2009 Denial of service attack Twitter out for 3 hours

BT London 2009 Flood Communications links lost

Amazon USA 2010 Complete UPS failure after maintenance

DC unavailable for 7 hoursafter maintenance

User Location date Failure mode ConsequencesUser Location date Failure mode ConsequencesAmazon USA 2010 Short circuit in PDU DC out for 8 hours

Amazon USA 2010 Power outage and then DC out for 30 minutesgfaulty ATS

Teremark Miami, USA

2010 Overloaded networkserver failed

Principal services out for 7 hoursUSA server failed hours

Equinix California, USA

2010 Storage device problem Main customer out of action for 1 hour

Paypal USA 2010 Networking equipment ?Paypal USA 2010 Networking equipment ?

FibreNet W Virginia, 2010 DC power plant failure DC out for 4 hoursUSA

IBM Singapore 2010 Disk storage failure DC out for 7 hours

User Location date Failure mode ConsequencesO2 London 2010 Hot weather overloaded DC out of action for 3 hours

HVAC

EMIS UK 2010 Not identified DC out of action for 4 hours

Barclaycard UK 2010 Software error DC out of action for 20 minutes

Facebook USA 2010 Software error DC out of action for 2.5 hours

ORCON USA 2010 PDU failure DC out of action for 2.5 hours

Virgin Sydney, 2010 Server failure DC out of action for 21 hoursgAirlines

y y,Australia

Wellington Hospital

New Zealand

2010 UPS failure DC out of action for 4 hours

American Eagle

USA 2010 Disk storage failure Out of action or impaired for 192 hours

Barclays Bank

UK 2010 Not identified DC out of action for 1 hourBank

User Location date Failure mode ConsequencesNorthrop Virginia, 2010 SAN failure DC out of action for 24 hourspGrumman

g ,USA

Wikipedia USA 2010 External power failure DC out of action for 1 hour

DBS Bank Singapore 2010 Not identified DC out of action for 7 hours

Dept of Education

Australia 2010 HVAC failure DC out of action for 2 hoursEducation

Twitter USA 2010 Network overloaded DC out of action for 5.5 hours

Centerlink Canberra, 2010 External power failure DC out of action for 20 ,Australia

pminutes

Dallas County

USA 2010 Burst water main destroyed power system

DC out of action for 48 hours

Hosting.com Philadelphia USA

2010 Network switch failure DC out of action for 14 hours

Reserved for your data centre…………y

Mean time to failureMean time to failure

• 52 major data centre failures in 36 months and52 major data centre failures in 36 months, and that’s just the ones made public

• If we presume this is at best half of all failuresIf we presume this is at best half of all failures then a data centre goes down somewhere every 2 weeks

• And that’s excluding individual equipment failures

• Average downtime 16.2 hours per major incident– From 20 minutes to 8 days

Failure mechanismsFailure mode Sites

Power failures 15

Fire 5

Storm & flood 8

Power failures

29%Major IT problem

23%

Other 8%

Other external issues

1

Malicious attack 2Fire 10%Storm &

HVAC 8%

23%

Malicious attack 2

HVAC 4

Major IT problem 12

flood 16%

OtherMalicious

attackOther 4

Other external issues

2%

attack 4%

Source: Capitoline from published sources 2008-2010

Almost every major failure could h b id d ithhave been avoided with•Better design•More thought about location•Proper maintenance plans•Testing of all systems, not just componentsAd t fi i th d•Adequate fire suppression methods

•Monitoring•Business processes

Avoiding failureAvoiding failure• Design and build it to workg• Audit what you’ve got• Do a business continuity risk assessment• Do a business continuity risk assessment• Have ongoing operational policies and

d i lprocedures in place• Have a Disaster Recovery plan• Audit the whole process

Design and build it to workDesign and build it to work• Meet standards• N, N+1, 2N models• TIA 942• TIA 942• BICSI 002• The UpTime Institute• EN 50173-5• ISO 24764

Business continuity starts with a risk assessment

• National scale• Local scale• Internal to the Data Centre• Internal to the Data Centre

National scaleNational scale

National risk Register UK 2010UK 2010

TUI Disaster risk locations

UN Asian natural risk profile

Local scaleLocal scale• Flooding, hurricanes, lightning

S it i i lit i• Security, criminality, issues• Strikes, blockades, pickets• Power and telecommunication

linksN b EMC• Nearby EMC source

• Local storage of oil, chemicals etc

Use free resources e.g. UK Environment agency flood risk by postcode

Dutch risk location register

Internal risks

• Loss of power• Loss of cooling• Fire Data centre fire at a Dutch UniversityFire• Cyber attack

M j IT i t

Data centre fire at a Dutch University

• Major IT equipment failure

• Sabotage

Risk assessmentRisk assessment• Conduct a risk assessment• What is the risk?• What/who is at risk?• What/who is at risk?• What can be done to mitigate the risk?• What do we do if there is a catastrophic

failure?

Disaster recoveryDisaster recovery• What is your recovery time objective?y y j• What will you back up?• Where will you back up to?• Where will you back up to?

– Another data centre in your own company?Commercial DR backup space– Commercial DR backup space

• What hardware will you backup to?

Data centre auditingData centre auditing• Our experience comes from auditing over 40

data centres in the UK, Ireland, Netherlands and the Middle East

• No two customers have the same expectation from a data centre audit

What are the motives to obtain a DC audit?What are the motives to obtain a DC audit?• Their customers require it

N d t d t d ‘Ti ’ ti• Need to understand ‘Tier’ rating• Know they have problems but need an external

lt t t fi th t t f f diconsultant to confirm that to free up funding• Have current and severe operational problems

and need to start the overhaul/replacementand need to start the overhaul/replacement process

• Need ISMS audits such as ISO 27000• Need ISMS audits such as ISO 27000• Need to know their green/CO2 /PUE position• Want compliance with H&S and other legislation• Want compliance with H&S and other legislation

It’s the simple things thatIt s the simple things that often go wrong, e.g. not putting the generator starter in ‘Auto’starter in Auto

About 50 separateseparate standards that could be applied to a ppdata centre plus many national requirements

Tier ratingTier rating

The UpTime Instit teThe UpTime Institute

=TIA 942

=BICSI 002=

‘Tier’ StandardsTier Standards• TUI is a design philosophyg p p y

– Tier 1, basic requirements– Tier 2, redundant components, p– Tier 3, concurrently maintainable– Tier 4, Autonomous fault toleranceTier 4, Autonomous fault tolerance

• TIA 942, a prescriptive design guideBICSI 002 some different ideas• BICSI 002, some different ideas

ISMSISMSInformation Security

Management standards

ISO 27000 series

ISO 27002 code of practiceInformation technology Security techniques Code of practice forInformation technology — Security techniques — Code of practice for

information security management

1. Introduction and scope2. Terms & definitions3. Structure of the Standard4. Risk assessment and treatment5 S it li

Big on questions but proposes no answers

5. Security policy6. Organisation of information security7. Asset management8. Human resources security8. Human resources security9. Physical and environmental security10. Communications and operational management11. Access control12. Information systems, acquisition, development & maintenance13. Information security incident management14. Business continuity management15 Compliance15. Compliance

Do you handle credit/debit card transactions or keep financial data?

Data centre AuditingData centre Auditing• What does the customer want to achieve?• Use the right audit package to answer the

customer’s questions/requirements• Select from the range of appropriate standards

available. There is no one standard that fits all i trequirements

• An audit includes business processes not just ph sical attrib tesphysical attributes

• Tune it to your business

Thank youThank you

Barry Elliott RCDD

Capitoline LLPCapitoline [email protected]

www capitoline euwww.capitoline.eu

Documents

Why do data centres fail? - Bicsi