22
1 Going Beyond Recovery to Continuity: Lessons Learned Dave Swartz Vice President & CIO The George Washington University

Going Beyond Recovery to Continuity: Lessons Learned

  • Upload
    duena

  • View
    34

  • Download
    1

Embed Size (px)

DESCRIPTION

Going Beyond Recovery to Continuity: Lessons Learned. Dave Swartz Vice President & CIO The George Washington University. Brief Background on GW. GW. Main campus Washington, DC ~100 buildings Blocks from the White House, IMF/World Bank, State Dept. 27,000 people - PowerPoint PPT Presentation

Citation preview

Page 1: Going Beyond Recovery to Continuity: Lessons Learned

1

Going Beyond Recovery to Continuity: Lessons Learned

Dave SwartzVice President & CIO

The George Washington University

Page 2: Going Beyond Recovery to Continuity: Lessons Learned

2

Brief Background on GW• Main campus

– Washington, DC – ~100 buildings – Blocks from the White House,

IMF/World Bank, State Dept.• 27,000 people

– 20K students (50% UG and 50% graduate and professional students)

– 7K faculty and staff– Of the 20K there are 8K resident

students• Major medical center – the ER for

the leadership of our government• Two other smaller campuses in

region• 2.5 Gb into Internet and Internet-2 • 15K voice connections and 17K data

connections• Two major data centers – 34 miles

apart

White HouseWhite House

PentagonPentagon

IMF/WB,State Dept.

GW

Page 3: Going Beyond Recovery to Continuity: Lessons Learned

3

Some Drivers for Business Continuity at GW

• Explosions in Man Holes in Street– Recurring unexplained accumulations of flammable liquids in the storm drains

explodes shutting power off a few buildings for days.• Flood hits Academic Center with Data Center

– A backed up city sewer system causes a flood in a building not designed for a data center.

• Change Management Issues– Our Facilities group is prone to taking significant actions without much notice,

including cutting off power or cooling to a building.• Email Systems Failure

– Lost the SAN and was down for 24 hours for basic email and it was 3 days until the archive could be restored.

• Cybersecurity Incidents– After a major worm infestation and also a hack on a trusted host in 2000, GW

creates its Information Security Program. • 9/11

– “The tragic events of Sept. 11 and their aftermath have resulted in changes in the way all of us conduct our lives,” said President Stephen Joel Trachtenberg. “Just as GW strives for academic excellence, we also want to take all appropriate steps to ensure the safety and well being of our community and the continued operation of the university”.

– GW was close to ground zero that day and all land-based phones and cell phones were congested for much of the day.

• Sarbanes-Oxley– A risk conscious Board of Trustees has lead to a number of initiatives to address

BC at GW.

Page 4: Going Beyond Recovery to Continuity: Lessons Learned

4

Who Owns BC at GW?• John Petrie, AVP for Public Safety & Emergency Mgmt.,

holds the AB degree from Villanova University and a master’s and doctorate from The Fletcher School of Law and Diplomacy.

• A career Naval officer, he was the head of the Naval Station at Norfolk, the world’s largest Naval complex, and also professor and head of research at the War College.

• The AVP position was created after 9/11 and was designed to broaden, coordinate, and execute the University’s crisis management, business continuity, emergency preparedness and public safety plans and activities.

• “We need to have people at the local level comfortable with what’s expected of them and what they have the authority to do,” Petrie says. “If they are confident and comfortable, then the chances of their being able to prepare, respond, or recover are easier.”

• John’s number one priority is the safety and welfare of people.

• He sits on regional and national emergency management response groups and represents the regional universities in exercises.

• References:

– BC Plan - http://www.gwu.edu/~response/contents.cfm

– Advisories and Alerts - http://www.gwu.edu/~gwalert/

John Petrie, AVP for Public Safety & Emergency Mgt

John has help to lead the development and administration of BC plans and

testing, and an integratedsystem of advisories, alerts and

real-time communications.

Page 5: Going Beyond Recovery to Continuity: Lessons Learned

5

Role of IT in Campus BC• Address the risks of IT failures• IT has helped to coordinate and fund the development of the

main 19 core office departmental plans– Many core departments had to be assisted to get their BC plans

done since they felt IT had things under control, so why do they have to plan?

– They also had difficulty freeing themselves from other priorities – needed their VP to make BC a priority!

• IT has also helped to deliver:– Campus Alerts (web page, portal, email, 3rd party call service)– Back up web site– Redundant email system and broadcast server (reflector and Listserv) – Alternate routing to different area code for our main incoming and outgoing

phone lines – Emergency intercom broadcasts over speaker phones – A network of Blackberries and support for management – Online directories and BC response plans – A fully configured and supported command center.

Page 6: Going Beyond Recovery to Continuity: Lessons Learned

6

The Planning Process• Identify sources of risks and plan

accordingly• Provide assistance

– Standard templates and questions to facilitate preparation of plans (available on request)

– Expert assistance to develop plan– Review of plans

• Enlist support– Of senior management, the Board

and all core offices• Prioritize efforts

– Not every department needs a comprehensive plan. At GW we identified 19 core offices that needed detailed plans.

• Make the plan easily available• Test the plan and the ability to

think on your feet regularly• Keep plans current

– All plans require periodic review, validation and update.

The online plan for GW is called theIncident Planning, Response, and Recovery Manual, included are individual BC Plans.

Page 7: Going Beyond Recovery to Continuity: Lessons Learned

7

The GW IT Recovery Profile

• Rebuild & Replace Disaster Recovery– Tape backup and priority

shipment of equipment– Weeks to recovery

• Hot-Site Disaster Recovery– Off site arrangements with a

hot-site provider– Several days to recovery

• High Availability Operations– Redundant data centers,

networks and telecom – Less than one day and ideally

less than a couple of hours to recovery.

0

50

100

150

200

250

300

350

400

450

2000 2002 2004 2006

420 (projected)

84

12 < 2

Hours to Recovery

Rebuild & Replace

Hot-Site

High-Availability

Page 8: Going Beyond Recovery to Continuity: Lessons Learned

8

Dealing with Risk Continuity rather than Recovery

• Common areas of IT risk were addressed with a focus on major risks and points of failure:– Data Center– Telecommunications– Network and ISP– Data– Security– Power and Cooling– Change and Service

Management– Classrooms

1. Continuity of operations needs to be built into the architecture and culture from the bottom up.

2. If you live and use it day to day then it is less of a big deal when a disaster hits.

3. BC at a comprehensive local level is essential to enable IT to deliver the sustainability of data and information services.

Page 9: Going Beyond Recovery to Continuity: Lessons Learned

9

Data Center Redundancy• We have created dual data

centers– separated by 34 miles– connected by a DWDM link

over a redundant dark fiber ring

• We split Test/Dev from the Prod instances.

• We also deploy VMware and virtualize servers across centers.

• Not all of production is at one site, but split on a 35-65% basis.

• We mirror data between data centers.

• We have staff split between centers.

• We routinely test failover during maintenance and upgrades.

• This design enables continuity of operations without the need to recover from most disasters.

DWDM DWDM

Ethernet Connection

Dark Fiber

SAN Fiber

L700 L700

EMCSYM-0

M3BCV

M2

EMCSYM-1

BCV

M2

M1M3

M1

WAN Attached Host WAN Attached Host

SAN Attached HostSAN Attached Host

Media Manager

WANWAN

Back-up Manager

LOUDOUN COUNTY DATA CENTER FOGGY BOTTOM DATA CENTER

SAN SAN

ATA DISKS10Tb

ATA DISKS10Tb

DATA CENTER BACKUP ARCHITECTURE

Page 10: Going Beyond Recovery to Continuity: Lessons Learned

10

Telecommunications Redundancy• We have several PBX switches (Avaya S8700s)

interconnected, load balanced, and spatially distributed.

– Two are on the main campus and separated. The third is on a remote campus 34 miles away in a different area code.

• We have the ability to re-route incoming and outgoing calls through different campuses and area codes.

• There are redundant emergency 911 and analog lines as a back up to our main trunks.

• Some specific phone numbers are protected and given regional priority for accessibility and sustainability during a major incident.

• We maintain copper connections for voice to permit inline power off of diesel generators to 15,000 phones.

Page 11: Going Beyond Recovery to Continuity: Lessons Learned

11

Data Redundancy

• All enterprise data is mirrored between data centers, including ERP, data marts, email, one-card, portal, and web systems.

• The main campus file servers are automatically backed up. Legacy departmental systems are slowly transitioning to central support and sustainability – a difficult political process.

• Desktops in many core offices have a standard image and automatically store to a central suite of file servers.

• Critical documents are being stored online in an enterprise document management system and archived to tape.

• We regularly test data backups to make sure we can restore from them.

• One of the most critical aspects of continuity is rapid access to the data!

On-site fire rated vault in addition to off-site storage

Page 12: Going Beyond Recovery to Continuity: Lessons Learned

12

Information Security• Protecting the university from security

risks that can interrupt operations and cost millions of dollars in lost productivity and liability is an important priority in BC.

• Like an onion, the best approach is defense in depth.

• One of our newest efforts after securing campus file servers is our desktop initiatitive.

– We now use Novell Patchlinks, Cisco Clean Access and IPS to automate updates, verify conformance to standards and non-infection.

– As a result, desktop infection problems have declined to a trickle.

• Creating a focused Information Security program, setting standards, and centralizing services, are critical to success.

“Rounding Up Rogue Servers”,article in July 2005 Chronicle.

Page 13: Going Beyond Recovery to Continuity: Lessons Learned

13

Power and Cooling

• Power Redundancy– Conditioned Commercial

Power– 450KW Diesel Generator

w/Maintenance Tap– Automatic Transfer Switch– Uninterruptible Power

Supplies (UPS)– Multiple Power supplies in

each computer system– 48 hours supply diesel (going

to 96 hrs) with priority shipments from three regional vendors possible

• Redundant Air Conditioning Systems– Chilled Water Plant & Two 60

Ton Dry Coolers– Glycol & Chilled Water Air

Handlers

Page 14: Going Beyond Recovery to Continuity: Lessons Learned

14

Change & Service Management

Change Control via Integration

Work Requests

C3

Prob Tickets & Service

OrdersRemedy Kintana

Asset Management

TBD

S/W License Mgmt Remedy

Upside

App. Change Control

Aperture

Adoption of integrated change control is one of the major factors to improvement and reliability of operations.

Page 15: Going Beyond Recovery to Continuity: Lessons Learned

15

Classrooms• What happens if we lose some

classroom space? How could we continue to conduct classes?

1. Using R25i (Resource25 3.3) to complement Schedule25 we can identify and reallocate any available university space to classrooms

2. Using Bb and Elluminate we can conduct classes virtually from home. a. We are piloting this approach

now for snow days and other unscheduled ad hoc gatherings such as study sessions.

b. We are also suggesting that faculty teach one virtual class every month so they have practice.

3. Podcasting = Apreso + iPodsa. GW is supporting Podcasting of

its non-credit lecture series to provide access to recorded presentations.

b. Could this be expanded for credit classes? Depends on support from faculty.

Page 16: Going Beyond Recovery to Continuity: Lessons Learned

16

Selling BCnot the WHAT, but the HOW

• Rational Approach– The risk or probability of the event multiplied by the potential loss

provides a suggested magnitude to the investment for protecting a university from disaster. Not many use this approach.

• Peer Group Benchmarks – A very common and accepted approach is to compare the university

against the market basket of peer institutions to see what they are doing.

• Leverage the Crisis – The emotional side of living through a crisis tends to ease the flow of

funds, so capture the opportunity when it arises.

• Partnering with the Board and Audit Team – The Board has the ability to drive improvements. The External and

Internal Audit Teams are agents of the Board and should be viewed as a partner, not a threat, as they are often viewed.

Page 17: Going Beyond Recovery to Continuity: Lessons Learned

17

Risks of Complexity

Standardization, documentation, and tight change control help to reduce risks from complexity.

Virtualization, distant centers, and split operations add complexity, which has its own attendant risks.

Page 18: Going Beyond Recovery to Continuity: Lessons Learned

18

Factors Related to Distance

• How far away is far enough for a second center?– GW has selected 34 miles

– USC has designated a “bunker” just a few miles away

– Others are saying 70+ miles.

• It really depends– You need to consider the types of risks in your region.

• The greater the distance– The greater the cost or lesser the functionality and immediacy of response.

• You may want to – Have a secondary high-availability or hot-site nearby and a tertiary cold-site

much farther away.

• You need to consider – The impacts on your staff and their ability to make it to the different sites both for

routine maintenance as well as during a disaster

– Some types of clustering do not work at a distance

– Real-time mirroring is also adversely affected by distance.

Page 19: Going Beyond Recovery to Continuity: Lessons Learned

19

Support those Blackberries

• A critical element of the GW BC program is a network of Blackberries. All senior management at GW have them and use them everyday.

• Blackberries are more like a laptop than a phone and require expert assistance

• They have cell phone and radio capability

• They can send and receive email and instant text messages

• They have the ability to surf the web and access calendars, directories and online documents that can be used to support BC

• We have a dedicated expert with backup to provide support to the Blackberries and the command centers.

Page 20: Going Beyond Recovery to Continuity: Lessons Learned

20

Doesn’t it cost a great deal?

• GW had a hot-site, – Costing several hundred thousand

dollars per year.• Went to a high-availability 2nd

site.– One-time cost about $1 million– The ongoing costs were not more

than the previous base budget due to the reallocation of the funds from the hot-site contract.

• Increase in base needed was:– $136K/yr: $1 million loaned at 6%

over 10 years• To offset costs we are leasing

excess space:– We are recovering the incremental

operating costs of the 2nd site. • More reliable service without large

additional costs - A NO-BRAINER!

Inve

stm

ent

2 Weeks 1 Week

Rebuild & Replace

Hot Site / Mobile Recovery

High Availability

72 Hours 48 Hours 24 Hours Minutes

GW Cost Curve

ExpectedCost Curve

Time to Restoration of Operations

Cost

A myth propagated by hot-site vendors is that the cost of customer owned high-availability is prohibitive

Page 21: Going Beyond Recovery to Continuity: Lessons Learned

21

Partnerships

• National Capital Regional Emergency Response Partnership

– Emergency Response groups across the region coordinate efforts and share experiences

– First Responder Access Card (FRAC)– Regional exercises– Information sharing with key groups

• University Partnerships:– Cost and resource sharing or exchange

programs– Georgetown University & GW back one

another up– MAX (Mid-Atlantic Crossroads gigapop)

• Vendor Partnerships:– Have helped GW identify best practices and

utilize new technology useful to BC.– Their support in a disaster can be critical

The FRAC helps to get approved personnel across road-blocks and barriers.

Page 22: Going Beyond Recovery to Continuity: Lessons Learned

22

Questions?

Dave Swartz