Upload
solarisyourep
View
115
Download
1
Embed Size (px)
Citation preview
© Amor Group 2012
Virtualizing Oracle Unlocked
Enterprise Wide Benefits
Dr Alastair Rennie
Principal IT Management Consultant
MBCS CITP CENG
May 2012
© Amor Group 2012 2
Introduction
• There’s Water (Flooding) in Scotland
• Where SEPA Started
• Criticality of Service Delivery
• Virtualisation
• Slowly but Surely we changed
• Business Change and DR Requirements
• First Design
• Second Design
• Final Design
• Where to now?
© Amor Group 2012 3
There’s water in Scotland, and a whole lot
more
• SEPA
– helps regulate and protect the environment: Water, Air, Land, Noise
– Provides flood warning and flood warning dissemination service
– provides environmental information www.environment.scotland.gov.uk www.sepa.org.uk
– provides advice and guidance on environmental management
– educates
• Formed from 65 bodies in 1996
• Centralised systems over 5 years
• Expanded from 600 to 1300 staff
• Turnover increased £34m to £85m
• Developed complex licensing processes and systems
• Implemented complex science and monitoring systems
© Amor Group 2012 4
We started small, but
• In 2000, we had:
– 10 servers
– 800 desktops
– stand-alone laboratory equipment
– manually polling river monitoring systems
– no flood warning or management services
• By 2010, we had:
– 140 servers
– 1600 desktops
– fully integrated laboratory equipment
– automatically managing monitoring stations
– sophisticated flood monitoring and prediction systems
– active flood warning services
– flood warning dissemination services
– Server and application growth continuing
0
20
40
60
80
100
120
140
160
180
2000 2002 2004 2006 2008 2010 2012
servers
Applications
© Amor Group 2012 5
Changes to the Criticality of the service?
• Increased business critical services
• Category 1 responder under the Civil Contingencies Act
• Flooding life risk notification services
• SEPA awake 24x7
© Amor Group 2012 6
Virtualisation – Utopia for SEPA • How many engineers does it take to change a light bulb
(ok manage a server farm)?
– 1:100?
– 1:300?
– 1:1000?
– 1:10,000?
– All of the above?
– At least 2!
• How much power to make the light bulb work?
– Less, I hope, and less to throw away
• How much space does the light bulb need?
– Not much, I hope
• We were looking for reduced space, power & management
0
20
40
60
80
100
120
2000 2005 2010
cooling
power
space
© Amor Group 2012 7
ca’ canny but ca’ awa
• In 2007, Built a test virtualisation suite on two spare HP DL 380’s
using VMWare ESXi (free!!!) for the team to play with. Success!!!
• In 2008 we “acquired” an HP MSA1000 storage arrays and couple
of HP DL580s
• Installed VMWare ESX Enterprise – we needed management of
the virtual servers now
• Started building virtual machines in earnest
• By 2010, out of 140 production servers, 40 had been virtualised
• Delivering improved resilience and disaster recovery
© Amor Group 2012 8
Disaster Recovery and Resilience
Shared storage
Primary Site
Dynamic Management
Replication Process
Resilient Site
Dynamic Management
Shared storage
Manual
Transfer
1 Gb Fibre x 2
© Amor Group 2012 9
Recovery Time Objectives
Business/Infrastructure Unit
ALA
RM
NE
TW
OR
K D
RIV
ES
EM
AIL
AR
C
Procurement Facilities and Estates 2 2 2
HR 2 2 2
EPI 2 1
Finance
Environmental Science - Hydrology 4 1 1
SCC 1 3 1
Environmental Science - Non-Hydrology 3
Environmental & Organisational Strategy 3 1
National Administration 1 1
1 = within 2 hrs
2 = within 4 hrs
3 = within 24 hrs
4 = within 48 hrs
5 = within 1 week
Recovery Point Objective
Business/Infrastructure Unit
ALA
RM
NE
TW
OR
K D
RIV
ES
EM
AIL
AR
C
Procurement Facilities and Estates 4 2 4
HR 2
EPI 1 1
Finance
Environmental Science - Hydrology 4 1 1
SCC 1 3 1
Environmental Science - Non-Hydrology
Environmental & Organisational Strategy 3 3
National Administration 1 1
1 = no loss of data
2 = up to 4 hrs
3 = up to 24 hrs
4 = up to 48 hrs
5 = up to 1 week
• External review determined required RTO & RPO
• Present solutions were not able to meet these
• Some were unattainable
• Realism was applied: RPO of 2 hours and RTO of 24 hours agreed
with business operations
All change for continuity, next stop disaster
© Amor Group 2012 10
How? Design, design, design to succeed!
• Specification created with expert help
• Solutions received
• Options appraised and tenders assessed
• To deliver DR, we were initially looking at only critical services
~60% service fail over
• Meanwhile, virtualisation needed to continue
– 40 business services to be virtualised
– 30 systems services to be retained as physical
– 12 Tb of data, 4 Tb of which is user files from other sites
– main virtualisation issues:
• Oracle database
• Citrix application delivery
© Amor Group 2012 11
From Detailed Analysis of Designs
• The options were a hybrid of virtual and physical technologies
• Not integrated and not fit for purpose
– Too complex, not a robust solution
– Too expensive to implement
– Too expensive to run, savings consumed
– Needed more people to manage
– Needed more network resources
Virtual to virtual
Physical to Physical
Physical to Physical
File to File
Virtual
Servers
Physical
Servers
Storage
File
Storage
© Amor Group 2012 12
Back to the Drawing Board
• Test Oracle virtualisation – core business systems
• Test Citrix – science system, quality system
• Re-design for an integrated fully virtualised solution
• Plan 100% virtual for business applications
• Plan for 90% virtual for systems – retain physical DC
• vSphere Site Recovery Manager to automate fail over
Virtual to virtual
Virtual to Virtual
File to File
Virtual
Servers
Enterprise
Storage
File
Storage
Physical
Servers
Mirrored
© Amor Group 2012 13
Should we virtualise Oracle, because ……
• Oracle was identified as the single biggest risk to full virtualisation
– No certification from Oracle support
– Business perceived it as a risk through unknowns:
• Critical systems
• Performance
• Availability
• Support
• Licensing
• Confidence
Major business lack of confidence
© Amor Group 2012 14
Oracle goes forth and has multiplied
• Solutions:
– Find other organisations using virtualised Oracle
• Pressure on Oracle to certify support
– Agreed licensing with Oracle support
– Discuss with SEPA business user groups, agree test plans
– Demonstrate performance with automated load testing
– Review architecture & design for supportability and availability
• Blade architecture with redundancy
• Disk arrays with redundancy
• vMotion for availability and performance
• Test, test, test ……..
© Amor Group 2012 15
Technical challenges in design
• Oracle – needs fast disks
– Had to replace standard SATA raid 6 with SAS raid 10
• Citrix – needs plenty of resources and management
– Lower server density, resource management servers are physical
• Synchronisation – fast, low latency communications
– Dedicated high speed link with acceleration
• Volume of storage – needed 30Tb
– Original calculations on VM size were low, final calculations gave a 50%
increase
© Amor Group 2012 16
This is where we arrived at
• Fully managed DR & Resilience service with vSphere
• Fully virtualised applications
• Fully virtualised storage
• Full backup of VMs
• Full backup of data
© Amor Group 2012 17
What else did we get?
• Increased scope of failover and recovery
• Improved RTO & RPO
• Additional secondary backup, tertiary DR
© Amor Group 2012 18
Here’s what we got out of this
• Costs
– Reduced implementation budget by c£180,000 in £500,000
– Reduced annual revenue by c£80,000 in £260,000
• Space
– Reduced space by 5 cabinets (x2 for DR solution)
• Power
– Reduced power by c20kW x 2
• Reduced Support
– Reduced server engineers by 1FTE
• SLA – SRM has helped deliver
– RTO target was 24 hours, achieve ~2 hours
– RPO target was 2 hours, achieve ~5 minutes
– Improved resilience and downtime of existing service
• Achieving better than 99.98 availability (<2 hours pa unplanned outage)
© Amor Group 2012 19
What did we get right?
• Reduced Power & Space
• Reduced Management
• Correct Virtualisation technology
• Correct storage technology
• Design, design, design
• Test, test, test
What was challenging?
• Storage management – speed and capacity
• Original design not bold enough
• Storage Integration and sizing
• Data flow to distant site
• No fail back
© Amor Group 2012 20
And where to now?
• Full DR testing is required
– No simple return from DR activation on most platforms
– Cannot be achieved easily with vSphere 4
• vSphere 5 is the answer
– Automated failover and failback
– Improved storage management and replication
– Improved management of environment
© Amor Group 2012 21
Thank you