Upload
penelope-ward
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Building a business critical system – technology, architecture, process
Larry Mead – CTO Platform Modernization Team – MicrosoftRob Shiveley – Data Center – IntelScott Rosenbloom – Platform Strategy - Microsoft
SESSION CODE: WSV202
Session Objectives and Takeaways
Session Objectives: Definition of Mission CriticalWindows Server 2008 R2 support for Mission Critical in conjunction with Intel technologyConsiderations at the Application and Tools for Mission Critical
Key Takeaways:Positioning Windows Server and SQL Server for mission criticalGuidance on mission critical considerations
Mission Critical - Defined
* Business Dictionary: http://www.businessdictionary.com/definition/critical-business-function.html
“We’ve enjoyed faultless performance with Windows Server and SQL Server “We haven’t had any unscheduled downtime.”
- Dr. Elnaggar, IT Director at Bavarian Auto Group.
Mission Critical* Vital function (such as production and sales) without which a firm cannot operate or remain viable. If a critical business function is interrupted, a firm could suffer serious financial, legal, or other damages or penalties.
Reliable – system is tolerant of various component failures
Available – application is accessible across system outages
Serviceable – systems are monitored, self-corrects and notifies when necessary
Scalability / Performance – systems can scale to the needs of the business while maintaining consistent and
System attributes to support Mission Critical
Mission Critical Application
DEMO
Hardware and Operating System
Combined Power of Windows Server 2008 R2, SQL Server 2008 R2 and Intel Xeon 7500 is Mission Critical
Intel and Microsoft delivering together Scale-up and scale-out capabilities
Windows Server clusteringHyper-V Virtualization
Business continuity and manageability Multi-site managementEnterprise class error checking and recovery
And …
Synergy of Windows Server 2008 R2 + Intel Xeon 7500Power
Management
• Timer coalescing• Tick skipping• Core parking• Report power
consumption to OS via ACPI
• Accessible via WMI (reading/writing of power plans – active plan can be changed remotely)
Virtualization
• SLAT• VMQ• Jumbo Frames• Intel VT
Scalability
• 256 Logical Processors
• Turbo Boost• Quickpath• 16 MB L3 Cache
(7400)• Multi-site
manageability
RAS
• Memory Mirroring – writes to 2 locations to compensate for DRAM failure
• Memory Sparing – predicts a failing DIMM and copies data to a spare DIMM
• I/O Hot plug• MCA Recovery• WHEA – root
cause
Machine Check Architecture - Recovery
Video
Built-In Redundancy & Failover Throughout the PlatformSocket Redundancy & Failover• Dynamic OS Assisted Processor Socket Migration*
• Electronically Isolated (Static) Partitioning
Memory Redundancy & Failover•Inter-socket Memory Mirroring
•Intra-socket Memory Mirroring
•Intel® SMI Lane Failover
•Intel® SMI Clock Fail Over
•Intel® SMI Packet Retry
•Memory DIMM and Rank Sparing
•Dynamic Memory Migration
•Fail Over from Single DRAM Device Failure (SDDC)
•Recovery from Single DRAM Device Failure (SDDC) plus random bit error
Intel® QPI Redundancy & Failover•QPI Self-Healing
•QPI Clock Fail Over
•Intel QPI Packet Retry
Intel® QPI
NHM-EX
NHM-EX
NHM-EX
NHM-EX
PCI Express* 2.0PCI Express* 2.0
MemoryMemory
ICH10
IOH IOH
MemoryMemory
Intel® QPI = Intel® QuickPath Interconnect
Intel® SMI = Intel® Scalable Memory Interconnect
Machine Check Architecture Recovery
Allows Recovery From Otherwise Fatal System Errors
Normal Status
With Error
Prevention
First Machine Check Recovery in Xeon®-based Systems
*Errors detected using Patrol Scrub or Explicit Write-back from cache
Previously seen only in RISC, mainframe, and Itanium-based systems
REG
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
REG
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
REG
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
REG
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
S
M
BREG
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
REG
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
REG
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
REG
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
DDR3
S
M
B
SMI
SMIError
Corrected
HW Correctable Errors
ErrorDetected*
Patrol Scrubber scans memory for errors
Un-correctable Error
Error
Contained
Bad memory location flagged so data will not be used by OS or
applications
Error information passed to OS / VMM
System works in conjunction with OS or VMM to recover or restart
processes and continue normal operation
SystemRecovery with OS
10
Windows Hardware Error ArchitectureIntroduced in Windows Server 2008*• Better root cause analysis– Error reporting via common error record format, richer data
content (e.g. FRU info)– Platform and the OS flows are well integrated which allows both
to contribute information to the log
• Better support for hardware error recovery– Built in infrastructure for error injection– Platform Specific Hardware Error Driver (PSHED) Plugins allow for
platform participation in error recovery
• Error avoidance with health monitoring– Allows for applications to register for hardware error event notification– PFA apps can be used to monitor platform health
• WHEA enhancements on Intel® Architecture in Windows Server 2008* R2– Support for Nehalem-EX MCA recoverable errors– Corrected Machine Check Interrupt (CMCI) error handling support
Intel® server processors codename Nehalem-EX
12
MCA Recovery :Explicit Write Back Error
Error detected
CPU
Cores
UnCore
Core0
Link
Memory Controller
Core7
LLC
WB Data
PoisonTag
WB Data
PoisonTag
Memory
New Data
Log the errorMCi_Status.Valid = 1MCi_Status.EN = 1 (Error enabled)MCi_Status.UC (uncorrected error ) = 1MCi_Status.PCC (Process context corrupt ) = 0MCi_Status.OVER (overflow) = 0MCi_Status.MCA_error_codes indicates which error is detected MCG_Status.RIPV = 1MCi_Status.ADDRV = 1MCi_Status.MISCV = 1 MCi_Status.MSCOD = poison (model specific)
EWB Error detectedData stored with poison bit
System Software recovers the error
Broadcast MCE to all threads
13
CPU
Cores
UnCore
Core0
Link
Memory Controller
Core7
LLC
Memory
New Data
Memroy Error detected
Memory Error is Detected
And Corrected1
1
Corrected ErrorCount is
Incremented2
Error CountExceeds
Threshhold3
Uncore Issues CMCI to the OSHandler
4
2 3
Example: OS InitiatesFail-over to Spares
MCA Predictive Failure Notification
4
14
Software Error Recovery MotivationAbility to isolate uncorrectable errors and achieve containment– Allows for OS to terminate/restart an application mapped to that address or the VMM to terminate
the guest OS– System remains active running other applications or guest OSs– Increase the system up time (RAS); important requirement for servers
These errors are detected ahead of software consumption– Provide software an opportunity to attempt to recover from an uncorrected error before the error
brings down the machine
Potential error recovery cases– Uncorrected errors detected outside of program execution have potential for error recovery
- e.g. DRAM patrol scrubber
Potential to extend architecture capability in future to cover cases where software consumes erroneous data
Application & System Tools
Microsoft Virtualization for Server Applications
Virtualization Platform
Mission Critical Applications Management Platform
Enterprise ApplicationsLine Of Business (LOB) Custom
Applications
Database Communication
Business Applications
MicrosoftServer
Applications
Collaboration
Hyper-V™
Microsoft Virtualization = Windows Server 2008 R2 Hyper-V + System Center
Virtual Memory & Second-Level TranslationWith Virtualization an additional level of mapping is required Second Level Address Translation (SLAT) provides the extra translation into Virtual Machine address spacesPerformance advantage over non-enabled CPUs
Physical Memory Pages
The Virtual / Process view The Physical / real view
Virtual Machine 1
Hyper Visor
Virtual Machine 1
Virtual Machine 3
Operating System
Second Level Address Translation
DEMO
What Makes a System Mission Critical?
SAN for SQL and Files
SAN
LOB AppsWindows Server App Fabric
SQL Server 2008 R2Windows Server 2008 R2
Fiber Optic channel to SAN
Scale Up Configuration
SAN based for
SQL and Files
SAN
Dual Gigabit Ethernet
on PCIe bus
HP BL465
CICS COBOL appsMicro Focus Server EEWindows Server 2008
SQL Server 2008 R2
Windows Server 2008 R2HP BL465
CICS COBOL appsMicro Focus Server EEWindows Server 2008LOB Apps
Windows Server App FabricWindows Server 2008 R2
Fiber Optic channel to SAN
Scale Out Configuration
Dual Gigabit Ethernet on PCIe
bus
SQL Server 2008 R2
Windows Server 2008 R2
LOB AppsApp Server
Windows Server 2008 R2
Windows Server 2008 R2Hyper-V Virtualization Server
LOB AppsApp Server
Windows Server 2008 R2
LOB AppsApp Server
Windows Server 2008 R2
Fiber Optic channel to SAN
Scale Out Virtualized
SAN based SQL and files
SAN
Core Parking & SQL Scale Up
DEMO
Operational Best PracticesOperations practices based on Information Technology Infrastructure Library (ITIL) /Microsoft® Operations Framework (MOF)
Change managementIncident managementProblem management
Dedicated Service Operations Center (SOC) Focused on BPOExperts in online collaboration services
Dedicated service administration teamISO 27001 aligned operational procedures
Hardware Provisioning
Deployment, Patching and State Mgmt
Virtual Workload
Provisioning
Mobile Device
Management
Performance and Health Monitoring
Backup & Disaster Recovery
The Microsoft Platforms are Mission Critical Today
Sunguard1024-Core Computing Grid running Windows Server
2008 and SQL Server 2008
Asset Liability management (ALM) -
Near Linear scalability
bwin30,000 Transactions per
Second at peak
1 Million bets per day
100 Terabytes of data
SiemensPLM system supports 5,000
concurrent users
Gained 50% of space through compression
Sunguard - http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000006391 bwin - http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000004138 Siemens - http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000004826
Mission Critical Wrap-up
Windows Server 2008 R2 and SQL Server 2008 R2 are mission criticalHardware partners provide scale-up and resilient platform Windows Server + Intel Xeon 7500 can detect and recover from hardware errorsDemocratizing Mission Critical
Related ContentBreakout Sessions
Deploying, Virtualizing, and Managing Linux and UNIX with Hyper-VManage Your Enterprise from a Single Seat: Windows PowerShell RemotingLiving in a Mixed Environment: Integrating Your Heterogeneous InfrastructureBuilding a Business Critical System: Technology, Architecture, and Process
Interactive SessionsNext Generation VDI with Microsoft RemoteFXLighting Up Nehalem EX with Windows Server 2008 R2
Hands-on LabsImplementing High Availability
Product Demo StationsWindows Server 2008 R2 Failover Clustering
Resources
www.microsoft.com/teched
Sessions On-Demand & Community Microsoft Certification & Training Resources
Resources for IT Professionals Resources for Developers
www.microsoft.com/learning
http://microsoft.com/technet http://microsoft.com/msdn
Learning
Complete an evaluation on CommNet and enter to win!
Sign up for Tech·Ed 2011 and save $500 starting June 8 – June 31st
http://northamerica.msteched.com/registration
You can also register at the
North America 2011 kiosk located at registrationJoin us in Atlanta next year
© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to
be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
JUNE 7-10, 2010 | NEW ORLEANS, LA