52
Failover Clustering Pro Troubleshooting with Windows Server 2008 R2 Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Embed Size (px)

Citation preview

Page 1: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Failover Clustering Pro Troubleshooting with Windows Server 2008 R2Steven EkrenSenior Program ManagerMicrosoft Corporation

SESSION CODE: WSV314

Page 2: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Qualifying Clusters

Cluster Validation

Cluster Event Logging

Cluster Debug Logging

Agenda

Troubleshooting Tips

Page 3: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Qualifying Clusters

Cluster Validation

Cluster Event Logging

Cluster Debug Logging

Agenda

Troubleshooting Tips

Page 4: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Win2008+ Cluster Support PolicyWhat defines Microsoft supportability:

Purchase components which all have a “Certified for Windows Server 2008” logo

Servers, Storage, HBAs, MPIO DSMs, etc…Connect your hardwareRun Validate to verify interoperabilityIf Validate passes, it’s supported!

If you make a change… just re-run ValidateIt’s that simple!

No changes to this policy in Win2008 R2See this doc for more details:

http://go.microsoft.com/fwlink/?LinkID=119949

Page 5: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Frequently Asked QuestionsCan I create Guest Clusters with VM’s?

Yes! Mix physical and virtual in the same cluster?Yes!

Do the servers have to be identical?No

If it passes Validate, it’s supported!

Page 6: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Qualifying Clusters

Cluster Validation

Cluster Event Logging

Cluster Debug Logging

Agenda

Troubleshooting Tips

Page 7: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Cluster Validation ToolBuilt into the productTests collection of servers and storage that is intended to be a cluster Run validate each and every time you …

Create a new cluster Add a node, disk, or networkUpdate system software (drivers, firmware, service packs)Configure hardware (HBA, MPIO, NIC teaming)Change any component in your solution

It’s the very first thing you do!Run on configured clusters as a diagnostic tool

Page 8: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Enhanced Validation in R2Scriptable with Test-Cluster PowerShell cmdletCollects configuration informationIncludes additional “best practices” tests

GUI DependenciesQuorum configurationStatus of cluster resourcesOffers prescriptive guidance to achieve higher availability

Page 9: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

New Validation Tests in R2Cluster Configuration

List Information (Core Group, Networks, Resources, Storage, Services and Applications)Validate Quorum ConfigurationValidate Resource StatusValidate Service Principal NameValidate Volume Consistency

NetworkList Network Binding OrderValidate Multiple Subnet Properties

System ConfigurationValidate Cluster Service and Driver SettingsValidate Memory Dump SettingsValidate OS Installation Options

Replaced Validate Operating Systems

Validate System Driver Variable

Page 10: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Using Validate

DEMO

Page 11: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Troubleshooting TipsIt’s best to use Validate first when:

Problem creating a cluster…Problem with storage...

Considerations…Validating storage requires disks be Offline, which means you need to schedule a maintenance windowRunning Validate with only a single node won’t help you much…You don’t always need to run a FULL validate

http://technet.microsoft.com/en-us/library/cc732035(WS.10).aspx Don’t “assume” the cluster will work and skip Validate

Page 12: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Validate Logging

Client side – Holistic Overview • Client refers to the system running CluAdmin.msc which is invoking Validate (could be running

RSAT)• Global view• Log file in C:\Windows\Cluster\Reports

• “Validation Report YYYY.MM.DD at HH.MM.SS.MHT”

Server side – Verbose Debug• For storage tests there is a verbose log which can contain additional information• Log file in C:\Windows\Cluster\Reports

• ValidateStorage.log• Similar in granularity to a Cluster.log• Each node has a unique log

• Need to collect logs from all nodes to get a holistic view

Page 13: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Qualifying Clusters

Cluster Validation

Cluster Event Logging

Cluster Debug Logging

Agenda

Troubleshooting Tips

Page 14: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Event Viewer

Where to find Cluster events in Event Viewer

Cluster EventsLevel System Channel Operational Channel

Critical P

Error P

Warning P

Informational P

• Operational channel found under:– Applications and Services Logs \ Microsoft \ Windows \ FailoverClustering

Page 15: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Viewing Events Cluster WideFailover Cluster Manager (CluAdmin.msc) provides an aggregated view of cluster events from all nodes

Click “Recent Cluster Events” to see all Error and Warnings cluster wide in the last 24 hoursBuild your own event queries

Page 16: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Built-In Event QueriesOn the right hand ‘Actions’ pane in Failover Cluster Management there are links to open filtered events

• Events associated with all resources in the group

Application Level

• Events related to that specific resource

Resource Level

Page 17: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

New Diagnostic Logging with R2Capture snap-in pop-up’s

Even before cluster creationNew debug logging channels

Disabled by defaultEnabled for advanced troubleshooting

Cluster.log converted to an ETW channel, now appears in Event Viewer as well

Tip: Be sure to click on View / Show Analytic and Debug Logs

Page 18: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Capture Event QueriesSave Failover Cluster Manager event query results as EVTX files for future analysisEnables you to build an aggregated / filtered collection of the events needed and send them to someone else

Page 19: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Viewing Cluster Events

DEMO

Page 20: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Understanding Cluster EventsEvery cluster event edited with improved descriptive text and error codes in Win2008Online troubleshooting steps for all cluster events:

http://technet2.microsoft.com/windowsserver2008/en/library/19adfd9a-6688-455c-8c33-4fc4b0da6e251033.mspx?mfr=true

Page 21: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Monitoring Cluster EventsFully featured Failover Cluster Management Packs for:

System Center Operations Manager 2007Microsoft Operations Manager 2005

Page 22: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Troubleshooting Tips When you encounter a problem, always, always, always start with cluster Events

Look at a cluster wide view of the cluster eventsDig into all events in the System Event logCheck the Application Event log

Don’t be distracted by symptoms - focus on root causeFor example, if you see cluster IP Address failures, don’t waste lots of time looking at cluster events

Instead look for other networking related errorsThere may be multiple retries after a failure, producing more events. Look for what caused the first failure

Page 23: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Qualifying Clusters

Cluster Validation

Cluster Event Logging

Cluster Debug Logging

Agenda

Troubleshooting Tips

Page 24: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Cluster Debug LoggingMigration from legacy cluster debug logging (cluster.log) to Event Tracing for Windows (ETW)

Legacy text based Cluster.log no longer existsAll cluster debug logging done to an event trace session: Microsoft-Windows-FailoverClustering

Page 25: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Configuring the LogLogging enabled by defaultLog files stored as .ETL in:

%WinDir%\System32\winevt\logs\Microsoft-Windows-FailoverClusteringDefault log size is 100 MB

Stored as a cluster property ( ClusterLogSize )Change via: Set-ClusterLog –size 100

Each time a node is rebooted, file suffix is incremented~Diagnostic.etl.001Up to three log files

This means log history can be kept for up to three rebootsThe number of logs can be modified via the registry:HKLM\Software\Microsoft\Windows\CurrentVersion\WINEVT\Channels\Microsoft-Windows-FailoverClustering/Diagnostic\FileMax

Page 26: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Etl.001

Etl.002Etl.003

Reboot

Reboot

Reboot

Each of these logs are circular

ETL LoggingThe cluster debug logging to ETL files in the %systemroot%\system32\winevt\Logs subdirectory

Microsoft-Windows-FailoverClustering%4Diagnostic.etl.00xEach individual log is circularReboot will cause logging to start a new log By default, there are 3 logs and each log has a maximum size of 100 MB

Page 27: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Understanding Log Gaps

An ETL file lasts for the uptime of a nodeA new ETL file is used each time you restart the node

When you restart, you move on to the next file. After you have restarted 3 times you return back to the first file.

Each ETL has a log size of 100 MB and will wrap on themselves, but only within their own logCmdlet will merge all the .ETL logging data into a single contiguous text file

This is extremely confusing, and a common question on where the data wentIn reality, it is ok… you didn’t need it anyway

Etl.001 Etl.002 Etl.003Reboot Reboot

Page 28: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Producing the Cluster.logCluster trace session can be dumped to a text file that looks very similar to the legacy Cluster.logDumps the Cluster ETW channel to a text log located at:

%WinDir%\Cluster\Reports\Cluster.log

Get-ClusterLog cmdletSwitch Effect

-Destination Dump the log on all nodes and copy them to this location. Dumps the logs on all nodes in the entire cluster to a single directory

-TimeSpan Just dump the last X minutes of the log

-Node Useful when the ClusSvc is down to dump a specific node’s logs

Page 29: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Viewing LogTracerpt.exe can be used to dump the trace session

.EVTX and view in Event Viewer (eventvwr.msc)

.XML and apply a script to parse the XML log data into any format you please

Page 30: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Cluster Log Output LevelsLevel Error Warning Info Verbose Debug

0 (disabled)

1 P2 P P3 P P P4 P P P P5 P P P P P

Cluster Logging LevelsLogging level is configurable cluster wide

Set-ClusterLog –level 3Logging levels to control Cluster.log granularity:

Can have performance impact

Default

Page 31: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Cluster Logging

DEMO

Page 32: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Logging TipsAll cluster logs are captured in a single directory:

C:\Windows\Cluster\ReportsIncludes:

Validation Report .MHT logsValidateStorage.logCreateCluster .MHT logsAdd Node .MHT logsLog files for configuring an HA roleCluster.logEnabling Disks for CSV logs

A great directory to zip up and send off when needing help

Page 33: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Troubleshooting TipsThe cluster log is verbose and complex!

It should be the last place you go, not the firstMake sure your cluster.log captures at least 72 hours of data

Mileage will vary depending on how noisy apps areCluster log timestamps are in GMT, while event log timestamps are in local timeStart at the bottom of the log and work your way backwards searching for “ERR” linesUse NET HELPMSG to decipher error codes

Page 34: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Performance CountersMonitor cluster API, resource failures, communication, I/O patterns

Useful for monitoring CSV I/O redirectionSee this blog for more details: http://blogs.msdn.com/clustering/archive/2009/09/04/9891266.aspx

Page 35: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

PowerShell SupportAll Cluster cmdlet’s now have full help online (aka. it’s searchable!)

http://technet.microsoft.com/en-us/library/ee461009.aspx Generate and configure cluster debug log Support for “read-only” access

Enables help desk to view (and not modify) the state of the cluster

Page 36: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Qualifying Clusters

Cluster Validation

Cluster Event Logging

Troubleshooting Tips

Agenda

Cluster Debug Logging

Page 37: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

CSV Troubleshooting TipsCSV will redirect I/O over multiple fabrics

Direct I/O over block storageFiber Channel / SAS / iSCSI

Redirected I/O over file storageNetwork over SMB

When troubleshooting a CSV “storage” problem, it could really be a network problem

Check network connectivity between nodesAbility to authenticate with a domain controllerDon’t make assumptions, things are different!

Page 38: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Troubleshooting RHS Terminations

How clustering deals with unresponsive resourcesa) RHS makes calls to resources (IsAlive, LooksAlive, Online, Offline, Terminate, etc…)b) If that resource doesn’t respond, cluster health detection attempts to recoverc) The RHS process is restarted, so the resource can be restartedGenerates an Event 1230

Cluster resource 'Resource Name' (resource type '', DLL ‘blah.dll') either crashed or deadlocked. The Resource Hosting Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor.

Do normal troubleshooting!What was the resource trying to do? See http://support.microsoft.com/kb/914458 Look for underlying core failures / events

Physical Disk… look for storage issuesNetwork Name… look for networking issues

See these blogs for more details:http://blogs.technet.com/askcore/archive/2009/11/23/resource-hosting-subsystem-rhs-in-windows-server-2008-failover-clusters.aspxhttp://blogs.msdn.com/clustering/archive/2009/06/27/9806160.aspx

Page 39: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Ensuring Graceful Shutdown Tip

When you click Start / Shutdown…Service Control Manager issues a STOP to all servicesServices are given 20 seconds to stopServices are terminated if they do not complete in time

Stopping a large number of VM’s could take longer than 20 seconds…ClusSvc will notify SCM in R2If Win2008, either:a) Offline / Move all groups prior to shutdownb) Manually modify SCM’s timeout

HKLM\SYSTEM\CurrentControlSet\Control\WaitToKillServiceTimeout http://technet.microsoft.com/en-us/library/cc976045.aspx

Page 40: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Cluster AD ConsiderationsAll Cluster Network Name resources now have associated computer objects

RequireKerberos is now on by default (and required)All nodes must be members of the same domain

Must be an Active Directory based domainComputer objects for cluster names are handled just like objects for normal machines

Passwords now rotated

Page 41: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Cluster Name Object (CNO)CNO – The computer object associated with the Cluster Name

It’s special…When creating a clustering, you must…

Be logged on locally with a domain user accountHave administrative privileges on all nodesHave “Create Computer Objects” privileges in the default Computers container

Or have Full Control permission to an existing disabled computer object

Page 42: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Virtual Computer Object (VCO)VCO – The computer object associated with Network Name resources for services and applicationsOnce a cluster is created, the CNO is the security context used moving forward

The CNO creates the VCOService is completely self managing

CNO requires privileges to the default Computer containerCreate Computer ObjectsAdd Workstations to a domain

Watch out there is a default quota of 10!

Page 43: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Creating Computer Objects

Installer creates CNOCNO creates VCO’s

Installer

Active Directory

Computers

CNO

VCO

VCO

Custom OU

Page 44: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Session SummaryValidate is a pre-deployment verifier, criterion for supportability, and also a diagnostic toolNew support policy for Win2008 and beyondFailover Cluster Management tool provides best mechanism to view cluster eventsCluster logging infrastructure is new, but it delivers what you are used toWindows Server 2008 R2 enhances Validate tool, adds Events and perf counters, and introduces PowerShell cmdletsFollow best practices to keep your cluster running smoothly

Page 45: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Passion for High Availability?

Are You Up For a Challenge?

Become a Cluster MVP!

Contact: [email protected]

Page 46: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Related ContentBreakout Sessions

WSV313 | Failover Clustering Deployment SuccessWSV314 | Failover Clustering Pro Troubleshooting with Windows Server 2008 R2VIR303 | Disaster Recovery by Stretching Hyper-V Clusters across Sites ARC308 | High Availability: A Contrarian ViewDAT207 | SQL Server High Availability: Overview, Considerations, and Solution GuidanceDAT303 | Architecting and Using Microsoft SQL Server Availability Technologies in a Virtualized WorldDAT305 | See the Largest Mission Critical Deployment of Microsoft SQL Server around the WorldDAT401 | High Availability and Disaster Recovery: Best Practices for Customer DeploymentsDAT407 | Windows Server 2008 R2 and Microsoft SQL Server 2008: Failover Clustering ImplementationsUNC304 | Microsoft Exchange Server 2010: High Availability Deep DiveUNC305 | Microsoft Exchange Server 2010 High Availability Design Considerations

Interactive SessionsVIR06-INT | Failover Clustering with Hyper-V Unleashed with Windows Server 2008 R2UNC01-INT | Real-World Database Availability Group (DAG) DesignVIR02-INT | Hyper-V Live Migration over Distance: A Multi-Datacenter Approach BOF34-IT | Microsoft Exchange Server High Availability and Disaster Recovery: Are You Prepared?

Hands-on LabsWSV01-HOL | Failover Clustering in Windows Server 2008 R2DAT01-HOL | Create a Two-Node Windows Server 2008 R2 Failover ClusterDAT02-HOL | Create a Windows Server 2008 R2 MSDTC ClusterDAT09-HOL | Installing a Microsoft SQL Server 2008 + SP1 Clustered InstanceDAT12-HOL | Maintaining a Microsoft SQL Server 2008 Failover ClusterUNC02-HOL | Microsoft Exchange Server 2010 High Availability and Storage ScenariosVIR06-HOL | Implementing High Availability and Live Migration with Windows Server 2008 R2 Hyper-V

Visit the Cluster Team in the TLC

Failover Clustering Booth

WSV-7

Page 47: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Failover Clustering ResourcesCluster Team Blog: http://blogs.msdn.com/clustering/

Cluster Resources: http://blogs.msdn.com/clustering/archive/2009/08/21/9878286.aspx

Cluster Information Portal: http://www.microsoft.com/windowsserver2008/en/us/clustering-home.aspx

Clustering Technical Resources: http://www.microsoft.com/windowsserver2008/en/us/clustering-resources.aspx

Clustering Forum (2008): http://forums.technet.microsoft.com/en-US/winserverClustering/threads/

Clustering Forum (2008 R2): http://social.technet.microsoft.com/Forums/en-US/windowsserver2008r2highavailability/threads/

R2 Cluster Features: http://technet.microsoft.com/en-us/library/dd443539.aspx

Page 48: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Resources

www.microsoft.com/teched

Sessions On-Demand & Community Microsoft Certification & Training Resources

Resources for IT Professionals Resources for Developers

www.microsoft.com/learning

http://microsoft.com/technet http://microsoft.com/msdn

Learning

Page 49: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Complete an evaluation on CommNet and enter to win!

Page 50: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

Sign up for Tech·Ed 2011 and save $500 starting June 8 – June 31st

http://northamerica.msteched.com/registration

You can also register at the

North America 2011 kiosk located at registrationJoin us in Atlanta next year

Page 51: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to

be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 52: Steven Ekren Senior Program Manager Microsoft Corporation SESSION CODE: WSV314

JUNE 7-10, 2010 | NEW ORLEANS, LA