Upload
andra-margaret-stafford
View
269
Download
5
Tags:
Embed Size (px)
Citation preview
Monitoring and Tuning Microsoft Exchange Server 2013 PerformanceJeff MealiffePrincipal Program ManagerMicrosoft Corporation
OFC-B321
AgendaOptimizeSizingPlatform recommendationsStorage best practicesBalance
TroubleshootHigh level planHigh CPU scenariosExcessive memory utilizationStorage perf analysis
TrendCollecting perf dataTrending using BI & analysis toolsBuilding a scorecard, rhythm of planning
Q&A
3 goals for this session:Present best practices for optimizing Exchange performance
Provide tools/methods for diagnosing Exchange perf issues
Suggest methods for trending utilization, predicting capacity requirements
http://aka.ms/tena2014experf
Optimize
Size matters. Really.Most critical perf optimization is sizingLots of guidance available from Microsoft & partnersIf in doubt, round upPlan for growth in all dimensionsAvoid reliance on customization to achieve desired perf targets – setup deployment for success
http://aka.ms/E2013Sizing-MEChttp://aka.ms/E2013Sizing
Luke, use the calculatorNo really, use it.Role Requirements Calculator turns published sizing guidance into a modeling toolTry out various failure scenariosUnderstand the impact of different hardware & storage choicesProvides scripts for DAG, database & copy creationMany new featuresCAS sizingTransport storage sizingMultiple databases per-volume (JBOD) supportHigh availability architecture improvements
http://aka.ms/E2013Calc
Note: Baseline platform for CPU guidance changed in 2013. Don’t directly compare results from 2010 & 2013 calculators.
Power considerationsBalance between energy efficiency and throughputPower used by CPUs, memory, disks, etc.Large slow drives help to conserve powerAvoid overdeploying CPU & memory, but don’t skimpIn general, use “High Performance” power plan for Exchange (highest perf state, all cores unparked), or use “Balanced” if required for power efficiencyMake sure BIOS is set to allow OS to manage power
Hyperthreading & Exchange 2013Turn off hyperthreading (SMT)!SMT provides gain in processor throughput, but overall the gain is not worth the “cost” based on our lab measurementsSignificant impact to some Exchange service memory footprints
Tuning .NET for storeBest practice to install KB 2803754 or 2803755http://support.microsoft.com/kb/2803754 (Windows Server 2008 R2)http://support.microsoft.com/kb/2803755 (Windows Server 2012)Set regkey after installing hotfix to enable: HKLM\Software\Microsoft\.NETFramework\DisableRetStructPinning (REG_DWORD) = 1Also install .NET 4.5.1 for other performance improvementsNo further updates required for Windows Server 2012 R2 (fix included in 4.5.1 on 2012 R2)
Reduces memory consumption in each store workerNo impact to sizing guidanceMemory is available for use by other processes
Decreases CPU spent in .NET garbage collectorBenefits Mailbox & multi-role
Network optimizationsSimplicity is key – multiple NICs generally don’t make sense for Exchange 2013NIC teaming doesn’t applyhttp://aka.ms/preferred
Use offload features, particularly including RSSRSS helps to scale CPU utilization, particularly on 10GbE portsDon’t bother adjusting specific offload settings, use defaults
Use latest (supported) version of Windows
Tuning IIS… or notVarious tuning parameters available for HTTP.sys and IIS usermode componentsIn general, Exchange doesn’t benefitExchange does tune a few connection limits at install
Keep the defaults in placeParticularly true on latest OS – Windows continues to improve OOB tuning
PagefileGuidance for Exchange 2013 is to use fixed size pagefile of size of RAM + 10MB, capped at 32778MBUse fixed size pagefile to avoid having to grow file under load
Likely won’t be large enough to capture full kernel dumpMinidumps may be enough to diagnose problemsDedicated dump file can be used – KB 969028, see DedicatedDumpFile regkey settings
MIN(RAM+10MB,32778MB)
\Memory\% Committed Bytes In Use < 80
\Memory\Available Mbytes > 5% of RAM
Storage tiering2 types of tiering to consider: manual & automaticManual tiering = databases placed on specific types of different storageAutomatic tiering = storage controller moves files or hottest blocks to “faster” storage typeNot recommended for ExchangeJetstress false-positives can be very misleadingMake sure you are testing all capacity simultaneously w/Jetstress
Very hard to predict true hot blocks for ExchangeConsider goal of simplicity, and think about slow/large disks for all Exchange storage
DAS storage cache policiesOn multi-role or Mailbox role servers, configure DAS storage controllers for 100% write cacheMailbox I/O patterns include lots of random reads, resulting in low cache hit ratesWhere predictions can be made, Exchange reads ahead into ESE cacheOther roles (CAS, AD) may benefit from some read cache on storage controller – consider 25% write/75% read
SSL offloadSSL offload is back in Exchange 2013 SP1Can be used to offload SSL-related processing to load balancer or reverse proxySome devices have specialized hardware for SSL processing, makes sense to take advantageExpect some reduction in CAS CPU consumption which will scale w/request ratehttp://aka.ms/e2013ssloffload
Virtualization best practicesGenerally, never overcommit any resourceAbsolutely do not use memory overcommitMajor cause of perf problems on virtualized Exchange
Hyperthreading is OK, but size based on physical processor coresNever oversubscribe CPU, CPU constraints cause painDelivery throughput reduction = queue growthContent indexing throughput reduction = increased IOPSStore ROP processing throughput reduction = RPC latency & end-user pain
Size using guidance for physical, add CPU overheadExchange isn’t NUMA aware, use NUMA defaults
AD query performanceAD bottlenecks are a frequent cause of Exchange perf issuesWatch for high query latencies on Exchange servers & high CPU on AD
Use built-in “Active Directory Diagnostics” data collector set to troubleshootPlan to deploy enough RAM on AD servers to cache entire database fileFollow deployment guidance for core ratio (also in calc)
\MSExchange ADAccess Domain Controllers(*)\LDAP Search Time
\Processor(_Total)\% Processor Time
NTLM auth and MaxConcurrentAPINTLM auth has a limited set of worker threads that service requestsCan result in MaxConcurrentAPI bottlenecksSymptoms are typically client logon delays & timeoutsOften related to other issues which cause high auth velocity
http://aka.ms/maxconcurrentapi
\Netlogon\Semaphore Waiters
\Netlogon\Semaphore Holders
\Netlogon\Semaphore Acquires
\Netlogon\Semaphore Timeouts
\Netlogon\Average Semaphore Hold Time
Achieving balance for deployment healthOptimal hardware utilizationBetter response to failuresMore predictable user experience
Balance at CAS layerLoad balancing handles traffic distribution across CAS componentsSelect a solution that is application aware for high availabilitySelect a traffic distribution policy carefully – least connections may cause issues during outages or maintenance, consider round-robinGoal is approximately equal spread of inbound client requests across CAS role servers\Web Service(Default Web Site)\Current Connections
Balance within DAGAlways aim for well-balanced equal distribution of active copiesUtilize DB copy ActivationPreference and MaximumPreferredActiveDatabases parameters to ensure *-over maintains balanceConsider regular utilization of RedistributeActiveDatabases.ps1 to maintain balanceDuring localized high load events, redistribute active copies to migrate load
\MSExchange Active Manager(_total)\Database Mounted
Balance within databasesAlways aim for equal utilization of databases (space & activity)Spread out heavy and light users across databasesCan be based on heuristics around job role, or statistics like message send/receive rates
Monitor space utilization and rebalance databases regularly via mailbox movesDon’t forget about whitespace when evaluating available DB spaceHarder to balance based on usage
Balance within mailboxesExchange 2013 improved server-side perf for high item count foldersLegacy clients still have issues with high item counts, can lead to significant performance impactLook at Office 365 limits as best practice maximums, generally aim for more folders with fewer messageshttp://aka.ms/ExOnlineLimits
Set-mailbox can enforce mailbox shaping quotas in the same way as Office 365
Multi-role: just do itVery few reasons not to consider multi-role (Mailbox+CAS) deploymentMulti-role simplifies deployment, can reduce server countBenefit of increased availability at the CAS layerIssues remain with Windows NLB + DAG (WSFC)Certificate management may be a concern
Considerations of scale: up or out?How big is too big?Design for scale-out, not scale-upBetter alignment with intentions & design points of PGIdeally focus on “commodity” 2U servers as a platform to help minimize deployment riskWe don’t push the “top end” today – and don’t want you to either
Troubleshoot
Where do we begin?Two things to define:What are the specific symptoms (so we understand success criteria)What resources are impacted
Once we understand resource impact, try to isolate causeStart high level, work our way down
Prioritize service availability – collect relevant data and quickly attempt to restart/rebalance/reboot
High CPU issuesWhere is CPU going?Kernel vs. usermode (driver or HW issues?)Specific process(es)Generally high utilization
Processor counters will show mix of kernel & usermode consumption
If most consumption is in kernel (privileged), likely due to driver or HW problems
For usermode CPU, look at Process object counters to figure out process(es) consuming time
\Processor(_Total)\% Privileged Time
\Processor(_Total)\% User Time
\Process(*)\% Privileged Time
High CPU issuesDetermine if high CPU is load relatedHelps to have a baselineLook at RPC ops/sec and message traffic counters to indicate overall workload transaction rateIf load related, consider moving active copies to temporarily rebalance load issue
Prioritize availabilityTake multiple process dumps of impacted processes at intervals, restart impacted services.
Workload Management may run “background” tasks during off-peak timesCan result in higher than expected utilization\MSExchange WorkloadManagement Workloads(*)\ActiveTasks
\MSExchange WorkloadManagement Workloads(*)\CompletedTasks
\MSExchange WorkloadManagement Workloads(*)\QueuedTasks
Memory utilization issuesVast majority of Exchange components allocate via .NET CLR (managed memory)Pay attention to key CLR garbage collection metrics
Monitor working setsWorking set trimming can cause significant pain
\.NET CLR Memory(*)\% Time in GC < 10
\Memory\% Committed Bytes In Use < 80
\Memory\Available Mbytes > 5% of RAM
\.NET CLR Memory(*)\# Bytes in all Heaps
\Process(*)\Private Bytes
\Process(*)\Working Set
Storage performance analysisLatencies are key for ExchangeCan monitor via PhysicalDisk/LogicalDisk counters, better to monitor via ESE counters\MSExchange Database ==> Instances(*)\I/O Database Reads (Attached) Average Latency
< 20ms
\MSExchange Database ==> Instances(*)\I/O Database Writes (Attached) Average Latency
< 50ms
\MSExchange Database ==> Instances(*)\I/O Log Writes Average Latency < 10ms
\MSExchange Database ==> Instances(*)\I/O Database Reads (Recovery) Average Latency
< 200ms
\MSExchange Database ==> Instances(*)\I/O Database Writes (Recovery) Average Latency
< read latency for same instance as above
Storage performance analysisHigh latencies can have many potential causesDisk healthArray healthHigh volume of random I/O (seek times)High volume of sequential I/O (bandwidth)Controller performance (cache behavior?)Bus/connectivity issuesStorage stack, incl filter drivers
Look for higher than expected I/O\MSExchange Database ==> Instances(*)\I/O Database Reads (Attached)/sec
\MSExchange Database ==> Instances(*)\I/O Database Writes (Attached)/sec
\MSExchange Database ==> Instances(*)\I/O Log Writes/sec
Storage performance analysisUse Sysinternals Process Monitor to watch for unexpected I/O to Exchange volumesFilter on Exchange volume paths, add Duration columnhttp://aka.ms/procmon
Check array & controller health (RAID rebuild?)Check components that include storage filter drivers (like AV utilities)
One counter to rule them allVast majority of user-facing perf issues reflected in RPC Average LatencyGood “canary” to determine if a particular server is having problems
\MSExchangeIS Store(*)\RPC Average Latency < 100ms
\MSExchangeIS Client Type(*)\RPC Average Latency < 100ms
\MSExchangeIS Store(*)\RPC Operations/sec
\MSExchangeIS Client Type(*)\RPC Operations/sec
If in doubt…Open a support case with MicrosoftWell established escalation path into the product groupVery deep perf analysis skills, access to internal tools
Trend
Collecting perf dataExchange Diagnostics Service (EDS) captures relevant perf counters to the DailyPerformanceLogs directoryCollects up to 5GB of logsAutomatically purges old logs to stay under space quotaQuota can be adjusted by changing a line in Microsoft.Exchange.Diagnostics.Service.exe.config (parameter defines MB of space quota):<add Name="DailyPerformanceLogs" LogDataLoss="True" MaxSize="5120" MaxSizeDatacenter="2048" />
Logs can be easily loaded into SQL to build a data warehouse of perf historyMay need to consider adding indexes, using Analysis Services to help with query perf on large datasets
Loading data into SQLOn each machine, need to regularly copy logs off and keep track of what has been loaded into SQLWindows relog tool handles the upload processCreate an ODBC DSN that points to a SQL instanceCall relog against each log to uploadRelog.exe filename -f SQL -o SQL:dsn_name!log_set_namehttp://aka.ms/relog for details on relog options
Basic analysisOnce data is in SQL, perfmon can utilize SQL data sourceAdditionally Excel and SQL Reporting Services can be used to query or reportExcel is fantastic for building trend reports against perf data
Build a scorecardA perf/capacity scorecard provides focus on a high-level view of the metrics that matterPick key health indicators, for example:CPU (average at peak)Storage utilizationMailflowClient connections for key client types
Set targets – both for health, as well as capacity triggersReport out on a regular schedule – operational rigor
SummaryProper sizing critical for a healthy deploymentSeek balance for optimal utilizationMulti-role is the goalLook at key resource utilization counters to troubleshoot perfBaselines are very helpfulCollect perf data & produce trend-based reporting: no surprises!
Q & A
[email protected]://aka.ms/tena2014experf
Breakout SessionsDCIM-B415 – Microsoft System Center 2012 R2 Operations Manager: Mastering Historical Monitoring DataDEV-B335 – Using the Cloud-Based Load Testing Service and Application Insights to Find Scale and Performance Bottlenecks in Your ApplicationsWIN-B413 – Windows Performance Deep Dive TroubleshootingDCIM-B360 – Key Metric, Performance, and Capacity Monitoring Using Microsoft System Center 2012 R2 Operations Manager
Related content
LabsWIN-IL301 – Resolving Windows Performance Issues without Opening a Support CaseWIN-IL302 – Windows 8.1 Managed Memory Debugging Using WinDBG
Microsoft Solutions Experience Location (MSE)Office Servers and Services (Go Deploy)
Find Me Later At. . .MSE Office Servers and Services (Go Deploy): Thursday 10:45 AM – 12:15PM
Resources
Learning
Microsoft Certification & Training Resources
www.microsoft.com/learning
msdn
Resources for Developers
http://microsoft.com/msdn
TechNet
Resources for IT Professionals
http://microsoft.com/technet
Sessions on Demand
http://channel9.msdn.com/Events/TechEd
Complete an evaluation and enter to win!
Evaluate this session
Scan this QR code to evaluate this session.
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.