Troubleshooting Gray Agent States in System Center Operations Manager

This article describes how to troubleshoot problems in which an agent, amanagement server, or a gateway is unavailable or "grayed out" in SystemCenter Operations Manager.

An agent, a management server, or a gateway can have one of the followingstates, as indicated by the color of the agent name and icon in the Monitoringpane.

State Appearance Description

Healthy Green checkmark

The agent or management server isrunning normally.

Critical Red checkmark

There is a problem on the agent ormanagement server.

Troubleshooting gray agent states in System Center Operations Manager https://support.microsoft.com/en-us/kb/2288515

1 of 27 10/12/2015 11:27 PM

Unknown Gray agentname, graycheck mark

The health service watcher on the rootmanagement server (RMS) that iswatching the health service on themonitored computer is no longerreceiving heartbeats from the agent. Thehealth service watcher had been receivingheartbeats previously, and the healthservice was reported as healthy). This alsomeans that the management servers areno longer receiving any information fromthe agent.

This issue may occur because thecomputer that is running the agent is notrunning or there are connectivity issues.You can find more information about theHealth Service Watcher view.

Unknown Green circle,no checkmark

The status of the discovered item isunknown. There is no monitor availablefor this specific discovered item.

An agent, a management server, or a gateway may become unavailable for anyof the following reasons:

Heartbeat failure

Invalid configuration

System workflows failure

OpsMgr Database or data warehouse performance issues

RMS or primary MS or gateway performance issues

Network or authentication issues

Health service issues (service is not running)

Before you begin to troubleshoot the agent "grayed out" issue, you should firstunderstand the Operations Manager topology, and then define the scope of the


2 of 27 10/12/2015 11:27 PM

issue. The following questions may help you to define the scope of the issue:How many agents are affected?

Are the agents experiencing the issue in the same network segment?

Do the agents report to the same management server?

How often do the agents enter and remain in a gray state?

How do you typically recover from this situation (for example, restart theagent health service, clear the cache, rely upon automatic recovery)?

Are the Heartbeat failure alerts generated for these agents?

Does this issue occur during a specific time of the day?

Does this issue persist if you failover these agents to anothermanagement server or gateway?

When did this problem start?

Were any changes made to the agents, the management servers, or thegateway or management group?

Are the affected agents Windows clustered systems?

Is the Health Service State folder excluded from antivirus scanning?

What is the environment this is occurring in OpsMgr SP1, R2, 2012?

Your troubleshooting strategy will be dictated by which component is inactive,where that component falls within the topology, and how widespread theproblem is. Consider the following conditions:

If the agents that report to a particular management server or gateway areunavailable, troubleshooting should start at the management server orgateway level.

If the gateways that report to a particular management server areunavailable, troubleshooting should start at the management server level.

For agentless systems, for Network devices, and for Unix/Linux servers,troubleshooting should start at the agent, management server, orgateway that is monitoring these objects.

If all the systems are unavailable, troubleshooting should start at the root


3 of 27 10/12/2015 11:27 PM

management server.

Troubleshooting typically starts at the level immediately above theunavailable component.

Consider the following scenarios.

Scenario 1Only a few agents are affected by the issue. These agents report to differentmanagement servers. Agents remain unavailable on a regular basis. Althoughyou are able to clear the agent cache to help resolve the issue temporarily, theproblem recurs after a few days.

Resolution 1To resolve the issue in this scenario, follow these steps:

Apply the appropriate hotfix to the affected operating systems.Windows 2008 R2 and Windows 7

This fix is included in Service Pack 1 (SP1).

Windows 2008 and Windows Vista

Install 2553708.

Windows 2003

Install 981263.

1.

Exclude the Agent cache from antivirus scanning.2.

Stop the Health service.3.

Clear the Agent cache.4.

Start the Health service.5.

Note We recommend that you proactively apply the hotfixes that are listed instep 1 to all monitored systems. This includes the management servers.Additionally, exclude the agent or management cache from antivirus scanningto prevent this issue from spreading to other systems.

For more information about these procedures, click the following articlenumbers to view the articles in the Microsoft Knowledge Base:


4 of 27 10/12/2015 11:27 PM

982018 An update that improves the compatibility of Windows 7 andWindows Server 2008 R2 with Advanced Format Disks is available

2553708 A hotfix rollup that improves Windows Vista and Windows Server2008 compatibility with Advanced Format disks

981263 Management servers or assigned agents unexpectedly appear asunavailable in the Operations Manager console in Windows Server 2003 orWindows Server 2008

975931 Recommendations for antivirus exclusions that relate to MOM 2005and to Operations Manager 2007

Scenario 2Only a few agents are affected by the issue. These agents report to differentmanagement servers. Agents remain inactive constantly. Although you are ableto clear the agent cache, this does not reolve the issue.


Determine whether the Health Service is turned on and is currentlyrunning on the management server or gateway. If the Health Service hasstopped responding, generate an Adplus dump in a service hang mode tohelp determine the cause of the problem. For more information, click thefollowing article number to view the article in the Microsoft KnowledgeBase:

286350 How to use Network Monitor to capture network traffic

1.

Examine the Operations Manager Event log on the agent to locate any ofthe following events:

Event ID: 1102Event Source: HealthServiceEvent Description:Rule/Monitor "%4" running for instance "%3" with id:"%2" cannot beinitialized and will not be loaded. Management group "%1"

Event ID: 1103Event Source: HealthServiceEvent Description:Summary: %2 rule(s)/monitor(s) failed and got unloaded, %3 of themreached the failure limit that prevents automatic reload. Managementgroup "%1". This is summary only event, please see other events withdescriptions of unloaded rule(s)/monitor(s).

2.


5 of 27 10/12/2015 11:27 PM

Event ID: 1104Event Source: HealthServiceEvent Description:RunAs profile in workflow "%4", running for instance "%3" with id:"%2"cannot be resolved. Workflow will not be loaded. Management group"%1"

Event ID: 1105Event Source: HealthServiceEvent Description:Type mismatch for RunAs profile in workflow "%4", running for instance"%3" with id:"%2". Workflow will not be loaded. Management group "%1"

Event ID: 1106Event Source: HealthServiceEvent Description:Cannot access plain text RunAs profile in workflow "%4", running forinstance "%3" with id:"%2". Workflow will not be loaded. Managementgroup "%1"

Event ID: 1107Event Source: HealthServiceEvent Description:Account for RunAs profile in workflow "%4", running for instance "%3"with id:"%2" is not defined. Workflow will not be loaded. Please associatean account with the profile. Management group "%1"

Event ID: 1108Event Source: HealthServiceEvent Description:An Account specified in the Run As Profile "%7" cannot be resolved.Specifically, the account is used in the Secure Reference Override "%6".%n%n This condition may have occurred because the Account is notconfigured to be distributed to this computer. To resolve this problem,you need to open the Run As Profile specified below, locate the Accountentry as specified by its SSID, and either choose to distribute the Accountto this computer if appropriate, or change the setting in the Profile so thatthe target object does not use the specified Account. %n%nManagementGroup: %1 %nRun As Profile: %7 %nSecureReferenceOverride name: %6%nSecureReferenceOverride ID: %4 %nObject name: %3 %nObject ID: %2%nAccount SSID: %5

Event ID: 4000Event Source: HealthServiceEvent Description:A monitoring host is unresponsive or has crashed. The status code for thehost failure was %1.


6 of 27 10/12/2015 11:27 PM

Event ID: 21016Event Source: OpsMgr ConnectorEvent Description:OpsMgr was unable to set up a communications channel to %1 and thereare no failover hosts. Communication will resume when %1 is availableand communication from this computer is allowed.

Event ID: 21006Event Source: OpsMgr ConnectorEvent Description:The OpsMgr Connector could not connect to %1:%2. The error code is%3(%4). Please verify there is network connectivity, the server is runningand has registered its listening port, and there are no firewalls blockingtraffic to the destination.

Event ID: 20070Event Source: OpsMgr ConnectorEvent Description:The OpsMgr Connector connected to %1, but the connection was closedimmediately after authentication occurred. The most likely cause of thiserror is that the agent is not authorized to communicate with the server,or the server has not received configuration. Check the event log on theserver for the presence of 20000 events, indicating that agents which arenot approved are attempting to connect.

Event ID: 20051Event Source: OpsMgr ConnectorEvent Description:The specified certificate could not be loaded because the certificate is notcurrently valid. Verify that the system time is correct and re-issue thecertificate if necessary%n Certificate Valid Start Time : %1%n CertificateValid End Time : %2

Event Source: ESEEvent Category: Transaction ManagerEvent ID: 623Description: HealthService () The version store for instance ("") has reached its maximum size of Mb. It islikely that a long-running transaction is preventing cleanup of the versionstore and causing it to build up in size. Updates will be rejected until thelong-running transaction has been completely committed or rolled back.Possible long-running transaction:SessionId: Session-context: Session-context ThreadId: .Cleanup:


7 of 27 10/12/2015 11:27 PM

If you locate the following specific events, follow these guidelines:Events 1102 and 1103: These events indicate that some of theworkflows failed to load. If these are the core system workflows,these events could cause the issue. In this case, focus on resolvingthese events.

Events 1104, 1105, 1106, 1107, and 1108: These events may causeEvents 1102 and 1103 to occur. Ttypically, this would occur becauseof misconfigured "Run as" accounts. In OpsMgr R2, this typicallyoccurs because the "Run as" accounts are configured to be usedwith the wrong class or are not configured to be distributed to theagent.

Event 4000 This event indicates that the Monitoringhost.exeprocess crashed. If this problem is caused by a Dll mismatch or bymissing registry keys, you may be able to resolve the problem byreinstalling the agent. If the problem presists, try to resolve it byusing the following methods:

Run a Process Monitor capture until the point the processcrashes. For more information, visit the following MicrosoftSysinternals website: Process Monitor v2.96

Generate an Adplus dump in crash mode. For moreinformation, click the following article number to view thearticle in the Microsoft Knowledge Base:

286350 How to use ADPlus.vbs to troubleshoot "hangs"and "crashes"

If the agent is monitoring network devices, and the agent isrunning on Windows Server 2003, apply hotfix in KB 982501.For more information, click the following article number toview the article in the Microsoft Knowledge Base:

982501 The monitoring of SNMP devices may stopintermittently in System Center Operations Manager or inSystem Center Essentials

Event ID 21006: This event indicates that communication issuesexist between the agent and the management server. If the agentuses a certificate for mutual authentication, verify that the certificateis not expired and that the agent is using the correct certificate. IfKerberos is being used, verify that the agent can communicate withActive Directory. If authentication is working correctly, this maymean that the packets from the agent are not reaching themanagement server or gateway. Try to establish a simple telnet toport 5723 from the agent to the management server. Additionally,run a simultaneous network trace between the agent and the

3.


8 of 27 10/12/2015 11:27 PM

management server while you reproduce the communicationfailures. This can help you to determine whether the packets arereaching the management server, and whether any device betweenthe two components is trying to optimize the traffic or is droppingsome packets. For more information, click the following articlenumber to view the article in the Microsoft Knowledge Base:

812953 How to use Network Monitor to capture network traffic

Event ID: 623 This event typically occurs in a large OperationsManager environment in which a management server or an agentcomputer manages many workflows. For more information, click thefollowing article number to view the article in the MicrosoftKnowledge Base:

975057 One or more management servers and their manageddevices are dimmed in the Operations Manager Console ofOperations Manager

Scenario 3All the agents that report to a particular management server or gateway areunavailable.


Try to determine what kind of workloads the management server orgateway is monitoring. Such workloads might include network devices,cross-platform agents, synthetic transactions, Windows agents, andagentless computers.

1.

Determine whether the Health Service is running on the managementserver or gateway.

2.

Determine whether the management server is running in maintenancemode. If it is necessary, remove the server from maintenance mode.

3.

Examine the Operations Manager Event log on the agent for any of theevents that are listed in Scenario 2. In the case of Event ID: 21006, followthe same guidelines that are mentioned in Scenario 2. Additionally in thiscase, this event indicates that management server or gateway cannotcommunicate with its parent server. In Operations Manager 2007 and R2for a management server, the parent server is the root management sever(RMS). For a gateway, the parent server may be any management server.(Refer to step 3 in the Scenario 2 resolution.)

4.

If the health service is monitoring network devices, and the management5.


9 of 27 10/12/2015 11:27 PM

server is running on a Windows Server 2003 system, you may also want toapply the following KB 982501 hotfix. For more information, click thefollowing article number to view the article in the Microsoft KnowledgeBase:

982501 The monitoring of SNMP devices may stop intermittently inSystem Center Operations Manager or in System Center Essentials

Examine the Operations Manager Event log for the following events.These events typically indicate that performance issues exist on themanagement server or Microsoft SQL Server that is hosting theOperationsManager or OperationsManagerDW database:

Event ID: 2115Event Source: HealthServiceEvent Description:A Bind Data Source in Management Group %1 has posted items to theworkflow, but has not received a response in %5 seconds. This indicates aperformance or functional problem with the workflow.%n Workflow Id :%2%n Instance : %3%n Instance Id : %4%n

Event ID: 5300Event Source: HealthServiceEvent Description:Local health service is not healthy. Entity state change flow is stalled withpending acknowledgement. %n%nManagement Group: %2%nManagement Group ID: %1

Event ID: 4506Event Source: HealthServiceEvent Description: Operations ManagerData was dropped due to too much outstanding data in rule "%2" runningfor instance "%3" with id:"%4" in management group "%1".

Event ID: 31551Event Source: Health Service ModulesEvent Description:Failed to store data in the Data Warehouse. The operation will beretried.%rException '%5': %6 %n%nOne or more workflows were affectedby this. %n%nWorkflow name: %2 %nInstance name: %3 %nInstance ID:%4 %nManagement group: %1

Event ID: 31552Event Source: Health Service ModulesEvent Description:Failed to store data in the Data Warehouse.%rException '%5': %6%n%nOne or more workflows were affected by this. %n%nWorkflowname: %2 %nInstance name: %3 %nInstance ID: %4 %nManagement

6.


10 of 27 10/12/2015 11:27 PM

group: %1

Event ID: 31553Event Source: Health Service ModulesEvent Description:Data was written to the Data Warehouse staging area but processingfailed on one of the subsequent operations.%rException '%5': %6%n%nOne or more workflows were affected by this. %n%nWorkflowname: %2 %nInstance name: %3 %nInstance ID: %4 %nManagementgroup: %1

Event ID:31557Event Source: Health Service ModulesEvent Description:Failed to obtain synchronization process state information from DataWarehouse database. The operation will be retried.%rException '%5': %6%n%nOne or more workflows were affected by this. %n%nWorkflowname: %2 %nInstance name: %3 %nInstance ID: %4 %nManagementgroup: %1

Event ID 3155X may also be logged because of incorrect "Run as" accountconfigurations or missing permissions for the "Run as" accounts. FOr moreinformation, see the the following Microsoft Technet blog, which includesa Microsoft Office Excel worksheet that lists the permissions for variousaccounts that are used by OpsMgr:

OpsMgr security account rights mapping - what accounts need whatprivileges?

7.

Note To troubleshoot management server or gateway performance and SQLServer performance, see the "Resolutions" section for the next scenarios.

Scenarios 4 and 5Scenarios 4All the agents that report to a specific management server alternateintermittently between healthy and gray states.

Scenarios 5All the agents in the environment alternate intermittently between healthy andgray states.

Resolutions 4 and 5To resolve the issue in either of these scenarios, first determine the cause of theissue. Common causes of temporary server unavailability include the following:

The parent server of the agents is temporarily offline.


11 of 27 10/12/2015 11:27 PM

Agents are flooding the management server with operational data, suchas alerts, states, discoveries, and so on. This may cause an increased use ofsystem resources on the OpsMgr database and on the OpsMgr servers.

Network outages caused a temporary communication failure between theparent server and the agents.

Management pack (MP) changes occurred. In OpsMgr Console, thesechanges require an OpsMgr configuration and an MP redistribution to theagents. If the change affect a larger agent base, this may cause increaseduse of system resources usage on the OpsMgr database and OpsMgrservers.

The key to troubleshooting in these scenarios is to understand the duration ofthe server unavailability and the time of day during which it occurred. This willhelp you to quickly narrow the scope of the problem.

For OpsMgr 2007 and R2 - Root management server (RMS)Configuration update bursts are caused by management pack imports and bydiscovery data. When system performance is slow, the most likely bottlenecksare, first, the CPU and, second, the OpsMgr installation disk I/O.

The RMS is responsible for generating and sending configuration files to allaffected Health Services.

For Workflow reloading (which is caused by new configuration on RMS), themost likely bottlenecks are the same: the CPU first, and OpsMgr installation diskI/O second. The RMS is responsible for reading the configuration file, forloading and initializing all workflows that run on it, and for updating the RMSHealthService store when the configuration file is updated on the RMS.

For local workflow activity bursts (which is when agents change theiravailability), the most likely bottleneck is the CPU. If you find that the CPU is notworking at maximum capacity, the next most likely bottleneck is the hard disk.The RMS is responsible for monitoring the availability of all agents that areusing RMS local workflows. The RMS also hosts distributed dependencymonitors that use the disk.

Management serverDuring a configuration update burst (that is caused by MP import anddiscovery), the typical bottlenecks are, first, the CPU and, second, the OpsMgrinstallation disk I/O. The management server is responsible of forwarding


12 of 27 10/12/2015 11:27 PM

configuration files from the RMS to the target agents.

For Operational data collection, bottlenecks are typically caused by the CPU. Thedisk I/O may also be at maximum capacity, but that is not as likely. Themanagement server is responsible for decompressing and decrypting incomingoperational data, and inserting it into the Operational Database. It also sendsacknowledgements (ACKs) back to the agents or gateways after it receivesoperational data, and uses disk queuing to temporarily store these outgoingACKs. Lastly, the management server will also forward monitor state changes (byusing a disk queue) to the RMS for distributed dependency monitors.

GatewayThe gateway is both CPU-bound and I/O-bound. When the gateway is relaying alarge amount of data, both the CPU and I/O operations may show high usage.Most of the CPU usage is caused by the decompression, compression,encryption, and decryption of the incoming data, and also by the transfer of thatdata. All data that is received by the gateway and from the agents is stored in apersistent queue on disk, to be read and forwarded to the management serverby the gateway Health service. This can cause heavy disk usage. This usage canbe significant when the gateway is taken temporarily offline and must thenhandle accumulated agent data that the agents generated and tried to sendwhen the GW was still offline.

To troubleshoot the issue in this situation, collect the following information foreach affected management server or gateway:

Exact Windows version, edition, and build number (for example, WindowsServer 2003 Enterprise x64 SP2)

Number of processors

Amount of RAM

Drive that contains the Health Service State folder

Whether the antivirus software is configured to exclude the Health Servicestore

Note For more information, click the following article number to view thearticle in the Microsoft Knowledge Base:

975931Recommendations for antivirus exclusions that relateto Operations Manager

RAID level (0, 1, 5, 0+1 or 1+0) for the drive that is used by the HealthService State

Number of disks used for the RAID


13 of 27 10/12/2015 11:27 PM

Whether battery-backed write cache is enabled on the array controller

Operational Database (OperationsManager)For the OperationsManager database, the most likely bottleneck is the diskarray. If the disk array is not at maximum I/O capacity, the next most likelybottleneck is the CPU. The database will experience occasional slowdowns andoperational "data storms (very high incidences of events, alerts, andperformance data or state changes that persist for a relatively long time). Ashort burst typically does not cause any significant delay for an extended periodof time.

During operational data insertion, the database disks are primarily used forwrites. CPU use is usually caused by SQL Server churn. This may occur when youhave large and complex queries, heavy data insertion, and the grooming oflarge tables (which, by default, occurs at midnight). Typically, the grooming ofeven large Events and Performance Data tables does not consume excessiveCPU or disk resources. However, the grooming pf the Alert and State Changetables can be CPU-intensive for large tables.

The database is also CPU-bound when it handles configuration redistributionbursts, which are caused by MP imports or by a large instance space change. Inthese cases, the Config service queries the database for new agentconfiguration. This ususally causes CPU spikes to occur on the database beforethe service sends the configuration updates to the agents.

Data Warehouse (OperationsManagerDW)For the OperationsManagerDW database, the most likely bottleneck is the diskarray. This usually occurs because of very large operational data insertions. Inthese cases, the disks are mostly busy performing writes. Usually, the disks areperforming few reads, except to handle manually-generated Reporting viewsbecause these run queries on the data warehouse.

CPU usage is usually caused by SQL Server churn. CPU spikes may occur duringheavy partitioning activity (when tables become very large and then getpartitioned), the generation of complex reports, and large amounts of alerts inthe database, with which the data warehouse must constantly sync up.

General troubleshootingTo troubleshoot the issue in this situation, collect the following information foreach affected management server or gateway:

Exact Windows version, edition, and build number (for example, WindowsServer 2003 Enterprise x64 SP2)


14 of 27 10/12/2015 11:27 PM

Number of processors

Amount of RAM

Amount of memory that is allocated to SQL Server

Whether SQL Server is 32-bit and is AWE enabled

Note You can find most of this information in SQL Server ManagementStudio or in SQL Server Enterprise Manager. To do this, open theProperties window of the server, and then click the General and Memorytabs. The General tab includes the SQL Server version, the Windowsversion, the platform, the amount of RAM, and the number of processors.The Memory tab includes the memory that is allocated to SQL Server. InMicrosoft SQL Server 2008 and in Microsoft SQL Server 2005, theMemory tab also includes the AWE option. To determine whether AWE isenabled in Microsoft SQL Server 2000, run the following command in theMicrosoft SQL Query Analyzer:

sp_configure 'show advanced options', 1RECONFIGUREGOsp_configure 'awe enabled'

The returned values for config_value and for run_value will be 1 if AWE isenabled.

If OS is 32-bit and RAM is 4 GB or greater, check whether the /pae or /3gbswitches exist in the Boot.ini. file. These options could be configuredincorrectly if the server was originally installed by having 4 GB or less ofRAM, and if the RAM was later upgraded.

For 32-bit servers that have 4 GB of RAM, the /3gb switch in Boot.iniincreases the amount of memory that SQL Server can address (from 2 to 3GB). For 32-bit servers that have more than 4 GB of RAM, the /3gb switchin Boot.ini could actually limit the amount of memory that SQL Server canaddress. For these systems, add the /pae switch to Boot.ini, and thenenable AWE in SQL Server.

On a multi-processor system, check the Max Degree of Parallelism(MAXDOP) setting. In SQL Server 2008 and in SQL Server 2005, this optionis on the Advanced tab in the Properties dialog box for the server. Todetermine this setting in SQL Server 2000, run the following command inSQL Query Analyzer:

sp_configure 'show advanced options', 1RECONFIGUREGOsp_configure 'max degree of parallelism'


15 of 27 10/12/2015 11:27 PM

The default value is 0, which means that all available processors will beused. A setting of 0 is fine for servers that have eight or fewer processors.For servers that have more than eight processors, the time that it takesSQL Server to coordinate the use of all processors may becounterproductive. Therefore, for servers that have more than eightprocessors, you generally should set Max Degree of Parallelism to avalue of 8. To do this, run the following command in SQL Query Analyzer:

sp_configure 'show advanced options', 1GORECONFIGURE WITH OVERRIDEGOsp_configure 'max degree of parallelism', 8GORECONFIGURE WITH OVERRIDEGO

Drive letters that contain data warehouse or Ops and Tempdb files

Whether the antivirus software is configured to exclude SQL data and logfiles (Antivirus software cannot scan SQL database files. Trying to do thiscan degrade performance.)

Amount of free space on drives that contain data warehouse or Ops andTempdb files

Storage type (SAN or local)

RAID level (0, 1, 5, 0+1 or 1+0) for drives that are used by SQL Server

If SAN storage us used: amount of spindles on each LUN that is used bySQL Server

In OpsMgr 2007 SP1: whether hotfix 969130 (data warehouse eventgrooming) or SP1 hotfix rollup 971541 is applied

If the converted Exchange 2007 managment pack is being used or hasever been used: amount of rows in the LocalizedText table in the Ops DBand in the EventPublisher table in the data warehouse database

Note To determine the row amounts, run the following commands:

USE OperationsManager SELECT COUNT(*) FROM LocalizedTextUSE OperationsManagerDW SELECT COUNT(*) FROMEventPublisher

Counters to identify memory pressure


16 of 27 10/12/2015 11:27 PM

MSSQL$: Buffer Manager: Page Life expectancy How longpages persist in the buffer pool. If this value is below 300 seconds, it mayindicate that the server could use more memory. It could also result fromindex fragmentation.

MSSQL$: Buffer Manager: Lazy Writes/sec Lazy writer freesspace in the buffer by moving pages to disk. Generally, the value shouldnot consistently exceed 20 writes per second. Ideally, it would be close tozero.

Memory: Available Mbytes - Values below 100 MB may indicate memorypressure. Memory pressure is clearly present when this amount is lessthan 10 MB.

Process: Private Bytes: _Total: This is the amount of memory (physical andpage) being used by all processes combined.

Process: Working Set: _Total: This is the amount of physical memory beingused by all processes combined. If the value for this counter issignificantly below the value for Process: Private Bytes: _Total, it indicatesthat processes are paging too heavily. A difference of more than 10% isprobably significant.

Counters to identify disk pressureCapture these Physical Disk counters for all drives that contain SQL data or logfiles:

% Idle Time: How much disk idle time is being reported. Anything below50 percent could indicate a disk bottleneck.

Avg. Disk Queue Length: This value should not exceed 2 times the numberof spindles on a LUN. For example, if a LUN has 25 spindles, a value of 50is acceptable. However, if a LUN has 10 spindles, a value of 25 is too high.You could use the following formulas based on the RAID level andnumber of disks in the RAID configuration:

RAID 0: All of the disks are doing work in a RAID 0 set

Average Disk Queue Length

RAID 5: All of the disks are doing work in a RAID 5 set

Average Disk Queue Length

OpsMgr Performance countersThe following sections describe the performance counters that you can use tomonitor and troubleshoot OpsMgr performance.

Gateway server roleOverall performance counters: These counters indicate the overallperformance of the gateway:

Processor(_Total)\% Processor Time

Memory\% Committed Bytes In Use

Network Interface(*)\Bytes Total/sec

LogicalDisk(*)\% Idle Time

LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process genericperformance counters: These counters indicate the overall performanceof OpsMgr processes on the gateway:

Process(HealthService)\%Processor Time

Process(HealthService)\Private Bytes (depending on how manyagents this gateway is managing, this number may vary and couldbe several hundred megabytes)

Process(HealthService)\Thread Count

Process(HealthService)\Virtual Bytes

Process(HealthService)\Working Set

Process(MonitoringHost*)\% Processor Time

Process(MonitoringHost*)\Private Bytes

Process(MonitoringHost*)\Thread Count

Process(MonitoringHost*)\Virtual Bytes

Process(MonitoringHost*)\Working SetOpsMgr specific performancecounters: These counters are OpsMgr specific counters that indicate theperformance of specific aspects of OpsMgr on the gateway:

Health Service\Workflow Count

Health Service Management Groups(*)\Active File Uploads: Thenumber of file transfers that this gateway is handling. Thisrepresents the number of management pack files that are beinguploaded to agents. If this value remains at a high level for a long


19 of 27 10/12/2015 11:27 PM

time, and there is not much management pack importing at a givenmoment, these conditions may generate a problem that affects filetransfer.

Health Service Management Groups(*)\Send Queue % Used: Thesize of persistent queue. If this value remains higher than 10 for along time, and it does not drop, this indicates that the queue isbacked up. This condition is cause by an overloaded OpsMgrsystem because the management server or database is too busy oris offline.

OpsMgr Connector\Bytes Received: The number of network bytesreceived by the gateway i.e., the amount of incoming bytes beforedecompression.

OpsMgr Connector\Bytes Transmitted: The number network bytessent by the gateway i.e., the amount of outgoing bytes aftercompression.

OpsMgr Connector\Data Bytes Received: The number of data bytesreceived by the gateway i.e., the amount of incoming data afterdecompression.

OpsMgr Connector\Data Bytes Transmitted: The number of databytes sent by the gateway i.e. the amount of outgoing data beforecompression.

OpsMgr Connector\Open Connections: The number of connectionsthat are open on gateway. This number should be same as thenumber of agents or management servers that are directlyconnected to the gateway.

Management server roleOverall performance counters: These counters indicate the overall performanceof the management server:





LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performancecounters: These counters indicate the overall performance of OpsMgr processeson the management server:

Process(HealthService)\% Processor Time


20 of 27 10/12/2015 11:27 PM

Process(HealthService)\Private Bytes Depending on how many agentsthis management server is managing, this number may vary, and it couldbe several hundred megabytes.








Process(MonitoringHost*)\Working SetOpsMgr specific performancecounters: These counters are OpsMgr specific counters that indicate theperformance of specifric aspects of OpsMgr on the management server:

Health Service\Workflow Count: The number of workflows that arerunning on this management server.

Health Service Management Groups(*)\Active File Uploads: The number offile transfers that this management server is handling. This represents thenumber of management pack files that are being uploaded to agents. Ifthis value remains at a high level for a long time, and there is not muchmanagement pack importing at a given moment, these conditions maygenerate a problem that affects file transfer.

Health Service Management Groups(*)\Send Queue % Used: The size ofthe persistent queue. If this value remains higher than 10 for a long time,and it does not drop, this indicates that the queue is backed up. Thiscondition is cause by an overloaded OpsMgr system because the OpsMgrsystem (for example, the root management server) is too busy or is offline.

Health Service Management Groups(*)\Bind Data Source Item Drop Rate:The number of data items that are dropped by the management server fordatabase or data warehouse data collection write actions. When thiscounter value is not 0, the management server or database is overloadedbecause it cant handle the incoming data item fast enough or because adata item burst is occurring. The dropped data items will be resent byagents. After the overload or burst situation is finished, these data itemswill be inserted into the database or into the data warehouse.

Health Service Management Groups(*)\Bind Data Source Item IncomingRate: The number of data items received by the management server for


21 of 27 10/12/2015 11:27 PM

database or data warehouse data collection write actions.

Health Service Management Groups(*)\Bind Data Source Item Post Rate:The number of data items that the management server wrote to thedatabase or data warehouse for data collection write actions.

OpsMgr Connector\Bytes Received: The number of network bytesreceived by the management server i.e., the size of incoming bytesbefore decompression.

OpsMgr Connector\Bytes Transmitted: The number of network bytes sentby the management server i.e., the size of outgoing bytes aftercompression.

OpsMgr Connector\Data Bytes Received: The number of data bytesreceived by the management server i.e., the size of incoming data afterdecompress)

OpsMgr Connector\Data Bytes Transmitted: The number of data bytessent by the management server i.e., the size of outgoing data beforecompression)

OpsMgr Connector\Open Connections: The number of connections openon management server. It should be same as the number of agents orroot management server that are directly connected to it.

OpsMgr database Write Action Modules(*)\Avg. Batch Size: The numberof a data items or batches that are eceived by database write actionmodules. If this number is 5,000, a data item burst is occurring.

OpsMgr DB Write Action Modules(*)\Avg. Processing Time: The number ofseconds a database write action modules takes to insert a batch intodatabase. If this number is often greater than 60, a database insertionperformance issue is occurring.

OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms: Thenumber of milliseconds for data warehouse write action to insert a batchof data items into a data warehouse.

OpsMgr DW Writer Module(*)\Avg. Batch Size: The average number ofdata items or batches received by data warehouse write action modules.

OpsMgr DW Writer Module(*)\Batches/sec: The number of batchesreceived by data warehouse write action modules per second.

OpsMgr DW Writer Module(*)\Data Items/sec: The number of data itemsreceived by data warehouse write action modules per second.


22 of 27 10/12/2015 11:27 PM

OpsMgr DW Writer Module(*)\Dropped Data Item Count: The number ofdata items dropped by data warehouse write action modules.

OpsMgr DW Writer Module(*)\Total Error Count: The number of errorsthat occurred in a data warehouse write action module.

Root management server roleOverall performance counters: These counters indicate the overall performanceof the root management server:





LogicalDisk(*)\Avg. Disk Queue LengthOpsMgr process generic performancecounters: These counters indicate the overall performance of OpsMgr processeson the root management server:

Process(HealthService)\% Processor Time

Process(HealthService)\Private Bytes (Depending on how many agents thisroot management server is managing, this number may vary and could beseveral hundred Megabytes.)








Process(MonitoringHost*)\Working Set

Process(Microsoft.Mom.ConfigServiceHost)\% Processor Time

Process(Microsoft.Mom.ConfigServiceHost)\Private Bytes

Process(Microsoft.Mom.ConfigServiceHost)\Thread Count


23 of 27 10/12/2015 11:27 PM

Process(Microsoft.Mom.ConfigServiceHost)\Virtual Bytes

Process(Microsoft.Mom.ConfigServiceHost)\Working Set

Process(Microsoft.Mom.Sdk.ServiceHost)\% Processor Time

Process(Microsoft.Mom.Sdk.ServiceHost)\Private Bytes

Process(Microsoft.Mom.Sdk.ServiceHost)\Thread Count

Process(Microsoft.Mom.Sdk.ServiceHost)\Virtual Bytes

Process(Microsoft.Mom.Sdk.ServiceHost)\Working SetOpsMgr specificperformance counters: These counters are OpsMgr specific counters thatindicate the performance of specific aspects of OpsMgr on the rootmanagement server:

Health Service\Workflow Count: The number of workflows that arerunning on this root management server.

Health Service Management Groups(*)\Active File Uploads: The number offile transfers that this root management server is handling i.e.,configuration and management pack uploads to agents. If this valueremains higher for a long time, and it does not drop, this indicates thatnot much discovery or management pack is being imported at themoment, and that there could be a problem in file transfer.

Health Service Management Groups(*)\Send Queue % Used: The size ofthe persistent queue.

Health Service Management Groups(*)\Bind Data Source Item Drop Rate:The number of data items dropped by the root management server fordatabase or data warehouse data collection write actions. When thiscounter value is not 0, the root management server or database isoverloaded because it cant handle the incoming data item fast enough orbecause a data item burst is occurring. The dropped data items will beresent by agents. After the overloaded or burst situation is finished, thesedata items will be inserted into the database or into the data warehouse.

Health Service Management Groups(*)\Bind Data Source Item IncomingRate: The number of data items received by the root management serverfor database or data warehouse data collection write actions.

Health Service Management Groups(*)\Bind Data Source Item Post Rate:The number of data items that the root management server wrote to thedatabase or to the data warehouse for database or data warehouse datacollection write actions.

OpsMgr Connector\Bytes Received: The number of network bytes


24 of 27 10/12/2015 11:27 PM

received by the root management server i.e., the size of incoming bytesbefore decompress.

OpsMgr Connector\Bytes Transmitted: The number of network bytes sentby the root management server i.e., the size of outgoing bytes aftercompression.

OpsMgr Connector\Data Bytes Received: The number of data bytesreceived by the root management server i.e., the size of incoming dataafter decompression.

OpsMgr Connector\Data Bytes Transmitted: The number of data bytessent by the root management server i.e., the size of outgoing databefore compression.

OpsMgr Connector\Open Connections: The number of connections openon the root management server. It should be same as the number ofagents or management servers that are directly connected to it.

OpsMgr Config Service\Number Of Active Requests: The number ofconfiguration or management pack requests that are being processing bythe Config service.

OpsMgr Config Service\Number Of Queued Requests: The number ofqueued config or management pack requests sent to the Config service. Ifit is high for a long time, the instance space or management pack space ischanging too frequently.

OpsMgr SDK Service\Client Connections: The number of SDK connections.

OpsMgr DB Write Action Modules(*)\Avg. Batch Size: The number of adata items or batches that are received by database write action modules.If this number is 5,000, a data item burst is occurring.

OpsMgr DB Write Action Modules(*)\Avg. Processing Time: The number ofseconds that a database write action modules takes to insert a batch intoa database. If this number is often larger than 60, a database insertionperformance issue is occurring.

OpsMgr DW Writer Module(*)\Avg. Batch Processing Time, ms: Thenumber of milliseconds that it takes for a data warehouse write action toinsert a batch of data items into a data warehouse.

OpsMgr DW Writer Module(*)\Avg. Batch Size: The average number ofdata items or batches that are received by data warehouse write actionmodules.

OpsMgr DW Writer Module(*)\Batches/sec: The number of batches


25 of 27 10/12/2015 11:27 PM

received by data warehouse write action modules per second.

OpsMgr DW Writer Module(*)\Data Items/sec: The number of data itemsreceived by data warehouse write action modules per second.

OpsMgr DW Writer Module(*)\Dropped Data Item Count: The number ofdata items that are dropped by data warehouse write action modules)

OpsMgr DW Writer Module(*)\Total Error Count (This is number of errorshappened in data warehouse write action modules.

Article ID: 2288515 - Last Review: 07/09/2012 20:47:00 - Revision: 3.0

Applies toMicrosoft System Center Operations Manager 2007 R2

Microsoft System Center Operations Manager 2007

Microsoft System Center Operations Manager 2007 Service Pack 1

Microsoft System Center 2012 Operations Manager

Keywords:kbtshoot KB2288515

Support Security Contact Us


26 of 27 10/12/2015 11:27 PM

Account support

Supported products list

Product support lifecycle

Virus and security

Safety & Security Center

Download Security Essentials

Malicious Software Removal Tool

Report a support scam

Disability Answer Desk

Locate Microsoft addressesworldwide


27 of 27 10/12/2015 11:27 PM

Documents

Troubleshooting Gray Agent States in System Center Operations Manager