Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
L&I - Systems Management Plan Problem/Incident Management
Version 1.14
Prepared for
Commonwealth of PennsylvaniaDepartment of Labor and Industry
December 2010
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementT:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
i
Revision History
Release/Version
Information
Revision Date Author / Editor Summary of Changes
1.0 02/01/07 Mike Smith Creation of document 1.2 05/15/08 Mary Hill-Hartman
John TamosaitisUpdates per L&I commentsUpdate document
1.3 07/22/08 John Tamosaitis Updates per L&I comments
1.4 8/13/08 John Tamosaitis Updates per L&I comments1.5 8/25 John Tamosaitis Updates per L&I comments1.6 10/31 John Tamosaitis Updates per L&I comments1.7 12/22/08 John Tamosaitis Updates per L&I comments1.8 3/2/09 John Tamosaitis Updates per L&I comments1.9 4/27/09 John Tamosaitis Updates per L&I comments
1.10 5/15/09 John Tamosaitis Updates per L&I comments1.11 5/27/09 John Tamosaitis Updates per L&I comments1.12 10/16/09 John Tamosaitis Updates per L&I comments1.13 10/30 John Tamosaitis Updates per L&I comments1.14 12/9/09 John Tamosaitis Updates per L& I comments1.15 11/15/10 John Tamosaitis Update for Prod outages – use of incident reports
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementT:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
ii
Reviewed By:
Name Team/Role Reviewer Comments Date Reviewed
Myrna Barnes Chief, Customer Relations Division
Myrna Barnes 12/7/2009
Anita Steinmeier Chief, Enterprise Software and Information Division
Anita Steinmeier 12/7/2009
Karen Fausnacht Chief, Project Mgmt Division
Karen Fausnacht 11/16/2009
Steve Yurich Chief, Security Division Steve Yurich 11/10/2009Jacki Hagmayer Chief, Engineering and
Research DivisionJacki Hagmayer 12/7/2009
Ed Bowlen Chief, Standards Development & Compliance Division
Ed Bowlen 12/8/2009
Joe Sheridan Chief, Data Mgmt & Database Operations Division
Joe Sheridan 12/8/2009
Bryan Reed Chief, Compensation & Insurance Division
Bryan Reed 11/13/2009
Mary Lynn Kowalski Chief, Unemployment Compensation Division
Mary Lynn Kowalski 12/8/2009
John Shontz Chief, Vocational Rehabilitation – Safety & Labor Mgmt Relations Division
John Shontz 12/8/2009
Phil Day Chief, Workforce Development Division
Phil Day 12/8/2009
John Auchey Chief, Server Farm Operations Division
John Auchey 12/8/2009
David Vogelsong Chief, Infrastructure Division
David Vogelsong 11/10/2009
Bill Glatz Chief, Network Support Services Division
Bill Glatz 11/12/2009
Marty Thomas Chief, Mainframe Operations Division
Marty Thomas 11/12/2009
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementT:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
iii
Approved By:
Name Team/Role Sign-off Date
Michele Sinko Director, BES 12/18/2009John Malinoski Director, BIO 12/18/2009Neil Ross Director, BEA 12/18/2009David Andrews Director, BBAD 12/18/2009
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementT:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
iv
Table of Contents
1.0 Preface.............................................................................................................................................. 1
2.0 Owner/Responsible.......................................................................................................................... 1
3.0 IT Process Integration...................................................................................................................... 1
4.0 Problem/Incident Management.......................................................................................................34.1 Problem/Incident Management Introduction....................................................................34.2 Problem/Incident Management Purpose...........................................................................34.3 Problem/Incident Management Definitions.......................................................................3
4.3.1 Problem/Incident Management Definitions...................................................................44.4 Problem/Incident Management Objectives.......................................................................54.5 Problem/Incident Management Inter-relationships..........................................................54.6 Problem/Incident Management Guiding Principles..........................................................64.7 Problem Management Roles and Responsibilities..........................................................74.8 Problem/Incident Management Process.........................................................................11
4.8.1 Problem Management Activities.................................................................................114.8.2 Process Components................................................................................................134.8.3 Description of Process Components..........................................................................14
4.9 Problem/Incident Management Procedures....................................................................144.9.1 Problem Identified.....................................................................................................144.9.2 Record, Assess, Classify Problem/Incident................................................................144.9.3 Diagnose and Escalate Problem/Incident...................................................................154.9.4 Resolution/Bypass and Verification of Problem/Incident..............................................154.9.5 Survey and Follow-up Problem/Incident.....................................................................16
4.10 Problem Management Guidelines....................................................................................164.10.1 Remedy Problem/Incident Priority Levels Matrix.........................................................164.10.2 Help Desk Priority/Event Management Matrix for Servers...........................................174.10.3 Tivoli Response Framework Matrix............................................................................184.10.4 Help Desk Brain Knowledgebase Entries...................................................................19
4.11 Problem Management Metrics.........................................................................................204.12 Problem/Incident Management Tool Capabilities...........................................................21
5.0 Appendix A– Sample Problems....................................................................................................225.1 L&I Enterprise Problem/Incident Examples....................................................................22
6.0 Appendix B – Acronyms................................................................................................................236.1 L&I Acronyms.................................................................................................................... 23
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementT:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
v
1.0 Preface
A cross IBM team signed a 6 1/2 year Service Oriented Architecture(SOA)-based application development contract with the Commonwealth of Pennsylvania for a new unemployment compensation modernization system (UCMS) that will provide a new platform for growth and innovation that will serve the Commonwealth for the foreseeable future. As part of the Agreement, IBM was required to prepare a Systems Management Plan for the UCMS project. This document represents the evolution of the Enterprise Systems Management (ESM) Plan work product. This document, L&I - Systems Management Plan - Problem/Incident Management, can be found at, T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL), along with other Systems Management Plan documents based on Information Technology Infrastructure Library (ITIL).
See Appendix B for a complete list of Acronyms used in this document.
2.0 Owner/Responsible
As of July, 21st, 2008 the Office of Information Technology (OIT), Bureau of Enterprise Services Customer Relations Division (BES-CRD) is the owner of this document. It is expected that the Systems Management Plan will be updated by the owner on a quarterly basis.
3.0 IT Process Integration
Multiple integration points exist between the processes that exist in IT Operational Management. The following figure gives a high-level overview of those integration points.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
1
3.0 Figure 1: ESM Process Integration (Refer to section 4.5 for description of this process)
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
ESM Process Integration
ChangeManagement
Process
Help Desk/
Problem Process
Backup/RecoveryProcess
Asset/ConfigurationManagement
Process
Service LevelManagement
Process
EventManagement
System
Performance/Capacity
ManagementProcess
Security Process
Release/SoftwareDistribution
Process
AvailabilityManagement
Process
2
4.0 Problem/Incident Management
4.1 Problem/Incident Management Introduction
A formal, structured process that addresses and identifies service anomalies and restoration of application or systems functions as quickly as possible to mitigate the impact to the Department of Labor and Industry (L&I) business and bring the services back up to the levels outlined in the Service Level Agreements (SLAs). L&I’s Problem/Incident Management Plan includes a Problem Process Owner, an Operations Manager Team Lead, a Help Desk Manager, Help Desk Coordinator, LINKS Help Desk Agents, and Level 2/Level 3 Subject Matter Experts (SME’s) for diagnosing and resolving Problem tickets. The entire process will be managed by the Help Desk Manager. The process will record the Problem and the root cause behind it, record the results of the resolution of the Problem, and provide information required by other processes, such as Change.
4.2 Problem/Incident Management Purpose
The L&I Problem/Incident Management Plan covers all problems and incidents that occur in all of the L&I custom application software, commercial-off-the-shelf software, and infrastructure/network support services hardware and software components that impact (or may impact) the L&I business and technology environments. Examples are listed in Appendix A – this list is a working document and will be modified over time.
The L&I Problem/Incident Management Plan will also serve as the “starting” point for problems that need to be forwarded to L&I’s overall Problem Management process.
4.3 Problem/Incident Management Definitions
The following diagram illustrates the relationships between Event Management, Problem/Incident Management, and Configuration Management.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
3
Information Technology Infrastructure Library (ITIL)Configuration Management Database
(contains relationships)(federated databases)
Currently Remedy and Spreadsheets at DLI
Event ManagementProblem/Incident
Management(records, tracks,
documents problems)
Events tracked in TEC at L&I
Problems tracked in
Remedy at L&I
4.3 Figure 2: Problem, Event and Configuration Management
4.3.1 Problem/Incident Management Definitions
An incident is any event that is not part of the standard operation of a service and causes, or may cause, an interruption to or reduction in the quality of that service.
A problem is an unknown, underlying cause of one or more incidents. A single problem may generate several incidents.
o For the purpose of this document, the following are examples of problems: Events that are detected by L&I Tivoli Infrastructure and escalated to the Tivoli
Enterprise Console (TEC) and Remedy. Incidents that are reported by end users through the LINKS Help Desk. Incidents that are identified by OIT staff and reported through the LINKS Help
Desk.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
4
A large scale problem or outage is defined as one or more applications or services which becomes inoperable and causes a major impact on the availability or function of systems. Examples of some systems include but are not limited to:
o Wide Area Network (WAN) links or Metropolitan Area Network (MAN) links that affect a large number of users
o Enterprise applicationso Public facing applicationso Enterprise servers that service a large number of userso Enterprise shared applicationso Mainframe applicationso Voice services affecting a large number of users or multiple siteso Desktop services that affect a large number of users or siteso Facility issues that affect a large number of users or multiple siteso Business Applications
4.4 Problem/Incident Management Objectives
The objective of the L&I Problem/Incident Management Plan is to provide a set of unambiguous and repeatable processes and procedures for:
Providing a model for recording and resolving Problems and Incidents that may occur within the L&I environment. (Please see Appendix A – Sample Problems)
Providing initial support and classification of received Incidents and Problems Ensuring that Problems and Incidents are assigned to the proper support team with an assigned
priority Ensuring that all Problems and Incidents are resolved within established time frames (according
to priority) and/or escalated to the next level of support Effectively tracking and managing Problems and Incidents once they occur Providing information to other processes, such as Change and Service Level Management Leveraging knowledge bases to increase problem resolution effectiveness Reviewing and validating closed problems to ensure customer satisfaction Performing trend analysis and proactive problem prevention Leveraging Help Desk tools to increase problem resolution effectiveness
4.5 Problem/Incident Management Inter-relationships
Following are some specific examples of how Problem/Incident Management interacts with other IT Operational processes. (Please note, not all process listed are depicted in 3.0 Figure 1)
Change Management o Fixes for problems will generate changes to all environments and will require change
requests to install the tested and approved fixes.o The implementation of a change may trigger problems in all environments that need to be
logged and managed by Problem Management Process.o Description and schedule of changes planned for systems is needed for problem analysiso Help Desk training requirements associated with technical changes.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
5
Event Management o Future Management Plano When an event is classified as a potential or current problem, a Problem Ticket should be
opened and handled within the Problem Management process. Configuration Management
o Future Management Plano The Problem Management process will obtain configuration information via the
Configuration Management Process, when required.o The Problem Management process uses Configuration Management information during
monitoring and troubleshooting problems. Asset Management
o Future Management Plano The Problem/Incident Management Process requires updated Asset Management
information for use during monitoring and troubleshooting problems. Service Level Management
o Future Management Plano Problem/Incident Management provides data to Service Level Management for use in
preparing measurement reports. o Problem/Incident Management must also detect and identify problem trends that impact
the attainment of service targets as a result of repetitive problems. Performance Management
o Future Management Plano System and performance problems are reported through Problem Management for
analysis and resolution by the responsible technical support staff. Backup and Recovery Management
o Future Management Plano Problem Management is linked to Backup and Recovery Management to ensure that all
component problems have been identified and properly recorded.o Documented and/or automated recovery procedures are essential for fast problem
resolution or service restoration.
4.6 Problem/Incident Management Guiding Principles
Guiding principles are fundamental rules or guidelines that establish design and implementation constraints and align with management’s vision of L&I service delivery:
The LINKS Help Desk provides a single point of contact (SPOC) for L&I employees and business partners needing technology support during agreed upon coverage hours. All problems raised are entered into Remedy.
The long term objective of Problem/Incident Management is to have all problem/incidents called into the LINKS Help Desk as a SPOC.
Events are generated, forwarded to the Tivoli Enterprise Console (TEC) and entered into Remedy depending on a pre-assigned priority and risk level. Priority and risk levels are described in T:\All (Common area for all OIT Staff)\Tivoli.
Soft skills such as customer service orientation, communication and analytic ability are a priority at the LINKS Help Desk.
The Help Desk is proactive rather than reactive wherever possible.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
6
Service level targets for the LINKS Help Desk are defined, measured and reported on a regular basis.
Interfaces to other organizations are through a defined set of escalation processes and support agreements and enabled by a Help Desk management system.
Support acceptance criteria for applications and systems include timely review and acceptance rights for the Help Desk at the pre-implementation stages and all required changes are made in accordance with the Change Management process requirements.
The Help Desk is automated wherever possible. There are defined second and third level support groups or SME’s, depending upon level of
expertise, and associated procedures for routing all problems or service requests that can not be addressed by the LINKS Help Desk (LINKS Help Desk = Level 1; Support Groups = Level 2 or Level 3).
Help Desk, Level 2 and Level 3 Support Groups have access to all appropriate resource tools and information databases to assist in servicing the customer request or addressing problems.
Any problems that cause an outage are entered into Remedy and an Incident Report developed. Failure to conform to the Problem Management process will result in appropriate management
action.
4.7 Problem Management Roles and Responsibilities
Role Responsibility Members
L&I Customer/Employee
Initiates the need for a Help Desk Ticket and opens a call with the LINKS Help Desk Analyst (All calls come to LINKS Help Desk)
L&I Customer/Employee
L&I/OIT Employee
Reports problems in response to L&I employee concern by entering into Remedy or by reporting problems to LINKS to enter into Remedy
Provides responsive, timely support to all support requests escalated from the LINKS Help Desk Agents or OIT self reported Help Desk tickets
Resolves the problem, documents the solution in the database or ensures follow-through if the call is passed to another Level 2, Level 3 SME
Maintains service level agreements on response turnaround
L&I/OIT/BBAD Staff L&I/OIT/BEA StaffL&I/OIT/BES Staff L&I/OIT/BIO Staff
Problem Process Owner
Acts as the overall “evangelist” for process work
Prioritizes investment, as the responsible individual for the cost and investment overall in process work
Resolves or escalates cross-process issues
Approves new process definitions and
L&I/OIT/BES/CRD Chief,
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
7
Role Responsibility Membersapproves or rejects process deviation requests·
Assigns or designates ownership and roles and responsibilities for each Operational process·
Evaluates process performance against standards and control criteria
LINKS Help Desk Agents
Provides telephone assistance to customers and maintains accurate records
Makes the first attempt to resolve the service issue reported by the end user
Acts as end-user advocate to ensure that service issues are resolved in a timely fashion
Ensures that the ticket contains an accurate and properly detailed description of the problem
Ensures that the priority classification is correct
Recognizes patterns of symptoms, applies search tools to identify previously developed solutions, and helps end-users implement the solution.
Assumes responsibility for problem tickets until resolved
Escalates problems, to Level 2 support group, if unable to satisfactorily resolve them
LINKS Help Desk Agents
Lead Help Desk Agent
Provides telephone assistance to customers and maintains accurate records
Makes the first attempt to resolve the service issue reported by the end user
Acts as end-user advocate to ensure that service issues are resolved in a timely fashion
Ensures tickets contain an accurate and properly detailed description of the problem
Ensures ticket priority classification is correct
Recognizes patterns of symptoms, applies search tools to identify previously developed solutions, and helps end-users implement the solution.
Assumes responsibility for problem tickets until resolved
Escalates problems, to Level 2 support group, if unable to satisfactorily resolve them
Verifies customer satisfaction of problem resolutions (Level 1 and Level 2) by
LINKS Help Desk Lead Agent
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
8
Role Responsibility Membersperforming customer follow-ups
Develops department-specific reports in Remedy and for the Automatic Call Distribution System(ACD)
Help Desk Coordinator
Communicates problem status and unresolved problems to customers and Help Desk management
Verifies customer satisfaction of problem resolutions (Level 1 and Level 2). Remedy sends a survey to customers when a ticket is resolved
Maintains and improves communication and escalation lists
Develops department-specific reports and procedures
Participates in the problem review process
Ensures assigned priority level for tickets follows the agreed-upon guidelines and that problems are resolved or escalated within service level targets
L&I, Help Desk Coordinator
Help Desk Manager
Ensures that a well defined, consistently executed and effective PM/IM process is established and maintained
As owner of the IM process, ensures that the process and capabilities are adequate, and are improved when necessary
Reviews and understands the Problem Management process and tools
Evaluates the effectiveness of the PM/IM process and supporting mechanisms such as reports, communication formats/messages, and escalation procedures
Makes recommendations to the Problem Process Owner on ways to improve the process
L&I Help Desk Manager
Level 2, Level 3 Subject Matter Experts (SMEs)
Provides responsive, timely support to all support requests escalated from the LINKS Help Desk Agents
Resolves the problem, documents the solution in the database and ensures follow-through if the call is passed to another SME
Maintains service level agreements on response turnaround
Works as a team to resolve outstanding support problems and/or requests to even workload, establish priorities, and meet deadlines
Escalates and works with appropriate vendor support to resolve issues where
L&I/OIT/BBAD Staff L&I/OIT/BEA StaffL&I/OIT/BES Staff L&I/OIT/BIO Staff
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
9
Role Responsibility Membersappropriate
Level2, Level 3 SME Managers
Leads and manages Level 2 and Level 3 SMEs throughout the problem resolution process.
Provides communication and notification to users, OIT Bureau Directors and CIO as necessary
L&I/OIT/BBAD ManagementL&I/OIT/BEA ManagementL&I/OIT/BES Management L&I/OIT/BIO Management
Problem/Incident Coordinator
Assembles and manages the Level 2 and Level 3 SME teams and sub-teams
Coordinates with other Commonwealth agencies, OIT managers, business process managers and agency executives
Establishes team leads as needed Leads and manages sub-teams to ensure
close coordination and communications with each of the sub-teams
Takes ownership of business critical IT problems and deliver effective workaround implementation, accurate root cause analysis and problem resolution
Ensures complete and accurate documentation is completed at all stages of the Problem Management process
Details responsibilities and specific tasks for emergency response activities and business resumption operations based upon pre-defined timeframes
L&I OIT BEA Bureau Director
Operations Manager Team Lead
Attends required meetings and effectively communicates the status of problems with high visibility to senior management, when required
Conducts Incident Report meetings for analyzing outages
BIO Technical Operations Lead
4.7 Figure 3: Problem/Incident Management Roles and Responsibilities
4.8 Problem/Incident Management Process
The Problem/Incident Control Process is a structured, step-by-step approach to controlling and managing Problem activity. This process will focus on restoring interrupted service as soon as possible.
4.8.1 Problem Management Activities
Five documented activities make-up the Problem Control Process:
1. Problem Identifieda. Receive notification of Problem/Incident through LINKS Help Desk
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
10
b. Tivoli generates an event2. Record, Assess, and Classify Problem/Incident
a. Log call details in Remedyb. Assess & classify Problem/Incident or Event (priority) & communicate c. Assign priority leveld. Identify and execute incident bypass or resolution, if possible
3. Diagnose and Escalate Problem/Incidenta. Diagnose or escalate problem (Level 2,Level 3, vendor or Problem Incident Coordinator)
4. Resolution/Bypass and Verifya. Recover from problem, if necessary, apply bypass or temporary fixb. Resolve problem (correction at root cause)c. Update customer and verify resolution
5. Survey and Follow-up Problem/Incidenta. Survey end userb. Conduct Problem/Incident review meeting, produce an Incident Report, and analyze
reports
The following figure, 4.8 Figure 4, illustrates the proposed L&I Problem/Incident Management process flow.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
11
4.8 Figure 4: Process Flow
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
12
4.8.2 Process Components
The Problem/Incident Management process focuses on restoring interrupted service as soon as possible. Process Components are the Inputs, Tools/Techniques, and Outputs required for effective and comprehensive Problem/Incident control management.
The following figure maps each Process Component to the appropriate Process Flow Activity.
Activity Problem Identified
Record, Assess and Classify
Diagnose and Escalate
Resolution, Bypass and
Verify
Survey and Follow-up
Input
Problem as reported by user
Event identified in Tivoli and forwarded to TEC and Remedy
Problem/Incident Escalation Schedule
Communications
Problem/Incident Priority Levels
Contacts
Remedy Problem Ticket
RemedyProblem Ticket
Remedy Problem Ticket
Tools and Techniques
Telephone system
Tivoli Monitors
Remedy
Brain Knowledgebase
Procedures
Level 2 Support team tools (various)
Level 3 Support team tools (various)
RemedyLevel 2 Support team tools (various)Level 3 Support team tools (various)
Remedy
Output
Remedy Problem Ticket
Remedy Problem Ticket
Communication to user
Updated Remedy
Problem Ticket
Remedy Work Log
Conference Call
Updated Remedy
Problem Ticket
Remedy Work Log and Solution
Incident Report(s)
Completed Problem Ticket
Communicate result
Survey
Incident Log
4.8 Figure 6: Process Flow Activity
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
13
4.8.3 Description of Process Components
Process Component Description
Purpose Restore interrupted service as soon as possible
Owner LINKS Help Desk/OIT Level 2/Level 3/Problem/Incident Coordinator
Input Problem identifiedProblem recorded in Remedy
Output Service restoredEnd user notifiedRecorded in RemedyUpdated Brain Entry, if required
Measurement Quantity of tickets presently openQuantity of incidents (tickets) by time (monthly, quarterly)Quantity of tickets resolved by each support groupsAverage time tickets were assigned to each groupAverage time to resolve incident Percentage of incidents resolved by LINKS Help DeskPercentage of incidents escalated to support groupsCustomer Surveys
4.8 Figure 7: Process Components
4.9 Problem/Incident Management Procedures
A procedure integrates a Process Flow Activity with one or more Process Components to create a series of step-by-step instructions that facilitate effective Problem/Incident Management. Below are the Problem/Incident Management Procedures.
4.9.1 Problem Identified
4.9.1.1 Receive call or notice of the Problem/Incident Problems are identified in one of two ways:
o An incident occurs is reported to the LINKS Help Desk via telephone call.o Tivoli identifies a problem and TEC generates an event.
4.9.2 Record, Assess, Classify Problem/Incident
4.9.2.1 Log call details in Remedy The problem is recorded automatically in Remedy by TEC or manually by the LINKS Help
Desk or an L&I OIT/Employee. Multiple tickets for associated problems/incidents will be related to one parent Remedy ticket as necessary.
The problem is analyzed, properly classified and assigned a priority level. Common or previously identified problems will be resolved at this level when possible.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
14
4.9.3 Diagnose and Escalate Problem/Incident
4.9.3.1 Diagnose the Problem/Incident Problems not immediately resolved or those that appear part of a larger problem will then be
escalated to Level 2 SMEs. The Level 2 SMEs will perform problem diagnosis activities using the appropriate technical
tools. In the event of a major problem/incident or an outage, Level 2 SME Manager will notify the
appropriate people using the L&I – Problem/Incident Communication Plan. The Level 2 SME Manager will work with the BES-CRD Chief or the Help Desk Manager to
determine if enterprise wide notification is necessary.
4.9.3.2 Escalate the Problem/Incident If the Level 2 SME is unable to identify and resolve the problem, Level 2 will escalate the
problem to Level 3. In the event of a major problem/incident or an outage, Level 3 SME Manager will notify the
appropriate people using the L&I – Problem/Incident Communication Plan. The Level 3 SME Manager will work with the BES-CRD Chief or the Help Desk Manager to
determine if enterprise wide notification is necessary. Level 3 works to resolve the problem and/or involves the Vendor as required. If the Level 3
SME is unable to identify or resolve the problem or if the problem appears to be part of an outage or larger problem, Level 3 will escalate the problem to the Problem/Incident Coordinator.
The Problem/Incident Coordinator will form adhoc teams to resolve the problem or deliver effective workaround and ensure complete and accurate documentation is completed at all stages of the problem resolution process.
If the problem/incident or outage affects the Production environment and/or a Production System, the Problem/Incident Coordinator will initiate a conference call within the first 30 minutes with all appropriate staff involved (See Problem/Incident Communication PlanSection 5 for teleconference phone number).
In the event of a major problem/incident or an outage, Level 3 SME Manager will notify the appropriate people using the L&I – Problem/Incident Communication Plan.
The Problem/Incident Coordinator will work with the BES-CRD Chief or the Help Desk Manager to determine if enterprise wide notification is necessary. The Problem/Incident Coordinator will work with Level 2, Level 3 SMEs and vendors as necessary on diagnosing the problem/incident.
4.9.4 Resolution/Bypass and Verification of Problem/Incident
4.9.4.1 Recover from problem/incident: Bypass or temporary fix Once the Problem/Incident has been correctly identified, a bypass or temporary fix can be
implemented, if a permanent fix is not available in the required time frame as defined in Section 4.10 Figure 8.
Before any temporary fix or bypass can be implemented, it will need to be tested, and scheduled through the change control process.
The Remedy ticket should be updated and remain open until a permanent resolution can be developed and implemented.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
15
4.9.4.2 Resolve problem (correction at root cause) Work will continue on the problem to ensure the root cause is identified and the problem
resolved. The Level 2, Level 3 SMEs will document resolution in Remedy for quicker diagnosis of a
similar Problem tickets in the future. If applicable, the resolution will be incorporated into the BRAIN Knowledgebase for future calls. Multiple tickets for associated problems/incidents will be related to one parent Remedy ticket as necessary.
The resolution will be tested, scheduled through the change control process, and implemented.
4.9.4.3 Update Customer and Verify Resolution Once the Problem/Incident is resolved, the resolving Help Desk agent or Level 2, level 3 SME
will verify with the L&I Customer/Employee that it has been resolved. The resolving agent or technician will proceed to resolve the ticket in Remedy.
4.9.5 Survey and Follow-up Problem/Incident
4.9.5.1 Survey End User When a Problem/Incident is successfully resolved in Remedy, a Remedy survey will be
electronically sent to the L&I Customer/Employee as a follow up to the Problem/Incident resolution.
The L&I Customer/Employee has the option of filling out the survey and commenting. The results are returned to the Help Desk Manger, Help Desk Coordinator and LINKS Lead Help Desk agent to be reviewed for possible follow up. The appropriate Level 2, Level 3 SME Supervisor may be contacted when the survey warrants further follow up action.
4.9.5.2 Conduct Problem/Incident Review meetings and analyze reports Produce OIT Incident Report and review report with management. Monthly reports are generated through Remedy and the LINKS ACD System.
o Top Ten Issues – Category, Type and Itemo Links - HD Services Rpto Call and Ticket statisticso Survey Reporto Close Ratioo Tickets per Agent
The L&I Help Desk Manager will conduct monthly review meetings to review all Problem tickets, and review reports (generated from Remedy).
4.10 Problem Management Guidelines
4.10.1 Remedy Problem/Incident Priority Levels Matrix
The problem/incident priority levels are set depending on their source. An incident occurs for the end user and is reported to the LINKS Help Desk via telephone
call. These problems/incidents are assigned a priority level by the LINKS Help Desk agent following the table in 4.10 Figure 8, Remedy Problem/Incident Priority Levels Matrix.
OIT staff is alerted to Problem/Incident. These problem/incidents are assigned a priority level by the Level 2, Level 3 Subject Matter Experts (SMEs) following the table in 4.10 Figure 8. Remedy Problem/Incident Priority Levels Matrix.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
16
Tivoli identifies a problem and TEC generates an event. Based on the server’s risk assessment and the severity of the event in TEC, a ticket may or may not be opened in Remedy. The source of an event from Tivoli may determine its initial severity. If the source does not set the severity, it will be determined by the default settings for the event class in the Tivoli Enterprise Console (TEC). For tickets opened in Remedy, Remedy sets the trouble ticket priority based on the Help Desk Priority/Event Management Matrix for Servers, 4.10 Figure 9.
Priority Scope of Impact Impact Resolution Time
UrgentLocation or System Down
Critical (Entire office or Location, Impacts a large number of users)
0-2 Hours
HighComponent Down or Degraded
Severe (Impacts a number of users)
2-4 Hours
Medium Component Down or Degraded
Minimal (Impact to a single user)
+4 hours -Day
Low
None, component if functional
None (Impact viewed as an inconvenience to a single user)
Day(s)
4.10 Figure 8: Remedy Problem/Incident Priority Levels Matrix
4.10.2 Help Desk Priority/Event Management Matrix for Servers
The Help Desk Priority/Event Management Matrix defines how events generated through TEC are mapped to the Remedy Priority Level. In most cases, this is done by checking the “risk level” assigned to the server in the Remedy Asset Record.
For example, a TEC Severity Level of ‘Critical’ and a Risk Level of ‘Medium’ will produce a Remedy Help Desk priority level of ‘High’. In cases where a Help Desk ticket needs to be entered manually, only the Risk Level assignment of the server will be used to set the Help Desk Priority Level.
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
17
Help Desk Priority Matrix
Risk Level High Risk Level
MediumRisk Level
Low
Risk Level Blank
TEC Severity
Set Help Desk Priority as Shown Below
Warning High Medium Low Medium
CriticalUrgent High Medium High
FatalUrgent Urgent High Urgent
None (Manually
Created DH Ticket)
Urgent High Medium N/A
4.10 Figure 9: Help Desk Priority/Event Management Matrix for Servers
4.10.3 Tivoli Response Framework Matrix
Problems/incidents that are generated through TEC will be escalated based on the server’s risk assessment and the severity of the event as defined in the following Tivoli Response Framework Matrix below.
For example, if a server risk level is set to “High” and the TEC event is determined to be either critical or fatal the following actions will be taken:
1. An Alarm Point Call will be generated 24/7 AND
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
18
2. A Remedy Ticket will be created using the Help Desk Priority Matrix AND3. A Text Page will be generated AND4. An Email will be generated AND5. The event will be displayed on the TEC console
In a second example, if a server risk level is set to “HIGH” and the TEC event is determined to be either warning or minor the following actions will be taken:
1. A Remedy Ticket will be created using the Help Desk Priority Matrix AND2. An Email will be generated AND3. The event will be displayed on the TEC console
Server Risk Level/ TEC Event Severity High Priority Medium Priority Low Priority
Critical/Fatal
Alarm Point Call/Operator Call (24/7)
Remedy Ticket Text Page E-Mail TEC Console
Alarm Point Call/Operator Call (Work Hours)
Remedy Ticket Text Page E-Mail TEC Console
Alarm Point Call/Operator Call (Work Hours)
Remedy Ticket Text Page E-Mail TEC Console
Warning/Minor
Remedy Ticket E-Mail TEC Console
TEC Console TEC Console
Harmless/Unknown
TEC Console TEC Console TEC Console
4.10 Figure 10: Tivoli Response Framework Matrix
4.10.4 Help Desk Brain Knowledgebase Entries
When the need for a new Brain Knowledgebase entry is identified, the following outlines the necessary steps.
Notify the Help Desk Manager and Help Desk Coordinator The Help Desk Manager and Coordinator will email Brain Knowledgebase Entry
Template to the requestor Create the new Entry using the template as a guide Email the new Entry to the Help Desk Manager and Coordinator The new Entry will be reviewed by the Help Desk Manager, Coordinator and the LINKS
Help Desk Lead Agent If no corrections or additions are necessary the new Entry will be scheduled, then added
to the Brain Knowledgebase
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
19
If there were corrections or additions made the to the new Entry, it will be sent back to the requestor for review and approval
The new Entry is then emailed back to the Help Desk Manager and Coordinator It is reviewed once again by the Help Desk Manager , Coordinator and LINKS Lead Help
Desk Agent The LINKS Lead Agent schedules and adds the new Entry to the Brain Knowledgebase
4.11 Problem Management Metrics
Following are current reports of Problem/Incident Management metrics, which can measure the effectiveness of the process:
Key Performance IndicatorsNumber of Tickets generated per Day, week, and Month.Number of Tickets resolved Monthly at Level 1Monthly Top 10 Category of tickets created using CTIs.Number of Tickets processed per agent MonthlyNumber of Tickets escalated to Level 2 MonthlyMonthly Satisfaction SurveysMonthly LINKS Help Desk Services ReportNumber of Tickets generated by Tivoli daily
Remedy Problem Tickets by CategoryApplicationsHardwareNetworkRemote AccessRestoreSecurityServerVoiceWeb Services Event
Problem Priority1 - Urgent2 - High3 - Medium4 - Low
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
20
4.12 Problem/Incident Management Tool Capabilities
The following are capabilities that have been evaluated and implemented:
Request/problem status communication Ability to assign priority to problems Interface with Tivoli Event Management Tool to create Problem tickets from Events Ability to provide current status of all tickets Ability to forward ticket based on escalation status matrix Ability to provide status and analysis reports Logging of Problem data in a database Access to Asset Management information Automated notification when problems are transferred from queue to queue Simple and quick entry and update of problem tickets High availability for the Remedy application and data Solicit and retrieve customer satisfaction information via Satisfaction Survey emailed to user
after resolution of Problem/Incident Ability to design custom reports to extract desired data
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
21
5.0 Appendix A– Sample Problems
5.1 L&I Enterprise Problem/Incident Examples
The following are Problem/Incident examples given the use of current CTIs within Remedy
Security/Password/CWOPA - CWOPA security password resets and/or account unlocks Voice/Voice/Dial Tone - Network problems experienced by a site concerning phone issues Network/Connection/Router - Network problems experienced by a site related to
connectivity issues and/or performance problems Hardware/Network Printer/IBM-Lexmark - Local and/or network printer issues Applications/CWDS-BWDP/Staff Access - Application problems concerning CWDS Applications/UCMS-BBAD/Staff Access - Application problems concerning UCMS (Applications/DeskTop/Adobe - Application problems encountered locally on user’s PC Applications/Operating Systems/Windows 2000 - Application problems encountered
concerning operating system errors or performance issues Applications/Email/Outlook-CWOPA - Application problems encountered concerning email
operation Hardware/PC/Hard Disk Drive - Computer hardware problems encountered by users related
to hard disk driver errors Hardware/PC/Network Card - Computer hardware problems encountered by users
specifically related to network connectivity Server/Hardware/Hard Disk Drive - Server Hardware problems experienced by users
generated by faulty hard disk drives Hardware/MainFrame/CPU - Mainframe system problems
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
22
6.0 Appendix B – Acronyms
6.1 L&I Acronyms
Acronym DefinitionACD System Automatic Call Distribution SystemBBAD Bureau of Business Application DevelopmentBBAD/CI Bureau of Business Application Development/Compensation and Insurance DivisionBBAD/WFD Bureau of Business Application Development/Workforce Development DivisionBBAD/UC Bureau of Business Application Development/Unemployment Compensation DivisionBBAD/OVR Bureau of Business Application Development/Occupational and Vocational Rehabilitation DivisionBBAD/SLMR Bureau of Business Application Development/Safety and Labor-Management Relations DivisionBEA Bureau Of Enterprise ArchitectureBEA/DMDB Bureau Of Enterprise Architecture/Data Management and Database Management DivisionBEA/ERD Bureau Of Enterprise Architecture/Engineering and Research DivisionBEA/SDCD Bureau Of Enterprise Architecture/Standards Development and Compliance DivisionBES Bureau of Enterprise ServicesBES/CoE Bureau of Enterprise Services/Business Center of Excellence DivisionBES/CRD Bureau of Enterprise Services/Customer Relations DivisionBES/PMD Bureau of Enterprise Services/Project Management DivisionBES/SD Bureau of Enterprise Services/Security DivisionBIO Bureau Infrastructure and OperationsBIO/ID Bureau Infrastructure and Operations/Infrastructure DivisionBIO/NSS Bureau Infrastructure and Operations/Network Support Services DivisionBIO/SFO Bureau Infrastructure and Operations/Server Farm Operations DivisionBIO/MFO Bureau Infrastructure and Operations/Mainframe Operations DivisionBWDP Bureau of Workforce Development PartnershipCIO Chief Information OfficerCTI Category, Type, ItemCWDS Commonwealth Workforce Development SystemESM Enterprise System Management IS/IT Information Systems/Information TechnologyIT Information TechnologyITIL Information Technology Infrastructure LibraryMAN Metropolitan Area NetworkOIT Office of Information TechnologyPIC Problem/Incident CoordinatorPM/IM Problem Management/Incident ManagementSLA Service Level Agreement
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
23
SME Subject Matter ExpertsSOA Service Oriented Architecture
SPOC Single Point of ContactTEC Tivoli Enterprise ConsoleUCMS Unemployment Compensation Modernization SystemWan Wide Area Network
Department of Labor and Industry – Office of Information TechnologySystems Management Plan – Problem/Incident ManagementDocument Location: T:\All (Common area for all OIT Staff)\Enterprise Systems Management Documents (ITIL)
24