Upload
andrew-white
View
160
Download
2
Tags:
Embed Size (px)
Citation preview
http://www.flickr.com/photos/64123293@N00/5985619750/sizes/l/in/photostream/!
Bringing Back the Love: How Situational Awareness Improves User Experience!
!Mr. White leads a team of software developers focused on creating
tools that collect and analyze health information from Nationwide's IT systems. These tools have a wide variety of applications, from fault detection and problem investigation to trend reporting and capacity planning.!
!Andrew has over ten years of experience designing and managing the
deployment of systems management software. Prior to joining Nationwide, Andrew developed solutions for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, and the US Navy Facilities and Engineering Command.!
Andrew White!Manager of Systems and !Event Management At !Nationwide Insurance!
Follow Us: #ITSMSummit!
GROUND RULES FOR THIS SESSION…!
1. If you can’t tell if I am trying to be funny…!!GO AHEAD AND LAUGH!!
2. Feel free to text, tweet, yammer, or whatever. People gotta hear this!!
3. If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind.!
I lead a Systems and Event Management team !
My name is Andrew White!
I am here today to talk about!
Situational Awareness!
Definitions:!
Follow Us: #ITSMSummit!
SIT�U�A�TION – [SI-CHƏ-WĀ-SHƏN]!1. manner of being situated; location or
position with reference to environment: The situation of the house allowed for a beautiful view. "
2. condition; case; plight: He is in a desperate situation. "
3. the state of affairs; combination of circumstances: The present international situation is dangerous. "
4. a state of affairs of special or critical significance in the course of a play, novel, etc. "
-noun"
Not this Situation…!
Think this situation…!
Follow Us: #ITSMSummit!
A�WARE�NESS – [UH-WAIR-NIS]!
1. having knowledge; conscious; cognizant: aware of danger. "
2. informed; alert; knowledgeable; sophisticated: She is one of the most politically aware young women around. "
-noun"
http://dc-cdn.virtacore.com/holding_door.jpg!
When you put them together we get:!
The perception of and reaction to a set of changing events in terms of what can be done instead of merely the recollection of a stimuli.1 !
Most outages are the result of the lack of situational awareness!
1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!!!
http://www.picfaz.com/pic/1390/02/12/4.jpg!
I am going to talk some new capabilities that will help you.!
Why do we lose situational awareness?!
This is Magenta…!
It doesn’t exist. L!
Magenta???!
Yellow = 510nm - 530nm!
Cyan = 600nm - 620nm!
The two color wave lengths that produce it are not side-by-side in the spectrum!
Squares A and B are the same color!
We cannot control the way our brain processes information!!
So… why do we lose situational awareness?!
Follow Us: #ITSMSummit!
SOMETIMES WE MISS WHAT IS GOING ON!
Say… what’s a mountain goat doing all the way up here in a cloud bank?!
Follow Us: #ITSMSummit!
Technology Areas!
WHICH DO YOU USE WHEN?!
Tool! Tool! Tool!
We don’t have a tooling problem…!
we have an understanding problem!!
1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!!!
Our systems are capable of producing a huge amount of data, both on the status of their own components and on the status of the environment. The problem with today’s systems is not a lack of information, but finding what is needed when it is needed.!
I would like to show why this happens…!
Follow Us: #ITSMSummit!
BOYD’S OODA “LOOP”!
Observation!
Outside Information!
Implicit Guidance & Control!
Unfolding Interaction With Environment!
Feedback!
Feedback!
Unfolding Circumstances! Cultural!
Norms!
Cognitive!Abilities!
Knowledge !Life Cycle!
Prior!Wisdom!
New !Information!
Feed Forward! Decision!
(Hypothesis)!
Feed Forward! Action
(Test)!
Feed Forward!
• Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window.!
• Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection.!
!From “The Essence of Winning and Losing,” John R. Boyd, January 1996.!
Observe! Orient! Decide! Act!
Follow Us: #ITSMSummit!
WHERE THE BREAKDOWN OCCURS!
Observe! Orient! Decide! Act!
Situational Awareness!
Perception of Elements in Current Situation!
!Level 1!
Comprehension of Current Situation!
!Level 2!
Projection of Future Status!
!!
Level 3!
Decision! Performance of Actions!
Cur
rent
Sta
te!
Feedback!
• Goals & Objectives!• Preconceptions!• Expectations!
• Abilities!• Experience!• Training!
Long Term Memory! Automaticity!
Cognitive Processes!
• System Capability!• Interface Design!• Stress & Workload!• Complexity!• Automation!
Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!
Systemic Influences!
Individual Influences!
Maybe.!Let me show you why this is important…!
Follow Us: #ITSMSummit!
WE (IT) SELLS PROMISES…!
The value of these promises depends on the customer’s perception that we are willing and capable of making good on the promise when the time comes. This perception is affected by the interactions they have with us. !
http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/!
Objective #1: Users Love Our IT Systems…
Follow Us: #ITSMSummit!
WHAT THIS MEANS TO US…!There are a few inescapable facts we face:!1. Weneeds reliable systems to store the promises it
makes to its customers !2. Our systems mirror the complexity of the
businesses they support!3. Our environments must be massive to scale to
handle the workload!4. There is too much activity for a single person to be
totally situationally aware!5. If the users can’t use it, it doesn’t work!
Follow Us: #ITSMSummit!
EVENT MANAGEMENT FOCUS…!
In addition to monitoring for performance, we are here to help manage availability.!
Our Formula:!1. Continually collect, categorize, and analyze all
events from as many sources as possible!2. Correlate events and analyze them using
previous outages as patterns to identify situations worth investigating!
3. Notify a support team so the situation can be mitigated before becoming an outage!
When all of these happen at the same time…!
Ug…!
http://www.flickr.com/photos/gregphoto/4881356366/sizes/l/in/photostream/!
Bad Experience!!!!
OK.!So now what?!
Follow Us: #ITSMSummit!
CLEANING UP THE LANDSCAPE!
Adapted from: Akella, Janaki. “IT Architecture: Cutting costs and complexity.” McKinsey Quarterly 13 Nov 2009 https://www.mckinseyquarterly.com/IT_architecture_Cutting_costs_and_complexity_2391!
Silo!
Monolithic Framework!
Nic
he!
Launch Pad!
Information Bus!
Follow Us: #ITSMSummit!
ONE INTEGRATED ENVIRONMENT!
Distributed! Database!Mainframe! Network! Middleware! Storage!
Event Pool!
Operational!Data Warehouse!
Predictive!
Enrichment & Correlation!
Service Desk!Paging!
CMDB!
Knowledge!
Asset Mgmt!
Event Catalog!
Event API!
Business Telemetry!
3rd Party Providers!
Presentation Framework!
Follow Us: #ITSMSummit!
CONCEPTUALIZING SITUATIONAL AWARENESS!
Situational Awareness
Engine!
Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep-how-to-build-an-event-processing-application-presentation-717795!
Real-Time Event Streams!
Detected and Predicted Situations!
Patterns from Historical Data!
Causal Relationship from Past RCAs!
Follow Us: #ITSMSummit!
Event Pipeline!
SITUATIONAL AWARENESS MODEL DESIGN!
Solicitations for User Interaction
via the Visualization Framework!
Event Taxonomy and Enrichment!
Level 1!!
Event Tracking!!
Level 2!
Situation Detection!Level 3!
Predictive Analysis!Level 4!
Even
t Sou
rces!
Data! Information! Knowledge!
Patterns from Historical Data!
Causal Relationship from Past RCAs!
Historical Event Archive!
Runbook Automation!
Level 5!
Adapted from the JDL: Steinberg, A., & Bowman, C., Handbook of Multisensor Data Fusion, CRC Press, 2001!
Intelligence!
Follow Us: #ITSMSummit!
REQUIREMENTS FOR UNITY OF EFFORT!
1. Command and Control!
2. Shared Experience!
3. Situational Awareness!
• Command and control (No Leadership)!• The team lacks a clear direction!• Lots of activity, lack of progress!
• Shared Experience (Poor Relationships)!• Us vs. Them mentality!
• Unhealthy competition!• Situational Awareness (Poor Communication)!
• Focused on cooperation, not collaboration!• Blame culture!• Infrequent or non-existent communication!
Symptoms of Missing Elements!
Our success in any endeavor depends directly on our ability to solve problems!
What do we need to do that?!
You Gotta Have Skillz…!
Follow Us: #ITSMSummit!
WHAT MATTERS MOST?!
Dr. Lee Goldman!
Cook County Hospital, Chicago, IL!
§ Is the patient feeling unstable angina?!
§ Is there fluid in the patient’s lungs?!§ Is the patient’s systolic blood
pressure below 100? !
The Goldman Algorithm!
Prediction of Patients Who Will Have a Heart Attack Within 72
Hours!
0 10 20 30 40 50 60 70 80 90 100
Traditional Techniques Goldman Algorithm
By paying attention to what really matters, Dr. Goldman improved the “false negatives” by
20 percentage points and eliminated the “false positives” altogether. !
Follow Us: #ITSMSummit!
ECG Evidence of Acute Ischemia?!ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or!T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or!Left Bundle-Branch Block (New or Unknown Age)!
Observation Unit!
Inpatient Telemetry
Unit!
High Risk! Low Risk! Very Low Risk!Moderate Risk!
Yes! No!
Coronary Care Unit!
THE GOLDMAN ALGORITHM!
No!
ECG Evidence of Acute Myocardial Infarction (MI)?!ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age)!or!Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age)!
Yes!
Patient enters ED with suspected Acute
Cardiac Ischema!
Perform Electrocardiogram
(EKG)!
0 Factors!2 or 3 Factors! 1 Factors!0 or 1 Factors!2 or 3 Factors!
Urgent Factors Present?!Rates Above Both Lung Bases!Systolic Blood Pressure <100 mm Hg!Unstable Ischemic Heart Disease!
Urgent Factors Present?!Rates Above Both Lung Bases!Systolic Blood Pressure <100 mm Hg!Unstable Ischemic Heart Disease!
NICE.!What does this look like in our world?!
Follow Us: #ITSMSummit!
WHAT GOOD MONITORING LOOKS LIKE!
Corporate!LANs & VPNs!
Load Balancer!
Load Balancer!
Firewall!
Switch!
Web Server Farm!
Database!
Data Power!Mainframe!
Middleware!
Load Balancer!
1. System Availability!2. Operating System Performance!3. Hardware Monitoring!4. Service/Daemon and Process Availability!5. Error Logs!6. Application Resource KPIs!7. End-to-End Transactions!8. Point of Failure Transactions!9. Fail-Over Success!10. “Activity Monitors” and “Reverse Hockey
Stick”!
Elements of Good Monitoring!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!3!2! 4! 5! 6!1!!!!!
7!
!!!!!!!!!!!!!!!!!!8!
!!!!!!!!!!!!!!!!!!!!
9! !!!!!!
10!
Follow Us: #ITSMSummit!
FINDING METRICS THAT MATTER!
§ Will the metric be used in a report? If so, which one? How is it used in the report?!
§ Will the metric be used in a dashboard? If so, which one? How will it be used?!
§ What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket be generated? If so, what severity?!
§ How often is this event likely to occur? What is the impact if the event occurs? What is the likelihood it can be detected by monitoring?!
§ Will the metric help identify the source of a problem? Is it a coincident / symptomatic indicator?!
§ Is the metric always associated with a single problem? Could this metric become a false indicator?!
§ What is the impact if this goes undetected?!§ What is the lifespan for this metric? What is the potential for changes that
may reduce the efficacy of the metric?!
Evaluating the Effectiveness of a Metric!
Follow Us: #ITSMSummit!
ANATOMY OF AN OUTAGE!
Corporate!LANs & VPNs!
Load Balancer!
Firewall!
Web!Servers!
Message!Queue!
zOS!CICS!
WAS!
Database!
WAS!Database!
zOS!MQ!
DB2!
IM01109089: P0 - Affecting Multiple apps & Internet Sales West!!!!!
4!
!!!!!!
3!
!!!!!!1!
5:45-ish pm: CICS ABENDS start flooding MainView but not high enough to ticket!
!!!!!!2!
6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics!
6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident!
6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem!
10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue!
!!!!
5!
Follow Us: #ITSMSummit!
DRIVING THE RIGHT KIND OF ACTION!
Application!
End User Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
Database!
KPI 1!
KPI 2!
KPI N!
Follow Us: #ITSMSummit!
Application!
End User Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
Database!
KPI 1!
KPI 2!
KPI N!
DRIVING THE RIGHT KIND OF ACTION!
Follow Us: #ITSMSummit!
Application!
End User Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
Database!
KPI 1!
KPI 2!
KPI N!
DRIVING THE RIGHT KIND OF ACTION!
Follow Us: #ITSMSummit!
Application!
End User Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
Database!
KPI 1!
KPI 2!
KPI N!
DRIVING THE RIGHT KIND OF ACTION!
Follow Us: #ITSMSummit!
Application!
End User Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
Database!
KPI 1!
KPI 2!
KPI N!
DRIVING THE RIGHT KIND OF ACTION!
Follow Us: #ITSMSummit!
COMMON PROBLEM TYPES!
§ Design Problems!§ Creative Problems!§ Daily Problems!§ People Problems!
Rule-Based Approach!
Event Based Approach!
Follow Us: #ITSMSummit!
EVENT-BASED PROBLEM SOLVING!
§ Appreciative Understanding!§ Know What We Are Solving!§ Create A Common Reality!§ Solutions Based on Causes !
Follow Us: #ITSMSummit!
Database Down !
(Effect)!
Drive Full (Cause/Effect)!
Logs Not Truncated (Cause)!
① Causes are effects, and effects are causes!
CAUSAL RELATIONSHIPS!
Follow Us: #ITSMSummit!
End of the Universe (Effect)!
Database Down !(Primary Effect)!
Drive Full (Cause/Effect)!
Logs Not Truncated
(Cause/Effect)!Beginning of Time (Cause)!
② You can keep identifying causes – there is no limit!
CAUSAL RELATIONSHIPS!
Follow Us: #ITSMSummit!
End of the Universe (Effect)!
Database Down !(Primary Effect)!
Drive Full (Cause/Effect)!
Logs Not Truncated
(Cause/Effect)!Beginning of Time (Cause)!
Ask “Why?”!
Ask “What”!
TWO IMPORTANT QUESTIONS!
Follow Us: #ITSMSummit!
③ An Effect is often the result of multiple causes!
SQL Server was not processing queries (Effect)!
Transaction log was unable to grow!
T: Drive at 0 Bytes free!
Logs were not truncated!
DBA on honeymoon
vacation in Fiji!
Logs are truncated manually!
Company has only 1 DBA!
“Backup” DBA was not aware the logs require truncation!
Space allocations are fixed! Lack of Control!
-AND-!
-AND-!
-AND-!
RULES FOR CAUSAL RELATIONSHIPS!
Follow Us: #ITSMSummit!
④ Causes need to be both necessary and sufficient!
SQL Server was not processing queries
(Effect)!
Transaction log was unable to grow
(Transitory Cause)!
T: Drive at 0 Bytes free!(Non-transitory Cause
& Effect)!
Logs were not truncated!
(Transitory Cause & Effect)!
DBA on honeymoon vacation in Fiji!
(Transitory Cause)!
Logs are truncated manually!
(Non-Transitory Cause)!
Company has only 1 DBA!
(Non-Transitory Cause)!
“Backup” DBA was not aware the logs require
truncation!(Non-Transitory Cause)!
Space allocations are fixed!
(Non-Transitory Cause)!Lack of Control!
-AND-!
-AND-!
-AND-!
RULES FOR CAUSAL RELATIONSHIPS!
Follow Us: #ITSMSummit!
HOW FIRE WORKS!
Time!
Oxygen!Heat!Fuel!
Fire!M
atch
Stri
ke!
Transitory!
Non-Transitory!
Fire!
Oxygen!
Heat!
Fuel!
Match Strike!
-AND-!
• Transitory Causes act as catalysts to bring about change (think Transition)!
• Non-Transitory Causes are objects, properties/attributes, and status!
Follow Us: #ITSMSummit!
TAKE AN SOLOGIC RCA DIAGRAM!
Customers Complaining!
Web Server returning 500 errors!
The application server was timing
out!
SQL Server was not processing queries!
Transaction log was unable to grow!
T: Drive at 0 Bytes free!
Logs were not truncated!
DBA on honeymoon vacation in Fiji!
Logs are truncated manually!
Company has only 1 DBA!
“Backup” DBA was not aware the logs require truncation!
Space allocations are fixed! Lack of Control!
Only one database cluster in use!
DR SQL Cluster!
DR Cluster being used for UAT testing!
More Information Needed!
One one application server exists!
More Information Needed!
Trying to do business on the
website!Desired Condition!
-AND-!
-AND-!
-AND-!
-AND-!
-AND-!
-AND-!
-AND-!
Follow Us: #ITSMSummit!
ADD THE EVIDENCE!
Customers Complaining!
Web Server returning 500 errors!
The application server was timing
out!
SQL Server was not processing queries!
Transaction log was unable to grow!
T: Drive at 0 Bytes free!
Logs were not truncated!
DBA on honeymoon vacation in Fiji!
Logs are truncated manually!
Company has only 1 DBA!
“Backup” DBA was not aware the logs require truncation!
Space allocations are fixed! Lack of Control!
Only one database cluster in use!
DR SQL Cluster!
DR Cluster being used for UAT testing!
More Information Needed!
One one application server exists!
More Information Needed!
Trying to do business on the
website!Desired Condition!
-AND-!
-AND-!
-AND-!
-AND-!
-AND-!
-AND-!
-AND-!
Statistical Data!
Situational!
Observation!
Follow Us: #ITSMSummit!
SQL Server Not Available!
Transaction log is unable to grow!
T: Drive at 0 Bytes free!
Logs were not truncated!
DBA on honeymoon vacation in Fiji!
Logs are truncated manually!
Company has only 1 DBA!
“Backup” DBA was not aware the logs require
truncation!(Condition Cause)!
Space allocations are fixed!
(Condition Cause)!Lack of Control!
SQL is unable to cache query results !
Available RAM at 0 Bytes Free!
C: Drive at 0 Bytes free!
Minidump is configured to write to C: Drive!
Server was ASRing frequently!
Software distributions were leaving files in the
TEMP folder!
%TEMP% configured to C:\Temp!
Kernel able to write to page file!
-AND-!
-AND-!
-AND-!
-AND-!
-OR-!
-AND-!
-OR-!
FAILURE MODES AND EFFECT ANALYSIS!
Follow Us: #ITSMSummit!
GETTING TO OUR REQUIREMENTS!
SQL Server Not Available!
Transaction log is unable to grow!
T: Drive at 0 Bytes free!
Logs were not truncated!
DBA on honeymoon vacation in Fiji!
Logs are truncated manually!
Company has only 1 DBA!
“Backup” DBA was not aware the logs require
truncation!(Condition Cause)!
Space allocations are fixed!
(Condition Cause)!Lack of Control!
SQL is unable to cache query results !
Available RAM at 0 Bytes Free!
C: Drive at 0 Bytes free!
Minidump is configured to write to
C: Drive!
Server was ASRing frequently!
Software distributions were leaving files in
the TEMP folder!
%TEMP% configured to C:\Temp!
Kernel able to write to page file!
-AND-!
-AND-!
-AND-!
-AND-!
-OR-!
-AND-!
-OR-!
Monitor the intersections at
the “OR’s”!
At least one point along each branch
after the “OR”!
Follow Us: #ITSMSummit!
FMEA MATRIX (IMPACT CALCULATION)!
Negligible (1-2): no loss in functionality, mostly cosmetic!Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time!Critical (5-6): the problem will not resolve itself but a work around exists allowing the problem to be bypassed!Serious (7-8): the problem will not resolve itself and no work around is possible. Functionality is impaired or lost but the system is usable to some extent!Catastrophic (9-10): the system is completely unusable!
Improbable (1-2): less than 1 time per year!Remote (3-4): 1 time per year!Occasional (5-6): 1 time per month!Probable (7-8): 1 time per day!Chronic (9-10): 1 or more times per day!
Very high (1-2): during the design phase!High (3-4): during peer review or unit testing!Moderate (5-6): during system testing or acceptance testing!Remote (7-8): during or immediately after production deployment!Very Remote (9-10): only after heavy usage by users!
Follow Us: #ITSMSummit!
FMEA MATRIX (EVIDENCE)!
These are the events that help us to RULE IN a failure mode as a possible cause!
These are the events that help us RULE OUT the failure mode as not relevant!
Follow Us: #ITSMSummit!
Logical Server!
Virtual Machine 1!
Virtual Machine 2!
HOW TO DETERMINE EVENT SEVERITY!
Severity! Description!Critical! The component has completely failed!Major! The component is operating but is in a degraded or crippled state!Minor! The component is functioning normally but is at risk of a more serious failure!Informational! The component is functioning normally but is reporting a change in state!Unknown! The component has changed its operating state but the effect is not known!Clear! The component is operating normally or a higher severity event has been resolved!
• The event severity is determined with respect to the component generating the event!
• The event severity does not consider impact or urgency!
• The incident priority is not determined by event severity!
• The event severity helps drive an effective triage when multiple events arrive at approximately the same time!
• Only after the effected components and their relationships to each other have been determined can impact and urgency be determined!
Six Levels of Severity!
Physical Server!
Server 1!
Server 2!
Logical Volumes!
Volume Group 1!
Volume Group 2!
Physical Volumes!Hard
Drive 1!Hard
Drive 2!Hard
Drive 3!
Follow Us: #ITSMSummit!
MONITORING BASED ON PATTERNS!
Layers of Pre-Defined Monitoring Patterns !
• The OS template is deployed when the server is provisioned!
• As a server is customized to fit its role, additional templates are deployed!
• Templates are stacked on top of each other until no gaps remain!
• This approach provides a high degree of standardization without sacrificing the ability to develop a custom solution !
Follow Us: #ITSMSummit!
APPLICATION-TECHNOLOGY MATRIX!Maps services, applications and technologies enabling:!• Monitoring investment prioritization!• Monitoring maturity!• Which templates need to be deployed when new hardware is acquired!• Whether an service has sufficient monitoring coverage based on its application components!• This approach allows for anticipating changes to a customer’s monitoring needs!
Scores indicate:!0 – No Strategy!1 – Limited Monitoring!2 – Fully Integrated Strategy!
Follow Us: #ITSMSummit!
EVENT LIFECYCLE EXAMPLE!
Legend!Element Manager!Distributed Collectors!Object Server Triggers!Impact Policies!ITNM RCA Engine!Gateway Replication!Webtop Event List!
Software-Operating System!
Data Collection!
Anomaly Detection!
Event Generation!
Integration!
Event Processing!
Enrichment!
Event Suppression!
Correlation!
Root Cause Analysis!
Business Impact Analysis!
Automation!
Notification & Escalation!
Presentation!
User Interaction Tools!
Archiving!
Reporting!
Activity! Responsible Tool!
Trigger Ticket Request!
Create Ticket!
Update Event with IM#!
Trigger Courtesy Pages!
Send Pages!
Activity! Responsible Tool!
Follow Us: #ITSMSummit!
AGGREGATION AND ANALYSIS OVERVIEW!
Automated Action!
Notification and Escalation!
Business Impact
Analysis!
Root C
ause Analysis!
Correlation and
Event Suppression!
Enrichment!
Meta-Data Integration Bus!
Distributed C
ollectors!D
istributed Collectors!
LOB Managed Monitoring
System!
Service Provider Managed Monitoring
System!
Vendor Managed Monitoring
System!
Element Manager!
Element Manager!
Element Manager!
Other Enterprise
Data!Document Sharing!
Service Desk! CMDB! Batch
Scheduling!Knowledge Database!
Online Run Book!
PBX/Call Manager!
Visualization Framework!C
omm
on Event Form
at!
Topology And Relationship
Database!Automated Action
Tools!
Distributed C
ollectors!Automated Provisioning
System!
Predictive Analysis!
Automated Change
Reconciliation!
Security Management!
Archive and Report!
Business Telemetry Data!
Service Center and Enterprise Notification Tool!
How do we keep it evolving?!
Follow Us: #ITSMSummit!
FACILITATING PRODUCTION ASSURANCE!
§ CritSits!§ Start the CritSit meeting and provide an accounting of all
the potential failure modes, which have been successfully ruled out, and which need to be investigated!
§ Include other potential failure modes into the KT matrix!§ Problem Management!
§ Document the causal elements as new failure modes!§ Disseminate new failure modes to Architecture, ESM and
the Command Center!§ Reporting!
§ Produce a monthly news letter to application owners with the list of failure modes they should discuss with their architects!
§ Incorporate failure modes into “Fault Line” analysis!
Follow Us: #ITSMSummit!
DURING THE DESIGN PROCESS!
• Architects !!• Certify that designs do not contain the known failure
modes or document that the failure mode does not present an unacceptable risk!
• Document the requirements for Solution Architects to follow to ensure the mitigation strategies are implemented completely!
• Developers!• Certify that designs do not contain the known failure
modes or document that the failure mode does not present an unacceptable risk!
• Certify the designs implement the mitigation strategies correctly!
Follow Us: #ITSMSummit!
IMPROVING ENTERPRISE TOOLS!
§ Systems Management!§ Develop new monitoring requirements using the
documented indications and contraindications!§ Event Management!
§ Develop new correlations tying indications and contraindications to failure modes to assist in ruling out or ruling in those “in play” more efficiently!
§ Configuration Management!§ Develop new discovery patterns using the
documented indications and contraindications!§ Develop automations to detect the presence of
failure mode conditions and generate an event to the Event Management System!
Follow Us: #ITSMSummit!
DURING SERVICE SUPPORT!
• Command Centers and Support Teams!– Use the failure modes to rule out potential failure modes!– Each failure mode will have a documented process to
follow to mitigate the impact once a failure mode is identified!
• Incident Managers!– Start bridge calls and provide an accounting of all the
potential failure modes, which have been successfully ruled out, and which need to be investigated!
– Coordinate the investigation assignments and consolidate the investigation results!
Follow Us: #ITSMSummit!
LET’S KEEP THE CONVERSATION GOING…!
ReverendDrew!
SystemsManagementZen.Wordpress.com!
systemsmanagementzen.wordpress.com/feed/!
@SystemsMgmtZen!
ReverendDrew!
614-306-3434!