79
http://www.flickr.com/photos/64123293@N00/5985619750/sizes/l/in/photostream/ Bringing Back the Love: How Situational Awareness Improves User Experience

Bright talk bringing back the love - final

Embed Size (px)

Citation preview

Page 1: Bright talk   bringing back the love - final

http://www.flickr.com/photos/64123293@N00/5985619750/sizes/l/in/photostream/!

Bringing Back the Love: How Situational Awareness Improves User Experience!

Page 2: Bright talk   bringing back the love - final

!Mr. White leads a team of software developers focused on creating

tools that collect and analyze health information from Nationwide's IT systems. These tools have a wide variety of applications, from fault detection and problem investigation to trend reporting and capacity planning.!

!Andrew has over ten years of experience designing and managing the

deployment of systems management software. Prior to joining Nationwide, Andrew developed solutions for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, and the US Navy Facilities and Engineering Command.!

Andrew White!Manager of Systems and !Event Management At !Nationwide Insurance!

Page 3: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

GROUND RULES FOR THIS SESSION…!

1.  If you can’t tell if I am trying to be funny…!!GO AHEAD AND LAUGH!!

2.  Feel free to text, tweet, yammer, or whatever. People gotta hear this!!

3.  If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind.!

Page 4: Bright talk   bringing back the love - final

I lead a Systems and Event Management team !

My name is Andrew White!

Page 5: Bright talk   bringing back the love - final

I am here today to talk about!

Situational Awareness!

Page 6: Bright talk   bringing back the love - final

Definitions:!

Page 7: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

SIT�U�A�TION – [SI-CHƏ-WĀ-SHƏN]!1. manner of being situated; location or

position with reference to environment: The situation of the house allowed for a beautiful view. "

2. condition; case; plight: He is in a desperate situation. "

3.  the state of affairs; combination of circumstances: The present international situation is dangerous. "

4. a state of affairs of special or critical significance in the course of a play, novel, etc. "

-noun"

Page 8: Bright talk   bringing back the love - final

Not this Situation…!

Think this situation…!

Page 9: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

A�WARE�NESS – [UH-WAIR-NIS]!

1. having knowledge; conscious; cognizant: aware of danger. "

2.  informed; alert; knowledgeable; sophisticated: She is one of the most politically aware young women around. "

-noun"

Page 10: Bright talk   bringing back the love - final

http://dc-cdn.virtacore.com/holding_door.jpg!

Page 11: Bright talk   bringing back the love - final

When you put them together we get:!

The perception of and reaction to a set of changing events in terms of what can be done instead of merely the recollection of a stimuli.1 !

Most outages are the result of the lack of situational awareness!

1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!!!

Page 12: Bright talk   bringing back the love - final

http://www.picfaz.com/pic/1390/02/12/4.jpg!

Page 13: Bright talk   bringing back the love - final

I am going to talk some new capabilities that will help you.!

Page 14: Bright talk   bringing back the love - final

Why do we lose situational awareness?!

Page 15: Bright talk   bringing back the love - final

This is Magenta…!

It doesn’t exist. L!

Page 16: Bright talk   bringing back the love - final

Magenta???!

Yellow = 510nm - 530nm!

Cyan = 600nm - 620nm!

Page 17: Bright talk   bringing back the love - final

The two color wave lengths that produce it are not side-by-side in the spectrum!

Page 18: Bright talk   bringing back the love - final

Squares A and B are the same color!

Page 19: Bright talk   bringing back the love - final

We cannot control the way our brain processes information!!

Page 20: Bright talk   bringing back the love - final

So… why do we lose situational awareness?!

Page 21: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

SOMETIMES WE MISS WHAT IS GOING ON!

Say… what’s a mountain goat doing all the way up here in a cloud bank?!

Page 22: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

Technology Areas!

WHICH DO YOU USE WHEN?!

Tool! Tool! Tool!

We don’t have a tooling problem…!

we have an understanding problem!!

Page 23: Bright talk   bringing back the love - final

1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!!!

Our systems are capable of producing a huge amount of data, both on the status of their own components and on the status of the environment. The problem with today’s systems is not a lack of information, but finding what is needed when it is needed.!

Page 24: Bright talk   bringing back the love - final

I would like to show why this happens…!

Page 25: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

BOYD’S OODA “LOOP”!

Observation!

Outside Information!

Implicit Guidance & Control!

Unfolding Interaction With Environment!

Feedback!

Feedback!

Unfolding Circumstances! Cultural!

Norms!

Cognitive!Abilities!

Knowledge !Life Cycle!

Prior!Wisdom!

New !Information!

Feed Forward! Decision!

(Hypothesis)!

Feed Forward! Action

(Test)!

Feed Forward!

•  Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window.!

•  Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection.!

!From “The Essence of Winning and Losing,” John R. Boyd, January 1996.!

Observe! Orient! Decide! Act!

Page 26: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

WHERE THE BREAKDOWN OCCURS!

Observe! Orient! Decide! Act!

Situational Awareness!

Perception of Elements in Current Situation!

!Level 1!

Comprehension of Current Situation!

!Level 2!

Projection of Future Status!

!!

Level 3!

Decision! Performance of Actions!

Cur

rent

Sta

te!

Feedback!

• Goals & Objectives!• Preconceptions!• Expectations!

• Abilities!• Experience!• Training!

Long Term Memory! Automaticity!

Cognitive Processes!

• System Capability!• Interface Design!• Stress & Workload!• Complexity!• Automation!

Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!

Systemic Influences!

Individual Influences!

Page 27: Bright talk   bringing back the love - final
Page 28: Bright talk   bringing back the love - final

Maybe.!Let me show you why this is important…!

Page 29: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

WE (IT) SELLS PROMISES…!

The value of these promises depends on the customer’s perception that we are willing and capable of making good on the promise when the time comes. This perception is affected by the interactions they have with us. !

Page 30: Bright talk   bringing back the love - final

http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/!

Objective #1: Users Love Our IT Systems…

Page 31: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

WHAT THIS MEANS TO US…!There are a few inescapable facts we face:!1.  Weneeds reliable systems to store the promises it

makes to its customers !2.  Our systems mirror the complexity of the

businesses they support!3.  Our environments must be massive to scale to

handle the workload!4.  There is too much activity for a single person to be

totally situationally aware!5.  If the users can’t use it, it doesn’t work!

Page 32: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

EVENT MANAGEMENT FOCUS…!

In addition to monitoring for performance, we are here to help manage availability.!

Our Formula:!1.  Continually collect, categorize, and analyze all

events from as many sources as possible!2.  Correlate events and analyze them using

previous outages as patterns to identify situations worth investigating!

3.  Notify a support team so the situation can be mitigated before becoming an outage!

Page 33: Bright talk   bringing back the love - final

When all of these happen at the same time…!

Ug…!

Page 34: Bright talk   bringing back the love - final

http://www.flickr.com/photos/gregphoto/4881356366/sizes/l/in/photostream/!

Bad Experience!!!!

Page 35: Bright talk   bringing back the love - final

OK.!So now what?!

Page 36: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

CLEANING UP THE LANDSCAPE!

Adapted from: Akella, Janaki. “IT Architecture: Cutting costs and complexity.” McKinsey Quarterly 13 Nov 2009 https://www.mckinseyquarterly.com/IT_architecture_Cutting_costs_and_complexity_2391!

Silo!

Monolithic Framework!

Nic

he!

Launch Pad!

Information Bus!

Page 37: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

ONE INTEGRATED ENVIRONMENT!

Distributed! Database!Mainframe! Network! Middleware! Storage!

Event Pool!

Operational!Data Warehouse!

Predictive!

Enrichment & Correlation!

Service Desk!Paging!

CMDB!

Knowledge!

Asset Mgmt!

Event Catalog!

Event API!

Business Telemetry!

3rd Party Providers!

Presentation Framework!

Page 38: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

CONCEPTUALIZING SITUATIONAL AWARENESS!

Situational Awareness

Engine!

Adapted from http://www.slideshare.net/TimBassCEP/getting-started-in-cep-how-to-build-an-event-processing-application-presentation-717795!

Real-Time Event Streams!

Detected and Predicted Situations!

Patterns from Historical Data!

Causal Relationship from Past RCAs!

Page 39: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

Event Pipeline!

SITUATIONAL AWARENESS MODEL DESIGN!

Solicitations for User Interaction

via the Visualization Framework!

Event Taxonomy and Enrichment!

Level 1!!

Event Tracking!!

Level 2!

Situation Detection!Level 3!

Predictive Analysis!Level 4!

Even

t Sou

rces!

Data! Information! Knowledge!

Patterns from Historical Data!

Causal Relationship from Past RCAs!

Historical Event Archive!

Runbook Automation!

Level 5!

Adapted from the JDL: Steinberg, A., & Bowman, C., Handbook of Multisensor Data Fusion, CRC Press, 2001!

Intelligence!

Page 40: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

REQUIREMENTS FOR UNITY OF EFFORT!

1. Command and Control!

2. Shared Experience!

3. Situational Awareness!

•  Command and control (No Leadership)!•  The team lacks a clear direction!•  Lots of activity, lack of progress!

•  Shared Experience (Poor Relationships)!•  Us vs. Them mentality!

•  Unhealthy competition!•  Situational Awareness (Poor Communication)!

•  Focused on cooperation, not collaboration!•  Blame culture!•  Infrequent or non-existent communication!

Symptoms of Missing Elements!

Page 41: Bright talk   bringing back the love - final

Our success in any endeavor depends directly on our ability to solve problems!

What do we need to do that?!

Page 42: Bright talk   bringing back the love - final

You Gotta Have Skillz…!

Page 43: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

WHAT MATTERS MOST?!

Dr. Lee Goldman!

Cook County Hospital, Chicago, IL!

§  Is the patient feeling unstable angina?!

§  Is there fluid in the patient’s lungs?!§  Is the patient’s systolic blood

pressure below 100? !

The Goldman Algorithm!

Prediction of Patients Who Will Have a Heart Attack Within 72

Hours!

0  10  20  30  40  50  60  70  80  90  100  

Traditional  Techniques   Goldman  Algorithm  

By paying attention to what really matters, Dr. Goldman improved the “false negatives” by

20 percentage points and eliminated the “false positives” altogether. !

Page 44: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

ECG Evidence of Acute Ischemia?!ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or!T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or!Left Bundle-Branch Block (New or Unknown Age)!

Observation Unit!

Inpatient Telemetry

Unit!

High Risk! Low Risk! Very Low Risk!Moderate Risk!

Yes! No!

Coronary Care Unit!

THE GOLDMAN ALGORITHM!

No!

ECG Evidence of Acute Myocardial Infarction (MI)?!ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age)!or!Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age)!

Yes!

Patient enters ED with suspected Acute

Cardiac Ischema!

Perform Electrocardiogram

(EKG)!

0 Factors!2 or 3 Factors! 1 Factors!0 or 1 Factors!2 or 3 Factors!

Urgent Factors Present?!Rates Above Both Lung Bases!Systolic Blood Pressure <100 mm Hg!Unstable Ischemic Heart Disease!

Urgent Factors Present?!Rates Above Both Lung Bases!Systolic Blood Pressure <100 mm Hg!Unstable Ischemic Heart Disease!

Page 45: Bright talk   bringing back the love - final

NICE.!What does this look like in our world?!

Page 46: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

WHAT GOOD MONITORING LOOKS LIKE!

Corporate!LANs & VPNs!

Load Balancer!

Load Balancer!

Firewall!

Switch!

Web Server Farm!

Database!

Data Power!Mainframe!

Middleware!

Load Balancer!

1.  System Availability!2.  Operating System Performance!3.  Hardware Monitoring!4.  Service/Daemon and Process Availability!5.  Error Logs!6.  Application Resource KPIs!7.  End-to-End Transactions!8.  Point of Failure Transactions!9.  Fail-Over Success!10. “Activity Monitors” and “Reverse Hockey

Stick”!

Elements of Good Monitoring!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!3!2! 4! 5! 6!1!!!!!

7!

!!!!!!!!!!!!!!!!!!8!

!!!!!!!!!!!!!!!!!!!!

9! !!!!!!

10!

Page 47: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

FINDING METRICS THAT MATTER!

§  Will the metric be used in a report? If so, which one? How is it used in the report?!

§  Will the metric be used in a dashboard? If so, which one? How will it be used?!

§  What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket be generated? If so, what severity?!

§  How often is this event likely to occur? What is the impact if the event occurs? What is the likelihood it can be detected by monitoring?!

§  Will the metric help identify the source of a problem? Is it a coincident / symptomatic indicator?!

§  Is the metric always associated with a single problem? Could this metric become a false indicator?!

§  What is the impact if this goes undetected?!§  What is the lifespan for this metric? What is the potential for changes that

may reduce the efficacy of the metric?!

Evaluating the Effectiveness of a Metric!

Page 48: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

ANATOMY OF AN OUTAGE!

Corporate!LANs & VPNs!

Load Balancer!

Firewall!

Web!Servers!

Message!Queue!

zOS!CICS!

WAS!

Database!

WAS!Database!

zOS!MQ!

DB2!

IM01109089: P0 - Affecting Multiple apps & Internet Sales West!!!!!

4!

!!!!!!

3!

!!!!!!1!

5:45-ish pm: CICS ABENDS start flooding MainView but not high enough to ticket!

!!!!!!2!

6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics!

6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident!

6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem!

10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue!

!!!!

5!

Page 49: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

DRIVING THE RIGHT KIND OF ACTION!

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

Database!

KPI 1!

KPI 2!

KPI N!

Page 50: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

Database!

KPI 1!

KPI 2!

KPI N!

DRIVING THE RIGHT KIND OF ACTION!

Page 51: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

Database!

KPI 1!

KPI 2!

KPI N!

DRIVING THE RIGHT KIND OF ACTION!

Page 52: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

Database!

KPI 1!

KPI 2!

KPI N!

DRIVING THE RIGHT KIND OF ACTION!

Page 53: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

Application!

End User Experience!

Gainesville!

Transaction 1!

Transaction 2!

Transaction N!

San Antonio!

Transaction 1!

Transaction 2!

Transaction N!

Des Moines!

Transaction 1!

Transaction 2!

Transaction N!

Columbus!

Transaction 1!

Transaction 2!

Transaction N!

Infrastructure!

Network!

KPI 1!

KPI 2!

KPI N!

Mainframe!

KPI 1!

KPI 2!

KPI N!

Storage!

KPI 1!

KPI 2!

KPI N!

Linux!

KPI 1!

KPI 2!

KPI N!

Middleware!

KPI 1!

KPI 2!

KPI N!

Database!

KPI 1!

KPI 2!

KPI N!

DRIVING THE RIGHT KIND OF ACTION!

Page 54: Bright talk   bringing back the love - final
Page 55: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

COMMON PROBLEM TYPES!

§  Design Problems!§  Creative Problems!§  Daily Problems!§  People Problems!

Rule-Based Approach!

Event Based Approach!

Page 56: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

EVENT-BASED PROBLEM SOLVING!

§  Appreciative Understanding!§  Know What We Are Solving!§  Create A Common Reality!§  Solutions Based on Causes !

Page 57: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

Database Down !

(Effect)!

Drive Full (Cause/Effect)!

Logs Not Truncated (Cause)!

①  Causes are effects, and effects are causes!

CAUSAL RELATIONSHIPS!

Page 58: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

End of the Universe (Effect)!

Database Down !(Primary Effect)!

Drive Full (Cause/Effect)!

Logs Not Truncated

(Cause/Effect)!Beginning of Time (Cause)!

②  You can keep identifying causes – there is no limit!

CAUSAL RELATIONSHIPS!

Page 59: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

End of the Universe (Effect)!

Database Down !(Primary Effect)!

Drive Full (Cause/Effect)!

Logs Not Truncated

(Cause/Effect)!Beginning of Time (Cause)!

Ask “Why?”!

Ask “What”!

TWO IMPORTANT QUESTIONS!

Page 60: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

③  An Effect is often the result of multiple causes!

SQL Server was not processing queries (Effect)!

Transaction log was unable to grow!

T: Drive at 0 Bytes free!

Logs were not truncated!

DBA on honeymoon

vacation in Fiji!

Logs are truncated manually!

Company has only 1 DBA!

“Backup” DBA was not aware the logs require truncation!

Space allocations are fixed! Lack of Control!

-AND-!

-AND-!

-AND-!

RULES FOR CAUSAL RELATIONSHIPS!

Page 61: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

④  Causes need to be both necessary and sufficient!

SQL Server was not processing queries

(Effect)!

Transaction log was unable to grow

(Transitory Cause)!

T: Drive at 0 Bytes free!(Non-transitory Cause

& Effect)!

Logs were not truncated!

(Transitory Cause & Effect)!

DBA on honeymoon vacation in Fiji!

(Transitory Cause)!

Logs are truncated manually!

(Non-Transitory Cause)!

Company has only 1 DBA!

(Non-Transitory Cause)!

“Backup” DBA was not aware the logs require

truncation!(Non-Transitory Cause)!

Space allocations are fixed!

(Non-Transitory Cause)!Lack of Control!

-AND-!

-AND-!

-AND-!

RULES FOR CAUSAL RELATIONSHIPS!

Page 62: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

HOW FIRE WORKS!

Time!

Oxygen!Heat!Fuel!

Fire!M

atch

Stri

ke!

Transitory!

Non-Transitory!

Fire!

Oxygen!

Heat!

Fuel!

Match Strike!

-AND-!

•  Transitory Causes act as catalysts to bring about change (think Transition)!

•  Non-Transitory Causes are objects, properties/attributes, and status!

Page 63: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

TAKE AN SOLOGIC RCA DIAGRAM!

Customers Complaining!

Web Server returning 500 errors!

The application server was timing

out!

SQL Server was not processing queries!

Transaction log was unable to grow!

T: Drive at 0 Bytes free!

Logs were not truncated!

DBA on honeymoon vacation in Fiji!

Logs are truncated manually!

Company has only 1 DBA!

“Backup” DBA was not aware the logs require truncation!

Space allocations are fixed! Lack of Control!

Only one database cluster in use!

DR SQL Cluster!

DR Cluster being used for UAT testing!

More Information Needed!

One one application server exists!

More Information Needed!

Trying to do business on the

website!Desired Condition!

-AND-!

-AND-!

-AND-!

-AND-!

-AND-!

-AND-!

-AND-!

Page 64: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

ADD THE EVIDENCE!

Customers Complaining!

Web Server returning 500 errors!

The application server was timing

out!

SQL Server was not processing queries!

Transaction log was unable to grow!

T: Drive at 0 Bytes free!

Logs were not truncated!

DBA on honeymoon vacation in Fiji!

Logs are truncated manually!

Company has only 1 DBA!

“Backup” DBA was not aware the logs require truncation!

Space allocations are fixed! Lack of Control!

Only one database cluster in use!

DR SQL Cluster!

DR Cluster being used for UAT testing!

More Information Needed!

One one application server exists!

More Information Needed!

Trying to do business on the

website!Desired Condition!

-AND-!

-AND-!

-AND-!

-AND-!

-AND-!

-AND-!

-AND-!

Statistical Data!

Situational!

Observation!

Page 65: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

SQL Server Not Available!

Transaction log is unable to grow!

T: Drive at 0 Bytes free!

Logs were not truncated!

DBA on honeymoon vacation in Fiji!

Logs are truncated manually!

Company has only 1 DBA!

“Backup” DBA was not aware the logs require

truncation!(Condition Cause)!

Space allocations are fixed!

(Condition Cause)!Lack of Control!

SQL is unable to cache query results !

Available RAM at 0 Bytes Free!

C: Drive at 0 Bytes free!

Minidump is configured to write to C: Drive!

Server was ASRing frequently!

Software distributions were leaving files in the

TEMP folder!

%TEMP% configured to C:\Temp!

Kernel able to write to page file!

-AND-!

-AND-!

-AND-!

-AND-!

-OR-!

-AND-!

-OR-!

FAILURE MODES AND EFFECT ANALYSIS!

Page 66: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

GETTING TO OUR REQUIREMENTS!

SQL Server Not Available!

Transaction log is unable to grow!

T: Drive at 0 Bytes free!

Logs were not truncated!

DBA on honeymoon vacation in Fiji!

Logs are truncated manually!

Company has only 1 DBA!

“Backup” DBA was not aware the logs require

truncation!(Condition Cause)!

Space allocations are fixed!

(Condition Cause)!Lack of Control!

SQL is unable to cache query results !

Available RAM at 0 Bytes Free!

C: Drive at 0 Bytes free!

Minidump is configured to write to

C: Drive!

Server was ASRing frequently!

Software distributions were leaving files in

the TEMP folder!

%TEMP% configured to C:\Temp!

Kernel able to write to page file!

-AND-!

-AND-!

-AND-!

-AND-!

-OR-!

-AND-!

-OR-!

Monitor the intersections at

the “OR’s”!

At least one point along each branch

after the “OR”!

Page 67: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

FMEA MATRIX (IMPACT CALCULATION)!

Negligible (1-2): no loss in functionality, mostly cosmetic!Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time!Critical (5-6): the problem will not resolve itself but a work around exists allowing the problem to be bypassed!Serious (7-8): the problem will not resolve itself and no work around is possible. Functionality is impaired or lost but the system is usable to some extent!Catastrophic (9-10): the system is completely unusable!

Improbable (1-2): less than 1 time per year!Remote (3-4): 1 time per year!Occasional (5-6): 1 time per month!Probable (7-8): 1 time per day!Chronic (9-10): 1 or more times per day!

Very high (1-2): during the design phase!High (3-4): during peer review or unit testing!Moderate (5-6): during system testing or acceptance testing!Remote (7-8): during or immediately after production deployment!Very Remote (9-10): only after heavy usage by users!

Page 68: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

FMEA MATRIX (EVIDENCE)!

These are the events that help us to RULE IN a failure mode as a possible cause!

These are the events that help us RULE OUT the failure mode as not relevant!

Page 69: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

Logical Server!

Virtual Machine 1!

Virtual Machine 2!

HOW TO DETERMINE EVENT SEVERITY!

Severity! Description!Critical! The component has completely failed!Major! The component is operating but is in a degraded or crippled state!Minor! The component is functioning normally but is at risk of a more serious failure!Informational! The component is functioning normally but is reporting a change in state!Unknown! The component has changed its operating state but the effect is not known!Clear! The component is operating normally or a higher severity event has been resolved!

•  The event severity is determined with respect to the component generating the event!

•  The event severity does not consider impact or urgency!

•  The incident priority is not determined by event severity!

•  The event severity helps drive an effective triage when multiple events arrive at approximately the same time!

•  Only after the effected components and their relationships to each other have been determined can impact and urgency be determined!

Six Levels of Severity!

Physical Server!

Server 1!

Server 2!

Logical Volumes!

Volume Group 1!

Volume Group 2!

Physical Volumes!Hard

Drive 1!Hard

Drive 2!Hard

Drive 3!

Page 70: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

MONITORING BASED ON PATTERNS!

Layers of Pre-Defined Monitoring Patterns !

•  The OS template is deployed when the server is provisioned!

•  As a server is customized to fit its role, additional templates are deployed!

•  Templates are stacked on top of each other until no gaps remain!

•  This approach provides a high degree of standardization without sacrificing the ability to develop a custom solution !

Page 71: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

APPLICATION-TECHNOLOGY MATRIX!Maps services, applications and technologies enabling:!• Monitoring investment prioritization!• Monitoring maturity!• Which templates need to be deployed when new hardware is acquired!• Whether an service has sufficient monitoring coverage based on its application components!• This approach allows for anticipating changes to a customer’s monitoring needs!

Scores indicate:!0 – No Strategy!1 – Limited Monitoring!2 – Fully Integrated Strategy!

Page 72: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

EVENT LIFECYCLE EXAMPLE!

Legend!Element Manager!Distributed Collectors!Object Server Triggers!Impact Policies!ITNM RCA Engine!Gateway Replication!Webtop Event List!

Software-Operating System!

Data Collection!

Anomaly Detection!

Event Generation!

Integration!

Event Processing!

Enrichment!

Event Suppression!

Correlation!

Root Cause Analysis!

Business Impact Analysis!

Automation!

Notification & Escalation!

Presentation!

User Interaction Tools!

Archiving!

Reporting!

Activity! Responsible Tool!

Trigger Ticket Request!

Create Ticket!

Update Event with IM#!

Trigger Courtesy Pages!

Send Pages!

Activity! Responsible Tool!

Page 73: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

AGGREGATION AND ANALYSIS OVERVIEW!

Automated Action!

Notification and Escalation!

Business Impact

Analysis!

Root C

ause Analysis!

Correlation and

Event Suppression!

Enrichment!

Meta-Data Integration Bus!

Distributed C

ollectors!D

istributed Collectors!

LOB Managed Monitoring

System!

Service Provider Managed Monitoring

System!

Vendor Managed Monitoring

System!

Element Manager!

Element Manager!

Element Manager!

Other Enterprise

Data!Document Sharing!

Service Desk! CMDB! Batch

Scheduling!Knowledge Database!

Online Run Book!

PBX/Call Manager!

Visualization Framework!C

omm

on Event Form

at!

Topology And Relationship

Database!Automated Action

Tools!

Distributed C

ollectors!Automated Provisioning

System!

Predictive Analysis!

Automated Change

Reconciliation!

Security Management!

Archive and Report!

Business Telemetry Data!

Service Center and Enterprise Notification Tool!

Page 74: Bright talk   bringing back the love - final

How do we keep it evolving?!

Page 75: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

FACILITATING PRODUCTION ASSURANCE!

§  CritSits!§  Start the CritSit meeting and provide an accounting of all

the potential failure modes, which have been successfully ruled out, and which need to be investigated!

§  Include other potential failure modes into the KT matrix!§  Problem Management!

§  Document the causal elements as new failure modes!§  Disseminate new failure modes to Architecture, ESM and

the Command Center!§  Reporting!

§  Produce a monthly news letter to application owners with the list of failure modes they should discuss with their architects!

§  Incorporate failure modes into “Fault Line” analysis!

Page 76: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

DURING THE DESIGN PROCESS!

•  Architects !!•  Certify that designs do not contain the known failure

modes or document that the failure mode does not present an unacceptable risk!

•  Document the requirements for Solution Architects to follow to ensure the mitigation strategies are implemented completely!

•  Developers!•  Certify that designs do not contain the known failure

modes or document that the failure mode does not present an unacceptable risk!

•  Certify the designs implement the mitigation strategies correctly!

Page 77: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

IMPROVING ENTERPRISE TOOLS!

§  Systems Management!§  Develop new monitoring requirements using the

documented indications and contraindications!§  Event Management!

§  Develop new correlations tying indications and contraindications to failure modes to assist in ruling out or ruling in those “in play” more efficiently!

§  Configuration Management!§  Develop new discovery patterns using the

documented indications and contraindications!§  Develop automations to detect the presence of

failure mode conditions and generate an event to the Event Management System!

Page 78: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

DURING SERVICE SUPPORT!

•  Command Centers and Support Teams!–  Use the failure modes to rule out potential failure modes!–  Each failure mode will have a documented process to

follow to mitigate the impact once a failure mode is identified!

•  Incident Managers!–  Start bridge calls and provide an accounting of all the

potential failure modes, which have been successfully ruled out, and which need to be investigated!

–  Coordinate the investigation assignments and consolidate the investigation results!

Page 79: Bright talk   bringing back the love - final

Follow Us: #ITSMSummit!

LET’S KEEP THE CONVERSATION GOING…!

[email protected]!

ReverendDrew!

SystemsManagementZen.Wordpress.com!

systemsmanagementzen.wordpress.com/feed/!

@SystemsMgmtZen!

ReverendDrew!

[email protected]!

614-306-3434!