Upload
others
View
20
Download
0
Embed Size (px)
Citation preview
Modern IT Operations Management with Moogsoft AIOps
Real-World Use Cases For Algorithms And Machine Learning
Dominic Wellington | Moogsoft
Confidential and Proprietary Information for Moogsoft Inc.
• ~200 Apps• Key input to MIM• Front line Support
Incident Management Workflow Today: Reactive
Customer
I have an issue!
• Major Incident Mgmt• Root Cause Analysis
All-hands Bridge
• KPI Tracking• Root Cause Analysis• Multiple People• Research, document issue
further to prevent recurrence
ECCEyes on Glass
EMIMEscalation/Restoration
NOCNetwork-centric
~10-15 per day
SOProcess & Problem Mgmt
L1Major Incident Mgmt
Avg. MTTB: 8min
60%of Issues
Severity & area of issue determined/guessed
Biz Infra L2
• No access to ECC data• Diff view of universe• Blamed 70-80% of time• Routes, switches, ISP,
VPN, Firewall, etc..
Proactive Bridge
Avg. MTTD: 15-20 mins
Avg. MTTR 107 mins
Monitoring
Reactive, rules-basedevent management
Reactive, rules-basedevent management
GSAM IBMRun bridge
Internet
40% of Issues
Confidential and Proprietary Information for Moogsoft Inc.
Your Incident Workflow Today
TOOLS
2 Million Events
WORKFLOW
TEAMS
XMatters
NOTIFY
New Relic
Riverbed
Splunk
Nagios XI
Oracle
AWS
DBA, Storage, Net
L3 App Dev
L2 App Support
SWAT Team
Exec / App Owner
TROUBLESHOOT
NagiosDynatraceExtrahop
…
L1 Operators
ServiceNow
CustomerReported 65%
MonitoringReported 35%
TICKET
Duplicate 50%
CORRELATE
L1 Operators
ServiceNow
RESOLVE
No Action 75%
L1 Operators
Webex
BRIDGE
Mean-Time-To-Detect 60 mins
Mean-Time-To-Resolve 120 mins
ANALYZE
Confidential and Proprietary Information for Moogsoft Inc.
Current vs. Future State with Moogsoft AIOps
TROUBLESHOOT
DETECTSITUATION
AUTO-TICKET
AUTO-NOTIFY
CurrentState
WithMoogsoft
ANALYZE NOTIFY RESOLVETICKETCORRELATE BRIDGE
MTTD: 15 mins MTTR: 104 mins
MTTD: secs MTTR: < 60 mins Moogsoft Business Value $$$
Min. reduction in downtime by 25%
Min. reduction in MTTD by 25%
Min. reduction in MTTR by 25%
Min. reduction in tickets by 25%RESOLVE
LEARN
KNOWLEDGECYCLE
TROUBLESHOOT
Algorithms
Humans
SITUATIONROOM
Confidential and Proprietary Information for Moogsoft Inc.
Moogsoft AIOps at RBC
“Operators could take hours to realize that they were investigating the same tickets”
–Adam Frank, Director of Alarm & Event Management Systems, RBC
Before Moogsoft
▪ 50% reduction in operational noise ▪ 35% reduction in Mean-Time-To-Detect ▪ 43% reduction in Mean-Time-To-Restore ▪ 4x RoI in first year
With Moogsoft
Confidential and Proprietary Information for Moogsoft Inc.
Moogsoft AIOps in Deutsche Telekom Group
Moogsoft AIOps detected early warnings 10 hours before Netcool
▪ Progression across multiple NOCs, domains and countries
▪ All stakeholders aware and push-notified
Machine learning reduced 360,000 raw events to thousands of
Situations
▪ 99.7% noise reduction
▪ No service-affecting incidents missed
▪ CMDB not required
Moogsoft AIOps rapidly ingests modern event feeds
▪ SDN and NFV ready
Confidential and Proprietary Information for Moogsoft Inc.
“We needed to automate our ‘catch and dispatch’ process without the need for rules”
–Navin Sabharwal, Fellow & Chief Architect, HCL Technologies.
• 62% Fewer Tickets• 33% Shorter MTTR• Integrated with existing and future tools
Operational Efficiency
Agility & Flexibility
HCL – Benefits of AIOps with Moogsoft
Confidential and Proprietary Information for Moogsoft Inc.
Moogsoft AIOps Process Overview
ApplicationOutage Occurs
Millions of Events
and Alerts
Real-TimeSituationInsight
Situation
AlgorithmicNoise Reduction
& Clustering
Algorithmic Knowledge
Ecosystem Integration
Collaborative Team-Based
Workflow
Network
L1 DBA
L2
Storage
Sys Admins
Dev
Workflow Automation
Confidential and Proprietary Information for Moogsoft Inc.
Moogsoft AIOps Algorithms
Time
Detects patterns
in timestamps
Linguistic TopologyOps
Template
Neural
FeedbackKnowledge
Detect linguistic
relationships in
events
Detects Patterns in
network proximity
Blueprint past
faults for future
Detection &
remediation
Automatically learns
Behavior from IT Ops
Users.
Knowledge Reuse
For Past Situations
Adapted ML for Real-Time IT Ops & DevOps
ACE
Algorithmic
Clustering
Engine
Teams
Intelligent Team
Notifications
Entropy
Alert Ranking &
Noise Reduction
Confidential and Proprietary Information for Moogsoft Inc.
Algorithmic Clustering Engine (ACE)
Gra
ph
En
tro
py
Tim
e
Occu
rren
ce
Wh
iteli
sti
ng
Bla
ckli
sti
ng
Info
rmati
o
nalE
ntr
op
y
Netw
ork
Pro
xim
ity
Textu
al
Sim
ilari
ty
So
ft F
uzzy
Matc
hin
g
AppDynamics
Splunk
Solarwinds
Nagios
Oracle
ACE
Streaming
Events
Situations
Monitoring
Tools
Lightweight ACE Definitions route
data to appropriate algorithms
Firewall Incident
01/07/17 10:14:21 AM
CRM, Website and Order Services Impacted
Database Incident
01/07/17 11:19:37 AM
BI Service Impacted
Storage Incident
01/07/17 12:14:06 AM
Payment Service Impacted
Real-Time Algorithms
Cluster Events
Confidential and Proprietary Information for Moogsoft Inc.
Clustering Techniques – Precision vs. Recall
Recall(Quantity)
Precision(Quality)
Low High
Rules
ACETime
Linguistic
Topology
High
Low
*Low Effort
*High Effort
Confidential and Proprietary Information for Moogsoft Inc.
Moogsoft AIOps Architecture
MooBotsWorkflow,
Notifications& Remediation
LAMsEvent Ingestion
Log Events
Monitoring Events
Change Events
IT Service Desk
Event Feeds
CMDBE
ven
ts
Ale
rts
Situ
atio
ns
SNMP, Netcool, BMC BEM, CA Spectrum, HP NNM/OM
Splunk, Log Files, syslog
Jira, Chef, Puppet
Extrahop, Dynatrace, Nagios
ServiceNow, Cherwell, BMC Remedy, JIRA Service Desk.
BMC Atrium, HP/IBM/CA CMDB, AMDOCS, File, any database, etc.
Slack, Skype, Google+, etc.
CLI, Java, JavaScript, C++, ObjC, SQL, PERL, etc.
SigalizersMachine Learning
SituationRoom
UI & Collaboration
MOOG
Knowledge
Real-time Bus
External
Knowledge
Script and Process etc.
IRC/Chat/Chatbots
NotificationsPagerDuty, OpsGenie, xMatters