Upload
doanthuan
View
231
Download
0
Embed Size (px)
Citation preview
Network Modeling, Analytics and Practical Data Science for NGN and EPN Networks
Matt Birkner, Distinguished Engineer, CCIE #3719
Rob Piasecki, Solutions Architect, CCIE #23765
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Cisco Spark
Questions? Use Cisco Spark to communicate with the speaker after the session
1. Find this session in the Cisco Live Mobile App
2. Click “Join the Discussion”
3. Install Spark or go directly to the space
4. Enter messages/questions in the space
How
cs.co/ciscolivebot#BRKSPG-2231
• Industry Trends and the Need for Analytics
• Analytics Framework and Toolsets
• Capacity Estimation Building Blocks
• Modeling Methodology and Topology Analytics
• Design Case Study
• Streaming Telemetry
• Analytics Case Studies and Visualizations
• The Road Ahead: WAN Orchestration, Analytics & SDN
Agenda
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 5BRKSPG-2231
Industry Trends
• Real time traffic such as Voice and Video, Virtualization and Storage have had profound impact on Network Planning, Engineering and Operations Teams in all segments (SP, ENT, Commercial, Public Sector, etc.)
• Well-defined Architecture and Best Practices have been critical for Converged, Scalable & Fault Tolerant Next Generation Networks (NGN)
Wave #1 – Converged NGN – ~2000 -2010
Consolidation for OPEX Reduction
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 6BRKSPG-2231
Industry Trends
• Device-centric views such as CPU, Memory, Buffers, Link utilization are not enough to create an end to end topological view of true network capacity
• Customers need Topological Visibility to Optimize the capacity of network resources, including the state of both internal (IGP) and external (BGP) link Metrics
Wave #2 –Visibility & Optimization – ~ 2010-2015
Understand and Optimize Capacity
bottlenecks due to traffic shifts/failures
Better utilization of Network Resources in
steady state
Reduce OPEX and CAPEX Though Analytics
Reactive Proactive Predictive
Growth, Scale and Customer Experience
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 7BRKSPG-2231
Industry Trends
• Single Data Source Visibility and reporting is good, but not adequate for complex problems
• Compute and Storage limits are no longer a bottleneck
• Oceans of Data– near real time
• Customers need to correlate multiple data sources to make business decisions, reduce OPEX and make forecasts & predictions better than their competitors…
• Customers need to find root cause of problems faster to take action and drive automation
Wave #3 – Data Correlation, Prediction, Automation - Now
Understand and Optimize Capacity
bottlenecks due to traffic shifts/failures
Better utilization of Network Resources in
steady state
Reduce OPEX and CAPEX Though Analytics
Reactive Proactive Predictive
How do I compete and survive?
Multiple Data Sources New Topologies, CLOS Fabrics
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
The Evolving Operational Landscape
• Analytics Platforms, Capacity Planning and Traffic modeling have matured
• Telemetry is key focus area, streaming data, data at rest, structured, unstructured
• Orchestration, network programming and collection standards are here (NETCONF/YANG)
• SDN Controllers are being deployed in DC, Campus, Core
• Predictive Analytics, Data Correlation are happening and accelerating – opening the door to Self Learning Networks (SLN)
• Traditional Operational Tools not sufficient for non-stop networking
8BRKSPG-2231
Lots of new Data Sources!
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
Service Provider Key Challenges
Traffic Growth Cost Scaling Faster
than Revenue
Complexity Training &
Operational
Expenses
Agility Time to market is slow due
to lack of automation
Competition Market Transitions
New Agile, Nimble
Players
TCO Cost of operations on the rise,
Profitability under pressure
Speed of Innovation
Unable to catch
Market transitions
Goal = Lean SP + Rapid / Rich Innovation
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 10
SP Compass Designs are simple, scalable, automatable and repeatable
SR/EVPN YANG data modelsIPv6 Telemetry and Analytics
Building Blocks
CLOS Fabric
Designs
Buying Centers
Compass Metro Fabric Compass Core Fabric
Metro & Access Core Peering
Compass Peering Fabric
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
So, what is the challenge?
• The Business Side is running Blind!
• There is an inability to see how Customers/Peers are using the network
• Network data is inherently difficult to visualize effectively for many use cases
• No visibility into Traffic Trends per Customer/Peer
• Limited Visibility into Peering Violations
• Limited Visibility into OPEX cost
• Limited ability to automate, and inability to predict issues
• Many Netflow, Syslog, performance tooling exist, but customers cannot easily “blend” the data in the way that makes more sense
• Traditional Network Reporting needs to transform to Business needs
• We need not only to Visualize the data, but also to Predict Future State
11BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 12BRKSPG-2231
Business ChallengesGrowth, Scale, Customer Experience, and Single View
Analytics Initiative
Performance & Capacity Analytics
Topology Analytics
Fault Analytics
How do I measure how well my business is running?
Device Level Analytics
Business Needs
Cost Reduction
Business Continuity
Service Assurance
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
The Need for Modeling and Analytics
Engineering/Architecture
Balancing Traffic
Optimal Topology Design
RSVP, QoS, Multicast Design
The need for MPLS TE or not?
Poorly defined Metrics
Design Validation
Asymmetric Metrics & Loads
Inadequate QoS & Fragmented QoS
Planning
Growth Forecasts
Upgrade Analysis
New Service Impact
SLA planning
Areas of Over-Capacity
Areas of Under-Capacity
Best Place to add a new Link
Multi chassis vs. Single Chassis
Operations
Network Health and Traffic Trends
Maintenance Planning
Troubleshooting
Congestion Mitigation
Failure Analysis (RCA)
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Control the Network • Fulfill customer demands with
automation • Enable high value applications
to tune network • Rapidly adjust network
configuration to current-state demand matrix
Optimize the Network• Evaluate traffic in conjunction
with topology • Predict ramifications of traffic
changes • Use risk assessment in
planning • Reclaim unused bandwidth
Network
Visualize the Network• Explore and understand
infrastructure (filter, sort, drill down)
• Visualize hotspots in global context
• Report and analyze trends
Visibility Analytics Control
14
Analytics is a Requirement for SDN & Automation
BRKSPG-2231
Analytics Framework and Toolsets
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 16BRKSPG-2231
Analytics Software LandscapeThird Party Analytics Software Ties together Multi Variate Data sources
Many Choices Available in the market today to extract knowledge. Just a few…
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
What is Big Data?
• Volume
• Internet Traffic, IOE
• Flow Data
• Mobile Devices
• Telemetry
• Velocity
• Data Streaming/Firehose of data
• Variety
• Structured vs Unstructured
• Monolithic Approach is not scalable
• Veracity (lots of sources !)
17BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
What is Data Science?
• Emerging Field
• Unique Intersection of Mathematics, Domain Expertise, and Computer Science
• Goal is to create actionable insights from Data Sources
• All elements together create new value
• Great opportunity for Network Engineers to use domain expertise to drive outcomes and use cases
18
Computer
ScienceMathematics
Domain
Expertise
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 19BRKSPG-2231
What is Network Analytics?Empower Engineers and Customers to drive customer outcomes, and add Value through Secure Data Access and Highly Interactive Analytics Platform
• What is happening? (visibility)
• Why is something happening? (root cause and correlation)
• What will happen next? (prediction)
• How can I optimize current state?
• Custom Reporting
Data Information Knowledge
Data Visibility Data
Interactivity
Topology & Capacity Estimation Building Blocks
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Network Abstraction
• An abstracted network model examines how traffic enters, traverses and exits the network according to
• The topology of the network
• The operational state of network elements
• The protocols in the network (example – MPLS-RSVP)
• Examine the impact based on the amount of traffic
• This is the foundation of a Predictive Model because you can simulate countless “What if…” scenarios
21BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Traffic Matrix Concept
• Traffic demands define the amount of data transmitted between each pair of network nodes
• Typically per Class
• Typically peak traffic or a very high percentile
• Measured, anticipated, or estimated/deduced
• A network's traffic matrix is list of demands
• The traffic matrix has two functions
• Indicate why a network’s traffic distributionlooks the way it looks
• Help predict what would happen in the network if something were to change (topology/traffic)
22BRKSPG-2231
ER1
ER2
ER3
CR5
CR2
CR4
CR1
CR3
CR6
ER4
ER5
ER6
Demand
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Demand Deduction
• When the traffic matrix and network topology are known
• You can determine how demands (elements of the traffic matrix) will be routed through the network
• But you still need to know how much traffic the demand represents
• Collecting measurements on interfaces, LSPs, and other paths is straightforward
• But mapping this information to create a reliable traffic matrix is not.
• Using various statistics (interface, node, LSP and flow data) you can create a reliable and accurate traffic matrix with a process called Demand Deduction
• Estimates traffic that would produce a set of measurements near observed levels
• Demand Deduction can continue to refine its results by applying successive regression analyses based on a variety of secondary sources of measurement data
23BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Assessing Risk
• By simulating failures, you can examine
• Where traffic will go (and what impact this traffic will have)
• By simulating failures over a set of objects, you can examine risk network-wide. This includes
• The impact a failure will have
• The worst-utilization an interface will have
• Example - Examine a set of circuit failures (one-by-one)
24BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 25BRKSPG-2231
Worst-Case Result
The Worst-Case impact of any network failure on the Interface from Site A to Site C is 86% utilization.
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 26BRKSPG-2231
Failure Impact Result
The Failure Impact of the circuit between Site A and Site C is a spike to 91% Utilization between D and B
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 27BRKSPG-2231
Measuring a Traffic Matrix
• v5
• Older format, not used much any more
• Resource intensive for collection and processing
• Non-trivial to convert to Traffic Matrix
• V9 and IPFIX
• BGP NextHop Aggregation provides almost direct measurement of Traffic Matrix
• Class of Service Aware
• Both are becoming more prevalent
• Inaccuracies
• Stats can clip at crucial times
• NetFlow and SNMP timescale mismatch
• Over time, minor variations will balance out
• Segment Routing Introduces its own Traffic Matrix
NetFlow Approach: Needed for Inter-AS
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Example IOS XR Netflow Router Configurations
28BRKSPG-2231
!
flow record traffic-matrix-record
match routing destination as
match routing source as peer
match routing next-hop address ipv4 bgp
match ipv4 dscp
match interface input
collect ipv4 tos
collect interface output
collect counter bytes long
!
int tenGigabitEthernet0/0/0
ip flow monitor traffic-matrix-monitor input
!
flow monitor traffic-matrix-monitor
exporter MATE
cache entries 20
record traffic-matrix-record
!
flow exporter MATE
description * * * TO MATE COLLECTOR * * *
destination 192.168.239.33
source Loopback0
transport udp 2100
Modeling Methodology and Topology Analysis
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Creating the Model
• What steps are involved in collection?
• Network discovery – discover the topology for a single router link-state table• Must be able to support multi-area(OSPF) / multi-level (IS-IS)
• Network Polling – collect traffic stats• Sounds scary, but only if you don’t know what you’re looking for.
• By polling specific MIBs and OIDs, you’re not doing an exhaustive interrogation, but a direct Q&A
• Collect traffic stats on BGP peers in this step as well
• Flow data collection – gather Netflow data
• Simulation and analysis tasks – Combine the topology, traffic information and stitch it together in a single model
• Archive insert – historical record of data over time
30BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Modeling Philosophy & Principles
• Do not assume anything!
• Ensure Traffic profiles are accurate
• Ensure Metrics are accurate
• Ensure Tunnel Policies are understood
• Have a good understanding of End to End Architecture
• Remember it is a simulated environment, never 100% accurate
• If all else fails, don’t forget about The Principles of Routing and Switching
31BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
Modeling Philosophy & Principles
1. Build the model from configuration and topology data
2. Start in Steady State with Base Assumptions
3. Observe and document congested links and latency impacts
4. What capacity is required to keep utilization at 45% and worst case utilization at 99% or lower ? (no oversubscription)
5. Analyze, Validate, Recommend
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 33BRKSPG-2231
Creating a Plan File Layout
• Invest time in Geographical mapping Sites to Latitude/Longitude
• Geographical Mapping is key for:• Filtering
• Classifying
• Better Visuals
• Will help later for Analytics layouts for creative graphics
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 34
Network Topology of ACME (Example)
Description:
• Global Provider of Voice and Data Solutions
• Points of Presence in Americas, Europe and Africa.
• Mixture of 10GE and 1GE interfaces
• Traffic Matrix for Voice and Data
Goal is to Identify:
• Single Points of Failure
• Capacity Hot Spots
• QoS Analysis for VOICE Class
• Identify Suboptimal and High Latency Routing
• Load Balancing Optimization
• Topology Potential Improvements including circuit Upgrade Recommendations
Modeling Objective
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
Single Points of Failure
Observations:
• Three Locations highlighted with red arrows are singly homed in the Topology: San Francisco , Toronto, and Warsaw
• In failure mode, Traffic Loss can be substantial
Analysis:
• Dual homing these sites is highly recommended
• Estimated Traffic Lost is 300 Mb/s of VOICE and Data based on today’s data
Single Failures
Summary Dashboard
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
Capacity Cold Spots -- Normal Operation
Observations:
• Some Links not being utilized effectively resulting in “Cold Spots”
• 0% utilization between Lima and Bogota, Rome and Tunisia, Cairo and Johannesburg
Analysis:
• Metrics should be analyzed and tweaked to ensure better use of bandwidth
• Area Designations should be checked as well
• Each location should be checked separately
Under Utilized Link
Summary Dashboard
NO TRAFFIC!
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 37
High Latency Paths
Observations:
• High Latency Paths exist due to Intercontinental Load Balancing between Paris and Santiago and Boston to Santiago
• This is under normal conditions
Analysis:
• This will cause high latency and potential is high for jitter that will likely be unacceptable for voice traffic and real time Traffic
• Consider Metric Tuning/ TE
High Latency Paths
Summary DashboardNote: “A” indicates demand source node and “Z” indicates destination
destination
source
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
ECMP Path Coverage & Cold Spots
Observations:
• For Intra EMEA Traffic, we see limited Coverage of ECMP Routing under normal conditions within Theatre—even though the paths exist
• Most North-South Traffic is using the Rome to Casablanca Circuit while there is no traffic on the ROME to Tunis and ROME to Cairo links.
Analysis:
• This is due to metric, circuit design in place today.
• This is resulting in little load balancing
• Consider in Region Model vs. Global Mode for Load Balancing and metric change (proposed on next slide)
• MPLS TE is an option as well
Regional ECMP
Summary Dashboard
source
source
source
source
destination
sourcesource
NO TRAFFIC!
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
ECMP Path Coverage
Outcome:
• By modifying the metrics from Rome to Tunis and Rome to Cairo we can achieve better load sharing (using What If Scenario)
Analysis:
• A few small metric changes results in better load balancing and overall better utilization
• Before/ After Traffic Reports provided
• MPLS TE is an option as well for other services that require low latency such as voice--Deep Dive Required
Regional ECMP
Potential Solution
source
source
source
source
destination
sourcesource
MODIFIED
METRIC AND
RE-
SIMULATION
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
QoS Violation Hotspots– VOICE Degradation
Observations:
• VOICE Traffic is Dropped under a single failure from DFW to Minneapolis.
• There are other failure cases as well that will be included.
Analysis:
• A single link failure will allow for Data to survive but the QOS policy for VOICE will be violated per the configuration.
• Building a redundant link is highly recommended or increase queues
• Place, Model and Iterate for next session.
Failure Impact
Summary Dashboard
DATA SERVICE
IMPACT
(Acceptable)
VOICE SERVICE
IMPACT
(Unacceptable)
BRKSPG-2231
Design Case Study
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Case Background
• Assumptions:
• Customer is planning build out a Next Gen Backbone
• They Plan to follow Cisco Best Practices for Design (IGP, BFD, Tuning, etc.)
• They have an estimated Traffic Matrix & Rough Topology
• Percentages based on Real Data and Projected Data (80% local 20% remote)
• Partial Mesh Topology
• Challenges:
• How much port and link capacity do they need now and in future?
• Is the topology is optimal for the projected Demands?
• Where best to put the Data Centers? (Centralized vs. Distributed Model)
• What is the best topology for the DC connectivity ?
• Is Single Chassis or Multi Chassis solution better?
• Should I use Single Area vs. Multiple Areas (OSPF)?
• What other observations can be observed?
• Validate Metric Choices
• Ultimately we need a List of Circuit Capacities to build!
42
9 Tier 1 Sites , Approximately 50 Tier 2 Sites
• Demand Types
Internet Traffic
Local Peering /Public Content Data
ENTERPRISE VPN Traffic
• Used Circuit Latency Data from Customer
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
Outcomes and Observations
• Number of Hops between Source and Destination Is very Expensive
• Analyze/Minimize the number of Hops between Source and Destination (hops script)
• 1 Hop Rule: Data Center & Internet Centralization w/ Hub Spoke connectivity turned out to be best topology
• Cross Links may not be always utilized due to higher metric path cost (may be planned for Polarization, but is a side effect that needs to be understood)
• Too Much Redundancy: Some Cases Triple Failures is when Redundant links are used –this may be Too Much Redundancy for the cost.
• Multi Chassis Options can save 8-13% of Circuit Capacity
• Push and Pull of Circuits is interesting study (looking at traffic each way)
• Our Modeling was able to save Substantial Amount of Port Capacity Based on our Findings (~20-30%)
BRKSPG-2231
Streaming Telemetry
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 45BRKSPG-2231
Traditional Monitoring Is Showing Its AgeNot suited for Cloud-Scale Network Operations
sensing &
measurement
Where Data Is Created Where Data Is Useful
T
T
T
Non
Real
time
SNMP
CLI
Syslog
SNMP
CLI
Syslog
SNMP
Server
Syslog
Collector
Scripts
Storage & Analysis
Strong burden on
back-end
Normalize different
encodings, transports, data
models, timestamps
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 46BRKSPG-2231
Telemetry is a Game ChangerNetwork monitoring becomes a big data problem –
sensing &
measurement
Where Data Is Created Where Data Is Useful
• Push paradigm
• One consistent way to
access to Statistics, Oper
state & Events @ all layers
• High Performance: 10 sec
• Multiple encodings &
Transport
Volume – Scale of Data
Velocity – Analysis of Streaming Data
Variety – Different Forms of Data
T
T
T
Removing limitations and
complexity
Big Data
Challenge
Real
time
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Real-Time Use Cases
• Network Health
• Troubleshooting / Remediation
• SLAs, Performance Tuning
• Capacity Planning
• Security
Trends
• Centralized / Software-defined
• Speed
• Scale
Why This Matters Now
47BRKSPG-2231
Capabilities
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 48BRKSPG-2231
Telemetry: Different Things to Different People
System / Control
plane
Hardware / Data
plane
On-Change <= 1 sec ~10s sec ~minutes-hours
C
L
I
X
M
L
S
N
M
Psyslog
traps
BMP
netFlow
Resolution = Frequency of Data Collection
PullPush
Microburst DetectionTraffic Engineering
Capacity Planning
Troubleshooting
MDT
Initial
Focus
sFlow
Direct ASIC stats
Network Health
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Kafka
Different Collection Models
49BRKSPG-2231
Logstash
ElasticSearch
Kibana
Panda
BYO
Custom Open Source, Customizable
Proprietaryor OS-based
Commercial Stack
Prometheus /
InfluxDB
Grafana
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 50
Pipeline: An Open Source Collector
Pipeline
Kapacitor
Output to file, TSBD, Kafka…Ingest, transform, filterSelf-monitoring, horizontally scalable
BRKSPG-2231
Analytics Case Studies and Visualizations
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
What is Data Visualization?
• Creating a visual representation of the data model
• The objective is to precisely communicate the information in a meaningful way to the end user
• Often there is a lot of information in a limited space
52
https://public.tableau.com/s/gallery
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 53
Plan Files Format
• Usage: mate_sql [-file <file>] [-out-
file <file>] [-sql <SQL statement>]
• Executes the SQL statement (or sequence
of statements). The database is
optionally initialized with the contents
of a tab file <file>. If a query is
performed, the results are collected
into a table and printed to the standard
output or a file. If an output file is
specified, the database is exported to
the specified tab file after the SQL
statements are executed.
• table_extract
• Extracts a table or a set of tables
from a file. Optionally fills in
simulation and worst case values if
the input file is a plan file.
Plan files are binary, but can be converted to tabular text files (Tab separated)
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
Which Raw Data Do We Start With ?
Example
Example Syslog Data
Aggregated NetflowData
Example JSON Device Data
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Telemetry Basic Concept: Encoding
Encoding (or “serialization”) translates data (objects, state) into a format that can be transmitted across
the network. When the receiver decodes (“de-serializes”) the data, it has an semantically identical copy
of the original data.
{"timest":1510249270601,"content":[{"timest":1510249270601,"content":{"timest":1510249270601,"mess
age-id":8564}},{"timest":1510249270601,"content":[{"timest":1510249270601,"card-
type":"RP"},{"timest":1510249270601,"node-name":"0/RP0/CPU0"},{"timest":1510249270601,"time-
stamp":1510249270000},{"timest":1510249270601,"time-of-day":"Nov 9 17:41:10.596 :
"},{"timest":1510249270601,"time-zone":"UTC"},{"timest":1510249270601,"process-
name":"pam_manager"},{"timest":1510249270601,"category":""},{"timest":1510249270601
DATA
DATA
“Decode”
“Encode”
Common Text-Based Encodings
• JSON
• XML
BRKSPG-2231 55
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
What is a Data Model and why is this important
• Provides more meaningful and standardized representation of underlying data
• Make it easy to share knowledge
• Make it easy to integrate with other tools (visualization tools, etc)
• Common naming conventions
• Very important when combining multiple data sources
• Think about how you can pass variable names to indexes or search terms such as hostname, router name, IP address
56BRKSPG-2231
Circuit Name
Node Name
Vendor Id
Capacity
Host Name
Interface Cost
Node Cost
Location
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 57BRKSPG-2231
Reactive: Root Cause Analytics
• Situation
• Something “bad” happened to a device, but why
…? What were the leading indicators prior to the
failure? Typically Cisco has to call customer for
more data.
• Solution
• Cisco uses data driven approach to identify syslog
messages, field notices, PSIRT, and best practice
alerts prior to the event using our RCA app.
• Outcome
• Suggestions of potential root causes around and
leading up to the time of failure MTTR. Saves time
when creating a RFO and getting customer network
restored.
Stakeholders: Support Teams
High priority syslog messages, HW/SW version
changes, last reset, PSIRT, Field Notices, EOX, best
practices, vulnerabilities in the timeframe desired
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 58BRKSPG-2231
Proactive: Topology AnalyticsIdentifying Outlier Metrics
Metric=10
Asymmetry may lead to Potential Incorrect Routing
And can help reduce OPEX
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 59BRKSPG-2231
Proactive Engineer : What Changed?
• Situation
• What changed since yesterday?
• What should I notify my customer about?
• Solution
• AS team created “good morning” and “KPI”
app to show data trends in great detail
around faults to show KPIs that need
attention right away with recommended
course of action
• Outcome
• Better NCE visibility to data in more
meaningful way
• By changing config registers, standardizing
code versions, implementing a best
practices we prevented an outage
Stakeholder: Cisco and Customer Team
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Proactive: Anomaly Detection
• A deviation in expected behavior (time series sensitive)
• Worst case utilization shown as a “Spike” in normal pattern
• Alerts can be generated for each anomaly detected
• Standard deviations can be tweaked
• Could also be:
• Utilization spike
• Number of syslogs
• Number of reboots
• QoS policy or config Change
60BRKSPG-2231
Detect numeric outliers and find values that differ significantly from
previous values.
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Trend Forecasting
• Forecasting : estimation of future values
• Trend: A pattern of events
• Great for Traffic Prediction for capacity planning
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 62BRKSPG-2231
Customer Peering Dashboard
• Per Customer Traffic Matrix
• Total Traffic To/From Customer based on Ingress and Egress PE nodes
• Local/Regional/National Traffic percentages
• Input/Output traffic ratio to detect imbalances in traffic flow or peering contract exceptions
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
Prospect Analytics Dashboard“BGP AS Path Analytics”
• Who are my top Destinations?
• Who are my Top Potential Peers?
• Who are my Top source ASN?
• Business Opportunities New Customer targets
New Peering targets
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
P95 Traffic Analytics
• The 95th percentile is a good number to use for planning so you can ensure you have the needed bandwidth at least 95% of the time.
• http://www.init7.net/en/backbone/95-percent-rule
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 65
Adding Cost Data to Model
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 66
Network OPEX Analytics Dashboard
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
Bit Mile Analytics
• Simple Bit-Mile Costing
model establishes a network
component price per bit
mile.
• The end result would be a
$/bit mile cost number.
• Then gross traffic analysis
would be applied to see
what a customer actually
uses of the network bit mile
resources.
• This can help for
determining customer
profitability and cost
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
BGP Monitoring Protocol (BMP)
• BGP RIB Telemetry over TCP to BMP server Station
• Previously only available through screen scraping
• BMP Message Types: Route Monitoring (RM), Peer Up/Down Notifications, Status Reports (SR), Initiation, Termination of sessions
• Supported in XR and IOS XE
RFC 7854
https://www.openbmp.org/#!docs/EXAMPLES.md
BRKSPG-2231 69
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
Network Dashboard
• Quick View of the last 180 days
• Determine which hosts are missing
• Identify Outliers leveraging Host/Syslog Trend
• Identify Network Issues through Syslog Severity Heat map
• Alerting for Anomalies
• Severity Heat map
Understanding “Steady State”, Creating a Baseline
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 71
Syslog Trend by Host
• Identify which Syslogsand Hosts have largest potential problems
• Top Syslogs by # of Hosts
• Drill-down to get host heat map
• Sparkline to show quick glance trend
• Possible to correlate to capacity Issues
Where to focus “Level 3 Support”
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 72
Rare Syslog Trend by Host
• Identify Rare Syslogs and Hosts based on host impact
• Top Syslogs by # of Hosts
• Drill-down to get host heatmap
Finding “The Needle in The Haystack”
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Event Driven Telemetry
• Situation:
• Problems occurring in the network are difficult to pinpoint source of instability. Is there a certain router that is most unstable?
• Solution:
• Dashboard based on Event based Telemetry that shows in real time (10sec) how often are my routes changing by node and geolocation
• Outcome:
• Faster pinpoints of problem spots
• Correlation with logs to increase uptime and focus for support
73BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Event Driven Telemetry
• Situation:
• Looking at finer granularity of specific processes that are most problematic
• Solution:
• Dashboard based on Event based Telemetry that shows in real time (10sec) how system processes that are most problematic by node
• Outcome:
• Faster pinpoints of problem spots
• Correlation with logs to increase uptime and focus for support
74BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 75BRKSPG-2231
Possible use cases for ML in telemetry
Device Health
(CPU/Mem/Fan/Power)Link Stability Bandwidth / Capacity Growth / Development
Unsupervised ML: find the
borders for the parameters in
normal conditions
Supervised ML:
- Classification: reactions on
different events (like new
neighbors, etc)
- Regression: study reactions
on continuous changes
(temperature, etc). For better
budget planning, etc.
Unsupervised ML: find the current level for balancing in LAG/ECMP, queue levels
Supervised ML:
- Classification: reactions on different events (like fiber cuts, optics issues etc).
- Regression: study queue levels changes with changes in traffic profile (during failures). This will help with planning and HA design
Unsupervised ML: find the current levels of load in the network
Supervised ML:
- Classification: reactions on adding / removing devices
- Regression: study reactions on changes in traffic. This will help with better planning (e.g. where is the best place to put equipment closer to the customers)
Unsupervised ML: find the current levels of load in the network wrt clients and servers
Supervised ML:
- Classification: reactions on different events (like new servers, etc)
- Regression: study reactions on continuous changes (new customers, servers, etc). This will help with understanding how traffic profile change within DC influences the network.
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 76
Network Dashboard
• Quick View of the last 180 days
• Determine which hosts are missing
• Identify Outliers leveraging Host/Syslog Trend
• Identify Network Issues through Syslog Severity Heat map
• Alerting for Anomalies
• Severity Heat map
Understanding “Steady State”, Creating a Baseline
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 77
Syslog Trend by Host
• Identify which Syslogsand Hosts have largest potential problems
• Top Syslogs by # of Hosts
• Drill-down to get host heat map
• Sparkline to show quick glance trend
• Possible to correlate to capacity Issues
Where to focus “Level 3 Support”
BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 78
Rare Syslog Trend by Host
• Identify Rare Syslogsand Hosts based on host impact
• Top Syslogs by # of Hosts
• Drill-down to get host heatmap
Finding “The Needle in The Haystack”
BRKSPG-2231
A Glimpse into The Future: WAN Orchestration and SDN
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 80BRKSPG-2231
WAN Automation EngineDelivering Optimization and Automation
Modeling
What if/predictive analysis
Global optimization
Assess historical and
real-time data
Find and manage hot
spots
Network efficiency
analysis
Programmatic network
control
Extensible,
open data models
Real-time traffic balancing
Intelligent bandwidth
scheduling
Automated service
delivery
Predictive Model Time Series VisibilityModel-Based Control
and Configuration
Optimization and
Automation
+ + =WAE
Cycle
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Not Embedded
App
Layer
Cisco AS, BU3rd PARTY
Native
Services
Connection
Layer
Multi-layer Configuration
deployment Service
Multi-Protocol
Data Collection Service
Infrastructure
Layer
Admin
Web UI
Admin UI
Microservices
Orchestration
Alerting
Service
Cisco SP NAC High Level Architecture and Applications
Messaging: Pub/Sub Request/response (KAFKA, NATS)
Logging, Packaging
Microservices
Monitoring
Plan and
Optimization
Inventory
& Reporting
ML Service
Intent
Optimizer
MOP &
Change
Automation
(Morph)
ML Topology
&
Visualization
KPI
Monitoring
(Pulse)
NoSQL DB
(Mongo)
TSDB
(Influx)
Conf.
Compliance
(CCM)
Time Series
Reporting Node
DB
API GW
NSO TL1
NED NEDTelemetry SNMP BGP-LS CLI
SNMP Trap
SNMP/CLI
polling
Syslog TL1
EPNM
Inventory
proxy
BNG
Autom.
(Nucleos)
Cable RPD
(Sereno)
Fault
Dashboard
NSO
Software Image
Management
(SWIM)
Putting it Together
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Network Analytics Drive Customer Outcomes
• Traffic modeling has matured
• Predictive Analytics and Simulation
• Rise of Open Source Software
• SDN Controllers have become a network application development platforms
• Focus on Applications and APIs (Java, REST)
• Resource elasticity and virtualization enabled by NfV; Bandwidth on Demand
• Interoperable network programming and collection standards arriving
• Finally doing something with all of that collected network data!
83BRKSPG-2231
• Positive customer experience
• Deeper network visibility
• Cost savings
• Optimization opportunities
• Automation, Correlation
Topology Analytics
Capacity and Performance Analytics
Fault Analytics
Device Analytics
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Cisco Spark
Questions? Use Cisco Spark to communicate with the speaker after the session
1. Find this session in the Cisco Live Mobile App
2. Click “Join the Discussion”
3. Install Spark or go directly to the space
4. Enter messages/questions in the space
How
cs.co/ciscolivebot#BRKSPG-2231
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
• Please complete your Online Session Evaluations after each session
• Complete 4 Session Evaluations & the Overall Conference Evaluation (available from Thursday) to receive your Cisco Live T-shirt
• All surveys can be completed via the Cisco Live Mobile App or the Communication Stations
Don’t forget: Cisco Live sessions will be available for viewing on-demand after the event at www.ciscolive.com/global/on-demand-library/.
Complete Your Online Session Evaluation
© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public
Continue Your Education
• Demos in the Cisco campus
• Walk-in Self-Paced Labs
• Tech Circle
• Meet the Engineer 1:1 meetings
• Related sessions
BRKSPG-2231 86
Thank you