Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance
M i c r o s o f t
w w w . d o r a d o s o f t w a r e . c o m
O c t o b e r 2 0 1 2
Dorado Software
This document explores in detail Dorado Software’s unique
approach to rapid cause analysis for todays’ emerging and
growing NGN+ services with their unique Key First
Indicators ™ and Tiered Assurance methodology.
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 1
Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance
Getting Practical about Service Assurance in a Fast Moving World
The evolution of smartphones, tablets, broadband technologies, virtualized processing and distributed
applications has allowed businesses, entrepreneurs, enterprises and individuals to rethink how they use
technology to simplify and enhance their endeavors.
These New Services leverage both private and public network capabilities, and create substantial
business opportunities for service providers. These New Services also present significant challenges as
providers need to meet the demands of these new customer business models and the unpredictability
of the network loads they create in an increasingly competitive telecommunication market.
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 2
To the service provider’s operational staff, one of the most significant challenges is the significantly
shorter service delivery lifecycle and the multi-organizational distributed nature of these services.
Operational processes will need to address the following realities of this new market in order to make
operational support a differentiate factor for their organizations.
For instance:
New services are distributed, potential involving multiple servers, multiple networks,
multiple organizations and even consumer owned devices. Each element in the delivery
chain offers another point of potential degradation. Each presents connectivity and
authorization issues. And, each may represent an additional party of interest that will
want visibility to the service’s performance.
The promise of New Services is dependent on near real-time operations. Whether it is
immediate download of an application, bandwidth on demand, or the utilization of new
virtual capabilities, the expectations are that these processes will be automated. As
increasing mission critical services are delivered over these distributed ecosystems the
need for immediate performance visibility will become imperative.
Performance of these new services will require visibility into the specific
service/application network behavior not just overall network throughput. In addition,
the relationship with other applications on networks or servers will become critical
areas of potential performance contention. Finally, the nature of network and
processing load generated by these new services can be highly unpredictable and
require more proactive assurance processes.
Compounded by the open nature and short delivery cycles of these New Services,
specific operational best practices will be inherently immature and nonstandard.
Further, the distributed nature of the service will make direct assurance processes hard
to administer. Collaborative processes across diverse operational communities will be
required.
Finally, all these new operational issues will have to be addressed in a fraction of the
time that traditional assurance processes were implemented.
Assurance has evolved from the NOC management of dedicated network component failures and
degradation to end customer awareness of specific application performance in a multi-tenancy
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 3
virtualized environment. Traditional support solutions were not designed for these conditions;
therefore, a new assurance methodology is required.
Traditional Assurance Solutions Inhibitors
Many of the most prevalent assurance solutions are built on architectures that are more
than 15 years old and often are loosely integrated suites of products. 15 years ago
there were only 18,000 websites on the internet and no such thing as smart phones or a
virtual machine! These dated architectures make significant feature changes expensive
and time consuming to implement. Scaling can also be an issue.
Carrier grade assurance systems manage network devices or EMSs for traditional service
provider networks mostly with a NOC (network operations center) perspective. The
focus is generally network performance specific with minimal visibility to the underlying
application traffic. These architectures are often not well suited to manage equipment
outside the provider domain or for correlating usage information from probes
(IPTV/VOIP) or optimization appliances.
Solution development, production installation and ongoing improvements are generally
bottom-up processes focused on the capabilities of new network devices and the
availability of fault and performance metrics from an element level. These processes can
be extremely data intensive. Without targeted efforts to isolate key data the results can
increase the amounts of data collection exponentially. Often multiple systems may
collect data from these managed devices.
Advanced correlation and root cause determination capabilities are then engineered
over the managed data and devices. These analytics are often the automation of
operational experience either of the solution vendor or the service provider themselves,
acquired from extensive experience managing similar networks and services. This level
of specific experience is unlikely to exit for New Services.
Assurance systems are generally funded and implemented within a specific
organization. Most of the visibility and capabilities are used within that organization.
Otherwise known as ‘The silo effect’. Limited access by other organizations or systems is
sometimes provided but this visibility is rarely real time and almost never collaborative.
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 4
Cross organizational development of value added capabilities is limited, time consuming
and often very bureaucratic.
Most assurance systems create a northbound data collection relationship with the
managed resources. Southbound transactions are generally confined to the solicitation
of data from these devices. Closed loop corrective actions associated with the assurance
indicators is very rare.
These restrictions in current assurance systems make them unsuitable to deal with the fundamental
requirements of New Services. Short delivery times, resources and operational groups from multiple
organizations, and limited operational experience. A new assurance process must evolve to meet these
challenges.
A modern service assurance methodology must give each functional group immediate visibility to the
most relevant operational data and a method to collaborate with other groups on operational best
practices.
Key Components of Modern Service Assurance Methodology
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 5
The first requirement is to have a flexible architecture that allows the collection and distribution of
performance information. This capability must facilitate the collection of metrics and management of
diverse service resources.
The solution must also have the ability to securely expose these operational metrics across varied
operational communities.
And finally the solution must enable these processes across organization boundaries.
Secure Collaboration Platform
Operational collaboration enables all the interested parties in the distributed service delivery ecosystem
to have secure access to the data and tools necessary to maximize the value from their New Services. A
collaborative platform must provide integrated capabilities for users to interact and intuitively share
experience and data. An integrated knowledge base will help capture operational experience for future
reference and potential resolution automation.
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 6
Comprehensive Resource Management
Comprehensive resource control is the cornerstone of any assurance system. The ability to extend
resource management beyond traditional provider boundaries is the key to New Service assurance. Bi-
directional control is essential to provide both assurance escalation processes and to automate closed
loop corrective processes.
For a more detail discussion on how to use a collaborative environment to manage cloud and next
generation services, please refer to our Using a Community-Based Approach to Simplify Cloud
Management whitepaper and accompanying webinar.
With the metric distribution process in place the next step is to define and process the metrics vital to
the effective management of New Services. Dorado Software’s methodology for this is Key First
Indicators™ (KFI).
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 7
Key First Indicators™ Methodology
Dorado’s KFI methodology enables the deployment of differentiated support processes in the time
frames required by New Service business models. It provides real-time visibility of network events
provided by the resource management layer to the distinct operational communities managed by the
collaboration platform.
The foundation of the KFI concept is to focus on three principles:
Identify the 10-20 major indicators a managed resource can provide
Identify the individual indicators required by each operational support persona
Create a reusable infrastructure to quickly associate these indicators to
personas
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 8
Dorado Software’s Service Assurance platform provides a number of KFI templates that can be quickly
assigned to specific monitored event data or as summarizations of other portal functions or resource
attributes. Common KFIs would include Thresholds, Availability, Top 10 List and Administrative
(Resolution Tickets).
These KFI structure can be collected from a variety of sources:
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 9
KFIs from the originating managed resource are known as native KFIs and can be associated with one or
more associated managed resources either physical or logical. They then become associated KFIs.
Managed elements can have associated KFIs from multiple sources. A KFI pool is maintained for each
managed resource.
KFI pools are populated in a number of ways. All devices that produce KFIs have their native KFIs
immediately associated. Managed Objects that are created with defined physical or logical relationship
to other managed objects can inherit specific KFIs. KFIs can be manually assigned to associated managed
objects. In addition summarization KFIs can allow the propagation of consolidated metrics upward to an
associated managed object. The Top Ten KFI allows a summation of alarms or other measured metrics
that are not managed as KFIs to be propagated as a KFI and provides an immediate drilldown
mechanism. In addition summary states for a managed object reflecting current state and operational
experience can also be propagated.
The KFI Portal displays these KFI pools and allows the user to get immediate visibility to the most
operationally significant information for a managed resource. Native KFIs (produced from this resource
itself) are always displayed to indicate current operational levels even when within acceptable range.
Associated KFIs (produced by other resources that impact this resource) are displayed when the
assigned severity level is reached. Display order is determined by severity first and then by KFI ranking
within the native resources. Navigation aids allow intuitive access to additional KFIs.
In this manner probable cause information is effectively propagated up the management infrastructure
from the device, to the service, to a location and to the customers themselves.
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 10
The KFIs then utilize the resource management and collaborative infrastructures to provide near real-
time operational status to each support group. Each group will have visibility to specific KFIs pertinent to
their operational roles. The delivery style may vary by operational persona: network topology for the
NOC; customer dashboards for customer support; service dashboards for sales and execs.
KFI Example
In the following view the operator while in a service level view can select a specific router based
on the color coded operational summaries. Theses summaries represent the standardized
severity color alerts with the left pattern representing the objects current status and the left
representing the historical experience over a defined period.
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 11
From a single view the operator can discern that this service has one current issue. That
the other resources including the MLPS core are all fully operational and have met
expected operational experience levels.
The KFI Portal for the effected router indicates that probable cause is a latency issue
affecting a specific router associated to this service, it has persisted for the last 20
minutes, and it has put the overall SLA for this service in jeopardy.
The NOC is working on the issue and a resolution is expected in 20 minutes.
Tiered Assurance Process
KFIs allow providers to meet the demands of New Services allowing rapid introduction of new
operational support processes, providing multiple organization with secure current status on their
resources and services, while optimizing the flow of operational data from distributed resources.
The need for deeper levels of analysis does not go away. It becomes part of the broader engineering
function that analyzes longer term trends, or it is part of an escalation process driven by the KFI
indicators.
A tiered service assurance process bridges the need for both real time operational support and
continuous operational improvement. It enables low volume real-time visibility, on-demand deep data
analysis and long term operational studies.
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 12
The Tier 1 Process is the general real-time KFI process that we have described above. It monitors service
performance through the specified KFIs and can indicate service impacting events quickly to the
appropriate support organization
When a KFI indicates a service impacting event it can initiate a number of actions to enhance probable
cause analysis. This is Tier 2 processing. The KFI can initiate the calculation of a more in depth tier 2 KFI.
An example would be a critical threshold alarm may trigger the display of the history of the problem
over a definable period based on the priority and severity of the event. The system would immediately
replace the simple threshold exceeded KFI with a threshold pattern KFI so that that operator would
immediate see extended information on the problem.
The KFI could also utilize the bi-directional control of the resources to ask for additional information for
deeper analysis of the problem and/or increase polling rates. The KFI may query other OSSs or BSSs for
more information and can create new tier 2 KFIs or provide additional data for reports or real time
portals. These escalated analysis processes can be automated or integrated manual operations initiated
from the collaboration processes. They may be transparent to the end user but save the solution
significant network and processing time by only being evoked by a Tier 1 trigger.
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 13
Finally, there may be situations where the Tier 2 process or operational staff recognizes that addition
off-line analysis needs to be applied. A Tier 3 process can be evoked. This may involve handing the
incident off to another specialized OSS or to a specific Engineering group.
At all levels of the Tiered process the collaborative platforms allows the varied operational support and
engineering groups to communicate in real-time in a secure manner and collect their operational
experiences into an operational knowledge base. This expertise can then be used to improve the KFI and
tiered assurance processes.
Tiered Assurance Example
The following IPTV assurance scenario depicts many the key aspects and benefits of a tiered
assurance process.
A Tier 1 KFI is established to collect a specific key video quality metrics from each settop
box (STB)in the service providers IPTV deployment on a 5 minute basis. This allows the
provider to limit the amount the data they are polling across their network while getting
full customer device coverage.
If this KFI reaches a defined severity level than then a Tier 2 process is invoked. In this
case the process now begins to poll the specific STB on a 30 second basis and also
collect 3 addition KFIs from the STB. This escalated monitoring remains in place until the
extended KFI monitoring indicates the STB video quality has been restored. While the
escalated monitoring is in place the specific STB KFIs are routed to the IPTV engineering
group for Tier 3 review.
As the support teams collaborate on solutions to these STB quality issues, they record
their experience into the knowledge base. Over time engineer qualifies that a certain
process corrects a specific degradation scenario. The engineering group develops
automated fixes for the problem that can be presented to the NOC or SOC groups as
collaborative KFI that will propagate automatically with the operational KFI and allow
the support staff to manually initiate the recommended solution. Over time the manual
process will be validated and the corrective process will be launched automatically as
part of the Tier 2 escalation process.
Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 14
These capabilities allow the service providers support groups to quickly manage new
service and capabilities, control their data volumes, get maximum service visibility and
allow the continuous improvement of the operation support processes.
Conclusions
Current assurance processes rely on: a certain level of predictability in the application and load patterns
of their managed services; a reasonable time to implement new support processes; direct access to
dedicated network resources; and a level of operational experience with the specific supported service.
All of these premises are disrupted as new cloud or NGN IP services are deployed. Additionally, the
number of interested operational communities has increased and will demand current service status
and potentially the ability to reconfigure their own devices and the resources provided by the carrier in
near real-time.
This new service ecosystem will require new methodologies for both the deployment and the execution
of support processes, across a wide range of managed resources, for multiple operational support
groups.
Contact
Dorado Software
100 Woodmere Road
Folsom, CA 95630
www.doradosoftware.com
© 2012 Dorado Software, Inc. All rights reserved. Dorado, Redcell, and Key First Indicators are registered trademarks of Dorado
Software, Inc. Any other mark is the property of their respective owners.