Rapid Cause Analysis, Key First Indicators™ and …downloads.deusm.com/serviceprovideritreport/DoradoWP...Rapid Cause Analysis, Key First Indicators and Tiered Service Assurance

Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance

M i c r o s o f t

w w w . d o r a d o s o f t w a r e . c o m

O c t o b e r 2 0 1 2

Dorado Software

This document explores in detail Dorado Software’s unique

approach to rapid cause analysis for todays’ emerging and

growing NGN+ services with their unique Key First

Indicators ™ and Tiered Assurance methodology.

Dorado Software Whitepaper: Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance October 2012 1

Rapid Cause Analysis, Key First Indicators™ and Tiered Service Assurance

Getting Practical about Service Assurance in a Fast Moving World

The evolution of smartphones, tablets, broadband technologies, virtualized processing and distributed

applications has allowed businesses, entrepreneurs, enterprises and individuals to rethink how they use

technology to simplify and enhance their endeavors.

These New Services leverage both private and public network capabilities, and create substantial

business opportunities for service providers. These New Services also present significant challenges as

providers need to meet the demands of these new customer business models and the unpredictability

of the network loads they create in an increasingly competitive telecommunication market.


To the service provider’s operational staff, one of the most significant challenges is the significantly

shorter service delivery lifecycle and the multi-organizational distributed nature of these services.

Operational processes will need to address the following realities of this new market in order to make

operational support a differentiate factor for their organizations.

For instance:

New services are distributed, potential involving multiple servers, multiple networks,

multiple organizations and even consumer owned devices. Each element in the delivery

chain offers another point of potential degradation. Each presents connectivity and

authorization issues. And, each may represent an additional party of interest that will

want visibility to the service’s performance.

The promise of New Services is dependent on near real-time operations. Whether it is

immediate download of an application, bandwidth on demand, or the utilization of new

virtual capabilities, the expectations are that these processes will be automated. As

increasing mission critical services are delivered over these distributed ecosystems the

need for immediate performance visibility will become imperative.

Performance of these new services will require visibility into the specific

service/application network behavior not just overall network throughput. In addition,

the relationship with other applications on networks or servers will become critical

areas of potential performance contention. Finally, the nature of network and

processing load generated by these new services can be highly unpredictable and

require more proactive assurance processes.

Compounded by the open nature and short delivery cycles of these New Services,

specific operational best practices will be inherently immature and nonstandard.

Further, the distributed nature of the service will make direct assurance processes hard

to administer. Collaborative processes across diverse operational communities will be

required.

Finally, all these new operational issues will have to be addressed in a fraction of the

time that traditional assurance processes were implemented.

Assurance has evolved from the NOC management of dedicated network component failures and

degradation to end customer awareness of specific application performance in a multi-tenancy


virtualized environment. Traditional support solutions were not designed for these conditions;

therefore, a new assurance methodology is required.

Traditional Assurance Solutions Inhibitors

Many of the most prevalent assurance solutions are built on architectures that are more

than 15 years old and often are loosely integrated suites of products. 15 years ago

there were only 18,000 websites on the internet and no such thing as smart phones or a

virtual machine! These dated architectures make significant feature changes expensive

and time consuming to implement. Scaling can also be an issue.

Carrier grade assurance systems manage network devices or EMSs for traditional service

provider networks mostly with a NOC (network operations center) perspective. The

focus is generally network performance specific with minimal visibility to the underlying

application traffic. These architectures are often not well suited to manage equipment

outside the provider domain or for correlating usage information from probes

(IPTV/VOIP) or optimization appliances.

Solution development, production installation and ongoing improvements are generally

bottom-up processes focused on the capabilities of new network devices and the

availability of fault and performance metrics from an element level. These processes can

be extremely data intensive. Without targeted efforts to isolate key data the results can

increase the amounts of data collection exponentially. Often multiple systems may

collect data from these managed devices.

Advanced correlation and root cause determination capabilities are then engineered

over the managed data and devices. These analytics are often the automation of

operational experience either of the solution vendor or the service provider themselves,

acquired from extensive experience managing similar networks and services. This level

of specific experience is unlikely to exit for New Services.

Assurance systems are generally funded and implemented within a specific

organization. Most of the visibility and capabilities are used within that organization.

Otherwise known as ‘The silo effect’. Limited access by other organizations or systems is

sometimes provided but this visibility is rarely real time and almost never collaborative.


Cross organizational development of value added capabilities is limited, time consuming

and often very bureaucratic.

Most assurance systems create a northbound data collection relationship with the

managed resources. Southbound transactions are generally confined to the solicitation

of data from these devices. Closed loop corrective actions associated with the assurance

indicators is very rare.

These restrictions in current assurance systems make them unsuitable to deal with the fundamental

requirements of New Services. Short delivery times, resources and operational groups from multiple

organizations, and limited operational experience. A new assurance process must evolve to meet these

challenges.

A modern service assurance methodology must give each functional group immediate visibility to the

most relevant operational data and a method to collaborate with other groups on operational best

practices.

Key Components of Modern Service Assurance Methodology


The first requirement is to have a flexible architecture that allows the collection and distribution of

performance information. This capability must facilitate the collection of metrics and management of

diverse service resources.

The solution must also have the ability to securely expose these operational metrics across varied

operational communities.

And finally the solution must enable these processes across organization boundaries.

Secure Collaboration Platform

Operational collaboration enables all the interested parties in the distributed service delivery ecosystem

to have secure access to the data and tools necessary to maximize the value from their New Services. A

collaborative platform must provide integrated capabilities for users to interact and intuitively share

experience and data. An integrated knowledge base will help capture operational experience for future

reference and potential resolution automation.


Comprehensive Resource Management

Comprehensive resource control is the cornerstone of any assurance system. The ability to extend

resource management beyond traditional provider boundaries is the key to New Service assurance. Bi-

directional control is essential to provide both assurance escalation processes and to automate closed

loop corrective processes.

For a more detail discussion on how to use a collaborative environment to manage cloud and next

generation services, please refer to our Using a Community-Based Approach to Simplify Cloud

Management whitepaper and accompanying webinar.

With the metric distribution process in place the next step is to define and process the metrics vital to

the effective management of New Services. Dorado Software’s methodology for this is Key First

Indicators™ (KFI).

http://www.doradosoftware.com/products/registration/wp-cloud-management.html

http://www.doradosoftware.com/products/registration/wp-cloud-management.html

http://www.lightreading.com/webinar.asp?webinar_id=30017&webinar_promo=28270


Key First Indicators™ Methodology

Dorado’s KFI methodology enables the deployment of differentiated support processes in the time

frames required by New Service business models. It provides real-time visibility of network events

provided by the resource management layer to the distinct operational communities managed by the

collaboration platform.

The foundation of the KFI concept is to focus on three principles:

Identify the 10-20 major indicators a managed resource can provide

Identify the individual indicators required by each operational support persona

Create a reusable infrastructure to quickly associate these indicators to

personas


Dorado Software’s Service Assurance platform provides a number of KFI templates that can be quickly

assigned to specific monitored event data or as summarizations of other portal functions or resource

attributes. Common KFIs would include Thresholds, Availability, Top 10 List and Administrative

(Resolution Tickets).

These KFI structure can be collected from a variety of sources:


KFIs from the originating managed resource are known as native KFIs and can be associated with one or

more associated managed resources either physical or logical. They then become associated KFIs.

Managed elements can have associated KFIs from multiple sources. A KFI pool is maintained for each

managed resource.

KFI pools are populated in a number of ways. All devices that produce KFIs have their native KFIs

immediately associated. Managed Objects that are created with defined physical or logical relationship

to other managed objects can inherit specific KFIs. KFIs can be manually assigned to associated managed

objects. In addition summarization KFIs can allow the propagation of consolidated metrics upward to an

associated managed object. The Top Ten KFI allows a summation of alarms or other measured metrics

that are not managed as KFIs to be propagated as a KFI and provides an immediate drilldown

mechanism. In addition summary states for a managed object reflecting current state and operational

experience can also be propagated.

The KFI Portal displays these KFI pools and allows the user to get immediate visibility to the most

operationally significant information for a managed resource. Native KFIs (produced from this resource

itself) are always displayed to indicate current operational levels even when within acceptable range.

Associated KFIs (produced by other resources that impact this resource) are displayed when the

assigned severity level is reached. Display order is determined by severity first and then by KFI ranking

within the native resources. Navigation aids allow intuitive access to additional KFIs.

In this manner probable cause information is effectively propagated up the management infrastructure

from the device, to the service, to a location and to the customers themselves.


The KFIs then utilize the resource management and collaborative infrastructures to provide near real-

time operational status to each support group. Each group will have visibility to specific KFIs pertinent to

their operational roles. The delivery style may vary by operational persona: network topology for the

NOC; customer dashboards for customer support; service dashboards for sales and execs.

KFI Example

In the following view the operator while in a service level view can select a specific router based

on the color coded operational summaries. Theses summaries represent the standardized

severity color alerts with the left pattern representing the objects current status and the left

representing the historical experience over a defined period.


From a single view the operator can discern that this service has one current issue. That

the other resources including the MLPS core are all fully operational and have met

expected operational experience levels.

The KFI Portal for the effected router indicates that probable cause is a latency issue

affecting a specific router associated to this service, it has persisted for the last 20

minutes, and it has put the overall SLA for this service in jeopardy.

The NOC is working on the issue and a resolution is expected in 20 minutes.

Tiered Assurance Process

KFIs allow providers to meet the demands of New Services allowing rapid introduction of new

operational support processes, providing multiple organization with secure current status on their

resources and services, while optimizing the flow of operational data from distributed resources.

The need for deeper levels of analysis does not go away. It becomes part of the broader engineering

function that analyzes longer term trends, or it is part of an escalation process driven by the KFI

indicators.

A tiered service assurance process bridges the need for both real time operational support and

continuous operational improvement. It enables low volume real-time visibility, on-demand deep data

analysis and long term operational studies.


The Tier 1 Process is the general real-time KFI process that we have described above. It monitors service

performance through the specified KFIs and can indicate service impacting events quickly to the

appropriate support organization

When a KFI indicates a service impacting event it can initiate a number of actions to enhance probable

cause analysis. This is Tier 2 processing. The KFI can initiate the calculation of a more in depth tier 2 KFI.

An example would be a critical threshold alarm may trigger the display of the history of the problem

over a definable period based on the priority and severity of the event. The system would immediately

replace the simple threshold exceeded KFI with a threshold pattern KFI so that that operator would

immediate see extended information on the problem.

The KFI could also utilize the bi-directional control of the resources to ask for additional information for

deeper analysis of the problem and/or increase polling rates. The KFI may query other OSSs or BSSs for

more information and can create new tier 2 KFIs or provide additional data for reports or real time

portals. These escalated analysis processes can be automated or integrated manual operations initiated

from the collaboration processes. They may be transparent to the end user but save the solution

significant network and processing time by only being evoked by a Tier 1 trigger.


Finally, there may be situations where the Tier 2 process or operational staff recognizes that addition

off-line analysis needs to be applied. A Tier 3 process can be evoked. This may involve handing the

incident off to another specialized OSS or to a specific Engineering group.

At all levels of the Tiered process the collaborative platforms allows the varied operational support and

engineering groups to communicate in real-time in a secure manner and collect their operational

experiences into an operational knowledge base. This expertise can then be used to improve the KFI and

tiered assurance processes.

Tiered Assurance Example

The following IPTV assurance scenario depicts many the key aspects and benefits of a tiered

assurance process.

A Tier 1 KFI is established to collect a specific key video quality metrics from each settop

box (STB)in the service providers IPTV deployment on a 5 minute basis. This allows the

provider to limit the amount the data they are polling across their network while getting

full customer device coverage.

If this KFI reaches a defined severity level than then a Tier 2 process is invoked. In this

case the process now begins to poll the specific STB on a 30 second basis and also

collect 3 addition KFIs from the STB. This escalated monitoring remains in place until the

extended KFI monitoring indicates the STB video quality has been restored. While the

escalated monitoring is in place the specific STB KFIs are routed to the IPTV engineering

group for Tier 3 review.

As the support teams collaborate on solutions to these STB quality issues, they record

their experience into the knowledge base. Over time engineer qualifies that a certain

process corrects a specific degradation scenario. The engineering group develops

automated fixes for the problem that can be presented to the NOC or SOC groups as

collaborative KFI that will propagate automatically with the operational KFI and allow

the support staff to manually initiate the recommended solution. Over time the manual

process will be validated and the corrective process will be launched automatically as

part of the Tier 2 escalation process.


These capabilities allow the service providers support groups to quickly manage new

service and capabilities, control their data volumes, get maximum service visibility and

allow the continuous improvement of the operation support processes.

Conclusions

Current assurance processes rely on: a certain level of predictability in the application and load patterns

of their managed services; a reasonable time to implement new support processes; direct access to

dedicated network resources; and a level of operational experience with the specific supported service.

All of these premises are disrupted as new cloud or NGN IP services are deployed. Additionally, the

number of interested operational communities has increased and will demand current service status

and potentially the ability to reconfigure their own devices and the resources provided by the carrier in

near real-time.

This new service ecosystem will require new methodologies for both the deployment and the execution

of support processes, across a wide range of managed resources, for multiple operational support

groups.

Contact

Dorado Software

100 Woodmere Road

Folsom, CA 95630

www.doradosoftware.com

© 2012 Dorado Software, Inc. All rights reserved. Dorado, Redcell, and Key First Indicators are registered trademarks of Dorado

Software, Inc. Any other mark is the property of their respective owners.

http://www.doradosoftware.com/

Documents

Rapid Cause Analysis, Key First Indicators™ and …downloads.deusm.com/serviceprovideritreport/DoradoWP...Rapid Cause Analysis, Key First Indicators and Tiered Service Assurance