14
BugFix 1 EWSD Reliability Contents Telecommunication systems have to work 24 hours a day , 365 days a year, wit h a negligible downtime. Therefore they have to have an extremely high level of availability/reliability. EWSD has been designed with stringent requirements for reliability and quality of service (QoS). Reliability concept Reliability analysis Component reliability Spare parts Hardware reliability analysis Software reliability estimation Reliability data Total system downtime Unscheduled maintenance Scheduled maintenance Design decisions (redundancy) Stress tests Function tests "Rough" analysis Quality Information System Concept & design design & development Test Field R e l i a b i l i t y e.g. Performance Design development Maintainability tests Repair Data collection Requirements R e l i a b i l i t y e.g. Performance Operation & maintenance "detailed" Reliability analysis - reliability - availability - maintainabilit y Reliability evaluation Reliability evaluation Reliability evaluation Reliability evaluation System specification Function & system test Design review Verification of parameters & data collection

P200

Embed Size (px)

Citation preview

Page 1: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 1/14

BugFix

1

EWSD

Reliability

Contents Telecommunication systems have to work 24 hours a

day, 365 days a year, with a negligible downtime.

Therefore they have to have an extremely high level of

availability/reliability.

EWSD has been designed with stringent requirements

for reliability and quality of service (QoS).

Reliability concept

Reliability analysis

Component reliability

Spare parts

Hardware reliability analysis

Software reliability estimation

Reliability data

Total system downtime

Unscheduledmaintenance

Scheduled

maintenance

Design

decisions

(redundancy)

Stresstests

Function

tests

"Rough" analysis

Quality

Information

System

Concept & design design & development Test Field

R e l i a b i l i t y 

e.g.

Performance

Design

development

Maintainability

tests

Repair

Data

collection

Requirements

R e l i a b i l i t y 

e.g.

Performance

Operation &

maintenance

"detailed"

Reliability

analysis

- reliability

- availability

- maintainability

Reliability

evaluation

Reliability

evaluationReliability

evaluation

Reliability

evaluation

System

specification

Function &

system test

Design review

Verification of

parameters &

data collection

Page 2: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 2/14

2

Reliability assurance program

Questions of reliability play an important role in the de-

velopment. A reliability assurance program was put in

place in order to achieve and maintain such a high level

of reliability. This program ensures the quality and reli-

ability by continuous monitoring from the beginning of

development through modifications and during field op-

eration.

Rough reliability analyses are performed in the concept

and design phase to verify the system design and to

support system design decisions. During design and

development, detailed reliability analyses are per-

formed in order to verify the system design, to optimize

the maintenance strategies, and to provide customers

with reliability models and results.

During the test phase reliability data are collected and

reliability-related tests are performed. These tests are

used to check the related functions and to verify the

model parameters used. A Quality Information System(QUIS) is established in order to monitor, measure and

understand the reliability and quality performance in

operation. For this purpose, quality and reliability data

are collected from systems worldwide and evaluated

continuously.

EWSD network nodes are in service in more than

100 countries, e.g. USA, Brazil, China, Indonesia,

Egypt, South Africa, Portugal, Germany, and

300 telephone administrations. EWSD is well-known

for its high reliability and high availability all over the

world. In the USA, for example, EWSD is the network

node with the best total downtime performance based

on 1998 FCC ARMIS data.

Reliability objectives

Reliability measurements are defined to provide a high

standard of service reliability. The reliability measure-

ments and their objectives consider two aspects:

  – the subscriber's/user’s point of view requires a high

reliability and availability of services

 – the service provider's point of view is to carry out

limited maintenance and repair work

Major reliability measurements are, for example, the to-

tal system downtime, the line or trunk downtime, the

premature release probability, the incorrect charging

probability, but also the number of maintenance ac-

tions, or the circuit pack return rate. The different reli-

ability measurements are calculated theoretically by

means of reliability modeling and they are recorded and

evaluated in operation as well. Typical in-service objec-

tives:

The theoretically calculated reliability measurements

are basically used to evaluate the expected reliability of

the system and to give the service provider an idea of

reliability. We have defined essential requirements that

cover most of the known requirements from Telcor-

dia/Bellcore, ITU, and specific customers.

Reliability conceptThe reliability concept comprises:

• Hardware

• Software

Hardware

In designing the system, particular attention was paid to

achieving the highest possible reliability and availability

of the system. This is achieved by full redundancy of all

central hardware components of the system.

The coordination processor (CP) works as an n+1 re-

dundant multiprocessor and all other central units areduplicated:

  – Common memory (CMY)

  – Input/output processors (IOP)

  – Disks

 – Message buffer (MB)

 – Switching network (SN)

 – Signaling system network control (SSNC)

 – Central clock generator (CCG)

The digital line unit (DLU) is internally duplicated and

operates according to the loadsharing principle.

In-service performance

(all causes of failure)

Total system downtime 3 minutes per year

Single termination downtime 30 minutes per year

SS7 link downtime 82 minutes per year

Premature release probability 2 x 10-5

Repair actions (hardware) 5 per 1,000 ports & year

Page 3: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 3/14

3

All units are monitored by safeguarding programs to

ensure that errors are dealt with immediately without

impairing operation of the system. Hardware faults are

detected and localized automatically. When a faulty

unit has been localized, it is disconnected and an alarm

is sent. The standby unit of redundant equipment is put

into service by fast service recovery functions.

Coordination processor 113E (CP113E)

The CP113E is a multiprocessor system. All critical

units are duplicated to improve system availability. This

ensures that an outage on only one unit (a single error)

will never cause outage of the CP113E. To achieve

this, the faulty component is immediately localized and

replaced by a redundant component as soon as it has

been isolated. In many cases, outage of more than one

component can also be tolerated without causing out-

age of the CP. For example, the following devices are

duplicated:

 – Base processor (BAP)

  – Input/output control (IOC)

  – Common memory (CMY)

 – ATM bridge processor (AMP)

Pool redundancy guarantees sufficient availability of

the call processors (CAP). Critical peripheral devices

(e.g. magnetic disk device, MDD) are also duplicated.

They are connected to different IOCs by different in-put/output processors (IOP). In the same manner, the

periphery is connected to the CP by one or more pairs

of IOPs for the message buffer (MB).

All BAPs, CAPs, IOCs and AMPs are connected to the

two CMYs.

One of the two BAPs is BAP-master and the other is

BAP-spare. If the BAP-master becomes unavailable,

the current BAP-spare automatically becomes BAP-

master. The BAP-spare usually functions as a CAP and

is part of the n+1 pool redundancy of the CAPs. IOC

and IOP pairs work in load-sharing mode. If one com-ponent of these pairs becomes unavailable, its function

is performed by the remaining redundant component

without any effect on service. Two AMPs are provided

as ATM bridge processors (AMPC) to the SSNC via op-

HTI

LTGDLU

RSU

CP

CCG

MB

NetManager

LTG

SN

SSNC

Active/ 

Standby

Active/Standby-redundancy

Master/slave redundancy

Pool redundancy (n+1)

BAP1

slaveCAP0 CAPn

BAP0

master

CP113E redundancy architecture

AMP0 AMP1

IOPs

IOP:MB7

Load sharing redundancy

IOP:MB0

IOPs

IOP:MB7

IOP:MB0

CMY0 CMY1

IOC0 IOC1

B:CMY0B:CMY1

SSNC

0 1 0 1

Page 4: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 4/14

4

tical fiber interfaces. They work in active/standby re-

dundancy. Central parts of the processor boards and

the CMY are internal duplicated in order to detect fail-

ures safely and immediately.

Message buffer (MB) and switching network (SN)

The message buffer (MB) is fully duplicated. The chan-

nels are through- connected on both switching network

(SN) sides by the associated message buffers MB0and MB1. As regards the control information to the SN,

both MBs operate in active/active mode. The message

channels to the LTGs are connected via both SN

halves. Although each MB can process the entire data

flow to the LTGs, the MBs operate in load-sharing

mode. If one MB becomes unavailable, the MB that is

still active takes over all the traffic to the LTGs.

The SN is also fully duplicated. One SN half is active

and the other is on hot-standby. All connections are set

up in parallel and on the same path in both SN halves.

If an error occurs in the active SN half, the CP initiates

changeover to the standby SN half.

Digital line unit (DLU) and line/trunk group (LTG)

The DLU provides the interface for all subscriber lines

to the EWSD system. To meet a high standard of reli-

ability, all central parts of the DLU (DLUC) are duplicat-

ed except the subscriber line modules (SLM) and

circuits used for external alarms and line testing.

Both DLUCs operate in load-sharing mode. If oneDLUC or a connected LTG fails, calls established via

the failed DLUC are lost and the call handling capacity

is reduced. The subscribers have to re-establish their

calls, which are now routed via the DLUC that is still in

operation.

Because of the high reliability of the LTG hardware the

reliability requirements for trunk terminations can be

met without redundancy. Nevertheless it is advisable to

distribute the trunks of a trunk group over several LTGs

in order to provide maximum service availability.

Active/active load-sharing redundancy

Active/active load-sharing redundancy

SN0

SNMAT

IOP:MB IOP:MB

SN1

SNMATSNMUX

0

SNMUX

0

SNMUX

15

SNMUX

15

Redundancy structure

SN/MB

MB0

LTG

SSNC SSNC

LTG

MB0

IOP:MB IOP:MB

Active/hot-standby redundancy

Page 5: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 5/14

5

Signaling system network control (SSNC)

The redundant structure of the SSNC ensures that no

single failure in the central parts will cause a loss of sig-

naling traffic.

The SSNC consists of two duplicated active/active

ATM switching planes ASN0 and ASN1.

Connected to the ATM switching network (ASN), and

communicating via it, are:

  – Main processor for administration and maintenance

(MP:OAM)

A duplicated MP:OAM is running in micro synchro-

nous active/hot-standby redundancy with

2 redundant magnetic disk devices, MDD0 and

MDD1 (active/active redundancy),

a magneto-optical disk, MOD (not redundant),

a local alarm interface (ALI) and a LAN interface.

 – Main processor for signaling manager (MP:SM)A duplicated MP:SM for SS7 signaling manager is

running in micro synchronous active/hot-standby

redundancy.

 – Main processor for signaling link termination

(MP:SLT)

Several duplicated MP:SLTs are running in micro

synchronous active/hot-standby redundancy.

 – Main processor for global title translation (MP:GTT)

Several duplicated MP:GTTs are running in micro

synchronous active/hot-standby redundancy.

 – Main processor for number portability (MP:NP)

Several duplicated MP:NPs are running in micro

synchronous active/hot-standby redundancy.

 – Main processor for IP interface (MP:IP)

Several duplicated MP:IP are running in micro syn-

chronous active/hot-standby redundancy.

 – Up to 2 cross-linked fiber optics to the MBD

 – One cross-linked fiber optics to the CP (AMP)

 – Line interface card (LIC)

Several pairs of E1 LICs working in active/standby

redundancy.

Non-redundant

Load-sharing redundancy

DLU

Trunks

DLUC0SLMs

SLMs

SLMs

DLUC0

LTG

LTG

LTG

LTG

SN1

SN0

Subscriber

lines

SSNC

LICMP:SLT

LICMP:SM

LIC

LIC

SS7

linksLIC

LIC

LICMP:OAM

LICAMP MBD

LICMP:GTT

LICMP:NP

LICMP:IP

ASN0

Page 6: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 6/14

6

The MP:OAM is the central operation, administration

and maintenance (OAM) processor of the system with

redundant active/active MDDs for software versions

and semi-permanent data. An MOD is provided without

redundancy for upgrade purposes and for storing snap-

shots of the signaling system network control (SSNC)

system.

The alarm interface module (ALI) shows up to

16 external alarms locally. The signaling system No. 7

(SS7) is managed via the MP:SM. Several MP:SLTs,

MP:NPs, MP:GTTs are provided. They communicate

directly via the optical fiber interfaces via the message

buffer (MB) with LTGs connected to DLUs for SEP traf-

fic.

Clock distribution

The central clock generator (CCG) generates the clock

for the EWSD system, synchronizes it to the externally

applied reference frequencies, and distributes it to the

subsequent equipment.

The clock pulse is distributed in four levels:

I. MB:GCG,

II. SN:GCG,III. LTG:GCG,

IV. DLU:GCG

Each level consists of one or more parallel clock gener-

ators which receive the clock of the next higher level as

a reference clock.

The CCG is duplicated. One CCG is active and the oth-

er CCG is on standby and in synchronism with the clock

of the active CCG. Each CCG is supplied with two ex-

ternal reference clocks. If the active CCG or both refer-

ence clocks fail, changeover to the standby CCG takes

place immediately without loss of synchronization.The SSNC has a separate clock distribution system

which is also completely redundant. The inputs to it are

connected to the primary CCG.

CCG1

CCG0DLULTG

LTG

LTG

MB1

MB0

SSNC

ASN0

ASN1

Active / Standby

redundancy

SN0

SN1

ACCG ACCG

ACCG ACCGACCG

LIC0

LIC1

MPU0

MPU1

ACCG

GCG0

GCG1

CP0CP1

GCG

GCG

GCG

GCG

GCG

GCG

GCG

R1

R4

R3

R2

Page 7: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 7/14

7

Software

In terms of adaptability, system modification and sys-

tem expansion, the EWSD software is modular in struc-

ture and functional in organization.

An essential quality attribute for the software is the soft-

ware reliability. Software reliability comprises the fol-

lowing aspects:

  – technical correctness, completeness  – consistency, integrity, error prevention

 – protection against failure, minimization of error

propagation

 – error neutralization mechanism (recovery)

 – analysis and correction of software errors

  – robustness against overload

Technical correctness and completeness are achieved

by means of inspections, reviews and tests. Thus the

whole development process from system definition to

field operation is controlled by a quality management

system (ISO 9000).Consistency, integrity, and error prevention are

achieved, for example by the following measures:

 – file protection against unauthorized access

 – periodic consistency checks for data

 – special security measures to prevent multiple ac-

cess to data

 – special security measures for data modification

  – validity and consistency checks for data transferred

at the interface

 – checksum procedures for monitoring data and pro-

gram code

 – corrective audits of critical data

Protection against failure and minimization of error

propagation are achieved by the following measures,

for example:

  – division of program code and data into separate link

modules, which are likewise stored in separate

memory areas

 – memory protection for program code and for semi-

permanent data

 – duplication of system files and user files – monitoring the real-time response of programs

  – monitoring system performance

The aim of recovery is to neutralize an error in such a

way that switching operation is either not impaired at all

or only slightly. Central and peripheral recovery are di-

vided up into recovery levels for this purpose. The indi-

vidual recovery levels initiate specific recovery actions,

which quickly and effectively restore the system to ser-

vice.

CP

recovery

level

System function Customer effect

CP Periphery

New start Initialization of

- processes

- stack

- process data

- heap

- Calls and connections maintained;

- Some calls in setup released;

- Call-charge data retained

Initial start System reset;

CP load from disk

LTGs loaded - Switched connection released;

- Nailed-up connections re-established;

- Call-charge data retained

Basic

operation

Initialization of vital processes, stack, data;

non-essential processes inhibited;

non-essential OA&M functions inhibited

Depends on recovery level initiating basic

operation (New start or initial start)

Initial start with

last generation

Reload lastgeneration

from disk

LTGs loaded - Switched connections released;

- Nailed-up connections re-established;

- Call-charge data restored

Page 8: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 8/14

8

Most of the software errors occurring on CP or MPs are

neutralized by the affected process itself if possible.

The few remaining software errors are neutralized by

the first recovery level (new start of all processes not

relevant for call processing on CP).

To ensure that the called recovery level clears an error

completely

 – the system supervises the run time of the recovery

 – checks whether further errors occur while recovery

is running, and

 – checks whether further errors occur again after re-covery within a supervision time.

If one of the checks indicates that the called recovery

level was not successful, the next-higher recovery level

is started. If the lowest level of Initial Start recovery is

not successful either, basic call processing is started.

This system status known as "basic operation" allows a

reduced process set to be activated, which guarantees

that the basic call processing functions are maintained.

In this way, errors in areas of software that are not con-

cerned with call processing are masked out.

Fault symptom files for problem analysis and a remote-controlled software correction system are used for

analysis and correction of software errors.

Robustness against overloads is achieved by means of

an overload protection procedure. The overload protec-

tion procedures use a step-by-step load rejection strat-

egy. The procedure is designed so that it can

differentiate between short-term load peaks, which may

be tolerated without any overload protection measure,

and long term overloads.

Reliability analysis

The reliability analysis comprises: –  Component reliability

 –  Spare parts

 –  Hardware reliability analysis

 –  Software reliability estimation

Component reliability

Overall component reliability is based on the reliability

of the various items of hardware (resistors, capacitors,

ICs, etc.). The failure rates of the components used are

calculated on the basis of Siemens norm SN29500.

SN29500 contains failure rates of the components forreference conditions and methods of considering de-

pendencies of the failure rates at operating conditions.

SN29500 complies with IEC 1709 ”Electronic Compo-

nents – Reliability Reference conditions for failure rates

and stress models for conversion”. The basis for

SN29500 is the worldwide field experience in Siemens

products, detailed service and repair statistics, compo-

nents test etc.

The mean failure rate of circuit boards is in the range of

2,000 to 6,000 FIT (failure in 109 hours) or an MTBF of

60 to 20 years.

SSNC

recovery

level

System function Customer effect

CP Periphery

Local recovery of

an MP platform

(FULLREC, code

only)

Initialization of

- processes

- stack

- process data

- heap

- Any failed MP:SM links are restored;

- Nailed-up connections re-established;

Local recovery of

an MP platform

(code & data)

Platform reset;

MP load from disk

LIC or ACCG is loaded

(code & data)

- Any failed MP:SM links are restored;

- Nailed-up connections re-established

Basic

operation

Initialization of vital processes, stack, data;

Non-essential processes inhibited;

Non-essential OA&M functions inhibited

Depends on recovery level initiating basic

operation (local recovery only with code or

with code and data)

System-wide

ecovery

(LOADREC2)

System reset;

MP:SA load from disk

All ACCGs and LICs are

loaded (code & data)

- All links are interrupted and restored;

- Nailed-up connections re-established.

Initial start with

last generation

Reload lastgeneration

from disk

All ACCGs and LICs are

loaded (code & data)

- All links are interrupted and restored;

- Nailed-up connections re-established.

Page 9: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 9/14

9

Spare parts

Network nodes are systems in which component fail-

ures and thus also device failures can be expected and

which therefore require corrective maintenance. This

gives rise to a certain demand for spare parts and the

need to maintain stocks of such parts (in this case

spare modules) either by repairing faulty modules or by

ordering new ones from time to time from the manufac-

turer.

The failure rates of the individual modules and the num-

ber of modules installed can be used to calculate the

probability of a certain number of module failures oc-

curring within a particular period.

The cumulative poisson distribution is used for calculat-

ing the required number of spare modules. Essential

customer-specific parameters for this calculation are

the required service continuity probability and the turn-

around time, which is defined as the interval between

the time when a replacement is ordered and the time

when the replacement is received. The spare parts re-

quirements are calculated individually for each project.

Hardware reliability analysis

Reliability analysis and modeling are an integral part of

the development process. Reliability block diagrams

and state transition diagrams are used for hardware re-

liability modeling. The models consider all aspects of

the system that affect its reliability, for example the abil-

ity of the system to detect faults, the ability to identify a

faulty unit and isolate it, or the frequency of periodic di-

agnosis. Hardware failure rates of all components are

predicted at an early stage of the development process.

All predictions are based on the Siemens normSN29500 for component failure rate calculations.

The reliability block diagram shows the simplified reli-

ability block diagram relating to the total system down-

time. Reliability block diagrams of this kind are created

for each specified reliability measurement such as total

system downtime, single termination outage, or SS7

link downtime.

5000

FIT

4000

FIT

3000

FIT

35

30

25

20

15

10

5

0   R  e  q  u   i  r  e   d  s  p  a  r  e  m  o   d  u   l  e  s

50001000

1002000

FIT

1000

FIT

F ai l ur e r at e o f  mo d ul e  n u m

  b e r  o  f  m

 o d u  l e

 s

99.9% service continuity prob.

turnaround time: 1 month

SN/MBCP113E incl. clock

CCG

CCG

BAP

BAP

CMY

CMY

AMP

AMP

IOC/IOP

IOC/IOP

SSNC

SSNC

SSNC

SN

SN

MB

MB

Page 10: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 10/14

10

The individual subsystems are modeled by Markov

modeling techniques. The reliability model of the sub-

systems (Markov model) shows all failure, detection,

recovery and repair actions relevant for the reliability of

the subsystem. Examples are:

 – failure in both sides of the system

  – uncovered faults

  – failure on the active system side during repair of the

other side – non-redundant operation of the system due to

periodic automatic diagnostics of the redundant

side and the effect of a failure during this period

The reliability analysis finishes with the computation of

the defined reliability measurements and verification of

the measurements with the requirements. Additionally,

the effect of the choice of parameter value on the re-

sulting reliability measurement is analyzed in order to

determine the optimized system parameters. Examples

are the minimum necessary fault detection probability

or the optimized period for periodic diagnosis of redun-

dant components.

During system integration testing, dedicated test steps

are used to verify the reliability structure and the cor-

rectness of the reliability models. For this purpose,

hardware faults are inserted to study how the system

behaves if a hardware fault occurs. Parameters used in

the hardware reliability models, such as switchover per-

formance or fault detection probability, are evaluated

statistically.

(1-d) λc

undetected

failure on the

standby

unit

µroutine µdia

SIMPLEX

on-site

repair

3

(d+c)λc

failure on

one of the

redundant

units

DOWN

travel

to the site

4

2 µdia

2 µroutine

µtravel

DOWN

up to endof diag.

10

Routing

diagnosis

switchover

9

SIMPLEX

travel

to the site

2

DOWN

on-site

repair

5

µrepair

DOWN

uncoveragefault

6

µremote

r (1-c)λc

µtravel µrepair

µtravel + µrepair

µrepair

2 λm

λc

µrepair µrepair

(1-r)(1-c)λc

λc

λm

µrepair

µtravel

µroutine

failure rate of common parts

failure rate of minor faults

repair rate

travel rate Z

routing rate

λc λc λc

SIMPLEX

undetected

fault

8

1

Normal operation of redundant units

λc

µdia

c

d

r

diagnosis rate

coverage probability

detection probability

remote repair probability

SIMPLEX

on-site

repair

11

DOWN

uncoveragefault

7

µtravel

Example of a typical Markov model for a redundant system

DOWN

2nd failure

during repair

12

Page 11: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 11/14

11

Software reliability estimation

Reliability of software used in telecommunication net-

works is a crucial determinant of network reliability.

Software reliability estimation is an important element

of software reliability management. In particular, soft-

ware reliability estimations guide the system testing

process and decisions on release of software.

Software errors are errors of logic, not of equipment - itis therefore possible to achieve 100 % reliability for

small programs. The average size of program modules

in the software does not as a rule exceed 1,000 state-

ments - they can therefore be regarded as small mod-

ules.

Software reliability increases with testing time as error

corrections are made in response to failures. During the

test phase, weekly progress reports on error finding

and error fixing activities are provided. These reports

are used as measurements of software quality during

the test phase. The measurements are compared with

defined software quality objectives at given milestones.

Software reliability models, which assume that the cu-

mulative number of software errors increases exponen-

tially to an asymptotic value, are applied to evaluate the

software reliability and quality. With the aid of the soft-

ware reliability model it is possible to estimate the num-

ber of errors in a software product and to estimate the

testing time and testing resources required to reach a

predefined quality level.

On the other hand, a software error is defined as a de-

parture of program operation from the specification,

caused by a software problem. Because of the different

characteristics and effects of software errors, and be-

cause of the error-tolerant software architecture, only

an infinitely small proportion of software errors affect

the reliability of the system.

The downtime due to software errors is basically de-

pendent on the frequency and duration of the recovery

level affecting the service capability of the switch.

For the estimation of the software reliability of a new

system version, detailed recovery statistics from ver-

sions worldwide in the field, recovery statistics from the

test bed, and recovery runtime estimations and mea-

surements are used.

Worldwide statistics show that the share of software-re-

lated failure in the total system downtime is approxi-

mately 1 min/year on average.

fixed failure

(cumulative)

0

0

test time

Qualitätskriterien für Meilen-

stein B600: kein Vorkommen

nicht korrigierter Prio 1-Fehler

still unfixed failure

fixed Prio1 failure

(cumulative)

still unfixed

Prio1 failure

test time

failure intensity

additional test time required

Cumulative failures

total failures

found to

date

test time

*)

*)

*)

test time

failure intensity objective

Page 12: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 12/14

12

Reliability data

Total system downtime

The total system downtime amounts on average to less

than 1 hour in 20 years (3 min/year) for hardware and

software failures. This corresponds to an overall sys-

tem availability of 99.9994 percent-plus.

Hardware: The MTBF of the system due to hardwarefaults has been calculated at more than 600 years. The

mean accumulated downtime is calculated in the range

of 0.01 to 0.05 min/year, depending on the assumed re-

pair time.

Software: Field performance measurements show that

less than one software error in 3 years requires a re-

covery level affecting the service capability of the sys-

tem longer than 30 seconds. In this case the service

capability can be restored in approx. 3 minutes on av-

erage. Thus the share of software-related failure in the

total system downtime is less than 1 min/year on aver-

age.

SSNC

4 672 yearsMTBF

0.0034 min/yearDowntime

6 371 years

0.0026 min/year

1 044 years

CCG

CCG

CP113E

CP113E

SSNCMB

MB

0.0039 min/year

MTBFtotal =

1MTBFi

Σ

1= 736 years

Total System Downtime = Σ Downtimei = 0.0177 min/year

SSNC

SN

SN

33 848 years

0.0009 min/year

MTTR = 0.5 h

0 h travel time

0.5 h repair time

MTTR = 2 h

1.5 h travel time

0.5 h repair time

MTTR = 4 h

3 h travel time

1 h repair time

MTBF

years

Downtime

min/ year

MTBF

years

Downtime

min/ year

MTBF

years

Downtime

min/ yearCCG 33 848 0.0009 33 511 0.0036 33 073 0.0072

CP113E

small 5 136 0.0030 5 059 0.0052 4 941 0.0089

large 4 672 0.0034 4 600 0.0059 4 488 0.0101

SN/MB

small 3 384 0.0050 3 229 0.0093 3 035 0.0184

large 6 371 0.0026 6 209 0.0045 6 006 0.0080

SSNC

small 1 030 0.0088 1 020 0.0107 1 006 0.0155

large 1 044 0.0039 1 031 0.0059 1 013 0.0113

Total (small) 671 0.0177 659 0.0288 643 0.0500

Total (large) 736 0.0099 726 0.0163 711 0.0294

Page 13: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 13/14

13

Trunk downtime

The unavailability encountered by an individual trunk

depends on failure of the central equipment as de-

scribed above and failures in the peripheral equipment.

The estimates show that the mean accumulated intrin-

sic downtime (MAIDT) for an individual termination will

be less than 15 minutes per year for hardware and soft-

ware faults. Thus the relevant ITU recommendation

(Q.541) requiring less than 30 minutes per year is met

comfortably.

Due to the full redundancy of all central equipment, the

unavailability of an individual termination is determined

by the non-redundant parts (LTG).

Hardware: The MTBF for an individual trunk due to

hardware faults has been calculated at more than

23 years. The mean accumulated downtime is calculat-

ed in the range of 2 to 14 min/year, depending on the

assumed repair time and depending on used LTG

(LTGN or LTGP).

Software: Field performance measurements show that

less than 0.5 software errors per year per LTG require

a recovery level causing service interruption to the

trunks directly connected, with a duration of about

2 minutes on average.

Total outage

671 years

0.018 min/year

MTBF

Downtime

1 718 years

0.011 min/year

16 years

1.872 min/year

LTG access

SN MB

SN MB

LTGP

Trunk Downtime = Σ Downtimei = 1.901 min/year

MTBFtotal =1

MTBFiΣ

1= 16 years

MTTR = 0.5 h

0 h travel time0.5 h repair time

MTTR = 2 h

1.5 h travel time0.5 h repair time

MTTR = 4 h

3 h travel time1 h repair time

MTBF

years

Downtime

min/ year

MTBF

years

Downtime

min/ year

MTBF

years

Downtime

min/ year

System large 671 0.018 659 0.029 643 0.050

SN/MB (LTG access) 1718 0.011 1384 0.032 1091 0.095

LTG

LTGP (4 x LTG func.) 16 1.872 16 6.444 16 12.886

LTGN 31 1.392 31 3.648 31 6.789

Trunk (LTGP) 16 1.901 15 6.505 15 13.031

Trunk (LTGN) 29 1.421 29 3.709 29 6.934

Page 14: P200

8/6/2019 P200

http://slidepdf.com/reader/full/p200 14/14

14

Subscriber line downtime

The unavailability encountered by an individual sub-

scriber line depends on failure of the central equipment

as described above and failures in the peripheralequip-

ment. The estimates show that the mean accumulated

intrinsic downtime (MAIDT) for an individual termination

will be less than 15 minutes per year for hardware and

software faults. Thus the relevant ITU recommendation

(Q.541) requiring less than 30 minutes per year is meet

comfortably.

Due to the full redundancy of all central equipment, the

unavailability of an individual termination is determined

by the non-redundant parts (SLM).

Hardware: The MTBF for an individual subscriber line

has been calculated at more than 5 years. The mean

accumulated downtime is calculated in the range of 3 to

13 min/year, depending on the assumed repair time.

Software: The probability that both DLU controls or theassociated LTGs will fail at the same time is negligible.

671 years

0.018 min/year

MTBF

Downtime

39 years

0.766 min/year

19 years

1.575 min/year

1 718 years

0.011 min/year

SLMA

Trunk Downtime = Σ Downtimei = 2.362 min/year

MTBFtotal =1

MTBFiΣ

1= 13 years

Total outage

LTG access

SN MB

SN MB

DLUG LTGP

DLUG LTGP

MTTR = 0.5 h

0 h travel time

0.5 h repair time

MTTR = 2 h

1.5 h travel time

0.5 h repair time

MTTR = 4 h

3 h travel time

1 h repair time

MTBF

years

Downtime

min/ year

MTBF

years

Downtime

min/ year

MTBF

years

Downtime

min/ year

System large 671 0.018 659 0.029 643 0.050

SN/MB (LTG access) 1718 0.011 1384 0.032 1091 0.095

DLUG-LTGP

(includes loadsharing failure modes)

39 0.766 39 3.066 39 6.133

SLM

SLMA (32 lines)

SLMD (16 lines)

19

14

1.575

2.078

19

14

2.912

4.686

19

14

5.823

9.372

Analog line 13 2.362 12 6.026 12 12.081

ISDN line 10 2.865 10 7.801 10 15.630

Copyright (C) Siemens AG 2001

Issued by Information and Communications Group • Hofmannstraße 51 • D-81359 München

Technical modifications possible. Technical specifications and features are binding only insofar as they are specifically and expressly agreed upon in a written contract.

Order Number: A30828-X1160-P200-1-7618 Visit our Website at: http://www.siemens.com