25
www.mobilab.unin a.it EDCC-8 28 April 2010 Valencia, Spain MobiLab [email protected] Roberto Natella, Domenico Cotroneo {roberto.natella, cotroneo}@unina.it The MobiLab Group Dipartimento di Informatica e Sistemistica, Università degli Studi di Napoli Federico II Via Claudio 21, 80125 - Napoli, Italy 28 April 2010, Valencia, Spain Emulation of Transient Emulation of Transient Software Faults for Software Faults for Dependability Assessment: A Dependability Assessment: A Case Study Case Study MobiLab The 8th European Dependable Computing Conference

Roberto Natella, Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

  • Upload
    miyoko

  • View
    33

  • Download
    1

Embed Size (px)

DESCRIPTION

28 April 2010, Valencia, Spain. MobiLab. The 8 th European Dependable Computing Conference. Emulation of Transient Software Faults for Dependability Assessment: A Case Study. Roberto Natella, Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group - PowerPoint PPT Presentation

Citation preview

Page 1: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

Roberto Natella, Domenico Cotroneo{roberto.natella, cotroneo}@unina.it

The MobiLab GroupDipartimento di Informatica e Sistemistica, Università degli Studi di Napoli Federico II

Via Claudio 21, 80125 - Napoli, Italy

28 April 2010,Valencia, Spain

Emulation of Transient Software Emulation of Transient Software Faults for Dependability Faults for Dependability

Assessment: A Case StudyAssessment: A Case Study

MobiLab

The 8th European Dependable Computing Conference

Page 2: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

2 / 22

Context and problem statementSoftware Fault InjectionBohrbugs and Mandelbugs

Case study from the ATC domain Evaluation of state-of-the-art fault injection An experiment involving concurrency faults Conclusions

::. Outline

Page 3: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Rationale (1/2)

Software faults represent an important cause of system failures

Despite of efforts on Verification activities, fault avoidance, and fault removal, software systems are often delivered with residual software faults

Critical systems adopt Fault Tolerance Mechanisms (FTMs) to avoid failures at run-time

3 / 22

Page 4: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Rationale (2/2)

FTMs: A few examples Spatial redundancy

• CORBA FT, TANDEM90 Process Pairs Temporal redundancy

• Checkpointing and rollbackSoftware Fault Injection (SFI) is a valuable

approach for the verification and the improvement of FTMs

To correctly emulate software faults, we need to understand their features

4 / 22

Page 5: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Software FaultsBohrBugs

Faults whose activation is reproducible, i.e., it is straightforward to identify its activation pattern

Typically detected and then fixed during testing phase

5 / 22

MandelBugs Faults whose activation is transient and not

systematically reproducible Their activation conditions depend on complex

combinations of user inputs, the internal state and the external environment

Page 6: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Problem statement

Mandelbugs represent the major cause of failure in mission-critical system ..up to 82 % in well-tested software [2] [5] [6]

Mandelbugs are typically tolerated by the adoption of several redundancy schemes

6 / 22

Are existing SFI techniques able to emulate Mandelbugs adequately?

How should Mandelbugs be emulated?

Page 7: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Software Fault Injection (SFI)

To date, representativeness of injected faults has still not been investigated with respect to: Fault manifestation; Their effectiveness in testing FTMs (i.e.,

to emulate faults that most often occur and that they should tolerate)

7 / 22

Page 8: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Contributions

We aim to investigate this issue with a simple experimental campaign but…

….in a complex and real-world software sytems

We evaluated G-SWFIT, with respect to Mandelbugs We compared the results with an experiment,

specifically designed to emulate MandelbugsCase study: a fault-tolerant system from the Air Traffic

Control (ATC) domain It is a Flight Data Processor (FDPS) based on a

CORBA-compliant middleware

8 / 22

Page 9: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Case study (1/2)9 / 22

Page 10: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Case study (2/2)

We modeled the FDPS as a FSM to support the analysis of faults

10 / 22

A state consists of the following internal variables:1) The number of FDP

requests queued by the Façade

2) The number of requests under processing

3) The number of requests queued by Processing Servers (PSs)

CR, FR, PSC, …, are the messages exchanged in the FDPS

Page 11: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using G-SWFIT (1/4)

We implemented G-SWFIT fault operators in an open-source fault injection tool

The tool analyzes a C/C++ source code file, to produce a set of faulty source files Freely available at: http://www.mobilab.unina.it/SFI.htm

11 / 22

C pre-processor

C/C++Source Files

C/C++frontend

FaultInjector

+÷ 2

6 3Abstract

Syntax TreePatch Files

(with faults)

Page 12: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using G-SWFIT (2/4)

533 faults have been injected in the Façade source code

1599 experiments (3 different workloads)For each experiment, we collected:

• Information about a failure (e.g., Façade crash, switch to the backup, missed FDP requests)

• The state in which a failure occurred• The state in which the fault was activated

12 / 22

Page 13: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using G-SWFIT (3/4)

G-SWFIT is useful to test important system states (e.g., the checkpointing mechanism)

However, faults did not emulate well Mandelbugs because: A great amount of faults

(56%) manifest themselves during Façade initialization or during the first request (state 0:0:0); but Mandelbugs usually manifest themselves during the operational phase of a system

13 / 22

faults activated failures

Page 14: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using G-SWFIT (4/4)

14 / 22

However, faults did not emulate well Mandelbugs because (CONTINUED): In most of cases (93%) in

which the backup Façade is activated, the backup also fails (i.e., fault activation is simple to reproduce, like Bohrbugs)

Some important states (potentially affected by Mandelbugs) are untested (e.g., when one or more requests are queued by the PSs)

State coverage: 65%

Page 15: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Concurrency fault emulation (1/3)

To emulate Mandelbugs, we analyzed the scientific literature on software faults

We identified the following fault triggers: Concurrency Timing of external events Wrong memory state Faulty error handlers Complex input sequences Software aging

15 / 22

Page 16: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Concurrency fault emulation (2/3)

Features of most frequent concurrency faults (from a field data study [29]): They are atomicity-violation faults (49%) Only 1 shared variable is involved (66%) At most 2 threads are needed to trigger

the fault (90%)Our fault model:

2 threads access to a shared variable without acquiring a lock (race condition)

16 / 22

Page 17: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Concurrency fault emulation (3/3)

We propose a fault emulation technique in two phases:

Fault injection: Collects information about critical regions and their

memory accesses Removes lock operations before and after a pair of

conflicting critical regionsTrigger injection :

Submits an input sequence to drive the system to a target state

Schedules 2 threads such that memory accesses interfere with each other

17 / 22

Page 18: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Preliminary system characterization

Focusing on fault triggering we have to profile the system to recognize (and then to drive) the operating state

An input is associated to: A sequence of

messages sent and received by the Façade

A sequence of lock and memory accesses

18 / 22

sent by the tester

Page 19: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. How to trigger a fault?

An algorithm identifies (i) which inputs to send and (ii) in which state to send inputs to trigger a fault

The algorithm exploits preliminary information to match messages with shared memory accesses

19 / 22

Page 20: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. An example of concurrency fault (1/2)

The CRQ input is sent

20 / 22

Lock operation omittedThread 1 is blockedThe CR input is sentThread 2 writes aninconsistent valueThread 1 reads the(faulty) value

An algorithm processes the FSM to find when to send inputs (next slide)

Page 21: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Experimental campaign using concurrency faults

4 injected concurrency faults lead to a failure of primary Façade and not the backup one

We covered 13 out of 14 states (93%) in which faults were injected

Cumulative state coverage: 95% In particular, states *:3:1

were tested (i.e., one or more requests queued by PSs)

21 / 22

Page 22: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. Lessons Learned

Are existing SFI techniques able to emulate Mandelbugs adequately? No, G-SWFIT should be complemented by

taking into account MandelbugsHow should Mandelbugs be emulated?

Our solution is to identify most common fault triggers, and to try to emulate them in addition to modifying the source code

22 / 22

Page 23: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::.

Thank you!Any questions?

Page 24: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::.

Backup slides

Page 25: Roberto Natella,  Domenico Cotroneo { roberto.natella , cotroneo}@unina.it The MobiLab Group

www.mobilab.unina.it

EDCC-828 April 2010

Valencia, SpainMobiLab

[email protected]

::. G-SWFIT fault operators

G-SWFIT fault operators were derived from a field data study [14]

Fault activation and manifestation were neglected due to lack of data

:-D :-D :-D