Upload
skah
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
POPART. Rhones-Alpes. Tolerating Communication and Processor Failures in Distributed Real-Time Systems. Hamoudi Kalla , Alain Girault and Yves Sorel. Grenoble, November 13, 2003. Outline. Introduction Modeling distributed real-time systems The Fault model Related work - PowerPoint PPT Presentation
Citation preview
Tolerating Communication and Processor Failures in Distributed
Real-Time Systems
Hamoudi Kalla, Alain Girault and Yves Sorel
Grenoble, November 13, 2003
2
Outline• Introduction
• Modeling distributed real-time systems
• The Fault model
• Related work
• Processor fault tolerance
• Communication fault tolerance
• Conclusion and future work
3
High level program
Compiler
Architecture specification
Distribution constraints
Execution times
Real-time constraints
Failure specification
Fault-tolerant distributed static schedule
Fault-tolerant distributed code
Code generator
Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic
Model of the algorithm
Introduction
4
Modeling distributed real-time systems
a. Algorithm Model
« I1 and I2 » are inputs operations (sensors)
« O » is output operation (actuator)
« A, B and C » are computations operations
I
1
A
B
C O
I2
5
Modeling distributed real-time systems
b. Architecture Model
P1
P2
P3
« P1, P2 and P3 » are processors
« B1 and B2 » are communication buses
B1
B2
Processor
Computation unit
mem
ory
co-processor
…
co-processor
6
The Fault Model
1. Tolerating a fixed number of fail-silent processors.
2. Tolerating a fixed number of fail-silent bus: complete and partial faults.
Complete bus faults
Partial bus faultsProcessors faults
P1
P2
P3
B1
B2
P1
P2
P3
B1
B2
P1
P2
P3
B1
B2
7
Find a distributed schedule of the algorithm on the
architecture which is fault-tolerantfault-tolerant to processors
and communications failures ?
Problem ?
I
1
A
B
C O
I2
scheduleschedulescheduleschedule
P1
P2 P3
B1 B2
8
2.2. Forward Error Correction (FEC)Forward Error Correction (FEC): passive or active replication of
operations and active replication of communication.
Related Work (1)
1.1. Time-Triggered Architecture (TTA)Time-Triggered Architecture (TTA): active replication of operations and
communications. (20 years = 100 masters theses and 25 doctoral)
9
1.1. Time-Triggered Architecture (TTA)Time-Triggered Architecture (TTA):
Related Work (2)
Processor fault tolerance: k replicas or copies of each operation are
actively allocated to separate processors.
Communication fault tolerance: k’ replicas or copies of each
communication are actively allocated to separate buses.
10
1.1. Forward Error Correction (FEC)Forward Error Correction (FEC):
Related Work (3)
Processor fault tolerance: k replicas or copies of each operation are
actively or passively allocated to separate processors.
Communication fault tolerance: First, each communication is coded
by the FEC code on k’ messages with redundant informations. Next,
the k’ messages are actively allocated to separate buses.
11
Outline• Introduction
• Modeling distributed real-time systems
• The Fault model
• Related work
• Processor fault tolerance
• Communication fault tolerance
• Conclusion and future work
12
Use the active sactive software replicationoftware replication of operations; where each
operation is replicated on k different processors to tolerate k
processors failures.
Processor fault tolerance
13
a. Use the passive software replicationpassive software replication of communication, which need
« watchdog timer watchdog timer »,
Communication fault tolerance (1)
b. Split each data communication on k messages. (data fragmentation)(data fragmentation)
14
Communication fault tolerance (2)
a. Use the passive software replicationpassive software replication of communication, which need
« watchdog timer watchdog timer »,
15
Communication fault tolerance (3)
b. Split each data communication on k messages. (data fragmentation)(data fragmentation)
16
Communication fault tolerance (3)
Why data data fragmentation fragmentation of communication ?
1. Distinction between complete and partialcomplete and partial communication fault !
17
Communication fault tolerance (4)
Why data data fragmentation fragmentation of communication ?
2. Enable rapid recoveryrapid recovery from processors and buses failures
18
Recovery from failures (1)
1. Processor fault
19
Recovery from failures (2)
2. Partial bus fault
20
Recovery from failures (3)
3. Complete bus fault
21
Example (1)
22
Example (2)
23
Conclusion and future work
Implementation of the proposed method into the SynDEx tool.
Simulations.
A new method to tolerate both communication and processor failuresboth communication and processor failures in
distributed real-time systems, which may be reduce the load and the
overhead of the recovery from failures.
Result
Future work
24
Questions Questions ??