24
Institute for Software Integrated Systems Vanderbilt University Design Environment for Fault-Adaptive Systems Ted Bapty Sandeep Neema Sweta Shetty, Steve Nordstrom, Divya Vashishtha, Jason Scott, Jason Overdorf Vanderbilt Univ.

Institute for Software Integrated Systems Vanderbilt University Design Environment for Fault- Adaptive Systems Ted Bapty Sandeep Neema Sweta Shetty, Steve

Embed Size (px)

Citation preview

Institute for Software Integrated SystemsVanderbilt University

Design Environment for Fault-Adaptive Systems

Ted BaptySandeep Neema

Sweta Shetty, Steve Nordstrom, Divya Vashishtha, Jason Scott, Jason Overdorf

Vanderbilt Univ.

BTeV RTES TeamNSF/ITR

• Fermilab– Building BTeV Trigger Hardware– Domain Experts, Define Goals, Constraints, etc.

• Vanderbilt– RTES Lead (Physics)– Design Environment, System Synthesis, System

Integration, Prototype Hardware

• UIUC– ARMOR, Fault Tolerant Middleware

• Syracuse & Pitt– Very Lightweight Agents, Diagnostics, Load Balancing

High Energy Physics

FermiLab Accelerator

BTeV Experiment

Particle Measurement

0 12 m

pp

DipoleRICH

EM

C

al

Had

ron

A

bsor

ber

Muo

n

Tor

oid

± 30

0 m

rad

Magnet

Forward tracker provides:•Momentum measurement•Pattern recognition for tracks born in decays downstream of vertex detector•Projection of tracks into particle ID devices

Detector Grids

Problem: -Massive amounts of data (Terabytes/Sec)-Determine the set of particle trajectories-Decide if it is interesting, keep or toss-Hardware => 2500 DSP’s + 2500 PC’s-Never Fail (ok to degrade)

Trigger System(20,000 ft. view)

Memory Queue, ~ms

2nd Level(PC)

Pre-Process(FPGA)

1st Level(DSP)

Store

~2000Nodes

~2000Nodes

System Constraints

• Triple-Mode Redundancy == Too Expensive– Some Over-capacity designed in

• Parallel System, Real-Time– Heterogeneous Processors– RT Constraints = Queue Length.

• No Generic Response to Faults– Based on application requirements– Based on system state– Based on available resources

Fault Mitigation

• System has excess capacity– But not much… (~10%)– Cannot pre-plan use of redundancy– Excess capacity may be used for

‘disposable tasks’

• Fault Occurs:– React quickly to regain minimal function– Rearrange Resources to make Best Use of

Remaining Resources– User-defined recovery behavior

Reflex + Healing

• ‘Reflex Action’: – Simple,– Rapid, – Real-Time, Guaranteed Response Time,– Sub-Optimal– Handle a Single Failure

• ‘Healing’– Re-Evaluate Resources & Tasks– Re-Balance/Re-Allocate Resources– Recover Failed Resources (After Testing)– Generate New Reflex Actions

ReflexMitigation Example

Primary Task

Secondary Task

1. Normal Operation

2. Processor Failure

3. Subdivide Primary Task

5. Replace Secondary Task

4. Migrate to Adjacent Processors

User-DefinedMitigation Actions

5 +. Reset/Test Failed Processor

HealingMitigation Example

Primary Task

Secondary Task

1. Normal Operation

2. Processor Failure Reflex Action

3. Update Models

5. Re-Plan System

4. Re-Evaluate Resources

Mitigation Actions

6. Rearrange tasks

Re-EvalRe-Plan

Design Issues

• Complex System– Thousands of Processors– High Data Rates– Real-Time Constraints

• User-Defined Behaviors– Domain-Specific Design Tool– System-Specific Implementation

• Run-Time Implementation– Heterogeneous Architecture– Real-Time - Execution & Mitigation– Fault-Tolerant

Analysis

Local Oper.Manager

LocalFaultMgr

TrigAlgo.

ARMOR/RTOS

TrigAlgo.Trig

Algo.Trig

Algo.

Logical C

ontrol N

etwork

L1/DSP

Local Oper.Manager

LocalFaultMgr

TrigAlgo.

ARMOR/RTOS

TrigAlgo.Trig

Algo.Trig

Algo.

Log

ical

Dat

a N

et

DSP

Local OperManager

LocalFaultMgr

TrigAlgo.

ARMOR/Linux

TrigAlgo.Trig

Algo.Trig

Algo.

Log

ical

Dat

a N

et

Logical C

ontrol Netw

ork

RISC

Local OperManager

LocalFaultMgr

TrigAlgo.

ARMOR/Linux

TrigAlgo.Trig

Algo.Trig

Algo.

L2,3/RISC

Region Operations Mgr

RegionFault Mgr

Runtime

Design and

Analysis

Reconfig Behavior

Algorithm Fault Behavior

Resource

Syn

thes

is

PerformanceSimulation

DiagnosabilityAnalysis

ReliabilityAnalysis

System Models

Soft Real-Time Hard

ExperimentInterface

Synthesis

Feed

back

Model Integrated Computing

Logical C

ontrol N

etwork

Global Operations Manager

Global Fault Manager

Modeling Language

Processing & Data Flow

Concepts:Processes, streams, data channels, Functions, data types,communication

Hardware Resources

Concepts:Processors, Memory,Topology, Reliability,Failure Modes,…

Full

Recov.Mode 1

RecovMode 3

Recov.Mode 2

Concepts:Recovery Strategies, Modesof Operation, goals/importance

Hierarchical Fault Management

Resource Models

=

Capture Hardware Resources• Nodes• Networks– Attributes– Hierarchy

Algorithm Models

• Processes

• Info Flow

° Interfaces

° Hierarchy

Fault Mitigation Models

Regional Manager

Local Manager

• Finite State Machine•Parallel, Hierarchical

• Events & Transitions• Mitigation Actions• Time Specs

State

Transition

MitigationActions

Conditions

System Generation

Algorithms

Schedules

CommMaps

SWLoads

Resources

OS Cfg

TaskAssign

BootMaps

Generation of Reflex Networks

State A

State CState B

Action AC1Action AC2Action AC3

Action AB1Action AB2Action AB3

SystemFaultState

Primary Struct.

Reflex Struct.

ON (L76 Fail) DODel P1

Conn P1,S22Map S22, C3

Kill T22Migrate T33,

Reflex Scripts

* 1 Set for Each ProcessorAnd failure type

Model-Based Healing

System Hardware

MIC Healing Controller

NominalF

ault

s

UpdateModel

Re-Balance New

Reflex

Interface

Tasks,Links

ReflexProgram

ReflexReflex

Runtime Environment

Global ManagerReflex ActionsMitigation Engine

Regional ManagerMitigation Engine

DSP Kernel

Local ManagerMitigation Engine

Actions FeedBack

Actions FeedBack

Reflex Actions

Reflex Actions

ExperimentInterface

ModelInterface

DSP Hardware

Fault Mitigation Interface

• Fault Mitigation Interface:– The FMA interfaces with the local diagnostics facility (receive local status,

clear errors, trigger rediagnosis, set diagnosis mode, etc.• Commands

– RETRY_LINK(link_id)• Function: Reset/resync a comm link,• Returns: failure or success

– REROUTE_LINK(link_id)• Function: Reroute communications through a separate link

– ADD_TASK(task_id, link_id)• Function: Adds a task to the task list, operate on data from link_id

– TEST_MEMORY(memory_bank)• Function: Intensive test on memory bank

– RELOCATE_DATA(from_bank, to_bank)• Function: Moves data, marks source memory bank as unused/unavail

– GET_LOCAL_STATUS• Function: Reports status of a resource on a local node

– SEND_MESSAGE– RECEIVE_MESSAGE– . . .

Synthesis: Analysis/Offline

• Simulation– Functional (e.g. Matlab)– Performance (Timing, Discrete Event)– Interfacing/generating to Swarms/Jackal/TAEMS

• Diagnosability– Failure Modes + Sensors– Predict ability to Detect/Isolate Failures

• Reliability Analysis– Predict MTBF, Maximum Failures– Robustness

• Stability Analysis– Reconfiguration Strategies/Control System

System Simulation

System Model

Task Model

Communication Model

Summary

• Developing Model-Based Approach– Capture Algorithm, Resource, and Mitigation

Aspects

• Generation of Software– Normal application Code– Fault Mitigation Code

• Two Fault Mitigation Approaches– Reflex: Fast, Limited Response– Healing: Slower, system ‘re-design’

• Analysis & Simulation• Runtime Infrastructure