Upload
julissa-blowers
View
215
Download
1
Embed Size (px)
Citation preview
Institute for Software Integrated SystemsVanderbilt University
Design Environment for Fault-Adaptive Systems
Ted BaptySandeep Neema
Sweta Shetty, Steve Nordstrom, Divya Vashishtha, Jason Scott, Jason Overdorf
Vanderbilt Univ.
BTeV RTES TeamNSF/ITR
• Fermilab– Building BTeV Trigger Hardware– Domain Experts, Define Goals, Constraints, etc.
• Vanderbilt– RTES Lead (Physics)– Design Environment, System Synthesis, System
Integration, Prototype Hardware
• UIUC– ARMOR, Fault Tolerant Middleware
• Syracuse & Pitt– Very Lightweight Agents, Diagnostics, Load Balancing
Particle Measurement
0 12 m
pp
DipoleRICH
EM
C
al
Had
ron
A
bsor
ber
Muo
n
Tor
oid
± 30
0 m
rad
Magnet
Forward tracker provides:•Momentum measurement•Pattern recognition for tracks born in decays downstream of vertex detector•Projection of tracks into particle ID devices
Detector Grids
Problem: -Massive amounts of data (Terabytes/Sec)-Determine the set of particle trajectories-Decide if it is interesting, keep or toss-Hardware => 2500 DSP’s + 2500 PC’s-Never Fail (ok to degrade)
Trigger System(20,000 ft. view)
Memory Queue, ~ms
2nd Level(PC)
Pre-Process(FPGA)
1st Level(DSP)
Store
~2000Nodes
~2000Nodes
System Constraints
• Triple-Mode Redundancy == Too Expensive– Some Over-capacity designed in
• Parallel System, Real-Time– Heterogeneous Processors– RT Constraints = Queue Length.
• No Generic Response to Faults– Based on application requirements– Based on system state– Based on available resources
Fault Mitigation
• System has excess capacity– But not much… (~10%)– Cannot pre-plan use of redundancy– Excess capacity may be used for
‘disposable tasks’
• Fault Occurs:– React quickly to regain minimal function– Rearrange Resources to make Best Use of
Remaining Resources– User-defined recovery behavior
Reflex + Healing
• ‘Reflex Action’: – Simple,– Rapid, – Real-Time, Guaranteed Response Time,– Sub-Optimal– Handle a Single Failure
• ‘Healing’– Re-Evaluate Resources & Tasks– Re-Balance/Re-Allocate Resources– Recover Failed Resources (After Testing)– Generate New Reflex Actions
ReflexMitigation Example
Primary Task
Secondary Task
1. Normal Operation
2. Processor Failure
3. Subdivide Primary Task
5. Replace Secondary Task
4. Migrate to Adjacent Processors
User-DefinedMitigation Actions
5 +. Reset/Test Failed Processor
HealingMitigation Example
Primary Task
Secondary Task
1. Normal Operation
2. Processor Failure Reflex Action
3. Update Models
5. Re-Plan System
4. Re-Evaluate Resources
Mitigation Actions
6. Rearrange tasks
Re-EvalRe-Plan
Design Issues
• Complex System– Thousands of Processors– High Data Rates– Real-Time Constraints
• User-Defined Behaviors– Domain-Specific Design Tool– System-Specific Implementation
• Run-Time Implementation– Heterogeneous Architecture– Real-Time - Execution & Mitigation– Fault-Tolerant
Analysis
Local Oper.Manager
LocalFaultMgr
TrigAlgo.
ARMOR/RTOS
TrigAlgo.Trig
Algo.Trig
Algo.
Logical C
ontrol N
etwork
L1/DSP
Local Oper.Manager
LocalFaultMgr
TrigAlgo.
ARMOR/RTOS
TrigAlgo.Trig
Algo.Trig
Algo.
Log
ical
Dat
a N
et
DSP
Local OperManager
LocalFaultMgr
TrigAlgo.
ARMOR/Linux
TrigAlgo.Trig
Algo.Trig
Algo.
Log
ical
Dat
a N
et
Logical C
ontrol Netw
ork
RISC
Local OperManager
LocalFaultMgr
TrigAlgo.
ARMOR/Linux
TrigAlgo.Trig
Algo.Trig
Algo.
L2,3/RISC
Region Operations Mgr
RegionFault Mgr
Runtime
Design and
Analysis
Reconfig Behavior
Algorithm Fault Behavior
Resource
Syn
thes
is
PerformanceSimulation
DiagnosabilityAnalysis
ReliabilityAnalysis
System Models
Soft Real-Time Hard
ExperimentInterface
Synthesis
Feed
back
Model Integrated Computing
Logical C
ontrol N
etwork
Global Operations Manager
Global Fault Manager
Modeling Language
Processing & Data Flow
Concepts:Processes, streams, data channels, Functions, data types,communication
Hardware Resources
Concepts:Processors, Memory,Topology, Reliability,Failure Modes,…
Full
Recov.Mode 1
RecovMode 3
Recov.Mode 2
Concepts:Recovery Strategies, Modesof Operation, goals/importance
Hierarchical Fault Management
Fault Mitigation Models
Regional Manager
Local Manager
• Finite State Machine•Parallel, Hierarchical
• Events & Transitions• Mitigation Actions• Time Specs
State
Transition
MitigationActions
Conditions
Generation of Reflex Networks
State A
State CState B
Action AC1Action AC2Action AC3
Action AB1Action AB2Action AB3
SystemFaultState
Primary Struct.
Reflex Struct.
ON (L76 Fail) DODel P1
Conn P1,S22Map S22, C3
Kill T22Migrate T33,
Reflex Scripts
* 1 Set for Each ProcessorAnd failure type
Model-Based Healing
System Hardware
MIC Healing Controller
NominalF
ault
s
UpdateModel
Re-Balance New
Reflex
Interface
Tasks,Links
ReflexProgram
ReflexReflex
Runtime Environment
Global ManagerReflex ActionsMitigation Engine
Regional ManagerMitigation Engine
DSP Kernel
Local ManagerMitigation Engine
Actions FeedBack
Actions FeedBack
Reflex Actions
Reflex Actions
ExperimentInterface
ModelInterface
DSP Hardware
Fault Mitigation Interface
• Fault Mitigation Interface:– The FMA interfaces with the local diagnostics facility (receive local status,
clear errors, trigger rediagnosis, set diagnosis mode, etc.• Commands
– RETRY_LINK(link_id)• Function: Reset/resync a comm link,• Returns: failure or success
– REROUTE_LINK(link_id)• Function: Reroute communications through a separate link
– ADD_TASK(task_id, link_id)• Function: Adds a task to the task list, operate on data from link_id
– TEST_MEMORY(memory_bank)• Function: Intensive test on memory bank
– RELOCATE_DATA(from_bank, to_bank)• Function: Moves data, marks source memory bank as unused/unavail
– GET_LOCAL_STATUS• Function: Reports status of a resource on a local node
– SEND_MESSAGE– RECEIVE_MESSAGE– . . .
Synthesis: Analysis/Offline
• Simulation– Functional (e.g. Matlab)– Performance (Timing, Discrete Event)– Interfacing/generating to Swarms/Jackal/TAEMS
• Diagnosability– Failure Modes + Sensors– Predict ability to Detect/Isolate Failures
• Reliability Analysis– Predict MTBF, Maximum Failures– Robustness
• Stability Analysis– Reconfiguration Strategies/Control System
Summary
• Developing Model-Based Approach– Capture Algorithm, Resource, and Mitigation
Aspects
• Generation of Software– Normal application Code– Fault Mitigation Code
• Two Fault Mitigation Approaches– Reflex: Fast, Limited Response– Healing: Slower, system ‘re-design’
• Analysis & Simulation• Runtime Infrastructure