Upload
himanshuagra
View
223
Download
0
Embed Size (px)
Citation preview
7/31/2019 Self Healing Systems Lecture
1/104
Fault -t olerance Techn iques f orSof t w are Syst em s
Dipankar Das
GM R&D, India Science LabMarch 2012, IIT Kharagpur
7/31/2019 Self Healing Systems Lecture
2/104
Aut om ot iv e ECS Trends
100 Million lines of code executing on a
distributed embedded system of 50-70ECUs and 7+ buses
2000 features with high interdependencies
More complex features in future
Hybrid PT, Fuel Cell, Displacement on Demand,Braking
65 and sub-65 nm design of memory andprocessors
Short development/testing time -- marketpressure
Multi-core/Distributed implementations
$1182
$1182
(+196%)
(+196%)
50ECUs(+150%)
100M
LinesofCode
(+9900%
)
$400
20ECUs
1ML
ines
1ML
ines
Electronics & Softwaregrowth in the last decade
7/31/2019 Self Healing Systems Lecture
3/104
Aut onom ic Com put ing A computer system which can recover from
faults without human intervention. Continued execution in spite of faults (availability ++)
Short time between fault-detection and correction
(security and maintenance ++)
The broader topic is self-aware systems,introduced by IBM Autonomic computing effort
Self-healing + self-optimizing + self-protecting + self-
configuring
Self-* requires: self-awareness, and context-
awareness Self-awareness: aware of self-stateand behaviors
Context-awareness: aware of context of operation (i.e.
the environment)
7/31/2019 Self Healing Systems Lecture
4/104
Self -Aw are Syst em s
Self-configuring:reconfigure automatically and
dynamically to changes in installation, update, integrationand composition.
Self-optimizing: adjust performance and resourceallocation to satisfy users.
Main concerns: QoS, Time-delay, Utilization
Self-protecting:detect and recovery from securitybreaches.
Self-Healing: self-diagnosis and self-repair. Can recover
from disruptions, either due to the environment or thesystem.
7/31/2019 Self Healing Systems Lecture
5/104
Faul t -t olerance for SW Syst em s
Adaptation processes in a self-adaptive system [1]
7/31/2019 Self Healing Systems Lecture
6/104
Faul t -t olerance for SW Syst em s
The goal is fault-tolerance:
Fault Error (Can be logged) Failure (Undesirable effects) Prevent faults from graduating into failures!
Applicable to both random and systematic faults.
Most hardware faults are random (wear-tear)
Most software faults are systematic (bugs), some hardware faultsare also systematic (HW is after all a program!)
Alternative strategies for fault-tolerance.
Replication, Diversity, Recovery-block/re-execution
Repair the current run vs. correct future runs (or both).
Will focus mostly on systematic faults
7/31/2019 Self Healing Systems Lecture
7/104
Ex am ples of Syst em at ic Fau lt s
7/31/2019 Self Healing Systems Lecture
8/104
Syst em at ic Fau lt s
A design fault which is built-into the system
May be triggered randomly but are not random
Nondeterministic interleaving of events
Causes of Systematic faults:
Incorrect/Incomplete/Inconsistent requirements
Software is inconsistent with requirements
Implementation errors: memory object overrun, type-faults,
data-races, incorrect synchronization
Configuration errors: incorrect estimation of message queue
size.
7/31/2019 Self Healing Systems Lecture
9/104
St ack Faul t s -- buf fer over f low
A buffer overflow occurs when you try to put toomany bits into an allocated buffer.
When this happens, the next contiguous chunk ofmemory is overwritten, such as
Return address Function pointer
Previous frame pointer, etc.
Also an attack code is injected.
This can lead to a serious security problem.
7/31/2019 Self Healing Systems Lecture
10/104
The Typ ical AUTOSAR st ack
7/31/2019 Self Healing Systems Lecture
11/104
Ex am ple Code
7/31/2019 Self Healing Systems Lecture
12/104
Overw r i t ing Ret ur n Addr ess
7/31/2019 Self Healing Systems Lecture
13/104
Il legal Event Sequ ence Ex am ple
Nondeterministic: will trigger only if TaskA is interrupted by Interrupt
B Very hard to reproduce!
Handling illegal sequences: Locks to ensure mutual exclusion
Contention-free design of software and system
Software-transactional memories
Re-executes code in case of data contentions
process basedon value of X
Process
Completeprocessing
ReadX
Read
X
Write
X
WriteX
Hardware interrupt/
preemption
Interrupt Service Routine
7/31/2019 Self Healing Systems Lecture
14/104
Ex am ples of Random Faul t s
7/31/2019 Self Healing Systems Lecture
15/104
Sof t Er ror s
01
It is an issue if silicon is < 65nm. State-of-the-art is 28nmEffects seen at software level, recovery/preventionstrategies can be built at software level.
7/31/2019 Self Healing Systems Lecture
16/104
Impact of Neutron Strike on a Si Device
Secondary source of upsets: alpha particles from packaging
Strikes release electron &hole pairs that can be
absorbed by source &drain to alter the state ofthe device
+
-+
+ +-- -
Transistor Device
source drain
neutron strike
7/31/2019 Self Healing Systems Lecture
17/104
Strike on state bit (e.g., in register file)
Bit
Read
Bit has error
protection
Erroris only detected(e.g., parity +
no recovery)
Error can becorrected(e.g, ECC)
yes no
Does bit
matter?
Silent Data
Corruption
(SDC)
yes
yes
no
Detected, butunrecoverable error(DUE)
no error
yesno
benign faultno error
benign faultno error
7/31/2019 Self Healing Systems Lecture
18/104
Some Acronyms
SDC = Silent Data Corruption
DUE = Detected & unrecoverable error
SER = Soft Error Rate = Total of SDC & DUE
7/31/2019 Self Healing Systems Lecture
19/104
Ev idence of Cosm ic Ray St r ik es
Documented strikes in large servers found inerror logs Normand, Single Event Upset at Ground Level, IEEE Transactions on
Nuclear Science, Vol. 43, No. 6, December 1996.
Sun Microsystems, 2000 Cosmic ray strikes on L2 cache with defective error protection
caused Suns flagship servers to suddenly and mysteriously crash!
Companies affected
Baby Bell (Atlanta), America Online, Ebay, & dozens of others
Verisign moved to IBM Unix servers (for the most part)
Toyota Prius Recall http://www.reuters.com/article/idUSTRE6293IC20100310
7/31/2019 Self Healing Systems Lecture
20/104
Plant Fau l t s
GM CONFIDENTIAL
Plant is the, mostly mechanical,component of a cyber-physical systemwhich is controller by electronics andsoftware.
Plant faults mainly result from mechanicalwear and tear.
Results in deviation from expectedbehavior.
7/31/2019 Self Healing Systems Lecture
21/104
Hardw are Faul t s
GM CONFIDENTIAL
S-a-1
Address Line5thbit.
7/31/2019 Self Healing Systems Lecture
22/104
A St at e-based Def in it ion of
Syst em Behav ior
7/31/2019 Self Healing Systems Lecture
23/104
The Com put er Syst em
I
Q-mem
Q-com CC
Communicationmedium-1
Communicationmedium-k
P
Instruction MemoryData Memory
CommunicationQueue
Load/store
Queue
Communication
Controller
Register FileProcessor Core
7/31/2019 Self Healing Systems Lecture
24/104
Com put er Syst em St at e
The stateof a computer system is the status of all the
components for all computer system nodes Currently executing instruction/statement
Current status of communication and load/store queues
Current status of instruction and program memory
Current status of communication controllers indirectlycaptures messages being transmitted in the system.
= ( , , , ,, , ,
0x0393 m ov r 1,G 5
0x0394 m ov r 2, G 2 5 6
0x0395 st G r 2,r 1
0x0396 m ov r 3, B 5
0x0397 m ov r 4, B 2 5 6
0x0398 st B r 4,r 3I0 M 0Q0 P0
, , , ,, , )0x0393 m ov r 1,G 5
0x0394 m ov r 2, G 2 5 6
0x0395 st G r 2,r 1
0x0396 m ov r 3, B 5
0x0397 m ov r 4, B 2 5 6
0x0398 st B r 4,r 3Ik M kQk Pkk
0 CC0
CCk
7/31/2019 Self Healing Systems Lecture
25/104
Syst em + Env ir on m ent
The stateof the environment captures
The state of the mechanical/biological/chemical components
(the plant).
The input generation system (users)*
= ( , , , ,, , ,
0x0393 m ov r 1,G 5
0x0394 m ov r 2, G 2 5 6
0x0395 st G r 2,r 1
0x0396 m ov r 3, B 5
0x0397 m ov r 4, B 2 5 6
0x0398 st B r 4,r 3
I0 M 0 Q0 P0
, , , ,, , ,
, )
0x0393 m ov r 1,G 5
0x0394 m ov r 2, G 2 5 6
0x0395 st G r 2,r 1
0x0396 m ov r 3, B 5
0x0397 m ov r 4, B 2 5 6
0x0398 st B r 4 ,r 3Ik M kQk Pkk
0 CC0
CCk
Plant Users*
7/31/2019 Self Healing Systems Lecture
26/104
Ex ecu t ion Space: Set of all possib le st at es
Execution Space
Systemshutdown
= ( , , , ,, , ,
0x0393 m ov r 1,G 5
0x0394 m ov r 2, G 2 5 6
0x0395 st G r 2,r 1
0x0396 m ov r 3, B 5
0x0397 m ov r 4, B 2 5 6
0x0398 st B r 4,r 3I0 M 0Q0 P0
, , , ,, , , )0x0393 m ov r 1,G 5
0x0394 m ov r 2, G 2 5 6
0x0395 st G r 2,r 1
0x0396 m ov r 3, B 5
0x0397 m ov r 4, B 2 5 6
0x0398 st B r 4,r 3Ik M k Qk Pkk
0 CC0
CCk
Plant
7/31/2019 Self Healing Systems Lecture
27/104
Syst em Behav io r
Execution Space
Systemshutdown
S3
S2
S1
Runs are sequences of states visited by the system
Transitions indicate a state change, say Transitions may be due to implemented system behavior, or due to environmentconditions (user pressing keys, mechanical systems response) Some transitions (implementation behavior + environment behavior) can be faulty,others can be fault-free or benign
S4
Transition guided byimplemented systembehavior
Transition due tothe environment
7/31/2019 Self Healing Systems Lecture
28/104
Funct ion al Prop ert ies Charact eri ze St at es
Execution Space
Systemshutdown
Input1 = 5,input2 = 3
MULT: For all runs starting from this state,if input1 = 5 and input2 = 3, then
there is a state in each run whereoutput = 15
Output = 15Start State
7/31/2019 Self Healing Systems Lecture
29/104
7/31/2019 Self Healing Systems Lecture
30/104
Accept ab le St at es
Execution Space
Acceptable States:Execution states which satisfy a weaker acceptability property (A). Acceptabilityproperties are basic properties which the system designer deems should be satisfied.Ex: System must not crash vs. functional property MULT
Correctness envelope
Acceptability envelope
Incorrect butacceptable
7/31/2019 Self Healing Systems Lecture
31/104
Un accept able Runs
Execution Space
Acceptability envelope
Unacceptable run
Unacceptable Run: One which goes through one or more unacceptable statesExamples: The return address of a function is overwritten by a buffer overflow The result of a solution varies from golden value by more than 20%.
7/31/2019 Self Healing Systems Lecture
32/104
Un accept able Runs
Execution Space
Acceptability envelope
Unacceptable run
Unacceptable Runs may be due to incorrect system implementation or due toenvironment-triggered transitions
System behavior
7/31/2019 Self Healing Systems Lecture
33/104
Fail St op Ex ecut ion
Execution Space
Acceptability envelope
A mechanism which forces the execution to be stopped when the acceptabilityboundary is breached.Example: Program execution is stopped when return address (in stack) is overwritten(in GCC 4.1 and above using the ProPolice mechanism)
STOP
7/31/2019 Self Healing Systems Lecture
34/104
Saf e Ex i t
Execution Space
Acceptability envelope
A mechanism which forces the execution to be altered such that it goes to a safeexit point, before halting the program.Example: Releasing locks before a program stops due to a detected stack smashingattack. Writing back EEPROM before shutdown in automotive ECUs.
STOP
7/31/2019 Self Healing Systems Lecture
35/104
Resi li ent Ex ecu t io n
Execution Space
Acceptability envelope
A mechanism which makes corrective action such that the execution remains withinthe acceptability envelope.
Two types of corrections: One which leads to the run entering the correctnessenvelope, and others which do not lead to correctness. Examples: Reactive systems which execute code cyclically, with each iterationreading an input and producing the corresponding output. resetting of automotivecontrollers on detection of stack overflows
Correctness envelope
7/31/2019 Self Healing Systems Lecture
36/104
Self -healin g Ex ecut ion
Execution Space
Acceptability envelope
Self-healing = Resilient execution (acceptability) + ensuring that the faulty rundoes not happen in the future (Continued execution + Automated Repair)
If same inputs/environment conditions are available then How do we prove that the corrected run and the modified behavior areacceptable?
Correctness envelope
7/31/2019 Self Healing Systems Lecture
37/104
Abst ract ion of t he St at e Space
Losing information to solve problems
7/31/2019 Self Healing Systems Lecture
38/104
Discret izat io n of Tim e
How can we model the plant?
How about the memory? Is it not a set of transistors, each havinganalogue behavior?
When trying to define a reasonable state system we perform
discretization on the variable time.
= ( , , , ,, , ,
0x0393 m ov r 1,G 5
0x0394 m ov r 2, G 2 5 6
0x0395 st G r 2,r 1
0x0396 m ov r 3, B 5
0x0397 m ov r 4, B 2 5 6
0x0398 st B r 4,r 3
I0 M 0Q
0 P0
, , , ,, , ,
)
0x0393 m ov r 1,G 5
0x0394 m ov r 2, G 2 5 6
0x0395 st G r 2,r 1
0x0396 m ov r 3, B 5
0x0397 m ov r 4, B 2 5 6
0x0398 st B r 4,r 3Ik M kQk Pkk
0 CC0
CCk
Plant
7/31/2019 Self Healing Systems Lecture
39/104
Abst ract ion of St at e
There are multiple abstractions of data forming a lattice.
Lattice = Elements + partial-order (more details)
Many nice properties follow when we add/reduce information in this manner
Do we have a lattice for the discretization of time?
001010011010100101
T Least information
001010011010100101intSh. doubleint Sh. ret. Sh.
STACKintSh. int001010011010100101intSh. doubleint Sh. ret. Sh.Static segment
Stack frame1
001010011010100101intSh. doubleint Sh. Sh. Sh.Static segment
Stack frame1
Can have manymore abstractions
STACKintSh. int 64-bits
Most information
7/31/2019 Self Healing Systems Lecture
40/104
Level of Abst ract ion = Our int erp ret at ion of Progr am
double = double;
short = short;
Static segment bound between [X,Y],
double = double + double;
return-address = 151; Existence of a static segment
B = 10.1;
return-address = 151;
Memory objects in static segment.
B = 10.1;
return-address = 151;
A = 2; 001010011010100101
001010011010100101intSh. B (double)int Sh. ret. Sh.
001010011010100101intSh. doubleint Sh. ret. Sh.Static segment
Stack frame1
001010011010100101intSh. doubleint Sh. Sh. Sh.Static segment
Stack frame1
7/31/2019 Self Healing Systems Lecture
41/104
Level of Abst ract ion = W hat check s can be perf or m ed
double = double;
short = short;
(type checks)
double = double + double;
return-address = 151;
(corruption of return address)
B = 10.1;
return-address = 151;
(reason about Bs value partial functionalcorrectness)
B = 10.1;
return-address = 151;
A = 2; (Everything on memory)001010011010100101
001010011010100101intSh. B (double)int Sh. ret. Sh.
001010011010100101intSh. doubleint Sh. ret. Sh.Static segment
Stack frame1
001010011010100101intSh. doubleint Sh. Sh. Sh.Static segment
Stack frame1
7/31/2019 Self Healing Systems Lecture
42/104
Granu lar i t y of t im e = W hat check s can be perf or m ed
Check at context switch
Buffer B overflows overwrites ret
Function corresponding to stack-frame-1 completes and returns
Task-time budget completes andcontext switch happens
Can we detect the overwriting ofreturn address?001010011010100101intSh. Double[]int Sh. ret.Sh.
001010011010100101intSh. Double[]int Sh. ret. Sh.Static segment
001010011010100101intSh. Double[]int Sh. ret. Sh.Static segment
Stack frame1
Static segment
Stack frame1
Key difference: data-abstraction vs. time granularity
We can refine our view to observe additional faults and take necessaryactions.
No scope for refinement after the completion of actions
7/31/2019 Self Healing Systems Lecture
43/104
Granular i t y o f t im e = W hat healing can be done
Check at context switch
Buffer B overflows overwrites ret.
Function corresponding to stack-frame-1 completes and returns
Program Sequence Monitor catchesfault after 10,000 instructions.
Task-time budget completes andcontext switch happens.
Can we recover from fault?
001010011010100101intSh. Double[]int Sh. ret.Sh.
001010011010100101intSh. Double[]int Sh. ret. Sh.Static segment
001010011010100101intSh. Double[]int Sh. ret. Sh.Static segment
Stack frame1
Static segment
Stack frame1
Often possible to log faults and detect them later.
For recovery/healing the amount of data which must be backed upis much larger than what is needed for detection so selecting the
correct time granularity is critical
7/31/2019 Self Healing Systems Lecture
44/104
Self -healing Syst em Design Check l ist
Level/s of abstraction/s
What do we want to check and recover from (abstract check)
Decides the cost of checking + cost of recovery
May adaptively vary abstraction level.
Time Granularity
Decides what faults we can recover from.
Direct bearing on performance (make checks thin.. They will run most of the time)
The Complexity of SH mechanism We are adding additional code which should not make system more unreliable.
Decouple SH component from native code (thin interface)
Make SH code simple and small (formal verification possible)
Which resource to spend on SH
Trade-off between Flash, RAM, Cache, Processing, magnetic memory, network.
What is the acceptability of the repair
Realizability: What infra support does the SH mechanism need
Continued Execution vs. Automated Repair
7/31/2019 Self Healing Systems Lecture
45/104
Self -healing Syst em Design Check l ist
Level/s of abstraction/s
?
Time Granularity
?
The Complexity of SH mechanism
?
Which resource to spend on SH ?.
What is the acceptability of the repair
?
Realizability: What infra support does the SH mechanism need
?
Continued Execution vs. Automated Repair
?
7/31/2019 Self Healing Systems Lecture
46/104
Repair in g Dat a St ruct u re by Goal
Direct ed Reason ing
7/31/2019 Self Healing Systems Lecture
47/104
The Prob lem
F = 20
G = 5
F = 20
G = 10
Broken Data Structure
Missing elements Inappropriate sharing Dangling references
Out of bounds arrayindices
Inconsistent values
7/31/2019 Self Healing Systems Lecture
48/104
F = 10
G = 5
F = 2
G = 1
F = 20
G = 10
F = 20
G = 5
F = 20
G = 10
Broken Data Structure Consistent Data Structure
RepairAlgorithm
The Solu t ion
7/31/2019 Self Healing Systems Lecture
49/104
7/31/2019 Self Healing Systems Lecture
50/104
Sum m ary of Techn ique
Abstract datastructure
construction
ComputeRepairedAbstract Data-structure
Goal Directed Planning:Actions: Abstraction rulesInitial State: Concretedata structure
Goal State: Any data-structure which is close tothe original data-structurein terms of the repair actiontaken (in abstract datastructure)
An amalgamation of abstraction and goal-directedreasoning.
7/31/2019 Self Healing Systems Lecture
51/104
Fil e Syst em Ex am ple: Con cret e Dat a-St ruct u re
struct disk {
int blockbitmap;entry dir[numentries];
block block[numblocks];
}
struct entry {byte name[Length];
int firstblock;
}
struct block {
int nextblock;byte data[blocksize];
}struct blockbitmap
subtype block {int nextblock;bit bitmap[numblocks];
}
intro-5 2 -1
Directory Entries Disk Blocks
-1 3 -1
7/31/2019 Self Healing Systems Lecture
52/104
The Or igi nal and Cor rect FS
blockbitmap
A Correct Fil e System
intro 10110 2 -1
Directory Entries Disk Blocks
-1 3 -1
Orig inal File System
intro-5 2 -1
Directory Entries Disk Blocks
-1 3 -1
Th Ab i
7/31/2019 Self Healing Systems Lecture
53/104
The Abst ract ion
Setsof objects
set Block of block : Used | Free
set Used of block : Bitmap Relationsbetween objects
relation Next : Used, Used
relation BlockStatus : Block, boolean
Block
Used FreeNext
Bitmap
boolean
BlockStatus
Note: This is the abstraction of theFile System -- while the actual filesystem is a bit map with multiplecomponents
Abstraction in terms of
sets + relationsbetween sets and sub-setting
R l f Ab t t i
7/31/2019 Self Healing Systems Lecture
54/104
Rules for Abst ract ion
i [0..numentries-1], 0 d.dir[i].firstblock d.block[d.dir[i].firstblock] Used
b Used, 0 b.nextblock b, d.block[b.nextblock] Next
b Used, 0 b.nextblock d.block[b.nextblock] Used
intro-5 2 -1 -1 3 -1
For all directories, the first block is used
For each used block, if next-block index >= 0, then the tuple containing the saidblock and the block with this index are contained in the Next(-block) relation
If a block has a valid next-block index, then the block pointed to by this index is Used
Quantifier + Condition on Concrete Data-str => Set inclusion
R l f Ab t t i
7/31/2019 Self Healing Systems Lecture
55/104
Rules for Abst ract ion
b in [0..numblocks-1], d.block[b] Used d.block[b] Free
true d.block[d.blockbitmap] Bitmap
j [0..numblocks-1], b Bitmap, true BlockStatus
intro-5 2 -1 -1 3 -1
If a block is not in Used, then it is in Free
The block pointed to by bitmapblock is contained in the set Bitmap
The block-status relation is contained in elements of the bitmap block
Quantifier + Condition on Concrete Data-str => Set inclusion
E l f h Ab i
7/31/2019 Self Healing Systems Lecture
56/104
Ex am ple of t he Abst ract ion
intro-5 2 -1
Directory Entries Disk Blocks
-1 3 -1
1
2
Used
Free0
Blocks
Bitmap
3
Next
ABSTRACT
C C
7/31/2019 Self Healing Systems Lecture
57/104
Consist ency Const rain t s: The check s
|Bitmap|=1
u Used, u.BlockStatus=true
f Free, f.BlockStatus=false
b Used, |Next.b| 1
1
2
Used
Free0
Blocks
Bitmap
3
Next
Th Di i
7/31/2019 Self Healing Systems Lecture
58/104
The Diagn osis
Evaluate consistency properties, find violations|Bitmap|=1 is violated - Bitmap set is empty
1
2
Used
Free
0
Blocks
Bitmap
3Next
R i i Vi l i f M d l C i P i
7/31/2019 Self Healing Systems Lecture
59/104
Repair in g Vio lat ion s of M odel Con sist ency Pr oper t ies
Violation provides binding for quantified variables
Convert Body of the constraint to disjunctive normal form
(p1 pn ) (q1 qm )
p1 pn , q1 qm are basic propositions
Choose a conjunction to satisfy Repair violated basic propositions in conjunction
Repair in g Vio lat ions of Basic Prop osit ions
7/31/2019 Self Healing Systems Lecture
60/104
Repair in g Vio lat ions of Basic Prop osit ions
Inequality constraints on values of numeric fields
V.R = E, V.R < E, V.R E, V.R E, V.R > E
Compute value of expression, assign relation
Presence of required number of objects
|S| = C, |S| C, |S| C
Remove or insert objects from/to set
Topology of region surrounding each object
|V.R| = C, |V.R| C, |V.R| C
|R.V| = C, |R.V| C, |R.V| C
Remove or insert tuples from/to relation Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R
Remove or add the object or tuple from/to set orrelation
R i i I i t i
7/31/2019 Self Healing Systems Lecture
61/104
Repai r ing Inconsist enciesRepair the violation of (|Bitmap|=1) (DNF-format) by adding a
block to the Bitmap set
1
2
Used
Free
0
Blocks
Bitmap
3Next
Goal Dir ect ed Reason in g
7/31/2019 Self Healing Systems Lecture
62/104
Goal -Dir ect ed Reason in g
Abstract repairs add or remove objects (ortuples) to sets (or relations)
Goal: find concrete data structure updates withsame effect
1) Find model definition rules that construct the relevantset or relation
2) Basic strategy:
For removals, appropriately falsify the guards of allthese model definition rules.
For additions, appropriately satisfy the guard of oneof these model definition rules.
G l Di t d R i i E l
7/31/2019 Self Healing Systems Lecture
63/104
Goal -Direct ed Reason in g in Ex am ple
Abstract Repair: add block 0 to the Bitmap set
Abstraction Rules:i [0..numentries-1], 0 d.dir[i].firstblock
d.block[d.dir[i].firstblock] Used
b Used, 0 b.nextblock
b,d.block[b.nextblock] Nextb Used, 0 b.nextblock d.block[b.nextblock] Used
b in [0..numblocks-1], d.block[b] Used
d.block[b] Free
true d.block[d.blockbitmap] Bitmapj [0..numblocks-1], b Bitmap, true =>
BlockStatus
G l Di t d R i i E l
7/31/2019 Self Healing Systems Lecture
64/104
Goal -Dir ect ed Reason in g in Ex am pl e
Abstract Repair: add block 0 to the Bitmap set
Abstraction Rules (Action Taken):i [0..numentries-1], 0 d.dir[i].firstblock
d.block[d.dir[i].firstblock] Used
b Used, 0 b.nextblock b,d.block[b.nextblock] Next
b Used, 0 b.nextblock d.block[b.nextblock] Used
b in [0..numblocks-1], d.block[b] Used d.block[b] Free
true d.block[d.blockbitmap] Bitmap (Guardalready satisfied)
j [0..numblocks-1], b Bitmap, true => BlockStatus
G l Di t d R i i E l
7/31/2019 Self Healing Systems Lecture
65/104
Goal -Direct ed Reason in g in Ex am ple
Abstract Repair: add block 0 to the Bitmap set
Relevant Abstraction Rule:
true d.block[d.blockbitmap] Bitmap
Action Taken: d.block[d.blockbitmap]=block 0
Corresponding Data Structure Update:d.blockbitmap = index of block 0 in d.blockarray
Re p a i r i n Ex a m p l e
7/31/2019 Self Healing Systems Lecture
66/104
Re p a i r i n Ex a m p l e
Orig inal File System
Updated File System
intro-5 2 -1
Directory Entries Disk Blocks
-1 3 -1
intro0 2 -1
Directory Entries Disk Blocks
-1 3 -1
blockbitmap
Note: Bitmap detailsare still abstracted
xxxx0010
M ult ip le Repair s
7/31/2019 Self Healing Systems Lecture
67/104
M ult ip le Repair s
Some broken data structures may requiremultiple repairs
Reconstruct model
Reevaluate consistency constraints
Perform any required additional repairs
Multiple repairs needed either due to complex
repair or due to refinement (some repair ruleacts on a more refined model of the system)
Re abst ract ed M odel
7/31/2019 Self Healing Systems Lecture
68/104
Re-abst ract ed M odelBlockStatus
1
Used
Free
Blocks
Bitmap
Next
0
true
2 3false
Note: BlockStatus relationship has been refined in this case.Refinement assigns arbitrary values to unassigned variables Can be treated as an environment action
Diagnosis in New M odel
7/31/2019 Self Healing Systems Lecture
69/104
Diagnosis in New M odel
Re-evaluate model constraints, find violations of
u Used, u.BlockStatus=true and
f Free, f.BlockStatus=false
-- Rule violations due to refinement of BlockStatus relationship
BlockStatus
1
Used
Free
Blocks
Bitmap
Next
0
true
2 3
false
Act ion 2: Fix Block St at us
7/31/2019 Self Healing Systems Lecture
70/104
Act ion 2 : Fix Block St at usRepair violations of
u Used, u.BlockStatus=true andf Free, f.BlockStatus=false
by modifying the BlockStatus relation
BlockStatus
1
Used
Free
Blocks
BitmapNext
0
true
2 3
false
Repai red Fi le Syst em
7/31/2019 Self Healing Systems Lecture
71/104
Repai red Fi le Syst em
blockbitmap
Repaired File System
intro 10110 2 -1
Directory Entries Disk Blocks
-1 3 -1
Repai r Plan Graph
7/31/2019 Self Healing Systems Lecture
72/104
Repai r Plan Graph
Add block to Bitmap
Replace with in BlockStatus
|Bitmap|!=1
f.BlockStatus=truefor any free block,f.BlockStatus=false
for any used block
7. b.bitmap[j]=falsefor j=indexof(f)
Remove tuples,from BlockStatus by
removing Bitmap
State predicate abstraction of the state
Action
Environment Actionwhen refining
Assign arbitrarytuples to f.BlockStatus
Experience
7/31/2019 Self Healing Systems Lecture
73/104
Experience We acquired five benchmarks (written in C/C++)
AbiWord
x86 emulator
CTAS (air-traffic control tool)
Simplified Linux file system
Freeciv interactive game
We developed specifications for all five
Little development time (days, not weeks)
Most of time spent figuring out Freeciv and CTAS
Each benchmark has Workload
Bug or fault insertion methodology
Ran benchmarks with and without repair
Snapshot of Resu l t s
7/31/2019 Self Healing Systems Lecture
74/104
Snapshot of Resu l t s
Application Time to Check
Consistency(ms)
Time to Checkand Repair (ms)
AbiWord 0.06 0.55
CTAS 0.07 0.15
FreeCiv 3.62 15.66
File system 4.22 263.14
Self healing Syst em Design Check l ist
7/31/2019 Self Healing Systems Lecture
75/104
Self -healing Syst em Design Check l ist
Level/s of abstraction/s
Multiple: Set-relation abstraction of data. Object level for blocks, bit-level for
bitmap, type-level for next. Needs traversal of about 1/Kth memory. K integers ina block.
Time Granularity
Variable possibly at suitable execution-block level
The Complexity of SH mechanism
20,000 lines of code, deeply related to data, complex operations on data Which resource to spend on SH
Mainly CPU
What is the acceptability of the repair
No guarantees, empirically observed to be successful
Realizability: What infra support does the SH mechanism need Access to data and ability to modify no memory protection
Continued Execution vs. Automated Repair
Continued execution
7/31/2019 Self Healing Systems Lecture
76/104
ClearView : Code Pat ch ing Using
Onl in e Learn in g
Det ect learn repair
7/31/2019 Self Healing Systems Lecture
77/104
Attackdetector
Repair
Learning
all executions
patch
Pluggable detector,
does not depend on learning
attacks(or bugs)
normalexecutions
predictiveconstraints
Learn normal behavior (constraints)
from successful runs Check constraints during attacks
Patch to re-establish constraints
Evaluate and distribute patches
True on every good run False during every attack
Det ect , learn , repair
Restores normal behavior
[Lin & Ernst 2003]
Learn ing no rm al behavio r
7/31/2019 Self Healing Systems Lecture
78/104
Learn ing no rm al behavio r
copy_len buff_size
Clients sendinference results
Server
Community machines
Server generalizes(merges results) Clients do local inference
Observe normal behavior
Generalize observed behavior
At t ack det ect ion & supp ression
7/31/2019 Self Healing Systems Lecture
79/104
At t ack det ect ion & supp ression
Detector collects informationand terminates application
Server
Community machines
Detectors used in our research: Code injection (Memory Firewall)
Memory corruption (Heap Guard)
Many other possibilities exist
Learn ing at t ack behavio r
7/31/2019 Self Healing Systems Lecture
80/104
Server
Instrumentation continuously
evaluates learned behavior
What was the effect of the attack? Community machines
Clients send difference inbehavior: violated constraints
Server correlatesconstraints to attack
Repair
7/31/2019 Self Healing Systems Lecture
81/104
Repair
Candidate patches:1. Set copy_len = buff_size2. Set copy_len = 03. Set buff_size = copy_len4. Return from procedure
Server
Propose a set of patches for each
behavior that predicts the attack
Community machines
Predictive: copy_len buff_size
Server generatesa set of patches
Repair
7/31/2019 Self Healing Systems Lecture
82/104
Repair
Server
Distribute patches to the community Community machines
Ranking:Patch 1: 0
Patch 2: 0Patch 3: 0
Repair
7/31/2019 Self Healing Systems Lecture
83/104
Repair
Ranking:Patch 3: +5
Patch 2: 0Patch 1: -5
Server
Evaluate patches
Success = no detector is triggered
When attacked, clientssend outcome to server
Community machines
Detector is stillrunning on clients
Server ranks patches
Repair
7/31/2019 Self Healing Systems Lecture
84/104
Repair
Server
Patch 3
Server redistributes themost effective patches
Redistribute the best patches Community machines
Ranking:Patch 3: +5
Patch 2: 0Patch 1: -5
Dyn am ic invar iant det ect ion
7/31/2019 Self Healing Systems Lecture
85/104
Dyn am ic invar iant det ect ion
Daikon generalizes observed programexecutions
Many optimizations for accuracy and speed Data structures, code analysis, statistical tests,
We further enhanced the technique
copy_len < buff_sizecopy_len buff_sizecopy_len = buff_sizecopy_len buff_sizecopy_len > buff_sizecopy_len buff_size
copy_len: 22
buff_size: 42
copy_len < buff_sizecopy_len buff_sizecopy_len = buff_sizecopy_len buff_sizecopy_len > buff_sizecopy_len buff_size
Candidate constraints: Remaining candidates:
Observation:
Repair ex am ple
7/31/2019 Self Healing Systems Lecture
86/104
Repair ex am pleif (! (copy_len buff_size))
copy_len = buff_size;
The repair checks the predictive constraint If constraint is not violated, no need to repair
If constraint is violated, an attack is (probably) underway
The patch does not depend on the detector Should fix the problem before the detector is triggered
Repair is not identical to what a human would write Unacceptable to wait for human response
Ex am ple const raint s & repair s
7/31/2019 Self Healing Systems Lecture
87/104
Ex am ple const raint s & repair s
v1 v2if (!(v1 v2)) v1 = v2;
v cif (!(v c)) v = c;
v { c1, c2, c3 }if (!(v==c1 || v==c2 || v==c3)) v = ci;
Return from enclosing procedureif (!()) return;
Modify a use: convert call *v toif () call *v;
Constraint on v (not negated)
ClearView w as successfu l
7/31/2019 Self Healing Systems Lecture
88/104
ClearView w as successfu l
Detected all attacks, prevented all exploits
For 7/10 vulnerabilities, generated a patch thatmaintained functionality
No observable deviation from desired behavior
After an average of 4.9 minutes and 5.4 attacks Handled polymorphic attack variants
Handled simultaneous & intermixed attacks
No false positives Low overhead for detection & repair
Self -healing Syst em Design Check l ist
7/31/2019 Self Healing Systems Lecture
89/104
Self healing Syst em Design Check l ist
Level/s of abstraction/s
Program invariants --- predicate abstraction
Time Granularity At level of granularity of checkers, e.g. heapguard
The Complexity of SH mechanism
Implementation size unknown, Changes code control-flow.
Which resource to spend on SH
Mainly CPU
What is the acceptability of the repair
No guarantees, empirically observed to work. Changes code control-flow. Verydifficult to verify!
Realizability: What infra support does the SH mechanism need
Virtual instruction cache -- Restricted to JIT setups
Continued Execution vs. Automated Repair
Automated repair
7/31/2019 Self Healing Systems Lecture
90/104
Ex t erm inat or : M em ory Faul t
Recovery
D i a g n o si n g B u f f e r O v e r f lo w s
7/31/2019 Self Healing Systems Lecture
91/104
D i a g n o si n g B u f f e r O v e r f lo w s
Canonical buffer overf low :
Allocate object too small
Write past end -- corrupt s object bytes forw ard
Not necessar i l y contiguous
b a d o b j e ct
( t o o sm a l l )
b y t e s p a st e n d
char * st r = new char [ 8] ;
st r cpy ( st r , goodbye cr uel wor l d ) ;
Isolat ing Buf f er Overf low s
7/31/2019 Self Healing Systems Lecture
92/104
8 10 2 9 4 5 1 7
Isolat ing Buf f er Overf low s
Re d =
possiblebad
object
G r e e n =
n o tbad
object
1 8 7 5 3 2 9 6 4
3
Canaries in freed space detect corruption
Run multiple times with DieFast allocator
Key insight: Overflow must be at same
10
Isolat ing Buf f er Overf low s
7/31/2019 Self Healing Systems Lecture
93/104
C a n a r i e s in freed space detect corrupt ion
Run mult iple tim es w ith DieFast allocator
K e y in si g h t : Overflow m u st b e a t sa m e
) object 9 overflow ed, w it h h ig h p r o b a b i li t y
Isolat ing Buf f er Overf low s
8 10 2 9 3 4 5 1 7
Re d =
possiblebad
object
G r e e n =
n o tbad
object
1 8 7 5 3 2 910 6 4
4 9 6 38 5 72 1
Isolat in g Dangl ing Poin t ers
7/31/2019 Self Healing Systems Lecture
94/104
Isolat in g Dangl ing Poin t ers
Dangling pointer error: Live object freed too soon
Overwritten by some other object
i nt * v = new i nt [ 4] ;
del et e [ ] v; / / oops
char * st r = new char [ 16] ;
st r cpy ( st r , di e, poi nt er ) ;v[ 3] = 4;
use of v[ 0]
Isolat in g Dangl ing Poin t ers
7/31/2019 Self Healing Systems Lecture
95/104
Isolat in g Dangl ing Poin t ers
Unlike buffer overflow: dangling pointer same corruption in all
11 2 3 6 4 5 10 1 12 798
1 7 5 3 2 1112 648 9 10
4 6 312 5 72 1410 8 9
Cor rect in g Al locat or
7/31/2019 Self Healing Systems Lecture
96/104
Co ect g ocat o
Generate runtime patches to correct errors
Track object call sites in allocator
Prevent overflows: pad overflowed objects
mal l oc( 8) mal l oc( 8 + )
Prevent dangling pointers: defer frees
f r ee( pt r ) delay mal l ocs; f r ee( pt r )
1 1
Ex t e r m in a t o r Ru n t i m e O v e r h e a d
7/31/2019 Self Healing Systems Lecture
97/104
25%
Self -healing Syst em Design Check l ist
7/31/2019 Self Healing Systems Lecture
98/104
g y g
Level/s of abstraction/s
Memory objects
Time Granularity At each malloc/free call
The Complexity of SH mechanism
Not a very complex implementation
Which resource to spend on SH
Mainly CPU, memory (but then that would be needed anyways)
What is the acceptability of the repair
Probabilistic guarantees that the repair will work.
Realizability: What infra support does the SH mechanism need
None.
Continued execution vs automated repair
Automated repair
Ex ercise 1
7/31/2019 Self Healing Systems Lecture
99/104
Write a C-code where a local array overflows (say in a for loop)
significantly (enough to overwrite the return address).
1. Compile and run this code on your Linux workstation. What do youobserve? What is the explanation of your observation? Prepare a 1-
page write-up on this.
2. If possible repeat the same on a Windows machine.
3. Re-write this code in C#. What happens in this case?
Please email your write-up for these three questions to Satya Gautam
(TA): [email protected]
Ex ercise 2
7/31/2019 Self Healing Systems Lecture
100/104
Write your own fault-tolerant my_malloc() and my_free() functions
which does the following:
It maintains a priority queue with elements containing pointers,additional lifetime, and additional buffer space. Set the free-delay and
additional buffer space randomly.
Additional buffer space = random number between 0 and ceil{size/10}
Additional life = random number between 0-5 events. Event = a malloc/free call.
For each malloc() allocate additional memory, for each my_free() delaythe actual free operation by additional lifetime.
Try it out on a buggy code which you have not been able to fix!
7/31/2019 Self Healing Systems Lecture
101/104
Thank You
Acycl ic Repai r Dependences
7/31/2019 Self Healing Systems Lecture
102/104
y p p
Questions
Isnt it possible for the repair of oneconstraint to invalidate another constraint?
What about infinite repair loops?
What about unsatisfiable specifications?
Answer We require specifications to have no cyclic
repair dependences between constraints
So all repair sequences terminate Repair can fail only because of resource
limitations
References
7/31/2019 Self Healing Systems Lecture
103/104
Automatically patching errors in deployed software by Jeff H.
Perkins, Sunghun Kim, Sam Larsen, Saman Amarasinghe,Jonathan Bachrach, Michael Carbin, Carlos Pacheco, Frank
Sherwood, Stelios Sidiroglou, Greg Sullivan, Weng-Fai Wong,
Yoav Zibin, Michael D. Ernst, and Martin Rinard. In Proceedingsof the 21st ACM Symposium on Operating Systems Principles,
(Big Sky, MT, USA), October 12-14, 2009, pp. 87-102.
Brian Demsky, Martin C. Rinard, "Goal-Directed Reasoningfor Specification-Based Data Structure Repair," IEEETransactions on Software Engineering, vol. 32, no. 12, pp.
931-951, Dec. 2006, doi:10.1109/TSE.2006.122
GM CONFIDENTIAL
References
7/31/2019 Self Healing Systems Lecture
104/104
Self-adaptive Software: Landscape and Research
Challenges. ACM TAAS, March 2009 Soft errors: Soft errors in circuits and systems, IBM Journal
of R&D, Vol. 52, No. 3, 2008http://researchweb.watson.ibm.com/journal/rd52-3.html
Martin Rinard, Acceptability Oriented Computing, ACMSIGPLAN Notices, Vol. 38, Issue 12, December 2003
Read other related works of Martin Rinard:
http://people.csail.mit.edu/rinard/acceptability_oriented_computing/
Marco Schneider, Self-Stabilization, ACM Computing
Surveys, Volume 25, Issue 1, March 1993.