Self Healing Systems Lecture

Embed Size (px)

Citation preview

  • 7/31/2019 Self Healing Systems Lecture

    1/104

    Fault -t olerance Techn iques f orSof t w are Syst em s

    Dipankar Das

    GM R&D, India Science LabMarch 2012, IIT Kharagpur

  • 7/31/2019 Self Healing Systems Lecture

    2/104

    Aut om ot iv e ECS Trends

    100 Million lines of code executing on a

    distributed embedded system of 50-70ECUs and 7+ buses

    2000 features with high interdependencies

    More complex features in future

    Hybrid PT, Fuel Cell, Displacement on Demand,Braking

    65 and sub-65 nm design of memory andprocessors

    Short development/testing time -- marketpressure

    Multi-core/Distributed implementations

    $1182

    $1182

    (+196%)

    (+196%)

    50ECUs(+150%)

    100M

    LinesofCode

    (+9900%

    )

    $400

    20ECUs

    1ML

    ines

    1ML

    ines

    Electronics & Softwaregrowth in the last decade

  • 7/31/2019 Self Healing Systems Lecture

    3/104

    Aut onom ic Com put ing A computer system which can recover from

    faults without human intervention. Continued execution in spite of faults (availability ++)

    Short time between fault-detection and correction

    (security and maintenance ++)

    The broader topic is self-aware systems,introduced by IBM Autonomic computing effort

    Self-healing + self-optimizing + self-protecting + self-

    configuring

    Self-* requires: self-awareness, and context-

    awareness Self-awareness: aware of self-stateand behaviors

    Context-awareness: aware of context of operation (i.e.

    the environment)

  • 7/31/2019 Self Healing Systems Lecture

    4/104

    Self -Aw are Syst em s

    Self-configuring:reconfigure automatically and

    dynamically to changes in installation, update, integrationand composition.

    Self-optimizing: adjust performance and resourceallocation to satisfy users.

    Main concerns: QoS, Time-delay, Utilization

    Self-protecting:detect and recovery from securitybreaches.

    Self-Healing: self-diagnosis and self-repair. Can recover

    from disruptions, either due to the environment or thesystem.

  • 7/31/2019 Self Healing Systems Lecture

    5/104

    Faul t -t olerance for SW Syst em s

    Adaptation processes in a self-adaptive system [1]

  • 7/31/2019 Self Healing Systems Lecture

    6/104

    Faul t -t olerance for SW Syst em s

    The goal is fault-tolerance:

    Fault Error (Can be logged) Failure (Undesirable effects) Prevent faults from graduating into failures!

    Applicable to both random and systematic faults.

    Most hardware faults are random (wear-tear)

    Most software faults are systematic (bugs), some hardware faultsare also systematic (HW is after all a program!)

    Alternative strategies for fault-tolerance.

    Replication, Diversity, Recovery-block/re-execution

    Repair the current run vs. correct future runs (or both).

    Will focus mostly on systematic faults

  • 7/31/2019 Self Healing Systems Lecture

    7/104

    Ex am ples of Syst em at ic Fau lt s

  • 7/31/2019 Self Healing Systems Lecture

    8/104

    Syst em at ic Fau lt s

    A design fault which is built-into the system

    May be triggered randomly but are not random

    Nondeterministic interleaving of events

    Causes of Systematic faults:

    Incorrect/Incomplete/Inconsistent requirements

    Software is inconsistent with requirements

    Implementation errors: memory object overrun, type-faults,

    data-races, incorrect synchronization

    Configuration errors: incorrect estimation of message queue

    size.

  • 7/31/2019 Self Healing Systems Lecture

    9/104

    St ack Faul t s -- buf fer over f low

    A buffer overflow occurs when you try to put toomany bits into an allocated buffer.

    When this happens, the next contiguous chunk ofmemory is overwritten, such as

    Return address Function pointer

    Previous frame pointer, etc.

    Also an attack code is injected.

    This can lead to a serious security problem.

  • 7/31/2019 Self Healing Systems Lecture

    10/104

    The Typ ical AUTOSAR st ack

  • 7/31/2019 Self Healing Systems Lecture

    11/104

    Ex am ple Code

  • 7/31/2019 Self Healing Systems Lecture

    12/104

    Overw r i t ing Ret ur n Addr ess

  • 7/31/2019 Self Healing Systems Lecture

    13/104

    Il legal Event Sequ ence Ex am ple

    Nondeterministic: will trigger only if TaskA is interrupted by Interrupt

    B Very hard to reproduce!

    Handling illegal sequences: Locks to ensure mutual exclusion

    Contention-free design of software and system

    Software-transactional memories

    Re-executes code in case of data contentions

    process basedon value of X

    Process

    Completeprocessing

    ReadX

    Read

    X

    Write

    X

    WriteX

    Hardware interrupt/

    preemption

    Interrupt Service Routine

  • 7/31/2019 Self Healing Systems Lecture

    14/104

    Ex am ples of Random Faul t s

  • 7/31/2019 Self Healing Systems Lecture

    15/104

    Sof t Er ror s

    01

    It is an issue if silicon is < 65nm. State-of-the-art is 28nmEffects seen at software level, recovery/preventionstrategies can be built at software level.

  • 7/31/2019 Self Healing Systems Lecture

    16/104

    Impact of Neutron Strike on a Si Device

    Secondary source of upsets: alpha particles from packaging

    Strikes release electron &hole pairs that can be

    absorbed by source &drain to alter the state ofthe device

    +

    -+

    + +-- -

    Transistor Device

    source drain

    neutron strike

  • 7/31/2019 Self Healing Systems Lecture

    17/104

    Strike on state bit (e.g., in register file)

    Bit

    Read

    Bit has error

    protection

    Erroris only detected(e.g., parity +

    no recovery)

    Error can becorrected(e.g, ECC)

    yes no

    Does bit

    matter?

    Silent Data

    Corruption

    (SDC)

    yes

    yes

    no

    Detected, butunrecoverable error(DUE)

    no error

    yesno

    benign faultno error

    benign faultno error

  • 7/31/2019 Self Healing Systems Lecture

    18/104

    Some Acronyms

    SDC = Silent Data Corruption

    DUE = Detected & unrecoverable error

    SER = Soft Error Rate = Total of SDC & DUE

  • 7/31/2019 Self Healing Systems Lecture

    19/104

    Ev idence of Cosm ic Ray St r ik es

    Documented strikes in large servers found inerror logs Normand, Single Event Upset at Ground Level, IEEE Transactions on

    Nuclear Science, Vol. 43, No. 6, December 1996.

    Sun Microsystems, 2000 Cosmic ray strikes on L2 cache with defective error protection

    caused Suns flagship servers to suddenly and mysteriously crash!

    Companies affected

    Baby Bell (Atlanta), America Online, Ebay, & dozens of others

    Verisign moved to IBM Unix servers (for the most part)

    Toyota Prius Recall http://www.reuters.com/article/idUSTRE6293IC20100310

  • 7/31/2019 Self Healing Systems Lecture

    20/104

    Plant Fau l t s

    GM CONFIDENTIAL

    Plant is the, mostly mechanical,component of a cyber-physical systemwhich is controller by electronics andsoftware.

    Plant faults mainly result from mechanicalwear and tear.

    Results in deviation from expectedbehavior.

  • 7/31/2019 Self Healing Systems Lecture

    21/104

    Hardw are Faul t s

    GM CONFIDENTIAL

    S-a-1

    Address Line5thbit.

  • 7/31/2019 Self Healing Systems Lecture

    22/104

    A St at e-based Def in it ion of

    Syst em Behav ior

  • 7/31/2019 Self Healing Systems Lecture

    23/104

    The Com put er Syst em

    I

    Q-mem

    Q-com CC

    Communicationmedium-1

    Communicationmedium-k

    P

    Instruction MemoryData Memory

    CommunicationQueue

    Load/store

    Queue

    Communication

    Controller

    Register FileProcessor Core

  • 7/31/2019 Self Healing Systems Lecture

    24/104

    Com put er Syst em St at e

    The stateof a computer system is the status of all the

    components for all computer system nodes Currently executing instruction/statement

    Current status of communication and load/store queues

    Current status of instruction and program memory

    Current status of communication controllers indirectlycaptures messages being transmitted in the system.

    = ( , , , ,, , ,

    0x0393 m ov r 1,G 5

    0x0394 m ov r 2, G 2 5 6

    0x0395 st G r 2,r 1

    0x0396 m ov r 3, B 5

    0x0397 m ov r 4, B 2 5 6

    0x0398 st B r 4,r 3I0 M 0Q0 P0

    , , , ,, , )0x0393 m ov r 1,G 5

    0x0394 m ov r 2, G 2 5 6

    0x0395 st G r 2,r 1

    0x0396 m ov r 3, B 5

    0x0397 m ov r 4, B 2 5 6

    0x0398 st B r 4,r 3Ik M kQk Pkk

    0 CC0

    CCk

  • 7/31/2019 Self Healing Systems Lecture

    25/104

    Syst em + Env ir on m ent

    The stateof the environment captures

    The state of the mechanical/biological/chemical components

    (the plant).

    The input generation system (users)*

    = ( , , , ,, , ,

    0x0393 m ov r 1,G 5

    0x0394 m ov r 2, G 2 5 6

    0x0395 st G r 2,r 1

    0x0396 m ov r 3, B 5

    0x0397 m ov r 4, B 2 5 6

    0x0398 st B r 4,r 3

    I0 M 0 Q0 P0

    , , , ,, , ,

    , )

    0x0393 m ov r 1,G 5

    0x0394 m ov r 2, G 2 5 6

    0x0395 st G r 2,r 1

    0x0396 m ov r 3, B 5

    0x0397 m ov r 4, B 2 5 6

    0x0398 st B r 4 ,r 3Ik M kQk Pkk

    0 CC0

    CCk

    Plant Users*

  • 7/31/2019 Self Healing Systems Lecture

    26/104

    Ex ecu t ion Space: Set of all possib le st at es

    Execution Space

    Systemshutdown

    = ( , , , ,, , ,

    0x0393 m ov r 1,G 5

    0x0394 m ov r 2, G 2 5 6

    0x0395 st G r 2,r 1

    0x0396 m ov r 3, B 5

    0x0397 m ov r 4, B 2 5 6

    0x0398 st B r 4,r 3I0 M 0Q0 P0

    , , , ,, , , )0x0393 m ov r 1,G 5

    0x0394 m ov r 2, G 2 5 6

    0x0395 st G r 2,r 1

    0x0396 m ov r 3, B 5

    0x0397 m ov r 4, B 2 5 6

    0x0398 st B r 4,r 3Ik M k Qk Pkk

    0 CC0

    CCk

    Plant

  • 7/31/2019 Self Healing Systems Lecture

    27/104

    Syst em Behav io r

    Execution Space

    Systemshutdown

    S3

    S2

    S1

    Runs are sequences of states visited by the system

    Transitions indicate a state change, say Transitions may be due to implemented system behavior, or due to environmentconditions (user pressing keys, mechanical systems response) Some transitions (implementation behavior + environment behavior) can be faulty,others can be fault-free or benign

    S4

    Transition guided byimplemented systembehavior

    Transition due tothe environment

  • 7/31/2019 Self Healing Systems Lecture

    28/104

    Funct ion al Prop ert ies Charact eri ze St at es

    Execution Space

    Systemshutdown

    Input1 = 5,input2 = 3

    MULT: For all runs starting from this state,if input1 = 5 and input2 = 3, then

    there is a state in each run whereoutput = 15

    Output = 15Start State

  • 7/31/2019 Self Healing Systems Lecture

    29/104

  • 7/31/2019 Self Healing Systems Lecture

    30/104

    Accept ab le St at es

    Execution Space

    Acceptable States:Execution states which satisfy a weaker acceptability property (A). Acceptabilityproperties are basic properties which the system designer deems should be satisfied.Ex: System must not crash vs. functional property MULT

    Correctness envelope

    Acceptability envelope

    Incorrect butacceptable

  • 7/31/2019 Self Healing Systems Lecture

    31/104

    Un accept able Runs

    Execution Space

    Acceptability envelope

    Unacceptable run

    Unacceptable Run: One which goes through one or more unacceptable statesExamples: The return address of a function is overwritten by a buffer overflow The result of a solution varies from golden value by more than 20%.

  • 7/31/2019 Self Healing Systems Lecture

    32/104

    Un accept able Runs

    Execution Space

    Acceptability envelope

    Unacceptable run

    Unacceptable Runs may be due to incorrect system implementation or due toenvironment-triggered transitions

    System behavior

  • 7/31/2019 Self Healing Systems Lecture

    33/104

    Fail St op Ex ecut ion

    Execution Space

    Acceptability envelope

    A mechanism which forces the execution to be stopped when the acceptabilityboundary is breached.Example: Program execution is stopped when return address (in stack) is overwritten(in GCC 4.1 and above using the ProPolice mechanism)

    STOP

  • 7/31/2019 Self Healing Systems Lecture

    34/104

    Saf e Ex i t

    Execution Space

    Acceptability envelope

    A mechanism which forces the execution to be altered such that it goes to a safeexit point, before halting the program.Example: Releasing locks before a program stops due to a detected stack smashingattack. Writing back EEPROM before shutdown in automotive ECUs.

    STOP

  • 7/31/2019 Self Healing Systems Lecture

    35/104

    Resi li ent Ex ecu t io n

    Execution Space

    Acceptability envelope

    A mechanism which makes corrective action such that the execution remains withinthe acceptability envelope.

    Two types of corrections: One which leads to the run entering the correctnessenvelope, and others which do not lead to correctness. Examples: Reactive systems which execute code cyclically, with each iterationreading an input and producing the corresponding output. resetting of automotivecontrollers on detection of stack overflows

    Correctness envelope

  • 7/31/2019 Self Healing Systems Lecture

    36/104

    Self -healin g Ex ecut ion

    Execution Space

    Acceptability envelope

    Self-healing = Resilient execution (acceptability) + ensuring that the faulty rundoes not happen in the future (Continued execution + Automated Repair)

    If same inputs/environment conditions are available then How do we prove that the corrected run and the modified behavior areacceptable?

    Correctness envelope

  • 7/31/2019 Self Healing Systems Lecture

    37/104

    Abst ract ion of t he St at e Space

    Losing information to solve problems

  • 7/31/2019 Self Healing Systems Lecture

    38/104

    Discret izat io n of Tim e

    How can we model the plant?

    How about the memory? Is it not a set of transistors, each havinganalogue behavior?

    When trying to define a reasonable state system we perform

    discretization on the variable time.

    = ( , , , ,, , ,

    0x0393 m ov r 1,G 5

    0x0394 m ov r 2, G 2 5 6

    0x0395 st G r 2,r 1

    0x0396 m ov r 3, B 5

    0x0397 m ov r 4, B 2 5 6

    0x0398 st B r 4,r 3

    I0 M 0Q

    0 P0

    , , , ,, , ,

    )

    0x0393 m ov r 1,G 5

    0x0394 m ov r 2, G 2 5 6

    0x0395 st G r 2,r 1

    0x0396 m ov r 3, B 5

    0x0397 m ov r 4, B 2 5 6

    0x0398 st B r 4,r 3Ik M kQk Pkk

    0 CC0

    CCk

    Plant

  • 7/31/2019 Self Healing Systems Lecture

    39/104

    Abst ract ion of St at e

    There are multiple abstractions of data forming a lattice.

    Lattice = Elements + partial-order (more details)

    Many nice properties follow when we add/reduce information in this manner

    Do we have a lattice for the discretization of time?

    001010011010100101

    T Least information

    001010011010100101intSh. doubleint Sh. ret. Sh.

    STACKintSh. int001010011010100101intSh. doubleint Sh. ret. Sh.Static segment

    Stack frame1

    001010011010100101intSh. doubleint Sh. Sh. Sh.Static segment

    Stack frame1

    Can have manymore abstractions

    STACKintSh. int 64-bits

    Most information

  • 7/31/2019 Self Healing Systems Lecture

    40/104

    Level of Abst ract ion = Our int erp ret at ion of Progr am

    double = double;

    short = short;

    Static segment bound between [X,Y],

    double = double + double;

    return-address = 151; Existence of a static segment

    B = 10.1;

    return-address = 151;

    Memory objects in static segment.

    B = 10.1;

    return-address = 151;

    A = 2; 001010011010100101

    001010011010100101intSh. B (double)int Sh. ret. Sh.

    001010011010100101intSh. doubleint Sh. ret. Sh.Static segment

    Stack frame1

    001010011010100101intSh. doubleint Sh. Sh. Sh.Static segment

    Stack frame1

  • 7/31/2019 Self Healing Systems Lecture

    41/104

    Level of Abst ract ion = W hat check s can be perf or m ed

    double = double;

    short = short;

    (type checks)

    double = double + double;

    return-address = 151;

    (corruption of return address)

    B = 10.1;

    return-address = 151;

    (reason about Bs value partial functionalcorrectness)

    B = 10.1;

    return-address = 151;

    A = 2; (Everything on memory)001010011010100101

    001010011010100101intSh. B (double)int Sh. ret. Sh.

    001010011010100101intSh. doubleint Sh. ret. Sh.Static segment

    Stack frame1

    001010011010100101intSh. doubleint Sh. Sh. Sh.Static segment

    Stack frame1

  • 7/31/2019 Self Healing Systems Lecture

    42/104

    Granu lar i t y of t im e = W hat check s can be perf or m ed

    Check at context switch

    Buffer B overflows overwrites ret

    Function corresponding to stack-frame-1 completes and returns

    Task-time budget completes andcontext switch happens

    Can we detect the overwriting ofreturn address?001010011010100101intSh. Double[]int Sh. ret.Sh.

    001010011010100101intSh. Double[]int Sh. ret. Sh.Static segment

    001010011010100101intSh. Double[]int Sh. ret. Sh.Static segment

    Stack frame1

    Static segment

    Stack frame1

    Key difference: data-abstraction vs. time granularity

    We can refine our view to observe additional faults and take necessaryactions.

    No scope for refinement after the completion of actions

  • 7/31/2019 Self Healing Systems Lecture

    43/104

    Granular i t y o f t im e = W hat healing can be done

    Check at context switch

    Buffer B overflows overwrites ret.

    Function corresponding to stack-frame-1 completes and returns

    Program Sequence Monitor catchesfault after 10,000 instructions.

    Task-time budget completes andcontext switch happens.

    Can we recover from fault?

    001010011010100101intSh. Double[]int Sh. ret.Sh.

    001010011010100101intSh. Double[]int Sh. ret. Sh.Static segment

    001010011010100101intSh. Double[]int Sh. ret. Sh.Static segment

    Stack frame1

    Static segment

    Stack frame1

    Often possible to log faults and detect them later.

    For recovery/healing the amount of data which must be backed upis much larger than what is needed for detection so selecting the

    correct time granularity is critical

  • 7/31/2019 Self Healing Systems Lecture

    44/104

    Self -healing Syst em Design Check l ist

    Level/s of abstraction/s

    What do we want to check and recover from (abstract check)

    Decides the cost of checking + cost of recovery

    May adaptively vary abstraction level.

    Time Granularity

    Decides what faults we can recover from.

    Direct bearing on performance (make checks thin.. They will run most of the time)

    The Complexity of SH mechanism We are adding additional code which should not make system more unreliable.

    Decouple SH component from native code (thin interface)

    Make SH code simple and small (formal verification possible)

    Which resource to spend on SH

    Trade-off between Flash, RAM, Cache, Processing, magnetic memory, network.

    What is the acceptability of the repair

    Realizability: What infra support does the SH mechanism need

    Continued Execution vs. Automated Repair

  • 7/31/2019 Self Healing Systems Lecture

    45/104

    Self -healing Syst em Design Check l ist

    Level/s of abstraction/s

    ?

    Time Granularity

    ?

    The Complexity of SH mechanism

    ?

    Which resource to spend on SH ?.

    What is the acceptability of the repair

    ?

    Realizability: What infra support does the SH mechanism need

    ?

    Continued Execution vs. Automated Repair

    ?

  • 7/31/2019 Self Healing Systems Lecture

    46/104

    Repair in g Dat a St ruct u re by Goal

    Direct ed Reason ing

  • 7/31/2019 Self Healing Systems Lecture

    47/104

    The Prob lem

    F = 20

    G = 5

    F = 20

    G = 10

    Broken Data Structure

    Missing elements Inappropriate sharing Dangling references

    Out of bounds arrayindices

    Inconsistent values

  • 7/31/2019 Self Healing Systems Lecture

    48/104

    F = 10

    G = 5

    F = 2

    G = 1

    F = 20

    G = 10

    F = 20

    G = 5

    F = 20

    G = 10

    Broken Data Structure Consistent Data Structure

    RepairAlgorithm

    The Solu t ion

  • 7/31/2019 Self Healing Systems Lecture

    49/104

  • 7/31/2019 Self Healing Systems Lecture

    50/104

    Sum m ary of Techn ique

    Abstract datastructure

    construction

    ComputeRepairedAbstract Data-structure

    Goal Directed Planning:Actions: Abstraction rulesInitial State: Concretedata structure

    Goal State: Any data-structure which is close tothe original data-structurein terms of the repair actiontaken (in abstract datastructure)

    An amalgamation of abstraction and goal-directedreasoning.

  • 7/31/2019 Self Healing Systems Lecture

    51/104

    Fil e Syst em Ex am ple: Con cret e Dat a-St ruct u re

    struct disk {

    int blockbitmap;entry dir[numentries];

    block block[numblocks];

    }

    struct entry {byte name[Length];

    int firstblock;

    }

    struct block {

    int nextblock;byte data[blocksize];

    }struct blockbitmap

    subtype block {int nextblock;bit bitmap[numblocks];

    }

    intro-5 2 -1

    Directory Entries Disk Blocks

    -1 3 -1

  • 7/31/2019 Self Healing Systems Lecture

    52/104

    The Or igi nal and Cor rect FS

    blockbitmap

    A Correct Fil e System

    intro 10110 2 -1

    Directory Entries Disk Blocks

    -1 3 -1

    Orig inal File System

    intro-5 2 -1

    Directory Entries Disk Blocks

    -1 3 -1

    Th Ab i

  • 7/31/2019 Self Healing Systems Lecture

    53/104

    The Abst ract ion

    Setsof objects

    set Block of block : Used | Free

    set Used of block : Bitmap Relationsbetween objects

    relation Next : Used, Used

    relation BlockStatus : Block, boolean

    Block

    Used FreeNext

    Bitmap

    boolean

    BlockStatus

    Note: This is the abstraction of theFile System -- while the actual filesystem is a bit map with multiplecomponents

    Abstraction in terms of

    sets + relationsbetween sets and sub-setting

    R l f Ab t t i

  • 7/31/2019 Self Healing Systems Lecture

    54/104

    Rules for Abst ract ion

    i [0..numentries-1], 0 d.dir[i].firstblock d.block[d.dir[i].firstblock] Used

    b Used, 0 b.nextblock b, d.block[b.nextblock] Next

    b Used, 0 b.nextblock d.block[b.nextblock] Used

    intro-5 2 -1 -1 3 -1

    For all directories, the first block is used

    For each used block, if next-block index >= 0, then the tuple containing the saidblock and the block with this index are contained in the Next(-block) relation

    If a block has a valid next-block index, then the block pointed to by this index is Used

    Quantifier + Condition on Concrete Data-str => Set inclusion

    R l f Ab t t i

  • 7/31/2019 Self Healing Systems Lecture

    55/104

    Rules for Abst ract ion

    b in [0..numblocks-1], d.block[b] Used d.block[b] Free

    true d.block[d.blockbitmap] Bitmap

    j [0..numblocks-1], b Bitmap, true BlockStatus

    intro-5 2 -1 -1 3 -1

    If a block is not in Used, then it is in Free

    The block pointed to by bitmapblock is contained in the set Bitmap

    The block-status relation is contained in elements of the bitmap block

    Quantifier + Condition on Concrete Data-str => Set inclusion

    E l f h Ab i

  • 7/31/2019 Self Healing Systems Lecture

    56/104

    Ex am ple of t he Abst ract ion

    intro-5 2 -1

    Directory Entries Disk Blocks

    -1 3 -1

    1

    2

    Used

    Free0

    Blocks

    Bitmap

    3

    Next

    ABSTRACT

    C C

  • 7/31/2019 Self Healing Systems Lecture

    57/104

    Consist ency Const rain t s: The check s

    |Bitmap|=1

    u Used, u.BlockStatus=true

    f Free, f.BlockStatus=false

    b Used, |Next.b| 1

    1

    2

    Used

    Free0

    Blocks

    Bitmap

    3

    Next

    Th Di i

  • 7/31/2019 Self Healing Systems Lecture

    58/104

    The Diagn osis

    Evaluate consistency properties, find violations|Bitmap|=1 is violated - Bitmap set is empty

    1

    2

    Used

    Free

    0

    Blocks

    Bitmap

    3Next

    R i i Vi l i f M d l C i P i

  • 7/31/2019 Self Healing Systems Lecture

    59/104

    Repair in g Vio lat ion s of M odel Con sist ency Pr oper t ies

    Violation provides binding for quantified variables

    Convert Body of the constraint to disjunctive normal form

    (p1 pn ) (q1 qm )

    p1 pn , q1 qm are basic propositions

    Choose a conjunction to satisfy Repair violated basic propositions in conjunction

    Repair in g Vio lat ions of Basic Prop osit ions

  • 7/31/2019 Self Healing Systems Lecture

    60/104

    Repair in g Vio lat ions of Basic Prop osit ions

    Inequality constraints on values of numeric fields

    V.R = E, V.R < E, V.R E, V.R E, V.R > E

    Compute value of expression, assign relation

    Presence of required number of objects

    |S| = C, |S| C, |S| C

    Remove or insert objects from/to set

    Topology of region surrounding each object

    |V.R| = C, |V.R| C, |V.R| C

    |R.V| = C, |R.V| C, |R.V| C

    Remove or insert tuples from/to relation Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R

    Remove or add the object or tuple from/to set orrelation

    R i i I i t i

  • 7/31/2019 Self Healing Systems Lecture

    61/104

    Repai r ing Inconsist enciesRepair the violation of (|Bitmap|=1) (DNF-format) by adding a

    block to the Bitmap set

    1

    2

    Used

    Free

    0

    Blocks

    Bitmap

    3Next

    Goal Dir ect ed Reason in g

  • 7/31/2019 Self Healing Systems Lecture

    62/104

    Goal -Dir ect ed Reason in g

    Abstract repairs add or remove objects (ortuples) to sets (or relations)

    Goal: find concrete data structure updates withsame effect

    1) Find model definition rules that construct the relevantset or relation

    2) Basic strategy:

    For removals, appropriately falsify the guards of allthese model definition rules.

    For additions, appropriately satisfy the guard of oneof these model definition rules.

    G l Di t d R i i E l

  • 7/31/2019 Self Healing Systems Lecture

    63/104

    Goal -Direct ed Reason in g in Ex am ple

    Abstract Repair: add block 0 to the Bitmap set

    Abstraction Rules:i [0..numentries-1], 0 d.dir[i].firstblock

    d.block[d.dir[i].firstblock] Used

    b Used, 0 b.nextblock

    b,d.block[b.nextblock] Nextb Used, 0 b.nextblock d.block[b.nextblock] Used

    b in [0..numblocks-1], d.block[b] Used

    d.block[b] Free

    true d.block[d.blockbitmap] Bitmapj [0..numblocks-1], b Bitmap, true =>

    BlockStatus

    G l Di t d R i i E l

  • 7/31/2019 Self Healing Systems Lecture

    64/104

    Goal -Dir ect ed Reason in g in Ex am pl e

    Abstract Repair: add block 0 to the Bitmap set

    Abstraction Rules (Action Taken):i [0..numentries-1], 0 d.dir[i].firstblock

    d.block[d.dir[i].firstblock] Used

    b Used, 0 b.nextblock b,d.block[b.nextblock] Next

    b Used, 0 b.nextblock d.block[b.nextblock] Used

    b in [0..numblocks-1], d.block[b] Used d.block[b] Free

    true d.block[d.blockbitmap] Bitmap (Guardalready satisfied)

    j [0..numblocks-1], b Bitmap, true => BlockStatus

    G l Di t d R i i E l

  • 7/31/2019 Self Healing Systems Lecture

    65/104

    Goal -Direct ed Reason in g in Ex am ple

    Abstract Repair: add block 0 to the Bitmap set

    Relevant Abstraction Rule:

    true d.block[d.blockbitmap] Bitmap

    Action Taken: d.block[d.blockbitmap]=block 0

    Corresponding Data Structure Update:d.blockbitmap = index of block 0 in d.blockarray

    Re p a i r i n Ex a m p l e

  • 7/31/2019 Self Healing Systems Lecture

    66/104

    Re p a i r i n Ex a m p l e

    Orig inal File System

    Updated File System

    intro-5 2 -1

    Directory Entries Disk Blocks

    -1 3 -1

    intro0 2 -1

    Directory Entries Disk Blocks

    -1 3 -1

    blockbitmap

    Note: Bitmap detailsare still abstracted

    xxxx0010

    M ult ip le Repair s

  • 7/31/2019 Self Healing Systems Lecture

    67/104

    M ult ip le Repair s

    Some broken data structures may requiremultiple repairs

    Reconstruct model

    Reevaluate consistency constraints

    Perform any required additional repairs

    Multiple repairs needed either due to complex

    repair or due to refinement (some repair ruleacts on a more refined model of the system)

    Re abst ract ed M odel

  • 7/31/2019 Self Healing Systems Lecture

    68/104

    Re-abst ract ed M odelBlockStatus

    1

    Used

    Free

    Blocks

    Bitmap

    Next

    0

    true

    2 3false

    Note: BlockStatus relationship has been refined in this case.Refinement assigns arbitrary values to unassigned variables Can be treated as an environment action

    Diagnosis in New M odel

  • 7/31/2019 Self Healing Systems Lecture

    69/104

    Diagnosis in New M odel

    Re-evaluate model constraints, find violations of

    u Used, u.BlockStatus=true and

    f Free, f.BlockStatus=false

    -- Rule violations due to refinement of BlockStatus relationship

    BlockStatus

    1

    Used

    Free

    Blocks

    Bitmap

    Next

    0

    true

    2 3

    false

    Act ion 2: Fix Block St at us

  • 7/31/2019 Self Healing Systems Lecture

    70/104

    Act ion 2 : Fix Block St at usRepair violations of

    u Used, u.BlockStatus=true andf Free, f.BlockStatus=false

    by modifying the BlockStatus relation

    BlockStatus

    1

    Used

    Free

    Blocks

    BitmapNext

    0

    true

    2 3

    false

    Repai red Fi le Syst em

  • 7/31/2019 Self Healing Systems Lecture

    71/104

    Repai red Fi le Syst em

    blockbitmap

    Repaired File System

    intro 10110 2 -1

    Directory Entries Disk Blocks

    -1 3 -1

    Repai r Plan Graph

  • 7/31/2019 Self Healing Systems Lecture

    72/104

    Repai r Plan Graph

    Add block to Bitmap

    Replace with in BlockStatus

    |Bitmap|!=1

    f.BlockStatus=truefor any free block,f.BlockStatus=false

    for any used block

    7. b.bitmap[j]=falsefor j=indexof(f)

    Remove tuples,from BlockStatus by

    removing Bitmap

    State predicate abstraction of the state

    Action

    Environment Actionwhen refining

    Assign arbitrarytuples to f.BlockStatus

    Experience

  • 7/31/2019 Self Healing Systems Lecture

    73/104

    Experience We acquired five benchmarks (written in C/C++)

    AbiWord

    x86 emulator

    CTAS (air-traffic control tool)

    Simplified Linux file system

    Freeciv interactive game

    We developed specifications for all five

    Little development time (days, not weeks)

    Most of time spent figuring out Freeciv and CTAS

    Each benchmark has Workload

    Bug or fault insertion methodology

    Ran benchmarks with and without repair

    Snapshot of Resu l t s

  • 7/31/2019 Self Healing Systems Lecture

    74/104

    Snapshot of Resu l t s

    Application Time to Check

    Consistency(ms)

    Time to Checkand Repair (ms)

    AbiWord 0.06 0.55

    CTAS 0.07 0.15

    FreeCiv 3.62 15.66

    File system 4.22 263.14

    Self healing Syst em Design Check l ist

  • 7/31/2019 Self Healing Systems Lecture

    75/104

    Self -healing Syst em Design Check l ist

    Level/s of abstraction/s

    Multiple: Set-relation abstraction of data. Object level for blocks, bit-level for

    bitmap, type-level for next. Needs traversal of about 1/Kth memory. K integers ina block.

    Time Granularity

    Variable possibly at suitable execution-block level

    The Complexity of SH mechanism

    20,000 lines of code, deeply related to data, complex operations on data Which resource to spend on SH

    Mainly CPU

    What is the acceptability of the repair

    No guarantees, empirically observed to be successful

    Realizability: What infra support does the SH mechanism need Access to data and ability to modify no memory protection

    Continued Execution vs. Automated Repair

    Continued execution

  • 7/31/2019 Self Healing Systems Lecture

    76/104

    ClearView : Code Pat ch ing Using

    Onl in e Learn in g

    Det ect learn repair

  • 7/31/2019 Self Healing Systems Lecture

    77/104

    Attackdetector

    Repair

    Learning

    all executions

    patch

    Pluggable detector,

    does not depend on learning

    attacks(or bugs)

    normalexecutions

    predictiveconstraints

    Learn normal behavior (constraints)

    from successful runs Check constraints during attacks

    Patch to re-establish constraints

    Evaluate and distribute patches

    True on every good run False during every attack

    Det ect , learn , repair

    Restores normal behavior

    [Lin & Ernst 2003]

    Learn ing no rm al behavio r

  • 7/31/2019 Self Healing Systems Lecture

    78/104

    Learn ing no rm al behavio r

    copy_len buff_size

    Clients sendinference results

    Server

    Community machines

    Server generalizes(merges results) Clients do local inference

    Observe normal behavior

    Generalize observed behavior

    At t ack det ect ion & supp ression

  • 7/31/2019 Self Healing Systems Lecture

    79/104

    At t ack det ect ion & supp ression

    Detector collects informationand terminates application

    Server

    Community machines

    Detectors used in our research: Code injection (Memory Firewall)

    Memory corruption (Heap Guard)

    Many other possibilities exist

    Learn ing at t ack behavio r

  • 7/31/2019 Self Healing Systems Lecture

    80/104

    Server

    Instrumentation continuously

    evaluates learned behavior

    What was the effect of the attack? Community machines

    Clients send difference inbehavior: violated constraints

    Server correlatesconstraints to attack

    Repair

  • 7/31/2019 Self Healing Systems Lecture

    81/104

    Repair

    Candidate patches:1. Set copy_len = buff_size2. Set copy_len = 03. Set buff_size = copy_len4. Return from procedure

    Server

    Propose a set of patches for each

    behavior that predicts the attack

    Community machines

    Predictive: copy_len buff_size

    Server generatesa set of patches

    Repair

  • 7/31/2019 Self Healing Systems Lecture

    82/104

    Repair

    Server

    Distribute patches to the community Community machines

    Ranking:Patch 1: 0

    Patch 2: 0Patch 3: 0

    Repair

  • 7/31/2019 Self Healing Systems Lecture

    83/104

    Repair

    Ranking:Patch 3: +5

    Patch 2: 0Patch 1: -5

    Server

    Evaluate patches

    Success = no detector is triggered

    When attacked, clientssend outcome to server

    Community machines

    Detector is stillrunning on clients

    Server ranks patches

    Repair

  • 7/31/2019 Self Healing Systems Lecture

    84/104

    Repair

    Server

    Patch 3

    Server redistributes themost effective patches

    Redistribute the best patches Community machines

    Ranking:Patch 3: +5

    Patch 2: 0Patch 1: -5

    Dyn am ic invar iant det ect ion

  • 7/31/2019 Self Healing Systems Lecture

    85/104

    Dyn am ic invar iant det ect ion

    Daikon generalizes observed programexecutions

    Many optimizations for accuracy and speed Data structures, code analysis, statistical tests,

    We further enhanced the technique

    copy_len < buff_sizecopy_len buff_sizecopy_len = buff_sizecopy_len buff_sizecopy_len > buff_sizecopy_len buff_size

    copy_len: 22

    buff_size: 42

    copy_len < buff_sizecopy_len buff_sizecopy_len = buff_sizecopy_len buff_sizecopy_len > buff_sizecopy_len buff_size

    Candidate constraints: Remaining candidates:

    Observation:

    Repair ex am ple

  • 7/31/2019 Self Healing Systems Lecture

    86/104

    Repair ex am pleif (! (copy_len buff_size))

    copy_len = buff_size;

    The repair checks the predictive constraint If constraint is not violated, no need to repair

    If constraint is violated, an attack is (probably) underway

    The patch does not depend on the detector Should fix the problem before the detector is triggered

    Repair is not identical to what a human would write Unacceptable to wait for human response

    Ex am ple const raint s & repair s

  • 7/31/2019 Self Healing Systems Lecture

    87/104

    Ex am ple const raint s & repair s

    v1 v2if (!(v1 v2)) v1 = v2;

    v cif (!(v c)) v = c;

    v { c1, c2, c3 }if (!(v==c1 || v==c2 || v==c3)) v = ci;

    Return from enclosing procedureif (!()) return;

    Modify a use: convert call *v toif () call *v;

    Constraint on v (not negated)

    ClearView w as successfu l

  • 7/31/2019 Self Healing Systems Lecture

    88/104

    ClearView w as successfu l

    Detected all attacks, prevented all exploits

    For 7/10 vulnerabilities, generated a patch thatmaintained functionality

    No observable deviation from desired behavior

    After an average of 4.9 minutes and 5.4 attacks Handled polymorphic attack variants

    Handled simultaneous & intermixed attacks

    No false positives Low overhead for detection & repair

    Self -healing Syst em Design Check l ist

  • 7/31/2019 Self Healing Systems Lecture

    89/104

    Self healing Syst em Design Check l ist

    Level/s of abstraction/s

    Program invariants --- predicate abstraction

    Time Granularity At level of granularity of checkers, e.g. heapguard

    The Complexity of SH mechanism

    Implementation size unknown, Changes code control-flow.

    Which resource to spend on SH

    Mainly CPU

    What is the acceptability of the repair

    No guarantees, empirically observed to work. Changes code control-flow. Verydifficult to verify!

    Realizability: What infra support does the SH mechanism need

    Virtual instruction cache -- Restricted to JIT setups

    Continued Execution vs. Automated Repair

    Automated repair

  • 7/31/2019 Self Healing Systems Lecture

    90/104

    Ex t erm inat or : M em ory Faul t

    Recovery

    D i a g n o si n g B u f f e r O v e r f lo w s

  • 7/31/2019 Self Healing Systems Lecture

    91/104

    D i a g n o si n g B u f f e r O v e r f lo w s

    Canonical buffer overf low :

    Allocate object too small

    Write past end -- corrupt s object bytes forw ard

    Not necessar i l y contiguous

    b a d o b j e ct

    ( t o o sm a l l )

    b y t e s p a st e n d

    char * st r = new char [ 8] ;

    st r cpy ( st r , goodbye cr uel wor l d ) ;

    Isolat ing Buf f er Overf low s

  • 7/31/2019 Self Healing Systems Lecture

    92/104

    8 10 2 9 4 5 1 7

    Isolat ing Buf f er Overf low s

    Re d =

    possiblebad

    object

    G r e e n =

    n o tbad

    object

    1 8 7 5 3 2 9 6 4

    3

    Canaries in freed space detect corruption

    Run multiple times with DieFast allocator

    Key insight: Overflow must be at same

    10

    Isolat ing Buf f er Overf low s

  • 7/31/2019 Self Healing Systems Lecture

    93/104

    C a n a r i e s in freed space detect corrupt ion

    Run mult iple tim es w ith DieFast allocator

    K e y in si g h t : Overflow m u st b e a t sa m e

    ) object 9 overflow ed, w it h h ig h p r o b a b i li t y

    Isolat ing Buf f er Overf low s

    8 10 2 9 3 4 5 1 7

    Re d =

    possiblebad

    object

    G r e e n =

    n o tbad

    object

    1 8 7 5 3 2 910 6 4

    4 9 6 38 5 72 1

    Isolat in g Dangl ing Poin t ers

  • 7/31/2019 Self Healing Systems Lecture

    94/104

    Isolat in g Dangl ing Poin t ers

    Dangling pointer error: Live object freed too soon

    Overwritten by some other object

    i nt * v = new i nt [ 4] ;

    del et e [ ] v; / / oops

    char * st r = new char [ 16] ;

    st r cpy ( st r , di e, poi nt er ) ;v[ 3] = 4;

    use of v[ 0]

    Isolat in g Dangl ing Poin t ers

  • 7/31/2019 Self Healing Systems Lecture

    95/104

    Isolat in g Dangl ing Poin t ers

    Unlike buffer overflow: dangling pointer same corruption in all

    11 2 3 6 4 5 10 1 12 798

    1 7 5 3 2 1112 648 9 10

    4 6 312 5 72 1410 8 9

    Cor rect in g Al locat or

  • 7/31/2019 Self Healing Systems Lecture

    96/104

    Co ect g ocat o

    Generate runtime patches to correct errors

    Track object call sites in allocator

    Prevent overflows: pad overflowed objects

    mal l oc( 8) mal l oc( 8 + )

    Prevent dangling pointers: defer frees

    f r ee( pt r ) delay mal l ocs; f r ee( pt r )

    1 1

    Ex t e r m in a t o r Ru n t i m e O v e r h e a d

  • 7/31/2019 Self Healing Systems Lecture

    97/104

    25%

    Self -healing Syst em Design Check l ist

  • 7/31/2019 Self Healing Systems Lecture

    98/104

    g y g

    Level/s of abstraction/s

    Memory objects

    Time Granularity At each malloc/free call

    The Complexity of SH mechanism

    Not a very complex implementation

    Which resource to spend on SH

    Mainly CPU, memory (but then that would be needed anyways)

    What is the acceptability of the repair

    Probabilistic guarantees that the repair will work.

    Realizability: What infra support does the SH mechanism need

    None.

    Continued execution vs automated repair

    Automated repair

    Ex ercise 1

  • 7/31/2019 Self Healing Systems Lecture

    99/104

    Write a C-code where a local array overflows (say in a for loop)

    significantly (enough to overwrite the return address).

    1. Compile and run this code on your Linux workstation. What do youobserve? What is the explanation of your observation? Prepare a 1-

    page write-up on this.

    2. If possible repeat the same on a Windows machine.

    3. Re-write this code in C#. What happens in this case?

    Please email your write-up for these three questions to Satya Gautam

    (TA): [email protected]

    Ex ercise 2

  • 7/31/2019 Self Healing Systems Lecture

    100/104

    Write your own fault-tolerant my_malloc() and my_free() functions

    which does the following:

    It maintains a priority queue with elements containing pointers,additional lifetime, and additional buffer space. Set the free-delay and

    additional buffer space randomly.

    Additional buffer space = random number between 0 and ceil{size/10}

    Additional life = random number between 0-5 events. Event = a malloc/free call.

    For each malloc() allocate additional memory, for each my_free() delaythe actual free operation by additional lifetime.

    Try it out on a buggy code which you have not been able to fix!

  • 7/31/2019 Self Healing Systems Lecture

    101/104

    Thank You

    Acycl ic Repai r Dependences

  • 7/31/2019 Self Healing Systems Lecture

    102/104

    y p p

    Questions

    Isnt it possible for the repair of oneconstraint to invalidate another constraint?

    What about infinite repair loops?

    What about unsatisfiable specifications?

    Answer We require specifications to have no cyclic

    repair dependences between constraints

    So all repair sequences terminate Repair can fail only because of resource

    limitations

    References

  • 7/31/2019 Self Healing Systems Lecture

    103/104

    Automatically patching errors in deployed software by Jeff H.

    Perkins, Sunghun Kim, Sam Larsen, Saman Amarasinghe,Jonathan Bachrach, Michael Carbin, Carlos Pacheco, Frank

    Sherwood, Stelios Sidiroglou, Greg Sullivan, Weng-Fai Wong,

    Yoav Zibin, Michael D. Ernst, and Martin Rinard. In Proceedingsof the 21st ACM Symposium on Operating Systems Principles,

    (Big Sky, MT, USA), October 12-14, 2009, pp. 87-102.

    Brian Demsky, Martin C. Rinard, "Goal-Directed Reasoningfor Specification-Based Data Structure Repair," IEEETransactions on Software Engineering, vol. 32, no. 12, pp.

    931-951, Dec. 2006, doi:10.1109/TSE.2006.122

    GM CONFIDENTIAL

    References

  • 7/31/2019 Self Healing Systems Lecture

    104/104

    Self-adaptive Software: Landscape and Research

    Challenges. ACM TAAS, March 2009 Soft errors: Soft errors in circuits and systems, IBM Journal

    of R&D, Vol. 52, No. 3, 2008http://researchweb.watson.ibm.com/journal/rd52-3.html

    Martin Rinard, Acceptability Oriented Computing, ACMSIGPLAN Notices, Vol. 38, Issue 12, December 2003

    Read other related works of Martin Rinard:

    http://people.csail.mit.edu/rinard/acceptability_oriented_computing/

    Marco Schneider, Self-Stabilization, ACM Computing

    Surveys, Volume 25, Issue 1, March 1993.