* Work supported in part by SRC Contract 1031.001 and NSF Award 0219805

Verifying MP Executions against Itanium Orderingsusing SAT*

Ganesh GopalakrishnanYue Yang

Hemanthkumar Sivaraj

School of Computing, University of UtahSalt Lake City, UT, 84112

* Work supported in part by SRC Contract 1031.001 and NSF Award 0219805

2

Efficient Multiprocessors must have Efficient Shared Memory Systems

* Hide the cost of memory operations by postponing updates

* Increasingly important because CPUs are growing faster faster than memory systems are

3

How to build Efficient Shared-memory Multiprocessor Systems?

• Employ weak memory models

– They permit global state updates to be postponed

• Employ aggressive shared memory consistency protocols

– Weak memory models permit shared memory consistency protocols to be aggressive without undue complexity (no speculation, etc.)

The focus of this talk is on weak memory models

4

Weak memory models allow multiple executions...

MemoryCPU CPU

st c,1 ;st d,2

ld d;ld c

st c,1 ;st d,2

ld d, 2;ld c, 1

st c,1 ;st d,2

ld d, 2;ld c, 0

One possibleexecution...

Anotherexecution...

Impossible under SC Possible under Itanium

Possible under SC and under Itanium

5

Problems with Weak Memory Models

• Hard to understand (easy to misunderstand)

P

st [x] = 1

mf

ld r1 = [y] <0>

R

ld . acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st . rel [y] = 1

Is this legal under Itanium ? (no)

6

Post-Si verification of MP Orderings today (oversimplified)

New MP System

assemblyprogram 1

assemblyprogram n

...

...

assemblyexecution 1

assembly execution n

Run repeatedly to catch one interleavingthat might reveal bug

Check every executionagainst ordering rules forcompliance

* This is done ad-hoc* How to make this formal and efficient ?* How to capitalize on repeated re-runs ?

7

Explanation of Illegal Executions (p 31 of Itanium App Note – search 251429)

P

st [x] = 1

mf

ld r1 = [y] <0>

R

ld . acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st . rel [y] = 1us:

mf:

ul1:

sr: la:

ul2:

• US >> MF ; hence RVr(US) F(MF)

• MF >> UL1 ; hence F(MF) R(UL1)

• …many reasons… hence R(UL1) RVp(SR)

• If RVr(SR) R(UL1) and RVr(SR) UL1 RVp(SR) , WB release atomicity of SR is violated, thus R(UL1) RVr(SR)

• …five lines of reasons Hence RVr(SR) R(LA)

• Since LA >> UL2, R(LA) R(UL2)

• Another para of reasons LV(Sr2) R(UL2) LV(SR1) RVp(SR1) RVq(SR1) F(MF1) R(UL1) RVq(SR2) RVp(SR2). But can’t allow due to atomicity of SR.

8

Checking Executions and Providing Explanations (present approach)

P

st [x] = 1

mf

ld r1 = [y] <0>

R

ld . acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st . rel [y] = 1

• Published approaches are very labor-intensive paper-and-pencil proofs

• Clearly this can’t scale (6 instruction MP program takes 1-page of detailed mathematical proof

• What about the combinatorics of reasoning about 200 instructions?

• Approaches actually used within the industry involves the use of “checkers”

• Details of these checkers are unknown (How complete? How scalable?)

9

Our Approach

Itanium Ordering rules written in Higher Order

Logic

MechanicalProgram Derivation

Checker Program

Satisfiability Problem with Clauses carrying annotations

Sat Solver

SatUnsat

Explanationin the form ofone possibleinterleaving

Unsat CoreExtraction using Zcore

P

st [x] = 1

mf

ld r1 = [y] <0>

R

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st.rel [y] = 1

st [x] = 1mfld r1 = [y] <0>

ld . acq r2 = [y] <1>ld r3 = [x] <0>

st . rel [y] = 1

• Find Offending Clauses• Trace their annotations• Determine “ordering cycle”

MP execution to be checked

10

Largest example tried to date (courtesy S. Zeisset, Intel)

Proc 1

st8 [12ca20] = 7f869af546f2f14cld r25 = [45180] <87b5e547172644a8>

… 58 more instructions…

st2 [7c2a00] = 4bca

Proc 2

ld4 r24 = [733a74] <415e304>st4.rel [175984] = 96ab4e1f


ld8 r87 = [56460] <b5c113d7ce4783b1>

• Initially the tool gave a trivial violation

• Diagnosed to be forgotten memory initialization

• Added method to incorporate memory initialization in our tool

• Our tool found the exact same cycle as pointed out by author of test

• Sat generation and Sat solving times need improving

Cycle found thru our tool:

st.rel (line 18, P1) ld (line 22, P2) mf ld (line 30, P2) st (line 11, P1)

11

Statistics Pertaining to Case Study

Proc 1

st8 [12ca20] = 7f869af546f2f14cld r25 = [45180] <87b5e547172644a8>


st2 [7c2a00] = 4bca

Proc 2

ld4 r24 = [733a74] <415e304>st4.rel [175984] = 96ab4e1f


ld8 r87 = [56460] <b5c113d7ce4783b1>

• All runs were on a 1.733 GHz 1GB Redhat Linux V9 Athlon

• ~2 minutes to generate Sat instance

• 14,053,390 clauses

• 117,823 variables

• ~1 minute to solve Sat problem - found Unsat

• Unsat Core generation runs fast – gave 23 clauses! - 23 of the 14M clauses were causing the problem to be Unsat- Sat time for these 23 clauses … under a second

Unsat Core’s annotations were traced back to offending instructions andthe memory ordering rules that situated them in a “cycle”

12

The rest of the talk

• Itanium memory model in Higher Order Logic (well, not so high actually… )

• Our HOL specs translation “sat-generating checker programs”

• Execution to be checked translation by above program to Sat

• Each assembly instruction clauses it generates + annotations

• When Sat, what interleaving explains?

• When Unsat, how to get “core” (root-cause) + annotations on core

• Translating annotations on core to cycle on original program

13


The initial focus of our presentation :

- How to model an execution ?

- Why use “split stores” in modeling ?

14


Basic problem-modeling idea:

Find a “shuffle” of the instructions that explains the observations…

st [y] = 1

ld reg1 = [y] <1>

ld reg2 = [y] <1> st [y] = 1

ld reg1 = [y] <1>

ld reg2 = [y] <1>

P0 P1 Explanation…

The basic idea won’t always work …

st.rel [y] = 1

ld reg1 = [x] <0> ld reg2 = [y] <0>

st.rel [x] = 2

ld.acq r3 = [y] <1> ld.acq r4 = [x] <2>

Dat. Dep. Dat. Dep.

Ld . Acq OrderLd . Acq Order

No Shuffleof thesesequencesrespecting satisfiesthe read-values“ ”

15

• Problem Modeling…

Idea: Find a shuffle after each store is split into (p+1) copies….(by the way, this idea has sort of become “standard”)

st [y] = 1

P0 P1

st [x] = 2

Local copy for P0

“remote” copy for P0

“remote” copy for P1

Now, arrange the split copies…

A similar split

16

• Problem Modeling…

st [y] = 1

ld reg1 = [x] <0> ld reg2 = [y] <0>

P0 P1

st [x] = 2

Now, arrange the split copies…

st [y] = 1 “l”

st [y] = 1 “rp0”

st [y] = 1 “rp1”

st [x] = 2 “l”st [x] = 2 “rp0”st [x] = 2 “rp1”

st [y] = 1 “l”

st [y] = 1 “rp0”

st [y] = 1 “rp1”

st [x] = 2 “l”

st [x] = 2 “rp0”

st [x] = 2 “rp1”

ld reg1 = [x] <0>

ld reg2 = [y] <0>

Explanation…

ld.acq r3 = [y] <1> ld.acq r4 = [x] <2>

ld.acq r3 = [y] <1>

ld.acq r4 = [x] <2>

Dependencies

Anti-dependencies

17

Informal statement:

Store-Releases to write-back memory become visible to all processors in the same order

st.rel [x] = 1

• Back to Itanium memory model in Higher Order Logic thru an example

Implementation:

All copies of a “split st.rel” are visible atomically

Atomic set

18

One standard way of specifying atomicity:

All other events “e” are strictly before orstrictly after the atomic set

e

Another standard way of specifying atomicity:

If some event “e” is between two events in the atomic set,then “e” also belongs to the atomic set

e

e

e

19

atomicWBRelease(ops,order) =

Forall (i in ops).(j in ops).(k in ops).

(i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB)

/\ (i.wrID = k.wrID)

/\ order(i,j) /\ order(j,k)

==> (j.wrID = i.wrID)

i k

j

• atomicWBRelease rule (Section 3.3.7.1 of Intel App Note):

We have reduced the ~36 page Intel App Note to ~3 pages of HOL rules (barring a few simple omissions…)

20

Basic idea behind Intel’s Formal Spec (which we follow in our formal spec)

legalItanium(ops) =Exists order.( requireStrictTotalOrder ops order

/\ requireWriteOperationOrder ops order/\ requireProgramOrder ops order/\ requireMemoryDataDependence ops order/\ requireDataFlowDependence ops order/\ requireCoherence ops order/\ requireAtomicWBRelease ops order/\ requireSequentialUC ops order/\ requireNoUCBypass ops order /\ requireReadValue ops order

SC(ops) =Exists order.( requireStrictTotalOrder ops order

/\ requireProgramOrder ops order

/\ requireReadValue ops order

Make it look like SC so that people have less trouble understanding!

Call it “otherOrder”

21

But, how do we check executions against such specs?

legalItanium(ops) =Exists order.( requireStrictTotalOrder ops order

/\ requireWriteOperationOrder ops order/\ requireProgramOrder ops order/\ requireMemoryDataDependence ops order/\ requireDataFlowDependence ops order/\ requireCoherence ops order/\ requireAtomicWBRelease ops order/\ requireSequentialUC ops order/\ requireNoUCBypass ops order /\ requireReadValue ops order

st c,1 ;st d,2

ld d, 2;ld c, 1

st c,1 ;st d,2

ld d, 2;ld c, 0

SC(ops) =Exists order.( requireStrictTotalOrder ops order

/\ requireProgramOrder ops order


Execution 1 Execution 2

e.g., which execution is legal under which memory model ?

22



23

Transformation of HOL specs to generate constraints

atomicWBRelease(ops,order) = forall (i in ops).(j in ops).(k in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ order(i,j) /\ order(j,k) ==> (j.wrID = i.wrID)

atomicWBRelease(ops,order) = forall (i in ops).(j in ops).(k in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) /\ (i.wrID = k.wrID) /\ ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

atomicWBRelease(ops,order) = forall (i in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in ops). (i.wrID = k.wrID) ==> forall (j in ops). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

Initial Spec

Applying Contrapositive

After Reducing quantifier Scopes

24

Functional (Ocaml) Program Derivation from HOL Specs:

atomicWBRelease(ops,order) = forall (i in ops). (i.op = StRel) /\ (i.wrType = Remote) /\ (attr_of i.var = WB) ==> forall (k in ops). (i.wrID = k.wrID) ==> forall (j in ops). ~(j.wrID = i.wrID) ==> ~(order(i,j) /\ order(j,k))

atomicWBRelease(ops) = forall(i,ops,wb(i))

wb(i) = if ~((attr_of i.var=WB) & (i.op=StRel) & (i.wrType=Remote) then true else forall(k,ops,wb1(i,k))

wb1(i,k) = if ~(i.wrID=k.wrID) then true else forall(j,ops,wb2(i,k,j))

wb2(i,k,j) = if (j.wrID=i.wrID) then true else ~(order(i,j) & order(j,k)) forall(i,S, e(i)) = for all i in S : e(i) (* foldr( map (fn i -> e(i)) (S) (&), true) *)

Transformed Spec

Functional Program that generates the constraints (will be automated)

25




26

P1: St a,1; Ld r1,a <1>; St b,r1 <1>;

P2: Ld.acq r2,b <1>; Ld r3,a <0>;

Have built tool for tuple-generation that addresses many details:(1) Expansion into tuples with variable address allocation

{id=0; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Local; wrProc=0; reg=-1; useReg=false};

{id=1; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=0; reg=-1; useReg=false};

{id=2; proc=0; pc=0; op= St; var=0; data=1; wrID=0; wrType=Remote; wrProc=1; reg=-1; useReg=false};

{id=3; proc=0; pc=1; op= Ld; var=0; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=0; useReg=true};

{id=4; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Local; wrProc=0; reg=0; useReg=true};

{id=5; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=0; reg=0; useReg=true};

{id=6; proc=0; pc=2; op= St; var=1; data=1; wrID=4; wrType=Remote; wrProc=1; reg=0; useReg=true};

{id=7; proc=1; pc=0; op= LdAcq; var=1; data=1; wrID=-1; wrType=DontCare; wrProc=-1; reg=1; useReg=true};

{id=8; proc=1; pc=1; op= Ld; var=0; data=0; wrID=-1; wrType=DontCare; wrProc=-1; reg=2; useReg=true}

Tuple 1

Tuple 9

...

27

How the SAT encoding is achieved...

legalItanium(ops) =Exists order.( requireStrictTotalOrder ops order /\ requireOtherOrderItanium ops order


st c,1 ;st d,2

ld d, 2;ld c, 0

SC(ops) =Exists order.( requireStrictTotalOrder ops order /\ requireOtherOrderSC ops order


Example Execution

Break it down into “tuples”

• Store c viewed at P1 for modeling bypassing• Store c viewed at P1 for modeling global visibility• Store c viewed at P2 for modeling global visibility• Store d viewed at P1 for modeling bypassing• Store d viewed at P1 for modeling global visibility• Store d viewed at P2 for modeling global visibility• Ld d viewed at P2 for modeling read value• Ld c viewed at P2 for modeling read value

8 tuples obtained

28

Constraint Encoding Approach #1

n logn approach (“small domain” encoding)

• Attach a word w_t of 2 bits to each tuple t• Tuple i before Tuple j --> Assert wi < wj

• StrictTotalOrder --> Assert that the wt words are distinct

• Smaller # of Boolean Vars • Much Harder SAT instances (abandoned for now)

Illustration on4 tuples

requireStrictTotalOrder ops

order requireOtherOrder ops

order requireReadValue ops order

x00 x01 x10 x11

x20 x21 x30 x31

For all i, j: xi1,xi0 != xj1, xj0

A system of constraintswith primitive constraint xi1, xi0 < xj1, xj0

29

Constraint Encoding Approach #2

n n approach (“e_ij” encoding)

• Assign a matrix position mij for each pair of tuples ti and tj • Tuple i before Tuple j --> Assert mij true• StrictTotalOrder --> Assert Irreflexitivity, Transitivity, Totality

• Larger # of Boolean Vars • Easier SAT instances (being pursued now)

Illustration on4 tuples

requireStrictTotalOrder ops

order requireOtherOrder ops

order requireReadValue ops order

A system of constraintswith primitive constraint mij

Forall i : ~mii

Forall i,j : mij \/ mji

Forall i,j,k : mij /\ mjk

=> mik

i . . . .

j . mij . .

. . . . . . . .

30

Table of Results (somewhat dated…)SAT-instance generation time for n logn method

Tuples Total Order Other Order

32 0.2 1.6

64 1.2 17.1

128 5.7 179.0

SAT-instance generation time for n n method

Tuples Total Order Other Order

32 0.5 0.1

64 4.3 0.9

128 34.2 9.0

SAT-checking timesTuples n logn nn

32 9.6 0.6 4.3 0.33 0.69 0.05

64 247.17 29.53 37.6 2.73 6.17 0.5

128 abort 1341 abort 164.8 145.6 351.1

Monolith TotalOrd OtherOrd Monolith TotalOrd OtherOrd

31

Explaining the results of Sat




• Each assembly instruction clauses it generates + annotations

• When Sat, what interleaving explains?

• When Unsat, how to get “core” (root-cause) + annotations on core

• Translating annotations on core to cycle on original program

32

Clause Annotations

• Each clause generated by the sat-generating checker program also generates an associated tuple.

• This tuple has information pertaining to the clause’s source.

• Each tuple has the following information– The ops involved in generating the clause (upto a

maximum of 4 ops could generate a clause)– The proc value of the processor whose instructions were

used to generate this clause (taken from the tuples generated by the gentuple program)

– The pc value of the instruction that was the source for this tuple

– The name of the memory ordering rule the application of which generated this tuple (ReadValue, ProgramOrder, Reflexive, etc)

• The clause annotation looks as follows< proc, pc, op1, op2, op3, op4, RuleName >

33

Example execution (Table 18, pg. 31 of App note)

P

st [x] = 1

mf

ld r1 = [y] <0>

R

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

Q

st.rel [y] = 1

• The Sat instance generated for the above example is UNSAT.

• Next few slides show automated approach to detect the root cause cycle.

• We will ignore the reflexive and transitive rules in these slides (they are necessary to force unsat, but useless in building a cycle!!)

34

Clause annotations for the unsat core for example

op1 = 1; op2 = -1; op3 = -1; op4 = -1; rule = Reflexiveop1 = 4; op2 = 5; op3 = 6; op4 = -1; rule = TransitiveOrderop1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrderop1 = 4; op2 = 6; op3 = 8; op4 = -1; rule = TransitiveOrderop1 = 4; op2 = 11; op3 = 12; op4 = -1; rule = TransitiveOrderop1 = 5; op2 = 6; op3 = -1; op4 = -1; rule = ProgramOrderop1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = TotalOrderop1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = TotalOrderop1 = 11; op2 = 4; op3 = 8; op4 = -1; rule = TransitiveOrderop1 = 11; op2 = 4; op3 = -1; op4 = -1; rule = TotalOrderop1 = 11; op2 = 12; op3 = -1; op4 = -1; rule = ProgramOrderop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValueop1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = -1; op2 = -1; op3 = -1; op4 = -1; rule = NoRuleop1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 9; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBReleaseop1 = 10; op2 = 11; op3 = 8; op4 = -1; rule = AtomicWBRelease

35

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

denotes an op

Denotes op numbers. Store has both local and remote ops

36

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 4; op2 = 5; op3 = -1; op4 = -1; rule = ProgramOrder

37

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>


38

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue


op1 = 6; op2 = -1; op3 = -1; op4 = -1; rule = R eadValue


op1 = 6; op2 = 8; op3 = -1; op4 = -1; rule = ReadValue


39

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 10; op2 = 12; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = -1; op4 = -1; rule = AtomicWBRelease

op1 = 10; op2 = 11; op3 = 10; op4 = -1; rule = AtomicWBRelease






40

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 11; op2 = 10; op3 = -1; op4 = -1; rule = ReadValue

41

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>


42

1 2 3 4

5

6

7 8 9 10

12

11

st [x] = 1

mf

ld r1 = [y] <0>

st.rel [y] = 1

ld.acq r2 = [y] <1>

ld r3 = [x] <0>

op1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = 4; op3 = -1; op4 = -1; rule = ReadValueop1 = 12; op2 = -1; op3 = -1; op4 = -1; rule = ReadValue

43

Good Case-study Illustrating Program Derivation from Formal Specs• Initial specs: HOL

• Formal derivation of tail-recursive functional programs

• “Code generation” consists of generating Boolean clauses– Choose Boolean encoding method – Re-target code generation correspondingly

• Source-level optimizations– Record known orderings (e.g., “i before j”) – these manifest as unit clauses– Infer others (e.g., “not j before i”) - generate unit-clauses for these too– Prevent generating transitivity axioms that depend on “j before i”

• The use of incremental SAT can perhaps be directed by “functional scripts” that are automatically generated

• Use of Unsat cores to pinpoint errors

44

Concluding Remarks

• Main source of complexity: the transitivity axiom

• “Lazy” methods for handling transitivity must be investigated

• Hybrid Sat encoding (partly nn and partly n log n) can also help as was the experience of Lahiri, Seshia, and Bryant

• Analyzing larger programs: – Somehow view program in terms of “basic blocks”

– Treat each basic block as super instruction

– If super-instruction unordered, no need to descend into basic block

• Exploit incremental Sat when same litmus tests are rerun

• Try modeling another weak memory model

45

Extra Slides

46

Unsat Core generation

• The CNF file generated by the sat-generating program is solved using zchaff.

• If SAT, then we get a satisfying assignment.• First n*n variables in the assignment correspond

to the n*n variables in our ordering. Can be used to output a valid ordering of the ops.

• If UNSAT, then need a way to find a “root-cause” for the illegality of the execution.

• We use unsatisfiable core generation to get to the root cause.

• An unsatisfiable core of an unsatisfiable Sat instance is a subset of clauses of the formula such that its conjunction is still UNSAT.

47

Generating Unsatisfiable Core

• Zchaff can be told to generate resolution trace while checking for Sat.

• Zcore – tool that takes as input a CNF file and resolution trace produced by zchaff and produces unsatisfiable core.

• Zcore available as part of zchaff.• Unsatisfiable core is another CNF file with the

reduced set of clauses.• Can be fed back into zchaff/zcore to generate

a potentially smaller unsatisfiable core.• Process repeated till fixed point reached.

48

Mapping back to root-cause

• Clauses in the unsatisfiable core contain the ordering violation information in them

• Tool to home in towards the root-cause for the violation• If the root cause is not something trivial, then the cause

is usually a cycle of instructions. Each link in the cycle corresponds to an ordering requirement between the instuctions involved.

• If cycle exists, then Transitivity can be applied to show that Irreflexivity is not satisfied.

• Input to the tool to generate root cause: – The original set of annotated machine instructions for all

processors– The default values stored in memory locations at the

beginning of the execution– Clause annotations for the clauses that form the

unsatisfiable core

49

Root-cause cycle analysis algorithm

Each ReadValue rule generates a set of clauses.From the annotations, find the tuples that come from the same

ReadValue rule (two different ops will be involved in a rule)– Extract the ops out of the annotations and get the

corresponding instructions (using the proc and pc values)

From the data being used in the ld instruction and the default date value for the corresponding memory address, it can be seen if the effect of a store is being reflected in a load.

This way the dependency between a load and a store is established.

The above is done for all the ReadValue rules in the annotations

Ops (and the corresponding instructions) on both sides of a mf that form a link in the cycle are inferred based on ProgramOrder rule annotations and the pc values involved.

The other missing links in the violating cycle are also inferred based on the remaining ProgramOrder rule annotations.

Documents

* Work supported in part by SRC Contract 1031.001 and NSF Award 0219805