59
Introduction to Abstract Interpretation Andy King [email protected] http://www.cs.kent.ac.uk/~amk

Introduction to Abstract Interpretation Andy King [email protected] amk

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Introduction to Abstract Interpretation

Andy [email protected]://www.cs.kent.ac.uk/~amk

Pointers to the literatureSAS, POPL, ESOP, ICLP, ICFP,…Useful review articles and books:

Patrick and Radhia Cousot, Comparing the Galois connection and Widening/Narrowing approaches to Abstract Interpretation, PLILP, LNCS 631, 269-295, 1992. Available from LIX library.

Patrick and Radhia Cousot, Abstract interpretation and Application to Logic Programs, JLP, 13(2-3):103-179, 1992

Flemming Neilson, Hanne Riis Neilson and Chris Hankin, Principles of Program Analysis, Springer, 1999.

Patrick has a database of abstract interpretation researchers and regularly writes tutorials, see, CC’02.

Applications of abstract interpretation

Verification: can a concurrent program deadlock? Is termination assured?

Parallelisation: are two or more tasks independent? What is the worst/best-case running time of function?

Transformation: can a definition be unfolded? Will unfolding terminate?

Implementation: can an operation be specialised with knowledge of its (global) calling context?

Applications and “players” are incredibly diverse

Casting out nines algorithmWhich of the following multiplications are correct:

2173 38 = 81574 or 2173 38 = 82574

Casting out nines is a checking technique that is really a form of abstract interpretation: Sum the digits in the multiplicand n1, multiplier n2 and

the product n to obtain s1, s2 and s. Divide s1, s2 and s by 9 to compute the remainder, that

is, r1 = s1 mod 9, r2 = s2 mod 9 and r = s mod 9. Calculate r’ = (r1 r2) mod 9 If r’ r then multiplication is incorrect

The algorithm returns “incorrect” or “don’t know”

Running the numbers for 2173 38 = 81574

Compute r1 = (2+1+7+3) mod 9 = …Compute r2 = (3+8) mod 9 = …Calculate r = (8+1+5+7+4) mod 9 = …Calculate r’ = (r1 r2) mod 9 = …Check (r’ r) = …Deduce that 2173 38 = 81574 is …

Abstract interpretation is a theory of relationships

The computational domain for multiplication (concrete domain): N – the set of non-negative integers

The computational domain of remainders used in the checking algorithm (abstract domain): R = {0, 1, …, 8}

Key question is what is the relationship between an element nN which is used in the real algorithm and its analog rR in the check

What is the relationship?

When multiplicand is n1 = 456, say, then the check uses r1 = (4+5+6) mod 9 = 4

Observe that 456 mod 9 = (4*100 + 56) mod 9 = (4*90+ 4*10 + 56) mod 9 = (4*10 + 56) mod 9 = ((4 + 5)*10 + 6) mod 9 = ((4 + 5)*9 + (4 + 5) + 6) mod 9 = (4 + 5 + 6) mod 9

More generally, induction can show r1= n1 mod 9 and r2 = n2 mod 9

Correctness is the preservation of relationships

The check simulates the concrete multiplication and, in effect, is an abstract multiplication

Concrete multiplication is n = n1 n2

Abstract multiplication is r’ = (r1 r2) mod 9Where r1 describes n1 and r2 describes n2

For brevity, write r n iff r = n mod 9

Then abstract multiplication preserves iff whenever r1 n1 and r2 n2 it follows that r’ n

Correctness argument

Suppose r1 n1 and r2 n2 If

n = n1 n2 then n mod 9 = (n1 n2) mod 9 hence n mod 9 = ((n1 mod 9) (n2 mod 9)) mod 9

whence n mod 9 = (r1 r2) mod 9 = r’ therefore r’ n

Consequently if (r’ n) then n n1 n2

Summary

Formalise the relationship between the data

Check that the relationship is preserved by the abstract analogues of the concrete operations

The relational framework [Acta Informatica, 30(2):103-129,1993] not only emphases the theory of relations but is very general

Numeric approximation and widening

Abstract interpretation does not require an abstract domain to be finite

Interval approximation

begin i := 0; {1: i[0,0]} while (i < 16) do {2: i[0,15]} i := i + 1 {3: i[1,16]} end {4: i[16,16]}

Consider the following Pascal-like program

SYNTOX [PLDI’90] inferred the invariants scoped within {…}

Invariants occur between consecutive lines in the program

i[0,15] asserts 0i15 whereas i[0,0] means i=0

Compilation versus (classic) interpretation

Abstract compilation – compile the concrete program into an abstract program (equation system) and execute the abstract program: good separation of concerns that aids debugging the particulars of the domain can be exploited to

reorder operations, specialise operations, etcAbstract interpretation – run the concrete

program but on-the-fly interpret its concrete operations as abstract operations: ideal for a generic framework (toolkit) which is

parameterised by abstract domain plugins

Abstract domain that is used in interval analysis

Domain of intervals includes: [l,u] where l u and l,u Z for bounded

sets ie [0, 5]{0,1,4} since {0,1,4} [0, 5]

to represent the empty set of numbers, that is,

[l,] for sets which are bounded below such as {l,l+2,l+4,…}

[-,u] to represent sets which are bounded above such as {..,l-5,l-3,l}

Weakening intervals

Join (path merge) is defined: Put d1d2 = d1 if d2 = d2 else if d1 = [min(l1,l2), max(u1,u2)] otherwise whenever d1 = [l1,u1] and d2 = [l2,u2]

if … then… {1: i[0,2]}else… {2: i[3,5]}endif{3: i[0,5]}

Strengthening intervals

Meet is defined: Put d1d2 = if (d1 = ) (d2 = ) [max(l1,l2), min(u1,u2)] otherwise whenever d1 = [l1,u1] and d2 = [l2,u2]

{3: i[0,5]}if (2 < i) then {4: i[3,5]} …else {5: i[0,2]} …

Meet and join are the basic primitives for compilation

I1 = [0,0] since program point (1) immediately follows the i := 0

I2 = (I1 I3) [-, 15] since: control from program points (1) and (3) flow

into (2) point (2) is reached only if i < 16 holds

I3 = {n+1 | n I2} since (3) is only reachable from (2) via the increment

I4 = (I1 I3) [16, ] since: control from (1) and (3) flow into (4) point (4) is reached only if (i < 16) holds

Interval iteration

I1 [0,0] [0,0] [0,0] [0,0] [0,0]

[0,0] [0,0]

I2 [0,0] [0,0] [0,1] [0,1]

[0,2] [0,2]

I3 [1,1] [1,1] [1,2]

[1,2] [1,3]

I4 I1 … [0,0] [0,0] [0,0] [0,0]I2 … [0,15] [0,15] [0,15] [0,15]I3 … [1,15] [1,16] [1,16] [1,16]I4 … [16,16

][16,16]

Jacobi versus Gauss-Seidel iteration

I1 [0,0] [0,0] [0,0] … [0,0] [0,0] [0,0]

I2 [0,0] [0,1] [0,2] … [0,14] [0,15] [0,15]

I3 [1,1] [1,2] [1,3] … [1,15] [1,16] [1,16]

I4 … [16,16] [16,16]

With Jacobi, the new vector I1’,I2’,I3’,I4’ of intervals is calculated from the old I1,I2,I3,I4

With Gauss-Seidel iteration: I1’ is calculated from I1,I2,I3,I4 I2’ is calculated from I1’,I2,I3,I4 I3’ is calculated from I1’,I2’,I3,I4 I4’ is calculated from I1’,I2’,I3’,I4

Gauss-Seidel versus chaotic iteration

Observe that I4 might change if either I1 or I3 change, hence evaluate I4 after I1 and I3 stabilise

Suggests that wait until stability is achieved at one level before starting on the next

I1 I2

I3I4

{I1}

{I2, I3}

{I4}

Gauss-Seidel versus chaotic iteration

Chaotic iteration can postpone evaluating Ii for bounded number of iterations: I1’ is calculated from I1,-,-,- I2’ and I3’ are calculated Gauss-Seidel style from I1,I2,I3,- I4’ is calculated from I1’,I2’,I3’,I4

Fast and (incremental) fixpoint solvers [TOPLAS 22(2):187-223,2000] apply chaotic iteration

I1 [0,0]

[0,0]

[0,0]

… [0,0] [0,0] [0,0]

I2 - [0,0]

[0,1]

… [0,15]

[0,15]

[0,15]

I3 - [1,1]

[1,2]

… [1,16]

[1,16]

[1,16]

I4 - - - … - - [16,16]

Suppose i was decremented rather than incremented

begin i := 0; {1: i[0,0]} while (i < 16) do {2: i[-,0]} i := i -1 {3: i[-,-1]} end {4: i}

I1 = [0,0] I2 = (I1 I3) [-, 15] I3 = {n-1 | n I2}I4 = (I1 I3) [16, ]

I1 [0,0]

[0,0]

[0,0] [0,0] [0,0] [0,0]

I2 - - [0,0] [-1,0] [-2,0] …

I3 - - [-1,-1]

[-2,-1]

[-3,-1] …

I4 - - - - - -

Ascending chain conditionA domain D is ACC iff it does not contain an

infinite strictly increasing chain d1<d2<d3<… where d<d’ iff dd’ and dd’ (see below)

The interval domain D is ordered by: d forall dD and [l1,u1] [l2,u2] iff l2l1u1u2

and is not ACC since [0,0]<[-1,0]<[-2,0]<…

… -4 –3 –2 –1 0 1 2 3 4 …

T

Some very expressive relational domains are ACC

The sub-expression elimination relies on detecting duplicated expression evaluation

Karr [Acta Informatica, 6, 133-151] noticed that detecting an invariance such as y = (x/2) – 6 was key to this optimisation

begin x := (2 * (z + *w)) - 2; y := (z – 7) + *w;end

The affine domain

The domain of affine equations over n variables is: D = {A,B|A is mn dimensional matrix and

B is m dimensional column vector}

D is ordered by: A1,B1A2,B2 iff (if A1x=B1 then A2x=B2)

An affine abstraction

Consider A,B where

A = B =

Consider x = x1,x2,x3T where Ax=B

Then x1 = 1 Then x2 – 2x3 = 0

1 0 0

0 1 -2

begin x1 := 1; x2 := 2*x3;

end

1

0

Pre-orders versus posets

A pre-order D, is a set D ordered by a binary relation such that: If dd for all dD If d1d2 and d2d3 then d1d3

A poset is pre-order D, such that: If d1d2 and d2d1 then d1=d2

The affine domain is a pre-order (so it is not ACC)

Observe A1,B1A2,B2 but A2,B2A1,B1

A1= B1= A2= B2=

To build a poset from a pre-order define dd’ iff dd’ and d’d define [d] = {d’D|dd’} and D = {[d]|dD} define [d] [d’] iff dd’

The poset D, is ACC since chain length is bounded by the number of variables n

1 0 0

0 1 0

0 0 1

1

0

0

2 0 0

0 1 0

0 0 1

2

0

0

Inducing termination for non-ACC (and huge ACC) domains

Enforce convergence for intervals with a widening operator :DD D d = d d = d [l1,u1] [l2,u2] = [if l2<l1 then - else l1,

if u1<u2 then else u1]Examples

[1,2][1,2] = [1,2] [1,2][1,3] = [1,] but [1,3][1,2] = [1,3]

Safe since [li,ui]([l1,u1][l2,u2]) for i{1,2}

Chaotic iteration with widening

To terminate it is necessary to traverse each loop a finite number of times

It is sufficient to pass through I2 or I3 a finite number of times [Bourdoncle, 1990]

Thus widen at I3 since it is simpler

I1 I2

I3I4

Termination for the decrement I1 = [0,0] I2 = (I1 I3) [-, 15] I3 = I3{n-1 | n I2} note the fix I4 = (I1 I3) [16, ]

When I2 = [-1,0] and I3 = [-1,-1], then

I3{n-1 | n I2} = [-1,-1] [-2,-1] = [-,-1]

I1 [0,0]

[0,0]

[0,0] [0,0] [0,0] [0,0] [0,0]

I2 - - [0,0] [-1,0] [-,0] [-,0] [-,0]

I3 - - [-1,-1]

[-,-1]*

[-,-1]

[-,-1]

[-,-1]

I4 - - - - - -

(Malicious) research challenge

Read a survey paper to find an abstract domain that is ACC but has a maximal chain length of O(2n)

Construct a program with O(n) symbols that iterates through all O(2n) abstractions

Publish the program in IPL

Are numeric domains convex?

A set SRn is convex iff for all x,yS it follows that {x + (1-)y | 01} S

The 2 leftmost sets in R2 are convex but the 2 rightmost sets are not

Intervals and affine systems are convex

Arithmetic congruences are not convexElements of the arithmetic congruence (AC)

domain take the form x – 2y = 1 (mod 3) which describes integral values of x and y

More exactly, the AC domain consists of conjunctions of equations of the form c1x1+…+cmxm = (c mod n) where ci,cZ and nN

Incredibly AC is ACC [IJCM, 30, 165--190, 1989]

0

1

2

3

4

5

6

7

8

9

0 0.5 1 1.5 2 2.5 3 3.5

Research challengeSøndergaard [FSTTCS,95] introduced the

concept of an immediate fixpointConsider the following (groundness)

dependency equations over the domain of Boolean functions Bool, , f1 = x (y z) f2 = t(x(z(u (tx) v (tz) f4))) f3 = u (v(x u z v f2)) f4 = f1 f3

Where x(f) = f[x true]f[x false] thus x(xy) = true and x(xy) = y

The alternative tactic

f1 false x (yz) x (yz) x (yz) … x (yz)

f2 false false false v (yu) … (uy) v

f3 false false false false … (xy) z

f4 false false x (yz) x (yz) … (xy) z

The standard tactic is to apply iteration:

Søndergaard found that the system can be solved symbolically (like a quadratic)

This would be very useful for infinite domains for improved precision and predictability

Combining analyses

Verifiers and optimisers are often multi-pass, built from several separate analyses

Should the analyses be performed in parallel or in sequence?

Analyses can interact to improve one another (problem is in the complexity of the interaction [Pratt])

Pruning combined domains

1: {{x}}, true z = c

4: {{x}}, yz 5: {{x},{x,

y}, {x, z},{y, z}}, (x(yz))(yz)

x = f(y, z)y = b

3: {{x}}, z

2: {{x}}, y

Pruning combined domains

Suppose that 1 D1C and 2D2C, then how is D=D1D2 interpreted?

Then d1,d2c iff d11c d22c

Ideally, many d1,d2D will be redundant, that is, cC . c1d1c2d2

Time versus precision from TOPLAS 17(1):28--44,1993

Time Precision

Share ASub ShareASub

Share ASub

ShareASub

serialise 9290 839 1870 235 35 35

init-subst 569 1250 829 5 72 5

map-color 4600 1040 5760 76 74 73

grammar 170 140 269 11 11 11

browse 51860 1609 49580 196 104 104

bid 1129 1000 1429 11 0 0

deriv 2819 2630 3550 0 0 0

rdtok 5670 4450 6389 185 48 48

read 8790 8380 11069 11 1 1

boyer 11040 3949 7709 242 93 93

peephole 20760 7990 23029 386 310 310

ann 93509 16789 53269 1935 1690

1690

The Galois framework

Abstract interpretation is classically presented in terms of Galois connections

Lattices – a prelude to Galois connections

Suppose S, is a posetA mapping :SSS is a join (least upper

bound) iff ab is an upper bound of a and b, that is, aab

and bab for all a,bS ab is the least upper bound, that is, if cS is

an upper bound of a and b, then abc

The definition of the meet :SSS (the greatest lower bound) is analogous

Complete lattices

A lattice S, , , is a poset S, equipped with a join and a meet

The join concept can often be lifted to sets by defining :(S)S iff t(T) for all TS and for all tT if ts for all tT then (T)s

If meet can often be lifted analogously, then the lattice is complete

A lattice that contains a finite number of elements is always complete

A lattice that is not complete

A hyperplane in 2-d space in a line and in 3-d space is a plane

A hyperplane in Rn is any space that can be defined by {xRn | c1x1+…+cnxn = c} where c1,…,cn,cR

A halfspace in Rn is any space that can be defined by {xRn | c1x1+…+cnxn c}

A polyhedron is the intersection of a finite number of half-spaces

Examples and non-examples in planar space

Join for polyhedra

Join of polyhedra P1 and P2 in Rn coincides (with the topological closure) of the convex hull of P1P2

The “join” of an infinite set of polyhedra

Consider the following infinite chain of regular polyhedra:

The only space that contains all these polyhedra is a circle yet this is not polyhedral

Galois connection example (2 complete lattices + …)

The concrete domain C,C,C,C is (Z),,,

The abstract domain A,A,A,A where: A = {,+,-,T} A a AT for all aA

join A and meet A are defined by:A + - T

+ - T

+ + + T T

- - T - T

T T T T T

A + - T

+ + T

- - T

T + - T

… + concretisation mapping + …

The concretisation mapping :AC is defined: () = Ø (+) = {nZ | n > 0} (-) = {nZ | n < 0} (T) = Z

Concretisation spells out how to interpret the symbols in the abstract domain

Observe that ()(+)(T) and more generally is required to be order-preserving

If a1 A a2 then (a1) C (a2)

… + an abstraction mapping

Since {1,2}(+) and {1,2}(T) either + or T can represent {1,2}.

Thus need a mechanism to map a set to the best abstract object that represents it

The abstraction mapping :CA is defined: (S) = if S = Ø (S) = + else if n > 0 for all nS (S) = - else if n < 0 for all nS (S) = T otherwise

Require to be monotonic, that is, if c1 C c2 then (c1) A (c2)

can be defined from (and vice versa)

Observe (S) = A{aA | S (a)}As an example consider ({1,2}):

{1,2} (T) {1,2} (+) {1,2} (-) {1,2} () Therefore ({1,2}) = A{+, T} = +

Dually (a) = {SZ | (S) A a}

requires A to be complete (dually for and C)Since (S) = A{aA | S (a)}, meet

needs to be defined over possibly infinite subsets of A

Observe that :(R2)A cannot be defined for A = set of planar polyhedra

Consider c = {x, yR2 | x2 + y2 1}

But A{a1, a2, a3, … } is not defined

c a1 a2 a3

A, , C, is Galois connection whenever

A, A and C, C are complete latticesThe mappings :CA and :AC are

monotonic, that is, If c1 C c2 then (c1) A (c2)

If a1 A a2 then (a1) C (a2)

The compositions :AA and :CC are extensive and reductive respectively, that is, c C ()(c) for all cC

()(a) A a for all aA

c C ()(c) is a statement about safe abstractions

If c < c’ for some cC then working in abstract setting has compromised precision

If c’ < c for some cC then working in abstract setting has compromised correctness

Bar ()(c) <C c for every cCThus stipulate c C ()(c) for

all cC to guarantee safety

c ac’

()(a) A a is a statement about best abstractions

Recall that (a) spells out what aA represents

Thus a is one way to describe (a); T is another way to describe (a) but a is better since a A T

Desire ((a)) to be the best way to describe (a)

Therefore require ((a)) A a

Collecting domains and semantics

Observe that C is not that concrete – programs include operations such as *:ZZZ

C=(Z) is collecting domain which is easier to abstract than Z since it already a lattice

To abstract *:ZZZ, say, we synthesise a collecting version *C:(Z)(Z)(Z) and then abstract that

Put S1 *C S2 = {n1*n2 | n1 S1 and n2 S2}

Safety and optimality requirements

The most precise (optimal) way to define *A:AAA is to define a1 *A a2 = ((a1)*C(a2))

Not practical since (a1) and (a2) are infinite

Handcraft computable *’A:AAA with a1 *A a2 A a1 *’A a2 for all a1,a2A

Merely need to assert ((a1)*C(a2)) A a1 *’A a2 for all a1,a2A for correctness

Abstract multiplication

*’A + - T

+ + - T

- - + T

T T T T

Consider ((+)*C(+)) and +*’A+

Recall (+) = {nZ | n > 0}, hence (+)*C(+) = {n1*n2 | n1 > 0 and n2 > 0} = {n | n > 0}

Hence ((+)*C(+)) = + = + *’A +

Since ((+)*C(+)) A

+*’A+ safety follows for this case

Since +*’A+ = ((+)*C(+)) optimality follows for this case

Exotic applications of abstract interpretationRecovering programmer intentions for

understanding undocumented or third-party code

Verifying that a buffer-over cannot occur, or pin-pointing where one might occur in a C program

Inferring the environment in which is a system of synchronising agents will not deadlock

Lower-bound time-complexity analysis for granularity throttling

Binding-time analysis for inferring off-line unfolding decisions which avoid code-bloat