1
ANALYSIS OF PROG. LANG.PROGRAM ANALYSISInstructors: Crista Lopes
Copyright © Instructors.
2
Motivation(s)
Where do you see PA in your everyday life?
How does PA “work”? What is PA anyway?
3
Auto-completion
4
Pre-compilation error detection
Ex: missing parenthesis
5
How do you know ...
int a;
increment_a() { a ++;
}
while(true) { String a = “hello”;
increment_a(); }
This “a” is not that “a”
6
How do you remember ...
int a;
increment_a() { a ++;
}
while(true) { String a = “hello”;
increment_a(); }
Wait, what’s the type of “a” again?
“a” is of type int (FYI...)
7
Outline
Introduction/motivations Program representation
AST 3-address code
Control flow analysis Data flow
8
Intermediate Representation (IR) Initial Point Abstract Syntax Tree
Abstract vs Concrete Syntax Parse Tree vs Abstract Syntax Tree
Three-address Codes
9
IR-1 Starting Point
Parsing, Lexical
Analysis
Code Generation, Optimizatio
n
Code Execution
Source
code
Intermediaterepresentation
Targetcode
Analyze IR – Perform analysis on the resultsUse this information for applications
10
IR-2. Abstract Syntax Tree (AST) Concrete vs Abstract Syntax
Concrete show structure and is language-specific
Abstract shows structure
Representations Parse Tree represents Concrete Syntax Abstract Syntax Tree represents Abstract
Syntax
11
IR-2. Example : Grammar
Example a:= b+c (Language 1) a = b+c; (Language 2)
Grammar for 1stmtlist � stmt | stmt stmtliststmt assign | if-then | …assign ident “:=“ ident binop identbinop “+” | “-” | …
Grammar for 2stmtlist � stmt “;”| stmt “;” stmtliststmt assign | if-then | …assign ident “=“ ident binop identbinop “+” | “-” | …
12
IR-2. Example: Parse Tree
stmtlist
stmt
assign
Ident := ident binop ident
a b “+” c
Parse Tree for a:=b+c Parse Tree for a=b+c;
stmtlist
stmt “;”
assign
Ident = ident binop ident
a b “+” c
13
IR-2 Example: Abstract Syntax Tree
Example
1. a:=b+c
2. a=b+c;
Abstract Syntax Tree for 1 and 2
assign
a add
b c
14
IR-3. Three Address Code
General form: x = y op z More generally: (operator, operand1, operand2, result)
(at most 3 spots besides the operator) May include temporary variables Examples
Assignment Binary x:= y op z (op, y, z, x) Unary x := op y (op, v, _, x)
Copy x:=y (_, y, _, x) Jumps
Unconditional goto L (goto, L, _, _) Conditional if x relop y goto L (relop, x, y, L)
….
15
IR-3. Example: Three Address Code if a>10
then x=y+zelse
x=y-z
1. if a>10 goto 4 2. x = y-z 3. goto 5 4. x = y + z 5. …..
16
Analysis Levels
Local within a single basic block or statement
Intraprocedural within a single procedure, function, or method
Interprocedural across procedure boundaries, procedure call, shared
globals, etc Intraclass
within a single class Interclass
across class boundaries …..
17
Outline
Introduction/motivations Program representation Control flow analysis
Computing Control Flow (analysis and representation)
Search and Traversals Applications
Data flow
18
Computing Control flow (example)Procedure AVGS1 count=0;S2 fread(fptr , n)S3 while(not EOF) doS4 if(n<0)S5 return(error)
elseS6 nums[count]=nS7 count++ endifS8 fread(fptr , n);
endwhileS9 avg= mean(nums , count)S10 return (avg)
S1
S2
S3
S4
S5
S10
S6
S9
S8
S7
EXIT
entry
19
CF1: Control Flow (Basic Blocks) A basic block is a sequence of
consecutive statements in which flow of control enters at the beginning and leaves at the end without halt of possibility of branch except at the end
A basic block may or may not be maximal
For compiler optimizations, maximal blocks are desirable
For software engineering tasks, basic blocks that represent one source code statement are often used
20
Computing Control flow (example)Procedure AVGS1 count=0;S2 fread(fptr , n)S3 while(not EOF) doS4 if(n<0)S5 return(error)
elseS6 nums[count]=nS7 count++ endifS8 fread(fptr , n);
endwhileS9 avg= mean(nums , count)S10 return (avg)
S1
S2
S3
S4
S5
S10
S6
S9
S8
S7
EXIT
entry
21
CF1: Computing Control Flow Input: A list of program statements in some form Output: A list of CFG nodes and edges Procedure:
Construct basic blocks Create entry exit nodes; create edge (entry, B1); create
(exit, Bk) for each Bk that represents an exit from program Add CFG edge from Bi to Bj if Bj can immediately follow Bi
in some execution i.e., There is conditional or unconditional goto from last statement of
Bi to first statement of Bj or Bj immediately follows Bi in the order of the program and Bi
does not end in unconditional goto statement Label edges that represent conditional transfers of control
22
CF2: Search and Ordering
Many ways to visit the nodes in the graph Depth First Search: Visits descendants of the
node before visiting any of its siblings Breadth First Search: All of the node’s
immediate descendants are processed before any of their unprocessed children
Preorder Traversal: A node is processed before its descendants
Postorder Traversal: A node is processed after its descendants
23
CF2: Search and Ordering (cont’d) (DFS)
One DFS of CFG 13467810,back to 8,9, back to 8, 7,6,4,5, back to 4,3,1,2,back to 1
The number assigned to a node during DFS is its depth first number
Depth first ordering of nodes is the reverse of the order in which nodes are visited in DFS
For the DFS, nodes are visited 1,3,4,6,7,8,10,8,9,8,7,6,5,4,3,1,2,1
Depth first ordering is 1,2,3,4,5,6,7,8,9,10
1
2
S3
S4
S5
S10
S6
S9
S8
S7
24
CF: Types of Edges
Depth first representation is depth first spanning tree along with other edges not part of the tree; tree edges, other edges
Three kinds of edges Advanced (forward) edges: go
from a node to one of its proper descendants in the tree; these include tree edges
Back edges: go from a node to one of its ancestor in the tree
Cross edges: connect nodes such that neither is an ancestor of the other
25
Applications of Control Flow
Complexity – Pointers to refactoring
Testing Branch, Path, Basis Path Branch: Must test 1-2, 1-3,
4-5, 4-8, 5-6, 5-7 Path: Infinite, due to loop Basis Path: Set of paths
which covers all the edges at least once e.g. 1,2,4,8; 1,3,4,5,6,7,4,8
Program Understanding Recover program structure
Impact analysis …..
1
2 3
4
8
6
5
7
26
Outline
Introduction/motivations Program representation Control flow Data flow
Introduction Reaching definitions
27
Data flow - Introduction
Flow of various data throughout the program Obtained from AST or CFG Used in software engineering tasks
Exact solutions to most data flow problems are undecidable May depend on input May depend on the outcome of a conditional
statement May depend on termination of loop
Thus we compute approximations of the exact solution
28
Data flow - Introduction
Some Approximations “overestimate” the solution Approximations contain actual information plus some
spurious information but does not omit any actual information Conservative and safe approach
Some Approximations “underestimate” the solution Approximations may not contain all the information of the
actual solution Unsafe
Research challenge: Providing safe but precise information in an efficient way
Uses of data flow: Compiler optimization requires conservative analysis Software engineering tasks may only need unsafe info
29
Data flow – Compiler Optimization
Common subexpression elimination
c=a+b=a
e=a+b=a
d=a+b=a
30
Data flow – Compiler Optimization
Common subexpression elimination
Need to know available expressions: which expressions have been computed at that point before this statement
c=a+b=a
e=a+b=a
d=a+b=a
t=a+b
c=tc=a
t=a+b
d=tc=a
e=t=a
31
Data Flow - Compiler Optimization
Register (de)allocation When assigning memory locations to
registers, if a value in a register (ie a memory location) is not used again, no need to keep it in a register
Is R2 needed after this statement? Need to know “live variables”: which
variables are still used after current line
R1=R2+10=a
32
Data Flow - Compiler Optimization
Suppose every assignment that reaches this statement assigns 5 to c
then ‘a’ can be replaced by 15
But: Need to know reaching definitions: which definition(s) of variable c reach this statement
a=c+10 // need 3 registers=a
a=15 //need 2 registers/a
33
Data Flow - Sw Eng Tasks
Data-Flow testing Suppose that a statement assigns a value but the use
of that value is never executed under test
a never used on this path
Need to know definition use pairs: link between definition(s) and use(s) of a variable (or a memory location)
a=c+10=a
d=a+y=a
34
Data Flow - Sw Eng Tasks
Debugging Suppose that ‘a’ has an incorrect value in the
statement Eg int overflow
Need data dependence information: some
statements produce erroneous values, others are affected by those values
a=c+y=a
d=a+y=a
35
Data flow - Example
Compute the flow of data throughout the program Where does the
assignment to i in statement 1 reach?
Where does the expression computed in statement 2 reach?
Which uses of variable are reachable from the end of Block1?
Is the value of variable i live after statement 2?
1. i=22. k=i+1
3. i=1
4. k=k+1
5. k=k-4
B1
B2
B3
B4
36
Reaching definitions analysis
Definition = statement where a variable is assigned a value (e.g. input statement, assignment statement)
A definition of ‘a’ reaches a point ‘p’ if there exists a control flow path in the CFG from the definition to ‘p’ with no other definitions of ‘a’ on the path
Such a path may exist in the graph but may not be possible – infeasible path
1. i=22. k=i+1
3. i=1
4. k=k+1
5. k=k-4
B1
B2
B3
B4
37
Reaching definitions analysis
What are the definitions in the program? Of variable i: Of variable k:
Which basic blocks (before block) do these definitions reach? Def 1 reaches: Def 2 reaches: Def 3 reaches: Def 4 reaches: Def 5 reaches:
1. i=22. k=i+1
3. i=1
4. k=k+1
5. k=k-4
B1
B2
B3
B4
38
Reaching definitions analysis
What are the definitions in the program? Of variable i: 1,3 Of variable k: 2,4,5
Which basic blocks (before block) do these definitions reach? Def 1 reaches: B2 Def 2 reaches: B1, B2, B3 Def 3 reaches: B1, B3, B4 Def 4 reaches: B4 Def 5 reaches: exit
1. i=22. k=i+1
3. i=1
4. k=k+1
5. k=k-4
B1
B2
B3
B4
39
Reaching definitions analysis
Method Compute two kinds of basic
information (within the block) Gen[B]: set of definitions
generated within B Kill[B]: set of definitions that, if
they reach the point before B, won’t reach end of B
Compute two other sets by propagation IN[B]: set of definitions the
reach the beginning of B OUT[B]: set of definitions that
reach the end of B
1. i=22. k=i+1
3. i=1
4. k=k+1
5. k=k-4
B1
B2
B3
B4
40
Reaching definitions analysis
Init GEN
Init KILL
Init IN
Init OUT
IN OUT
1 1,2 3,4,5
-- 1,2 2,3 1,2
2 3 1 -- 3 1,2 2,3
3 4 2,5 -- 4 2,3 3,4
4 5 2,4 -- 5 3,4 3,5
1. i=22. k=i+1
3. i=1
4. k=k+1
5. k=k-4
B1
B2
B3
B4
41
Iterative Data-Flow analysis algorithm
Algorithm for Reaching Definitions Input: CFG with GEN[B], KILL[B] for all B Output: IN[B], OUT[B] for all B
Begin RDIN[B]=empty, OUT[B]=GEN[B] for all B; change = trueWhile change do begin
change=falseFor each B do begin
IN[B]=union OUT[P] (P is a predecessor of B)OLDOUT=OUT[B]OUT[B]=GEN[B] union (IN[B]-KILL[B])if (OUT[B]!=OLDOUT) then change = true;
End forEnd whileEnd RD
42
Tools
Eclipse JDT/AST (APIs to construct, traverse and manipulate AST)
http://www.vogella.de/articles/EclipseJDT/article.html Sourcererhttp://sourcerer.ics.uci.edu/index.html Crystal (Data Analysis Framework, mostly
for academic purposes)http://code.google.com/p/crystalsaf/wiki/Installation
43
Mandatory Reading List
Representation and Analysis of Software – Rep-Analysis.pdf
Crystal Notes – CrystalTutorialNotes.pdf, CrystalTutorial.ppt
Eclipse JDT - AST - http://www.vogella.de/articles/EclipseJDT/article.html
44
More (optional) Reading List
Principles of Program Analysis, Nielson and Hankin
Invariant Detection using Daikon – daikon.pdf
More optional readings available at Program Analysis course material at CMU http://www.cs.cmu.edu/~aldrich/courses/15-819M/