Upload
dustin-bennett
View
221
Download
0
Embed Size (px)
Citation preview
The GLIMPSES ToolkitRapid code prototyping for SPEs
Jaswanth Sreeram, Santosh Pande
2
Overview of Toolkit
• GLIMPSES Toolkit : GLobal Interprocedural Memory and ParalleliSm Estimator for SPUs
– Profile instrumentation support• Profile parsers and interpreters.
– Analyzers for memory allocation & access behavior
– Visualization Engine
3
GLIMPSES toolkit• One of two tools available in public domain
– Rapid Prototyping, Legacy Code Migration and Performance Tuning on Cell SPEs
– Second one is asmvis
• Released on source-forge in mid July:http://glimpses.sourceforge.net
• OSI certified open source license(s).
• Has received interest for adoption in academia and industry– Samsung Korea, Codecs and Media computing Group.– Sony Computer Entertainment America (SCEA)
4
GLIMPSES : Motivation
• Prototyping large codebases for porting to SPEs is challenging– Find a partition (set of functions)– Find a set of upward exposed references– DMA transfer them and lay them out –
alignment– After execution store the results back– Make sure memory requirements do not exceed
capacity
5
Motivation – contd.
• Challenges due to architectural attributes– Limited local store– High branch penalty– Suited for vectorizable code rather than scalar
code– SPE/PPE interactions
• Provide programmer with tools to– Understand program behavior (esp. memory
usage)– Quickly construct candidates partitions for SPE– Evaluate/Quantify partitions’ suitability for SPEs
6
GLIMPSES : Details
• Memory Estimation tools enable programmer to:– Estimate static & dynamic memory usage
• Code, Stack, Heap
– Understand program behavior• Detect program objects affecting dynamic memory
behavior• Show the correlation between these program objects and
memory usage.
– Rank program segments• Criteria: Memory requirements, vectorizability, branching,
etc.
– Visualize results interactively.
7
Features overview• Dynamic Call Graph visualization – ability
to select a call tree • Memory Requirements
– Dynamic– Analytical – ‘what if’ scenario calculator
for memory capacity • Memory Access Patterns
– Locality (spatial, temporal, neighbor affinity)
• Ranking– Criteria based estimates
• Alias and safe pre-fetching information– Multiple alias analyses available
8
Overview
Test Inputs
VisualizationEngine
Dyn. Memory Estimator
Profile Trace
Analysis &Instrumentation Passes
Execute
Instru. Bytecode
C/C++ program
LLVM compiler flow
Bytecode
LinkRuntime
AnalyticalMemory Estimator
GraphML Trace
Partition Estimator
9
Visualization
Graph Visualization Area
Results Display Panel
10
Visualization …contd
11
Visualization …contd
• Zoom view
• Shows dynamic call chains for a program run (in this case the program is mpeg2-decode)
12
Visualization …contdFunction Characteristics
Alias Analysis Algorithm used
Type of Aliases displayed (“Must Alias”, “May Alias”, “No Alias”)
Aliasing information for pairs of variables/memory regions.
13
Analytical Memory Estimation
• Correlate dynamic memory usage with program objects– Dynamic memory usage depends on inputs, etc.
• Compiler Analysis– From each malloc, do a backward traversal to find
instructions that influence the arguments to malloc.– Construct an arithmetic expression for amount of
memory allocated, in terms of inputs or other program objects.
– Handles control flow constructs (if-then-else, loops etc)
14
Memory Behavior: Analytical Estimation
if (cc==0) size = Picture_Width * Picture_Height;else size = Chroma_Width * Chroma_Height;…..……
for(….) {if (…..)
malloc(size);if (…..)
malloc(size);}
__Malloc_size__1 = Picture_Width*Picture_Height
__Malloc_size__2 = Picture_Width*Picture_Height
__Malloc_size__3 = Picture_Width*Picture_Height
__Malloc_size__4 = Picture_Width*Picture_Height
__Malloc_size__5 = Chroma_Width*Chroma_Height
__Malloc_size__6 = Chroma_Width*Chroma_Height
__Malloc_size__7 = Chroma_Width*Chroma_Height
__Malloc_size__8 = Chroma_Width*Chroma_Height
15
Memory References
• Memory reference metrics– Temporal (frequency) – Spatial– Neighbor affinity
• Metrics measured per memory line
• Per function metrics or per-partition metrics
• Visually represented via a color map– Pale Violet (low) -> Bright Red (high)
16
Memory Ref. Frequency (mpeg2decode)Memory Reference map (per partition)
with 1024B memory lines
17
Mpeg2decode: Load recurrence
Neighbor Affinity
• Metric to describe how well memory layout is suited to caching
• Consider a slice S of length w of the whole memory access trace and two loads
L1, L2 Є S
If |L1addr – L2addr| < line size then
L1, L2 exhibit neighbor affinity for slice size w
18
19
Load Neighbor Affinity
20
Alias Analysis for libode
• Basic AA (least precise, fastest)– Aggressive local analysis– Non context sensitive– Non-flow sensitive
• Total number of queries 119520497• “No Alias” 35924925• “May Alias”
83492482• “Must Alias” 103090
21
Alias Analysis (contd)
• Globals Mod/Ref– context-sensitive mod/ref and alias
analysis for internal global variables– Very fast, very precise, limited scope
• Total number of queries 119520497• “No Alias” 35944215• “May Alias” 83473192• “Must Alias” 103090
22
Alias Analysis (contd)
• Anderson’s AA algorithm– Subset-based, flow-insensitive, context-
insensitive, and field-insensitive alias analysis
– Very precise, but slow.
• Total number of queries 119520497• “No Alias” 79361105• “May Alias” 40057171• “Must Alias” 102221
23
Ranking (MPEG2Encode)• Criteria based
– Code Size (csize)– Stack Size (ssize)– Heap Size (hsize)– Branch density (br_density)– Autovectorizable loops (av_loops)– Is LS memory limit likely to be hit (ls_limit)Rank = w1*csize + w2*ssize + w3*hsize + w4*br_density + w5/(1 + av_loops) + w6* ls_limit
(wi are weights for each criteria)
Partitioning
• Preprocessing: Propogate ranks upwards in the call graph
Rank(n) = Rank(n) + ∑ Rank(n→child[i])
• Input: Call graph consisting of nodes annotated with ranks
• Output: Graph partitions that are suitable for execution on the SPEs
• A partition P is deemed “suitable” if Rank(P→root) < Threshold
24
Effect of threshold on partitions
25
mpeg2decode
26
GLIMPSES status• Beta version available for download at:
http://glimpses.sourceforge.net • 300MB source code package (includes visualizer)• Lines of code (C/C++): 447,000 • Third party tools integrated: LLVM (Compiler),
Prefuse (Visualization) • Executable Size: 422 MB (x86 binaries) • Typical trace size : 900 MB (LIBODE)• Man-hour effort: ~750• Releases :
– v.0.8 : based on LLVM version 1.8 (July 7th)– v.1.0 : based on LLVM version 2.0 (undergoing testing)
• Tested to work with large codebases: – LIBODE (115000 lines of code), mpeg2 (10000 lines of
code etc.), SPEC INT 2000 etc.
Ongoing and future work
• More Validation– Compare partitions produced with those
generated by expert programmers
• An inter-procedural, flow-sensitive, context-sensitive alias analysis algorithm
27
Ongoing and future work
• Function data dependence graph– Encapsulates data flow between
functions– Arguments, aliases, globals– Important factor in partitioning decisions
– “affinity between pairs of functions”
28