Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
Center for Domain-Specific Computing Supported by NSF “Expedition in Computing” Program
www.cdsc.ucla.edu.
CDSC-GR: a CnC-inspired Graph Representation
CnC 2013 -- September 24, 2013
Alina Sbirlea1, Zoran Budimlic1, Jason Cong2, Zhuo Li2, Louis-Noel Pouchet2, Vivek Sarkar1 and Mo Xu2
1 Rice University 2 University of California Los Angeles
2
CDSC-GR: Motivations for a CnC-inspired Graph Representation
• Provide a separate, well defined input language to the programmer
Easy to read/write for domain experts, “DSL” for dataflow
• Be an intermediate representation for parallel programs
Translation to/from CDSC-GR, analysis & optimizations, mapping
• Perform graph-level optimizations & checking
Race detection, static & static+dynamic scheduling, data locality analysis, …
• CnC-HC is not enough, extensions are desired such as:
Support for non-graph data, step-to-step synchronization
Support for regions (sets of integer points) for collections
• CDSC-GR can be mapped on heterogeneous hardware
Distributed-CnC, CDSC-GR-to-HC, FPGA synthesis
3
Motivation Through Example: Heart Wall Tracking
• Heart Wall Tracking:
medical imaging application from the Rodinia Benchmark Suite
detects and analyzes the movement of the heart walls
4
Motivation Through Example: Heart Wall Tracking
• Heart Wall Tracking – one step further:
create a finer grained version, by splitting each step into 10 additional steps
data remains structured in the original C code => use “fake” items to achieve
step-to-step synchronization
5
Motivation Through Example: Heart Wall Tracking
• Heart Wall Tracking – one step further:
create a finer grained version, by splitting each step into 10 additional steps
data remains structured in the original C code => use “fake” items to achieve
step-to-step synchronization
6
Motivation Through Example: Heart Wall Tracking
• Heart Wall Tracking – Fine grained:
Accesses C code, oblivious of data present there
Uses items for step-to-step synchronization in CnC
17 lines of code in CDSC-GR vs 38 lines of code in the CnC graph file
• CDSC-GR can operate on both graph data and non-graph data
CnC dynamic single assignment not needed for non-graph data
• CDSC-GR can be mapped to heterogeneous targets (i.e., FPGA) with
synchronized access to non-graph data
FPGA mapper prototype requires explicit (static) regions
Regions simplify communication generation and management of collections
Non-graph data support requires explicit step-to-step dependence support
7
Standard CnC CDSC-GR
Controller – controllee
(step1)
(step2)
Producer - consumer
(step1) (step2) [item]
(step1) -> [item] -> (step2);
Controller – controllee
(step1)
(step2)
<t2>
Producer - consumer
(step1) (step2) [item]
(step1) -> [item] -> (step2); (step1)->(step3);
(step3)
Some differences between CnC and CDSC-GR
(step1) -> <t2> ; <t2> :: (step2); (step1):: (step2);
8
CDSC-GR Key Features and Their Purpose [1/3]
Property Intel CnC CDSC-GR Role
1) Item, Step, Control Collections
CnC has all three CDSC-GR only has item and step collections (no loss of generality in omitting control collections)
Modeling
2) Item put-get semantics
Dynamic single assignment
Dynamic single assignment for graph data (arbitrary access to non-graph data)
Modeling
3) Step prescription Achieved via put on tag collection, and prescription of step
Achieved directly by one step prescribing (spawning) another step. Steps still have control tags as in CnC.
Modeling
4) Item allocation Dynamic Dynamic or pre-allocated Modeling
5) Nondeterminism Design in progress Support for putIfAbsent() Modeling
6) User-defined reductions
Designed but not included in release
In plan for CDSC-GR Modeling
9
CDSC-GR Key Features and Their Purpose [2/3]
Property Intel CnC CDSC-GR Role
7) Inter-step synchronization
Inter-step synchronization must be indirect via items
Direct inter-step synchronization permitted (used for coordinating accesses to non-graph data)
Modeling, Performance
8) Put/get granularity
Single items Single items + regions Modeling, Performance
9) Step parameters No parameters other than control tag
Parent step can pass parameters to child step separate from items
Performance
10) Tuning language Designed but not included in release
In plan for CDSC-GR Modeling,
Performance
11) Get counts and folding functions
Designed but not included in release
In plan for CDSC-GR Performance
10
CDSC-GR Key Features and Their Purpose [3/3]
Property Intel CnC CDSC-GR Role
12) Dependent gets Permitted Prohibited --- tag/key for a get operation should only depend on the step's control tag, and not on the value of another get operation in the same step.
Analysis and Generation
13) Tag functions Optional Required for all input items.
Can be optionally provided for output items
Modeling, Analysis and Generation
14) Static analysis and transformation of graph programs
Hard to do with API calls In plan for CDSC-GR Analysis and
Generation
15) Standardized textual representation of graphs
Present in earlier versions; current version uses API calls to specify graph
Foundation for CDSC-GR Modeling,
Analysis and
Generation
11
Translating Legacy Codes to CDSC-GR
Main motivation 1: reuse existing software/compilers
• CDSC-GR-to-HC translation enables HC optimizations
OpenCL code generation
• Distributed-CnC, CnC on OCR
A step toward distributed memory code generation for CDSC-GR
• Static/dynamic analysis at the graph level
Static scheduling / partitioning, static+dynamic scheduling, placement
I/O complexity analysis
Main motivation 2: ease the user job when translating legacy programs
• Full automation out of reach for arbitrary codes, feedback needed from user
12
In-Depth Look: Regions
• In CnC-HC: tags are integer tuples, with simple bounds
[ A: i, j ], [B: i, j+1] -> (mystep: i, j);
env -> (mystepColl: { 1 .. 42 }, { 3 .. 51 });
• In practice, program use more complex “shapes” for tags
for (i = 0; i < N; ++i)
for (j = i + 1; j < N; ++j)
A[i][j] = A[j][i];
[ A: j,i ] -> (mystep: i, j) -> [A: i,j ]; env -> <mystepColl: { 0 .. N-1 }, { ?? .. ?? }>;
• In practice, data regions need not be a single entry in the item
collection
These can be solved with program modifications (loop merging, careful management of
item collections)
Or with language extension!
13
Syntax Proposal for Regions
• Proposal by example:
for (i = 0; i < N; ++i)
for (j = i + 1; j < N; ++j)
A[i][j] = A[j][i];
[ A: j,i ] -> (mystep: i,j) -> [A: i,j]; def region1 : i, j := { 0 <= i < N, i + 1 <= j < N }; env -> <mystepColl: region1>;
• General template for a basic region:
def <region_name> : <list of names for each dim> := { inequalities }
• General template for a union of regions:
def <reg_name>[param list] : <list of names for each dim> := {
inequalities1 }[, { inequalities2}]
14
Inequalities to Describe a Region
Inequalities are used to describe a convex set of points
Ex: { i <= 42, i >= 12 + j }
Several key questions:
• Must the set really be convex?
No, but easier mapping if it is
• Must the set be exact?
No, but over-approximation must not affect correctness
• Must the set be computable at compile-time?
No, but better static/dynamic optimizations could be triggered if it is; some mappings
(FPGA) require it
15
Regions for Graph Data
• Syntax allows for parameterized regions (maps) by arguments:
def region2(p,q) : i,j := { i = p, q - 1 <= j <= q + 1 };
• Use inside the graph:
[A:region2(i,j) ] -> (step1: i,j) /// eq. to [A:i,j-1],[A:i,j],[A:i.j+1] -> (step1:i,j)
def region1 : i, j := { 0 <= i < N, i + 1 <= j < N };
env -> <mystepColl: region1>;
Key questions have the same answer:
• Must the map be affine? Invertible? Likely no, but easier mapping if it is
• Must the map be exact? Likely no, but over-approximation must not affect
correctness
• Must the map be computable at compile-time? Likely no, but better static/dynamic
optimizations could be triggered if it is; some mappings (FPGA) require it
16
A Step Further: Functions
• Not all programs have affine iteration domains!
• Union of regions can be too tedious for non-convex sets
• We do not want to support arbitrary expressions in CDSC-GR
Grammar for arbitrary expressions using C math functions needed
• Our proposal: allow for definition of functions to map CDSC-GR
symbols with C expression
for (int i = min(P, Q); i < sqrt(P*Q); ++i)
S(i, j)
def mymin(x,y) := “x < y ? x : y”;
def mysqrt(x) := “sqrt(x)”;
def region1 : i, j := { i >= mymin(P, Q); i < mysqrt(P*Q)};
env -> <S : region1>;
17
More on Functions
• Functions associate an identifier, and possibly some arguments, to a
syntax in the target language to evaluate the expression
def <func_name>[arguments] := “<code in host language>”;
• Functions are assumed to return the same value when called with the
same arguments, and have no side effect
• Functions are implicitly typed: they return an integer value, and take
integer as argument
• Functions can lead to non-convex regions
• Functions can inhibit static analysis at the graph level
• Functions make translation from C for loops to CDSC-GR easier!
18
Easing Static Analysis: Properties on Functions
• Functions are black boxes, but some information may be known
Example: range for the input values, range for the output values
• Ditto for “parameters” (unknown constants)
• We introduce properties on functions, by associating them to a region
def myfunc(i,j) := “i*i + j*j + 1”;
def regprop1 : x := { x >= 1 };
prop myfunc : regprop1
def region1 : i, j := { 0 <= i <= 42, 0 <= j <= myfunc(i,i+1)};
Here we now know region1 is never empty
• Note: a one-dimensional region is associated to a function, as a
function returns an integer value
19
Easing Static Analysis: Properties on Parameters
• The same mechanism can be used to bound program parameters, and
map them to symbol in the host language
def myParam1 := “P”;
def regprop1 : x := { 1 <= x <= 10000 };
prop myParam1 : regprop1
def region1 : i := { 0 <= i <= myParam1};
Here we now know region1 is never empty, and will never exceed 10000 points
20
The Next Step: A Compiler Framework for CDSC-GR
Our objective: provide translators to/from CDSC-GR
CDSC-GR to HC prototype already available
Current work: translating a subset of C to CDSC-GR. Key challenges
include:
• Analysis of C program to detect parallelism / step partitioning
Conservative analysis, feedback from the user expected
• Analysis of C program to capture dataflow
Conservative analysis, feedback expected
• Pre-transformations to ease translation to CDSC-GR
21
The Role of Compiler Analysis & Transformations
• Key challenge 1: what to put inside step(s)?
Current approach: try to put non-affine code segment inside steps
Not all outer loop(s) have an easy-to-analyze dependence pattern -> need transformation
for better efficiency?
Tiling, when applicable, allows for dynamic management of step granularity
• Key challenge 2: extract enough parallelism at the CDSC-GR level
Conservative analysis may be correct, but can provide a list of may dependences to user
for feedback (interactive process)
Currently plan to use basic dependence distance vector computation
• Key challenge 3: compute data flow and item collection initialization
Translate program to DSA, graph vs. non-graph data selection
22
Where to Put the Parallelism?
• Parallelism must be exploited at multiple levels for good performance
SIMD-like parallelism for CPU and GPU
Sync-free multi-core parallelism (OpenMP)
Pipeline parallelism / sync-free parallelism for FPGAs
• Key issue: how much parallelism at the CDSC-GR level?
• Approach 1: fine-grain
•Small step granularity at the CDSC-GR level
•Group step instances into “macro-steps” and scan this set in the step code regions are
useful here: they define a multi-dimensional space
•Issue: atomicity of a “macro step” (analogous to tiling requirements)
• Approach 2: medium-grain
•Ensure enough workload inside a step (i.e., a step is “a tile”)
•Issue: less flexibility for the runtime, and CDSC-GR static optimizations
23
Sum-up: Motivation for CDSC-GR
• Provide a separate, well defined input language to the programmer
Easy to read/write for domain experts, “DSL” for dataflow
• Be an intermediate representation for parallel programs
Translation to/from CDSC-GR, analysis & optimizations, mapping
• Perform graph-level optimizations & checking
Race detection, static & static+dynamic scheduling, data locality analysis, …
• CnC-HC is not enough, extensions are desired such as:
Support for non-graph data, step-to-step synchronization
Support for regions (sets of integer points) for collections
• CDSC-GR can be mapped on heterogeneous hardware
Distributed-CnC, CDSC-GR-to-HC, FPGA synthesis
24
Open Topics
• Dynamic single assignment data vs non-DSA data
Data-race detection on non-DSA data is needed to ensure correctness
Determinism --- a step waits for some vs. all of its inputs (some inputs may not be
defined as part of item collections)
DSA with data folding annotations --- memory space reuse
• Other discussion topics
Co-optimization of step code and program graph e.g., tiled steps
Memory management --- folding, get-counts, both?
• Questions?