Upload
doanthuy
View
225
Download
0
Embed Size (px)
Citation preview
Dependence-Based Automatic Parallelization using CnC
Bo Zhao, Ali Janessari
Technische Universitat Darmstadt
[email protected], [email protected]
September 8, 2015
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 1 / 19
Overview
1 IntroductionMotivationObjectives
2 ApproachOverview FrameworkProgram AnalysisTask parallelism ExtractionCode Generation
3 Conclusion
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 2 / 19
Introduction Motivation
Motivation
Multicore and architecture has become popular as a result of thestagnating single core performance
Many software products are implemented sequentially
fail to tap potential of the parallel hardware
Problem : the gap between parallel hardware and sequential software
take advantage of new hardware featurespreserve the current software investmentsave human resource
Solution: automatically (semi-automatically) transform sequentialcode into parallel code
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 3 / 19
Introduction Objectives
Objectives
Discover potential parallelism
Loop parallelismIrregular task parallelism
Detect data and control dependencies
Generate parallel code using Concurrent Collections
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 4 / 19
Approach Overview Framework
Overview Workflow
Static Code Analysis
Dynamic Code Analysis
Code instrumentation
Front End
Task Graph Generator
Seq IR
Parallel Execution
Unit T
estin
g
Sequential
Source
Code
LLVM JIT Compiler
DiscoPoP
Correctness Feedback
CU
Graph
Co
mp
ile T
ime
Ru
ntim
e
Phase1: Program AnalysisPhase2: Coarse-Grained Task
Extraction & IR2IR TransPhase3: Code Generation
CnC-Par IRIR-to-IR transformation
Ctrl Info
Task
Graph
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 5 / 19
Approach Program Analysis
DiscoPoP (Discovery of Potential Parallelism)
Phase 1:Static and dynamic analysesInstruments the target program and identifies control and datadependencies
Phase 2 & 3:Post-mortem analysis for parallelism discoveryBuilds Computational Units (CUs) for the target programRanking
Phase 3Phase 2Phase 1
SourceCode
Conv
ersio
n to
IR
Memory Access& Control-flow
Instrumentation
Static Control-flowAnalysis
Data Dependency Analysis
CUGraph
ControlRegion
InformationPa
ralle
lism
Disc
over
y &
Para
llel P
atte
rnDe
tect
ion
RankedParallel
Opportunities
exec
utio
n
static dynamic
Rank
ing
Dynamic Control-flow Analysis
Variable Lifetime Analysis
Runtime Dependency MergingComputational
Unit Analysis
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 6 / 19
Approach Program Analysis
Dependence Profiling
Control dependence<FileID:LineID <Contr.ID> <Label> <Exec.Time>
1:60 BGN loop void1:74 END loop 1200
Data dependence<FileID:LineID> <Contr.ID> <Label> <Dep.> <FileID:LineID|VarName>
1:63 NOM void RAW 1:59|temp11:70 NOM void WAR 1:67|temp2
Data dependence (multi-threaded)<FileID:LineID|ThreadID> <Contr.ID> <Label> <Dep.> <FileID:LineID|VarName|ThreadID>
4:59|2 NOM void WAR 4:71|2|z real
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 7 / 19
Approach Task parallelism Extraction
Computation Unit (CU)
A collection of instructions (LLVM-IR instruction)Follows the read-compute-write pattern
A program state is first read from memory, the new state is computed,and finally written back
A small piece of code containing no parallelism or only ILPBuilding blocks for forming parallel tasksCU graph
Dependences are mapped to CUsExposes tightly-connected CUs
1 x = 32 y = 43 a = x + rand() / x4 b = x - rand() / x5 x = a + b6 a = y + rand() / y7 b = y - rand() / y8 y = a + b
x = 3
a = x + rand() / x
b = x - rand() / x
CUx
INIT
x = a + b
y = 4
a = y + rand() / y
b = y - rand() / y
y = a + b
CUy
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 8 / 19
Approach Task parallelism Extraction
CU Graph
CU-19 [146,152,153,154]
*in = args->in_img;for(all_pixels){
R = *in++;G = *in++;B = *in++;
}
CU-32 [152,153,154,156]
for(all_pixels){R = *in++;G = *in++;B = *in++;
Y= round(c1*R+c2*G+c3*B);}
CU-33 [152,153,154,157]
for(all_pixels){R = *in++;G = *in++;B = *in++;
U = round(c4*R+c5*G+c6*B);}
CU-34 [152,153,154,158]
for(all_pixels){R = *in++;G = *in++;B = *in++;
V = round(c7*R+c8*G+c9*B);}
CU-21 [147,156,160]
*pY = args->pY;for(all_pixels){
Y = round(c7*R+c8*G+c9*B);*pY++ = Y;
}
CU-24 [148,157,161]
*pU = args->pU;for(all_pixels){
U = round(c7*R+c8*G+c9*B);*pU++ = U;
}
CU-27 [149,158,162]
*pV = args->pV;for(all_pixels){
V = round(c7*R+c8*G+c9*B);*pV++ = V;
}
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 9 / 19
Approach Task parallelism Extraction
Program Execution tree
A call tree combined withloop information and basicblocks
CU graph is mapped on tothe execution tree
Program 1 - 377
Basic Block11 - 19
Loop21 - 36
Basic Block22 - 27
Tree node
CU
DataDependency
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 10 / 19
Approach Task parallelism Extraction
Task Extraction
Merge CUs contained in strongly connected components (SCCs) or inchains
A
B
C
D
E
F
G
H
I
A
B
C
D
E I
FGH
A
B
I
FGHCDE
1 2
SCCSCCchain
SCCFGH and chainCDE are two tasks
Hide complex dependences inside SSCs, exposing parallelizationopportunities outside
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 11 / 19
Approach Task parallelism Extraction
Task Extraction
Two CUs can share common instructions
53
54
57
56
59 58
55
2
1
2
7
3
6
4
7
1 5
4
1
2
3
No. of Common Instructions
No. of Dependences
(a) CU graph with CUsas vertices and RAWdependences and commoninstructions as edges
53
54
57
56
59 58
55
0.20
0.28
0.17
0.35
0.10
0.18
0.43
0.20
0.05
Affinity
(b) CU graph withaffinities between the CUs
53
54
57
56
59 58
55
0.20
0.28
0.17
0.35
0.10
0.18
0.43
0.20
0.05
Min Cut
(c) CU graph with aminimum cut
53
54
57
56
59 58
55
0.20
0.28 0.35
0.18
0.43
0.20
0.05
(d) CU graphpartitioned to identifytasks
Figure : Demonstration of a CU graph and graph partitioning to form tasks.
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 12 / 19
Approach Task parallelism Extraction
Task Graph
Task Extraction
Not limited to predefined language constructsCovers independent tasks and dependent tasks (coarse-grained tasks)
function: 365 - 381Parallelizable: true loop: 372 - 380
Parallelizable: falseINIT370
CU374 - 379
INIT666
if-else: 667 - 678Parallelizable: false
loop: 682 - 709Parallelizable: true
if-else: 719Parallelizable: false
RAWCU
ControlRegion
Blue
Yellow
Grey
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 13 / 19
Approach Code Generation
Code Generation
On going work
Map the task graph to CnC graph
CnC defines two scheduling constraints in parallel execution
producer/consumer relationshipscontroller/controllee relationships
A task (coarse-grained CU) is similar to a step collection
Data dependency among tasks are known form the task graph
Detected control information is not sufficient
Users specify the controller/controllee relationships
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 14 / 19
Approach Code Generation
Code Generation
Propose CnC-specific IR template
Transform the original IR to Cnc specific IR using task graph andusers’ control information
Generate binary code form CnC speceific IR
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 15 / 19
Approach Code Generation
Code Generation
previous code transformation results
Source-to-source transformation using Intel TBB flow graph(semi-automatic)FaceDetection (CnC sample application)
(a) Logic of FaceDetection (b) Flow graph
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 16 / 19
Approach Code Generation
Code Generation
Speedups on 2x8-core Intel Xeon E5-2650 2 GHz
1 2 4 8 16 320
5
10
15
20
thread
spee
dup
Official Manual CnC ParallelizationSemi-automatic TBB Parallelization
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 17 / 19
Conclusion
Conclusion
Profile data and control dependencies
DiscoPoPUsers’ specification
Extract coarse-grained task parallelism
CU graphProgram execution treeTask graph
Generate parallel code using CnC
Define CnC-specific IRCode transformation at IR levelEmploy CnC runtime library
Bo Zhao, Ali Janessari (TU Darmstadt) CnCworkshop 2015 September 8, 2015 18 / 19