Upload
lynne
View
52
Download
0
Embed Size (px)
DESCRIPTION
Exploiting Both Pipelining and Data Parallelism with SIMD RA. Yongjoo Kim *, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo * and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science & Technology). ARC March 21, 2012 Hong Kong. - PowerPoint PPT Presentation
Citation preview
Exploiting Both Pipelining and Data Parallelism with SIMD RA
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek*
*Seoul National University**UNIST (Ulsan National Institute of Science & Technology)
ARC March 21, 2012Hong Kong
Reconfigurable Architec-ture
2/20
Reconfigurable architecture High perfor-
mance Flexible
Cf. ASIC Energy effi-
cient Cf. GPU
Source: ChipDesignMag.com
Coarse-Grained Reconfigurable Archi-tecture
3 /20
Coarse-Grained RA Word-level granularity Dynamic reconfigurability Simpler to compile
Execution modelMain
Proces-sor
CGRA
Main Mem-ory
DMA Con-
troller
MorphoSys
ADRES
Application Mapping
4 /20
Place and route DFG on the PE array mapping space
Should satisfy several constraints Should map nodes on the PE which have a right functionality Data transfer between nodes should be guaranteed Resource consumption should be minimized for performance
Application
IR
Front-end
Partitioner
ConventionalC
compilation
ConfigurationAssembly
Exec. + Config.
Extended assembler
Seq Code Loops
Place & Route
DFG generation
ArchParam.
Mapping for CGRA
<DFG>
<CGRA>
Modulo scheduling-based mapping
5 /20
Software Pipelining
time
0 1
2
4
3
A[i]
B[i]
C[i]
PE0
PE3
PE1
PE2
PE0 PE1 PE2 PE31234567
0 1
2
4
3
0 1
2
4
3
0 1
2
4
3
II = 2 cycles
II : Initiation In-terval
Suffer several problems in a large scale CGRA Lack of parallelism
Limited ILP in general applications Configuration size(in unrolling case)
Search a very large mapping space for placement and routing
Skyrocketing compilation time
CGRAs remain at 4x4 or 8x8 at the most.6 /20
Problem - Scalability
Overview
7 /20
Background
SIMD Reconfigurable Architecture (SIMD RA)
Mapping on SIMD RA
Evaluation
Consists of multi-ple identical parts, called cores
Identical for the reuse of configura-tions
At least one load-store PE in each core
8 /20
SIMD Reconfigurable Architecture
Crossbar SwitchBank1 Bank2 Bank3 Bank4
Core 1 Core 2
Core 3 Core 4
More iterations executed in parallel Scale with the PE array size Short compilation time thanks to small mapping space Archive denser scheduled configuration
Higher utilization and performance. Loop must not have loop-carried dependence.9 /20
Advantages of SIMD-RA
time
Large CoreIteration 0Iteration 1Iteration 2Iteration 3Iteration 4Iteration 5
time
Core 1
Core 2
Core 3
Core 4
Iter.0
Iter.1
Iter.2
Iter.3
Iter.4
Iter.5
Iter.6
Iter.7
Iter.8
Iter.9
Iter.10
Iter.11
Large Core
Core 1
Core 2
Core 3
Core 4
Overview
10 /20
Background
SIMD Reconfigurable Architecture (SIMD RA)
Bank Conflict Minimization in SIMD RA
Evaluation
New mapping problem Iteration-to-core mapping
Iteration mapping affects on the perfor-mance related with a data mapping affect the number of bank conflicts
11 /20
Problems of SIMD RA mapping
for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i];}
Core 1
Core 2
Core 3
Core 4
15 iterations
Iteration-to-core mapping Data mapping
12 /20
Mapping schemes
Iter.0-3
Iter.4-7Iter.12-14
Iter.8-11
Iter.0,4,8,1
2
Iter.1,5,9,1
3
Iter.3,7,11
Iter.2,6,10,
14
Crossbar SwitchA[0]A[4]A[8]
A[12]B[1]B[5]B[9]
B[13]
A[1]A[5]A[9]
A[13]B[2]B[6]
B[10]B[14]
A[2]A[6]A[10]A[14]B[3]B[7]B[11]
A[3]A[7]
A[11]B[0]B[4]B[8]
B[12]
Crossbar SwitchA[0]A[1]A[2]A[3]A[4]A[5]
A[13]A[14]
B[0]B[1]B[2]B[3]B[4]B[5]
B[13]B[14]
… …
< Sequen-tial >
< Interleav-ing >
< Sequen-tial >
< Interleav-ing >
for(i=0 ; i<15 ; i++) { B[i] = A[i] + B[i];}
13
With interleaving data placement, interleaved iteration assignment is better than sequential iteration assignment.
Weak in stride ac-cesses
reduce the number of utilized banks,
increase bank conflicts
Interleaving data placement
Iter.0-3
Iter.4-7Iter.12-14
Iter.8-11
Iter.0,4,8,1
2
Iter.1,5,9,1
3
Iter.3,7,11
Iter.2,6,10,
14
Crossbar SwitchA[0]A[4]A[8]
A[12]B[1]B[5]B[9]
B[13]
A[1]A[5]A[9]
A[13]B[2]B[6]
B[10]B[14]
A[2]A[6]
A[10]A[14]B[3]B[7]
B[11]
A[3]A[7]
A[11]B[0]B[4]B[8]
B[12]
ConfigurationLoad
A[i]…
… …
Load
A[2i]
14
Sequential data place-ment
Cannot work well with SIMD mapping
Cause frequent bank con-flicts
Data tiling i) array base address mod-
ification ii) rearranging data on the
local memory. Sequential iteration as-
signment with data tiling suits for SIMD mapping
14
Crossbar SwitchA[0]A[1]A[2]A[3]B[0]B[1]B[2] B[3]
A[4]A[5]A[6]A[7]B[4]B[5]B[6]B[7]
A[8]A[9]
A[10]A[11]B[8]B[9]
B[10]B[11]
A[12]A[13]A[14]
B[12]B[13]B[14]
Crossbar SwitchA[0]A[1]A[2]A[3]A[4]A[5]
A[13]A[14]
B[0]B[1]B[2]B[3]B[4]B[5]
B[13]B[14]
… …
Iter.0-3
Iter.4-7Iter.12-14
Iter.8-11
Iter.0,4,8,1
2
Iter.1,5,9,1
3
Iter.3,7,11
Iter.2,6,10,
14
ConfigurationLoad
A[i]…
… …
Two out of the four combinations have strong advantages Interleaved iteration, interleaved data
mapping Weak in accesses with stride Simple data management
Sequential iteration, sequential data mapping (with data tiling)
More robust against bank conflict Data rearranging overhead
15 /20
Summary of Mapping Combinations Analysis
Experimental Setup
16 /20
Sets of loop kernels from OpenCV, multimedia and SPEC2000 benchmarks
Target system Two CGRA sizes – 4x4, 8x4 2x2 core with one load-store PE and one multiplier PE Mesh + diagonal connections between PEs Full crossbar switch between PEs and local memory
banks
Compared with non-SIMD mapping Original : non-SIMD previous mapping SIMD : Our approach (interleaving-interleaving mapping)
reduced by 61% in 4x4 CGRA, 79% in 8x4 CGRA17 /20
Configuration Size
18 /20
RuntimeOr
ig.
SIMD
Orig
.(U2)
SIMD Orig
.SI
MDOr
ig.(U
2)SI
MDOr
ig.(U
2)SI
MDOr
ig.(U
2)SI
MD Orig
.SI
MDOr
ig.(U
4)SI
MDOr
ig.(U
2)SI
MDOr
ig.(U
4)SI
MDOr
ig.(U
3)SI
MDOr
ig.(U
6)SI
MDOr
ig.(U
3)SI
MDOr
ig.(U
4)SI
MDOr
ig.(U
4)SI
MDOr
ig.(U
8)SI
MDOr
ig.(U
4)SI
MDOr
ig.(U
6)SI
MDOr
ig.(U
2)SI
MDOr
ig.(U
2)SI
MD Orig
.SI
MD Orig
.SI
MD
4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4 4x4 8x4Swim1 Swim2 Swim3 Laplace Wavelet CalcHar-
risCvtColor Dot-
ProductGaussian Erode Average
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%Stall time Non-stall time
Runt
ime
(Nor
mal
ized
to
4x4
Orig
inal
)
29%
32%
Presented SIMD reconfigurable architecture Exploit data parallelism and instruction level
parallelism at the same time
Advantages of SIMD reconfigurable archi-tecture Scale the large number of PEs well Alleviate increasing compilation time Increase performance and reduce configuration
size
19 /20
Conclusion
Thank you!
20 /20
21
In a large loop case, small core might not
be a good match
Merge multiple cores ⇒ Macrocore
No HW modification require
Core size
Crossbar SwitchBank1 Bank2 Bank3 Bank4
Core 1 Core 2
Core 3 Core 4
Macrocore 1
Macrocore 2
22
SIMD RA mapping flow
Operation Mapping
Check SIMDRequirement
Select Core Size
Iteration Mapping
Data Tiling
If scheduling fails and MaxII<II, increase core size.
Traditional MappingFail
If scheduling fails, increase II and repeat.
Modulo Scheduling
Array Placement(Implicit)
Int-Int Seq-Tiling