View
226
Download
3
Category
Tags:
Preview:
Citation preview
Processor Architectures and Program Mapping
5KK70 TU/e
Henk Corporaal
Jef van Meerbergen
Bart Mesman
Data Memory Management
Part b: Loop transformations & Data Reuse
Thanks to the IMEC DTSE experts:
Erik Brockmeyer
IMEC, Leuven, Belgium
and also
Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda,
Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.
@HC 5KK70 Platform-based Design 3
DM methodology
Dataflow Transformations
Analysis/Preprocessing
Loop/control-flow transformations
Data Reuse
Storage Cycle Budget Distribution
Memory Allocation and Assignment
Memory Layout organisation
C-out
C-in
Address optimization
@HC 5KK70 Platform-based Design 4
for (i=0; i < 8; i++) A[i] = …;for (i=0; i < 8; i++) B[7-i] = f(A[i]);
Location
Time
Production
Consumption
for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]);
Location
Time
Production
Consumption
Locality of Reference
@HC 5KK70 Platform-based Design 5
Regularity
for (i=0; i < 8; i++) A[i] = …;for (i=0; i < 8; i++) B[i] = f(A[7-i]);
Location
Time
for (i=0; i < 8; i++) A[i] = …;for (i=0; i < 8; i++) B[7-i] = f(A[i]);
Location
Time
ProductionConsumption
@HC 5KK70 Platform-based Design 6
for (i=0; i < 8; i++) B[i] = f1(A[i]);for (i=0; i < 8; i++) C[i] = f2(A[i]);
Location
Time
Consumption
Consumption
Location
Time
Consumption
Consumption
Enabling Reuse
for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]);
@HC 5KK70 Platform-based Design 7
How to do these loop transformations automatically?
Requires cost function
Requires technique
Let's introduce some terminology
- iteration spaces
- polytopes
- ordering vector / execution order
@HC 5KK70 Platform-based Design 8
0 1 j2 3 4 50
i
1
2
3
4
5
Iteration space and polytopes
// assume A[][] exists
for (i=1; i<6; i++) {
for (j=2; j<6; j++) {
B[i][j] = g( A[i-1][j-2]);
} }
--- iteration space
--- consumption space
--- production space
--- dependency vector
@HC 5KK70 Platform-based Design 9
Example with 3 polytopes
A: for (i=1; i<=N; ++i)
for (j=1; j<=N-i+1; ++j)
a[i][j] = in[i][j] + a[i-1][j];
B: for (p=1; p<=N; ++p)
b[p][1] = f( a[N-p+1][p], a[N-p][p] );
C: for (k=1; k<=N; ++k)
for (l=1; l<=k; ++k)
b[k][l+1] = g (b[k][l]);
A
B
C
Algorithm having 3 loops:
j
i
k
p
l
@HC 5KK70 Platform-based Design 10
Common iteration space
for (i=1; i<=(2*N+1); ++i)
for (j=1; j<=2*N; ++j)
if (i>=1 && i<=N && j>=1 && j<=N-i+1)
a[i][j] = in[i][j] + a[i-1][j];
if (i==N+1 && j>=1 && j<=N)
b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k)
b[i-N-1][j-N+1] = g (b[i-N-1][j-N]);
j
i
1
2*N+1
1 2*N
Initial solution having a common iteration space:
Bad locality Bad regularity Requires 2N memory locations Many dummy iterations
Ordering vector
@HC 5KK70 Platform-based Design 11
Cost function needed for automation
RegularityEqual direction for dependency vectorsAvoid that dependency vectors cross each otherGood for storage size
Temporal localityEqual length of all dependency vectorsGood for storage sizeGood for data reuse
@HC 5KK70 Platform-based Design 13
Bad regularity limits the ordering freedom
j
i
1
2*N+1
1 2*N
Ordering freedom = 90 degrees
@HC 5KK70 Platform-based Design 14
Locality estimates
P
C
C
C
C
P
C
C
C
C
P = productionC = consumption
P
C
C
C
C
C
Dependency vector length is measure for localityQ: Which length is the best estimate?
Sum{di} Max {di} Spanning tree
di
@HC 5KK70 Platform-based Design 15
1. Affine loop transformations
1. Only geometric information is available during placement
2. Rotation, skewing, interchange, reverse
2. Polytope placement
1. Only geometric information is available during placement
2. Translation
3. Choose ordering vector
Three step approach for loop transformation tool
y
x
j
iu
j
iT
y
x
u
Combined transformation:
@HC 5KK70 Platform-based Design 16
A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j];
C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] );
B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] );
i
j
p
k
l
• Affine loop transformations• Polytope placement• Choose ordering vector
Three step approach for loop transformation tool
@HC 5KK70 Platform-based Design 17
Three step approach for loop transformation tool
i
j
A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j];
C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] );
pB: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] );
k
l
i
j
p
k
l
• Affine loop transformations• Polytope placement• Choose ordering vector
@HC 5KK70 Platform-based Design 18
Three step approach for loop transformation tool
i
p
k
l
j
i
j
A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j];
C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] );
pB: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] );
k
l
• Affine loop transformations• Polytope placement = merging loops• Choose ordering vector
@HC 5KK70 Platform-based Design 19
Choose optimal ordering vector
Ordering Vector 1 Ordering Vector 2
@HC 5KK70 Platform-based Design 20
From the Polyhedral model back to C
for (j=1; j<=N; ++j){ for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] );}
i
l
j
• Affine loop transformations• Polytope placement• Choose ordering vector
Optimized solution having a common iteration space:
Optimal locality Optimal regularity Requires 2 memory locations
@HC 5KK70 Platform-based Design 21
Scanner
Loop trafo - cavity detection
GaussBlur y
GaussBlur x
N x M
X-Y LoopInterchange
N x M
From N x M toN x (2GB+1) buffer size
X
Y
N x M
@HC 5KK70 Platform-based Design 22
Loop trafo-cavity (1)
1Transform:
interchange
2
Translate:merge
3Order
@HC 5KK70 Platform-based Design 23
Loop trafo-cavity (2)
1Transform:
interchange
2
Translate:merge
3Order
x-blur filter:
@HC 5KK70 Platform-based Design 24
Scanner
Loop trafo - cavity detection
GaussBlur y
GaussBlur x
N x M
X-Y LoopInterchange
N x M
From N x M toN x (2GB+1) buffer size
X
Y
N x M
@HC 5KK70 Platform-based Design 25
Loop trafo-cavity (3)
2Translate 1:
2 Translate 2:
3
Comparingdifferenttranslations
@HC 5KK70 Platform-based Design 26
Loop trafo-cavity (4)
33Order
+ =
Combining (merging) multiple polytopes
@HC 5KK70 Platform-based Design 27
Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x<=N-1-GB && y>=GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot;
} else if (x<N && y<M) gauss_x_image[x][y] = 0;
if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0;
@HC 5KK70 Platform-based Design 28
Intermezzo Before we continue with data reuse, have a look at other
loop transformations
@HC 5KK70 Platform-based Design 29
DM methodology
Dataflow Transformations
Analysis/Preprocessing
Loop/control-flow transformations
Data Reuse
Storage Cycle Budget Distribution
Memory Allocation and Assignment
Memory Layout organisation
C-out
C-in
Address optimization
@HC 5KK70 Platform-based Design 30
Layer 1 Layer 2 Layer 3Datapaths
Memory hierarchy and Data reuse1. Determines reuse candidates
2. Combine reuse candidates into reuse chains
3. If multiple access statements/array combine into reuse trees
4. Determine number of layers (if architecture is not fixed)
5. Select candidates and assign to memory layers
6. Add extra transfers between the different memory layers(for scratchpad RAM; not for caches)
@HC 5KK70 Platform-based Design 31
TI C55@200MHz example platform
Register file+
Core
4Kx16dual
32xTotal 256Kb1 elem in 1 cycle
16Kx16ROM
OffchipMAX: 8MBx16SRAM/EPROM/ SDRAM/SBSRAM
TMS320vc5510@200MHzVdd= 1.5 VP = unknown
8xTotal 64Kb2 elem in 1 cycle
4Kx16dual
4Kx16dual
4Kx16sing
4Kx16sing
4Kx16sing
ROM (Data/program/DMA)first 3 cycles, next 2 cyclesIt seems this can be in parallel with the 256Kb memory Bandwidth 100M words/S
Bandwidth 400M words/s
Size 32kB
Size 320kB
ROM partition
Variable size RAM partition
Bandwidth 50M words/sSize 16 MB Fixed size RAM partition
Bandwidth 4.8Gwords/sSize 2x16 registersProcessor partition
BW: 50M Word/ssingle port
L2
L0
L1
BW: 400M Word/sdual port
@HC 5KK70 Platform-based Design 32
M
P = 1
Exploiting Memory Hierarchy for reduced Power: principle
Processor Data Paths
RegisterFile
Processor Data Paths
RegisterFile
A
P = 1
#A = 100%
P total (before) = 100%
@HC 5KK70 Platform-based Design 33
P total (before) = 100%
M
P = 1
A
P = 1
A’
P = 0.3
100% 5%
Exploiting Memory Hierarchy for reduced Power: principle
P total (after) = 100%x0.01+10%x0.1+1%x1 = 3%
M
P = 1
A
P = 1
A’
P = 0.1
A’’
P = 0.01
100% 1%10%Processor Data Paths
RegisterFile
Processor Data Paths
RegisterFile
@HC 5KK70 Platform-based Design 34
M
Data reuse decision and memory hierarchy: principle
Processor Data Paths
RegisterFile
Processor Data Paths
RegisterFile
BA
A’A’’
customized connections
Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead.
@HC 5KK70 Platform-based Design 35
Step 1: identify arrays with data reuse potential
for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];
timecopy3 copy4copy1 copy2
Time frame 1 Time frame 2 Time frame 3 Time frame 4
arrayindex
intra-copyreuse
inter-copyreuse
@HC 5KK70 Platform-based Design 36
Importance of high level cost estimate
for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];
timecopy3 copy4copy1 copy2
Time frame 1 Time frame 2 Time frame 3 Time frame 4
arrayindex
6
Mk
Array copies arestored in-place!
@HC 5KK70 Platform-based Design 37
Step 1: determine gains Intra-copy reuse factor
for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];
timecopy3 copy4copy1 copy2
Time frame 1 Time frame 2 Time frame 3 Time frame 4
arrayindex
6
Mk
intra-copyreuse
factor= 3
j iterator =not presentso intra-copy reuse
3
@HC 5KK70 Platform-based Design 38
Step 1: determine gains Inter-copy reuse factor
timecopy3 copy4copy1 copy2
Time frame 1 Time frame 2 Time frame 3 Time frame 4
arrayindex
inter-copyreuse factor
= 1/(1-1/3)=3/2
6
Mk
for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];
for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k];
i iterator has smaller weightthan k range so
inter-copy reuse
@HC 5KK70 Platform-based Design 39
5Mm
tf 1 tf 2 tf 3 tf 4 tf 5 tf 6 tf 7 tf 8 tf 9
Possibility for multi-level hierarchy
arrayindex
time
for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m];
Mk
15
time frame 1 time frame 2
5Mm
tf 1.1 tf 1.2 tf 1.3 tf 1.4 tf 1.5 tf 1.6 tf 2.1 tf 2.2 tf 2.3
@HC 5KK70 Platform-based Design 40
Step 2: determine data reuse chains for each memory access
R1(A)
A
A’
R1(A)
A
A’
R1(A)
A
A’
A’’
Many reuse possibilities
Cost estimate needed
Prune for promising ones
R1(A)
A
@HC 5KK70 Platform-based Design 41
Cost function needs both size and number of accesses to intermediate array
for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m];
Gk
155
Gm
estimate #misses from different levels for one iteration of i
R1(A)
2*3*3*5=90
A’
3*5=15
A’
2*3*5=30
estimate size
0 5 10 15 20#elements
0
20
40
60
80
100#
mis
ses
@HC 5KK70 Platform-based Design 42
R1(A)
A
A’
R1(A)
A
A’
R1(A)
A
A’
A’’
R1(A)
A
30
90 90 9090
15 15
30
90 30 15 15
120105 45
120
150 150 150 150
515 15
5
135 45 22 22
616 7
6
135 51 38 35150 155 165 170
140 150 160 170 180
Area Estimate
0
50
100
150
En
erg
y E
stim
ate
Very simplistic power and area estimation for different data-reuse versions
xyz
accessessizeenergy
@HC 5KK70 Platform-based Design 43
R1(A)
A
A’
A’’
for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m];
Step 3: determine data reuse trees for multiple accesses
R2(A)
A
A’ for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y];
@HC 5KK70 Platform-based Design 44
R1(A)
A
A’
A’’
R2(A)
A
A’
Reuse tree
A
R1(A)
A’
A’’
R2(A)
A’
Step 3: determine data reuse trees for multiple accesses
@HC 5KK70 Platform-based Design 45
Assign all data reuse trees (multiple arrays) to memory hierarchy
A
R1(A)
A’
A’’
R2(A)
A’
R1(B)
B
B’
B’’
B’’’
Layer 1
Layer 2
Layer 3
A
R1(A)
A’
A’’
R2(A)
A’
R1(B)
B
B’
B’’’
@HC 5KK70 Platform-based Design 46
Step 4: Determine number of layers
B
B'
FG
A
A'
FG
Data reusetrees A
Data reusetrees B
FG
Hierarchylayers
Layer1
Layer2
Layer3
Foreground mem.Datapath
@HC 5KK70 Platform-based Design 47
Step 5: Select and assign reuse candidates
A
A'
FG
FG
Data reusetrees
Hierarchylayers
hierarchy assignments
FG
A
A'
1
FG
A
A'
2
FG
A
A'
3
FG
A
A
4
FG
A
5
FG
all
@HC 5KK70 Platform-based Design 48
Step 5: All freedom in array to memory hierarchy
A
A'
FG
Data reusetrees A
Hierarchylayers
B
B'
FG
Data reusetrees B
FGFG
@HC 5KK70 Platform-based Design 49
Step 5: Prune reuse graph (platform independent)
Hierarchy layersFull freedom
FG FG
Hierarchy layersPruned
Quite some solutions never make sense
@HC 5KK70 Platform-based Design 50
Step 5: Prune reuse graph further (platform dependent)
FG
Hierarchy layersPruned
FG
Final solution4 layer platform
A
B
B'
A'
FG
Final solution4 layer platform
@HC 5KK70 Platform-based Design 51
int in[H][W+8], out[H][W];const int c[] = {1,0,1,2,2,1,0,1};for (r=0; r < H; r++) for (c=0; c < W; c++) for (dc=0; dc < 8; dc++) out[r][c] += in[r][c+dc]*c[dc];
int in[H][W+8], out[H][W], buf[8];const int c[] = {1,0,1,2,2,1,0,1};for (r=0; r < H; r++) for (i=0; i<7; i++) buf[i]=in[r][i]; for (c=0; c < W; c++) buf[(c+7)%8] = in[r][c+7]; for (dc=0; dc < 8; dc++) out[r][c] += buf[(c+dc)%8]*c[dc];
Introducing 1D reuse buffer
Reuse Factor =7
intermediatelevel decl.
additional copy
initial copy
reread from buffer
@HC 5KK70 Platform-based Design 52
Data Reuse on 1D horizontal convolutionHow to make explicit copies?
initbuffer
reusedata
newdata
Image NxM, traversed row order
@HC 5KK70 Platform-based Design 53
Introducing line buffers for vertical filtering
whole image
size[N][M]
set of lines [2GB+1]
Why keep the whole image in that case?
[N]
@HC 5KK70 Platform-based Design 54
Simplified “reuse script”
1. Identify arrays with sufficient reuse potential
2. Determine reuse chains and prune these
(for every array read)
3. Determine reuse trees and prune these
(for every array)
4. Determine reuse graph including bypasses and
prune (for entire application)
5. Determine memory hierarchy layout assignment incorporating given background memory restrictions (layers) and real-time constraints
6. Introduce copies in code: init, update, use code For scratchpad memories only For caches we need a different approach
@HC 5KK70 Platform-based Design 55
Data re-use trees: cavity detector
N*M
N*1
3*1
image_in
N*3
1*3
gauss_x
N*3
3*3
gauss_xy/comp_edge
N*3
1*1
N*M*3
N*M
N*M N*M*3
N*M*3
N*M*3 N*M
N*M
image_out
0
N*M*8 N*M*8
CPU CPUCPU CPU CPU
Array reads: Array write:
@HC 5KK70 Platform-based Design 56
Memory hierarchy assignment: cavity detector
N*M
3*1
image_in
N*3
gauss_x gauss_xy comp_edgeimage_out
3*3 1*1 3*3 1*1
L2
N*M
N*M*3 N*M*3 N*M
N*M
0
N*M*3 N*M
N*M*3 N*M*8 N*M*8 N*M*8 N*M*8
N*3 N*3
L3
L1
1MB
SDRAM
16KB
Cache
128 B
RegFile
Recommended