Upload
dewei
View
48
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Embedded Computer Architecture. Data Memory Management Overview. 5KK73 TU/e Henk Corporaal Bart Mesman. Data Memory Management Overview. Motivation Example application DMM steps Results Notes: We concentrate on Static Data structures like arrays - PowerPoint PPT Presentation
Citation preview
Embedded Computer Architecture
5KK73 TU/e
Henk Corporaal
Bart Mesman
Data Memory Management
Overview
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 2
Data Memory Management Overview
• Motivation• Example application• DMM steps• Results
Notes: - We concentrate on Static Data structures like arrays- The Data Transfer and Storage Exploration (DTSE)
methodology, on which these slides are based, has been developed at IMEC, Leuven
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 3
You want the big picture?
• Take a well written C-program, which is dominated by:– "well structured" loopnests, and by– accesses to "static" big (multi-dimensional) arrays
• Transform your code in order to – reduced number of external memory accesses by a
factor 10– drastic reduction of energy consumption– significant improvement of speed
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 4
VLIWcpu
I$
video-in
video-out
audio-in
audio-out
PCI bridge
Serial I/O
timers I2C I/O
SDRAM
D$
for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k];
SDRAM
D$
Data storage bottleneck
B[j] = A[i*4+k];
The underlying idea
B[j] = A[i*4+k];
Data transfer bottleneck
B[j] = A[i*4+k];
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 5
Platform example: TriMedia
5 out of 27 processor
FUs
128*32b16-portRegFile
Hardware accelerators
TriMedia (5-issue VLIW processor)
256M1-portSDRAM
16K2-portSRAM
256M1-portSDRAM
SWcache8KB
TriMedia cacheuse
HWcache
8/16KB
CPUCache bypassSW controlledHW controlled
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 6
Generic platform model
CPUs
HWaccel
Level-1 Level-2 Level-3 Level-4
ICache
LocalMemory
Disk
MainMemory
bus-if
on-c
hip
buss
es
LocalMemory
L2Cache
LocalMemory
bridge SCSI
DCache Disk
Disk
bus
bus
SCSI
bus
Chip
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 7
Data transfer and storage power
01
234
56
Relative energyper operation
ADD R1, R2
LW R1,[R2]
Cache miss(per word)
05
10152025303540
Powerdistribution in
CPUs
Storage
Interconnect
Clocks
Other
05
101520253035
Relative energy per operation
16b carry-select
16b multiplier
8x128x16 SRAM (R)
8x128x16 SRAM (W)
External IO access
16b Memory transfer
Power(memory)
Power(arith
metic)= 33
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 8
What about delay of memories?
Global wiring delay becomes dominant over gate delay
Gate delay vs. wire delay
0
50
100
150
200
250
300
350
400
0.5 0.35 0.25 0.18 0.13 0.1
technology (micron)
ps
wire delay (ps/mm)
gate delay (ps)
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 9
ApplicationsApplicationsArchitecture
Instance
Mapping
Applications
PerformanceAnalysis
PerformanceNumbers
Positioning in the Y-chart
Data transfer anddata storagespecificrewrites in theapplication code
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 10
If you are in a hurry, do this:• Given
– architecture e.g. TriMedia TM1000– reference C code for application
e.g. MPEG-4 Motion Estimation
• Task– map application on architecture
• But … wait a moment me@work> tmcc -o mpeg4_me mpeg4_me.c
Thank you for running TriMedia compiler.Your program uses 257321886 bytes,78 Watt, and 428798765291 clock cycles
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 11
Let’s help the compiler ...DTSE: data transfer and storage exploration
• Transforms C-code of the application– By focusing on multi-dimensional arrays – To better exploit platform capabilities
• This overview covers the major steps to improve power, area, performance trade-off in the context of platform based design
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 12
Application example• Application domain:
– Computer Tomography in medical imaging
• Algorithm: – Cavity detection in CT-scans
– Detect dark regions in successive images
– Indicate cavity in brain
Bad news for owner of brain
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 13
Application
Reference (conceptual) C code for the algorithm– all functions: image_in[N x M] -> image_out[N x M]
– new value of pixel depends on its neighbors
– neighbor pixels read from background memory
– approximately 110 lines of C code (ignoring file I/O etc)
– experiments with e.g. N x M = 640 x 400 pixels
– straightforward implementation: 6 image buffers
ComputeEdges
GaussBlur x
ReverseDetectRoots
MaxValue
GaussBlur y
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 14
Dynamic memory mgmtDynamic memory mgmt
Task concurrency mgmtTask concurrency mgmt
Static memory mgmtStatic memory mgmt
Address optimizationAddress optimization
SWSWdesigndesignflowflow
HWHWdesigndesignflowflow
SW/HW co-designSW/HW co-design
Concurrent OO specConcurrent OO spec
Remove OO overheadRemove OO overhead
Where does this fit in the whole design flow?
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 15
DMM (data mem. mgt.) principles
Processor Data Paths
L1mem/cache
L2mem/cache
CacheBank
Combine
local latch 1& bank 1
local latch N& bank N
Exploit memory hierarchy
Off-chip SDRAM
Exploit limited life-time of array elements
Avoid N-port Memories within real-time constraints
Reduce redundant transfersIntroduce Locality
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 16
DMM stepsC-in
Preprocessing
Dataflow transformations
Loop transformations
Data reuse Memory hierarchy layer assignment
Cycle budget distribution
Memory allocation and assignment
Data layout
C-out
Address optimization
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 17
The DMM steps
• Preprocessing– Rewrite code in 3 layers (parts)– Selective inlining, Single Assignment form, ....
• Data flow transformations– Eliminate redundant transfers and storage
• Loop and control flow transformations– Improve regularity of accesses and data locality
• Data re-use and memory hierarchy layer assignment– Determine when to move which data between memories to meet
the cycle budget of the application with low cost– Determine in which layer to put the arrays (and copies)
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 18
The DMM stepsPer memory layer:• Cycle budget distribution
– determine memory access constraints for given cycle budget
• Memory allocation and assignment– which memories to use, and where to put the arrays
• Data layout– determine how to combine and put arrays into memories
• Address optimization on the final C-code
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 19
Preprocessing: Dividing an application in the 3 layers
Module1a
Module1b
Module2 Module3
Synchronisation
- testbench call- dynamic event behaviour- mode selection
for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]);
intfunc1(int a, int b){ return a*b;}
LAYER1
LAYER2
LAYER3
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 20
main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out);}
void cav_detect() { /* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { // filtersize 2GB+1 for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } }}
Layered code structure: level 1+2
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 22
N
M
Data-flow trafo - cavity detection
for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0;
for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); }}
M-2
N-2
#accesses: N * M + (N-2) * (M-2)
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 23
Data-flow trafo - cavity detection
N
M
N-2
M-2
for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } }}
#accesses: N * Mgain is almost 50 %
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 24
Data-flow transformation• In total 5 types of data-flow transformations:
– advanced signal substitution and (copy) propagation
– algebraic transformations (associativity, distributive law,
etc.)
– shifting “delay lines”
– re-computation (instead of keeping values alive too long)
– transformations to eliminate bottlenecks
for subsequent loop transformations
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 25
Data-flow transformation - result
0
20
40
60
80
100
120
140
accesses size cycles
OriginalDF trafo
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 26
• Loop transformations– improve regularity of accesses
– improve temporal locality: production consumption
• Expected influence– reduce temporary storage and (anticipated) background
storage
storage size N
Loop transformations
for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]);
for (i=1; i<=N; i++) out[i] = A[i];
for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i];}
storage size 1
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 27
Global loop transformation steps applied to cavity detection
• Make all loop dimensions equal
• Regularize loop traversal:possibly Y and X loop interchange– follow order of input stream
• Y loop folding and global mergingX loop folding and global merging– full, global scope regularity– nearly complete locality for main signals (arrays)
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 28
Data enters Cavity Detectorrow-wise
scandevice
Buffer
serial scan
Cavity
Detector
GaussBlur loop
= image_in
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 29
Scanner
Loop trafo - cavity detection
N x M
GaussBlur x
N x M
From double bufferto single buffer
X
YX-Y LoopInterchange
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 30
• Not always possible; check dependences
• For all loops, to maintain regularity
Loop interchange (Y X)
for (x=0;x<N;x++) for (y=0;y<M;y++) /* filtering code */
for (y=0;y<M;y++) for (x=0;x<N;x++) /* filtering code */
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 31
Loop trafo - cavity detection
ComputeEdges
GaussBlur y
N x (2GB+1)
Repeated fold and loop merge
N x 3
From N x M toN x (3) buffer size
From N x M toN x (2GB+1) buffer size
2GB+13(offset
arrays)
GaussBlur x
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 32
for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */for (y=0;y<M;y++) for (x=0;x<N;x++) /* 2nd filtering code */
for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (x=0;x<N;x++) /* 2nd filtering code */
Improve regularity and locality Loop Merging
!! Often impossible due to dependencies!
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 33
Data dependencies between1st and 2nd loop
for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = …
for (y=0;y<M;y++) for (x=0;x<N;x++) … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] …
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 34
Enable merging withLoop Folding (bumping)
for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = …
for (y=0+GB;y<M+GB;y++) for (x=0;x<N;x++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] …
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 35
Y-loop merging resultof 1st and 2nd loop nest
for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = …
if (y>=GB) for (x=0;x<N;x++) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 36
Simplify conditions in merged loop
for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = …
for (x=0;x<N;x++) if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB)
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 37
Global loop merging/folding steps1 x y Loop interchange (done)
2 Global y-loop folding/merging: 1st and 2nd nest (done)
3 Global y-loop folding/merging: 1st/2nd and 3rd nest
4 Global y-loop folding/merging: 1st/2nd/3rd and 4th nest
5 Global x-loop folding/merging: 1st and 2nd nest
6 Global x-loop folding/merging: 1st/2nd and 3rd nest
7 Global x-loop folding/merging: 1st/2nd/3rd and 4th nest
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 38
End result of global loop trafo
for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; …
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 39
Loop transformations - result
0
50
100
150
200
250
300
accesses size cycles
OriginalDF trafoLoop trafo
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 40
#A = 100
P (original) = # access x power/access = 100
Processor Data Paths
RegFile
MMain
memory
P = 1
M’
P = 0.1
M’’
P = 0.01
100 110
Data re-use & memory hierarchy
• Introduce memory hierarchy – reduce number of reads from main memory
– heavily accessed arrays stored in smaller memories
P (after) = 100 x 0.01 + 10 x 0.1 + 1 x 1 = 3
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 41
Data re-use
• Data flow transformations to introduce extracopies of heavily accessed signals– Step 1: figure out data re-use possibilities
– Step 2: calculate possible gain
– Step 3: decide on data assignment to memory hierarchy
int[2][6] A;
for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) B[j] = A[i][k]; iterations
array index (6 * i + k)
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 42
Data re-use
• Data flow transformations to introduce extracopies of heavily accessed signals– Step 1: figure out data re-use possibilities
– Step 2: calculate possible gain
– Step 3: decide on data assignment to memory hierarchy
iterationsframe1 frame2 frame3
array index6*2
6*1N*2*3*6
CPU
1*2*1*6
N*2*1*6
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 43
Data re-use tree
N*M
N*1
3*1
image_in
M*3
1*3
gauss_x
M*3
3*3
gauss_xy/comp_edge
M*3
1*1
N*M*3
N*M
N*M N*M*3
N*M*3
N*M*3 N*M
N*M
image_out
0
N*M*8 N*M*8
CPU CPU CPU CPU CPU
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 44
Memory hierarchy assignment
L2
L1
L0
N*M
3*1
image_in
M*3
gauss_x gauss_xy comp_edgeimage_out
3*3 1*1 3*3 1*1
N*M
N*M*3 N*M*3 N*M
N*M
0
N*M*3 N*M
N*M*3 N*M*8 N*M*8 N*M*8 N*M*8
M*3 M*3
1MB
SDRAM
16KB
Cache
128 B
RegFile
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 45
Data-reuse - cavity detection code
for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x<=N-2 && y>=1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ }}
Code before reuse transformation
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 46
Data-reuse - cavity detection code
for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3] = image_in[x+k][y];
/* copy rest of in_pixels in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3] = image_in[x+1][y];
if (x>=1 && x<=N-1-1 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0;
Code after reuse transformation:
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 47
Data reuse & memory hierarchy
0
100
200
300
400
500
600
accesses size cycles
OriginalDF trafoLoop trafoData reuse
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 48
Data layout optimization• At this point multi-dimensional arrays
are to be assigned to physical memories
• Data layout optimization determines exactly where in each memory an array should be placed, to
– reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes)
– to avoid cache misses due to conflicts– exploit spatial locality of the data in memory to improve
performance of e.g. page-mode memory access sequences
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 49
In-place mapping
B
C
D
A
C
AD
B
E
C
A
D
E
B
time
E
addresses
A
B
C
D
E
Inter in-place
Intra in-place
Both intra+inter
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 50
0x0
0x28a0B
A
In-place mapping• Implements all the “anticipated” memory size savings
obtained in previous steps• Modifies code to introduce one
array per “real” memory• Changes indices to addresses in mem. arrays
b8 A[100][100];b6 B[20][20];for (i,j,k,l; …) B[i][j] = f(B[j][i], A[i+k][j+l]);
b8 mem1[10400];for (i,j,k,l; …) mem1[10000+i+20*j] = f(mem1[10000+j+20*i], b6(mem1[i+k+100*(j+l)]);0x2710
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 51
In-place mapping• Input image is partly consumed, and not reused again,
by the time first results for output image are ready
Image_out
index
time
Image_in
time
index
time
address
Image
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 52
In-place - cavity detection code
for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */
… = image_in[x+1][y]; }}
for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */
… = image [x+1][y]; }}
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 53
In-place mapping - results
0
100
200
300
400
500
600
accesses size cycles
OriginalDF trafoLoop trafoData reuseIn-place
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 54
The last step: ADOPT (Address OPTimization)
• Increased execution time introduced by DTSE– Complicated address arithmetic (modulo!)
– Additional complex control flow
• Additional transformations needed to– Simplify control flow
– Simplify address arithmetic: common sub-expression elimination, modulo expansion, …
– Match remaining expressions on target machine
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 55
for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { B[ ] = A[ ]; }} dist += A[ ]- B[ ]; }
cse1 = (33025*i+6869616)*2; cse3 = 1040+i; cse4 = j*257+1032; cse5 = k+cse4; cse5+cse1 = cse5+cse3 3096 cse1
ADOPT principles
for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)*257+ 16+i+k] = B[(8+j)*257+16+i+k]; } dist += A[3096] - B[((208+i)*257+4)*257+ 16+i-4]; }
Example: Full-search Motion Estimation
Algebraic transformationsat word-level
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 56
Address optimization - result
0
100
200
300
400
500
600
accesses size cycles
OriginalDF trafoLoop trafoData reuseIn-placeData layoutADOPT - moduloADOPT - rest
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 57
Fixing platform parameters
• Assume configurable on-chip memory hierarchy– Trade-off power versus cycle-budget
storagecyclebudget
power [mW]
50,000 100,000 150,000
5
10
15
20
25
04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 58
Conclusion• Many applications use large (static) data structures
– Access and layout of this data can be heavily optimized
– Compilers don't do this
• Source code (C-to-C) transformations needed !!• Showed systematic approach
– Platform independent high-level transformations
– Platform dependent transformations exploit platform characteristics (optimal use of cache, …)
• Substantial energy, memory size (cost) and performance improvements
– MPEG-4, OFDM, H.263, ADSL, ...