57
Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Overview

Embedded Computer Architecture

  • Upload
    dewei

  • View
    48

  • Download
    2

Embed Size (px)

DESCRIPTION

Embedded Computer Architecture. Data Memory Management Overview. 5KK73 TU/e Henk Corporaal Bart Mesman. Data Memory Management Overview. Motivation Example application DMM steps Results Notes: We concentrate on Static Data structures like arrays - PowerPoint PPT Presentation

Citation preview

Page 1: Embedded Computer Architecture

Embedded Computer Architecture

5KK73 TU/e

Henk Corporaal

Bart Mesman

Data Memory Management

Overview

Page 2: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 2

Data Memory Management Overview

• Motivation• Example application• DMM steps• Results

Notes: - We concentrate on Static Data structures like arrays- The Data Transfer and Storage Exploration (DTSE)

methodology, on which these slides are based, has been developed at IMEC, Leuven

Page 3: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 3

You want the big picture?

• Take a well written C-program, which is dominated by:– "well structured" loopnests, and by– accesses to "static" big (multi-dimensional) arrays

• Transform your code in order to – reduced number of external memory accesses by a

factor 10– drastic reduction of energy consumption– significant improvement of speed

Page 4: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 4

VLIWcpu

I$

video-in

video-out

audio-in

audio-out

PCI bridge

Serial I/O

timers I2C I/O

SDRAM

D$

for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k];

SDRAM

D$

Data storage bottleneck

B[j] = A[i*4+k];

The underlying idea

B[j] = A[i*4+k];

Data transfer bottleneck

B[j] = A[i*4+k];

Page 5: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 5

Platform example: TriMedia

5 out of 27 processor

FUs

128*32b16-portRegFile

Hardware accelerators

TriMedia (5-issue VLIW processor)

256M1-portSDRAM

16K2-portSRAM

256M1-portSDRAM

SWcache8KB

TriMedia cacheuse

HWcache

8/16KB

CPUCache bypassSW controlledHW controlled

Page 6: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 6

Generic platform model

CPUs

HWaccel

Level-1 Level-2 Level-3 Level-4

ICache

LocalMemory

Disk

MainMemory

bus-if

on-c

hip

buss

es

LocalMemory

L2Cache

LocalMemory

bridge SCSI

DCache Disk

Disk

bus

bus

SCSI

bus

Chip

Page 7: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 7

Data transfer and storage power

01

234

56

Relative energyper operation

ADD R1, R2

LW R1,[R2]

Cache miss(per word)

05

10152025303540

Powerdistribution in

CPUs

Storage

Interconnect

Clocks

Other

05

101520253035

Relative energy per operation

16b carry-select

16b multiplier

8x128x16 SRAM (R)

8x128x16 SRAM (W)

External IO access

16b Memory transfer

Power(memory)

Power(arith

metic)= 33

Page 8: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 8

What about delay of memories?

Global wiring delay becomes dominant over gate delay

Gate delay vs. wire delay

0

50

100

150

200

250

300

350

400

0.5 0.35 0.25 0.18 0.13 0.1

technology (micron)

ps

wire delay (ps/mm)

gate delay (ps)

Page 9: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 9

ApplicationsApplicationsArchitecture

Instance

Mapping

Applications

PerformanceAnalysis

PerformanceNumbers

Positioning in the Y-chart

Data transfer anddata storagespecificrewrites in theapplication code

Page 10: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 10

If you are in a hurry, do this:• Given

– architecture e.g. TriMedia TM1000– reference C code for application

e.g. MPEG-4 Motion Estimation

• Task– map application on architecture

• But … wait a moment me@work> tmcc -o mpeg4_me mpeg4_me.c

Thank you for running TriMedia compiler.Your program uses 257321886 bytes,78 Watt, and 428798765291 clock cycles

Page 11: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 11

Let’s help the compiler ...DTSE: data transfer and storage exploration

• Transforms C-code of the application– By focusing on multi-dimensional arrays – To better exploit platform capabilities

• This overview covers the major steps to improve power, area, performance trade-off in the context of platform based design

Page 12: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 12

Application example• Application domain:

– Computer Tomography in medical imaging

• Algorithm: – Cavity detection in CT-scans

– Detect dark regions in successive images

– Indicate cavity in brain

Bad news for owner of brain

Page 13: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 13

Application

Reference (conceptual) C code for the algorithm– all functions: image_in[N x M] -> image_out[N x M]

– new value of pixel depends on its neighbors

– neighbor pixels read from background memory

– approximately 110 lines of C code (ignoring file I/O etc)

– experiments with e.g. N x M = 640 x 400 pixels

– straightforward implementation: 6 image buffers

ComputeEdges

GaussBlur x

ReverseDetectRoots

MaxValue

GaussBlur y

Page 14: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 14

Dynamic memory mgmtDynamic memory mgmt

Task concurrency mgmtTask concurrency mgmt

Static memory mgmtStatic memory mgmt

Address optimizationAddress optimization

SWSWdesigndesignflowflow

HWHWdesigndesignflowflow

SW/HW co-designSW/HW co-design

Concurrent OO specConcurrent OO spec

Remove OO overheadRemove OO overhead

Where does this fit in the whole design flow?

Page 15: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 15

DMM (data mem. mgt.) principles

Processor Data Paths

L1mem/cache

L2mem/cache

CacheBank

Combine

local latch 1& bank 1

local latch N& bank N

Exploit memory hierarchy

Off-chip SDRAM

Exploit limited life-time of array elements

Avoid N-port Memories within real-time constraints

Reduce redundant transfersIntroduce Locality

Page 16: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 16

DMM stepsC-in

Preprocessing

Dataflow transformations

Loop transformations

Data reuse Memory hierarchy layer assignment

Cycle budget distribution

Memory allocation and assignment

Data layout

C-out

Address optimization

Page 17: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 17

The DMM steps

• Preprocessing– Rewrite code in 3 layers (parts)– Selective inlining, Single Assignment form, ....

• Data flow transformations– Eliminate redundant transfers and storage

• Loop and control flow transformations– Improve regularity of accesses and data locality

• Data re-use and memory hierarchy layer assignment– Determine when to move which data between memories to meet

the cycle budget of the application with low cost– Determine in which layer to put the arrays (and copies)

Page 18: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 18

The DMM stepsPer memory layer:• Cycle budget distribution

– determine memory access constraints for given cycle budget

• Memory allocation and assignment– which memories to use, and where to put the arrays

• Data layout– determine how to combine and put arrays into memories

• Address optimization on the final C-code

Page 19: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 19

Preprocessing: Dividing an application in the 3 layers

Module1a

Module1b

Module2 Module3

Synchronisation

- testbench call- dynamic event behaviour- mode selection

for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]);

intfunc1(int a, int b){ return a*b;}

LAYER1

LAYER2

LAYER3

Page 20: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 20

main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out);}

void cav_detect() { /* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { // filtersize 2GB+1 for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } }}

Layered code structure: level 1+2

Page 21: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 22

N

M

Data-flow trafo - cavity detection

for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0;

for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); }}

M-2

N-2

#accesses: N * M + (N-2) * (M-2)

Page 22: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 23

Data-flow trafo - cavity detection

N

M

N-2

M-2

for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } }}

#accesses: N * Mgain is almost 50 %

Page 23: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 24

Data-flow transformation• In total 5 types of data-flow transformations:

– advanced signal substitution and (copy) propagation

– algebraic transformations (associativity, distributive law,

etc.)

– shifting “delay lines”

– re-computation (instead of keeping values alive too long)

– transformations to eliminate bottlenecks

for subsequent loop transformations

Page 24: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 25

Data-flow transformation - result

0

20

40

60

80

100

120

140

accesses size cycles

OriginalDF trafo

Page 25: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 26

• Loop transformations– improve regularity of accesses

– improve temporal locality: production consumption

• Expected influence– reduce temporary storage and (anticipated) background

storage

storage size N

Loop transformations

for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]);

for (i=1; i<=N; i++) out[i] = A[i];

for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i];}

storage size 1

Page 26: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 27

Global loop transformation steps applied to cavity detection

• Make all loop dimensions equal

• Regularize loop traversal:possibly Y and X loop interchange– follow order of input stream

• Y loop folding and global mergingX loop folding and global merging– full, global scope regularity– nearly complete locality for main signals (arrays)

Page 27: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 28

Data enters Cavity Detectorrow-wise

scandevice

Buffer

serial scan

Cavity

Detector

GaussBlur loop

= image_in

Page 28: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 29

Scanner

Loop trafo - cavity detection

N x M

GaussBlur x

N x M

From double bufferto single buffer

X

YX-Y LoopInterchange

Page 29: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 30

• Not always possible; check dependences

• For all loops, to maintain regularity

Loop interchange (Y X)

for (x=0;x<N;x++) for (y=0;y<M;y++) /* filtering code */

for (y=0;y<M;y++) for (x=0;x<N;x++) /* filtering code */

Page 30: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 31

Loop trafo - cavity detection

ComputeEdges

GaussBlur y

N x (2GB+1)

Repeated fold and loop merge

N x 3

From N x M toN x (3) buffer size

From N x M toN x (2GB+1) buffer size

2GB+13(offset

arrays)

GaussBlur x

Page 31: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 32

for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */for (y=0;y<M;y++) for (x=0;x<N;x++) /* 2nd filtering code */

for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (x=0;x<N;x++) /* 2nd filtering code */

Improve regularity and locality Loop Merging

!! Often impossible due to dependencies!

Page 32: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 33

Data dependencies between1st and 2nd loop

for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = …

for (y=0;y<M;y++) for (x=0;x<N;x++) … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] …

Page 33: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 34

Enable merging withLoop Folding (bumping)

for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = …

for (y=0+GB;y<M+GB;y++) for (x=0;x<N;x++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] …

Page 34: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 35

Y-loop merging resultof 1st and 2nd loop nest

for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = …

if (y>=GB) for (x=0;x<N;x++) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else

Page 35: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 36

Simplify conditions in merged loop

for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = …

for (x=0;x<N;x++) if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB)

Page 36: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 37

Global loop merging/folding steps1 x y Loop interchange (done)

2 Global y-loop folding/merging: 1st and 2nd nest (done)

3 Global y-loop folding/merging: 1st/2nd and 3rd nest

4 Global y-loop folding/merging: 1st/2nd/3rd and 4th nest

5 Global x-loop folding/merging: 1st and 2nd nest

6 Global x-loop folding/merging: 1st/2nd and 3rd nest

7 Global x-loop folding/merging: 1st/2nd/3rd and 4th nest

Page 37: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 38

End result of global loop trafo

for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; …

Page 38: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 39

Loop transformations - result

0

50

100

150

200

250

300

accesses size cycles

OriginalDF trafoLoop trafo

Page 39: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 40

#A = 100

P (original) = # access x power/access = 100

Processor Data Paths

RegFile

MMain

memory

P = 1

M’

P = 0.1

M’’

P = 0.01

100 110

Data re-use & memory hierarchy

• Introduce memory hierarchy – reduce number of reads from main memory

– heavily accessed arrays stored in smaller memories

P (after) = 100 x 0.01 + 10 x 0.1 + 1 x 1 = 3

Page 40: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 41

Data re-use

• Data flow transformations to introduce extracopies of heavily accessed signals– Step 1: figure out data re-use possibilities

– Step 2: calculate possible gain

– Step 3: decide on data assignment to memory hierarchy

int[2][6] A;

for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) B[j] = A[i][k]; iterations

array index (6 * i + k)

Page 41: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 42

Data re-use

• Data flow transformations to introduce extracopies of heavily accessed signals– Step 1: figure out data re-use possibilities

– Step 2: calculate possible gain

– Step 3: decide on data assignment to memory hierarchy

iterationsframe1 frame2 frame3

array index6*2

6*1N*2*3*6

CPU

1*2*1*6

N*2*1*6

Page 42: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 43

Data re-use tree

N*M

N*1

3*1

image_in

M*3

1*3

gauss_x

M*3

3*3

gauss_xy/comp_edge

M*3

1*1

N*M*3

N*M

N*M N*M*3

N*M*3

N*M*3 N*M

N*M

image_out

0

N*M*8 N*M*8

CPU CPU CPU CPU CPU

Page 43: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 44

Memory hierarchy assignment

L2

L1

L0

N*M

3*1

image_in

M*3

gauss_x gauss_xy comp_edgeimage_out

3*3 1*1 3*3 1*1

N*M

N*M*3 N*M*3 N*M

N*M

0

N*M*3 N*M

N*M*3 N*M*8 N*M*8 N*M*8 N*M*8

M*3 M*3

1MB

SDRAM

16KB

Cache

128 B

RegFile

Page 44: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 45

Data-reuse - cavity detection code

for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x<=N-2 && y>=1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ }}

Code before reuse transformation

Page 45: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 46

Data-reuse - cavity detection code

for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3] = image_in[x+k][y];

/* copy rest of in_pixels in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3] = image_in[x+1][y];

if (x>=1 && x<=N-1-1 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0;

Code after reuse transformation:

Page 46: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 47

Data reuse & memory hierarchy

0

100

200

300

400

500

600

accesses size cycles

OriginalDF trafoLoop trafoData reuse

Page 47: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 48

Data layout optimization• At this point multi-dimensional arrays

are to be assigned to physical memories

• Data layout optimization determines exactly where in each memory an array should be placed, to

– reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes)

– to avoid cache misses due to conflicts– exploit spatial locality of the data in memory to improve

performance of e.g. page-mode memory access sequences

Page 48: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 49

In-place mapping

B

C

D

A

C

AD

B

E

C

A

D

E

B

time

E

addresses

A

B

C

D

E

Inter in-place

Intra in-place

Both intra+inter

Page 49: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 50

0x0

0x28a0B

A

In-place mapping• Implements all the “anticipated” memory size savings

obtained in previous steps• Modifies code to introduce one

array per “real” memory• Changes indices to addresses in mem. arrays

b8 A[100][100];b6 B[20][20];for (i,j,k,l; …) B[i][j] = f(B[j][i], A[i+k][j+l]);

b8 mem1[10400];for (i,j,k,l; …) mem1[10000+i+20*j] = f(mem1[10000+j+20*i], b6(mem1[i+k+100*(j+l)]);0x2710

Page 50: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 51

In-place mapping• Input image is partly consumed, and not reused again,

by the time first results for output image are ready

Image_out

index

time

Image_in

time

index

time

address

Image

Page 51: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 52

In-place - cavity detection code

for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */

… = image_in[x+1][y]; }}

for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */

… = image [x+1][y]; }}

Page 52: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 53

In-place mapping - results

0

100

200

300

400

500

600

accesses size cycles

OriginalDF trafoLoop trafoData reuseIn-place

Page 53: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 54

The last step: ADOPT (Address OPTimization)

• Increased execution time introduced by DTSE– Complicated address arithmetic (modulo!)

– Additional complex control flow

• Additional transformations needed to– Simplify control flow

– Simplify address arithmetic: common sub-expression elimination, modulo expansion, …

– Match remaining expressions on target machine

Page 54: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 55

for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { B[ ] = A[ ]; }} dist += A[ ]- B[ ]; }

cse1 = (33025*i+6869616)*2; cse3 = 1040+i; cse4 = j*257+1032; cse5 = k+cse4; cse5+cse1 = cse5+cse3 3096 cse1

ADOPT principles

for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)*257+ 16+i+k] = B[(8+j)*257+16+i+k]; } dist += A[3096] - B[((208+i)*257+4)*257+ 16+i-4]; }

Example: Full-search Motion Estimation

Algebraic transformationsat word-level

Page 55: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 56

Address optimization - result

0

100

200

300

400

500

600

accesses size cycles

OriginalDF trafoLoop trafoData reuseIn-placeData layoutADOPT - moduloADOPT - rest

Page 56: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 57

Fixing platform parameters

• Assume configurable on-chip memory hierarchy– Trade-off power versus cycle-budget

storagecyclebudget

power [mW]

50,000 100,000 150,000

5

10

15

20

25

Page 57: Embedded Computer Architecture

04/21/23 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman 58

Conclusion• Many applications use large (static) data structures

– Access and layout of this data can be heavily optimized

– Compilers don't do this

• Source code (C-to-C) transformations needed !!• Showed systematic approach

– Platform independent high-level transformations

– Platform dependent transformations exploit platform characteristics (optimal use of cache, …)

• Substantial energy, memory size (cost) and performance improvements

– MPEG-4, OFDM, H.263, ADSL, ...