5th International Conference , HiPEAC 2010

5th International Conference , HiPEAC 2010

MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS

Yongjoo Kim, Jongeun Lee*, Aviral Shrivastava**,

Jonghee Yoon and Yunheung Paek

**Compiler and Microarchitecture Lab,Center for Embedded Systems,

Arizona State University, Tempe, AZ, USA.

* Embedded Systems Research Lab,ECE, Ulsan Nat’l Institute of Science & Tech,

Ulsan, Korea

Software Optimization And Restructuring,Department of Electrical Engineering,

Seoul National University, Seoul, South Korea

2010-01-25

2

Coarse-Grained Reconfigurable Array (CGRA)

SO&R and CML Research Group

High computation throughput Low power consumption and scalability High flexibility with fast configuration

Category Processor MIPS mW MIPS/mW

Embedded Xscale 1250 1600 0.78

DSP TI TM320C6455 9.57 3.3 2.9

DSP(VLIW)

TI TM320C614T 4.711 0.67 7

* CGRA shows 10~100MIPS/mW

3

Coarse-Grained Reconfigurable Array (CGRA)


Array of PE Mesh like network Operate on the result of their neighbor PE Execute computation intensive kernel

4

Application mapping in CGRA


Mapping DFG on PE array mapping space Should satisfy several conditions

Should map nodes on the PE which have a right functionality Data transfer between nodes should be guaranteed Resource consumption should be minimized for performance

5

CGRA execution & data mapping

tc : computation time, td : data transfer time

PE

Configuration Memory

Main Memory

Bk1buf2

Bk2buf2

Bk3buf2

Bk4buf2 DMA

Bk1buf1

Bk2buf1

Bk3buf1

Bk4buf1

Local memory

Double buffering

Total runtime = max(tc, td)

6

The performance bottleneck : Data transfer


Many multimedia kernels show bigger td than tc

Average ratio of tc : just 22%

swim

_calc

1

swim

_calc

2

*com

press

*lowpas

s

laplac

e

form_p

redict

ion

wavele

tSO

R*G

SRso

bel

AVERAGE0%

10%20%30%40%50%60%70%80%90%

100%Data transfer Time Computation Time

Most applications are memory-bound applications.

< The ratio between tc and td >

100% = tc + td

7

Computation Mapping & Data Mapping


Duplicate array increase data transfer time

Local memory

0 1

2

LD S[i] LD S[i+1]

+

S[i]

S[i+1]

0

1

8

Contributions of this work


First approach to consider computation map-ping and data mapping

- balance tc and td

- minimize duplicate arrays (maximize data reuse)- balance bank utilization

Simple yet effective extension - a set of cost functions

- can be plugged in to existing compilation frameworks

- E.g., EMS (edge-based modulo scheduling)

9

Application mapping flow


DFG

PerformanceBottleneckAnalysis

Data ReuseAnalysis

Memory-awareModulo Scheduling

DCR DRG

Mapping

10

Preprocessing 1 : Performance bottleneck analysis


Determines whether it is computation or data trans-fer that limits the overall performance

Calculate DCR(data-transfer-to-computation time ratio)DCR = td / tc

DCR > 1 : the loop is memory-bound

11

Preprocessing 2 : Data reuse analysis


Find the amount of potential data reuse

Creates a DRG(Data Reuse Graph) nodes correspond to memory opera-

tions and edge weights approximate the amount of reuse

The edge weight is estimated to be TS - rd TS : the tile size rd : the reuse distance in itera-

tions

S[i]S[i+1]

D[i]

R[i]

S[i+5]

D[i+10]

R2[i]

< DRG>

12



DFG


Data ReuseAnalysis


DCR DRG

Mapping

DCR & DRG are used for cost calcu-lation

13

Mapping with data reuse opportunity cost


PE0 PE1 PE2 PE3

0

1

2

3

4

0 1

3

5

2

7

4

A[i],A[i+1] B[i]Local Memory

PE PE PE PE

Bank1 Bank2

0 1

3

5

2

7

9

A[i] B[i]

A[i+1]

4

8 B[i+1]

PE Array

40

50

6060

50

x

x

xx

0

0

0+20

+20

x

x

xx

40

50

6040

30

x

x

xx6

6

Memory-unaware costData reuse opportunity costNew total cost(memory unaware cost + DROC)

14

BBC(Bank Balancing Cost)


To prevent allocating all data to just one bank BBC(b) = β × A(b)

β : the base balancing cost(a design parameter)

A(b) : the number of arrays already mapped onto bank b

PE0 PE1 PE2 PE3

0

1

2

3

4 +10 +0

A[i],A[i+1]

0

32

5

6

A[i]

A[i+1]

4

7B[i]

1

0

32

5

6

4

1

Cand Cand

β : 10

Local Memory

PE PE PE PE

Bank1 Bank2

PE Array

15



DFG


Data ReuseAnalysis


DCR DRG

Mapping

Partial ShutdownExploration

16

Partial Shutdown Exploration


For a memory-bound loop, the performance is often limited by the memory bandwidth rather than by computation. ≫ Computation resources are surplus.

Partial Shutdown Exploration on PE rows and the memory banks find the best configuration that gives the minimum

EDP(Energy-Delay Product)

Example of Partial shutdown exploration

Tc Td R E R*E

4r-2m 180 288 288 10.46 3012

2r-2m 270 288 288 10.01 2882

7/7r 8/3

-/6 5/- -/4

0/1 2/0r

D[…], R[…]

S[…]

< 4 row - 2 bank >

-/0r/2 0/1/-

4/-/- -/5/- 3/8/6 7/-/-

S[…]

D[…], R[…]

< 2 row - 2 bank >

0 1

2

43

5

6

7

8

LD S[i] LD S[i+1]

LD D[i]

ST R[i]

17

18

Experimental Setup


A set of loop kernels from MiBench, multimedia, SPEC 2000 benchmarks

Target architecture 4x4 heterogeneous CGRA(4 memory accessable PE) 4 memory bank, each connected to each row Connected to its four neighbors and four diagonal ones

Compared with other mapping flow Ideal : memory unaware + single bank memory architecture MU : memory unaware mapping(*EMS) + multi bank memory

architecture MA : memory aware mapping + multi bank memory architecture MA + PSE : MA + partial shutdown exploration

* Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures, Hyunchul Park et al, PACT 08

19

Runtime comparison


Compared with MU

The MA reduces the runtime by 30%

form

_pre

d

laplaceso

belSOR

swim

_calc1

swim

_calc2

wavelet

*com

press

*GSR

*lowpass

AVERAGE0

0.2

0.4

0.6

0.8

1

Ideal

MU

MA

MA+PSE

No

rmal

ized

Ru

ntim

e

20

Energy consumption comparison


MA + PSE shows 47% energy consumption reduction.

form

_pre

d

laplaceso

belSOR

swim

_calc1

swim

_calc2

wavelet

*com

press

*GSR

*lowpass

AVERAGE0

0.2

0.4

0.6

0.8

1MU

MA

MA+PSE

No

rmal

ized

En

ergy

21

Conclusion


The CGRA provide very high power efficiency while be soft-ware programmable.

While previous solutions have focused on the computation speed, we consider the data transfer to achieve higher per-formance.

We proposed an effective heuristic that considers memory architecture.

It achieves 62% reduction in the energy-delay product which factors into 47% and 28% reductions in the energy consumption and runtime.

22


Thank you for your attention!

Documents

5th International Conference , HiPEAC 2010