22
5th International Conference , HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee * , Aviral Shrivastava ** , Jonghee Yoon and Yunheung Paek **Compiler and Microarchitecture Lab, Center for Embedded Systems, Arizona State University, Tempe, AZ, USA. * Embedded Systems Research Lab, ECE, Ulsan Nat’l Institute of Science & Tech, Ulsan, Korea Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea 2010-01-25

5th International Conference , HiPEAC 2010

  • Upload
    stan

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Memory-aware application mapping on coarse-grained reconfigurable arrays. Yongjoo Kim, Jongeun Lee * , Aviral Shrivastava ** , Jonghee Yoon and Yunheung Paek. Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea. - PowerPoint PPT Presentation

Citation preview

Page 1: 5th International Conference ,  HiPEAC  2010

5th International Conference , HiPEAC 2010

MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS

Yongjoo Kim, Jongeun Lee*, Aviral Shrivastava**,

Jonghee Yoon and Yunheung Paek

**Compiler and Microarchitecture Lab,Center for Embedded Systems,

Arizona State University, Tempe, AZ, USA.

* Embedded Systems Research Lab,ECE, Ulsan Nat’l Institute of Science & Tech,

Ulsan, Korea

Software Optimization And Restructuring,Department of Electrical Engineering,

Seoul National University, Seoul, South Korea

2010-01-25

Page 2: 5th International Conference ,  HiPEAC  2010

2

Coarse-Grained Reconfigurable Array (CGRA)

SO&R and CML Research Group

High computation throughput Low power consumption and scalability High flexibility with fast configuration

Category Processor MIPS mW MIPS/mW

Embedded Xscale 1250 1600 0.78

DSP TI TM320C6455 9.57 3.3 2.9

DSP(VLIW)

TI TM320C614T 4.711 0.67 7

* CGRA shows 10~100MIPS/mW

Page 3: 5th International Conference ,  HiPEAC  2010

3

Coarse-Grained Reconfigurable Array (CGRA)

SO&R and CML Research Group

Array of PE Mesh like network Operate on the result of their neighbor PE Execute computation intensive kernel

Page 4: 5th International Conference ,  HiPEAC  2010

4

Application mapping in CGRA

SO&R and CML Research Group

Mapping DFG on PE array mapping space Should satisfy several conditions

Should map nodes on the PE which have a right functionality Data transfer between nodes should be guaranteed Resource consumption should be minimized for performance

Page 5: 5th International Conference ,  HiPEAC  2010

5

CGRA execution & data mapping

tc : computation time, td : data transfer time

PE

Configuration Memory

Main Memory

Bk1buf2

Bk2buf2

Bk3buf2

Bk4buf2 DMA

Bk1buf1

Bk2buf1

Bk3buf1

Bk4buf1

Local memory

Double buffering

Total runtime = max(tc, td)

Page 6: 5th International Conference ,  HiPEAC  2010

6

The performance bottleneck : Data transfer

SO&R and CML Research Group

Many multimedia kernels show bigger td than tc

Average ratio of tc : just 22%

swim

_calc

1

swim

_calc

2

*com

press

*lowpas

s

laplac

e

form_p

redict

ion

wavele

tSO

R*G

SRso

bel

AVERAGE0%

10%20%30%40%50%60%70%80%90%

100%Data transfer Time Computation Time

Most applications are memory-bound applications.

< The ratio between tc and td >

100% = tc + td

Page 7: 5th International Conference ,  HiPEAC  2010

7

Computation Mapping & Data Mapping

SO&R and CML Research Group

Duplicate array increase data transfer time

Local memory

0 1

2

LD S[i] LD S[i+1]

+

S[i]

S[i+1]

0

1

Page 8: 5th International Conference ,  HiPEAC  2010

8

Contributions of this work

SO&R and CML Research Group

First approach to consider computation map-ping and data mapping

- balance tc and td

- minimize duplicate arrays (maximize data reuse)- balance bank utilization

Simple yet effective extension - a set of cost functions

- can be plugged in to existing compilation frameworks

- E.g., EMS (edge-based modulo scheduling)

Page 9: 5th International Conference ,  HiPEAC  2010

9

Application mapping flow

SO&R and CML Research Group

DFG

PerformanceBottleneckAnalysis

Data ReuseAnalysis

Memory-awareModulo Scheduling

DCR DRG

Mapping

Page 10: 5th International Conference ,  HiPEAC  2010

10

Preprocessing 1 : Performance bottleneck analysis

SO&R and CML Research Group

Determines whether it is computation or data trans-fer that limits the overall performance

Calculate DCR(data-transfer-to-computation time ratio)DCR = td / tc

DCR > 1 : the loop is memory-bound

Page 11: 5th International Conference ,  HiPEAC  2010

11

Preprocessing 2 : Data reuse analysis

SO&R and CML Research Group

Find the amount of potential data reuse

Creates a DRG(Data Reuse Graph) nodes correspond to memory opera-

tions and edge weights approximate the amount of reuse

The edge weight is estimated to be TS - rd TS : the tile size rd : the reuse distance in itera-

tions

S[i]S[i+1]

D[i]

R[i]

S[i+5]

D[i+10]

R2[i]

< DRG>

Page 12: 5th International Conference ,  HiPEAC  2010

12

Application mapping flow

SO&R and CML Research Group

DFG

PerformanceBottleneckAnalysis

Data ReuseAnalysis

Memory-awareModulo Scheduling

DCR DRG

Mapping

DCR & DRG are used for cost calcu-lation

Page 13: 5th International Conference ,  HiPEAC  2010

13

Mapping with data reuse opportunity cost

SO&R and CML Research Group

PE0 PE1 PE2 PE3

0

1

2

3

4

0 1

3

5

2

7

4

A[i],A[i+1] B[i]Local Memory

PE PE PE PE

Bank1 Bank2

0 1

3

5

2

7

9

A[i] B[i]

A[i+1]

4

8 B[i+1]

PE Array

40

50

6060

50

x

x

xx

0

0

0+20

+20

x

x

xx

40

50

6040

30

x

x

xx6

6

Memory-unaware costData reuse opportunity costNew total cost(memory unaware cost + DROC)

Page 14: 5th International Conference ,  HiPEAC  2010

14

BBC(Bank Balancing Cost)

SO&R and CML Research Group

To prevent allocating all data to just one bank BBC(b) = β × A(b)

β : the base balancing cost(a design parameter)

A(b) : the number of arrays already mapped onto bank b

PE0 PE1 PE2 PE3

0

1

2

3

4 +10 +0

A[i],A[i+1]

0

32

5

6

A[i]

A[i+1]

4

7B[i]

1

0

32

5

6

4

1

Cand Cand

β : 10

Local Memory

PE PE PE PE

Bank1 Bank2

PE Array

Page 15: 5th International Conference ,  HiPEAC  2010

15

Application mapping flow

SO&R and CML Research Group

DFG

PerformanceBottleneckAnalysis

Data ReuseAnalysis

Memory-awareModulo Scheduling

DCR DRG

Mapping

Partial ShutdownExploration

Page 16: 5th International Conference ,  HiPEAC  2010

16

Partial Shutdown Exploration

SO&R and CML Research Group

For a memory-bound loop, the performance is often limited by the memory bandwidth rather than by computation. ≫ Computation resources are surplus.

Partial Shutdown Exploration on PE rows and the memory banks find the best configuration that gives the minimum

EDP(Energy-Delay Product)

Page 17: 5th International Conference ,  HiPEAC  2010

Example of Partial shutdown exploration

Tc Td R E R*E

4r-2m 180 288 288 10.46 3012

2r-2m 270 288 288 10.01 2882

7/7r 8/3

-/6 5/- -/4

0/1 2/0r

D[…], R[…]

S[…]

< 4 row - 2 bank >

-/0r/2 0/1/-

4/-/- -/5/- 3/8/6 7/-/-

S[…]

D[…], R[…]

< 2 row - 2 bank >

0 1

2

43

5

6

7

8

LD S[i] LD S[i+1]

LD D[i]

ST R[i]

17

Page 18: 5th International Conference ,  HiPEAC  2010

18

Experimental Setup

SO&R and CML Research Group

A set of loop kernels from MiBench, multimedia, SPEC 2000 benchmarks

Target architecture 4x4 heterogeneous CGRA(4 memory accessable PE) 4 memory bank, each connected to each row Connected to its four neighbors and four diagonal ones

Compared with other mapping flow Ideal : memory unaware + single bank memory architecture MU : memory unaware mapping(*EMS) + multi bank memory

architecture MA : memory aware mapping + multi bank memory architecture MA + PSE : MA + partial shutdown exploration

* Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures, Hyunchul Park et al, PACT 08

Page 19: 5th International Conference ,  HiPEAC  2010

19

Runtime comparison

SO&R and CML Research Group

Compared with MU

The MA reduces the runtime by 30%

form

_pre

d

laplaceso

belSOR

swim

_calc1

swim

_calc2

wavelet

*com

press

*GSR

*lowpass

AVERAGE0

0.2

0.4

0.6

0.8

1

Ideal

MU

MA

MA+PSE

No

rmal

ized

Ru

ntim

e

Page 20: 5th International Conference ,  HiPEAC  2010

20

Energy consumption comparison

SO&R and CML Research Group

MA + PSE shows 47% energy consumption reduction.

form

_pre

d

laplaceso

belSOR

swim

_calc1

swim

_calc2

wavelet

*com

press

*GSR

*lowpass

AVERAGE0

0.2

0.4

0.6

0.8

1MU

MA

MA+PSE

No

rmal

ized

En

ergy

Page 21: 5th International Conference ,  HiPEAC  2010

21

Conclusion

SO&R and CML Research Group

The CGRA provide very high power efficiency while be soft-ware programmable.

While previous solutions have focused on the computation speed, we consider the data transfer to achieve higher per-formance.

We proposed an effective heuristic that considers memory architecture.

It achieves 62% reduction in the energy-delay product which factors into 47% and 28% reductions in the energy consumption and runtime.

Page 22: 5th International Conference ,  HiPEAC  2010

22

SO&R and CML Research Group

Thank you for your attention!