CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Yongjun Park, Hyunchul Park, Scott Mahlke

1

CCCP Research Group, University of Michigan

University of MichiganElectrical Engineering and Computer Science

2

Coarse-Grained Reconfigurable Architecture (CGRA)

Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power

consumption High flexibility with dynamic reconfiguration


CGRA : Attractive Alternative to ASICs

viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW

Morphosys SiliconHive ADRES

3

Suitable for running multimedia applications for future embedded systems

High throughput, low power consumption, high flexibility

Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW


Performance Bottleneck: Acyclic Code

4

Original Loop region dominant

Original Loop region dominant

Software PipelineAcyclic region

dominant

Software PipelineAcyclic region

dominant

Block 2

Block 3 Block 4

Block 1

Application

…

Block 5

Block 0

Execution Time

…

Block 1

Block 2

Block 3

Block 5

Block 0

…

Normal schedule

Block 1

Block 2

Block 3

Block 5

Block 0

…

Software Pipeline

Acyclic region is substantial!

It’s time to optimize acyclic code.


Key Idea: Chaining Instructions

5

1. Clock period

Longest operation with register file access.

2. CGRA is not VLIW.Register file access is not

frequent!

3. Opportunity of instruction chaining.

4. Considerable register access time

≈ Arithmetic operation delay

(3.5ns clock period @ IBM 90nm)

Group Opcode Delay(ns)

Multi cycle op

MUL, LD, ST 1.65

Arith ADD, SUB 1.74

Shift LSL, LSR, ASR

1.36

Comp EQ, NE, LT 0.93

Logic AND, OR, XOR

0.73

RF Access 1.61

Critical Path: Slow!Critical Path: Slow!Non-critical path : Fast!Non-critical path : Fast!


ADDADD ADDADD LSRLSR

4x4 CGRA

Current :3 Cycle

Current :3 Cycle

Dynamic Operation Fusion

6

Execute multiple dependent operations in one cycle Key benefits

1. Minimal hardware overhead

2. Multiple subgraphs can be executed simultaneously.3. Dynamic merging of FUs

ADD

MUL LD

ADD

512

LSR

10

A B

Out

AssumptionInstruction time= RF read time= RF write time

AssumptionInstruction time= RF read time= RF write time

Add512r10

4x4 CGRA

Operation fusion :1 Cycle

Operation fusion :1 Cycle


Hardware Support

7

Simple bypass network

　 baseline modifiedoverhead(

%)control bit 845 877 3.8

area (mm^2)

1.447 1.48 2.3

Small overhead: 3.8%(SRAM), 2.3%(MUX)


Compiler Support

8

Tick-based scheduling Tick: small time unit based on hardware delay

information Clock cycle = # of ticks Clock boundary constraint checking

Resource conflict Time conflict








Dynamic Operation Fusion Example(1)

9

1. Conventional Scheduling

Time

FU0 FU1 FU2 FU3 FU4 FU5

0

1

2

3

4

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow Graph Schedule Table

CGRA Mapping

Time


0 OP 0

OP 1

1

2

3

4

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]



CGRA Mapping

OP 2OP 2

Time


0 OP 0

OP 1

1 OP 2

2

3

4

OP 0OP 0 OP 1OP 1

OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]



CGRA Mapping

Time


0 OP 0

OP 1

1 OP 2

2 OP 3

3

4

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]



CGRA Mapping

Time


0 OP 0

OP 1

1 OP 2

2 OP 3

3 OP4

4

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]



CGRA Mapping

Time


0 OP 0

OP 1

1 OP 2

2 OP 3

3 OP4

4 OP 5

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]



CGRA Mapping

1. Conventional Scheduling – 5 cycle


Dynamic Operation Fusion Example(2)

10

2. Dynamic Operation Fusion

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]


DataFlow GraphSchedule Table

CGRA Mapping

Time FU0 FU1 FU2 FU3 FU4 FU5

0

1

2

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]



CGRA Mapping


0 RF RF

OP 0 OP 1

OP 2

1

2

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]



CGRA Mapping


0 RF RF

OP 0 OP 1

OP 2

1 OP 3

OP4

2

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]



CGRA Mapping


0 RF RF

OP 0 OP 1

OP 2

1 OP 3

OP4

2 OP 5

RF

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

2. Dynamic Operation Fusion – 3 Cycle.


Experimental Setup

11

Benchmarks multimedia applications for embedded systems Audio decoding (AAC) Video decoding (H.264) 3D graphics (3D)

Two designs baseline : 4x4 heterogeneous CGRA express : 4x4 heterogeneous CGRA with bypass

network


Performance Enhancement

12

Express achieves 7-17% reduction in execution time Most of reduction comes from acyclic code region.

Express also improves the performance of resource-constrained loop. Bypass network gives more freedom to compiler.


Detailed Result for 3D Graphics

13

Target application 3D graphics

Power consumption 3% higher than the baseline

Performance enhancement 17% faster than the baseline

Energy consumption 15% more efficient

　 baseline express ratio

power (mW) 298.26 306.78 102.86%# of cycles (million)

156.81 130.22 83.04%

energy (mJ) 233.85 199.74 85.42%


Conclusion

14

Acyclic region becomes the performance bottleneck. The run-time for loops decreases by large factors.

Dynamic operation fusion enables to execute back-to-back operations in a cycle Bypass network Tick-based scheduler

Up to17% faster and 15% more energy efficient with 3% hardware overhead


Questions?

15

Documents

CGRA Express: Accelerating Execution using Dynamic Operation Fusion