15
CGRA Express: Accelerating Execution using Dynamic Operation Fusion Yongjun Park, Hyunchul Park, Scott Mahlke 1 CCCP Research Group, University of Michigan

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

  • Upload
    jonah

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

CGRA Express: Accelerating Execution using Dynamic Operation Fusion. CCCP Research Group, University of Michigan. Yongjun Park, Hyunchul Park, Scott Mahlke. Coarse-Grained Reconfigurable Architecture (CGRA). Array of PEs connected in a mesh-like interconnect - PowerPoint PPT Presentation

Citation preview

Page 1: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Yongjun Park, Hyunchul Park, Scott Mahlke

1

CCCP Research Group, University of Michigan

Page 2: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

2

Coarse-Grained Reconfigurable Architecture (CGRA)

Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power

consumption High flexibility with dynamic reconfiguration

Page 3: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

CGRA : Attractive Alternative to ASICs

viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW

Morphosys SiliconHive ADRES

3

Suitable for running multimedia applications for future embedded systems

High throughput, low power consumption, high flexibility

Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW

Page 4: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Performance Bottleneck: Acyclic Code

4

Original Loop region dominant

Original Loop region dominant

Software PipelineAcyclic region

dominant

Software PipelineAcyclic region

dominant

Block 2

Block 3 Block 4

Block 1

Application

Block 5

Block 0

Execution Time

Block 1

Block 2

Block 3

Block 5

Block 0

Normal schedule

Block 1

Block 2

Block 3

Block 5

Block 0

Software Pipeline

Acyclic region is substantial!

It’s time to optimize acyclic code.

Page 5: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Key Idea: Chaining Instructions

5

1. Clock period

Longest operation with register file access.

2. CGRA is not VLIW.Register file access is not

frequent!

3. Opportunity of instruction chaining.

4. Considerable register access time

≈ Arithmetic operation delay

(3.5ns clock period @ IBM 90nm)

Group Opcode Delay(ns)

Multi cycle op

MUL, LD, ST 1.65

Arith ADD, SUB 1.74

Shift LSL, LSR, ASR

1.36

Comp EQ, NE, LT 0.93

Logic AND, OR, XOR

0.73

RF Access 1.61

Critical Path: Slow!Critical Path: Slow!Non-critical path : Fast!Non-critical path : Fast!

Page 6: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

ADDADD ADDADD LSRLSR

4x4 CGRA

Current :3 Cycle

Current :3 Cycle

Dynamic Operation Fusion

6

Execute multiple dependent operations in one cycle Key benefits

1. Minimal hardware overhead

2. Multiple subgraphs can be executed simultaneously.3. Dynamic merging of FUs

ADD

MUL LD

ADD

512

LSR

10

A B

Out

AssumptionInstruction time= RF read time= RF write time

AssumptionInstruction time= RF read time= RF write time

Add512r10

4x4 CGRA

Operation fusion :1 Cycle

Operation fusion :1 Cycle

Page 7: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Hardware Support

7

Simple bypass network

  baseline modifiedoverhead(

%)control bit 845 877 3.8

area (mm^2)

1.447 1.48 2.3

Small overhead: 3.8%(SRAM), 2.3%(MUX)

Page 8: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Compiler Support

8

Tick-based scheduling Tick: small time unit based on hardware delay

information Clock cycle = # of ticks Clock boundary constraint checking

Resource conflict Time conflict

Tick-based scheduling Tick: small time unit based on hardware delay

information Clock cycle = # of ticks Clock boundary constraint checking

Resource conflict Time conflict

Tick-based scheduling Tick: small time unit based on hardware delay

information Clock cycle = # of ticks Clock boundary constraint checking

Resource conflict Time conflict

Page 9: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Dynamic Operation Fusion Example(1)

9

1. Conventional Scheduling

Time

FU0 FU1 FU2 FU3 FU4 FU5

0

1

2

3

4

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow Graph Schedule Table

CGRA Mapping

Time

FU0 FU1 FU2 FU3 FU4 FU5

0 OP 0

OP 1

1

2

3

4

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow Graph Schedule Table

CGRA Mapping

OP 2OP 2

Time

FU0 FU1 FU2 FU3 FU4 FU5

0 OP 0

OP 1

1 OP 2

2

3

4

OP 0OP 0 OP 1OP 1

OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow Graph Schedule Table

CGRA Mapping

Time

FU0 FU1 FU2 FU3 FU4 FU5

0 OP 0

OP 1

1 OP 2

2 OP 3

3

4

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow Graph Schedule Table

CGRA Mapping

Time

FU0 FU1 FU2 FU3 FU4 FU5

0 OP 0

OP 1

1 OP 2

2 OP 3

3 OP4

4

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow Graph Schedule Table

CGRA Mapping

Time

FU0 FU1 FU2 FU3 FU4 FU5

0 OP 0

OP 1

1 OP 2

2 OP 3

3 OP4

4 OP 5

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow Graph Schedule Table

CGRA Mapping

1. Conventional Scheduling – 5 cycle

Page 10: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Dynamic Operation Fusion Example(2)

10

2. Dynamic Operation Fusion

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow GraphSchedule Table

CGRA Mapping

Time FU0 FU1 FU2 FU3 FU4 FU5

0

1

2

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow GraphSchedule Table

CGRA Mapping

Time FU0 FU1 FU2 FU3 FU4 FU5

0 RF RF

OP 0 OP 1

OP 2

1

2

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow GraphSchedule Table

CGRA Mapping

Time FU0 FU1 FU2 FU3 FU4 FU5

0 RF RF

OP 0 OP 1

OP 2

1 OP 3

OP4

2

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

const

const

ADD(2)

SUB(0)

LSR(3)

ADD(1)

LSL(4)

ADD(5)

RF[2]

RF[0] RF[1]const const

DataFlow GraphSchedule Table

CGRA Mapping

Time FU0 FU1 FU2 FU3 FU4 FU5

0 RF RF

OP 0 OP 1

OP 2

1 OP 3

OP4

2 OP 5

RF

OP 0OP 0 OP 1OP 1

OP 2OP 2 OP 3OP 3

OP 5OP 5

OP 4OP 4

Register file

2. Dynamic Operation Fusion – 3 Cycle.

Page 11: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Experimental Setup

11

Benchmarks multimedia applications for embedded systems Audio decoding (AAC) Video decoding (H.264) 3D graphics (3D)

Two designs baseline : 4x4 heterogeneous CGRA express : 4x4 heterogeneous CGRA with bypass

network

Page 12: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Performance Enhancement

12

Express achieves 7-17% reduction in execution time Most of reduction comes from acyclic code region.

Express also improves the performance of resource-constrained loop. Bypass network gives more freedom to compiler.

Page 13: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Detailed Result for 3D Graphics

13

Target application 3D graphics

Power consumption 3% higher than the baseline

Performance enhancement 17% faster than the baseline

Energy consumption 15% more efficient

  baseline express ratio

power (mW) 298.26 306.78 102.86%# of cycles (million)

156.81 130.22 83.04%

energy (mJ) 233.85 199.74 85.42%

Page 14: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Conclusion

14

Acyclic region becomes the performance bottleneck. The run-time for loops decreases by large factors.

Dynamic operation fusion enables to execute back-to-back operations in a cycle Bypass network Tick-based scheduler

Up to17% faster and 15% more energy efficient with 3% hardware overhead

Page 15: CGRA Express: Accelerating Execution using Dynamic Operation Fusion

University of MichiganElectrical Engineering and Computer Science

Questions?

15