Upload
jonah
View
46
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CGRA Express: Accelerating Execution using Dynamic Operation Fusion. CCCP Research Group, University of Michigan. Yongjun Park, Hyunchul Park, Scott Mahlke. Coarse-Grained Reconfigurable Architecture (CGRA). Array of PEs connected in a mesh-like interconnect - PowerPoint PPT Presentation
Citation preview
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Yongjun Park, Hyunchul Park, Scott Mahlke
1
CCCP Research Group, University of Michigan
University of MichiganElectrical Engineering and Computer Science
2
Coarse-Grained Reconfigurable Architecture (CGRA)
Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power
consumption High flexibility with dynamic reconfiguration
University of MichiganElectrical Engineering and Computer Science
CGRA : Attractive Alternative to ASICs
viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW
Morphosys SiliconHive ADRES
3
Suitable for running multimedia applications for future embedded systems
High throughput, low power consumption, high flexibility
Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW
University of MichiganElectrical Engineering and Computer Science
Performance Bottleneck: Acyclic Code
4
Original Loop region dominant
Original Loop region dominant
Software PipelineAcyclic region
dominant
Software PipelineAcyclic region
dominant
Block 2
Block 3 Block 4
Block 1
Application
…
Block 5
Block 0
Execution Time
…
Block 1
Block 2
Block 3
Block 5
Block 0
…
Normal schedule
Block 1
Block 2
Block 3
Block 5
Block 0
…
Software Pipeline
Acyclic region is substantial!
It’s time to optimize acyclic code.
University of MichiganElectrical Engineering and Computer Science
Key Idea: Chaining Instructions
5
1. Clock period
Longest operation with register file access.
2. CGRA is not VLIW.Register file access is not
frequent!
3. Opportunity of instruction chaining.
4. Considerable register access time
≈ Arithmetic operation delay
(3.5ns clock period @ IBM 90nm)
Group Opcode Delay(ns)
Multi cycle op
MUL, LD, ST 1.65
Arith ADD, SUB 1.74
Shift LSL, LSR, ASR
1.36
Comp EQ, NE, LT 0.93
Logic AND, OR, XOR
0.73
RF Access 1.61
Critical Path: Slow!Critical Path: Slow!Non-critical path : Fast!Non-critical path : Fast!
University of MichiganElectrical Engineering and Computer Science
ADDADD ADDADD LSRLSR
4x4 CGRA
Current :3 Cycle
Current :3 Cycle
Dynamic Operation Fusion
6
Execute multiple dependent operations in one cycle Key benefits
1. Minimal hardware overhead
2. Multiple subgraphs can be executed simultaneously.3. Dynamic merging of FUs
ADD
MUL LD
ADD
512
LSR
10
A B
Out
AssumptionInstruction time= RF read time= RF write time
AssumptionInstruction time= RF read time= RF write time
Add512r10
4x4 CGRA
Operation fusion :1 Cycle
Operation fusion :1 Cycle
University of MichiganElectrical Engineering and Computer Science
Hardware Support
7
Simple bypass network
baseline modifiedoverhead(
%)control bit 845 877 3.8
area (mm^2)
1.447 1.48 2.3
Small overhead: 3.8%(SRAM), 2.3%(MUX)
University of MichiganElectrical Engineering and Computer Science
Compiler Support
8
Tick-based scheduling Tick: small time unit based on hardware delay
information Clock cycle = # of ticks Clock boundary constraint checking
Resource conflict Time conflict
Tick-based scheduling Tick: small time unit based on hardware delay
information Clock cycle = # of ticks Clock boundary constraint checking
Resource conflict Time conflict
Tick-based scheduling Tick: small time unit based on hardware delay
information Clock cycle = # of ticks Clock boundary constraint checking
Resource conflict Time conflict
University of MichiganElectrical Engineering and Computer Science
Dynamic Operation Fusion Example(1)
9
1. Conventional Scheduling
Time
FU0 FU1 FU2 FU3 FU4 FU5
0
1
2
3
4
OP 0OP 0 OP 1OP 1
OP 2OP 2 OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow Graph Schedule Table
CGRA Mapping
Time
FU0 FU1 FU2 FU3 FU4 FU5
0 OP 0
OP 1
1
2
3
4
OP 0OP 0 OP 1OP 1
OP 2OP 2 OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow Graph Schedule Table
CGRA Mapping
OP 2OP 2
Time
FU0 FU1 FU2 FU3 FU4 FU5
0 OP 0
OP 1
1 OP 2
2
3
4
OP 0OP 0 OP 1OP 1
OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow Graph Schedule Table
CGRA Mapping
Time
FU0 FU1 FU2 FU3 FU4 FU5
0 OP 0
OP 1
1 OP 2
2 OP 3
3
4
OP 0OP 0 OP 1OP 1
OP 2OP 2 OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow Graph Schedule Table
CGRA Mapping
Time
FU0 FU1 FU2 FU3 FU4 FU5
0 OP 0
OP 1
1 OP 2
2 OP 3
3 OP4
4
OP 0OP 0 OP 1OP 1
OP 2OP 2 OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow Graph Schedule Table
CGRA Mapping
Time
FU0 FU1 FU2 FU3 FU4 FU5
0 OP 0
OP 1
1 OP 2
2 OP 3
3 OP4
4 OP 5
OP 0OP 0 OP 1OP 1
OP 2OP 2 OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow Graph Schedule Table
CGRA Mapping
1. Conventional Scheduling – 5 cycle
University of MichiganElectrical Engineering and Computer Science
Dynamic Operation Fusion Example(2)
10
2. Dynamic Operation Fusion
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow GraphSchedule Table
CGRA Mapping
Time FU0 FU1 FU2 FU3 FU4 FU5
0
1
2
OP 0OP 0 OP 1OP 1
OP 2OP 2 OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow GraphSchedule Table
CGRA Mapping
Time FU0 FU1 FU2 FU3 FU4 FU5
0 RF RF
OP 0 OP 1
OP 2
1
2
OP 0OP 0 OP 1OP 1
OP 2OP 2 OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow GraphSchedule Table
CGRA Mapping
Time FU0 FU1 FU2 FU3 FU4 FU5
0 RF RF
OP 0 OP 1
OP 2
1 OP 3
OP4
2
OP 0OP 0 OP 1OP 1
OP 2OP 2 OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
const
const
ADD(2)
SUB(0)
LSR(3)
ADD(1)
LSL(4)
ADD(5)
RF[2]
RF[0] RF[1]const const
DataFlow GraphSchedule Table
CGRA Mapping
Time FU0 FU1 FU2 FU3 FU4 FU5
0 RF RF
OP 0 OP 1
OP 2
1 OP 3
OP4
2 OP 5
RF
OP 0OP 0 OP 1OP 1
OP 2OP 2 OP 3OP 3
OP 5OP 5
OP 4OP 4
Register file
2. Dynamic Operation Fusion – 3 Cycle.
University of MichiganElectrical Engineering and Computer Science
Experimental Setup
11
Benchmarks multimedia applications for embedded systems Audio decoding (AAC) Video decoding (H.264) 3D graphics (3D)
Two designs baseline : 4x4 heterogeneous CGRA express : 4x4 heterogeneous CGRA with bypass
network
University of MichiganElectrical Engineering and Computer Science
Performance Enhancement
12
Express achieves 7-17% reduction in execution time Most of reduction comes from acyclic code region.
Express also improves the performance of resource-constrained loop. Bypass network gives more freedom to compiler.
University of MichiganElectrical Engineering and Computer Science
Detailed Result for 3D Graphics
13
Target application 3D graphics
Power consumption 3% higher than the baseline
Performance enhancement 17% faster than the baseline
Energy consumption 15% more efficient
baseline express ratio
power (mW) 298.26 306.78 102.86%# of cycles (million)
156.81 130.22 83.04%
energy (mJ) 233.85 199.74 85.42%
University of MichiganElectrical Engineering and Computer Science
Conclusion
14
Acyclic region becomes the performance bottleneck. The run-time for loops decreases by large factors.
Dynamic operation fusion enables to execute back-to-back operations in a cycle Bypass network Tick-based scheduler
Up to17% faster and 15% more energy efficient with 3% hardware overhead
University of MichiganElectrical Engineering and Computer Science
Questions?
15