Upload
brooke-eunice-scott
View
230
Download
1
Embed Size (px)
Citation preview
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator
Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami
Kyushu University, Fukuoka, Japan
2
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
3
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
4
Designing Embedded Systems
Embedded Microprocessors Application-Specific Integrated Circuits (ASICs) Application-Specific Instruction set Processors (ASIPs) Extensible Processors
CPU
Instruction Dispatcher
Register File
+ & x LD/ST CFU1 CFU2
LD/ST: Load / Store
CFU: Custom Functional Unit
5
Extensible Processors
Improving the performance and energy efficiency of the embedded processors
maintaining compatibility and flexibility. Using an accelerator for accelerating frequently executed portions of
applications Accelerator implementations
custom hardware (such as ASIP or Extensible Processors) reconfigurable fine/coarse grain hw
CPU
Instruction Dispatcher
Register File
+ & x LD/ST CFU1 CFU2LD/ST: Load / Store
CFU: Custom Functional Unit
6
Custom Instructions
Critical segments Most frequently executed portions of the applications
Instruction set customization hardware/software partitioning (Identifying critical segments in applications)
Custom Instructions (CIs) are extracted from critical segments of an application and executed on a Custom Functional Unit (CFU)
A CI is represented as a DFG
1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2
ADDU
SRA
SLT
SUBU
BNE
R3R0 R0R0
R10
R8
R2
R2
R30x3
400488
2
3
4
5
1A Custom Instruction
7
Reconfigurable Processors
Using a reconfigurable accelerator (RAC) instead of custom functional units
Adding and generating custom instructions after fabrication
CPU
Instruction Dispatcher
Register File
+ & x LD/ST CFU1 CFU2RFU Config Mem
CFU: Custom Functional Unit
RFU: Reconfigurable Functional Unit
8
Baseline Processor
ALU
Register File
RAC...
...
...
ConfigurationMemory
How a Reconfigurable Processor Works
GPP: General Purpose Processor
RAC: Reconfigurable Accelerator
400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 lw $2,0($2)4006d0 xori $13,$13,14006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10,4006f0 ... Hot Basic Block
Reconfigurable Processor
9
OUTLINE
Introduction Problem Definition and Basic Concepts Overall Design Approach Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
10
Designing a Reconfigurable Accelerator (RAC)
Multitude of design parameters e.g. number of functional units, input/output ports, type of
functional units and etc.
Several design parameters high complexity in the RAC design the requirements for a methodological approach
A major challenge finding the right balance between the different quality
requirements (e.g. speedup, area, energy consumption)
11
Traditional Design Process
Describing a reference model Verifying the model functional correctness Obtaining a rough estimates of performance Manual or semi-automatic generation of several
alternative designs Choosing the most suitable design based on various
performance metrics
12
Design Space Exploration
Design Space Exploration (DSE) the process of analyzing several “functionally equivalent”
implementation alternatives to identify an optimal solution Can become too computationally expensive
Example: a design with four tasks on an architecture with three processing
modules and each have four possible configurations results in 500 design alternatives
Our Hybrid (analytical & quantitative) DSE approach reduces substantially the design time and effort
13
Assumptions
RAC: a matrix of FUs the width equal to w and height equal to h basically has a combinational logic
Basic Elements: Functional Units (logic resources) Multiplexers (routing resources)
FUs in RAC are fully connected Except the lack of connections from lower to upper rows
RAC Input Ports
RAC Output Ports
2
2FU
1
1FU
2
1FU
1
2FU Row1
FU FU FU FU
Col1
Row5
...
...
h
w
14
Assumptions
CIs (DFGs) are mapped onto the RAC and executed at runtime
1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2
ADDU
SRA
SLT
SUBU
BNE
R3R0 R0R0
R10
R8
R2
R2
R30x3
400488
2
3
4
5
1A Custom Instruction
Data Flow Graph
2
3
4
1
RAC Initial Architecture
5
15
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
16
The Problem
Designing a RAC while optimizing Speedup & Area Design parameters
the RAC’s width and height the number of FUs, the number of input and output ports
Speedup
CCyclesOnRATotalClock
ocessorseCyclesOnBaTotalClockSpeedup
Pr
17
Design Methodology- Tool Chain
Our Approach: A Hybrid (Analytical & Quantitative) DSE Approach
CI ExtractorCIs
(DFGs)
CI (DFG)Analyzer
Required Informationfor DSE
Design SpaceExploration
Accelerator FinalSpecification
Adjusting RACArchitecure
Specification ofAccelerator
Application
18
Problem Formulation
: the fraction of DFGs (CIs) in applications which have the width equal to i and height equal to j.
: percentage of application execution time concerns to all DFGs with the width of 4 and height of 3 is 7%.
ij
W
i
H
j
ij CPUCCCPUCC
1 1
)()(
jiFUCPUCC ij
ij )()(
2
4
6
1 3
5
2
4
5
1 3
ij
%743
19
Problem Formulation
ij
W
i
wh
H
j
ij
wh RACCCRACCC
1 1
)()(
: the number of clock cycles for executing DFG(i,j) on a RAC (w,h)
)( wh
ij RACCC
if one or both dimensions of a DFGs are greater than RAC’s dimensions,
temporal partitioning: divides a DFG into time exclusive smaller DFGs
reconfiguration overhead time for loading subsequent partitions of a DFG from the configuration memory onto the RAC
20
Problem Formulation
hjwi ,)3hjwi ,)2
)()( wh
wh
ij RACDRACCC hjwi ,)1
hjwi ,)4
RACh
w
RAC
DFG
RAC
h
w
RAC
DFGh
w
RAC
DFG
21
Delay of RAC
wkMUXDFUDRACDh
i
ki
h
i
ki
wh ,...,1,)()()(
1
11
)( kjMUXD = delay of mux(
kjs2 to 1)
kjmk
js 2log )1()1( wwjmkjRAC Input Ports
RAC Output Ports
22FU
1
1FU
21FU
1
2FU Row1
FU FU FU FU
Col1
Row5
...
...
h
wRAC’s Delay Path
22
Area of RAC
= delay of mux(kjs2 to 1)
w
i
h
j
ij
w
i
h
j
ij
wh MUXAFUARACA
1
1
11 1
)(*2)()(
)( ijMUXA
RAC Input Ports
RAC Output Ports
2
2FU
1
1FU
2
1FU
1
2FU Row1
FU FU FU FU
Col1
Row5
...
...
h
w
23
Optimization Problem
Objective: Maximize ( ) )
ij
W
i
wh
H
j
ij
ij
W
i
H
j
ij
wh
wh
RACCC
CPUCC
RACCC
CPUCCRACS
1 1
1 1
)(
)(
)(
)()(
)( whRACS
24
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
25
Designing RAC for a Reconfigurable Processor
AMBER: a reconfigurable processor targeted for embedded systems Main components
a base processor (general RISC processor) Sequencer and a coarse-grained reconfigurable functional unit (RFU)
Register File
ID/EXE Reg
RFU
ConfigurationMemory
ALUs
MUX SequencerSequencer
Table
EXE/MEM Reg
GPP Augmented HW
AMBER’s Architecture
26
Quantitative Approach
22 applications of Mibench were attempted Applications were executed on Simplescalar and profiled Hot segments and DFGs are extracted from the applications No limitation in the RFU’s initial architecture DFGs are mapped on the initial RFU architecture Feedbacks of mapping
Specification of the designed architecture for AMBER’s RFU
27
Hybrid DSE Approach
FU and various size multiplexers Synthesized using Hitachi 0.18um Measuring delay and area
DFGs are analyzed to extract required information quantitatively
Reconfiguration penalty time: 1 clock cycle
The base processor clock frequency: 166MHz
28
Speedup Evaluation
14
710
1316
191 4 7
10
13
16
19
0
0.5
1
1.5
2
2.5
Speedup
Width
Height
0-0.5 0.5-1 1-1.5 1.5-2 2-2.5
29
Speedup Evaluation
Increasing RAC’s width more parallelism
Widths larger than 6 negative effect of growing the number of muxes and their sizes no more speedup achievable
Increasing RAC’s Height longer delay speedup declines and area increases
30
Comparison
31
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
32
CONCLUSION
Hybrid DSE approach Uses realistic data from the attempted applications Substantially reduces design time and designer efforts Can be used for shrinking a large design space Easily extendable to apply new design parameters More suitable where the new applications are added
Thanks for your attention!