Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
HotChips 2006 1
An Implementation of Hardware
Accelerator using Dynamically Reconfigurable Architecture
Takashi Yoshikawa, Yutaka Yamada and Shigehiro Asano
Toshiba R&D Center
HotChips 20062
Outline
Motivation
Architecture overview
Application example (H.264 decode)
Evaluation
Performance
Area consumption
Conclusion
HotChips 20063
Motivation
Need better solution for multimedia applications
Processor solution is not efficient
Lots of data preparation before actual operations
Data alignment, Shuffle, Shift
SIMD operations don’t always fit multimedia applications
Sometimes an instruction requires two or more operations (ex. ++--++--) to maximize its efficiency
Difficult to add newly defined instructions to an existing ISA for supporting these operations
100, 200, or even more instructions needs to be added
Hardware engine is efficient enough but not flexible
Needs a new design for each application
Not easy to fix bugs
HotChips 20064
Motivation (Cont’d)
Reconfigurable logic may solve these problems but….
Utilization of logic/network must be high enough
Overhead of loading configuration must be covered with processing time
We propose dynamically reconfigurable HW engine optimized for multimedia applications
HotChips 20065
Outline
Motivation
Architecture overview
Application example (H.264 decode)
Evaluation
Performance
Area consumption
Conclusion
HotChips 20066
Architecture overview (entire block diagram)
I/O Buffer
I/O Buffer Ctrl
Inter-Unit Buffer
Formatter1AUX1
Input Ctrl
Output Ctrl
AUX0
Input Ctrl
Output Ctrl
Input Ctrl
Output Ctrl
Write Control
Formatter0
I/O Buffer Ctrl
Output Ctrl
DMAC
Host Processor
System Memory
Code Buffer
co
ntro
lbus
Input Ctrl
HotChips 20067
Architecture overview (data/code transfer)
I/O Buffer
Inter-Unit Buffer
Formatter1AUX1
Input Ctrl
Output Ctrl
AUX0
Input Ctrl
Output Ctrl
Input Ctrl
Output Ctrl
Formatter0
I/O Buffer Ctrl
Output Ctrl
DMAC
Host Processor
System Memory
Code Buffer
Re
qu
est
codes
(if not in CodeBuffer)
data
I/O Buffer Ctrl
Write Control
Input Ctrl
HotChips 20068
Architecture overview (data/code transfer)
DMAC controls the transfer of data/codes between a system memory and I/O Buffer/Code Buffer
The code consists of the configuration bits and the control code
The control code specifies which configuration bits should be applied to the reconfigurable units at any given cycle
DMAC also initiates code transfer from Code Buffer to the reconfigurable units
Reconfigurable units: Formatters, AUXs and Write Control Unit
A host processor issues a command with a system memory address to DMAC as a request
Issued through the control bus
HotChips 20069
Architecture overview (Formatter)
I/O Buffer
Inter-Unit Buffer
Formatter1AUX1
Input Ctrl
Output Ctrl
AUX0
Input Ctrl
Output Ctrl
Input Ctrl
Output Ctrl
Formatter0
I/O Buffer Ctrl
Output Ctrl
DMAC
Host Processor
System Memory
Code Buffer
co
ntro
lbus
data
I/O Buffer Ctrl
Write Control
Input Ctrl
HotChips 200610
Architecture overview (Formatter)
Formatter Units
Each consists of Input Control, Output Control,
full crossbar switches and 5 stages of processing element (PE)
Next two slides describe PE in detail
Input Control reads data from I/O buffer (or Inter Unit Buffer) and send the data along with a Context ID
Context ID is a pointer to one of the configurations of PE
Output Control receives the data from the last stage of PE and writes it into Inter Unit Buffer
HotChips 200611
Architecture overview (Formatter)
data A
Co
nfig
Me
m
data B
Shuffle
16-bit ALU * 8PE0
XB
XBXB
16x8 (8bit) crossbars available at Formatter0
8x8 (16bit) crossbar available at Formatter1
Each PE has its own configuration memory
Data, context ID and
valid (described later)
goes down the pipeline in sync
Only 4 patterns are available
Formatter
validID
PE1 (same as PE0)
PE2 (same as PE0)
PE3 (same as PE0)
PE4 (same as PE0)
HotChips 200612
data A
Architecture overview (Processing Element)
Co
nfig
Me
m
data B
From Previous PE
To Next PE
ID valid
data A data B ID valid
Processing Element (PE) Each ALU can be configured separately
2 outputs each
ALU ALU ALU ALU ALU ALU ALU ALU
HotChips 200613
Architecture overview (Inter-Unit Buffer)
MUX
bufbuf
MUX
bufbuf
MUX
bufbuf
MUX
bufbuf
MUX
bufbuf
MUX
bufbuf
MUX
bufbuf
From Fomatter0 From Formatter1 From AUX0 From AUX1
To Formatter1 To AUX0 To AUX1 To Write Ctrl
data
select
HotChips 200614
Architecture overview (Inter-Unit Buffer)
Inter-Unit Buffer (IUB)
Consists of 14 data buffers (size:128bit each)
Formatters and AUX write output data into one of these buffers
The buffer ID is associated with a context ID from those units
Conflict of buffer writes has been statically avoided by the control code
All the timing of buffer writes is deterministic
Each buffer has a Valid BitDescribed in detail in the application example
HotChips 200615
Architecture overview (AUX units)
I/O Buffer
Inter-Unit Buffer
Formatter1AUX1
Input Ctrl
Output Ctrl
AUX0
Input Ctrl
Output Ctrl
Input Ctrl
Output Ctrl
Formatter0
I/O Buffer Ctrl
Output Ctrl
DMAC
Host Processor
System Memory
Code Buffer
co
ntro
lbus
I/O Buffer Ctrl
Write Control
Input Ctrl
HotChips 200616
Architecture overview (AUX units)
AUX (Auxiliary) Units
Perform operations which can’t be handled by Formatter
Multiplication, 32bit operations, etc.
Each AUX unit consists of Input Control, Output Control and single reconfigurable SIMD unit
SIMD unit consists of eight 32-bit integer unitsMultiplier is 16-bit wide
Result is held in a 32x8 bit accumulator
All of the operation units share the same configuration
One AUX can transfer its output data to the other AUX directly
Inter-Unit Buffer may also be used if necessary
HotChips 200617
Architecture overview (Write Control Unit)
I/O Buffer
Inter-Unit Buffer
Formatter1AUX1
Input Ctrl
Output Ctrl
AUX0
Input Ctrl
Output Ctrl
Input Ctrl
Output Ctrl
Formatter0
I/O Buffer Ctrl
Output Ctrl
DMAC
Host Processor
System Memory
Code Buffer
co
ntro
lbus
I/O Buffer Ctrl
Write Control
Input Ctrl
HotChips 200618
Architecture overview (Write Control Unit)
Write Control Unit
Reads 128bit data from Inter-Unit Buffer and writes it into I/O buffer
Data shuffle available (8x8 [16bit] full crossbar)
Context IDs are issued in the same manner as Input Control of Formatter/AUX
The Context ID is associated with:The buffer containing write data
128bit-aligned write addressThe address is given in conjunction with the control code
Byte enable (for partial write)
Configuration of a crossbar switch
HotChips 200619
Architecture overview (Timing Chart)
Timing chart example
Load0Load1Formatter0 (PE0)Formatter0 (PE1)Formatter0 (PE2)Formatter0 (PE3)Formatter0 (PE4)IUB Formatter1 (A0)
Formatter1 (PE0)IUB Formatter1 (B0)
Formatter1 (PE4)IUB AUX0 (A0)IUB AUX0 (B0)IUB AUX1 (A0)IUB AUX1 (B0)IUB AUX1 (A1)IUB AUX1 (B1)
AUX1AUX0
IUB Write Control (0)
Store0Store1
IUB Write Control (1)
Formatter1 (PE1)Formatter1 (PE2)Formatter1 (PE3)
0
0 0
2
3
2
5
0
6 7
1
8
5 6
6
7
0 2
4 5 6
67
6 7 86 7 8
6 7 86 7 8
6 7 86 7
4
8
6
9 101112111211121112111211
11121112111211
111211
Starts when buffer A0 & B0
both become valid
4 context IDs are used
at the same cycle
Pattern of Context ID change is
statically specified in the control code
Context ID changes
as frequently as
every cyclecycle
Formatter0
Formatter1
AUX 0&1
Write Control
Number represents a Context ID for each unit
8
0 00000000
00
00000000 0
00000000
00 000
000000
000
0000000 00 00000000
0 00 000000000 00 00000000
0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
2222
22
22
22
223
3
2
33
3
3
33
33
33
3
3 3
4 54 54 5454 54 54 54
54 54 54 5454 54 54 54
54 54 54 5454 54 54 54
54 54 54 54
55 5 5
99
99
99
9
1010
1010
1010
2 444
00
00
00
11
11
1
1
1 1 11 1 1
1 1 11 1 1
1 1 1
1 1 1
22
22
2
20
21 3 4 8 9
9
3
0 1 2 3
HotChips 200620
Outline
Motivation
Architecture overview
Application example (H.264 decode)
Evaluation
Performance
Area consumption
Conclusion
HotChips 200621
Application example (H.264 decode: intra16x16 prediction [plane])
int H,V,a,b,c,t,i,pt=132,p;
t=a–7*(b+c)+16;b=(5*H+32)>>6; c=(5*V+32)>>6;a=(top[15]+left[15])<<4;
V+=-left[7-b]*b+left[7+b]*b; }H+=top[15]*8; V+=left[15]*8;
{ H+=-top[7-b]*b+top[7+b]*b;for (b=1;b<8;b++)V=H=-top[7]*8;
for (i = 0; i < 256; i++){ p=t>>5; t+=b;
if ((i & 15) == 15){ t+=c–b*16; pt+=16; } }
void intra16x16_plane_prediction(uchar *left, uchar *top) {
unsigned char Org[768]Original Code
}
Org[pt++]=(p<0)?0:(p>255)?255:p;
top[0..15] left[0..15]
F0 (0) F0 (1)
F1 (0) F1 (1) F1 (2)
F1 (3) A0 (0)
A0 (1)
Org[pt..pt+7], pt+=8
A1 (0)
F1 (4)
A0 (2)
A0 (3)
A0 (4)
Org[pt+8..pt+15], pt+=24
A1 (1)
F1 (5)
forms loop
Context Dependency Graph
A0 (1), A0 (2), A0 (3)(next iteration)
HotChips 200622
Application example (operation graph)ALU & full crossbar/shuffle configuration for Formatter0 context 0&1
sub
* bold arrow forwarded to red arrow
AB
AA op B
Full Crossbar (16x8)
BA
A-BA-B
* blue arrow is subtractor* both outputs are same by default
unused operation
(unused data paths are not shown)
A=top[0..15] B=top[0..15]
A=F89ABCDEB=76543210
Shuffle pattern 1
01234567 -> 10325476
0123456789ABCDEF
Shuffle pattern 2
01234567 -> 10325476
Shuffle pattern 1
Shuffle pattern 2
subsub sub sub sub sub sub sub
fwd fwd fwd fwd
fwd fwd fwd
fwd fwdfwd fwd fwd
fwd fwd fwd fwd
x8
x2 x2
x2
add add add
add add add
add add
add add add add
HotChips 200623
Application example (Timing Chart)
Timing Chart (96 cycles in total)
33
33
33
44
44
4
4
5
5
55
55
2
44
44
4
55
55
5
45
2
44
44
4
55
55
5
44
44
4
55
55
5
44
44
4
554
4
22
54
22
Load0Load1Formatter0 (PE0)Formatter0 (PE1)Formatter0 (PE2)Formatter0 (PE3)Formatter0 (PE4)IUB Formatter1 (A0)
Formatter1 (PE0)IUB Formatter1 (B0)
Formatter1 (PE4)IUB AUX0 (A0)IUB AUX0 (A1)IUB AUX0 (B0)IUB AUX0 (B1)IUB AUX1 (A0)IUB AUX1 (A1)
AUX1AUX0
IUB Write Control (0)
Store0Store1
IUB Write Control (1)
Formatter1 (PE1)Formatter1 (PE2)Formatter1 (PE3)
5
IUB AUX1 (B0)IUB AUX1 (B1)
Loop
cycle
Formatter0
Formatter1
AUX 0&1
WriteControl
IUB AUX1 is not used
(AUX0 transfers data to AUX1 directly)
Execution images of
these cycles are shown in the next animation
0
00
00
00
0
11
1
11
11
10
00
00 1 2
1 21 2
1 21 2
0
0 10
1
2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 41 0 1 0 1 0 1 0 1 0
0 1 0 1 0 1 0 1 0 1 0
2 2 2 2 2 23 3 3 3 3 3
0
22
00
00
00
00
11
11
1
1
HotChips 200624
Application example (Execution Image)
data vData Vdata v
data A
Mem
data B ID valid
ALU * 8PE0
data A
Mem
data B ID valid
ALU * 8PE1
Data V
exec ctrl
Input Ctrl
data v data vData V Data V
VV
0
A new Context ID is
issued to PE0 when all
the input data required
for the context become
valid. The ID is issued with data and valid=1.
HotChips 200625
data A
Mem
data B ID valid
ALU * 8PE0
Application example (Execution Image)
data v
exec ctrl
Input Ctrl
data v data v data v
Data Data
Data V Data V
V0
Inst 0
data A
Mem
data B ID valid
ALU * 8PE1
When PE receives a valid
Context ID, it changes its
configuration accordingly
and performs calculation & data shuffle
HotChips 200626
data A
Mem
data B ID valid
ALU * 8PE0
data A
Mem
data B ID valid
ALU * 8PE1
Application example (Execution Image)
data v
exec ctrl
Input Ctrl
data v data v data v
VV
1
Data Data V0
DataData VV DataData VVAt the next cycle, both
data remains valid,
therefore the next Context ID is issued.
HotChips 200627
data A
Mem
data B ID valid
ALU * 8PE0
Application example (Execution Image)
data v
exec ctrl
Input Ctrl
data v data v data v
Data Data V1
Inst 1
data A
Mem
data B ID valid
ALU * 8PE1
Data Data VID
Inst 0
Data V Data V
The change of configuration
and data calculation/shuffle
are performed in a pipeline
fashion. PE0 & PE1 use different Context ID.
HotChips 200628
Outline
Motivation
Architecture overview
Application example (H.264 decode)
Evaluation
Performance
Area consumption
Conclusion
HotChips 200629
Evaluation (Performance)
Application: H.264 decode
Estimated number of cycles required for decoding one macroblock
When 2Mbps VGA stream is decoded
3660 (average)
Our Accelerator (typical)
dual G5@2Ghz recommended by Apple for 8Mbps, 24fps HD stream
16576 (average)
PowerPC G5 @2GHz
Actual cycles required for decoding the most demanding macroblock
4923 (intra)
7108 (inter)
Our Accelerator (real)
ideal cycles expected for decoding the most demanding macroblock
3792 (intra)
5705 (inter)
Our Accelerator (ideal)
CommentCyclesName
More than 4 times faster than PowerPC G5 in terms of number of cycles required
HotChips 200630
Evaluation (Performance Breakdown)
Timing chart (real, intra)
1100 more cycles are required than the ideal cycles
The reason: configuration transfer cannot be completely hidden by signal processing
It can be hidden if the number of configuration memories increases (currently two buffers)
IQ/IDCT Intra4x4(luma) Intra(chroma) System memory write (frame buffer) Deblocking filter(luma)
Deblocking filter(chroma)System memory write (pixel buffer) Store macroblockCompensation
4923 cycles
Data transferConfig. transferCode transferSig. processing
Param calc.
HotChips 200631
Evaluation (Area consumption)
Area (logic gate) consumption
--2,200- Buffer (x1 in PE)
--1,000- Shuffle (x1 in PE)
--1,700- ALU (x8 in PE)
6567,50018,700- Processing Element (x5 in Form.)
2,11222,0004,000Crossbar (Formatter0)
--20,000Inter Unit Buffer
31,584314,200319,000Total
9609,50045,000AUX (x2)
8007,6006,000I/O Buffer Interface (Read & Write)
7687,6003,000Crossbar (Formatter1, Write Ctrl Unit)
13,696136,00020,000Total Inter Unit Ctrl. (Input & Output)
1,15212,000800Micro Controller (x5)
3,13631,00087,500Formatter (x2)
Memory Size
[bit]
F/F Memory [gate size]
Logic
[gate size]
HotChips 200632
Evaluation (Chip layout)
Formatter0
Formatter1
AUX0
AUX1Write Ctrl
Inter Unit Buffer
HotChips 200633
Outline
Motivation
Architecture overview
Application example (H.264 decode)
Evaluation
Performance
Area consumption
Conclusion
HotChips 200634
Conclusion
Presented an implementation of hardware
accelerator using dynamically reconfigurable architecture
Efficient signal processing can be achieved by reconfiguring ALUs and crossbars dynamically
More efficient than a general SIMD unit because each crossbar/16bit-ALU can be reconfigured differently
Not limited to the acceleration of H.264 decode
Can also accelerate H.264 encode and other codecsby applying different configurations
HotChips 200635
Conclusion (cont’d)
Performance
Expects 30 frames/sec with 3 accelerators running at 300MHz for decoding Full HD H.264 video
More than 4 times faster than PowerPC G5 in terms of number of cycles required for decoding one macroblock
Area consumption (number of logic gates)
319K Gates (excluding memory)Small enough considering its performance
Size of buffer memory will be 24KB or less
HotChips 200636
Backup Slides
HotChips 200637
Evaluation (Comparison with PowerPC G5)
Application: H.264 decoder
Full HD frame (1920x1080 pixels) consists of 8100 macroblocks)
one macroblock = 16x16 pixels
To achieve frame rate = 24 Full HD frame/sec, one macroblock must be decoded in 5.14 sec
Requires dual 2.0GHz Power Mac G5 (Apple’s recommendation)
decoded w/ Quicktime7.1 (AltiVec used)
5.14 sec = 10288 cycles at 2.0GHz
HotChips 200638
Evaluation (Function level performance results)
69 if motion compensation (luma) excluded133Motion compensation
100Store macroblock
Worst case (filter applied maximum times)~ 784Deblocking filter chroma
Worst case (filter applied maximum times)~ 2112Deblocking filter luma
Worst case (6/4-tap filter applied max. times)~ 1579/~163Inter prediction (luma/chroma)
motion compensation (luma) included~ 448 / 75Intra prediction (luma/chroma)
204 if intra16x16 prediction is not used221IQ/IDCT
CommentCyclesName