HC18.430.S4T3.An Implementation of Hardware Accelerator ... · HotChips 2006 3 Motivation Need better solution for multimedia applications Processor solution is not efficient Lots

HotChips 2006 1

An Implementation of Hardware

Accelerator using Dynamically Reconfigurable Architecture

Takashi Yoshikawa, Yutaka Yamada and Shigehiro Asano

Toshiba R&D Center

HotChips 20062

Outline

Motivation

Architecture overview

Application example (H.264 decode)

Evaluation

Performance

Area consumption

Conclusion

HotChips 20063

Motivation

Need better solution for multimedia applications

Processor solution is not efficient

Lots of data preparation before actual operations

Data alignment, Shuffle, Shift

SIMD operations don’t always fit multimedia applications

Sometimes an instruction requires two or more operations (ex. ++--++--) to maximize its efficiency

Difficult to add newly defined instructions to an existing ISA for supporting these operations

100, 200, or even more instructions needs to be added

Hardware engine is efficient enough but not flexible

Needs a new design for each application

Not easy to fix bugs

HotChips 20064

Motivation (Cont’d)

Reconfigurable logic may solve these problems but….

Utilization of logic/network must be high enough

Overhead of loading configuration must be covered with processing time

We propose dynamically reconfigurable HW engine optimized for multimedia applications

HotChips 20065

Outline

Motivation



Evaluation

Performance

Area consumption

Conclusion

HotChips 20066

Architecture overview (entire block diagram)

I/O Buffer

I/O Buffer Ctrl

Inter-Unit Buffer

Formatter1AUX1

Input Ctrl

Output Ctrl

AUX0

Input Ctrl

Output Ctrl

Input Ctrl

Output Ctrl

Write Control

Formatter0

I/O Buffer Ctrl

Output Ctrl

DMAC

Host Processor

System Memory

Code Buffer

co

ntro

lbus

Input Ctrl

HotChips 20067

Architecture overview (data/code transfer)

I/O Buffer

Inter-Unit Buffer

Formatter1AUX1

Input Ctrl

Output Ctrl

AUX0

Input Ctrl

Output Ctrl

Input Ctrl

Output Ctrl

Formatter0

I/O Buffer Ctrl

Output Ctrl

DMAC

Host Processor

System Memory

Code Buffer

Re

qu

est

codes

(if not in CodeBuffer)

data

I/O Buffer Ctrl

Write Control

Input Ctrl

HotChips 20068

Architecture overview (data/code transfer)

DMAC controls the transfer of data/codes between a system memory and I/O Buffer/Code Buffer

The code consists of the configuration bits and the control code

The control code specifies which configuration bits should be applied to the reconfigurable units at any given cycle

DMAC also initiates code transfer from Code Buffer to the reconfigurable units

Reconfigurable units: Formatters, AUXs and Write Control Unit

A host processor issues a command with a system memory address to DMAC as a request

Issued through the control bus

HotChips 20069

Architecture overview (Formatter)

I/O Buffer

Inter-Unit Buffer

Formatter1AUX1

Input Ctrl

Output Ctrl

AUX0

Input Ctrl

Output Ctrl

Input Ctrl

Output Ctrl

Formatter0

I/O Buffer Ctrl

Output Ctrl

DMAC

Host Processor

System Memory

Code Buffer

co

ntro

lbus

data

I/O Buffer Ctrl

Write Control

Input Ctrl

HotChips 200610


Formatter Units

Each consists of Input Control, Output Control,

full crossbar switches and 5 stages of processing element (PE)

Next two slides describe PE in detail

Input Control reads data from I/O buffer (or Inter Unit Buffer) and send the data along with a Context ID

Context ID is a pointer to one of the configurations of PE

Output Control receives the data from the last stage of PE and writes it into Inter Unit Buffer

HotChips 200611


data A

Co

nfig

Me

m

data B

Shuffle

16-bit ALU * 8PE0

XB

XBXB

16x8 (8bit) crossbars available at Formatter0

8x8 (16bit) crossbar available at Formatter1

Each PE has its own configuration memory

Data, context ID and

valid (described later)

goes down the pipeline in sync

Only 4 patterns are available

Formatter

validID

PE1 (same as PE0)

PE2 (same as PE0)

PE3 (same as PE0)

PE4 (same as PE0)

HotChips 200612

data A

Architecture overview (Processing Element)

Co

nfig

Me

m

data B

From Previous PE

To Next PE

ID valid

data A data B ID valid

Processing Element (PE) Each ALU can be configured separately

2 outputs each

ALU ALU ALU ALU ALU ALU ALU ALU

HotChips 200613

Architecture overview (Inter-Unit Buffer)

MUX

bufbuf

MUX

bufbuf

MUX

bufbuf

MUX

bufbuf

MUX

bufbuf

MUX

bufbuf

MUX

bufbuf

From Fomatter0 From Formatter1 From AUX0 From AUX1

To Formatter1 To AUX0 To AUX1 To Write Ctrl

data

select

HotChips 200614

Architecture overview (Inter-Unit Buffer)

Inter-Unit Buffer (IUB)

Consists of 14 data buffers (size:128bit each)

Formatters and AUX write output data into one of these buffers

The buffer ID is associated with a context ID from those units

Conflict of buffer writes has been statically avoided by the control code

All the timing of buffer writes is deterministic

Each buffer has a Valid BitDescribed in detail in the application example

HotChips 200615

Architecture overview (AUX units)

I/O Buffer

Inter-Unit Buffer

Formatter1AUX1

Input Ctrl

Output Ctrl

AUX0

Input Ctrl

Output Ctrl

Input Ctrl

Output Ctrl

Formatter0

I/O Buffer Ctrl

Output Ctrl

DMAC

Host Processor

System Memory

Code Buffer

co

ntro

lbus

I/O Buffer Ctrl

Write Control

Input Ctrl

HotChips 200616

Architecture overview (AUX units)

AUX (Auxiliary) Units

Perform operations which can’t be handled by Formatter

Multiplication, 32bit operations, etc.

Each AUX unit consists of Input Control, Output Control and single reconfigurable SIMD unit

SIMD unit consists of eight 32-bit integer unitsMultiplier is 16-bit wide

Result is held in a 32x8 bit accumulator

All of the operation units share the same configuration

One AUX can transfer its output data to the other AUX directly

Inter-Unit Buffer may also be used if necessary

HotChips 200617

Architecture overview (Write Control Unit)

I/O Buffer

Inter-Unit Buffer

Formatter1AUX1

Input Ctrl

Output Ctrl

AUX0

Input Ctrl

Output Ctrl

Input Ctrl

Output Ctrl

Formatter0

I/O Buffer Ctrl

Output Ctrl

DMAC

Host Processor

System Memory

Code Buffer

co

ntro

lbus

I/O Buffer Ctrl

Write Control

Input Ctrl

HotChips 200618

Architecture overview (Write Control Unit)

Write Control Unit

Reads 128bit data from Inter-Unit Buffer and writes it into I/O buffer

Data shuffle available (8x8 [16bit] full crossbar)

Context IDs are issued in the same manner as Input Control of Formatter/AUX

The Context ID is associated with:The buffer containing write data

128bit-aligned write addressThe address is given in conjunction with the control code

Byte enable (for partial write)

Configuration of a crossbar switch

HotChips 200619

Architecture overview (Timing Chart)

Timing chart example

Load0Load1Formatter0 (PE0)Formatter0 (PE1)Formatter0 (PE2)Formatter0 (PE3)Formatter0 (PE4)IUB Formatter1 (A0)

Formatter1 (PE0)IUB Formatter1 (B0)

Formatter1 (PE4)IUB AUX0 (A0)IUB AUX0 (B0)IUB AUX1 (A0)IUB AUX1 (B0)IUB AUX1 (A1)IUB AUX1 (B1)

AUX1AUX0

IUB Write Control (0)

Store0Store1


Formatter1 (PE1)Formatter1 (PE2)Formatter1 (PE3)

0

0 0

2

3

2

5

0

6 7

1

8

5 6

6

7

0 2

4 5 6

67

6 7 86 7 8

6 7 86 7 8

6 7 86 7

4

8

6

9 101112111211121112111211

11121112111211

111211

Starts when buffer A0 & B0

both become valid

4 context IDs are used

at the same cycle

Pattern of Context ID change is

statically specified in the control code

Context ID changes

as frequently as

every cyclecycle

Formatter0

Formatter1

AUX 0&1

Write Control

Number represents a Context ID for each unit

8

0 00000000

00

00000000 0

00000000

00 000

000000

000

0000000 00 00000000

0 00 000000000 00 00000000

0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0

2222

22

22

22

223

3

2

33

3

3

33

33

33

3

3 3

4 54 54 5454 54 54 54

54 54 54 5454 54 54 54

54 54 54 5454 54 54 54

54 54 54 54

55 5 5

99

99

99

9

1010

1010

1010

2 444

00

00

00

11

11

1

1

1 1 11 1 1

1 1 11 1 1

1 1 1

1 1 1

22

22

2

20

21 3 4 8 9

9

3

0 1 2 3

HotChips 200620

Outline

Motivation



Evaluation

Performance

Area consumption

Conclusion

HotChips 200621

Application example (H.264 decode: intra16x16 prediction [plane])

int H,V,a,b,c,t,i,pt=132,p;

t=a–7*(b+c)+16;b=(5*H+32)>>6; c=(5*V+32)>>6;a=(top[15]+left[15])<<4;

V+=-left[7-b]*b+left[7+b]*b; }H+=top[15]*8; V+=left[15]*8;

{ H+=-top[7-b]*b+top[7+b]*b;for (b=1;b<8;b++)V=H=-top[7]*8;

for (i = 0; i < 256; i++){ p=t>>5; t+=b;

if ((i & 15) == 15){ t+=c–b*16; pt+=16; } }

void intra16x16_plane_prediction(uchar *left, uchar *top) {

unsigned char Org[768]Original Code

}

Org[pt++]=(p<0)?0:(p>255)?255:p;

top[0..15] left[0..15]

F0 (0) F0 (1)

F1 (0) F1 (1) F1 (2)

F1 (3) A0 (0)

A0 (1)

Org[pt..pt+7], pt+=8

A1 (0)

F1 (4)

A0 (2)

A0 (3)

A0 (4)

Org[pt+8..pt+15], pt+=24

A1 (1)

F1 (5)

forms loop

Context Dependency Graph

A0 (1), A0 (2), A0 (3)(next iteration)

HotChips 200622

Application example (operation graph)ALU & full crossbar/shuffle configuration for Formatter0 context 0&1

sub

* bold arrow forwarded to red arrow

AB

AA op B

Full Crossbar (16x8)

BA

A-BA-B

* blue arrow is subtractor* both outputs are same by default

unused operation

(unused data paths are not shown)

A=top[0..15] B=top[0..15]

A=F89ABCDEB=76543210

Shuffle pattern 1

01234567 -> 10325476

0123456789ABCDEF

Shuffle pattern 2

01234567 -> 10325476

Shuffle pattern 1

Shuffle pattern 2

subsub sub sub sub sub sub sub

fwd fwd fwd fwd

fwd fwd fwd

fwd fwdfwd fwd fwd

fwd fwd fwd fwd

x8

x2 x2

x2

add add add

add add add

add add

add add add add

HotChips 200623

Application example (Timing Chart)

Timing Chart (96 cycles in total)

33

33

33

44

44

4

4

5

5

55

55

2

44

44

4

55

55

5

45

2

44

44

4

55

55

5

44

44

4

55

55

5

44

44

4

554

4

22

54

22

Load0Load1Formatter0 (PE0)Formatter0 (PE1)Formatter0 (PE2)Formatter0 (PE3)Formatter0 (PE4)IUB Formatter1 (A0)

Formatter1 (PE0)IUB Formatter1 (B0)

Formatter1 (PE4)IUB AUX0 (A0)IUB AUX0 (A1)IUB AUX0 (B0)IUB AUX0 (B1)IUB AUX1 (A0)IUB AUX1 (A1)

AUX1AUX0


Store0Store1


Formatter1 (PE1)Formatter1 (PE2)Formatter1 (PE3)

5

IUB AUX1 (B0)IUB AUX1 (B1)

Loop

cycle

Formatter0

Formatter1

AUX 0&1

WriteControl

IUB AUX1 is not used

(AUX0 transfers data to AUX1 directly)

Execution images of

these cycles are shown in the next animation

0

00

00

00

0

11

1

11

11

10

00

00 1 2

1 21 2

1 21 2

0

0 10

1

2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 41 0 1 0 1 0 1 0 1 0

0 1 0 1 0 1 0 1 0 1 0

2 2 2 2 2 23 3 3 3 3 3

0

22

00

00

00

00

11

11

1

1

HotChips 200624

Application example (Execution Image)

data vData Vdata v

data A

Mem

data B ID valid

ALU * 8PE0

data A

Mem

data B ID valid

ALU * 8PE1

Data V

exec ctrl

Input Ctrl

data v data vData V Data V

VV

0

A new Context ID is

issued to PE0 when all

the input data required

for the context become

valid. The ID is issued with data and valid=1.

HotChips 200625

data A

Mem

data B ID valid

ALU * 8PE0


data v

exec ctrl

Input Ctrl

data v data v data v

Data Data

Data V Data V

V0

Inst 0

data A

Mem

data B ID valid

ALU * 8PE1

When PE receives a valid

Context ID, it changes its

configuration accordingly

and performs calculation & data shuffle

HotChips 200626

data A

Mem

data B ID valid

ALU * 8PE0

data A

Mem

data B ID valid

ALU * 8PE1


data v

exec ctrl

Input Ctrl


VV

1

Data Data V0

DataData VV DataData VVAt the next cycle, both

data remains valid,

therefore the next Context ID is issued.

HotChips 200627

data A

Mem

data B ID valid

ALU * 8PE0


data v

exec ctrl

Input Ctrl


Data Data V1

Inst 1

data A

Mem

data B ID valid

ALU * 8PE1

Data Data VID

Inst 0

Data V Data V

The change of configuration

and data calculation/shuffle

are performed in a pipeline

fashion. PE0 & PE1 use different Context ID.

HotChips 200628

Outline

Motivation



Evaluation

Performance

Area consumption

Conclusion

HotChips 200629

Evaluation (Performance)

Application: H.264 decode

Estimated number of cycles required for decoding one macroblock

When 2Mbps VGA stream is decoded

3660 (average)

Our Accelerator (typical)

dual G5@2Ghz recommended by Apple for 8Mbps, 24fps HD stream

16576 (average)

PowerPC G5 @2GHz

Actual cycles required for decoding the most demanding macroblock

4923 (intra)

7108 (inter)

Our Accelerator (real)

ideal cycles expected for decoding the most demanding macroblock

3792 (intra)

5705 (inter)

Our Accelerator (ideal)

CommentCyclesName

More than 4 times faster than PowerPC G5 in terms of number of cycles required

HotChips 200630

Evaluation (Performance Breakdown)

Timing chart (real, intra)

1100 more cycles are required than the ideal cycles

The reason: configuration transfer cannot be completely hidden by signal processing

It can be hidden if the number of configuration memories increases (currently two buffers)

IQ/IDCT Intra4x4(luma) Intra(chroma) System memory write (frame buffer) Deblocking filter(luma)

Deblocking filter(chroma)System memory write (pixel buffer) Store macroblockCompensation

4923 cycles

Data transferConfig. transferCode transferSig. processing

Param calc.

HotChips 200631

Evaluation (Area consumption)

Area (logic gate) consumption

--2,200- Buffer (x1 in PE)

--1,000- Shuffle (x1 in PE)

--1,700- ALU (x8 in PE)

6567,50018,700- Processing Element (x5 in Form.)

2,11222,0004,000Crossbar (Formatter0)

--20,000Inter Unit Buffer

31,584314,200319,000Total

9609,50045,000AUX (x2)

8007,6006,000I/O Buffer Interface (Read & Write)

7687,6003,000Crossbar (Formatter1, Write Ctrl Unit)

13,696136,00020,000Total Inter Unit Ctrl. (Input & Output)

1,15212,000800Micro Controller (x5)

3,13631,00087,500Formatter (x2)

Memory Size

[bit]

F/F Memory [gate size]

Logic

[gate size]

HotChips 200632

Evaluation (Chip layout)

Formatter0

Formatter1

AUX0

AUX1Write Ctrl

Inter Unit Buffer

HotChips 200633

Outline

Motivation



Evaluation

Performance

Area consumption

Conclusion

HotChips 200634

Conclusion

Presented an implementation of hardware

accelerator using dynamically reconfigurable architecture

Efficient signal processing can be achieved by reconfiguring ALUs and crossbars dynamically

More efficient than a general SIMD unit because each crossbar/16bit-ALU can be reconfigured differently

Not limited to the acceleration of H.264 decode

Can also accelerate H.264 encode and other codecsby applying different configurations

HotChips 200635

Conclusion (cont’d)

Performance

Expects 30 frames/sec with 3 accelerators running at 300MHz for decoding Full HD H.264 video

More than 4 times faster than PowerPC G5 in terms of number of cycles required for decoding one macroblock

Area consumption (number of logic gates)

319K Gates (excluding memory)Small enough considering its performance

Size of buffer memory will be 24KB or less

HotChips 200636

Backup Slides

HotChips 200637

Evaluation (Comparison with PowerPC G5)

Application: H.264 decoder

Full HD frame (1920x1080 pixels) consists of 8100 macroblocks)

one macroblock = 16x16 pixels

To achieve frame rate = 24 Full HD frame/sec, one macroblock must be decoded in 5.14 sec

Requires dual 2.0GHz Power Mac G5 (Apple’s recommendation)

decoded w/ Quicktime7.1 (AltiVec used)

5.14 sec = 10288 cycles at 2.0GHz

HotChips 200638

Evaluation (Function level performance results)

69 if motion compensation (luma) excluded133Motion compensation

100Store macroblock

Worst case (filter applied maximum times)~ 784Deblocking filter chroma

Worst case (filter applied maximum times)~ 2112Deblocking filter luma

Worst case (6/4-tap filter applied max. times)~ 1579/~163Inter prediction (luma/chroma)

motion compensation (luma) included~ 448 / 75Intra prediction (luma/chroma)

204 if intra16x16 prediction is not used221IQ/IDCT

CommentCyclesName

Documents

HC18.430.S4T3.An Implementation of Hardware Accelerator ... · HotChips 2006 3 Motivation Need better solution for multimedia applications Processor solution is not efficient Lots