Download ppt - Increasing Hardware Efficiency with Multifunction Loop Accelerators

1 University of MichiganElectrical Engineering and Computer Science

Increasing Hardware Efficiency with Multifunction Loop Accelerators

Kevin Fan, Manjunath Kudlur,Hyunchul Park, Scott Mahlke

Advanced Computer Architecture LaboratoryUniversity of Michigan

October 25, 2006


Introduction• Emerging applications have

high performance, cost, energy demands– H.264, wireless, software radio,

signal processing– 10-100 Gops required– 200 mW power budget

• Applications dominated by tight loops processing large amounts of streaming data

CPU

Accelerators


Loop Accelerators

• Order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9

.C

Automated C gates solution

• Correct by construction• Close designer productivity gap• Achieve short time-to-market


Prescribed Throughput Accelerators• Traditional behavioral synthesis

– Directly translate C operatorsinto gates

Operation graph Datapath

Application Architecture

• Our approach: Application-centric Architectures– Achieve fixed throughput– Maximize hardware sharing


Outline• Loop accelerator schema and design flow• Cost sensitive scheduling• Designing multifunction accelerators

– Naïve– Joint scheduling– Datapath union

• Synthesis results


Loop Accelerator Template

• Parameterized execution resources, storage, connectivity

• Hardware realization of modulo scheduled loop


Loop Accelerator Design Flow

FU Alloc.c

C Code,Performance(Throughput)

AbstractArch

ModuloSchedule

Op1 Op2Op3 …tim

e

FUs

ScheduledOps

RF

FU FU

BuildDatapath

ConcreteArch

FU FUInstantiateArchSynthesize

Verilog,Control Signals

.v

LoopAccelerator


Datapath Derived from Schedule

• Schedule to abstract architecture (FUs)• Determine register and interconnect requirements

from schedule

r1 = Mem[r2]r3 = r1 + 12

Source Code Datapath

MEM +

12

time 1

time 4

FU1 FU2

Schedule. . .

ADD

LOAD


Cost Sensitive Scheduling

• 27% cost reduction with same performance [MICRO ’05]

+1

LD1

+1

LD1

+2

LD2

LD2

+2

time

FU1 FU2 FU3

FU1 FU2 FU3012

+1

+2

LD2

LD1time

FU1 FU2 FU3

FU1 FU2 FU3012

• Traditional scheduling is hardware unaware• Intelligent scheduling needed to reduce hardware cost


LA1

LA2

LA4

AcceleratorPipeline

LoopAccelerator

LA3

LA5

Multifunction Accelerator

• Map multiple loops to single accelerator

• Improve hardware efficiency via reuse

• Opportunities for sharing– Disjoint stages

(loops 2, 3)– Pipeline slack

(loops 4, 5)

FrameType?

Loop 2 Loop 3

Loop 1

Loop 4

Application

…

Block 5

LA1

LA2

LA3

AcceleratorPipeline

…

LoopAccelerator

MultifunctionLoopAccelerator

MultifunctionLoopAccelerator


Design Strategies• Naïve method: Design single function accelerators,

place side by side– Misses potential hardware sharing of FUs, storage,

interconnect

Loop 1

Loop 2

Cost SensitiveModulo Scheduler


FU FU

FU FU

FU FUFU FU

Multifunction datapath


Joint Scheduling

• Loops are independent: # possible schedules exponential in # of loops!

• Infeasible for modest problems

Loop 1

Loop 2

JointCost Sensitive

Modulo Scheduler

Op1 Op2Op3 …tim

e

FUs

Op2 Op1… Op3tim

e

FUs

FU FU


Multifunction Gate Costs

• 43% average savings over sum of accelerators

0

0.2

0.4

0.6

0.8

1

1.2

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

sharp,sob

sharp,sob,fsed

idct,deq idct,deq,dca

bfir,bform

vit,fft vit,fft,conv

vit,fft,conv,fmd

vit,fft,conv,

fmd,fmf

vit,fft,con,fmd,

fmf,fir

Avg

Norm

aliz

ed G

ate

Cos

t

FU Storage MUXImage MPEG-4 Beamformer Signal processing

A B C D E F G H I J


Datapath Union

Loop 1

Loop 2



FU FU

FU FU

FU FUDatapathUnion


Datapath Union

• Combine similar components→ better hardware sharing→ lower cost

• Trade off FU and register cost– Combining dissimilar FUs can

enable register cost savings• ILP formulation minimizes FU

and register cost

Accel 1

Accel 2

+ - M M

+ + * M

+ */- M M/+Multi-

functionaccel

+ +/- M/* M


Multifunction Gate Costs

• Smart union within 3% of joint scheduling solution

0

0.2

0.4

0.6

0.8

1

1.2

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

sharp,sob

sharp,sob,fsed

idct,deq idct,deq,dca

bfir,bform

vit,fft vit,fft,conv

vit,fft,conv,fmd

vit,fft,conv,fmd,fmf

vit,fft,conv,fmd,

fmf,fir

Avg

Norm

aliz

ed G

ate

Cos

t

FU Storage MUXImage MPEG-4 Beamformer Signal processing

A B C D E F G H I J


Conclusion• Multifunction accelerators highly effective in

exploiting coarse grained hardware sharing• Joint scheduling achieves 43% average cost

savings, but is impractical• Smart union of independent accelerators achieves

40% average savings• Compile times of 5 minutes – 1 hour


Questions?