18
1 University of Michigan Electrical Engineering and Computer Science Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan October 25, 2006

Increasing Hardware Efficiency with Multifunction Loop Accelerators

  • Upload
    munin

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Increasing Hardware Efficiency with Multifunction Loop Accelerators. Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan October 25, 2006. Introduction. Emerging applications have high performance, cost, energy demands - PowerPoint PPT Presentation

Citation preview

Page 1: Increasing Hardware Efficiency with Multifunction Loop Accelerators

1 University of MichiganElectrical Engineering and Computer Science

Increasing Hardware Efficiency with Multifunction Loop Accelerators

Kevin Fan, Manjunath Kudlur,Hyunchul Park, Scott Mahlke

Advanced Computer Architecture LaboratoryUniversity of Michigan

October 25, 2006

Page 2: Increasing Hardware Efficiency with Multifunction Loop Accelerators

2 University of MichiganElectrical Engineering and Computer Science

Introduction• Emerging applications have

high performance, cost, energy demands– H.264, wireless, software radio,

signal processing– 10-100 Gops required– 200 mW power budget

• Applications dominated by tight loops processing large amounts of streaming data

CPU

Accelerators

Page 3: Increasing Hardware Efficiency with Multifunction Loop Accelerators

3 University of MichiganElectrical Engineering and Computer Science

Loop Accelerators

• Order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9

.C

Automated C gates solution

• Correct by construction• Close designer productivity gap• Achieve short time-to-market

Page 4: Increasing Hardware Efficiency with Multifunction Loop Accelerators

4 University of MichiganElectrical Engineering and Computer Science

Prescribed Throughput Accelerators• Traditional behavioral synthesis

– Directly translate C operatorsinto gates

Operation graph Datapath

Application Architecture

• Our approach: Application-centric Architectures– Achieve fixed throughput– Maximize hardware sharing

Page 5: Increasing Hardware Efficiency with Multifunction Loop Accelerators

5 University of MichiganElectrical Engineering and Computer Science

Outline• Loop accelerator schema and design flow• Cost sensitive scheduling• Designing multifunction accelerators

– Naïve– Joint scheduling– Datapath union

• Synthesis results

Page 6: Increasing Hardware Efficiency with Multifunction Loop Accelerators

6 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Template

• Parameterized execution resources, storage, connectivity

• Hardware realization of modulo scheduled loop

Page 7: Increasing Hardware Efficiency with Multifunction Loop Accelerators

7 University of MichiganElectrical Engineering and Computer Science

Loop Accelerator Design Flow

FU Alloc.c

C Code,Performance(Throughput)

AbstractArch

ModuloSchedule

Op1 Op2Op3 …tim

e

FUs

ScheduledOps

RF

FU FU

BuildDatapath

ConcreteArch

FU FUInstantiateArchSynthesize

Verilog,Control Signals

.v

LoopAccelerator

Page 8: Increasing Hardware Efficiency with Multifunction Loop Accelerators

8 University of MichiganElectrical Engineering and Computer Science

Datapath Derived from Schedule

• Schedule to abstract architecture (FUs)• Determine register and interconnect requirements

from schedule

r1 = Mem[r2]r3 = r1 + 12

Source Code Datapath

MEM +

12

time 1

time 4

FU1 FU2

Schedule. . .

ADD

LOAD

Page 9: Increasing Hardware Efficiency with Multifunction Loop Accelerators

9 University of MichiganElectrical Engineering and Computer Science

Cost Sensitive Scheduling

• 27% cost reduction with same performance [MICRO ’05]

+1

LD1

+1

LD1

+2

LD2

LD2

+2

time

FU1 FU2 FU3

FU1 FU2 FU3012

+1

+2

LD2

LD1time

FU1 FU2 FU3

FU1 FU2 FU3012

• Traditional scheduling is hardware unaware• Intelligent scheduling needed to reduce hardware cost

Page 10: Increasing Hardware Efficiency with Multifunction Loop Accelerators

10 University of MichiganElectrical Engineering and Computer Science

LA1

LA2

LA4

AcceleratorPipeline

LoopAccelerator

LA3

LA5

Multifunction Accelerator

• Map multiple loops to single accelerator

• Improve hardware efficiency via reuse

• Opportunities for sharing– Disjoint stages

(loops 2, 3)– Pipeline slack

(loops 4, 5)

FrameType?

Loop 2 Loop 3

Loop 1

Loop 4

Application

Block 5

LA1

LA2

LA3

AcceleratorPipeline

LoopAccelerator

MultifunctionLoopAccelerator

MultifunctionLoopAccelerator

Page 11: Increasing Hardware Efficiency with Multifunction Loop Accelerators

11 University of MichiganElectrical Engineering and Computer Science

Design Strategies• Naïve method: Design single function accelerators,

place side by side– Misses potential hardware sharing of FUs, storage,

interconnect

Loop 1

Loop 2

Cost SensitiveModulo Scheduler

Cost SensitiveModulo Scheduler

FU FU

FU FU

FU FUFU FU

Multifunction datapath

Page 12: Increasing Hardware Efficiency with Multifunction Loop Accelerators

12 University of MichiganElectrical Engineering and Computer Science

Joint Scheduling

• Loops are independent: # possible schedules exponential in # of loops!

• Infeasible for modest problems

Loop 1

Loop 2

JointCost Sensitive

Modulo Scheduler

Op1 Op2Op3 …tim

e

FUs

Op2 Op1… Op3tim

e

FUs

FU FU

Page 13: Increasing Hardware Efficiency with Multifunction Loop Accelerators

13 University of MichiganElectrical Engineering and Computer Science

Multifunction Gate Costs

• 43% average savings over sum of accelerators

0

0.2

0.4

0.6

0.8

1

1.2

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

naïv

e

join

t

sharp,sob

sharp,sob,fsed

idct,deq idct,deq,dca

bfir,bform

vit,fft vit,fft,conv

vit,fft,conv,fmd

vit,fft,conv,

fmd,fmf

vit,fft,con,fmd,

fmf,fir

Avg

Norm

aliz

ed G

ate

Cos

t

FU Storage MUXImage MPEG-4 Beamformer Signal processing

A B C D E F G H I J

Page 14: Increasing Hardware Efficiency with Multifunction Loop Accelerators

14 University of MichiganElectrical Engineering and Computer Science

Datapath Union

Loop 1

Loop 2

Cost SensitiveModulo Scheduler

Cost SensitiveModulo Scheduler

FU FU

FU FU

FU FUDatapathUnion

Page 15: Increasing Hardware Efficiency with Multifunction Loop Accelerators

15 University of MichiganElectrical Engineering and Computer Science

Datapath Union

• Combine similar components→ better hardware sharing→ lower cost

• Trade off FU and register cost– Combining dissimilar FUs can

enable register cost savings• ILP formulation minimizes FU

and register cost

Accel 1

Accel 2

+ - M M

+ + * M

+ */- M M/+Multi-

functionaccel

+ +/- M/* M

Page 16: Increasing Hardware Efficiency with Multifunction Loop Accelerators

16 University of MichiganElectrical Engineering and Computer Science

Multifunction Gate Costs

• Smart union within 3% of joint scheduling solution

0

0.2

0.4

0.6

0.8

1

1.2

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

naïv

eun

ion

join

t

sharp,sob

sharp,sob,fsed

idct,deq idct,deq,dca

bfir,bform

vit,fft vit,fft,conv

vit,fft,conv,fmd

vit,fft,conv,fmd,fmf

vit,fft,conv,fmd,

fmf,fir

Avg

Norm

aliz

ed G

ate

Cos

t

FU Storage MUXImage MPEG-4 Beamformer Signal processing

A B C D E F G H I J

Page 17: Increasing Hardware Efficiency with Multifunction Loop Accelerators

17 University of MichiganElectrical Engineering and Computer Science

Conclusion• Multifunction accelerators highly effective in

exploiting coarse grained hardware sharing• Joint scheduling achieves 43% average cost

savings, but is impractical• Smart union of independent accelerators achieves

40% average savings• Compile times of 5 minutes – 1 hour

Page 18: Increasing Hardware Efficiency with Multifunction Loop Accelerators

18 University of MichiganElectrical Engineering and Computer Science

Questions?