1 University of MichiganElectrical Engineering and Computer Science
Increasing Hardware Efficiency with Multifunction Loop Accelerators
Kevin Fan, Manjunath Kudlur,Hyunchul Park, Scott Mahlke
Advanced Computer Architecture LaboratoryUniversity of Michigan
October 25, 2006
2 University of MichiganElectrical Engineering and Computer Science
Introduction• Emerging applications have
high performance, cost, energy demands– H.264, wireless, software radio,
signal processing– 10-100 Gops required– 200 mW power budget
• Applications dominated by tight loops processing large amounts of streaming data
CPU
Accelerators
3 University of MichiganElectrical Engineering and Computer Science
Loop Accelerators
• Order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9
.C
Automated C gates solution
• Correct by construction• Close designer productivity gap• Achieve short time-to-market
4 University of MichiganElectrical Engineering and Computer Science
Prescribed Throughput Accelerators• Traditional behavioral synthesis
– Directly translate C operatorsinto gates
Operation graph Datapath
Application Architecture
• Our approach: Application-centric Architectures– Achieve fixed throughput– Maximize hardware sharing
5 University of MichiganElectrical Engineering and Computer Science
Outline• Loop accelerator schema and design flow• Cost sensitive scheduling• Designing multifunction accelerators
– Naïve– Joint scheduling– Datapath union
• Synthesis results
6 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Template
• Parameterized execution resources, storage, connectivity
• Hardware realization of modulo scheduled loop
7 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Design Flow
FU Alloc.c
C Code,Performance(Throughput)
AbstractArch
ModuloSchedule
Op1 Op2Op3 …tim
e
FUs
ScheduledOps
RF
FU FU
BuildDatapath
ConcreteArch
FU FUInstantiateArchSynthesize
Verilog,Control Signals
.v
LoopAccelerator
8 University of MichiganElectrical Engineering and Computer Science
Datapath Derived from Schedule
• Schedule to abstract architecture (FUs)• Determine register and interconnect requirements
from schedule
r1 = Mem[r2]r3 = r1 + 12
Source Code Datapath
MEM +
12
time 1
time 4
FU1 FU2
Schedule. . .
ADD
LOAD
9 University of MichiganElectrical Engineering and Computer Science
Cost Sensitive Scheduling
• 27% cost reduction with same performance [MICRO ’05]
+1
LD1
+1
LD1
+2
LD2
LD2
+2
time
FU1 FU2 FU3
FU1 FU2 FU3012
+1
+2
LD2
LD1time
FU1 FU2 FU3
FU1 FU2 FU3012
• Traditional scheduling is hardware unaware• Intelligent scheduling needed to reduce hardware cost
10 University of MichiganElectrical Engineering and Computer Science
LA1
LA2
LA4
AcceleratorPipeline
LoopAccelerator
LA3
LA5
Multifunction Accelerator
• Map multiple loops to single accelerator
• Improve hardware efficiency via reuse
• Opportunities for sharing– Disjoint stages
(loops 2, 3)– Pipeline slack
(loops 4, 5)
FrameType?
Loop 2 Loop 3
Loop 1
Loop 4
Application
…
Block 5
LA1
LA2
LA3
AcceleratorPipeline
…
LoopAccelerator
MultifunctionLoopAccelerator
MultifunctionLoopAccelerator
11 University of MichiganElectrical Engineering and Computer Science
Design Strategies• Naïve method: Design single function accelerators,
place side by side– Misses potential hardware sharing of FUs, storage,
interconnect
Loop 1
Loop 2
Cost SensitiveModulo Scheduler
Cost SensitiveModulo Scheduler
FU FU
FU FU
FU FUFU FU
Multifunction datapath
12 University of MichiganElectrical Engineering and Computer Science
Joint Scheduling
• Loops are independent: # possible schedules exponential in # of loops!
• Infeasible for modest problems
Loop 1
Loop 2
JointCost Sensitive
Modulo Scheduler
Op1 Op2Op3 …tim
e
FUs
Op2 Op1… Op3tim
e
FUs
FU FU
13 University of MichiganElectrical Engineering and Computer Science
Multifunction Gate Costs
• 43% average savings over sum of accelerators
0
0.2
0.4
0.6
0.8
1
1.2
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
sharp,sob
sharp,sob,fsed
idct,deq idct,deq,dca
bfir,bform
vit,fft vit,fft,conv
vit,fft,conv,fmd
vit,fft,conv,
fmd,fmf
vit,fft,con,fmd,
fmf,fir
Avg
Norm
aliz
ed G
ate
Cos
t
FU Storage MUXImage MPEG-4 Beamformer Signal processing
A B C D E F G H I J
14 University of MichiganElectrical Engineering and Computer Science
Datapath Union
Loop 1
Loop 2
Cost SensitiveModulo Scheduler
Cost SensitiveModulo Scheduler
FU FU
FU FU
FU FUDatapathUnion
15 University of MichiganElectrical Engineering and Computer Science
Datapath Union
• Combine similar components→ better hardware sharing→ lower cost
• Trade off FU and register cost– Combining dissimilar FUs can
enable register cost savings• ILP formulation minimizes FU
and register cost
Accel 1
Accel 2
+ - M M
+ + * M
+ */- M M/+Multi-
functionaccel
+ +/- M/* M
16 University of MichiganElectrical Engineering and Computer Science
Multifunction Gate Costs
• Smart union within 3% of joint scheduling solution
0
0.2
0.4
0.6
0.8
1
1.2
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
sharp,sob
sharp,sob,fsed
idct,deq idct,deq,dca
bfir,bform
vit,fft vit,fft,conv
vit,fft,conv,fmd
vit,fft,conv,fmd,fmf
vit,fft,conv,fmd,
fmf,fir
Avg
Norm
aliz
ed G
ate
Cos
t
FU Storage MUXImage MPEG-4 Beamformer Signal processing
A B C D E F G H I J
17 University of MichiganElectrical Engineering and Computer Science
Conclusion• Multifunction accelerators highly effective in
exploiting coarse grained hardware sharing• Joint scheduling achieves 43% average cost
savings, but is impractical• Smart union of independent accelerators achieves
40% average savings• Compile times of 5 minutes – 1 hour
18 University of MichiganElectrical Engineering and Computer Science
Questions?