Upload
munin
View
31
Download
0
Embed Size (px)
DESCRIPTION
Increasing Hardware Efficiency with Multifunction Loop Accelerators. Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan October 25, 2006. Introduction. Emerging applications have high performance, cost, energy demands - PowerPoint PPT Presentation
Citation preview
1 University of MichiganElectrical Engineering and Computer Science
Increasing Hardware Efficiency with Multifunction Loop Accelerators
Kevin Fan, Manjunath Kudlur,Hyunchul Park, Scott Mahlke
Advanced Computer Architecture LaboratoryUniversity of Michigan
October 25, 2006
2 University of MichiganElectrical Engineering and Computer Science
Introduction• Emerging applications have
high performance, cost, energy demands– H.264, wireless, software radio,
signal processing– 10-100 Gops required– 200 mW power budget
• Applications dominated by tight loops processing large amounts of streaming data
CPU
Accelerators
3 University of MichiganElectrical Engineering and Computer Science
Loop Accelerators
• Order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9
.C
Automated C gates solution
• Correct by construction• Close designer productivity gap• Achieve short time-to-market
4 University of MichiganElectrical Engineering and Computer Science
Prescribed Throughput Accelerators• Traditional behavioral synthesis
– Directly translate C operatorsinto gates
Operation graph Datapath
Application Architecture
• Our approach: Application-centric Architectures– Achieve fixed throughput– Maximize hardware sharing
5 University of MichiganElectrical Engineering and Computer Science
Outline• Loop accelerator schema and design flow• Cost sensitive scheduling• Designing multifunction accelerators
– Naïve– Joint scheduling– Datapath union
• Synthesis results
6 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Template
• Parameterized execution resources, storage, connectivity
• Hardware realization of modulo scheduled loop
7 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Design Flow
FU Alloc.c
C Code,Performance(Throughput)
AbstractArch
ModuloSchedule
Op1 Op2Op3 …tim
e
FUs
ScheduledOps
RF
FU FU
BuildDatapath
ConcreteArch
FU FUInstantiateArchSynthesize
Verilog,Control Signals
.v
LoopAccelerator
8 University of MichiganElectrical Engineering and Computer Science
Datapath Derived from Schedule
• Schedule to abstract architecture (FUs)• Determine register and interconnect requirements
from schedule
r1 = Mem[r2]r3 = r1 + 12
Source Code Datapath
MEM +
12
time 1
time 4
FU1 FU2
Schedule. . .
ADD
LOAD
9 University of MichiganElectrical Engineering and Computer Science
Cost Sensitive Scheduling
• 27% cost reduction with same performance [MICRO ’05]
+1
LD1
+1
LD1
+2
LD2
LD2
+2
time
FU1 FU2 FU3
FU1 FU2 FU3012
+1
+2
LD2
LD1time
FU1 FU2 FU3
FU1 FU2 FU3012
• Traditional scheduling is hardware unaware• Intelligent scheduling needed to reduce hardware cost
10 University of MichiganElectrical Engineering and Computer Science
LA1
LA2
LA4
AcceleratorPipeline
LoopAccelerator
LA3
LA5
Multifunction Accelerator
• Map multiple loops to single accelerator
• Improve hardware efficiency via reuse
• Opportunities for sharing– Disjoint stages
(loops 2, 3)– Pipeline slack
(loops 4, 5)
FrameType?
Loop 2 Loop 3
Loop 1
Loop 4
Application
…
Block 5
LA1
LA2
LA3
AcceleratorPipeline
…
LoopAccelerator
MultifunctionLoopAccelerator
MultifunctionLoopAccelerator
11 University of MichiganElectrical Engineering and Computer Science
Design Strategies• Naïve method: Design single function accelerators,
place side by side– Misses potential hardware sharing of FUs, storage,
interconnect
Loop 1
Loop 2
Cost SensitiveModulo Scheduler
Cost SensitiveModulo Scheduler
FU FU
FU FU
FU FUFU FU
Multifunction datapath
12 University of MichiganElectrical Engineering and Computer Science
Joint Scheduling
• Loops are independent: # possible schedules exponential in # of loops!
• Infeasible for modest problems
Loop 1
Loop 2
JointCost Sensitive
Modulo Scheduler
Op1 Op2Op3 …tim
e
FUs
Op2 Op1… Op3tim
e
FUs
FU FU
13 University of MichiganElectrical Engineering and Computer Science
Multifunction Gate Costs
• 43% average savings over sum of accelerators
0
0.2
0.4
0.6
0.8
1
1.2
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
naïv
e
join
t
sharp,sob
sharp,sob,fsed
idct,deq idct,deq,dca
bfir,bform
vit,fft vit,fft,conv
vit,fft,conv,fmd
vit,fft,conv,
fmd,fmf
vit,fft,con,fmd,
fmf,fir
Avg
Norm
aliz
ed G
ate
Cos
t
FU Storage MUXImage MPEG-4 Beamformer Signal processing
A B C D E F G H I J
14 University of MichiganElectrical Engineering and Computer Science
Datapath Union
Loop 1
Loop 2
Cost SensitiveModulo Scheduler
Cost SensitiveModulo Scheduler
FU FU
FU FU
FU FUDatapathUnion
15 University of MichiganElectrical Engineering and Computer Science
Datapath Union
• Combine similar components→ better hardware sharing→ lower cost
• Trade off FU and register cost– Combining dissimilar FUs can
enable register cost savings• ILP formulation minimizes FU
and register cost
Accel 1
Accel 2
+ - M M
+ + * M
+ */- M M/+Multi-
functionaccel
+ +/- M/* M
16 University of MichiganElectrical Engineering and Computer Science
Multifunction Gate Costs
• Smart union within 3% of joint scheduling solution
0
0.2
0.4
0.6
0.8
1
1.2
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
naïv
eun
ion
join
t
sharp,sob
sharp,sob,fsed
idct,deq idct,deq,dca
bfir,bform
vit,fft vit,fft,conv
vit,fft,conv,fmd
vit,fft,conv,fmd,fmf
vit,fft,conv,fmd,
fmf,fir
Avg
Norm
aliz
ed G
ate
Cos
t
FU Storage MUXImage MPEG-4 Beamformer Signal processing
A B C D E F G H I J
17 University of MichiganElectrical Engineering and Computer Science
Conclusion• Multifunction accelerators highly effective in
exploiting coarse grained hardware sharing• Joint scheduling achieves 43% average cost
savings, but is impractical• Smart union of independent accelerators achieves
40% average savings• Compile times of 5 minutes – 1 hour
18 University of MichiganElectrical Engineering and Computer Science
Questions?