View
38
Download
0
Category
Preview:
DESCRIPTION
Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System. Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan. 20 GB HD. Introduction. Emerging applications have high performance, cost, energy demands - PowerPoint PPT Presentation
Citation preview
1 University of MichiganElectrical Engineering and Computer Science
Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System
Kevin Fan, Manjunath Kudlur,Hyunchul Park, Scott Mahlke
Advanced Computer Architecture LaboratoryUniversity of Michigan
2 University of MichiganElectrical Engineering and Computer Science
Introduction
• Emerging applications have high performance, cost, energy demands– H.264, wireless, software radio,
signal processing– 10-100 Gops required– 200 mW power budget
• Applications dominated by tight loops processing large amounts of streaming data
3.5G (HSDPA)WiMax
Stereo Headset
TV out
PC / MacMemory
card
20 GB HD
[ARM 2005]
3 University of MichiganElectrical Engineering and Computer Science
Loop Accelerators
• Order-of-magnitude performance and efficiency wins– Viterbi: 100x speedup vs. ARM9
.C.C
Automated C gates solution
• Correct by construction
• Close designer productivity gap
• Achieve short time-to-market
4 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Template
• Parameterized execution resources, storage, connectivity
• Hardware realization of modulo scheduled loop
5 University of MichiganElectrical Engineering and Computer Science
Loop Accelerator Design Flow
FU Alloc.c
C Code,Performance(Throughput)
AbstractArch
1
ModuloSchedule
Op1 Op2Op3 …tim
e
FUs
ScheduledOps
2
RF
FU FU
BuildDatapath
ConcreteArch
3
FU FUInstantiateArch
Synthesize
Verilog,Control Signals
.v
LoopAccelerator
5 4
6 University of MichiganElectrical Engineering and Computer Science
Modulo Scheduling andDatapath Derivation
• Schedule to abstract architecture (FUs)• Determine register and interconnect requirements
from schedule
r1 = Mem[r2]r3 = r1 + 12
Source Code Datapath
MEM +
12
ADD
LOADtime 1
time 4
FU1 FU2
Schedule. . .
7 University of MichiganElectrical Engineering and Computer Science
Cost Sensitive Scheduling
• Different scheduling alternatives not equal
+1
LD1
+1
LD1
+2
LD2
LD2
+2
time
FU1 FU2 FU3
FU1 FU2 FU30
1
2
+1
+2
LD2
LD1time
FU1 FU2 FU3
FU1 FU2 FU30
1
2
• Traditional scheduling is hardware unaware• Intelligent scheduling needed to reduce hardware cost
8 University of MichiganElectrical Engineering and Computer Science
Scheduling to Reduce Cost
• Hardware cost is function of final schedule• Increased hardware sharing = reduced cost
1
2
FU • Reusing hardware is “free”
• Traditional metrics (register pressure) not sufficient
3
4
FU
No additional costfor longer lifetime
FU
9 University of MichiganElectrical Engineering and Computer Science
Initial Approach: Greedy
• Standard iterative modulo scheduler, augmented with hardware cost model
• Choose alternative which increases cost the least
while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model}
Hardware cost =FU cost + Storage cost + Wire cost
+ - * <<
10 University of MichiganElectrical Engineering and Computer Science
Results – Greedy Scheduling
• 5% average cost savings
• Local scope local minima• Much more cost savings possible
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
FU Storage MUX
11 University of MichiganElectrical Engineering and Computer Science
Optimal Modulo Scheduling+1 +2
LD3
-5
+4
(1,0)
(1,1) (3,0) (3,1)
(2,0) (2,1)
Op1
Op2
Op3
Loop Search Space
(FU #, time)
• Optimal modulo schedulingextends [Eichenberger ’97]
Storage cost = widthi depthi
FU cost = cost(FUi)
12 University of MichiganElectrical Engineering and Computer Science
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
na
ïve
gre
ed
y
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
Results – Optimal Scheduling
• 27% average cost savings
FU Storage MUX
13 University of MichiganElectrical Engineering and Computer Science
Problem Decomposition
• Exact solutions are not practical– (#FU II stages) ^ #ops possible schedules– 20 lines of C code 100 hours– Excessive runtimes even for modest-size loops
• Decompose into more manageable sub-problems– Partitioned scheduling– Time-space decomposition
14 University of MichiganElectrical Engineering and Computer Science
Partitioned Scheduling
• Partition the operations into small groups• Schedule groups of operations sequentially
– Account for hardware contribution of previously scheduled groups
– Backtrack if infeasible state reached
1 2
43
5
OptimalModulo
Scheduler
1
3
5
OptimalModulo
Scheduler
1 2
43
5
15 University of MichiganElectrical Engineering and Computer Science
Operation Partitioning
• Traditional partitioning: minimize edge cuts– Does not necessarily lead to good cost
• Goal: maximize hardware sharing opportunities within a group
+
LD+
LD<<
+
*
+
LD+
LD
16 University of MichiganElectrical Engineering and Computer Science
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
na
ïve
gre
ed
y
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
Results – Partitioned Scheduling
• 8% average cost savings• With large number of partitions, similar to greedy
FU Storage MUX
17 University of MichiganElectrical Engineering and Computer Science
Partition Size for Sharp
• Improve cost by considering more ops at a time
0
5000
10000
15000
20000
25000
30000
3 6 9 12 15 18 21 24 27 30 full
Partition Size
Co
st in
Gat
es
18 University of MichiganElectrical Engineering and Computer Science
Time-Space Decomposition
1 2
43
5
1
3
52time 0:
time 1: 4
1
3
5 2
time
FU1 FU2 FU3
0
1 4
1
3
5
2
FU 1:
FU 2: 4
FU 3:
1
35
2
time
FU1 FU2 FU3
0
1 4
Time, space
Space, time
• Reduce scheduling complexity• View all operations together
• Optimize for register depth during time assignment, register width and FU cost during space assignment
19 University of MichiganElectrical Engineering and Computer Science
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
na
ïve
gre
ed
yp
art
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
0
0.2
0.4
0.6
0.8
1
1.2
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
na
ïve
gre
ed
yp
art ts st
op
t
sobel fir dequant dcac viterbi sharp sha Average
No
rma
lize
d G
ate
Co
st
Results – Time-Space Scheduling
• Time, space: 19% average cost savings• Space, time: 20% average cost savings
FU Storage MUX
20 University of MichiganElectrical Engineering and Computer Science
Real Cost Savings
Viterbi, naïve scheduler, 0.66 mm2
Viterbi, space-time decomposedscheduler, 0.37 mm2
43.2% overall area savings
21 University of MichiganElectrical Engineering and Computer Science
Conclusion
• Automated C loop accelerator synthesis system• Modulo scheduler must be cost aware• Decomposition methods make problem tractable
– 20% average cost savings with space-time decomposition
– Importance of global view of all operations• Individual savings up to 43%• Compile times of 1 minute – 30 minutes
22 University of MichiganElectrical Engineering and Computer Science
Questions?
• For more information: http://cccp.eecs.umich.edu
Recommended