Upload
pauline-scott
View
215
Download
0
Embed Size (px)
Citation preview
1 University of MichiganElectrical Engineering and Computer Science
Integrating Post-programmability Into the High-level Synthesis Equation*
Scott Mahlke
Advanced Computer Architecture LaboratoryUniversity of Michigan
Ann Arbor, MI USA
* This is work done by Kevin Fan and Manjunath Kudlur at UM
2 University of MichiganElectrical Engineering and Computer Science
Application Engines Differentiate Consumer SoCs
Slide Courtesy of Synfora
3 University of MichiganElectrical Engineering and Computer Science
The HLS Equation
Area
Power
Performance
What about programmability?
How to deal with application
changes?
Time to market
4 University of MichiganElectrical Engineering and Computer Science
Substrate Determines Programmability
MAC
Unit
Addr
Gen
P
Prog Mem
Embedded Processor
(lpArm)
Direct MappedHardware
EmbeddedFPGA
DSP(e.g. TI 320CXX )
Fle
xib
ility
Area or Power
Reconfigurable Processors (Maia)
Factor of 100-1000
100-1000 MOPS/mW
10-100MOPS/mW
.5-5MOPS/mW
5 University of MichiganElectrical Engineering and Computer Science
for(k=0; k<N4; k++) { ... real = Z1[k][0]; img = Z1[k][1];
Z1[k][0] = real * sincos[k][0] - img*sincos[k][1];
Z1[k][0] = Z1[k][0] << 1; if(b_scale) { Z1[k][0] = Z1[k][0] * scale; }}
Version 1.40
for(k=0; k<N4; k++) { ... uint16_t n = k << 1; ComplexMult(...);
X_out[ n] = -RE(x); X_out[N2 - 1 - n] = IM(x); X_out[N2 + n] = -IM(x); X_out[N - 1 - n] = RE(x);}
Version 1.34
for(k=0; k<N4; k++) { ... uint16_t n = k << 1; ComplexMult(...);
X_out[ n] = RE(x); X_out[N2 - 1 - n] = -IM(x); X_out[N2 + n] = IM(x); X_out[N - 1 - n] = -RE(x);}
Version 1.33
Bug fix
in faad2
How Much Programmability?
for(k=0; k<N4; k++) { ... real = Z1[k][0]; img = Z1[k][1];
Z1[k][0] = real * sincos[k][0] - img*sincos[k][1];
Z1[k][0] = Z1[k][0] << 1;}
Version 1.39
New feature
in faad2
Just Enough!
6 University of MichiganElectrical Engineering and Computer Science
StreamRoller Approach
FrameType?
Loop 2 Loop 3
Loop 1
Loop 4
Application
…
Block 5
Loop Accelerator Template Architecture
Point-to-point Connections
+
… …
&
… …
MEM
… …
LocalMem
FSM
Controlsignals
CRF
BR
7 University of MichiganElectrical Engineering and Computer Science
LA Programmability Shortcomings
Point-to-point Connections
+
… …
&
… …
MEM
… …
LocalMem
FSM
Controlsignals
CRF
BR
1. Point-to-point interconnect:
Only dataflow in the original
application is supported
2. Fixed functionality:
Only operators in the original
application are supported3. Hardwired control and
unaddressable register
storage
8 University of MichiganElectrical Engineering and Computer Science
Programmable Loop Accelerator
Point-to-point Connections
+/-
… …
&/|
… …
MEM
… …
LocalMem
ControlMemory
Controlsignals
CRF
BR
RR RRRRRR
Literals
Bus
1. Low-cost functionality generalization 2. Addressable rotating registers
3. Low bandwidth full connectivity path 4. Enable input swapping
5. Programmable literals 6. Memory for decoded control
9 University of MichiganElectrical Engineering and Computer Science
Mapping New Loops onto a PLA
MoveInsertion
SMTScheduling
RegisterAllocation
LoopControl SignalsMachine
descriptionIncrement II
• Large search space, few solutions• Op-centric approaches unable to find solutions• Satisfiability Modulo Theory (SMT) formulation to
solve linear and SAT constraints simultaneously
10 University of MichiganElectrical Engineering and Computer Science
Area Comparison – 130nm Library
0
0.2
0.4
0.6
0.8
1
1.2
1.4b
ffo
rm
dca
c
de
qu
an
t
fft
fir
fmra
dio
fse
d
he
at lu
sob
el
ave
rag
e
Are
a (m
m2)
LA
PLA
OR1K
LA = single function accelerator, PLA = programmable accelerator, OR1K = OR-1200 processor
11 University of MichiganElectrical Engineering and Computer Science
Power Comparison
1
10
100
1000b
fform
dca
c
de
qu
an
t fft fir
fmra
dio
fse
d
he
at
lu
sob
el
ave
rag
e
No
rmal
ized
Po
wer
Co
nsu
mp
tio
n
PLA
OR1K
OR1K-equiv
1.0 = power for single function LA, OR1K-equiv = performance equivalent processor
12 University of MichiganElectrical Engineering and Computer Science
Efficiency Comparison
0
200
400
600
800
1000
1200
1400
0 10 20 30 40 50 60
Power Consumption (mW)
Pe
rfo
rma
nc
e (
MIP
S)
LA
PLA
OR1K
20 MIPS/mW
2 MIPS/mW
200 MIPS/mW
13 University of MichiganElectrical Engineering and Computer Science
Programmability Assessment
0
2
4
6
8
10
12
14
16
18
20b
ffo
rm
dca
c
de
qu
an
t
fft
fir
fmra
dio
fse
d
he
at lu
sob
el
# P
ertu
rbat
ion
s
Number of algorithm perturbations tolerated while maintaining the same performance
14 University of MichiganElectrical Engineering and Computer Science
Final Thoughts
• Programmability not an all or nothing issue– Application accelerators need to be able to evolve– HLS + targeted design generalizations yield a highly customized,
but semi-programmable ASIC• Bottom line tradeoffs
– PLA vs OR-1200: 4 - 34x more power efficient, 30x smaller – PLA vs ASIC: 2 - 9x worse power, 2x larger
• Cost breakdown– Addressable register storage and generalized FUs most costly– Interconnect extensions less costly
15 University of MichiganElectrical Engineering and Computer Science
For More Information
http://cccp.eecs.umich.edu
• “Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability,” K. Fan, H. Park, M. Kudlur, and S. Mahlke, Proc. 2008 International Symposium on Code Generation and
Optimization, Apr. 2008, pp. 124-133. • “Orchestrating the Execution of Stream Programs on Multicore
Platforms,” M. Kudlur and S. Mahlke, Proc. ACM SIGPLAN 2008 Conference on Programming Languages Design and Implementation, Jun. 2008.