View
233
Download
4
Category
Preview:
Citation preview
TKTTKT--2431 SoC 2431 SoC DesignDesign
Lec 8 Lec 8 –– Optimization, ASIPOptimization, ASIP
Erno SalminenErno Salminen
Department of Computer SystemsDepartment of Computer SystemsTampere University of TechnologyTampere University of TechnologyTampere University of TechnologyTampere University of Technology
Fall 2012Fall 2012
Erno Salminen - Oct. 2012
Copyright noticeCopyright notice
Part of the slidesadapted from slide set by Albertoadapted from slide set by Alberto
Sangiovanni-Vincentelli course EE249 at University of California,
Berkeley http://www-
cad.eecs.berkeley.edu/~polis/class/lectures.shtml
Part of figures from: J. Heikkinen, J. Sertamo, T. Rautiainen and J. Takala,
"Design of Transport Triggered Architecture Processor forDesign of Transport Triggered Architecture Processor for Discrete Cosine Transform", in Proc. 15th Ann. IEEE Int. ASIC/SOC Conf., Rochester, NY, U.S.A., Sept. 25-28 2002, pp. 87-91
#2/44 Department of Computer SystemsErno Salminen - Oct. 2012
OutlineOutlineDetermine bottlenecks - Amdahl’s lawMethodsMethods Architectural choices Algorithm modifications, assembly codingg , y g Custom processors, e.g. ASIP HW accelerators (Parallel processing on next lecture)
#3/44 Department of Computer SystemsErno Salminen - Oct. 2012
At firstAt first
Make sure that simple things worksimple things work before even tryingbefore even trying more complex onesmore complex ones
#4/44 Department of Computer SystemsErno Salminen - Oct. 2012
ForewordForeword ”Premature optimization is the root of all evil” Donald Knuth [quoting Hoare]
Sutter, Alexandrescu 1st rule: Don’t optimize
(f ) 2nd rule (for experts only): Don’t do it yet. Measure twice, optimize once.
Focus on making code as clear and readableFocus on making code as clear and readable as possibleOptimizations make design and code more p g
complex Optimize only when performance bottle-neck
#5/44 Department of Computer SystemsErno Salminen - Oct. 2012
has been proven
System bottlenecks (1)System bottlenecks (1) Determine what’s taking time
Or area, power, memory
Bottleneck halts other parts of the system
[H Meyr Application Specific Instruction-Set
#6/44 Department of Computer SystemsErno Salminen - Oct. 2012
[H. Meyr, Application Specific Instruction Set
Processors for Wireless Communications, Tampere
SoC, Nov. 2004]
[Berkeley Design Technology Inc., Alternatives to DPSs: What and Why?, Tampere SoC, Nov. 2003]
System System bottlenecks (2)bottlenecks (2)bottlenecks (2)bottlenecks (2)
Concentrate optimization on bottlenecks Don’t optimize everything, e.g. function taking 3% of runtime
System may be refined into smaller blocks to define the bottlenecks, e.g. in logic’s critical path or area usage, g g p g Otherwise, it is difficult to determine the relation between HDL
source line and schematic
Removing a single bottleneck might have minor effect if Removing a single bottleneck might have minor effect, if the second worst is almost as bad Consider e.g. critical paths in logic
Embarrasingly trivial Matlab example Removed one unnecessary #include from m files: 12x speedup Locating bottleneck took few hours, fixing took 1 minute
#7/44 Department of Computer SystemsErno Salminen - Oct. 2012
Amdahl’s LawAmdahl’s Law
tnew = told * (1 - fractionenhanced) + fractionenhanced
speedupenhanced
speedupoverall =told
=1
(1 f ti ) + f tip poverall tnew
(1 - fractionenhanced) + fractionenhanced
speedupenhanced
c enh
HUOM! OBS!
M importante!
frac
#8/44 Department of Computer SystemsErno Salminen - Oct. 2012 [H. Corporaal, course material Adv. Computer architectures, Univ. Delft, 2001]
Muy importante!
old new
Amdahl’s Law ExampleAmdahl’s Law ExampleFloating point instructions improved to run 2X;
but only 10% of actual instructions are FPy
tnew = told * (1.0-0.1 + 0.1/2) = 0.95 * told
speedupoverall = 10 95
= 1.053
new old ( ) old
M d 1 / (1 f ti )
0.95
Max. speedupoverall = 1 / (1- fractionenhanced )Be careful: ”new is 5% smaller than old”
means that”old is 5 3% larger than new”
#9/44 Department of Computer SystemsErno Salminen - Oct. 2012
means that old is 5.3% larger than new”
Architectural choicesArchitectural choices
Erno Salminen - Oct. 2012
Architectural choices: qualitativeArchitectural choices: qualitativeex
ibili
ty
Data+instr
memData+instr
Dream solution
log
Fle
micro-
processor micro-
processor
mem
i
Data+instr
mem
(exists only in marketing material...)
FPGA
General purpose
microprocessor MAC
processorAddr
gen
SW blco-
micro-
processorco-
FPGASW programmable
DSPproc proc
Hardware
reconfigurable
std. cell
ASIC full custom
l Effi i
processor
Direct mapped HW
No free lunch this time
custom ASIC
#11/44 Department of Computer SystemsErno Salminen - Oct. 2012
log Efficiency (increasing speed, decreasing power and area)
No free lunch this time either
Architectural choices: quantitative dataArchitectural choices: quantitative data
General-purpose CPU
General-purpose CPU
DSP
FPGA, ASIP
purpose CPU
DSP
std-cell ASICFPGA, ASIP
f ll t ASIC
ASIC
Heinrich Meyr, Future Wireless Communication Systems…, VTC, 2005.
full custom ASIC
#12/44 Department of Computer SystemsErno Salminen - Oct. 2012
(Figure data by T.Noll T.Noll, RWTH Aachen)
http://www.ieeevtc.org/vtc2005spring/presentations/2020_presentations/HMeyr.pdfNote: flexibility and price are not included
Architectural choices: quatitative (2)Architectural choices: quatitative (2) Area and energy efficiencies of comparable MPEG-4
encoder implementations (bigger the better)[O Sil d K J kkä Ob ti P Effi i T d i M bil C i ti D i
,[Mpixels/s/mm2]
[O. Silven and K. Jyrkkä, Observations on Power-Efficiency Trends in Mobile Communication Devices, EURASIP Journal on Embedded Systems, Vol 2007, Article ID 56976, 10 pages, 2007.]
dream solution
,[Mpixels/s/W]
#13/44 Department of Computer SystemsErno Salminen - Oct. 2012 Values include RAM.
Algorithmic Algorithmic modifications assembly modifications assembly modifications, assembly modifications, assembly languagelanguage
Erno Salminen - Oct. 2012
Example: SortingExample: Sorting900
Simplest algorithms have O(n2) execution time
M l O( l )
bubble
selection
900
More complex O(n log n) Require recursion,
advanced data structures, and multiple arrays
insertion
shelland multiple arrays Recursion may lead to
stack overflow
shell
heap0.7
Multiple arrays require big memory
Fig:
heapmerge
i kg
http://linux.wku.edu/~lamonml/algor/sort/sort.html
P.S. Avoid light-colored lines (e.g. yellow)
quick
#15/44 Department of Computer SystemsErno Salminen - Oct. 2012
yellow).
Algorithm manipulationAlgorithm manipulation Do not perform over-accurate calculation
Single/double prec. floating-point vs. fixed point Fixed point is less accurate but may be enough Fixed point is less accurate but may be enough SW emulation of floating point operations is s-l-o-w, tens to
hundreds of cycles per operation (+, *, /…) HW FPUs are big: HW FPUs are big:
Nios II/f + periph ~2 kALUT, FPU incl. DIV 4.2 kALUT ~5.7 mm2 @0.35 um [Brunelli, TreSoc04], ~120 kgates
(compare to RISC core ~50 kgates) Word width optimization
Useufl especially on HW However, smallest is not necessarily fastest on SW
Using type char may require additional shift/AND/ORinstructions
#16/44 Department of Computer SystemsErno Salminen - Oct. 2012
Example2: Sacrificing qualityExample2: Sacrificing qualityD d id h f HW Decrease data width of HW
#17/44 Department of Computer SystemsErno Salminen - Oct. 2012
[Ramchan Woo, Tampere Soc, Nov. 2004]
Assembly coding (1)Assembly coding (1)Try assembly only if everything else fails Keep also the high-level language (HLL) version p g g g ( )
to allow portability and reuseSometimes required with special instructions Such as interrupt handling, MMX, processor
mode (user/supervisor)Speedup with RISC procecssors not that
greatU ll l ti it Usually only one execution unit
(Few) instructions, simple addressingDecent compilers available
#18/44 Department of Computer SystemsErno Salminen - Oct. 2012
Decent compilers available
Assembly coding (2)Assembly coding (2) DSPs most likely benefit from assembly
Tight loops Complex micro-architecture is difficult for compiler
“Latest Compilers fall short of hand-optimized performance substantially even for DSP Kernels”performance substantially even for DSP Kernels
#19/44 Department of Computer SystemsErno Salminen - Oct. 2012
[Naji S. Ghazal et al., Retargetable Estimation for DSP, Architecture Selection, Tampere Soc, Nov. 1999]
Optimization impactOptimization impact RISC = estimated number of required basic ”RISC” operations fm = fitting coefficient = measured_cycles / estim_RISC_ops N.O = no optimization H.O. = hand optimized It was no use tohand-optimize the codes O. Lehtoranta, PhD Thesis, TUT 2006 for single-issue RISC (=ARM )
#20/44 Department of Computer SystemsErno Salminen - Oct. 2012 [O. Lehtoranta, PhD Thesis, TUT 2006]
Assembly example: vector copy, B[] = A[]Assembly example: vector copy, B[] = A[] First version
start_copy:ld r1, [r2] // r2 is src addr, A[i]st [r3], r1 // r3 is dst addr, B[i]inc r2
Load causes pipeline inc r2
inc r3dec r4 // r4 is data amount, one data copiedcmp r4, 0 // is enough copied?bneq start_copy // loop back if needed
Second
pipeline stall if next instruction depends on loaded
ld r1, [r2]inc r2st [r3], r1and so on ...
Incrementing r2 does not depend on r1 and stall is id d
on loaded value, like here
g pavoided
Load could be performed just before branch Load delay happens during pipeline stall Some ISAs support auto increment in load and Some ISAs support auto-increment in load and
store Poor compiler might even load the table addresses
again on every iteration
#21/44 Department of Computer SystemsErno Salminen - Oct. 2012
Assembly example: delayed branchAssembly example: delayed branchAddr Instruction
Fig 2 ’Normal’ branch
Fig 3. Delayed branch
Two instr. (i3 +i4) following the branch are also e ec ted The m st
Addr Instruction
a1 i1: MR=MR+MX0*MY0 (SS);
a2 i2: IF COND JUMP aa1;
a3 i3 Fig 2. ’Normal’ branch branch are also executed. They must be nop if others are not found
a3 i3
a4 i4
a5 i5
a6 i6a6 i6
a7 i7
... ...
aa1 ii1aa1 ii1
Four-cycle stall before ii1 is executed
Only two-cycle stall
#22/44 Department of Computer SystemsErno Salminen - Oct. 2012 [http://www.analog.com/UploadedFiles/Application_Notes/587795865ee_123.pdf]
Custom processors Custom processors Custom processors Custom processors (ASIPs)(ASIPs)
Erno Salminen - Oct. 2012
Custom processorsCustom processors ASIP = Application Specific Instruction-set Processor Extend CPU with application (domain) specific instructions
MAC, sum with clipping, DCT etc.g Extension tightly coupled with CPU pipeline Optimize internal communication within CPU
Remove unnecessary instructionsOth i fi CPU ( f i t d t idth ) Otherwise configure CPU (num of registers, data width...)
Allow using C/C++ compilation
#24/44 Department of Computer SystemsErno Salminen - Oct. 2012
Custom processor performance (1)Custom processor performance (1) Tensilica Xtensa Kernel speed-up 6x – 100x
Depends heavily on applicationy Base CPU ~20 000 gates
HW overhead 20% - 150%
Sidenote: (most likely) the Sidenote: (most likely) the largest multiprocessor
chip in the world contains 192 Xtensa processors
Fig: [Monica Lam, Compiler Technology for Configurable Processors Tampere SoC Nov
(Cisco’s CSR-1 router chip)
Configurable Processors, Tampere SoC, Nov. 2001.]
#25/44 Department of Computer SystemsErno Salminen - Oct. 2012
Custom processor performance (3)Custom processor performance (3) Beneficial also for energy
Note: E= P * t
(6.1x speedup)
(8.0x speedup)
#26/44 Department of Computer SystemsErno Salminen - Oct. 2012
[H. Meyr, Application Specific Instruction-Set
Processors for Wireless Communications, TreSoC 2004
Transport Triggered Architecture (TTA)Transport Triggered Architecture (TTA) Application-specific processor
Wide range in performance vs. cost Can reach almost the same cyclecount y
as ASIC Still allows programmability, more
flexible than HW Easily configurable Easily configurable
Number and type of execution units Connections between units Number of cores (multi threading) Number of cores (multi-threading)Many trade-offs between area and
performance Easy way tio desing an accelerator Easy way tio desing an accelerator
Designer gives C and HW description Tools generate synthesizable VHDL Automated exploration is under
construction
#27/44 Department of Computer SystemsErno Salminen - Oct. 2012
constructionScreen caps: tce.cs.tut.fi
TTA (2)TTA (2) Harvard architecture
Separate instruction and data memories
Supports multiple data memories
C compiler and simulator automatically configured to newautomatically configured to new micro-architecture
Only one instruction: move e.g. ”Add r2, r3, r3:g , ,move RF[2] -> ALU.op1move RF[3] -> ALU.trigmove ALU.result -> RF[3]
Instruction word has as many Instruction word has as many fields as there are internal buses Resembles VLIW, Everything
scheduled at compile-timeL d i th RISC
#28/44 Department of Computer SystemsErno Salminen - Oct. 2012
Larger code size than RISC
Move instruction is handyMove instruction is handy Instructions control the internal buses, and
operations happen as a side-effect Resource sharing for buses Resource sharing for buses
Move result from FU’s output to next one’s input, instead of going through register file -> less registers and ports to register fileregisters and ports to register file
More freedom in code scheduling than traditional CPUs. Move can happen later (or earlier) if the
lt (i t) i t d d f th d tresult reg (input) reg is not needed for other data -> less buses needed, supports different pipeline depths in FUs
Number of units and buses easily configurableNumber of inputs and outputs in an FU is easily
configurable (not just 2 inputs and 1 outputs)
#29/44 Department of Computer Systems
configurable (not just 2 inputs and 1 outputs)
Erno Salminen - Oct. 2012
TTA performanceTTA performanceBetter area and performance than general
purpose RISCSpecial function unit (SFU)Special function unit (SFU)
Designed and added manually Arbitrary latency and num of operands (thanks to
transport triggered scheme)transport-triggered scheme) Decreases ex.time but increases area
For certain algorithms, same cycle counts as S CASIC may achieved ASIC has larger operating frequency, though
Currently, TTA+tools developed at TUTCurrently, TTA tools developed at TUT Download: http://tce.cs.tut.fi/ Used in course TKT-3526 Processor design Interested students may do project work on TTA
#30/44 Department of Computer SystemsErno Salminen - Oct. 2012
Interested students may do project work on TTA
Area vs. runtime tradeArea vs. runtime trade--offoff TTA’s cycle count
smaller than RISC, close to ASIC
TTA’s area between ASIC and RISC ASIC has highest frequency
#31/44 Department of Computer SystemsErno Salminen - Oct. 2012
(memory excluded) (memory excluded)
[Hämäläinen, Euromicro DSD, 2005]RC4 exploration
HW acceleratorsHW accelerators
Erno Salminen - Oct. 2012
Recap: HW acceleratorsRecap: HW accelerators Favor: highest performance, smallest area and power Against: longest design time, narrow application domain Do not require code memory like progammable processors (CPU Do not require code memory like progammable processors (CPU,
ASIP, DSP) Accelerated function should give identical results with original
Additi l i f ti d d Additional conversion functions reduce speedup E.g. converting 16b results to 32b integers with SW or
transposing the resuly matrix on SW take time Optimally, the next function can accept slightly different input
# Type um Cycles Area Speedup (in cycle count) Freq [MHz] Max perf
[blocks/s]
Perf/area [blocks/s /
gates]
Example: 8x8 DCT
y ) [ ] gates]A RISC (ARM9) 0.18 2660 190 kilogates + mem 1.0 160 60 M 0.32
B ASIP (TTA+SFU) 0.13 538 56 kilogates + 34 kilogates mem
4.9 250 464 M 5.16
C HW (by student) 0.18 250 44 kilogates 10.6 182 728 M 16.5539 kilogates + control
#33/44 Department of Computer SystemsErno Salminen - Oct. 2012
D: [J. Nikara, Application-Specific Parallel Structures for Discrete Cosine Transforn and Variable Length Decoding, PhD thesis, TUT, June 2004]
D HW (by PhD) 0.11 9439 kilogates + control
logic 29.3 253 2691 M 69.01
HW HW accelerators: private vs. sharedaccelerators: private vs. shared Originally, accelerator were always attached to CPU
memory bus Smaller SoCs, just 1 CPU, poor portabilitySmaller SoCs, just 1 CPU, poor portability
Today, both private and shared aceclerators are used Each shared resource needs some arbitration mecahnism
which decides who can use itwhich decides who can use it Leads to contention and (usually) unpredictable delay
Data ”flowing through” the accelerator (e.g. cpu1→ acc →cpu2) is better than ”roundtrip” (cpu1 →acc → cpu1)→cpu2) is better than roundtrip (cpu1 →acc → cpu1)
CPU 1 I+D mem
on-chip
CPU 2 I+D mem
Local, private acc.
=> Low time overhead.
Large area if all CPUS h th i
accel 1
pnetwork
network IF
network IF
accel 2
#34/44 Department of Computer SystemsErno Salminen - Oct. 2012
have their ownaccel
3
Remote, shared acc.
=> Larger and more unpredictable time overhead, but also a smaller area
HW accelerators: HW accelerators: Uasge overheadUasge overhead Regular, data-flow type functions most suitable for
HW Communication between CPU and HW critical Communication between CPU and HW critical
Delay for feeding the input and getting results Ensuring mutual exclusion so that no other CPU uses the
same HW Pipelining reduces the idle period in CPU
CPUCPU only CPU CPUCPUCPU only CPU CPU
CPUCPU CPUCommunication overhead reduces the overall speedup. Moreover, CPU is idle
4x speedup
+ HW v.1 HW when HW is active
CPU CPU CPU Pipeline uses CPU and HW concurrently
#35/44 Department of Computer SystemsErno Salminen - Oct. 2012
+ HW v.2once the first results from HW are ready. Requires a bit more bookkeeping in SW.
HW accelerators: Pipelined usageHW accelerators: Pipelined usage Orig SW:
for i=0:N loopload r1, [r2]add sub mul cmp beq other processing
Measured SW ex.time includes loading input values
add, sub, mul, cmp, beq, other processingst r1, [r3]
end loop
SW + HW, straightforward polling
and storing the results
E if HW dstart hw()while (hw_ready==0) {}for i=0:N loop
load r1, [r2]
Even if HW does processing much faster, data transfers from CPU to HW must be taken into
polling =busy wait
end loop
SW + HW, pipelinedstart_hw()other function x();
account
Function X executed in parallel with HW. _ _ ;
while (hw_ready==0) {}for i=0:N loop
load r1, [r2]end loop
pLess time wasted in polling. Most efficient when HW and Function_X take nearly the same time
#36/44 Department of Computer SystemsErno Salminen - Oct. 2012
p nearly the same time
HW accelerator: Overhead (2)HW accelerator: Overhead (2) Sometimes, overhead is even larger than actual computation In the example below, Both Motion Estimation (ME) and DCT-
Quant-Iquant-IDCT took about 25 kcycles on NiosQuant-Iquant-IDCT took about 25 kcycles on Nios Accelerators in ideal conditions (in TB) took 1/70x and 1/14x of
SW time Espcecially the ME requires large input data (>2 1 kcycles) and Espcecially the ME requires large input data (>2.1 kcycles) and
large transfer contends for memory access with other parts of SoC (>4.3 kcycles)
Despite overheads accelerators offered about 3 5x and 6 5x Despite overheads, accelerators offered about 3.5x and 6.5x speedups
#37/44 Department of Computer SystemsErno Salminen - Oct. 2012
A. Rasmus et al.. "IP Integration Overhead Analysis…", DDECS, 2007]
HW accelerators: Getting the resultsHW accelerators: Getting the results Interrupts allow more efficient parallel
execution than polling Or CPU can enter a sleep state to save energy Or, CPU can enter a sleep state to save energy
Most SoCs include DMA units that can efficiently transfer data between resourcesCPU controlled transfers vs. DMA
a) CPU transfers all the data, time O(n), e.g. 7 cycles/wordmemcpy (&B[0], &A[0], 64*sizeof(int));py , , ;
b) CPU just inits DMA controller, cpu_time O(1), dma_time O(n) but only ~1 cycle/word
start_dma:st #DMA SRC ADDR r1st #DMA SRC ADDR, r1st #DMA_DST_ADDR, r2st #DMA_AMOUNT, r4
do_other_stuff()...
#38/44 Department of Computer SystemsErno Salminen - Oct. 2012
CPU is free once DMA is started
HW HW blockblock--level optimization level optimization (1)(1) Reuse benefits from configurability and many parameters
Run-time configurability is often costly Good for simulation-based testingGood for simulation based testing
Convert input signals into generics for synthesis Turn unwanted features off to save area and power
Perhaps increases the max freq alsop q if enable_g = ’1’ then <code>;
1200 0
1400.0
gate
s] Example: config memory inside bus
600.0
800.0
1000.0
1200.0
mem
ory
area
[g
No slots 1 slot 2 slotsmemory inside bus wrapper 2 generics
0.0
200.0
400.0
we=0, re=0 we=0, re=1 we=1, re=0 we=1, re=1Con
figur
atio
n 1. we= write enable2. re = read enable optimize according
#39/44 Department of Computer SystemsErno Salminen - Oct. 2012
, , , ,
rom ram
Memory type
to application
HW HW blockblock--level optimization level optimization (2)(2)Try to design HW so that propagation delay is
not (linearly) dependent on data width Scalable solution Bad example: if data < 55 then data<= data+1; Better: if data /= 55 then data<= data+1;
Turn on boundary optimizationy p Logic in different entities optimized together
Hand-coding might be required with more complicated boundary-optimizationp y p
block BE.g. inverters can be removed
R t i t d l t i t t
block A
4b (This can be optimized for smaller Note: combinatorial outputs not recommended
Restricted value set in output
#40/44 Department of Computer SystemsErno Salminen - Oct. 2012
(If output uses < 16 of all possible values)
range)
HW HW blockblock--level optimization level optimization (3)(3)Minimize the data width of signals Remove unnecessary flip-flops (á 4-6 eq.gates)
i.e. those with constant output DC: set compile_seqmap_propagate_constants true
Optimizes also the logic after the flip-flop
→always 1 By default, synthesis does
NOT remove any registers
propagated constant
NOT remove any registers All signals that are assigned
in sequential process (clk, rst n) produce a flip-flop
→ always 0rst n) produce a flip-flop
#41/44 Department of Computer SystemsErno Salminen - Oct. 2012
Flip-flop with constant output
HW blockHW block--level optimization level optimization (4)(4) Do not ’reset’ registers when value is not needed
e.g. if valid_in = ’0’ then data_r <= (others =>’0’);
Unncecessary input MUX Unncecessary input MUX Good for visualization in
simulation though if dbg enable g = ’1’ then if dbg enable g = 1 then
reg <= dbg_value;
Easy to see when these are valid
Validity determined according to signal empty
real real
”debug value”
unnecessary mux
#42/44 Department of Computer SystemsErno Salminen - Oct. 2012
logiclogic
HW optim: Aim at ”fast enough”HW optim: Aim at ”fast enough” Do not overoptimize HW, if performance limit is
known 100 frames/sec encoder is not better than 25 fps enc, if100 frames/sec encoder is not better than 25 fps enc, if
camera restricts the frame rate anywayMinimizing critical path, causes large area
Requires larger drive strength for gates q g g g They also have higher leakage currents
Minimizing cycle countMinimizing cycle count needs many parallel sub-blocks (e.g. ALUs)
Consider the integration overheads also
Fig: [J. Wei, C. Rowen, “Implementing low-power
#43/44 Department of Computer SystemsErno Salminen - Oct. 2012
Fig: [J. Wei, C. Rowen, Implementing low power configurable processors…”, DAC 2005]
ConclusionConclusionRemember Amdahl’s law – concentrate on
appropriate parts of the systempp p p yASIPs provide great improvements (like
ASIC) but allow programmability (like CPU)) p g y ( )Communication between components has
great impact on performanceg Use interrupts and DMA controllers Pipeline SW and HW
Don’t overdo things, aim for fast enough and then minimize area and power
#44/44 Department of Computer SystemsErno Salminen - Oct. 2012
Sidenote: ASIC vs. FPGA Design StartsSidenote: ASIC vs. FPGA Design Starts
5000
6000
ASIC Design Starts
500000
600000
PLD/FPGA Design Starts
3000
4000
300000
400000
1000
2000
Source: Gartner Group
100000
200000
Source: Gartner Group0
2001 2002 2003 20040
2001 2002 2003 2004
“ASIC design starts will decline 12.3 percent PLD/FPGAs are becoming more and more g pto 4,345 this year following the precipitous 36 percent drop in design starts in 2001”
(B. Lewis, Gartner Dataquest, 10/28/02)
gthe driving force in microelectronics technology, CAD tools and System-on-Chip design.
#45/44 Department of Computer SystemsErno Salminen - Oct. 2012
Note! New ASICs are much larger (much more logic, much more personnel involved) than previously.
Custom processor performance (2)Custom processor performance (2) SC140 = original Star Core DSPGFISA = special instructions for Galois field
ti dd doperations added HW overhead ~10%
Special ISA does not help every algorithm!Reed-Solomon decoding cycle count
p p y g
runt
ime
[Yasmin Oz et al.,Galois Field Instruction Set A l t i th St C SC140 DSP
#46/44 Department of Computer SystemsErno Salminen - Oct. 2012
Accelerator in the StarCore SC140 DSP, Tampere SoC, Nov. 2001.]
Speedup 22.1 14.5 6.3 1.0=t(sc140)
t(gfisa)
Last warning: Scheduling anomalyLast warning: Scheduling anomalya0
a1
deadline for a2
b1
PE2
PE1
Improving one part of the system may
a2b0PE0
1y y
result in worst performance
timedeadline met Faster NoC, faster PE
Thi i d ttask a0
a1
deadline for a2
b1
PE2
PE
This is due to changes in the schedule i e order or a1
a2b0
b1
PE0’
PE1schedule, i.e. order or execution
#47/44 Department of Computer SystemsErno Salminen - Oct. 2012
timedeadline violated
Recommended