View
228
Download
2
Category
Preview:
Citation preview
Mse: Hardware Algorithms
Parallelization
Marcel JacometJosef Goette
Bern University of Applied SciencesBfh-Ti HuCE-microLab, Biel/Bienne
Marcel.Jacomet@bfh.ch
October 11, 2017
Contents
1 Introduction 1
2 Parallelization 2
3 Unfolding 9
4 Hardware Rules 14
5 OCT Example 155.1 OCT Introduction . . . . . . . . . . . . . . . . . 15
6 Parallelization at OCTExample 296.1 Data-Path Unfolding . . . . . . . . . . . . . . . . 296.2 FiFo Unfolding . . . . . . . . . . . . . . . . . . . 316.3 DFT Unfolding . . . . . . . . . . . . . . . . . . . 33
References 38
Hardware Algorithms
c© Marcel Jacomet, 2012 - 2016
All rights reserved. This work may not be translated or copied in
whole or in part without the written permission by the author, except
for brief excerpts in connection with reviews or scholarly analysis.
Use in connection with any form of information storage and retrieval,
electronic adaptation, computer software is forbidden.
Marcel Jacomet ii 2008
Hardware Algorithms
1 Introduction
Marcel Jacomet 1 2008
Hardware Algorithms
Textbooks
• Vlsi Digital Signal Processing Systems, Design and Im-plementation, Keshab K. Parhi, John Wiley & Sons,Isbn 0-471-24186-5, 1999, USD 135
• Oct texts discussing the lab example can be found onthe web
2 Parallelization
Marcel Jacomet 2 2008
Hardware Algorithms
Parallelization Principles 1
• parallelization at degree p speeds up hardware algorithmsby up to factor p
• parallelization of hardware basically can be done in twoways:
– p identical hardware paths executing time delayeddata-streams in parallel
– p interlinked hardware paths executing a stream ofdata vectors of length p data sets in parallel
• the first approach is a straight forward implementationusing p times the number of non parallelized hardware
• the second approach is more challenging, using p times thenumber of operators of the non parallelized hardware, butthe identical number of storage elements only
Marcel Jacomet 3 2008
Hardware Algorithms
Parallelization Principles: Parallel Streams
Marcel Jacomet 4 2008
Hardware Algorithms
Parallelization Principles: Parallel Sets
data sampling channel 1
data sampling channel 2
data sampling channel 3
data sampling channel 4
data sampling channel 5
data sample(5 set vector)
interlinked parallel processing of samples (vectors)
Marcel Jacomet 5 2008
Hardware Algorithms
Dataflow Graph Representation
y[n] = a · x[n] + b · x[n− 1] + c · x[n− 2]
• block diagram of 3-tap FIR filter
1z
1z
y[n]
x[n-2]x[n-1]x[n]
a b c
• data-flow diagram of 3-tap FIR filter
y[n]
x[n]
a b c
D 2D
Marcel Jacomet 6 2008
Hardware Algorithms
Dataflow Graph: Pipelining
• pipelining is done by introducing additional delay elements(registers)
• pipelining delays elements can only be set in feed-forwardpaths
y[n]
x[n]
a b c
D2D
y[n]
x[n]
a b c
D3D
D
Marcel Jacomet 7 2008
Hardware Algorithms
Dataflow Graph: Pipelining for Speedup
• pipelining to increase clock frequency
• retiming theory (Bellman-Ford or Floyd-Warshall algo-ithms)
• Fir example: frequency is 1/(2u) instead of 1/(4u)
y[n]
x[n]
a b c
D2D
(2u) (2u) (2u)
(1u) (1u)
y[n]
x[n]
a b c
D2D
(2u) (2u) (2u)
(1u) (1u)D
D D
y[n]
x[n]
a b c
D D
(2u) (2u) (2u)
(1u) (1u)D
D D
Marcel Jacomet 8 2008
Hardware Algorithms
3 Unfolding
Marcel Jacomet 9 2008
Hardware Algorithms
Unfolding 1
• unfolding or loop unrolling
• example
y[n] = a · y[n− 9] + x[n]
1: for i← 1, to ∞ do2: y[i]← a · y[i− 9] + x[i]
• replacing index n by 2k and n+ 1 by 2k + 1
• together, the 2 equations describe the same algorithm
y[2k] = a · y[2k − 9] + x[2k]
y[2k + 1] = a · y[2k − 8] + x[2k + 1]
Marcel Jacomet 10 2008
Hardware Algorithms
Unfolding 2
• parallelization degree: J-slow
• J-slow means that for an input x[kJ +m] the output aftera delay is x[(k − 1)J +m]
• thus we get:
y[2k] = a · y[2(k − 5) + 1] + x[2k]
y[2k + 1] = a · y[2(k − 4) + 0] + x[2k + 1]
Marcel Jacomet 11 2008
Hardware Algorithms
Unfolding 3
• data flow graph of example
• algorithm of example (2-slow)
x[n]
a
9D
y[n]
x[2k+1]
a
4D
x[2k]
a
5D
y[2k+1]
y[2k]
Marcel Jacomet 12 2008
Hardware Algorithms
Unfolding Design Procedure
• for each node U in the original Dfg, draw the J nodesU0,U1, · · · , UJ−1
• for each edge U → V with w delays in the original Dfg,draw the J edges Ui → V(i+w)mod (J) with ⌊ i+w
J ⌋ delaysfor i = 0, 1, 2, · · · , J − 1
U0
U1
U2
V0
V1
V2
T0
T1
T2
U V
T
D
6D
5D
D
D
2D
2D
2D
2D
2D
Marcel Jacomet 13 2008
Hardware Algorithms
4 Hardware Rules
Signal Processing Hardware Rules: ”No Control Path”
• 1/z register stores at every clock cycle a new input sample
• if clause asks for controllable registers (with enable)
• let’s built it in Simulink: hardware rule
1z
Unit Delay
Register
D
clk
Q 1z
Unit Delay
Register
D
clk
Q
u
E
1z
Unit Delay
y
Enabled
1z
Unit Delay
Register
D
clk
Q
EnabledRegister
D
clk
Q
ena u
E
1z
Unit Delay
y
Enabled
1z
Unit Delay
Register
D
clk
Q
EnabledRegister
D
clk
Q
ena
1z
Unit Delay
~=0
Switch
Register
D
clk
Q
EnabledRegister
D
clk
Q
ena
1z
Unit Delay
ena
DQ
1z
Unit Delay
~=0
Switch
1z
Unit Delay
ena
D
Marcel Jacomet 14 2008
Hardware Algorithms
5 OCT Example
5.1 Introduction to OCT
Marcel Jacomet 15 2008
Hardware AlgorithmsOptical coherence tomography (Oct) is an optical signalacquisition and processing method. It captures micrometer-resolution, three-dimensional images from within optical scat-tering media (e.g., biological tissue). Optical coherence tomog-raphy is an interferometric technique, typically employing near-infrared light. The use of relatively long wavelength light allowsit to penetrate into the scattering medium. Reflection is causedby refraction index changes at tissue boundaries and scatteringis a diffraction process at micro-structures in the tissue. Oct
signals only contain information about the depth of scattering orreflecting structures and cannot differentiate between these twofundamental processes. A relatively recent implementation ofoptical coherence tomography, frequency-domain optical coher-ence tomography, provides advantages in signal-to-noise ratio,permitting faster signal acquisition. Optical coherence tomog-raphy systems are employed in diverse applications, includingart conservation and diagnostic medicine, notably ophthalmol-ogy where it can be used to obtain detailed images from withinthe retina. Advantages compared to other techniques are theachieved tissue penetration (1 to 3 mm) combined with the rel-ative high axial resolution (0.5 to 15 mm) at a very high mea-suring frequency (several 100 kS/s).
Introduction to OCT: Features
• Oct is an optical signal acquisition and processing method
• micro-meter resolution in 3-D images
• optical scattering/reflecting media: biological tissues
• interferometric technique with near infrared laser
Marcel Jacomet 16 2008
Hardware Algorithms• reflection is caused by refraction index changes at tissueboundaries
• recent Oct technology is frequency domain Oct provideslow Snr and high speed signal acquisition
Marcel Jacomet 17 2008
Hardware Algorithms
Introduction to OCT: Applications
• applications in medicine: ophthalmology, ...
• depth penetration of 1 to 3 mm (A-scan)
• speeds of 100 kS/s per depth scans at 2048 pixels, ≥ 200MS/s
• Oct image of pig eye atHuCE-optoLab (left), Oct setupwith Gecko platform at HuCE-microLab (right)
Marcel Jacomet 18 2008
Hardware AlgorithmsThe optical setup for frequency-domain Oct typically con-sists of an interferometer with a low coherence, broad band-width light source (white light) or a narrow band sweeping lightsource. Light is split into and recombined from reference andsample arm, respectively.
Introduction to OCT: Principle
• low coherence source (Lcs)
• beam splitter (Bs)
• reference (Ref) and sample arm (Smp)
• diffraction grating (Dg) and full field camera Cam) asspectrometer (source wiki)
Marcel Jacomet 19 2008
Hardware AlgorithmsThe measured input samples received by the digital signalprocessing units are equidistant to the wavelength (x-axis is thewavelength, y-axis is the measured Oct light intensity). A firststep in the Oct processing is to remap the measured light in-tensity equidistant to the wave number instead to the wave-length. This pre-processing step is needed for a succeeding Dft
transformation. Use simple linear interpolation to calculate theremapped sample intensity.
Introduction to OCT: Signals
• top: captured fourier domain Oct signals of A-scan
• middle: signals after filtering and remapping
• bottom: final A-scan image after inverse Fft
0 200 400 600 800 1000 12000
1
2
3
wave length [nm]
Inte
nsity
a.u
.
7.25 7.3 7.35 7.4 7.45 7.5 7.55 7.6 7.65 7.7 7.75
−0.5
0
0.5
1
wave number [1/um]
Inte
nsity
a.u
.
−1000 −800 −600 −400 −200 0 200 400 600 800 10000
0.05
0.1
0.15
0.2
depth z [um]
Inte
nsity
a.u
Marcel Jacomet 20 2008
Hardware Algorithms
Signal Processing in OCT: Remapping 1
• Oct input signals are captured in λ (wave length) domain
• they have to be transformed into k (wave number) domain
• this process is called remapping
7.25 7.3 7.35 7.4 7.45 7.5 7.55 7.6 7.65 7.7 7.75
7.25
7.3
7.35
7.4
7.45
7.5
7.55
7.6
7.65
7.7
7.75
camparison of k (linear) and k = 2*pi/lambda(n)
linear k
Marcel Jacomet 21 2008
Hardware Algorithms
Signal Processing in OCT: Remapping 2
• λ (wave length) from 810 nm to 870 nm
• λ equidistant sampling in wave length: Ln
• λ equidistant sampling in wave number: Lm
Ln-1 Ln Ln+1 Ln+2
Lm-1 Lm Lm+1
L (equidistant in L)
L (equidistant in k)
Lstep
valA
valBout(m)
input signal
remapped signal
• relation is: k = 2π/λ withLstep =
λmax−λminN Ln = λmin + n · Lstep
kstep =2π
λmin−
2π
λmaxN Lm = 2π
kmax−m·kstep
Marcel Jacomet 22 2008
Hardware Algorithms
Signal Processing in OCT: Remapping 3
• signal processing with look-up table
– no division with iteration
– no error due to continuous summing
Ln-1 Ln Ln+1 Ln+2
Lm-1 Lm Lm+1
L (equidistant in L)
L (equidistant in k)
Lstep
valA
valBout(m)
input signal
remapped signal
outm = valA+ (valB−valA)Lstop
· (Lm − Ln)
outm = valA+ (valB− valA) · LUTk(addr)
Marcel Jacomet 23 2008
Hardware Algorithms
Signal Processing in OCT: Control Path
• signal processing: data path and control path
– for clause would be perfect
– if clause in code asks for control path
– control can also be done by look-up tables
Ln-1 Ln Ln+1 Ln+2
Lm-1 Lm Lm+1
L (equidistant in L)
L (equidistant in k)
Lstep
valA
valBout(m)
input signal (equidistant sampling in wave length)
remapped signal (equidistant sampling in wave number)
Lm+2
Ln-1 Ln Ln+1 Ln+2
Lm-1 Lm Lm+1
L (equidistant in L)
L (equidistant in k)
Lstep
valA
valBout(m)
input signal (equidistant sampling in wave length)
remapped signal (equidistant sampling in wave number)
Lm+2
1x 2x 0x
Marcel Jacomet 24 2008
Hardware Algorithms
Signal Processing in OCT: Datapath and Control Path
1: i← 1, j ← 1, m← 1, adr ← 12: while m ≤ 1024 do3: varA← inp[i]4: varB ← inp[i+ 1]5: if lutCtr(adr − 1) 6= 2 then6: outm(j)← varA+ (varB − varA) ∗ lutK(adr)7: if lutCtr(adr) = 0 increment input and output sample
index then8: m← m+ 19: i← i+ 1
10: else if lutCtr(adr) = 3 keep, do not load new input sam-ple then
11: m← m+ 112: else if lutCtr(adr) = 2 skip, do not generate output sam-
ple then13: i← i+ 114: adr ← adr + 1
Marcel Jacomet 25 2008
Hardware Algorithms
Signal Processing in OCT: Simulink
outm = valA+ (valB− valA) · LUTk(addr)
Marcel Jacomet 26 2008
Hardware Algorithms
Signal Processing in OCT: ”No Control Path”
outm = valA+ (valB− valA) · LUTk(addr)
Marcel Jacomet 27 2008
Hardware Algorithms
Signal Processing in OCT: Simplifications in ControlPath
outm = valA+ (valB− valA) · LUTk(addr)
Marcel Jacomet 28 2008
Hardware Algorithms
6 Parallelization at OCT Example
6.1 Data-Path Unfolding
Marcel Jacomet 29 2008
Hardware Algorithms
Unfolding: OCT Example 1
• OCT data flow graph for interpolation
• exercise: design a 4-slow unfolding
• simulate it with Matlab/Simulinik
in Mux
wr
Mux
wr
+
- *out+
D
D
D
D
D
D
D
lutKlutCTR
Marcel Jacomet 30 2008
Hardware Algorithms
Unfolding: How to Model the FiFo?
• OCT data flow graph for interpolation
• exercise: 4-slow unfolding inlcuding control path
• what about the FiFos?
in Mux
wr
Mux
wr
+
- *Mux
wr
out+
not 3 not 2
+
LUT ctr
LUT k1
D
D
D
D
D
D
D
D
D
D
D
2D 3D
D
D
1
?? D
push pop
FiFo ??
push pop
FiFo
6.2 FiFo Unfolding
Marcel Jacomet 31 2008
Hardware Algorithms
FiFo Model
• Dfg model of a FiFo
• the FiFo has to be decomposed downto delay elementsand combinational logic
push pop
FiFo
Mux
wr
D
D Mux
wr
D
D
push pop
dual portRAM
in out
adrWadrRD
D
1
D
D
1
in out
Marcel Jacomet 32 2008
Hardware Algorithms
Unfolding the FiFo Model
• Dfg model of an 2-slow unfolding of FiFos
• impossible to compose again FiFos
• shall we start to re-implement all IP cores?
Mux
wr
Mux
wr
push
pop
dual portRAM
in out
adrWadrR
1
D
1
inout
Mux
wr
D
Mux
wr
pushpop
dual portRAM
in out
adrWadrR
11
inout
D D
D
D
6.3 DFT Unfoldingl
Marcel Jacomet 33 2008
Hardware Algorithms
Dft (Dtfs): Discrete Fourier Transform
• natural parallelization by Fft algos
• N -point Dft
X[k] =
N−1∑
n=0
x[n]W knN , k = 0, 1, 2, . . . , N − 1
where WN =̂ Nth root of unity
WN =N√1 = e−j(2π/N)
• inverse transform
x[n] =1
N
N−1∑
k=0
X[k]W−knN , n = 0, 1, 2, . . . , N − 1
We need a note on the factor 1/N .
Marcel Jacomet 34 2008
Hardware Algorithms
Dft: Matrix Form
• denote the vector of input samples by
x =(
x[0] , x[1] , x[2] , . . . , x[N − 1])T
• denote the vector of spectral samples by
X =(
X[0] , X[1] , X[2] , . . . , x[N − 1])T
• then the Dft can be written as
X = DFT (x) = Fx
with F =̂
1 1 1 · · · 1
1 WN W 2N · · · WN−1
N
1 W 2N W 2·2
N · · · W2·(N−1)N
...
1 WN−1N W
(N−1)·2N · · · W
(N−1)·(N−1)N
Superscript T denotes transpose.
Marcel Jacomet 35 2008
Hardware Algorithms
Dft: Low-Order Fourier Matrix Examples
• for N = 2: WN = W2 = 2√1 = e−j2π/2 = e−jπ = −1
F2 =̂
(
1 1
1 W2
)
=
(
1 1
1 −1
)
• for N = 4: WN = W4 = 4√1 = e−j2π/4 = e−jπ/2 = −j
F4 =̂
1 1 1 1
1 W4 W 24 W 3
4
1 W 24 W 2·2
4 W 2·34
1 W 34 W 3·2
4 W 3·34
=
1 1 1 1
1 −j −1 j
1 −1 1 −11 j −1 −j
Superscript T denotes transpose.
Marcel Jacomet 36 2008
Hardware Algorithms
Dft: Matrix Factorization ❀ Fft
• for example N = 1024:
F1024 =̂
(
I512 D512
I512 −D512
)
·(
F512 O
O F512
)
·(
even
odd
)
where I512 =̂ identity matrix
D512 =̂ diag{
1,W1024,W21024, . . . ,W
5111024
}
F512 =̂ 512-point Fourier matrix
permutation at end separates even and odd part:
(↓)x =(
x[0] , x[2] , . . .)
(↓) (z)x =(
x[1] , x[3] , . . .)
Marcel Jacomet 37 2008
Hardware AlgorithmsReferences
Marcel Jacomet 38 2008
Recommended