HardwareAlgorithms Mse: Parallelization - BFH · Mse: HardwareAlgorithms Parallelization...

Mse: Hardware Algorithms

Parallelization

Marcel JacometJosef Goette

Bern University of Applied SciencesBfh-Ti HuCE-microLab, Biel/Bienne

Marcel.Jacomet@bfh.ch

October 11, 2017

Contents

1 Introduction 1

2 Parallelization 2

3 Unfolding 9

4 Hardware Rules 14

5 OCT Example 155.1 OCT Introduction . . . . . . . . . . . . . . . . . 15

6 Parallelization at OCTExample 296.1 Data-Path Unfolding . . . . . . . . . . . . . . . . 296.2 FiFo Unfolding . . . . . . . . . . . . . . . . . . . 316.3 DFT Unfolding . . . . . . . . . . . . . . . . . . . 33

References 38

Hardware Algorithms

c© Marcel Jacomet, 2012 - 2016

whole or in part without the written permission by the author, except

for brief excerpts in connection with reviews or scholarly analysis.

Use in connection with any form of information storage and retrieval,

electronic adaptation, computer software is forbidden.

Marcel Jacomet ii 2008

Hardware Algorithms

1 Introduction

Marcel Jacomet 1 2008

Hardware Algorithms

Textbooks

• Vlsi Digital Signal Processing Systems, Design and Im-plementation, Keshab K. Parhi, John Wiley & Sons,Isbn 0-471-24186-5, 1999, USD 135

• Oct texts discussing the lab example can be found onthe web

2 Parallelization

Hardware Algorithms

Parallelization Principles 1

• parallelization at degree p speeds up hardware algorithmsby up to factor p

• parallelization of hardware basically can be done in twoways:

– p identical hardware paths executing time delayeddata-streams in parallel

– p interlinked hardware paths executing a stream ofdata vectors of length p data sets in parallel

• the first approach is a straight forward implementationusing p times the number of non parallelized hardware

• the second approach is more challenging, using p times thenumber of operators of the non parallelized hardware, butthe identical number of storage elements only

Hardware Algorithms

Parallelization Principles: Parallel Streams

Hardware Algorithms

Parallelization Principles: Parallel Sets

data sampling channel 1

data sample(5 set vector)

interlinked parallel processing of samples (vectors)

Hardware Algorithms

Dataflow Graph Representation

y[n] = a · x[n] + b · x[n− 1] + c · x[n− 2]

• block diagram of 3-tap FIR filter

x[n-2]x[n-1]x[n]

• data-flow diagram of 3-tap FIR filter

Hardware Algorithms

Dataflow Graph: Pipelining

• pipelining is done by introducing additional delay elements(registers)

• pipelining delays elements can only be set in feed-forwardpaths

Hardware Algorithms

Dataflow Graph: Pipelining for Speedup

• pipelining to increase clock frequency

• retiming theory (Bellman-Ford or Floyd-Warshall algo-ithms)

• Fir example: frequency is 1/(2u) instead of 1/(4u)

(2u) (2u) (2u)

(1u) (1u)

(2u) (2u) (2u)

(1u) (1u)D

(2u) (2u) (2u)

(1u) (1u)D

Hardware Algorithms

3 Unfolding

Hardware Algorithms

Unfolding 1

• unfolding or loop unrolling

• example

y[n] = a · y[n− 9] + x[n]

1: for i← 1, to ∞ do2: y[i]← a · y[i− 9] + x[i]

• replacing index n by 2k and n+ 1 by 2k + 1

• together, the 2 equations describe the same algorithm

y[2k] = a · y[2k − 9] + x[2k]

y[2k + 1] = a · y[2k − 8] + x[2k + 1]

Hardware Algorithms

Unfolding 2

• parallelization degree: J-slow

• J-slow means that for an input x[kJ +m] the output aftera delay is x[(k − 1)J +m]

• thus we get:

y[2k] = a · y[2(k − 5) + 1] + x[2k]

y[2k + 1] = a · y[2(k − 4) + 0] + x[2k + 1]

Hardware Algorithms

Unfolding 3

• data flow graph of example

• algorithm of example (2-slow)

x[2k+1]

y[2k+1]

Hardware Algorithms

Unfolding Design Procedure

• for each node U in the original Dfg, draw the J nodesU0,U1, · · · , UJ−1

• for each edge U → V with w delays in the original Dfg,draw the J edges Ui → V(i+w)mod (J) with ⌊ i+w

J ⌋ delaysfor i = 0, 1, 2, · · · , J − 1

Hardware Algorithms

4 Hardware Rules

Signal Processing Hardware Rules: ”No Control Path”

• 1/z register stores at every clock cycle a new input sample

• if clause asks for controllable registers (with enable)

• let’s built it in Simulink: hardware rule

Unit Delay

Register

Unit Delay

Register

Unit Delay

Enabled

Unit Delay

Register

EnabledRegister

Unit Delay

Enabled

Unit Delay

Register

EnabledRegister

Unit Delay

Switch

Register

EnabledRegister

Unit Delay

Switch

Unit Delay

Hardware Algorithms

5 OCT Example

5.1 Introduction to OCT

Hardware AlgorithmsOptical coherence tomography (Oct) is an optical signalacquisition and processing method. It captures micrometer-resolution, three-dimensional images from within optical scat-tering media (e.g., biological tissue). Optical coherence tomog-raphy is an interferometric technique, typically employing near-infrared light. The use of relatively long wavelength light allowsit to penetrate into the scattering medium. Reflection is causedby refraction index changes at tissue boundaries and scatteringis a diffraction process at micro-structures in the tissue. Oct

signals only contain information about the depth of scattering orreflecting structures and cannot differentiate between these twofundamental processes. A relatively recent implementation ofoptical coherence tomography, frequency-domain optical coher-ence tomography, provides advantages in signal-to-noise ratio,permitting faster signal acquisition. Optical coherence tomog-raphy systems are employed in diverse applications, includingart conservation and diagnostic medicine, notably ophthalmol-ogy where it can be used to obtain detailed images from withinthe retina. Advantages compared to other techniques are theachieved tissue penetration (1 to 3 mm) combined with the rel-ative high axial resolution (0.5 to 15 mm) at a very high mea-suring frequency (several 100 kS/s).

Introduction to OCT: Features

• Oct is an optical signal acquisition and processing method

• micro-meter resolution in 3-D images

• optical scattering/reflecting media: biological tissues

• interferometric technique with near infrared laser

Hardware Algorithms• reflection is caused by refraction index changes at tissueboundaries

• recent Oct technology is frequency domain Oct provideslow Snr and high speed signal acquisition

Hardware Algorithms

Introduction to OCT: Applications

• applications in medicine: ophthalmology, ...

• depth penetration of 1 to 3 mm (A-scan)

• speeds of 100 kS/s per depth scans at 2048 pixels, ≥ 200MS/s

• Oct image of pig eye atHuCE-optoLab (left), Oct setupwith Gecko platform at HuCE-microLab (right)

Hardware AlgorithmsThe optical setup for frequency-domain Oct typically con-sists of an interferometer with a low coherence, broad band-width light source (white light) or a narrow band sweeping lightsource. Light is split into and recombined from reference andsample arm, respectively.

Introduction to OCT: Principle

• low coherence source (Lcs)

• beam splitter (Bs)

• reference (Ref) and sample arm (Smp)

• diffraction grating (Dg) and full field camera Cam) asspectrometer (source wiki)

Hardware AlgorithmsThe measured input samples received by the digital signalprocessing units are equidistant to the wavelength (x-axis is thewavelength, y-axis is the measured Oct light intensity). A firststep in the Oct processing is to remap the measured light in-tensity equidistant to the wave number instead to the wave-length. This pre-processing step is needed for a succeeding Dft

transformation. Use simple linear interpolation to calculate theremapped sample intensity.

Introduction to OCT: Signals

• top: captured fourier domain Oct signals of A-scan

• middle: signals after filtering and remapping

• bottom: final A-scan image after inverse Fft

0 200 400 600 800 1000 12000

wave length [nm]

7.25 7.3 7.35 7.4 7.45 7.5 7.55 7.6 7.65 7.7 7.75

−0.5

wave number [1/um]

−1000 −800 −600 −400 −200 0 200 400 600 800 10000

depth z [um]

Hardware Algorithms

Signal Processing in OCT: Remapping 1

• Oct input signals are captured in λ (wave length) domain

• they have to be transformed into k (wave number) domain

• this process is called remapping

7.25 7.3 7.35 7.4 7.45 7.5 7.55 7.6 7.65 7.7 7.75

camparison of k (linear) and k = 2*pi/lambda(n)

linear k

Hardware Algorithms

• λ (wave length) from 810 nm to 870 nm

• λ equidistant sampling in wave length: Ln

• λ equidistant sampling in wave number: Lm

Ln-1 Ln Ln+1 Ln+2

Lm-1 Lm Lm+1

L (equidistant in L)

L (equidistant in k)

valBout(m)

input signal

remapped signal

• relation is: k = 2π/λ withLstep =

λmax−λminN Ln = λmin + n · Lstep

kstep =2π

λmin−

λmaxN Lm = 2π

kmax−m·kstep

Hardware Algorithms

• signal processing with look-up table

– no division with iteration

– no error due to continuous summing

Ln-1 Ln Ln+1 Ln+2

Lm-1 Lm Lm+1

valBout(m)

input signal

remapped signal

outm = valA+ (valB−valA)Lstop

· (Lm − Ln)

outm = valA+ (valB− valA) · LUTk(addr)

Hardware Algorithms

Signal Processing in OCT: Control Path

• signal processing: data path and control path

– for clause would be perfect

– if clause in code asks for control path

– control can also be done by look-up tables

Ln-1 Ln Ln+1 Ln+2

Lm-1 Lm Lm+1

valBout(m)

input signal (equidistant sampling in wave length)

remapped signal (equidistant sampling in wave number)

Ln-1 Ln Ln+1 Ln+2

Lm-1 Lm Lm+1

valBout(m)

input signal (equidistant sampling in wave length)

remapped signal (equidistant sampling in wave number)

1x 2x 0x

Hardware Algorithms

Signal Processing in OCT: Datapath and Control Path

1: i← 1, j ← 1, m← 1, adr ← 12: while m ≤ 1024 do3: varA← inp[i]4: varB ← inp[i+ 1]5: if lutCtr(adr − 1) 6= 2 then6: outm(j)← varA+ (varB − varA) ∗ lutK(adr)7: if lutCtr(adr) = 0 increment input and output sample

index then8: m← m+ 19: i← i+ 1

10: else if lutCtr(adr) = 3 keep, do not load new input sam-ple then

11: m← m+ 112: else if lutCtr(adr) = 2 skip, do not generate output sam-

ple then13: i← i+ 114: adr ← adr + 1

Hardware Algorithms

Signal Processing in OCT: Simulink

Hardware Algorithms

Signal Processing in OCT: ”No Control Path”

Hardware Algorithms

Signal Processing in OCT: Simplifications in ControlPath

Hardware Algorithms

6 Parallelization at OCT Example

6.1 Data-Path Unfolding

Hardware Algorithms

Unfolding: OCT Example 1

• OCT data flow graph for interpolation

• exercise: design a 4-slow unfolding

• simulate it with Matlab/Simulinik

in Mux

- *out+

lutKlutCTR

Hardware Algorithms

Unfolding: How to Model the FiFo?

• OCT data flow graph for interpolation

• exercise: 4-slow unfolding inlcuding control path

• what about the FiFos?

in Mux

- *Mux

not 3 not 2

LUT ctr

LUT k1

push pop

FiFo ??

push pop

6.2 FiFo Unfolding

Hardware Algorithms

FiFo Model

• Dfg model of a FiFo

• the FiFo has to be decomposed downto delay elementsand combinational logic

push pop

dual portRAM

in out

adrWadrRD

in out

Hardware Algorithms

Unfolding the FiFo Model

• Dfg model of an 2-slow unfolding of FiFos

• impossible to compose again FiFos

• shall we start to re-implement all IP cores?

dual portRAM

in out

adrWadrR

pushpop

dual portRAM

in out

adrWadrR

6.3 DFT Unfoldingl

Hardware Algorithms

Dft (Dtfs): Discrete Fourier Transform

• natural parallelization by Fft algos

• N -point Dft

X[k] =

N−1∑

x[n]W knN , k = 0, 1, 2, . . . , N − 1

where WN =̂ Nth root of unity

WN =N√1 = e−j(2π/N)

• inverse transform

x[n] =1

N−1∑

X[k]W−knN , n = 0, 1, 2, . . . , N − 1

We need a note on the factor 1/N .

Hardware Algorithms

Dft: Matrix Form

• denote the vector of input samples by

x[0] , x[1] , x[2] , . . . , x[N − 1])T

• denote the vector of spectral samples by

X[0] , X[1] , X[2] , . . . , x[N − 1])T

• then the Dft can be written as

X = DFT (x) = Fx

with F =̂

1 1 1 · · · 1

1 WN W 2N · · · WN−1

1 W 2N W 2·2

N · · · W2·(N−1)N

1 WN−1N W

(N−1)·2N · · · W

(N−1)·(N−1)N

Superscript T denotes transpose.

Hardware Algorithms

Dft: Low-Order Fourier Matrix Examples

• for N = 2: WN = W2 = 2√1 = e−j2π/2 = e−jπ = −1

F2 =̂

1 −1

• for N = 4: WN = W4 = 4√1 = e−j2π/4 = e−jπ/2 = −j

F4 =̂

1 1 1 1

1 W4 W 24 W 3

1 W 24 W 2·2

4 W 2·34

1 W 34 W 3·2

4 W 3·34

1 1 1 1

1 −j −1 j

1 −1 1 −11 j −1 −j

Superscript T denotes transpose.

Hardware Algorithms

Dft: Matrix Factorization ❀ Fft

• for example N = 1024:

F1024 =̂

I512 D512

I512 −D512

F512 O

O F512

where I512 =̂ identity matrix

D512 =̂ diag{

1,W1024,W21024, . . . ,W

5111024

F512 =̂ 512-point Fourier matrix

permutation at end separates even and odd part:

(↓)x =(

x[0] , x[2] , . . .)

(↓) (z)x =(

x[1] , x[3] , . . .)

Hardware AlgorithmsReferences

HardwareAlgorithms Mse: Parallelization - BFH · Mse: HardwareAlgorithms Parallelization...

Documents

MSE 140, MSE 140 C, MSE 160, MSE 160 C, MSE 180, MSE 180 C ...cedimsa.com/membre/cedim_global_new/stihl/eclates_stihl/MS/MS... · STIHL MSE 140, 140 C, 160, 160 C, 180, 180 C, 200

Bfh elop virtualkickoff sept14 def

Control Units MSE-100 | MSE-300| MSE-400| MSE-500 · MSE-100 / MSE-300 MSE-400 / MSE-500 Our control units MSE-100 and MSE-300 are compatible with all our cameras, reels and manual

BFH-Zentrum Soziale Sicherheit

Romain Descloux und Etienne Rumo - BFH

Loop parallelization & pipelining

Händehygiene im Akutspital - BFH

Parallelization & Multicore

Master - BFH

OPTIMIZATION AND OPENMP PARALLELIZATION OF …€¦ · OPTIMIZATION AND OPENMP PARALLELIZATION OF A DISCRETE ELEMENT ... with the optimization and parallelization of a discrete element

Parallelization of Explicit and Implicit Solver · — Parallelization of Explicit and Implicit Solver — CFD08-9 Parallelization and Iterative Solver Rolf Rabenseifner Slide 17

Mse: Hardware Algorithms Parallelization - :: microLab · Mse: Hardware Algorithms Parallelization Marcel Jacomet Josef Goette Bern University of Applied Sciences Bfh-TiHuCE-microLab,

BFH-Zentrum Holz – Ressource und Werkstoff

réseau Kontakt - BFH

6 Pflanzliche Antimalariamittel bfh 2017EEF

Shared Memory Parallelization

BFH Tätigkeitsbericht 2010

Trend Towards Parallelization

Automatic parallelization by pattern-matching · PDF fileforms automatic parallelization of numerical Fortran 77 ... direct solvers for linear equation ... automatic parallelization

BFH SummerSnacks 13