Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital

Sub-Nyquist SamplingDSP & SCD Modules

Presented by: Omer Kiselov, Daniel PrimorSupervised by: Ina Rivkin, Moshe Mishali

Winter 2010 High Speed Digital Systems labElectrical Engineering faculty

Technion – Israeli institute of technology

Outline

• Overview – Goals and discussion• Algorithm review• Implementation in hardware• Changes for Adaptation to hardware• Evaluation• Possible Optimization & Future Work

Overview

• The Goal system• The module’s Objectives• Interface

Memory

CTF(Support

recovery)DSP

(Baseband)

AnalogBack-end

(Realtime)

Detector

Expand1:q

DELAYFIFO

SUPPORT & MatrixDSP

(Baseband)

0

†

, 1i pY AZ Z f X f i L f

YA Z

DSP & SUPPORT CHANGE DETECTOR

A matrix vector 432 bits

Support Anlysis vector101 bits

First Beta (For QR decomposition)

36 bits

Samples Bundle 432 bits

Support Changed1 bit

Valid Supports 1 bit

A Matrix Address 9 bits

Valid samples 1 bit

Outline


Algorithm Review

• Pseudo-Inverse– Matrix Decomposition– Matrix Inversion– Matrix Multiplication

• Support Change Detection– Support threshold evaluation attempt

Pseudo inverseReal Time Vector MultiplierSupport Change Detector

Algorithm Review – Pseudo Inverse

• Matrix Decomposition• QR Decomposition

• Using Householder Reflections

1†

1 1

† 1

T Tn m n mn n

n n n m

T

T

A A A A

A Q R

A R Q

A R Q

1...i i kQ Q Q

Algorithm Review – Pseudo Inverse

• Matrix Inversion – Gaussian Elimination

• Matrix MultiplicationMatrix

MultiplierVector

Multiplier

Matrix Multiplier’s Common Interface

Algorithm Review - SCD• The support change detector is a vector multiplier – given

one row of the pseudo inversed A matrix and multiply it by the signal to see if any energy there is not noise.

• Threshold generation attempt:

– If there was no support change

– If we replace W with the average:

– The generated value doesn't show any false alarms. But may have misdetection on several cases where the SNR is low.

*Eventually The Threshold was defined as an input by the user.

min minamplitude noiseThreshold sample in range samples A

/20

1

1

10

min

DB

noiseSNR

signal

noisenoise

signal

A

A

AT sample A

A

1* )max) ((sample samp FrameOrgan noiseP y W P A P T

24 24 22

24 _1 1

2 2 22_ 1

24

1 11

max

24 max 24 max) ( 24 24

24 5

i

i

samples samp avg samp noise avgi i

samp noise avg

samp avgi

P y W y W

y W P Anoise threshold P T

y W P T P T

Our estimated guess for threshold is 000001000110010100 (for the AM demo)~0.3

DSP & SCD system operation

QR Decomposition

Upper triangular

matrix inverse

Matrix multiplier

R

Q’Auxiliary multiplicationsReflections creationReflection multiplication

R inversed

Delay FIFO

A Matrix RAM

Real Time Matrix-Samples Multiplier

Ping-Pong Buffer (RAM)

A dagger

Support Change Detector

Control Vector

Supportindexes A_s

SamplesFromExpand

Reconstructed Signal

'1'

Outline


Implementation In Hardware

QR Decompositio

n

Inverting an upper

triangular matrix

Matrix Multiplier

Block (Entities) Definition – Pseudo Inverse

QR Decompositi

on

Matrix Multiplier

Matrix Inversion


• Block (Entities) Definition – Pseudo Inverse• QR Decomposition

Phase 2Phase 1

Aux 2

24 Multipliers

Beta calculation unit

Matrix Inversion Unit


• Block (Entities) Definition – Pseudo Inverse

Vector Inversion UnitVector Inverter

FIFO for Original R Matrix


Matrix Multiplier

RAM

Matrix Multiplier

SCD

Real Time Mult

Outline - Adaptation to Hardware

• Overview – Goals and discussion• Algorithm review• Implementation in hardware• Adaptation to hardware

– Complex Enhance– Normalizing the Input– Resolution (Overflow) discussion– SCD – running average– Timing issues

• Evaluation• Possible Optimization & Future Work

Complex Enhance

• To avoid all complex multiplications we changed the structures of the matrix.

• The matrix is 4 times bigger. For every complex vector multiplication we can still multiply 1 vector with another vector the ordinary way, and get the correct results.

, ,

,, ,

0, 0,

( ) ( )

( ) ( )

i j i j

i ji j i j

i rownumber and j columnnumber

real a imag aa A

imag a real a

Normalizing the Input

• Accuracy falls with smaller mantissa

• Matrices can be normalized pre inverse and post inverse

• Hence:

• Motivation– The real data differed

from the synthetic data given – thus 18 bits are not enough (we need to represent both the number and 1 divided by the number).

– Normalizing the matrix allows us to play with the fraction to minimize error and underflow.

†

1 †2

1 †2

12

2

z y A

z y D A

D isdiagonal

z D y A

z D z

z D z

Support Change Detection – with running average

Vector multiplier

Cycle counter

Control vector RAM

Samples

MU

X

REG6

REG7

REG8

REG1

REG2

REG3

REG5

REG4

+Detection

>

Threshold

Timing

• Deep pipeline– We incorporated a deeper pipeline to make the module

work on the high desired frequency. The Quartus currently shows that the module may perform only up to the given frequency. It is possible to rise it by raising the pipe levels in the bottlenecks found in the design.

• Clocks– Main clock – 20 MHz may rise to 70MHz– Working clock for pseudo inverse – 100 MHz – currently

non flexible

• Hardware reuse– The matrix multiplier and the inverse unit use a single unit

for a vector size for many iterations – hence they make the bottlenecks.

Bottlenecks in the design

• Matrix Inverse• Matrix Multiplier• Beta calculation in the QR – heavy arithmetic actions taking place.

• If we replace the arithmetic units within these entities with higher pipeline units (the division is 23 cycles, the square root is 11 cycles and the multiplier is 2) – the maximal frequency will rise.

• No real reason to activate with a higher clock except when memory on the chip is lacking for the delay FIFO or speed being an actual necessity.

Resource Consumption

• Total numbers taken from Stratix III FPGA EP3SE260F1152C2

AloneWith architecture

totalusageusage with architecture

architecture consumption

out of total

5194062,913203,52025.52%30.91%5.39%17.44%combinational ALUT's

0640101,7600.00%0.63%0.63%100.00%memory ALUT's

1778848,820203,5208.74%23.99%15.25%63.56%logic registers

1002241,240,80815,040,5120.67%8.25%7.58%91.92%memory bits

75275276897.92%97.92%0.00%0.00%dsp block 18-bit elements

0580.00%62.50%62.50%100.00%PLLs

0240.00%50.00%50.00%100.00%DLLs

Resources on FPGAUsage percentageResources

DSP – Runtime Analysis

• Worse case pseudo inverse timing (for 11 support vectors) is a delay of 0.5 milliseconds. Hence an appropriate delay FIFO is required.

• The SCD and reconstruction multiplier works in real time (1 cycle 50 ns).

Outline

• Overview – Goals and discussion• Algorithm review• Implementation in hardware• Changes for Adaptation to hardware• Evaluation

– Testing method– Results– discussion– Conclusions

• Possible Optimization & Future Work

Evaluation - Testing

Input text files

Output text files

Matlab (fixed

point)=

VHDL

Logical Testing

Expanded

samples

CTF output support

VHDL – Test bench

A matrix memory

Status parser

Functional module

DSP SCD

Evaluation - Testing

Input text files

Output text files

Analysis &

Comparison to

Modelsim

On Chip Testing

Expanded

samples

CTF output support

Debug Environment

A matrix RAM

CTF model & FIFO ctrl

Functional module

DSP SCD

Evaluation - Results

• Results of the run on FPGA with the following signals– Fm259_252_sin824_809– Fm259_252_am872.697– Am_872.697_sin824

• SCD test


0 10 20 30 40 50 60 70 80 90-200

-190

-180

-170

-160

-150

-140

-130

-120

-110

-100

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(

Reconstructed sequence #1

0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(


0 10 20 30 40 50 60 70 80 90-200

-180

-160

-140

-120

-100

-80

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(

Reconstructed sequence fixed point modelsim #1

0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(

Reconstructed sequence fixed point modelsim #2FPGA output

0 10 20 30 40 50 60 70 80 90-200

-190

-180

-170

-160

-150

-140

-130

-120

-110

-100

Frequency )MHz(

Pow

er/

frequency )

dB

/Hz(


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/

frequency )

dB

/Hz(


0 10 20 30 40 50 60 70 80 90-200

-180

-160

-140

-120

-100

-80

Frequency )MHz(

Pow

er/

frequency )

dB

/Hz(


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/

frequency )

dB

/Hz(


Matlab simulation


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/

frequency )

dB

/Hz(


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

-40

Frequency )MHz(

Pow

er/

frequency )

dB

/Hz(


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/

frequency )

dB

/Hz(


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

-40

Frequency )MHz(

Pow

er/

frequency )

dB

/Hz(


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

-40

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(


0 10 20 30 40 50 60 70 80 90-180

-160

-140

-120

-100

-80

-60

-40

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(

Reconstructed sequence fixed point modelsim #2FPGA output

Matlab simulation


FPGA output

Matlab simulation

0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(

Reconstructed sequence fixed point hardware #1

0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/f

requ

ency

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/

freq

uen

cy

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/

freq

uen

cy

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(P

ower/

freq

uen

cy

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/

freq

uen

cy

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/

freq

uen

cy

)dB

/Hz(


0 20 40 60 80-180

-160

-140

-120

-100

-80

-60

Frequency )MHz(

Pow

er/

freq

uen

cy

)dB

/Hz(



Support changed

Support Change experiment

Evaluation - Discussion

• Inspection of correctness were done in comparison to Matlab under the following:– Maximal MSE of the calculated pseudo inversed

matrix values– Maximal and averaged values of the difference

between the results of the matlab simulation and the actual results

– By looking and inspecting differences….

• The SCD experiment was composed of two uneven support samples bundles put together to inspect correctness and conclude further about the support threshold.

Evaluation – conclusions

• The MSE inspected for the inversed matrix is 10^-3

• The MSE for the reconstructed signal:– Maximal 0.04– Averaged ~10^-6

• No actual conclusions were made about the support changes in function – the predictable behavior of the function is only in the support changes.

Outline


Future Work

• Possible Optimizations– Modification to the inversion algorithm for

higher parallelism.– Scaling hardware to increase performance.

• Possibly changing the resolution of the calculations to 22 or more bits for more accurate resolution - great cost in hardware.

• Integration

Summary

• We have managed to activate the DSP and SCD module on FPGA and got sufficient results.

• We introduced an algorithm for calculating the support threshold.

• We changed most architecture to support pipeline and use minimal hardware – vector resolution.

• Changed debug environment to support a different FPGA.

Documents

Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital