Increasing the spectral efficiency of continuous phase

Increasing the Spectral Efficiency ofContinuous Phase ModulationApplied to Digital Microwave

Radio: A Resource Efficient FPGAReceiver Implementation

A thesis presented in partial fulfilment of therequirements for the degree of

Master of Engineeringin

Electronics and Computer SystemsEngineering

at Massey University, Palmerston North,New Zealand

Andrew B. BridgerB.E. (Hons 1)

November 2, 2009

ii

Abstract

In modern point to point microwave radio systems used to backhaul cellularvoice and data traffic, quadrature amplitude modulation (QAM) is the norm.These systems require a highly linear power amplifier which is expensive andhas relatively low power efficiency. Recently, continuous phase modulation(CPM) has been deployed in this market. The CPM transmitted waveform hasa constant envelope and so a non-linear RF power amplifier can be used. Thissignificantly reduces cost and improves power efficiency.

Two important disadvantages of CPM are receiver complexity and inferiorspectral efficiency compared to QAM. This thesis demonstrates a 50% spectralefficiency improvement over an existing CPM configuration without loss ofdetection efficiency. This is achieved by moving to coherent demodulation andextending the duration of the CPM phase pulse to 3 symbol periods.

This new CPM configuration of h=1/4, M=4, L=3, is evaluated against ETSIrequirements for a 28 MHz channel carrying 24 E1 circuits. Simulation of thereceiver floating point model demonstrates all requirements are met. The de-tection efficiency requirement is exceeded by 4.7 dB. Carrier recovery, phaseand timing synchronisation are assumed to be ideal.

The 50% increased symbol rate, coherent reception and a longer smootherphase pulse, conspire to increase receiver complexity substantially. The Viterbialgorithm is used to perform maximum-likelihood detection resulting in a 128state trellis. This application has a stringent cost requirement that limits theimplementation target to a Field Programmable Gate Array (FPGA) costingless than US$30. To demonstrate this demanding cost target is met, the twomost computationally expensive receiver functions, the branch metric unit andpath metric processing unit, are implemented in VHDL and targeted to a XilinxSpartan 3A-DSP 1800 FPGA. The implementation uses 67% of the availablelogic resources, thus meeting the cost requirement.

The branch metric unit is implemented using a distributed arithmetic tech-nique that performs the equivalent of 27.6 giga-multiplies/s, consuming only23% of the available FPGA logic cells. This is very efficient compared to a con-ventional approach using all the FPGA’s embedded multipliers which com-bined can only achieve 21 giga-multiplies/s.

The Viterbi path metric processing unit is implemented using a more con-ventional state-parallel architecture. To reduce state metric routing complex-ity, states are grouped into radix-4 units comprising dual add-compare-select(ACS) units. By utilising a spare cycle in the deep ACS pipeline, each ACSunit processes two output state metrics, thus halving the number of ACS unitsrequired. This implementation uses 44% of the available FPGA resources andmeets timing at 204.5 MHz, exceeding the throughput requirement of 54 Mbit/s.

iii

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere without the permission of the Author.

iv ABSTRACT

Acknowledgements

I would like to acknowledge and thank those who have helped me during myresearch and made it all possible.

This thesis was completed under a Technology Industry Fellowship (TIF)Education scholarship in conjunction with Harris Stratex Networks (NZ) LTDand Massey University. This support from FRST is much appreciated.

A big thanks to Dr Philip Secker, my project mentor from Harris StratexNetworks, for his role in kick starting the project and securing funding fromFRST and Harris Stratex Networks. Also for taking the time out from a busyschedule to offer guidance, read my long emails and answer my many ques-tions throughout the year, and provide feedback on this thesis.

I would like to thank Harris Stratex Networks for their commitment to theproject in the form of Philip’s time and a stipend contribution.

Many thanks to Dr Xiang Gui, my academic supervisor in Palmerston North,for his time spent listening and offering guidance during our weekly progressmeetings, for ensuring I had access to the university’s resources I needed, andfor taking the time to review my thesis.

I would also like to thank Dr Edmund Lai, my co-supervisor based in Welling-ton, for his wise words, excellent thesis writing resources on his website, andfor taking the time to provide feedback on my thesis.

Colin Plaw and Patrick Rynhart helped me with VPN access to the univer-sity labs and ensured I had access to a source code repository. Managing allthe material created throughout the year would have been much much harderwithout it, thank you.

To my parents, Mum, Dad, Ma and Baba, thank you for all your encourage-ment and support, and for all the dinners that have been cooked for me overthe last 6 months while I have been writing up this thesis. Also, thank youMum for proof reading the whole thesis.

Finally, a very special thank you to my lovely wife Mandira, for convincingme to pursue my FPGA and digital signal processing passion in the form ofa Masters in Engineering. Thank you for your constant encouragement andunderstanding during the thesis writeup phase.

v

vi ACKNOWLEDGEMENTS

Contents

Abstract iii

Acknowledgements v

List of Figures xii

List of Tables xiii

Glossary xvi

1 Introduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Summary of Thesis Contributions . . . . . . . . . . . . . . . . . 41.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background and Related Work 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Continuous Phase Modulation Signal Model . . . . . . . . . . . 72.4 Maximum Likelihood Receiver . . . . . . . . . . . . . . . . . . . 10

2.4.1 Rational h . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.2 Viterbi Trellis Decode . . . . . . . . . . . . . . . . . . . . 12

2.5 Viterbi Decoder Architecture . . . . . . . . . . . . . . . . . . . . 122.6 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6.1 CPM Receiver Implementations . . . . . . . . . . . . . . 132.6.1.1 Viterbi Path Metric Processing . . . . . . . . . . 132.6.1.2 Viterbi Path Metric Normalisation . . . . . . . . 14

2.6.2 CPM Configurations and their Energy/Bandwidth Con-sumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6.3 Complexity Reduction . . . . . . . . . . . . . . . . . . . . 152.6.4 Literature Review Summary . . . . . . . . . . . . . . . . 16

3 CPM Parameter Selection 173.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 CPM Candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Choice of Symbol Alphabet Size (M) and Phase Pulse Shape 193.2.2 Modulation Index (h) and Phase Pulse Duration (L) Can-

didates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

vii

viii CONTENTS

3.3 ETSI Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3.1 Specific Application . . . . . . . . . . . . . . . . . . . . . 233.3.2 Bandwidth: Transmit Power Spectral Density (PSD) . . . 243.3.3 Detection Efficiency: Bit Error Rate as a function of Re-

ceive Signal Level . . . . . . . . . . . . . . . . . . . . . . . 243.3.4 Interference Rejection . . . . . . . . . . . . . . . . . . . . 24

3.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4.1 Simulation System Model . . . . . . . . . . . . . . . . . . 253.4.2 Bandwidth: Transmit Power Spectral Density (PSD) . . . 253.4.3 Detection Efficiency and Interference Rejection Performance:

h=1/4, L=3 . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4.3.1 Detection Efficiency: Bit Error Rate as a func-

tion of SNR . . . . . . . . . . . . . . . . . . . . . 273.4.3.2 1st Adjacent Channel Interference . . . . . . . . 293.4.3.3 Co-channel Interference . . . . . . . . . . . . . . 30

3.4.4 Detection Efficiency and Interference Rejection Performance:h=1/5, L=2 . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Fixed Point Modelling 354.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Implementation Target . . . . . . . . . . . . . . . . . . . . . . . . 354.3 Sampling Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.4 Fixed Point Modelling . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.1 Floating Point vs Fixed Point Numeric Representation . 394.4.2 Quadrature and In-phase Received Signal Word-length . 394.4.3 Branch Metric Filter Bank Coefficient Word-length . . . . 394.4.4 Branch Metric Word-length . . . . . . . . . . . . . . . . . 414.4.5 State Metric Normalisation . . . . . . . . . . . . . . . . . 42

4.5 Survivor Path History . . . . . . . . . . . . . . . . . . . . . . . . 434.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Branch Metric Implementation 475.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 Computation Complexity . . . . . . . . . . . . . . . . . . . . . . 485.3 CPM and Distributed Arithmetic . . . . . . . . . . . . . . . . . . 50

5.3.1 Phase State Symmetry . . . . . . . . . . . . . . . . . . . . 515.4 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4.1 Throughput Requirements . . . . . . . . . . . . . . . . . 525.4.2 Efficient Mapping of a DA Filter into FPGA Hardware . 53

5.4.2.1 DALUT . . . . . . . . . . . . . . . . . . . . . . . 535.4.2.2 Scaling Accumulator . . . . . . . . . . . . . . . 545.4.2.3 FPGA Resource Use Summary . . . . . . . . . . 545.4.2.4 Additional Resource Use Required to Meet Tim-

ing . . . . . . . . . . . . . . . . . . . . . . . . . . 545.4.3 Implementation Results . . . . . . . . . . . . . . . . . . . 555.4.4 Functional Verification . . . . . . . . . . . . . . . . . . . . 56

5.5 DA vs Embedded Multipliers . . . . . . . . . . . . . . . . . . . . 565.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

CONTENTS ix

6 Path Metric Implementation 596.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3.1 State-Parallel Radix-4 Decomposition . . . . . . . . . . . 616.3.2 Add-Compare-Select Unit . . . . . . . . . . . . . . . . . . 64

6.3.2.1 Resource Use Estimate . . . . . . . . . . . . . . 656.4 Implementation Results . . . . . . . . . . . . . . . . . . . . . . . 66

6.4.1 Functional Verification . . . . . . . . . . . . . . . . . . . . 676.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7 Conclusions and Future Work 69

Appendix 72

A VHDL Implementation Functional Verification 75A.1 VHDL Functional Verification Architecture . . . . . . . . . . . . 75

B Receive Signal Level to SNR Conversion 77

C Baseband I/Q Modulator Derivation 79

D VHDL Source Code 81D.1 Branch Metric Filter Bank . . . . . . . . . . . . . . . . . . . . . . 81

D.1.1 Synthesis Top Level . . . . . . . . . . . . . . . . . . . . . 81D.1.2 Top Level . . . . . . . . . . . . . . . . . . . . . . . . . . . 81D.1.3 4-Tap Distributed Arithmetic Filter . . . . . . . . . . . . . 87D.1.4 Filter Coefficients . . . . . . . . . . . . . . . . . . . . . . . 92D.1.5 Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . 92D.1.6 Test Vectors Package . . . . . . . . . . . . . . . . . . . . . 96

D.2 Viterbi Trellis Path Metrics . . . . . . . . . . . . . . . . . . . . . . 96D.2.1 Synthesis Top Level . . . . . . . . . . . . . . . . . . . . . 96D.2.2 Top Level . . . . . . . . . . . . . . . . . . . . . . . . . . . 96D.2.3 Radix-4 Add-Compare-Select Unit . . . . . . . . . . . . . 105D.2.4 Viterbi Trellis Package . . . . . . . . . . . . . . . . . . . . 121D.2.5 Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . 121D.2.6 Test Vectors Package . . . . . . . . . . . . . . . . . . . . . 122

D.3 Branch Metrics Filter Bank and Viterbi Trellis Path Metrics . . . 122D.3.1 Synthesis Top Level . . . . . . . . . . . . . . . . . . . . . 122D.3.2 Top Level . . . . . . . . . . . . . . . . . . . . . . . . . . . 122D.3.3 Testbench . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

D.4 Placed Primitives Modules . . . . . . . . . . . . . . . . . . . . . . 122D.4.1 Adder with Subtract and Clear Controls . . . . . . . . . . 122D.4.2 Adder with Subtract and 2 Input Operand Mux . . . . . 122D.4.3 16 Deep ROM using Distributed Ram . . . . . . . . . . . 122D.4.4 2 Input Mux . . . . . . . . . . . . . . . . . . . . . . . . . . 122D.4.5 Shift Register . . . . . . . . . . . . . . . . . . . . . . . . . 122D.4.6 Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122D.4.7 Relative Location Constraint(RLOC) Helper Package . . 122

D.5 Sundry Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . 123D.5.1 Key Project Constants . . . . . . . . . . . . . . . . . . . . 123

x CONTENTS

E Matlab Source Code 125E.1 Analytical CPM Code Performance . . . . . . . . . . . . . . . . . 125

E.1.1 CPM Code Minimum Euclidean Distance Upper Bound 125E.1.2 CPM Code Baseband Double Sided Bandwidth . . . . . 125

E.2 Floating and Fixed Point M-file Models . . . . . . . . . . . . . . 125E.3 VHDL Trellis Representation and Test Vector Generation . . . . 125

E.3.1 Export Matlab Data to VHDL Writer . . . . . . . . . . . . 126E.3.2 VHDL Writer . . . . . . . . . . . . . . . . . . . . . . . . . 126

E.4 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126E.4.1 Hekstras Method Bound Calculation . . . . . . . . . . . . 126

F Implementation Results 127F.1 Branch Metric Filter Bank . . . . . . . . . . . . . . . . . . . . . . 127F.2 Viterbi Trellis Path Metrics . . . . . . . . . . . . . . . . . . . . . . 129

G Software Tool Versions 133G.1 High Level Modelling: Matlab and Simulink . . . . . . . . . . . 133G.2 FPGA Implementation: Xilinx ISE 10.1 . . . . . . . . . . . . . . . 133G.3 VHDL Simulation: Modelsim . . . . . . . . . . . . . . . . . . . . 133

Bibliography 135

List of Figures

1.1 CPM Receiver Functionality Studied in this Thesis . . . . . . . 31.2 Modelling and Implementation at Complex Baseband . . . . . 4

2.1 Raised Cosine (RC) and Rectangular (REC) Frequency Pulse andPhase Pulse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Standard Viterbi Decoder Implementation Architecture . . . . . 12

3.1 Effect of Symbol Alphabet Size (M) on Detection Efficiency, L=1,Raised Cosine Phase Pulse . . . . . . . . . . . . . . . . . . . . . 19

3.2 Effect of Phase Pulse Duration (L) on Detection Efficiency, M=4 203.3 Relative Detection Efficiency and Spectral Efficiency for several

CPM Configurations . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Simulation System Model . . . . . . . . . . . . . . . . . . . . . . 263.5 Simulated Transmit Power Spectral Density of Candidate CPM

Configurations, Various (h, L), M=4, 27 Msymbols/s) . . . . . . 263.6 Zoomed in version of Figure 3.5 . . . . . . . . . . . . . . . . . . 273.7 Simulated Bit Error Probability with and without Reed Solomon

FEC, AWGN Channel, No ACI Reject Filter, h=1/4, L=3RC, 27Msymbols/s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.8 Simulated Bit Error Rate with Adjacent Channel Interference,h=1/4, L=3, M=4, 27 Msymbols/s . . . . . . . . . . . . . . . . . 29

3.9 Simulated Bit Error Rate Demonstrating Effect of ACI Reject Fil-ter and Adjacent Channel Interference, h=1/4, L=3, M=4, 27 Msym-bols/s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.10 Simulated Bit Error Rate with Co-Channel Interference, h=1/4,L=3, M=4, 27 Msymbols/s . . . . . . . . . . . . . . . . . . . . . . 31

3.11 Simulated Bit Error Rate with and without Adjacent ChannelInterference, h=1/5, L=2, M=4, 27 Msymbols/s . . . . . . . . . 32

4.1 Approximations to the maximum-likelihood receiver . . . . . . 364.2 Effect of Sampling Rate on Detection Efficiency . . . . . . . . . 374.3 Effect of Quantised Received Signal Word-length on Receiver

Detection Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Effect of Quantised Branch Metric Filter Bank Coefficient Word-

length on Receiver Detection Efficiency . . . . . . . . . . . . . . 414.5 Effect of Branch Metric Word-length on Receiver Detection Effi-

ciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.6 Effect of Survivor Path History Depth on Detection Efficiency . 44

xi

xii LIST OF FIGURES

5.1 4-Tap Distributed Arithmetic FIR Filter Block Diagram . . . . . 52

6.1 CPM Viterbi Detection . . . . . . . . . . . . . . . . . . . . . . . . 596.2 Add-Compare-Select Processing Required per State . . . . . . . 616.3 128 State CPM Viterbi Detector Comprising 32 Radix-4 Add-

Compare-Select Units . . . . . . . . . . . . . . . . . . . . . . . . 626.4 CPM Radix-4 Trellis (h=1/4, L=3, M=4) . . . . . . . . . . . . . . 636.5 Add-Compare-Select (ACS) Unit Detail and Pipeline . . . . . . 646.6 Radix-4 Unit Comprising Dual ACS Units . . . . . . . . . . . . 65

A.1 VHDL Implementation Test Architecture . . . . . . . . . . . . . 76

List of Tables

2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Candidate Modulation Indices (h) . . . . . . . . . . . . . . . . . 213.2 Candidate CPM Configurations . . . . . . . . . . . . . . . . . . . 233.3 ETSI Co-Channel and 1st Adjacent Channel Interference Perfor-

mance [1, Table D.7] . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1 Comparison of DSP and FPGA Multiplication Capability . . . . 364.2 Detection Efficiency Degradation due to Branch Metric Quanti-

sation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1 CPM Branch Metric Filter Bank Complexity . . . . . . . . . . . . 495.2 Distributed Arithmetic Filter Fmax Requirement . . . . . . . . . 535.3 Estimated Branch Metric Filter Bank FPGA Resource Use . . . . 545.4 Branch Metric Filter Bank Implementation Results . . . . . . . . 55

6.1 Single Add-Compare-Select Unit FPGA Resource Use Estimate 666.2 Path Metric Processing Unit Implementation Results . . . . . . . 67

B.1 ETSI Received Signal Level Converted to SNR and Eb

No. . . . . . 77

G.1 Matlab and Simulink Software Versions . . . . . . . . . . . . . . 133G.2 Xilinx Synthesis and Implementation Software Versions . . . . . 134

xiii

xiv LIST OF TABLES

Glossary

ACI Adjacent Channel InterferenceACS Add-Compare-SelectACSU Add-Compare-Select UnitADC Analog to Digital ConverterAGC Automatic Gain ControlASIC Application Specific Integrated CircuitASSP Application Specific Standard ProductAWGN Additive White Gaussian Noise

BER Bit Error RateBMU Branch Metric UnitBRAM Block Random Access Memory

CCI Co-Channel InterferenceCFO Carrier Frequency OffsetCPFSK Continuous Phase Frequency Shift KeyingCPM Continuous Phase Modulation

DA Distributed ArithmeticDALUT Distributed Arithmetic Look-Up TableDSP Digital Signal ProcessorDUT Device Under Test

ETSI European Telecommunications Standards In-stitute

FEC Forward Error CorrectionFPGA Field Programmable Gate Array

GMSK Gaussian Minimum Shift Keying

IF Intermediate Frequency

LC Logic CellLUT Look-Up Table

ML Maximum-LikelihoodMLSD Maximum-Likelihood Sequence Detector

xv

xvi Glossary

MSK Minimum Shift Keying

PAM Pulse Amplitude ModulationPSD Power Spectral Density

QAM Quadrature Amplitude Modulation

RC Raised CosineREC RectangularRF Radio FrequencyRLOC Relative LocationRSL Received Signal Level

SER Symbol Error RateSMU Survivor Management UnitSNR Signal to Noise RatioSRAM Static Random Access MemorySRC Spectrally Raised CosineSTA Static Timing Analysis

TFM Tamed Frequency Modulation

UHF Ultra-High Frequency

VHDL VHSIC hardware description languageVHSIC Very-High-Speed Integrated CircuitVME VERSA-module Europe

Chapter 1

Introduction

1.1 Introduction

In a data market, bit-rate is everything. Vendors in the digital microwave radiocellular backhaul market are under pressure to reduce costs and cope with in-creasing data rate requirements due to the rapidly increasing availability andconsumer uptake of high speed mobile data services. Traditionally, microwaveradio backhaul systems have used quadrature amplitude modulation (QAM).Radio spectrum is a limited resource and so the trend has been toward largerQAM constellation sizes to increase spectral efficiency. However, since QAMmodulates carrier phase and amplitude, the transmitter requires a linear poweramplifier. This component is a significant portion of the cost and power con-sumption of the microwave radio outdoor unit.

A recently released microwave radio product has significantly reduced costand improved power efficiency by using continuous phase Modulation (CPM).By modulating carrier phase only and ensuring a smooth phase transition be-tween symbols, the transmitted CPM waveform has a constant envelope. Thisallows the use of a low-cost non-linear power amplifier in the transmitter.

Nevertheless, two important disadvantages of CPM are receiver complexityand inferior spectral efficiency compared with QAM. Spectrally efficient CPMconfigurations require at least a quaternary symbol alphabet and a smoothedsymbol phase pulse lasting several symbol periods. This leads to a maximum-likelihood receiver with a large matched filter bank and a Viterbi decoder witha large number of states; the implementation cost can be prohibitive.

This thesis proposes a CPM configuration that achieves a 50% improve-ment in spectral efficiency over an existing microwave radio CPM product,while maintaining receiver detection efficiency. This is achieved by moving tocoherent demodulation and lengthening the phase pulse duration to 3 sym-bol periods. The ETSI channel bandwidth of 28 MHz is constant, so this newscheme increases data throughput by 50%.

The increased symbol rate (27 MSymbols/s), coherent reception, and a longersmoother phase pulse, conspire to increase receiver complexity substantially.The maximum-likelihood receiver contains a matched filter bank of 128 filters,followed by Viterbi path metric processing of 128 Viterbi states. The matchedfiltering operation alone is shown to consume 27.6 giga-multiplies/s. A cost

1

2 CHAPTER 1. INTRODUCTION

effective implementation is a challenge.The application has a stringent cost requirement that limits the implementa-

tion target to a Field Programmable Gate Array (FPGA) costing less than US$30at a volume price. To demonstrate the proposed CPM configuration is able tomeet the cost target, the two most computationally expensive receiver func-tions are implemented in VHDL and targeted to a Xilinx Spartan 3A-DSP 1800FPGA. The designs are synthesised to confirm FPGA resource use and cost.Static timing analysis on the placed and routed netlist confirms data through-put. A VHDL functional simulation verifies operation of the implemented de-sign.

The literature tackles the complexity problem by using a variety of algo-rithm level complexity reduction techniques that have been shown to givesignificant reductions in complexity with a range of performance degrada-tions relative to the maximum-likelihood receiver. Many of these techniqueshave not been tested in the presence of adjacent channel interference (ACI),and some complexity reduction techniques have shown increased sensitivityto ACI. This thesis avoids this issue and focuses on a maximum-likelihoodFPGA implementation. This allows a dollar cost to be put on any quaternaryCPM configuration with a symbol rate in the region of 10-30 MSymbols/s. Thisfills a gap in the literature where there are very few published details of CPMreceiver FPGA implementations.

A low-cost CPM filter bank FPGA implementation is proposed. It uses adistributed arithmetic technique to implement 27.6 giga-multiplies per secondof filter bank multiplications consuming 23% of the available FPGA logic cells.A conventional approach would use the FPGA’s embedded multipliers butthese resources provide a maximum of only 21 giga-multiplies/s. This designmeets timing at 215.6 MHz, exceeding the minimum throughput requirementsby 14%. The main drawback of this technique is that it adds one symbol pe-riod of processing latency. Since the branch metric filter bank is likely to beinside the phase recovery loop, this added latency degrades phase recoveryperformance. This is the price to be paid for a low-cost implementation.

The Viterbi path metrics processing implementation follows a more stan-dard architecture. The 128 state trellis is decomposed into 32 radix-4 units.Each radix-4 unit comprises 2 add-compare-select (ACS) units that calculate4 path metrics every symbol period. This implementation uses 44% of theFPGA’s available logic cell resources. This design meets timing at 204.5 MHzachieving a throughput of 58.4 Mbit/s which is 8% more than the minimumrequirement of 54 Mbit/s for the target application.

Symbol and phase synchronisation for CPM is an ongoing area of researchand is beyond the scope of this thesis; we assume ideal symbol and phasesynchronisation.

1.2 Scope

The purpose of this work is two-fold. Firstly, a new CPM configuration mustbe found that achieves a 50% spectral efficiency improvement over the existingproduct. Secondly, a practical, low-cost implementation must be demonstratedwith an FPGA targeted VHDL implementation.

In the search for an efficient CPM configuration we assume single-h CPM

1.2. SCOPE 3

Figure 1.1: CPM Receiver Functionality Studied in this Thesis

and coherent reception. We assume perfect timing and carrier phase recoveryand assume a zero carrier frequency offset. Although this is an active area ofresearch it is beyond the scope of this thesis.

The CPM configuration’s bandwidth consumption, SNR performance, Ad-jacent Channel and Co-Channel Interference (ACI, CCI) performance must meetan application specific set of ETSI standard requirements [1]. We assume 239/255Reed Solomon forward error correction. The data rate requirement is 54 Mbit/sor 27 MSymbols/s for a quaternary CPM symbol alphabet. This provides forthe transport of 24 E1 circuits and an additional approximately 5 Mbit/s forframing and auxiliary channel overhead.

The VHDL implementation in this thesis focuses purely on the two mostcomputationally expensive, and thus costly, functions of the CPM receiver. SeeFigure 1.1. This is the branch metrics filter bank and the Viterbi path metricadd-compare-select functionality. Not included within the scope of this thesisare E1 data circuit interfaces, forward error correction, framing, sample rateconversion and front end band-limiting.

Survivor path management is not implemented. A traceback architectureis cost effective because it uses FPGA block memory to store the Viterbi pathhistory. Section 4.5 shows that the memory required to implement this func-tion is small; one or two block rams for this application. Meeting throughputrequirements is straight forward since this function is outside the Viterbi iter-ation loop and so the logic can be pipelined extensively. However, tracebackmemory bandwidth requirements are high when tracing back once every sym-bol. The bandwidth requirements can be reduced significantly by tracing backonly once every several symbols using the technique described in [2]. The costis added latency but this is negligible in the context of the latency through acomplete receiver. Phase and timing recovery require early, tentative symboldecisions and these would not be generated by tracing back [3] [4].

The find-the-best metric operation has also not been implemented. If trac-ing back once every few symbols, bit-serial techniques are appropriate and theFPGA resources consumed can be kept to a minimum. Phase and timing re-covery determine the constraints on this latency which is beyond the scope of


Figure 1.2: Modelling and Implementation at Complex Baseband

this thesis. For reasonable latencies, the cost of this operation is small relativeto the rest of the receiver.

The VHDL implementation is proved with synthesis results, static timinganalysis and a VHDL functional simulation. This approach is justified in ap-pendix A.

All modelling and implementation is done at complex baseband as shownin Figure 1.2.

1.3 Summary of Thesis Contributions

The main contribution of this thesis is a resource efficient and low-cost FPGAimplementation of the two most computationally expensive components of aCPM receiver of moderate throughput (54 Mbit/s). The CPM configurationachieves a 50% increase in spectral efficiency over an existing product.

Other contributions include:

• Demonstration of a 50% spectral efficiency improved CPM configura-tion meeting a specific ETSI microwave radio standard which specifiesrequirements for tranmsitted power spectral density and receiver per-formance in the presence of channel noise and adjacent and co-channelinterference.

• CPM Viterbi demodulator fixed point simulation results.

• The application of Hekstra’s path metric normalisation method to a CPMdetector.

• The application of a bit-serial distributed arithmetic filter algorithm tothe FPGA implementation of a CPM branch metric unit.

• Application of a Viterbi radix-4 decomposition to a CPM trellis.

1.4 Overview

• Chapter 2 presents background material relevant to the work in this the-sis. The CPM signal model is developed and the key equations for the

1.4. OVERVIEW 5

maximum likelihood receiver are stated. This chapter also contains a re-view of the CPM literature relevant to this thesis.

• Chapter 3 investigates the choice of CPM parameters to improve spec-tral efficiency by 50%. Floating point models developed in Matlab andSimulink simulate the CPM transmitter and receiver at baseband. Wechoose the CPM configuration with lowest complexity that still meetsthe ETSI standard transmit power spectrum mask and receiver detectionefficiency specification.

• Chapter 4 begins the transition to a fixed point hardware implementa-tion. Matlab simulation results demonstrate the effect on receiver de-tection efficiency of sampling rate, word-length and Viterbi path historydepth. There is a direct tradeoff between hardware complexity and detec-tion efficiency. This chapter concludes by selecting the most appropriatesample rate, word-length and path history depth to be used in the FPGAimplementation to follow.

• Chapter 5 presents the branch metrics filter bank VHDL implementa-tion. A distributed arithmetic algorithm implements each filter in thefilter bank. Synthesis results show that the FPGA resource use meets thecost requirement, and static timing analysis results confirm throughput.Functional simulation results demonstrate that the VHDL implementa-tion matches the Matlab fixed point model precisely.

• Chapter 6 describes the Viterbi path metric VHDL implementation. Thistrellis decode add-compare-select (ACS) processing is implemented us-ing a state-parallel structure using radix-4 units comprising dual ACSunits. Results from synthesis, static timing analysis and functional simu-lation demonstrate that this design meets requirements.

• Chapter 7 concludes the work presented in this thesis and suggests pos-sibilities of investigation for the future.


Chapter 2

Background and RelatedWork

2.1 Introduction

This chapter begins by introducing the CPM signal model used throughoutthis thesis. A maximum-likelihood receiver is presented that comprises a filterbank followed by a Viterbi trellis search. Practical implementations use theViterbi algorithm with an implementation comprising a branch metric unit,Viterbi path metric processing unit and a survivor management unit.

The second half of this chapter presents a review of the CPM literature rel-evant to this thesis.

2.2 Notation

This thesis uses the notation in Table 2.1.

<(x) real component of x.=(x) imaginary component of x.x(t) baseband complex envelope of passband signal x(t) where x(t) =

<{x(t)ejωct} and ωc is the carrier angular frequency.

Table 2.1: Notation

2.3 Continuous Phase Modulation Signal Model

A radio frequency (RF) carrier modulated by a baseband message carrying sig-nal can be described by Equation (2.1). The amplitude a(t) and phase φ(t) ofthe carrier fc are available to be modulated by the message signal [5].

s(t) = a(t)cos(2πfct+ φ(t)) (2.1)

An equivalent exponential notation is given in Equation (2.2) where s(t)is the baseband complex envelope and ωc is the RF carrier angular frequency.

7

8 CHAPTER 2. BACKGROUND AND RELATED WORK

Equation (2.3) represents s(t) in terms of its in-phase and quadrature compo-nents and equivalently, Equation (2.4) shows the amplitude a(t) and phase φ(t)components explicitly. For the purposes of this thesis, all the interesting prop-erties of the modulation are described by its complex envelope s(t) and so thepassband formulation is not considered any further.

s(t) = <{s(t)ejωct} (2.2)

s(t) = sI(t) + jsQ(t) (2.3)

s(t) = a(t)ejφ(t) (2.4)

Keeping a(t) constant and using the message signal to modulate phase φ(t)only, the transmitted RF signal has a constant envelope and consequently isrobust to non-linearities in the signal path. By smoothly transitioning the phasefrom symbol to symbol the spectral occupancy is reduced. This is continuousphase modulation [6].

The standard definition for how the phase is modulated by the messagesignal is defined in Equation (2.5). The message signal is represented as digi-tal data coded into M-ary symbols a, coming from a symbol set of size M, asdefined by equation (2.6). In this thesis M is considered a power of 2 only.

φ(t,a) = 2πh∞∑

i=−∞αiq(t− iT ) (2.5)

αi ∈ {±1,±3, ...,±(M − 1)} (2.6)

The modulation index, h, is a parameter that trades off bandwidth and en-ergy performance. Although it is possible to vary h from symbol to symbol,we only consider CPM schemes with a single, fixed h [7] 1.

q(t) is the all important phase smoothing function or phase pulse. Twoexamples are shown in Figure 2.1. For example, the raised cosine frequencypulse is described by Equation (2.7). The phase pulse is defined by Equation(2.8). This function starts at 0 at the beginning of a symbol duration so that thephase is continuous from one symbol period to the next. By convention thisfunction is 1/2 at the end of the symbol duration. An infinite number of phasesmoothing functions are possible but there are several standard pulses definedin the literature.

g(t) =1

2LT(1− cos(

2πtLT

)), 0 ≤ t ≤ LT (2.7)

q(t) =∫ t

−∞g(τ)dτ (2.8)

The frequency pulse g(t) has finite duration of length LT where T is a nom-inal symbol period. The symbol period relates to the data bit rate and alphabet

1If h is allowed to vary from one symbol to the next, it is called multi-h CPM. Further gains inspectral efficiency and energy consumption are possible, at the expense of further complication inthe receiver[8].

2.3. CONTINUOUS PHASE MODULATION SIGNAL MODEL 9

Figure 2.1: Raised Cosine (RC) and Rectangular (REC) Frequency Pulse andPhase Pulse


size as in Equation (2.9). L controls the degree of overlap between consecutivesymbols in the modulator. CPM schemes with L = 1 are called full responseand those with L > 1 are called partial response. CPM schemes with L > 1spread the phase pulse in time and reduce the bandwidth of the transmittedsignal [9].

The use of partial response CPM systems yields a more attractivetradeoff between error probability and spectrum than does the fullresponse systems [9].

T =log2M

bitrate(2.9)

A CPM configuration is uniquely defined by the three parameters h, Mand q(t). These parameters must be chosen to meet the requirements of theapplication at hand. Chapter 3 investigates the choice of these parameters withregard to meeting spectral occupancy, SNR performance and cost requirementsfor a specific point to point microwave radio application.

2.4 Maximum Likelihood Receiver

In this thesis we are interested in an implementation of the maximum-likelihoodreceiver. This is presented below.

The received signal r(t) is a distorted version of the transmitted signal sin the presence of additive white gaussian noise (AWGN) n(t), as shown byequation (2.10). We assume perfect channel equalisation, and we set the phaseand time offset terms to zero, implying perfect timing recovery and perfectcarrier phase synchronisation with no carrier frequency offset. Appendix Cshows how the transmitted baseband signal, s, is generated in practice.

r(t) = s(t, α) + n(t) (2.10)

In [6], a maximum likelihood sequence estimating (MLSE) receiver is de-rived and it requires finding the symbol sequence α that maximises the corre-lation in Equation (2.11). This is a correlation of the received signal with allpossible transmitted signals.

J(α) = <{∫ ∞−∞

r(t)s∗(t, α)dt} (2.11)

The number of symbol sequences grows exponentially with the sequencelength making a receiver using a direct implementation of this equation im-practical. By taking an iterative approach as in Equation (2.12), and perform-ing the correlation over one symbol period at a time as in Equation (2.13), thereceiver becomes practical. The index n identifies a single symbol period.

Jn(α) = Jn−1(α) + Zn(α) (2.12)

Zn(α) = <{∫ (n+1)T

nT

r(t)s∗(t, α)dt} (2.13)

2.4. MAXIMUM LIKELIHOOD RECEIVER 11

2.4.1 Rational h

A further requirement is for h to be rational so that s∗(t, α) lies within a finiteset of waveforms over a single symbol period, making Equation (2.13) realis-able. By restricting h to be rational according to Equation (2.14) then the phasetakes on a finite set of phases modulo 2π at symbol time boundaries and a trel-lis with a finite number of states can be used to represent the phase transitions.

h =2kp, k, p ∈ integers (2.14)

For CPM we have

s(t) = ejφ(t) (2.15)

where

φ(t,a) = 2πh∞∑

i=−∞αiq(t− iT ) (2.16)

And when h is rational the phase signal can be divided into two terms asshown in Equation (2.17).

φ(t,a) = 2πhn∑

i=n−L+1

αiq(t− iT ) + 2πhn−L∑i=−∞

αiq(LT ) (2.17)

θ(t,a) in Equation (2.18) describes how the phase changes during the nthsymbol interval due to the current αn symbol and the previous L− 1 symbols.

θ(t,a) = 2πhn∑

i=n−L+1

αiq(t− iT ) (2.18)

θn in Equation (2.19) is the accumulated phase change due to all symbolsprior to and including the an−L symbol. This is called the phase state. 2

θn = πh

n−L∑i=−∞

αi (mod 2π) (2.19)

φ(t,a) = θ(t,a) + θn (2.20)

Placing Equation (2.20) into Equation (2.13) gives Equation (2.21).

Zn(αn, θn) = <{∫ (n+1)T

nT

r(t)e−j[θ(t,αn)+θn]dt} (2.21)

Zn(αn, θn) = <{e−jθn

∫ (n+1)T

nT

r(t)e−jθ(t,αn)dt} (2.22)

Since αn can take on ML different sequences and θn takes on p differentvalues, Equation (2.22) represents ML complex matched filters followed by pphase rotations [9].

2In general, θn does not represent the actual signal phase at the start of a symbol period be-cause θ(t,a) is non-zero at the beginning of a symbol period. The phase state is the cumulativecontribution of past symbols to the signal phase up to L-1 symbols prior to the current time.


2.4.2 Viterbi Trellis Decode

The Viterbi algorithm is an efficient way to evaluate Equation (2.12). Jn(α)represents the accumulated path metrics at time nT and Zn(α) is the set ofbranch metrics for the interval from t = nT to t = (n + 1)T . The trellis stateis defined by Equation (2.23) where the phase state θn is defined by Equation(2.24) and the correlative state is defined by Equation (2.25). This gives a Viterbitrellis with pML−1 states [9].

σn = (θn, αn−1, αn−2, . . . , αn−L+1) (2.23)

θn =2πip, i ∈ {0, 1, 2, . . . , p− 1} (2.24)

CorrelativeState = (αn−1, αn−2, . . . , αn−L+1) (2.25)

2.5 Viterbi Decoder Architecture

In the literature, Viterbi decoder implementations are typically treated by par-titioning the functionality into three parts as shown in Figure 2.2.

• Branch Metric Unit (BMU) - Computes hamming or Euclidean distancebetween the received symbol and the various possible transmitted sym-bols. For a CPM receiver this implements Equation (2.21).

• Path Metric Processing Unit - Accumulates the path metric and selects asurvivor path from each of the trellis connections inbound to each state.The path metrics are accumulated as defined by Equation (2.12).

• SMU or TBU - Survivor management unit or Traceback unit. These unitsextract the decoded data from the ultimate survivor path.

Figure 2.2: Standard Viterbi Decoder Implementation Architecture

2.6. LITERATURE REVIEW 13

2.6 Literature Review

2.6.1 CPM Receiver Implementations

Nova Engineering use a multi-h CPM waveform to increase spectral efficiencyby 3 times compared to legacy PCM/FM telemetry waveforms at the same de-tection efficiency. They use a modulation index of h=1/4,5/16, a raised cosinephase pulse of 3 symbol periods (L=3RC) and a quaternary alphabet (M=4).Viterbi trellis complexity is 512 states [10].

Nova Engineering have also designed a product called Hypermod MMD22which uses multi-h CPM with M=4 and 128 state Viterbi trellis complexity.The complete transceiver is implemented on a board with 5 Xilinx Virtex EXCV2000E 3 FPGAs. One FPGA is allocated to the Viterbi trellis update calcu-lations which consume 80% of the logic resources and 40% of the block ram.The data rate is 22 Mbit/s [11].

In [12], turbo-detected coded CPM is used in a military UHF satellite com-munications application. A data rate of 80 kbit/s is transmitted in a 25 kHzchannel. At Eb

Noof 11 dB, the bit error rate is 10−5. They use M=8, h=1/8 and

a rectangular phase pulse of 1 symbol duration (L=1REC). The modem imple-mentation consists of 2 VERSA-module Europe (VME) cards in a VME chas-sis. Most of the signal processing functions are performed by a TMS320C6701DSP but the iterative decoder is implemented in VHDL and uses 70% of anXCV2000E FPGAs resources.

Because these are commercial products, very few details of these FPGA im-plemenations have been published.

2.6.1.1 Viterbi Path Metric Processing

There is a large body of literature describing algorithms and implementationdetails for Viterbi trellis decoding, mostly targeting the decoding of convolu-tional codes and for ASIC implementation. Although few papers were foundthat directly address the implementation of a CPM Viterbi trellis, many of thegeneral Viterbi results are applicable.

For low data rates or small Viterbi trellises, a fully state-serial approach usesthe least hardware resources. A 64 state Viterbi decoder was implemented us-ing a single add-compare-select (ACS) processing unit targeting a Xilinx Spar-tan 3 FPGA. The implementation used only 128 slices (approximately 256 logiccells) and 2 block rams to support a data rate of 2.4 Mbit/s [13].

In contrast, the CPM application in this thesis calls for a Viterbi decoderwith a large number of states (128) and for a throughput of 54 Mbit/s whichis moderately high for a low-cost FPGA implementation target. The tradi-tional approach to high throughput, large trellis size Viterbi decoders is a fullystate-parallel approach in which each state is processed with individual add-compare-select units. This consumes a lot of hardware resources and the ACS

3An XCV2000E has 38400 logic cells and supports a 130 MHz clock rate with 4 LUT levels.Although this FPGA is almost 10 years old, this part is roughly equivalent in density and perfor-mance to a Spartan 3-ADSP XC3S1800ADSP, the FPGA implementation target in this thesis. VirtexFPGA’s are Xilinx’s premium brand so offer higher performance than the spartan family, but atsignificantly higher cost.


path metric routing is complicated. A bit-serial approach to the ACS process-ing reduces the hardware requirements enormously, whilst also reducing theamount of ACS to ACS connectivity required since only a 1 bit wide path met-ric bus is required [2] [14]. However, this bit-serial approach does put an upperlimit on throughput, dependent on the path metric wordlength.

By using multiple ACS units where each ACS unit processes multiple states,a hybrid of the fully state-serial and state-parallel approaches is possible. Shungproposes a systematic approach to allocating states to ACS units. By pipelin-ing the ACS operation a single ACS unit processes multiple states at once. It isclaimed this provides a favourable area-time tradeoff [15].

In general, a Viterbi trellis can be decomposed into radix-k sub-units, wherek is a power of 2. For example, a radix-2 trellis has 2 inbound and 2 outboundpaths per state and the radix-4 form of this trellis has 4 inbound and 4 out-bound paths per state. Each Viterbi iteration of a radix-4 trellis is equivalentto 2 iterations of the radix-2 trellis. In this way a radix-4 trellis doubles theavailable time to perform the ACS operations. [16] is an ASIC implementationthat nearly doubles throughput by decomposing a 32 state convolutional coderadix-2 trellis into a radix-4 trellis. They achieve a decoding throughput of 140Mbit/s in 1.2um CMOS. All ACS units within a radix-4 unit share input pathmetrics keeping all routing within a radix-4 unit local. This thesis shows thatthe natural radix-4 decomposition of the CPM trellis used in our microwaveradio application, brings the same advantages to a CPM trellis Viterbi decoder.

Survivor path traceback has been regarded as another bottleneck to through-put. However, by increasing the survivor path memory size and tracing backless often than once per symbol, the traceback memory bandwidth require-ments may be reduced significantly [2].

Sub-optimum detection based on the T-algorithm has been applied to theVLSI implementation of a coherent CPM detector [17]. Another non maximum-likelihood technique is the adaptive Viterbi algorithm which reduces the aver-age amount of computation required by searching a subset of the full trellisbased on channel conditions [18] [19] [20]. A systolic array approach to thebranch metric and path metric processing is proposed in [21].

Both of the two main SRAM based FPGA vendors, Altera and Xilinx, havedeveloped Viterbi Decoder IP cores. The Altera core can implement a 256 statetrellis with a throughput of 16 Mbit/s using 3800 logic cells and 18 9-kbit blockrams in a Cylone III family FPGA (EP3C10F256C6). They use 3 bit branchmetrics [22]. Xilinx’s serial IP core implements a 64 state trellis using 983 slices(equivalent to approximately 1966 logic cells) at a throughput of 15 Mbit/s ina Spartan 3A-DSP family FPGA (XC3SD3400A-4) [23].

2.6.1.2 Viterbi Path Metric Normalisation

The Viterbi path metrics grow without bound over time. Several techniqueshave been developed for scaling or normalising the metrics so they can be rep-resented in fixed point arithmetic [24]. Since only the difference between pathmetrics is required for path selection in the Viterbi add-compare-select unit,and because this difference is bounded, Hekstra proposes the use of 2’s com-plement arithmetic as an alternative to scaling or normalisation. This methodeliminates the need for additional normalisation or rescaling hardware [25].Path metric difference bounds are required to size the path metric wordlength

2.6. LITERATURE REVIEW 15

[26]. It has not been shown that these techniques can be applied to a CPMViterbi trellis.

2.6.2 CPM Configurations and their Energy/Bandwidth Con-sumption

Aulin and Sundberg [27] [9] show a method for calculating a CPM code’s mini-mum distance which predicts detection efficiency. They also plot various CPMcodes on the energy bandwidth plane. Anderson, Aulin and Sundberg [6] [7]show results for a wider variety of CPM configurations on the energy band-width plane, mostly using the raised cosine (RC) phase pulse. However, theydo not measure performance against ETSI microwave radio standards.

Svensson [28] considers the choice of CPM configuration with regard tomeeting a spectral mask requirement, seemingly also from the same ETSI spec-ification [1] as required in our application. They target 37.5 Mbit/s in a 14MHz channel (2.68 bits/s/Hz) whereas the application in this thesis requires54 Mbit/s in a 28 MHz channel (1.93 bits/s/Hz). They also provide resultsfor adjacent channel interference with a carrier to interference ratio the same(-5dB) as required for the application considered in this thesis. Strangely, theylocate the interferer at the 2nd adjacent channel whereas the ETSI specificationcalls for a 1st adjacent channel interferer.

Svensson [29] develops an empirical model for CPM and shows that for aconstant effective bandwidth, M=8 is the optimum power of 2 symbol alphabetsize in the range from 4 to 32, in order to maximise minimum distance squared(detection efficiency). However, compared to M=4, the advantage in termsof d2min of M=8 is only 0.55 dB. They also describe a saturation L, beyondwhich brings little improvement to d2min. For M=8 it is 3, and for M=4 it is 7.However, for M=4 the advantage of L=7 over L=4 is only .57 dB.

Optimising the shape of the phase pulse has been investigated. In [30] anoptimised phase pulse for M=8, L=3 and h=1/8 gave only a 0.2dB gain over aGMSK phase pulse. For other M, gains up to 0.9 dB were found. [31] also in-vestigates optimised phase pulses but concluded “the commonly known signalshapes are not too far from optimal performance”.

Multi-h CPM is summarised in [8] and shows for the same bandwidth con-sumption, multi-h CPM has about a 2 dB d2min improvement for M=4, 3RC,across a range of h. However, the increase in receiver complexity is consideredbeyond the scope of this thesis.

2.6.3 Complexity Reduction

Spectrally efficient CPM configurations have a non-binary symbol alphabetand smooth phase pulses lasting multiple symbol periods which leads to maximum-likelihood receivers with high complexity [7]. There is a significant body ofliterature proposing reduced complexity detectors, summarised by Perrins in[32].

The size of the receiver matched filter bank has been reduced by truncat-ing the phase pulse and also by using a modified and reduced set of basisfunctions [33] [34]. The number of Viterbi trellis states has been reduced by


searching only a subset of the full trellis or by using decision feedback [35] [36][6]. Combining these techniques have been studied in [32] [37] [28].

Laurent proposed a pulse amplitude representation [38] and by ignoringthe smaller amplitude pulse receiver complexity is reduced. This was extendedto M-ary CPM by Mengali [39]. Kaleh [40] presents a near optimum reducedcomplexity Viterbi receiver based on the PAM decomposition.

Other than in [28], adjacent channel interference performance of these re-duced complexity schemes has not been tested. For a carrier to interferenceratio of -5 dB, their 64 state detector has approximately only 0.5 dB loss; themaximum-likelihood detector is 1280 states. Unfortunately, they place the in-terferer in the 2nd adjacent channel but in our application there is the far morestringent requirement of the interferer being in the 1st adjacent channel. Also,Simmons claims the reduced trellis size decoders have significantly increasedsusceptibility to adjacent channel interference (ACI) [41], although the carrierto interference levels used were large (-10 and -20 dB).

2.6.4 Literature Review Summary

Bandwidth and energy consumption of CPM configurations are well studied,and a few papers evaluate the CPM codes bandwidth properties against ETSIstandard spectral masks. As one would expect, CPM code performance hasnot been evaluated against the specific 28 MHz ETSI channel that is the focusof the microwave radio application considered by this thesis.

Although there is a vast range of Viterbi decoder literature, there are fewpublished details for FPGA targeted CPM receiver implementations. Therewere no published results found for fixed point CPM receiver models.

Several techniques for complexity reduction give large reductions in com-plexity with minimal performance degradation, however adjacent channel in-terference performance of these algorithms is largely unproven. This thesisavoids this issue by implementing a maximum-likelihood receiver and focus-ing on a low-cost, resource efficient FPGA implementation.

The CPM literature typically measures complexity in terms of the numberof matched filters and Viterbi states. This is a limitation for the purpose of thisthesis since here we must meet a specific cost requirement. FPGA cost andresource use is measured in logic cells and block rams rather than the numberof Viterbi states or branch metric filters.

Chapter 3

Improving Spectral Efficiency:CPM Parameter Selection

3.1 Introduction

In microwave radio cellular backhaul applications, improving spectral effi-ciency is desirable, as long as receiver detection efficiency and interferencesensitivity are not compromised. An increase in spectral efficiency means thesame data rate can be transported using less bandwidth allowing the opera-tor to lower costs by using less costly radio spectrum licenses. Alternatively,operators keep constant their use of the already limited radio spectrum, andprovide higher data rates to support, for example, the growing demand formobile data services.

An existing CPM microwave radio product transports 16 E1 data circuits(36 Mbit/s including overhead) in a ETSI 28 MHz channel. This is a nominalspectral efficiency of 1.3 bits/s/Hz. This modem achieves a 10−6 bit error rate(BER) at an SNR of 14 dB [42].

In this chapter, we show a 50% higher data rate, 24 E1 data circuits (54Mbit/s including overhead), can be transported in the same 28 MHz chan-nel without any degradation in receiver detetection efficiency. This provides amargin of 4.7 dB to the ETSI receive signal level threshold specification assum-ing a radio noise figure of 6 dB.

The spectral efficiency is improved to 1.9 bits/s/Hz by moving to a newCPM configuration with a longer duration phase pulse and a smaller modu-lation index. Both these factors reduce detection efficiency; this degradationis mitigated by moving to coherent CPM demodulation. We assume perfecttiming and carrier phase recovery and assume a zero carrier frequency offset.

Others have attempted more ambitious schemes such as 37.5 Mbit/s in a14 MHz channel; this is a spectral efficiency of 2.7 bits/s/Hz. However, theoptimum detector for this scheme is complicated requiring 2048 matched filtersand a Viterbi decoder of 1280 states. A reduced complexity approach was takenresulting in a detector that is not maximum-likelihood. Also, only 2nd adjacentchannel interference (ACI) sensitivity was investigated; the microwave radioapplication in this thesis requires compliance with a more stringent 1st adjacentchannel interference sensitivity specification [28].

17

18 CHAPTER 3. CPM PARAMETER SELECTION

There are a large number of CPM configurations that potentially meet ourrequirements. Symbol alphabets of 2, 4 or 8 have been used in real world im-plementations, phase pulse durations from 1 to 5 symbol periods, phase pulseshapes of raised cosine, rectangular, GMSK and others, and a wide range ofmodulation indexes are possible [10] [11] [12] [28] [32]. This chapter choosesa CPM configuration that meets the ETSI channel transmit spectral mask andinterference rejection requirements, whilst providing an acceptable tradeoff be-tween detection efficiency and receiver implementation complexity measuredin terms of the Viterbi trellis and branch metric matched filter bank size.

This chapter starts by identifying 4 candidate CPM configurations by ex-amining their theoretical detection and bandwidth efficiency. These candidateconfigurations are then simulated and their performance evaluated against anETSI standard specifying limits for transmit power spectrum, interference sen-sitivity and receive signal threshold performance. The CPM configuration ofh=1/4, L=3RC, M=4 is chosen because it meets the ETSI requirements, andhas the lowest complexity of any scheme that exceeds the ETSI receive signalthreshold requirements by more than 4 dB.

3.2 Analytical CPM Performance: Identifying CPMConfiguration Candidates

There is a large number of possible CPM configurations, each with its ownspecific detection efficiency and bandwidth consumption characteristics. Thereare 4 parameters that specify a CPM configuration:

• Symbol Alphabet Size (M)

• Modulation Index (h)

• Phase Pulse Duration (L)

• Phase Pulse Shape

These 4 parameters are reflected in the standard equation for how the phaseof a CPM modulated signal changes with time and data symbols; see Equation(3.1). q(t) determines the shape of the phase pulse and αi is a data symbol froman alphabet of size M. “n” refers to the nth data symbol transmitted.

θ(t,a) = 2πhn∑

i=n−L+1

αiq(t− iT ) (3.1)

Continuous phase modulation is a coded modulation and by calculatinga CPM configuration’s minimum Euclidean distance, relative detection effi-ciency performance comparisons can be made. Symbol error rate is derivedfrom the minimum distance squared, d2min, as shown in Equation (3.2). Theprocedure for calculating a CPM configuration’s minimum distance is welldocumented[6]. For the purposes of this thesis, this was implemented in Mat-lab ; Appendix E.1.1 contains the source code.

Pe ≈ Q(√d2min

EbN0

) (3.2)

3.2. CPM CANDIDATES 19

Figure 3.1: Effect of Symbol Alphabet Size (M) on Detection Efficiency, L=1,Raised Cosine Phase Pulse

Figure 3.1 illustrates how M and h affect detection efficiency for L=1 and araised cosine (RC) phase pulse. Increasing M improves detection efficiency andthe general trend for increasing h is an increase in detection efficiency. Increas-ing these parameters also causes an increase in bandwidth consumption. It isworth noting that this is an upper bound. At certain so-called “weak” modu-lation indices d2min no longer reflects actual symbol error rate performance.None of the CPM configurations considered in this thesis fall into this category[6].

Figure 3.2 shows the effect of phase pulse durations from 1 to 5 nominalsymbol periods for a quaternary alphabet and raised cosine phase pulse. Theeffect of a longer, smother phase pulse is to reduce detection efficiency; the flipside is a reduction in bandwidth consumption [6].

It is clear that in order to choose a CPM configuration it is necessary toevaluate it in a combined detection and bandwidth efficiency sense.

3.2.1 Choice of Symbol Alphabet Size (M) and Phase PulseShape

In this thesis, only a quaternary symbol alphabet (M=4) is investigated. This isalmost certainly superior to a binary symbol alphabet (M=2). Across a range ofmodulation indices, M=4, L=3RC has almost 3dB better detection performancecompared to M=2, L=3RC when comparing CPM configurations with the same


Figure 3.2: Effect of Phase Pulse Duration (L) on Detection Efficiency, M=4

bandwidth consumption [6]. An empirical model was developed that showsM=8 may be the best even-integer alphabet size [29]. However, in the examplegiven it is only 0.55 db better than M=4 yet has substantially higher complexityso is not considered further. 1

There are an infinite number of possible phase pulses although there are astandard few described in the literature: raised cosine (RC), spectrally raisedcosine (SRC), rectangular (REC), Gaussian minimum shift keying (GMSK), tamedfrequency modulation (TFM) and continuous phase frequency shift keying(CPFSK). Raised cosine is arguably the most popular and is often used as thebaseline for comparison [6]. A phase pulse shape optimised to the bandwidthand detection efficiency requirements is possible but Asano concludes “thecommonly known signal shapes are not too far from optimal performance”[31].

In any event, the implementation proposed in Chapters 5 and 6 supportsalternative phase pulses. The phase pulse shape is reflected in the matchedfilter bank coefficients which are stored in volatile memory. The design can beeasily modified to support external reloading of this memory. For these rea-sons, the choice of phase pulse is not investigated; the simulations and FPGA

1M=8 has higher complexity since the number of matched filters and Viterbi states increasesexponentially with M. However, an advantage of moving to M=8 is that the symbol rate drops by50% since the number of bits per symbol increases by 50%. This eases the throughput requirementon the implementation since the amount of processing time available for each symbol is now %50higher. This is particularly significant for the Viterbi iteration loop since it contains feedback thatcauses a bottleneck in pipelined FPGA implementations.

3.2. CPM CANDIDATES 21

h (fraction) h k p1/3 0.33... 1 62/7 0.29... 1 71/4 0.25 1 81/5 0.2 1 10

Table 3.1: Candidate Modulation Indices (h)

implementations in this thesis use a raised cosine phase pulse. 2

We have chosen a quaternary alphabet (M=4) and a raised cosine phasepulse. Candidates for the modulation index (h) and phase pulse duration (L)are selected next.

3.2.2 Modulation Index (h) and Phase Pulse Duration (L) Can-didates

In order to identify modulation index and phase pulse duration candidates,the CPM configuration must be evaluated in a combined detection efficiencyand bandwidth efficiency sense. Bandwidth of the transmitted baseband CPMsignal is calculated using a numerical method described in [7, pg 231]. Wedefine the double-sided bandwidth as the bandwidth containing 99% of thetransmitted power. This numerical method was implemented by the authorusing Matlab; Appendix E.1.2 contains the source code listing.

For coherent reception, the modulation index must be of the form h = 2kp

where k and p are integers. The modulation indices considered are shown inTable 3.1. Modulation indices above this range consume too much bandwidthand values below this range have too low a detection efficiency. There areother k, p combinations that sit within this range, but all require large valuesof p and so are considered too costly in terms of implementation. For example,h = 3

11 = 0.27 is interesting, but when put into the form h = 2kp , k=3 and p=22.

22 phase states is considered too costly for implementation.Phase pulse durations from the range 1 to 5 symbols are considered. Com-

bined with the 4 modulation index values of Table 3.1 gives 20 CPM configu-rations to be evaluated.

All configurations are evaluated in a combined detection efficiency andspectral efficiency sense with the results shown in Figure 3.3. Detection effi-ciency is calculated in terms of d2min relative to minimum shift keying (MSK).MSK has a d2min of 2. Spectral efficiency is calculated from the bandwidthcalculation described above.

The existing CPM product’s performance is also plotted in Figure 3.3. Thespectral efficiency of the CPM product is calculated as the symbol rate datarate (36 Mbit/s) divided by the ETSI channel bandwidth (28 MHz). Whencomparing the existing CPM product’s spectral efficiency with these new CPMconfigurations, there is an assumption that the transmit power spectrum de-termines the channel spacing. This is not necessarily true as adjacent channel

2The current FPGA design uses a LUT4 primitive to store the matched filter coefficients. Byreplacing the LUT4 with an SRL16 primitive the filter coefficients can be loaded serially, yet stillread and addressed as a standard memory.


Figure 3.3: Relative Detection Efficiency and Spectral Efficiency for severalCPM Configurations

interference tolerance may require the transmitted bandwidth to lie some extradistance inside the transmit channel mask.

The existing CPM product’s detection efficiency relative to MSK is approx-imated by calculating its equivalent d2min using the existing products thresh-old performance of 14 dB SNR at 10−6 bit error rate and Equation (3.2) solvedfor d2min. This is an approximation because the threshold performance is spec-ified at a bit error rate, whereas Equation (3.2) calculates a symbol error rate.

The first point to note is that h acts to tradeoff detection efficiency and spec-tral efficiency. Only by increasing L can both detection efficiency and spectralefficiency be improved. However, the gains get less and less at each higher L,while the complexity increases exponentially with L [6].

CPM configurations with L=4 or L=5 are ruled out as their complexity isconsidered too high. Indeed, Chapters 5 and 6 implement a CPM configura-tion with L=3 that meet the cost requirement for the application in this thesis.Moving to L=4 increases the complexity by 4 times which would cause the costrequirement to be exceeded.

There are only 4 remaining CPM configurations that are in the vicinity ofmeeting the 50% increase in spectral efficiency goal of 1.9 bits/s/Hz. Theseare listed in Table 3.2. Three of the 4 schemes promise to improve detectionefficiency compared with the existing CPM product. This table also showsthe maximum-likelihood receiver complexity in terms of matched filters andViterbi trellis states.

3.3. ETSI REQUIREMENTS 23

h L M no. matched filters no. Viterbi trellis states1/5 2 4 32 402/7 3 4 128 1121/4 3 4 128 1281/5 3 4 128 160

Table 3.2: Candidate CPM Configurations

3.3 CPM Configuration Evaluation Metrics: ETSI Com-pliance

A floating point Simulink model is used to carry out simulations confirmingthe analytical detection efficiency and bandwidth consumption results pre-sented in the previous section and evaluate the candidate CPM configurationsagainst an ETSI microwave radio standard. The cellular backhaul microwaveradio application considered in this thesis requires a product to meet this stan-dard.

The ETSI specification [1, Annex D] constrains three aspects of the modula-tion:

1. Bandwidth - The transmitted power spectrum must fit within a low-passspectral mask.

2. Detection Efficiency - At a specified received signal power level, the re-ceiver bit error rate (BER) must be less than a specified value.

3. Interference Rejection - In the presence of adjacent channel interference(ACI) or co-channel interference (CCI), detection efficiency can degradeby no more than specified limits.

It is worth noting that the ETSI detection efficiency specification is the mini-mum performance required to achieve ETSI compliance. System gain is an im-portant product marketing specification, and since improvements to detectionefficiency (receiver sensitivity) directly improve system gain, it is desirable tomaximise the margin to this specification. For example, improving the detec-tion efficiency by 3 dB allows the use of an antenna approximately half the sizeand therefore significantly reduced cost. The chosen CPM configuration mustmeet the ETSI requirements whilst providing an acceptable tradeoff betweendetection efficiency and receiver implementation complexity and cost.

3.3.1 Specific Application

The specific application of interest to this thesis is covered by the ETSI stan-dard: Fixed radio systems, Characteristics and requirements for point-to-pointequipment and antennas. The frequency bands of interest are 13 GHz and 15GHz which are covered by Annex D of this standard. Our 50% improved spec-tral efficiency CPM configuration transports 24 E1 circuits within the 28 MHzchannel. The ETSI standard classifies such a system as spectrum efficiency class2, system D.1. MHz channel [1, Annex D].


interference type carrier tointerferenceratio (dB)

allowedSNR degra-dation (dB)

newSNR(dB)

newEb

No(dB)

(M=4)co-channel 23 1 19.7 16.7co-channel 19 3 21.7 18.71st adjacent channel 0 1 19.7 16.71st adjacent channel -4 3 21.7 18.7

Table 3.3: ETSI Co-Channel and 1st Adjacent Channel Interference Perfor-mance [1, Table D.7]

3.3.2 Bandwidth: Transmit Power Spectral Density (PSD)

The transmitted signal must lie within the radio frequency spectrum maskshown in Figure 3.5. The channel spacing is 28 MHz. This mask is specifiedrelative to the carrier frequency fo. The transmitted spectrum is assumed to besymmetrical and so only single sided limits are specified. The 0dB point on themask corresponds to the power spectral density (PSD) at the carrier frequency[1, Table D.4].

In this thesis the transmitter is modelled at baseband and so the simulatedbaseband transmit power spectrum is directly evaluated against the spectrummask shown in Figure 3.5. It is assumed the up-conversion process does notalter the shape of the transmitted power spectrum.

3.3.3 Detection Efficiency: Bit Error Rate as a function of Re-ceive Signal Level

It is widely known that in general, detection efficiency performance can betraded off against bandwidth efficiency performance. Section 3.3.2 constrainsthe bandwidth and here we constrain detection efficiency; at a receive signallevel of -75 dBm, the bit error rate (BER) must be less than 10−6 [1, Table D.6].

For the purposes of simulation, -75 dBm receive signal level is equivalentto a signal to noise ratio (SNR) of 18.7 dB and energy per bit to noise ratio (Eb

No)

of 15.7 dB. Appendix B details this calculation.

3.3.4 Interference Rejection

The radio must achieve a minimum level of detection efficiency in the presenceof co-channel and adjacent channel interference. Table 3.3 specifies the strengthof the interferer and the amount by which SNR may be degraded while stillachieving a 10−6 bit error rate.

In practice, the candidate CPM configurations are evaluated against a morestringent specification to provide engineering margin. We increase the carrierto interference ratios in Table 3.3 by 1 dB and bit error rate is measured with-out forward error correction (FEC) and is relaxed to 10−5. This leaves approx-imately two orders of magnitude of BER margin before exceeding the errorcorrecting capabilities of the Reed Solomon FEC. The code used has a thresh-old at about 10−3; a BER of 10−3 at the FEC input results in approximately 10−6

at the output.

3.4. SIMULATION RESULTS 25

3.4 Simulated CPM Performance: Selecting a CPMConfiguration For Implementation

The simulation system model is presented first, followed by simulation resultsevaluating the candidate CPM configurations against the ETSI requirements.The h=2/7, L=3 configuration does not meet the spectral mask and so is re-jected. The h=1/4, L=3 configuration meets all ETSI requirements and so meetsthe 50% increase in spectral efficiency target; the h=1/5, L=3 configuration hasa smaller minimum distance so is rejected. The h=1/5, L=2 configuration doesnot meet the adjacent channel interference rejection requirement.

3.4.1 Simulation System Model

The simulation system model is shown in Figure 3.4. The 6 key parts of thismodel are:

• Transmitter - The transmitter comprises a linear feedback shift registerdata source of polynomial x15 + x14 + 1 , a Reed Solomon forward errorcorrection encoder of codeword length (n) 255 and message length (k)239, and a CPM modulator. The symbol rate is 27 MSymbols/s and 8samples per symbol.

• Receiver - The receiver comprises a low pass filter, CPM demodulatorwith traceback depth of 32 symbols and a Reed Solomon decoder. Thelow pass filter provides close-in adjacent channel rejection.

• Channel - The channel is modelled with additive white Gaussian noiseonly. There is no channel delay. Phase and symbol synchronisation areideal.

• Interference Generator - A second transmitter model generates co-channeland adjacent channel interference. Gain, phase and frequency are ad-justable.

• Bit Error Rate (BER) Checker - Transmitted bits and symbols are com-pared with the received symbols and bits, both before and after FEC de-coding to determine the symbol and bit error rate.

• Transmit Power Spectral Density Measurement - Transmitter power spec-tral density is measured using a periodogram. The FFT length is 2048samples, it uses a Hanning window and the periodogram averages over256 spectra.

3.4.2 Bandwidth: Transmit Power Spectral Density (PSD)

The simulated baseband transmit power spectrum of the 4 candidate CPM con-figurations is shown in 3.5 together with the relevant ETSI spectral mask. Fig-ure 3.6 is a zoomed in view of the same data and shows that h=2/7, L=3 CPMconfiguration is the only one that fails to meet the spectral mask.


Figure 3.4: Simulation System Model

Figure 3.5: Simulated Transmit Power Spectral Density of Candidate CPMConfigurations, Various (h, L), M=4, 27 Msymbols/s)


Figure 3.6: Zoomed in version of Figure 3.5

3.4.3 Detection Efficiency and Interference Rejection Perfor-mance: h=1/4, L=3

3.4.3.1 Detection Efficiency: Bit Error Rate as a function of SNR

With an AWGN channel only, simulated error rate performance is shown inFigure 3.7. The symbol and bit error rate data was collected by accumulatinga minimum of 100 symbol errors, or for the FEC BER graph a minimum of 100bit errors after FEC. The receiver low-pass filter is removed for this simulation.

A BER of 10−6 is achieved with an Eb

Noof 14 dB. Furthermore, when the

Reed Solomon FEC is included then Eb

Nois approximately 10.6 dB at a BER of

10−6. This is 5.1 dB better than the ETSI requirement of 15.7 dB.

The existing CPM modem product achieves a 10−6 BER at an SNR of 14dB or 11 dB Eb

No[42] assuming ideal timing synchronisation and no degrada-

tion due to fixed precision arithmetic. 3 These results show that the new CPMconfiguration has the potential to be 0.4 dB better in terms of detection effi-ciency. However, a low-pass adjacent channel rejection filter is required tomeet the ACI rejection requirements. This filter also degrades clear channelperformance as described in the next section.

3SNR is 3 dB higher than EbNo

for a quaternary alphabet.


Figure 3.7: Simulated Bit Error Probability with and without Reed SolomonFEC, AWGN Channel, No ACI Reject Filter, h=1/4, L=3RC, 27 Msymbols/s


Figure 3.8: Simulated Bit Error Rate with Adjacent Channel Interference,h=1/4, L=3, M=4, 27 Msymbols/s

3.4.3.2 1st Adjacent Channel Interference

Figure 3.8 shows adjacent channel performance with carrier to interference(C/I) ratios of -5 dB and -1 dB. This adjacent channel interference is 1 dBstronger than specified in the ETSI specification. For the 5 dB adjacent channelinterferer test, the bit error rate is very close to meeting the target of 10−5 at anEb

Noof 18.7 dB. This is well below the 10−3 Reed Solomon FEC threshold and so

with FEC present, the BER falls well below the ETSI specification limit of 10−6.The less stressful 1 dB ACI test shows a bit error rate of less than 10−6 at

16.7 dB Eb

No, clearly meeting the ETSI requirement.

The radio’s intermediate frequency (IF) amplifier stages contribute signifi-cantly to the receiver’s selectivity. Nevertheless, the final IF stage in this radiois 36 MHz wide which passes a significant amount of 1st adjacent channel sig-nal power. The front end of the digital portion of this receiver contains decima-tion stages and other low pass filters which provide adjacent channel rejection.These are modelled with a single low pass filter of 96 taps.

The low pass filter cutoff frequency has a significant impact on receiver BERperformance. If the cutoff is set too low then the in-band signal is distortedtoo much and BER performance is degraded. On the other hand, if the filtercutoff is set too high, then too much adjacent channel power passes into thedemodulator and BER performance is degraded. A cutoff frequency of 14 MHzis chosen to provide a compromise between clear channel performance andadjacent channel rejection.


Figure 3.9: Simulated Bit Error Rate Demonstrating Effect of ACI Reject Filterand Adjacent Channel Interference, h=1/4, L=3, M=4, 27 Msymbols/s

Figure 3.9 shows the effect of the low pass filter on clear channel perfor-mance. At a BER of 10−6, Eb

Nois 14.4 dB, a degradation of 0.4 dB due to the

low pass filter. Without the filter present the new CPM configuration was 0.4dB better than the existing product. This means that with the filter present thenew CPM configuration has the same level of detection efficiency as the exist-ing CPM product. This is 4.7 dB better than the minimum required in the ETSIstandard.

3.4.3.3 Co-channel Interference

Co-channel interference is simulated at carrier to interference (C/I) ratios of 19dB and 23 dB and the results are shown in Figure 3.10. In both cases the biterror rate is less than 10−5 at an Eb

Noof 16 dB. The requirement is for a bit error

rate of less than 10−5 at Eb

Noof 18.7 dB and 15.7 dB. This is a clear pass to the

ETSI requirements.

3.4.4 Detection Efficiency and Interference Rejection Perfor-mance: h=1/5, L=2

The h=1/5, L=2 CPM configuration is interesting because compared to h=1/4,L=3 it has a matched filter bank 1/4 the size and Viterbi trellis with 1/3 of the


Figure 3.10: Simulated Bit Error Rate with Co-Channel Interference, h=1/4,L=3, M=4, 27 Msymbols/s


Figure 3.11: Simulated Bit Error Rate with and without Adjacent Channel In-terference, h=1/5, L=2, M=4, 27 Msymbols/s

states. The simulated detection efficiency performance of this CPM configura-tion is shown Figure 3.11.

At a BER of 10−6, Eb

Nois approximately 14.8 dB with the low pass ACI reject

filter present. This is 0.4 dB worse than the h=1/4, L=3 CPM configuration.However, the ACI rejection performance is considerably worse. At an Eb

No

of 18.7 dB the BER is 5 ∗ 10−5. This does not meet the stated requirement of aBER less than 10−5 at 18.7 dB Eb

No.

The ACI performance could be improved by reducing the low-pass ACI re-ject filter cutoff frequency. However, this also increases the amount of in-bandsignal removed at the band edge. The clear channel performance is alreadyreduced by 0.6 dB due to the low-pass ACI reject filter, so further lowering thecutoff frequency will degrade the in-band performance even further.

3.5 Conclusion

The 50% spectral efficiency improvement requirement has been met by a moveto a CPM configuration with a longer duration phase pulse (L=3) and coherentreception. This new CPM configuration (h=1/4, L=3, M=4) is simulated andshown to exceed the ETSI clear channel detection efficiency requirement by4.7 dB including Reed Solomon forward error correction. This level of detec-tion efficiency is the same as the existing CPM microwave radio product, andbecause the spectral efficiency is 50% higher, this is a significant improvement.

3.5. CONCLUSION 33

However, the longer phase pulse and move to coherent reception result in amore computationally expensive receiver. For this new CPM configuration tohave practical significance its implementation cost must be low enough. Thefollowing chapters in this thesis now focus on a low cost implementation. Thenext chapter demonstrates the transition to a fixed point simulation model thatis used to optimize word-lengths in the branch metrics and path metrics pro-cessing units of a coherent CPM receiver.


Chapter 4

Toward an FPGAImplementation: Fixed PointModelling

4.1 Introduction

The previous chapter demonstrated that a 50% increase in spectral efficiencywithout loss of detection efficiency is obtained by moving to coherent receptionand increasing the phase pulse length to 3 symbol periods. This new CPM con-figuration of h=1/4, L=3, M=4, requires a CPM Viterbi decoder of 128 matchedfilters and 128 trellis states, all computed at a 27 MSymbols/s rate.

Before moving to an FPGA implementation, several approximations andoptimizations to the floating point maximum-likelihood receiver must be madeto make the implementation realizable and cost effective. These are:

• Reduced sampling rate

• Fixed point numeric representation

• Reduced survivor path memory size

These approximations are relevant to the receiver functional blocks as shownin Figure 4.1. This chapter demonstrates the effect of these approximations onreceiver detection efficiency. The choice of sampling rate, fixed point word-lengths and survivor path memory depth are optimised for implementationsize (product cost) and performance.

4.2 Implementation Target: ASSP, ASIC, FPGA, DSP

An FPGA is the targeted implementation technology because it is the most costeffective implementation technology currently available. An existing CPM mi-crowave radio [42] uses an FPGA modem. Although a standard cell basedapplication specific integrated circuit (ASIC) offers higher performance, lowerpower, and consumes less silicon area [43], the non recurring engineering cost

35

36 CHAPTER 4. FIXED POINT MODELLING

Figure 4.1: Approximations to the maximum-likelihood receiver

DSP FPGAPart Texas Instruments

TMS320C6421Xilinx Spartan 3XC3SD1800A

Technology Process 90nm 90nmPeak 16 bit x 16 bit multipliesper second

1.6 GMULT/S 21 GMULT/s [44]

Cost (US$) $8.95 @10k $29.85 @25k$ per GMULT/S $5.60 $1.40

Table 4.1: Comparison of DSP and FPGA Multiplication Capability

of an ASIC cannot be justified by the expected product volumes. Besides which,the design flexibility and time-to-market advantage of an FPGA cannot be ig-nored.

A digital signal processor based platform simply does not meet the highcomputation requirements of this CPM receiver. For example, this CPM re-ceiver’s branch metric filter bank alone requires 27.6 giga-multiplies/s. In thisthesis the targeted low-cost FPGA is a Xilinx Spartan 3 1800-ADSP. Table 4.1provides a comparison between this FPGA and a Texas Instruments DSP. ThisDSP was picked for comparison purpose because it is the lowest cost per multi-ply TI DSP at the same technology process node. In this comparison, the FPGAperformance is 10 times higher than the DSP and on a cost per multiply basisthe FPGA is 4 times cheaper.

Provigent provide application specific standard product (ASSP) solutionsfor the cellular backhaul microwave radio market. However, these chipsets donot support CPM and so are ruled out for the purposes of this thesis [45].

4.3. SAMPLING RATE 37

Figure 4.2: Effect of Sampling Rate on Detection Efficiency

4.3 Sampling Rate

The floating point model in Chapter 3 approximated continuous time by repre-senting each symbol with 8 samples. At a symbol rate of 27 MSymbols/s this isa sample rate of 216 MSamples/s. Due to Nyquist sampling theory, this samplerate allows frequencies up to 108 MHz to be represented without aliasing. Thisis obviously too high a sample rate for an implementation because the receivedsignal does not contain significant energy at these high frequencies; it is band-limited by intermediate frequency (IF) filtering stages within the receiver. Forexample, the existing CPM receiver IF 3dB bandwidth is 36 MHz.

For the purposes of this thesis, we assume the analog I/Q signal is sampledappropriately to meet the anti-aliasing and dynamic range requirements forthe receiver. This is followed by a multi-rate filter to reduce (or increase) thesample rate to whatever the requirement is for this coherent CPM receiver pro-cessing. The multi-rate filter, analog anti-aliasing low pass filtering and analogto digital converter sample rate choice is beyond the scope of this thesis.

Since there is a direct correlation with sampling rate and computationalcomplexity, finding the minimum sampling rate that does not significantly de-grade receiver detection efficiency is important. Figure 4.2 shows bit error ratevs Eb

Nosimulation results for sampling rates from 2 to 8 samples per symbol.

The Viterbi path memory size is large (32 symbols deep) and the receiver modeluses 64-bit floating point arithmetic.

The results show minimal, if any, performance degradation due to the re-


duced sample rate. Therefore, the implementation will use the minimum sam-ple rate required for timing recovery which is 2 samples per symbol (54 MSam-ples/s). This result is not surprising when considering the transmitted powerspectral density and receiver ACI reject filter cutoff frequency. At 27 MHz(54/2), the transmitted power spectral density is more than 40 dB down, be-sides which, the low pass ACI reject filter cutoff is 14 MHz. i.e. very littleuseful received signal energy is present at 27 MHz so higher sampling rates donot improve performance.

Furthermore, the FPGA branch metrics implementation described in Chap-ter 5 uses a distributed arithmetic approach which becomes particularly re-source efficient for 4-input LUT (look-up table) FPGAs when the sample rate istwice the symbol rate.

4.4 Fixed Point Modelling

Fixed point modelling of the receiver is a significant step in determining theFPGA implementation size and cost. The models of Chapter 3 used large word-lengths (64 bit) and floating point arithmetic so that receiver performance is notdegraded by arithmetic overflow and quantisation noise. By moving to fixedpoint arithmetic and using word-lengths much less than 64 bits, the FPGA re-sources used, and hence cost of the implementation is minimised. This sectionpresents fixed point modelling results that give insight into the relationshipbetween word-length (hardware cost) and receiver detection efficiency.

Although the literature contains fixed point modelling results for Viterbidecoding in error correction applications, there are no known fixed point sim-ulation results for CPM Viterbi detection. This thesis performs fixed point sim-ulations using the same system model presented in Chapter 3 but replacingkey floating point data types with Matlab’s fixed point number data type (fi()).Also, specific parts of the receiver model were rewritten to match the FPGAVHDL implementation presented in Chapters 5 and 6.

Developing a fixed point model is also important because it creates a base-line to which the FPGA implementation can be tested against. The VHDL ver-ification strategy described in Appendix A uses results from the fixed pointmodel stored in VHDL arrays to verify the VHDL implementation.

The number of samples per symbol is 2 and the Viterbi path history depthis 16. At least 100 bit errors are accumulated at each Eb

Nosimulation data point.

The low-pass adjacent channel interference reject filter is not present. Adjacentchannel interference rejection performance of the fixed point model has notbeen investigated. Forward error correction is not included.

There are 4 word-lengths that must be determined:

1. In-phase and quadrature received signal word-length.

2. Branch metric filter bank coefficient word-length.

3. Output branch metrics word-length.

4. Path metric word-length.

Simulation is used to investigate the first three. Path metric word-length ischosen analytically.

4.4. FIXED POINT MODELLING 39

4.4.1 Floating Point vs Fixed Point Numeric Representation

Fixed point addition and multiplication generally require less FPGA resourcesthan the equivalent operations in floating point. For example, floating pointaddition requires an extra operation to normalise the smallest operand’s man-tissa before adding. Also, for the same word-length, fixed point numbers pro-vide higher precision since bits are not consumed by the exponent representa-tion. However, the dynamic range of fixed point numbers is severely restrictedwhen compared to a floating point representation.

The limited dynamic range of fixed point numbers is particularly relevantin a communications receiver, since received signal levels vary over a widerange. It is assumed that automatic gain control (AGC) circuitry keeps the sam-pled received signal within the dynamic range of the ADC. And once inside theFPGA, front end low pass filtering would be followed by further AGC. AGC isoutside the scope of this thesis but it has been assumed that the received signalis scaled such that there is 1 bit of headroom for the in-phase and quadratureinputs to the branch metric unit.

4.4.2 Quadrature and In-phase Received Signal Word-length

For this experiment, the received signal is quantised in the range of word-lengths between 5 and 8 bits. In contrast, the simulation results from the pre-vious chapter represented the received signal using 64-bit floating point num-bers. Figure 4.3 presents the results.

At word-lengths of 5 and 6 bits the receiver detection efficiency is degradedby approximately 0.4 dB and 0.2 dB at a BER of 10−4. There is a small degra-dation for a word-length of 7 bits and for 8 bits the results closely match thefloating point model.

The received signal word-length is a critical factor in determining the amountof FPGA resources used to implement the filter bank. Chapter 5 proposes a bit-serial arithmetic technique which is very low cost if the received signal word-length is 7 bits or less. For larger word-lengths the amount of FPGA resourcesmust double in order to keep the FPGA operating frequency low enough suchthat static timing is met. Hence the received signal word-length is chosen to be7 bits.

The filter bank coefficient word-length is 16 bits and the matched filterarithmetic is carried out such that there is no rounding nor loss of precision.The resulting branch metrics are kept at full precision and converted to 64 bitfloating point for the remainder of the model. Path metric computations usedouble precision (64 bit) floating point arithmetic.

4.4.3 Branch Metric Filter Bank Coefficient Word-length

The filter bank coefficients are also quantised to a word-length from 5 to 8 bits.The coefficients are quantised such that they extend across the complete dy-namic range of the fixed point representation.

Figure 4.4 presents the results. At a word-length of 5 bits the detectionefficiency degrades by 0.5 dB at a BER of 10−4 and at 6 bits there is a slightdegradation. At word-lengths of 7 and 8 bits there is very little degradation


Figure 4.3: Effect of Quantised Received Signal Word-length on Receiver De-tection Efficiency

4.4. FIXED POINT MODELLING 41

Figure 4.4: Effect of Quantised Branch Metric Filter Bank Coefficient Word-length on Receiver Detection Efficiency

at a BER of 10−5. 7 bits is chosen as the minimum word-length that does notdegrade performance.

The in-phase and quadrature received signal are quantised to a word-lengthof 16 bits and the branch metric arithmetic is performed at full precision. Theresulting branch metrics are kept at full precision and converted to 64 bit float-ing point for the remainder of the model.

4.4.4 Branch Metric Word-length

This experiment investigates the effect of branch metric word-length and demon-strates the overall receiver detection efficiency degradation due to the transi-tion to a fixed point model. The received signal word-length and matchedfilter bank coefficient word-lengths are set to the minimum that did not causereceiver performance degradation. i.e. 7 bits. The branch metric outputs arecalculated in full precision, the top 2 bits truncated and the result then roundeddown to a word-length of 7, 8, 9 or 10 bits.

The path metric word-length is always 2 bits more than the branch metricword-length; this is justified by the use of Hekstra’s method as discussed insection 4.4.5

Figure 4.5 presents the results in graphical form and Table 4.2 shows thereceiver detection efficiency degradation for each branch metric word-length.

Branch metric word-lengths of 7 and 8 bits cause significant degradation.


Branch Metric Word-length SNR Degradation at BER of 10−4 (dB)7 1.38 0.49 0.210 0.2

Table 4.2: Detection Efficiency Degradation due to Branch Metric Quantisation

Figure 4.5: Effect of Branch Metric Word-length on Receiver Detection Effi-ciency

At 9 and 10 bits the degradation is approximately only 0.2 dB at a BER of 2 ∗10−6. Hence, the implementation will use branch metrics rounded to 9 bits andconsequently a path metric word-length of 11 bits.

4.4.5 State Metric Normalisation

Viterbi state metrics grow unbounded with time and since an implementationmust use a numeric representation with finite range, a strategy for coping withthis unbounded growth is required. Several techniques have been developedfor scaling or normalising the metrics so that overflow is avoided [24]. Most ofthese techniques cost FPGA resources and add latency inside the critical Viterbiiteration loop.

Hekstra proposes an alternative technique which relies on the overflow be-haviour of 2’s complement arithmetic and the fact that it is the difference be-

4.5. SURVIVOR PATH HISTORY 43

tween path metrics rather than their absolute value that is important for sur-vivor path selection. Since 2’s complement arithmetic is the norm for imple-mentation, Hekstra’s method avoids the need for any extra hardware resources[25].

Although Hekstra’s method has been applied to Viterbi decoders for use inforward error correction applications [2], it has not yet been applied to CPMViterbi detectors. Hekstra’s method requires the difference between state met-rics to be bounded. It can be proven the metrics are bounded by consideringthe 2 dimensional state transition matrix T. Source and destination states arerepresented by each dimension, and a 1 entry indicates the state transition ispossible, and a 0 entry indicates the transition is not possible. If T can be raisedto a power n, such that all entries are strictly positive, then the metrics arebounded[25].

For the h=1/4, L=3, M=4 state transition matrix, n could not be found. In-deed, simulation shows the state metric range growing without bound overtime. However, a direct consequence of the modulation index having an evenvalued p=8, (h=2k/p) is that every state transition occurs from a state witheven valued phase state to odd valued phase state and vice versa. Odd to oddor even to even state transitions are not possible in the state transition matrix T.This means the trellis comprises two sub-trellises that do not share any connec-tions and hence the difference in metrics between sub-trellises is unbounded.

If T is defined for the odd or even sub-trellis only, then n is found to be3, and metrics within a sub-trellis are bounded. (See Appendix E.4.1 for theMatlab source code used to find n). Once a starting phase state is defined, allfurther transitions are within a single sub-trellis only and so Hekstra’s methodis applicable. The implication for the phase recovery algorithm is that phaseslips between the two sub-trellises must not be allowed.

Since only one sub-trellis is relevant, it is suggested that the other sub-trellisViterbi calculations are redundant and the overall Viterbi processing require-ments can be halved. This has not been investigated in the FPGA implementa-tion of Chapter 6.

To avoid overflow, Hekstra claims the path metric word-length Bpm mustsatisfy equation (4.1). B is an upper bound on the absolute values of branchmetrics and is given by equation (4.2). Conservatively m = 2n [25]. Bbm is thebranch metric word-length.

Bpm ≥ log2((m+ 2)B + 1) + 1 (4.1)

B = 2Bbm−1 (4.2)

For a branch metric word-length of 9 bits and n=3, this equation suggeststhe path metric word-length must be at least 11 bits. The simulation of section4.4.4 uses 11 bit path metric word-lengths. That simulation was repeated witha path metric word-length of 10 bits with the same result. However, moving to9 bits causes a very high error rate.

4.5 Survivor Path History Depth

The Viterbi algorithm provides maximum-likelihood estimates of the transmit-ted symbol when the Viterbi path history has infinite depth. In practice, near


Figure 4.6: Effect of Survivor Path History Depth on Detection Efficiency

optimal performance is achievable with practical path memory sizes. For con-volutional decoder applications, a well known rule of thumb is to size the pathhistory to be 5 times the convolutional code constraint length.

For CPM Viterbi receivers, Anderson et al [6] suggest that the path historydepth should be at least as large as the observation interval required to achievethe upper bound on the CPM code’s minimum squared Euclidean distance(d2min). [6, Figure 3.31, pg 103] shows that for a quaternary, 3RC, h < 0.5CPM configuration, the observation interval must be at least 12 symbols longto achieve the performance predicted by d2min. This implies a minimum pathhistory depth of 12 symbols for our h=1/4, L=3RC, M=4 CPM receiver.

Simulation shows that a path history depth of only 8 symbols does not de-grade detection efficiency. Figure 4.6 shows simulated receiver bit error rate vsEb

Nofor a variety of Viterbi path history depths. The low-pass ACI reject filter

is not present, all arithmetic is 64-bit floating point and there are 8 samples persymbol. A depth of only 2 symbols degrades the performance significantly anda depth of 4 symbols only slightly.

Although the Viterbi path history implementation is beyond the scope ofthis thesis, a traceback architecture using the memory management techniquedescribed in [46] is likely to be cost effective. For quaternary CPM symbols, thistechnique stores 2 bits per state per Viterbi iteration. An h=1/4, L=3, M=4 CPMconfiguration has a 128 state trellis, and for a path history depth of 8 iterationsthe path history memory required is 2048 bits. This easily fits within a single18 kbit block RAM within the FPGA. The FPGA targeted in this thesis has 84

4.6. CONCLUSION 45

such block RAMs so the memory cost is low.Viterbi path memory bandwidth requirements are high when tracing back

once every symbol. The bandwidth requirements can be reduced significantlyby tracing back only once every several symbols using the technique describedby Chang [2].

4.6 Conclusion

Sampling rate, fixed point word-length and survivor path history depth have acritical effect on the CPM receiver implementation cost and detection efficiencyperformance. By using Matlab fixed point data types and rewriting specificportions of the CPM receiver floating point model used in the previous chapter,simulations have been performed to find an acceptable tradeoff between costand performance for these parameters.

Simulation results show that compared to the floating point model, a fixedpoint model with 2 samples per symbol and 16 symbol deep path history, de-grades detection efficiency by only 0.2 dB at a BER of 2 ∗ 10−6. To achieve thissmall degradation, the in-phase and quadrature received signal inputs must bequantised to at least 7 bits, the branch metric filter bank coefficients at least 7bits and the branch metrics are rounded down to at least 9 bits. The path metricword-length is 11 bits.

Path metrics grow without bound over time. This is commonly overcomeby periodically rescaling or normalising the metrics. A clever technique pub-lished by Hekstra avoids rescaling entirely by exploiting the fact that path met-ric differences are bounded and it is the difference between metrics, not theirabsolute value that are important in a Viterbi decoder. This technique is ap-plied to a CPM receiver; as far as the author knows, this has not been pub-lished before. Hekstra’s calculations show that a path metric word-length of 11bits is sufficient to avoid incorrect survivor path selections and this has beenconfirmed in simulation.


Chapter 5

CPM Branch Metric FilterBank FPGA Implementation

5.1 Introduction

The previous chapter introduced fixed point arithmetic, finite Viterbi path mem-ory size and a practical sampling rate. It was shown that these approximationscause a detection efficiency degradation of only 0.2 dB. This still meets the ra-dio performance requirements and provides for a significantly lower complex-ity and lower cost receiver than an implementation based on a high samplerate and large word-length floating point arithmetic.

The application considered in this thesis has a stringent cost requirement,and to prove this requirement can be met an implementation is required. Thetwo most computationally expensive parts of the receiver are the Viterbi branchmetric filter bank and Viterbi path metrics processing unit. This section of thethesis considers the branch metrics filter bank implementation.

A key contribution of this thesis is the proposed application of a well knowndistributed arithmetic technique to the implementation of a coherent CPM re-ceiver branch metrics filter bank. The implementation is generalised to supportany reasonable h, L, M, quantised received signal word-length and filter coef-ficient word-length.

The technique is proven with a VHDL implementation for the CPM con-figuration found in Chapter 3 and further refined in Chapter 4. That is h=1/4,L=3, M=4, raised cosine phase pulse. Synthesis results are presented whichshow FPGA resource usage meets the cost requirements specified for this ap-plication. It would seem this technique is more than 6 times cheaper than aconventional technique using the FPGA’s dedicated multipliers.

Data throughput is verified by performing static timing analysis on theplaced and routed design. Timing closure is achieved for a 215.6 MHz clockallowing a symbol rate of up to 30.9 MSymbols/s, meeting the application re-quirement of 27 MSymbols/s. Correct functional performance is demonstratedwith a VHDL simulation producing results that match the Matlab fixed pointmodel precisely (bit for bit).

Distributed arithmetic uses the FPGA logic fabric very efficiently by per-forming the filter calculations in a bit-serial fashion. For the application con-

47

48 CHAPTER 5. BRANCH METRIC IMPLEMENTATION

sidered in this thesis, it is most FPGA hardware cost efficient for symbol ratesaround 30 MSymbols/s. For 2 samples per symbol this technique applies mostefficiently to the 4 input LUT architecture which is used in the current gener-ation of low cost FPGAs from the dominant FPGA market players Xilinx andAltera.

A significant advantage of this technique is that it allows symbol rate to betraded off against implementation size (and hence cost), although that has notbeen demonstrated here. The proposed implementation supports any h, L, M,input or coefficient word-length.

The main disadvantage of this bit-serial processing is that it introduces asymbol period of delay. It is likely this branch metric processing is inside thecarrier phase recovery loop and so this added latency will degrade phase re-covery performance. This is the performance cost incurred in order to meet thedemanding cost requirements.

Another challenge with this technique is in meeting timing due to the highclock rates caused by fully bit-serial processing. This was overcome by design-ing with low-level FPGA resource primitives directly and algorithmically floor-planning these elements; this “manual” approach beats the automatic tools al-gorithms by 10’s of MHz for this design.

5.2 Computation Complexity

The CPM receiver literature typically expresses branch metric complexity asthe number of matched filters. The phase rotation computations are ignored[28] [32] [34]. In this thesis we focus on implementation and so a more ap-propriate measure of complexity is real valued multiplications (and adds) perdecoded symbol. By taking account of the symbol rate an even more usefulmetric is multiplications per second and additions per second. These metricsdo not account for filter coefficient word-length nor received signal sampleword-length. We account for that in a later section by expressing complexitydirectly in terms of FPGA resources.

The maximum-likelihood receiver calculates branch metrics according toEquation (5.1), previously presented in section 2.4.


∫ (n+1)T

nT

r(t)e−jθ(t,αn)dt} (5.1)

Express in terms of real operations only by substituting r(t) = rI(t)+jrQ(t)into Equation (5.1) we get Equation (5.2) [6].

Zn(αn, θn) = cos(θn)[∫ (n+1)T

nT

rI(t) cos(θ(t, αn))dt+∫ (n+1)T

nT

rQ(t) sin(θ(t, αn))dt]

+ sin(θn)[∫ (n+1)T

nT

rQ(t) cos(θ(t, αn))dt−∫ (n+1)T

nT

rI(t) sin(θ(t, αn))dt]

(5.2)Each integral represents a matched filter. cos(θ(t, αn)) and sin(θ(t, αn))

are the matched filter coefficients where αn is an L length symbol sequence.There are ML such possible symbol sequences so Equation 5.2 represents 4ML

5.2. COMPUTATION COMPLEXITY 49

h p L M N R (1/T) Branch Metricmultiplies/s

Branch Metricadds/s

1/4 8 3 4 2 27 MSamples/s 6.9 GMults/s 3.5 GAdds/s

Table 5.1: CPM Branch Metric Filter Bank Complexity

matched filters. However, every αn sequence has a corresponding negative se-quence and using the trigonometric identities in Equation (5.3), the number ofmatched filters is halved. The maximum-likelihood branch metric filter bankis 2ML matched filters [6].

cos(A) = cos(−A)sin(A) = − sin(−A)

(5.3)

cos(θ(t, αn)) and sin(θ(t, αn)) are precomputed filter coefficients. Each inte-gral (filter) requires N real multiplications and N-1 additions, where N is thenumber of samples per symbol. For a symbol rate ofR = 1/T symbols/s whereT is the symbol period, Equation (5.4) describes the matched filter multiplica-tion complexity and Equation (5.5) is the matched filter add complexity. Table5.1 shows this complexity for the CPM configuration selected in Chapter 3.

MatchedF iltersmultiplies/s = R2MLN (5.4)

MatchedF iltersadds/s = R2ML(N − 1) (5.5)

To achieve these high computational rates, a traditional solution might usethe embedded multipliers now present in the fabric of modern low-cost FP-GAs. The XC3S1800A-DSP Spartan 3A-DSP FPGA is the largest Xilinx FPGAthat meets the cost requirements for this application. This part has 84 multipli-ers in addition to the general purpose LUT and flip-flop fabric. These multipli-ers can be clocked at a maximum of 250 MHz which provides for a processingcapability 21 giga-multiplies/s [44].

At a first analysis it appears that the embedded multipliers have the mul-tiplication capacity to support the matched filter processing. However, Equa-tion (5.4) and Table 5.1 ignore the phase rotation multiplications. Ultimatelythere are pML/2 (256) unique branch metrics to be calculated yet there areonly 84 multipliers available. With 128 matched filters, the processing mightbe arranged such that each embedded multiplier performs the processing for 2matched filters. This requires the two results per multiplier to be stored addingfurther complexity. A solution using the majority (76%) of the available multi-pliers is undesirable since other parts of the CPM receiver also require the useof the multipliers. For example, front end adjacent channel filtering and theadaptive equaliser.

In some respects the Xilinx datasheet claim of 21 giga-multiplies/s over-states the devices capability. Firstly, every clock cycle must be used to generatea result. This requires extra logic to sequence operands and store results. Andsecondly, the multipliers must be clocked at their maximum rate of 250 MHzand the design must meet timing. Getting data into and out of the multipliersfrom the general logic fabric at this rate is a design challenge. Furthermore, itis unlikely other parts of the FPGA design would operate at this frequency and


so extra logic and complexity is required to transfer the data between this highspeed 250 MHz clock domain and the rest of the design.

The phase state rotation complexity can be reduced and almost eliminatedfor some specific values of h. For example, with h=1/4 there are 8 phase states,yet cos(θ(t, αn)) and sin(θ(t, αn)) take on values of 0,1 or +- 0.707 only. Mul-tiplication by 0 and 1 is trivial so the multiplication by 0.707 is the only onerequired per matched filter output. This thesis proposes an implementationthat ignores this optimisation and so it is generalised for any h, L, and M.

5.3 Applying Distributed Arithmetic to a CPM Fil-ter Bank

The microwave radio application considered in this thesis requires a symbolrate of 27 MSymbols/s. This is almost an order of magnitude less than clockrates achievable in current generation low cost FPGAs. For example, the Spar-tan 3A-DSP FPGA multipliers support operation at up to 250 MHz[44] and asshown in section 5.4.3, clock rates above 200 MHz are achievable in the FPGAlogic fabric. The implementation proposed by this thesis exploits this by car-rying out the processing in a bit-serial fashion using a well known techniquecalled distributed arithmetic.

Distributed arithmetic (DA) filters are a logic resource efficient way of im-plementing a finite impulse response (FIR) filter [47] [48]. The signal to befiltered is passed to the filter in a bit-serial fashion, least significant (LSB) first.This requires a system clock at a rate several times that of the incoming samplerate. In this way, a higher system clock is traded off against lower hardwarecosts. In our application we have a symbol rate of 27 MSymbols/s and Spartan3ADSP clock rates of more than 200 MHz are achievable, making DA filtersattractive for this application.

A distributed arithmetic filter implements the standard FIR filter equationprecisely. Equation (5.6) describes aM tap FIR filter with filter coefficients h(k)and input signal x(m) where m is the discrete time variable [47].

y(m) =M−1∑k=0

x(m− k)h(k) (5.6)

The key to applying this to a CPM branch metric filter bank is to bring thephase rotations inside the integral and combine the in-phase and quadraturematched filters into a single filter. This is shown starting with Equation (5.7)and ending in Equation (5.8) where the branch metrics computations are ex-pressed in a sum of products form that is equivalent to Equation (5.6).


∫ (n+1)T

nT

r(t)e−jθ(t,αn)dt}

= <{∫ (n+1)T

nT

r(t)e−j(θ(t,αn)+θn)dt}

=∫ (n+1)T

nT

rI(t) cos[θ(t, αn) + θn] + rQ(t) sin[θ(t, αn) + θn]dt

(5.7)

5.3. CPM AND DISTRIBUTED ARITHMETIC 51

We now move to discrete time notation where m is the discrete time sampleindex, rI(m) and rQ(m) are discrete time versions of rI(t) and rQ(t) sampledevenly throughout a symbol period. N is the number of samples per symbol.Perfect timing recovery is assumed which means that the I and Q samples al-ways occur at the same place relative to symbol boundaries.

hc(m) and hs(m) represent discrete time versions of the matched filter co-efficients cos[θ(t, αn) + θn] and sin[θ(t, αn) + θn].

Zn(αn, θn) =N∑m=0

rI(m)hc(m) +N∑m=0

rQ(m)hs(m) (5.8)

And by letting x(m) = {rI(0), rI(1), . . . , rI(N−1), rQ(0), rQ(1), . . . , rQ(N−1)}, h(m) = {hc(0), hc(1), . . . , hc(N − 1), hs(0), hs(1), . . . , hs(N − 1)} and M =2N , we see that Equation (5.8) is equivalent to Equation (5.6) and therefore asingle distributed arithmetic filter calculates a single branch metric.

A distributed arithmetic filter requires the filter coefficients to be constant[47] 1. θ(t,a) is given by Equation (5.9) and for a fixed h and a specified fixedphase pulse q(t) then the coefficients are only a function of a and a is a constantfor each matched filter. A practical CPM microwave radio link has a constantCPM configuration (h, L, M, phase pulse shape).

θ(t,a) = 2πhn∑

i=n−L+1

αiq(t− iT ) (5.9)

5.3.1 Phase State Symmetry

A CPM receiver requires pML branch metrics to be calculated and fed intothe Viterbi trellis decoder. Since each branch metric is implemented with aseparate DA filter the receiver has a branch metric filter bank containing pML

filters operating in parallel. For h=1/4, M=4, L=3 this is 512 filters. The numberof filters can be halved by exploiting symmetry in the phase state.

As previously defined in Chapter 2, Equation (5.10) defines the phase stateand the number of phase states p is related to h as in Equation (5.11). When pis even then θn + π mod 2π is also a phase state. And because cos(A + π) =− cos(A) and sin(A+ π) = − sin(A), Equation (5.7) only needs to be calculatedfor θn < π. The remaining branch metrics for θn >= π are simply the negativeof the branch metrics calculated for θn < π.

θn =2πip, i ∈ {0, 1, 2, . . . , p− 1} (5.10)

h =2kp, k, p ∈ integers (5.11)

This negation operation can be rolled into the Viterbi add-compare-selectunit without any additional LUT or flip-flop cost. However, there is still anFPGA routing cost because each branch metric filter output must now route to2 add-compare-select units. The Viterbi trellis add-compare-select implemen-tation is discussed in detail in Chapter 6.

1This is not quite true. By replacing the distributed arithmetic LUT with a Xilinx SRL shiftregister primitive, new filter coefficients can be loaded serially. This can be used to build adaptivedistributed arithmetic filters. This is not relevant to the CPM receiver considered in this thesis.


Figure 5.1: 4-Tap Distributed Arithmetic FIR Filter Block Diagram

5.4 4-Tap Distributed Arithmetic FIR Filter Imple-mentation

Chapter 4 showed that a sample rate of 2 samples per symbol (N=2) does notdegrade receiver detection efficiency. This means a 4-tap DA filter is requiredto calculate each branch metric. The Xilinx Spartan 3A-DSP FPGA provides 4input LUTs [44] which means the 4-tap DA filter maps very efficiently to thisFPGA’s logic fabric [47]. 2

The basic structure of a 4-tap DA filter is shown in Figure 5.1. There are twoparts to it. Firstly, a DA lookup table (DALUT) which takes in 1 bit from eachof 4 input samples every clock cycle. This lookup table stores all 24 combina-tions of partial products given by the 4 filter coefficients. The 4 input samplebits are concatenated to form the lookup table address. These partial productsare pre-computed and loaded into the lookup table during FPGA configura-tion. Secondly, as each new bit from the input samples is clocked through, theDALUT partial products are added with a scaling accumulator. After the inputsample word-length (BIQ) number of clock cycles, a new filter output is valid.

5.4.1 Throughput Requirements

The maximum throughput (symbol rate) is determined by the maximum fre-quency of operation of the design. Define this frequency as fmax . This param-eter is determined by running the placed and routed design through Xilinx’sstatic timing analysis tool called Trace. Trace uses a detailed timing model ofthe FPGA to find the slowest static path between registers in the design andhence the maximum operating frequency. Since every FPGA manufactured byXilinx is factory tested against this timing model, the design is guaranteed tofunction in the real world across all corners of process, voltage and tempera-ture (PVT) [49].

For a fully bit-serial DA filter the symbol rate throughput R is given by

26 input LUTs are used in Xilinx’s premium FPGA brand called Virtex-5. These FPGA’s offerhigher performance and density but at significantly higher cost. Spartan-6 and Virtex-6 are thenext generation of Xilinx FPGA’s and they use a 6 input LUT. They are not currently released forproduction.

5.4. FPGA IMPLEMENTATION 53

R (Symbol Rate MSymbols/s) fmax BIQ27 MSymbols/s 189 MHz 7

Table 5.2: Distributed Arithmetic Filter Fmax Requirement

Equation (5.12). Table 5.2 shows that for our application throughput require-ment of 27 MSymbols/s and a filter input sample word-length of 7 bits, theFPGA design must achieve an fmax of 189 MHz. This is a significant challengewhen targeting a low-cost FPGA such as a Xilinx Spartan3A-DSP FPGA.

R =fmaxBIQ

(5.12)

5.4.2 Efficient Mapping of a DA Filter into FPGA Hardware

Define BC as the filter coefficient word-length in bits and BIQ as the in-phaseand quadrature sample word-lengths.

5.4.2.1 DALUT

The DALUT has 4 address bits so has 16 locations. Each location stores the par-tial product from accumulating 4 coefficients. Each time two B bit numbers areadded, the resulting sum must be B+1 bits to guarantee overflow is avoided.When accumulating 4 coefficients then the resulting sum word-length must beBC + 2 to avoid overflow. Therefore, the DALUT size is 16x(BC + 2).

In Xilinx Spartan 3-ADSP FPGAs there are 3 types of memory [44]:

• Block Ram (BRAM) - Large 18kbit blocks of dual ported ram which canbe used in configurations from 16k x 1 3 through to 512x36.

• Flip Flops - Each FPGA logic cell contains a single flip-flop which canstore 1 bit of data.

• LUT RAM - Each FPGA logic cell contains a 4 input LUT. This LUT canbe configured as a 16x1 ROM and half the available LUTs can be used as16x1 RAMs (aka “distributed ram”).

For our application, the LUT RAM is by far the most efficient way of im-plementing the DALUT. The LUT RAM stores 16 times more bits per logic cellthan compared to a flip-flop. The block ram capacity of 18 kbits is many timeslarger than required, and with only 84 BRAMs available on the FPGA, there aresimply not enough BRAMs to implement 256 filters. The most logic efficientsolution is to use the LUT RAM.

Each LUT is used as 16x1 ROM so a single filter DALUT requires BC + 2LUTs. It is also worth noting that the actual implementation also registers theDALUT output to improve the maximum frequency of operation. This registeressentially comes for “free” since every logic cell contains 1 LUT and 1 flip-flop.However the register does add one extra clock cycle of latency.

3Block ram parity bits are not available in the 16kx1 configuration hence only 16k addressablelocations are available.


BIQ BC Logic Cellsper filter

h L M # Filters Total LogicCells

Total LogicCells (% ofavailable) 5

7 7 25 14 3 4 256 6400 19.2%

Table 5.3: Estimated Branch Metric Filter Bank FPGA Resource Use

5.4.2.2 Scaling Accumulator

The scaling accumulator adds the current partial product to a right shifted ver-sion of the previous clock cycles accumulation. The partial product is alwaysaligned to the most significant bit of the scaling accumulator result. In orderto avoid overflow, the scaling accumulator word-length is the DALUT word-length plus 1 bit growth every clock cycle. i.e. BC + 2 + BIQ since there areBIQ clock cycles required per input sample.

The scaling accumulator comprises an adder followed by a register. Theright shift does not consume any logic resources; a bit right shift is wired intothe accumulator feedback path. It is well known that an adder consumes 1logic cell per bit, and the register uses the flip-flop in the same logic cell. Thismeans the scaling accumulator consumes a maximum of BC + 2 +BIQ bits.

5.4.2.3 FPGA Resource Use Summary

The total resources used by a single 4-tap CPM DA filter are given in Equation(5.13). The complete CPM receiver branch metric filter bank requires many ofthese filters operating in parallel. Table 5.3 shows the calculated resource usagefor the specific CPM configuration chosen in Chapter 3 and further refined inChapter 4. For a single 4-tap DA filter with 7 bit coefficients and 7 bit samples,the FPGA implementation requires 25 logic cells per filter and is guaranteed toavoid overflow and without any loss of precision.

For h=1/4, L=3, M=4 then the complete filter bank requires 256 filters whichis estimated to consume 6400 logic cells. This is 19% of the available FPGAlogic cell resources which meets the low-cost requirements for this application.Confirmation of throughput, static timing analysis and actual resource use isdemonstrated in section 5.4.3. 4

DAFIRLCs = (BC + 2) + (BC + 2 +BIQ) (5.13)

5.4.2.4 Additional Resource Use Required to Meet Timing

Given the high fmax requirement of 189 MHz (see section 5.4.1), signal routingdelays must be kept to a minimum. Signals with high fanout have higher rout-ing delays because the average distance the signal travels is further; the signalpasses through more switch boxes within the FPGA’s routing fabric; and the

4Xilinx datasheets specify “equivalent logic cells” which Xilinx calculate by applying an arbi-traty factor to the actual number of logic cells in the device. One assumes this is done for marketingpurposes. This thesis always uses actual logic cells. For SC3S1800A-DSP the Xilinx datasheet re-ports 37440 “equivalent logic cells” and reports 4160 actual complex logic blocks(CLB). Each CLBcontains 8 logic cells (LCs) giving the part a total of 33280 actual logic cells or in terms of slices thisis 16640 slices.

5.4. FPGA IMPLEMENTATION 55

flip-flops LUTs slices BRAMs DSP48s best caseachievableperiod

fmax

7460 4896 3863 0 0 4.639 ns 215.6 MHz

Table 5.4: Branch Metric Filter Bank Implementation Results

capacitive loading on the high fanout net is higher. The filter bank signal inputx(m) routes to all 256 filters, and each filter DALUT is 9 LCs so the fanout is256x9=2304. This is a very high fanout and experiments show that fmax canfall well below 100 MHz, reducing the design throughput well below the re-quired 27 MSymbol/s. The solution is straight forward. The register storingx(m) is duplicated many times which reduces the fanout by the same ratio butincreases the design cost. It was found that a good balance between cost andperformance is achieved when x(m) register is duplicated 128 times. Pairs offilters are driven by their own dedicated x(m) register thus reducing the fanoutto 18 LCs. The increase in LC usage is 4x (256/2) = 512 LCs which is an 8% in-crease in the filter bank size.

The filter bank contains two more high fanout signals; both are scaling ac-cumulator controls 6. The first control signal indicates the input signal LSBwhich zeros the scaling accumulator. The second control signal indicates theinput signal MSB which configures the scaling accumulator to perform a sub-traction. These two signals were register duplicated on a per filter basis, so theincrease in logic cell usage is 2*256=512 LCs.

The total increase in design size is thus 1024 LCs bringing the total designestimate to 7424 LCs.

5.4.3 Implementation Results

The DA filter bank is written in VHDL and implemented using the Xilinx ISEtool suite. Appendices D.1 and D.4 contain the VHDL source and Appendix Glists the implementation tool versions.

A period timing constraint of 216 MHz was applied to the place and route(PAR) tools. The default synthesis and PAR options were modified slightly:“keep hierarchy” is set to yes to allow relative location (RLOC) constraints topropagate correctly through the hierarchy, and “add I/O buffers” is set to falsesince this design does not connect to FPGA device input or output pins 7.

The FPGA resource use and static timing analysis results are summarisedin Table 5.4. Appendix F.1 contains more detailed output from the mapper andPAR tools.

6Actually there is a third extremely high fanout signal - the clock. FPGAs have dedicated globalclock routing networks that have very low skew. Timing analysis results show that clock skew canconsume more than 300 ps from the timing budget. But there is relatively little that can be doneabout it. The filter bank could be split in half, thus halving the clock fanout. But this introduces a2nd clock into the design which raises the specter of crossing clock domains at some point furtherdownstream in the data path.

7The ISE tools mapper strips all logic not connected to input or output pins. This is avoidedby manually instantiating IBUFs on the branch metric filter bank inputs and applying the mappersave attribute to the branch metric filter bank outputs


Since every logic cell (LC) contains a LUT and flip-flop, the actual flip-flopresource usage of 7460 flip-flops closely follows the theoretical estimate of 7424LCs. The small difference can be explained by two functionalities not includedin the resource use estimate: a simple state machine and registers that serialiseincoming parallel data (x(m)) for the bit-serial DA filter bank.

The slice usage result is more difficult to interpret. Each spartan 3A-DSPslice contains two LCs and each LC contains a LUT and a flip-flop. If a singleLUT or flip-flop is used then the tools still report slice usage of 1. This explainswhy 3863 slices x 2 = 7726 LCs is larger than the actual number of flip-flops orLUTs used in the design.

The FPGA resource usage of 3863 slices is 23% of the available FPGA slicesthus comfortably meeting the applications cost requirement. The remaining77% of slices and 100% of block ram and embedded multipliers are availablefor the Viterbi path metric calculations described in Chapter 6. This designmeets timing at 215.6 MHz thus supporting a data throughput of 215.6 / 7= 30.9 MSymbols/s. This is 14.1% higher throughput than the application’srequirement of 27 MSymbols/s, giving a considerable margin. In practice anextra timing allowance for clock jitter is required depending on the quality ofthe clock source. Also, the branch metric results are not routed and attached tothe Viterbi trellis decoder. Altough this routing will be kept short and local byplacing each branch metric filter close to its associated Viterbi add-compare-select unit, the extra routing resources consumed will degrade timing.

5.4.4 Functional Verification

The VHDL implementation is verified by executing the design in a VHDL sim-ulator. Test vectors are generated by using the Matlab fixed point model towrite a VHDL package containing input and output from the Matlab model.The VHDL testbench applies I and Q samples to the filter bank and then checksthe filter bank output branch metrics against the expected outputs also storedin the VHDL package. Any errors are counted and reported. This test strategyis justified and described in more detail in Appendix A.

The testbench VHDL source code is in Appendix D.1.5 and the test vectorspackage in Appendix D.1.6. 100 random symbols worth of I and Q sampleswere applied to the filter bank. The testbench reports 0 errors. This simulationdemonstrates that the VHDL branch metrics precisely match the Matlab fixedpoint model branch metrics.

5.5 Comparison: Distributed Arithmetic vs Embed-ded FPGA Multipliers

For input sample word-lengths significantly less than 18 bits, a distributedarithmetic approach achieves a much higher throughput than is achievablewith the FPGAs embedded multipliers. The DA filter bank presented in thischapter performs the equivalent of 31.6 giga-multiplies/s while using slightlyless than a quarter of the available FPGA slices. The absolute maximum through-put available using all the FPGA’s embedded multipliers is 21 giga-multiplies/s

5.6. CONCLUSION 57

[44]. Using a quarter of the embedded multipliers (21) provides for only 21 x250 MHz = 5.3 giga-multiplies/s, almost 6 times less than the DA approach.

The DA filter multiplication complexity is calculated using the filter bankimplemented in section 5.4.3. That is 256 filters running at a symbol rate of 30.9MSymbols/s. Each filter performs 4 multiplications per symbol. I.e. 4 x 256 x30.9 = 31.6 giga-multiplies/s.

On the other hand, the embedded multipliers provide for significantly higheroperand word-lengths. Up to 18 bits compared with only 7 bit for the DA fil-ter bank presented in this thesis. As the DA filter input sample and coefficientword-lengths increase, so does the FPGA logic resource use. A longer word-length also implies a longer carry chain for the scaling accumulator whichmay degrade fmax and throughput as well. Nevertheless, it was shown inChapter 4 that input sample word-lengths greater than 7 bits provide mini-mal receiver detection efficiency improvements. And this is the elegance ofdistributed arithmetic designs; FPGA resource use is precisely matched to theapplications requirements.

5.6 Conclusion

This chapter showed that distributed arithmetic can be applied to an FPGAimplementation of the branch metrics component of a CPM receiver. AlthoughDA is a well known technique, its application to a CPM receiver has not pre-viously been published. The key to making this technique work for CPM isto combine the branch metric matched filter and phase rotation into a singlefilter. Although this increases the total number of filters required, each filteris implemented bit-serially using distributed arithmetic. Each filter consumesvery few FPGA logic resources; in this application each filter consumes 25 logiccells.

The complete filter bank of 256 filters occupies 23% of the low cost targetFPGA. The design meets timing at 215.6 MHz, meeting the minimum require-ment of 189 MHz and thus supporting a symbol rate in excess of 27 MSym-bols/s. This provides a margin of 14%. The main drawback of this bit-serialprocessing is the added l symbol latency. Since the branch metric filter bankis likely to be inside a phase recovery loop, this extra processing latency de-grades phase recovery performance. This is the price to be paid for a low-costimplementation meeting the strict cost requirements for this microwave radioapplication.


Chapter 6

CPM Path Metric FPGAImplementation

6.1 Introduction

The previous chapter presented an FPGA implementation for the branch met-ric unit which used 23% of the available logic resources and requires a 189 MHzsystem clock to meet the 27 MSymbol/s (54 Mbit/s) throughput requirement.As shown in Figure 6.1, the next stage of processing in this CPM receiver is thepath metric processing unit.

Each symbol interval this unit calculates 4 path metrics inbound to each ofthe 128 trellis states and selects one surviving path per state to be the outputstate metric. Two decision bits represent the selected path and these are for-warded onto a survivor management unit which tracks the trellis path historyfor each state. The survivor management unit traces back the path with thelargest metric to a finite depth and outputs the symbol at the tail of this path.

For such a large trellis, and moderate throughput requirement, a low-costFPGA implementation is a challenge. The path metric computations alone re-quire 13.8 giga-adds per second. Furthermore, this add-compare-select (ACS)processing is inside the Viterbi iteration loop so each added pipeline stage re-

Figure 6.1: CPM Viterbi Detection

59

60 CHAPTER 6. PATH METRIC IMPLEMENTATION

duces the amount of time available to complete the required processing. Witha system clock frequency of at least 189 MHz, deep pipelining is required topass static timing analysis, leaving limited opportunities to time-share FPGAresources.

Although the open literature contains many examples of Viterbi decoderstargeting ASIC implementations for convolutional decoder applications [16][2] [14], there are none that present FPGA implementations of CPM Viterbidetection to any significant level of detail.

The standard large trellis, high-throughput architecture is a state parallelsolution in which every trellis state has dedicated hardware resources perform-ing the ACS operations. For large trellises the state metric routing is complex[16]. The solution presented here is to partition the ACS processing into radix-4units, each of which processes 4 states. The advantage is that the state metricread routing is short since it is local to a single radix-4 unit. The state metricwrite routing is still lengthy; this is mitigated by devoting 1 complete pipelinecycle to the state metric write routing.

The proposed VHDL implementation is deeply pipelined to meet the 189MHz system clock requirement. The ACS unit uses 6 out of the 7 availableclock cycles per Viterbi iteration to produce 1 survivor state metric. This un-used cycle is exploited by processing two output state metrics per ACS unit,thus requiring only 2 ACS units per radix-4 unit. The FPGA resource cost ishalved.

6.2 Viterbi Algorithm: Updating State Metrics

The background material presented in Chapter 2 explained that the Viterbi al-gorithm is an iterative approach to find the most likely transmitted symbolsequence. Equation (6.1) shows how the nth Viterbi iteration has state metricsJn(α) which are calculated from the previous iteration state metrics Jn−1 andbranch metrics Zn(α) [6]. This Viterbi iteration is repeated once per receivedsymbol duration.

Jn(α) = Jn−1(α) + Zn(α) (6.1)

For an h=1/4, L=3, M=4 CPM configuration there are 128 states and hence128 state metrics. Each state is uniquely identified by a combination of a phasestate and correlative state as defined in Equation (6.2). The phase state is de-fined in Equation (6.3) where p = 8 for a modulation index of h = 1/4.

σn = (θn, αn−1, αn−2) (6.2)

θn =2πip, i ∈ {0, 1, 2, . . . , p− 1} (6.3)

For quaternary symbols (M=4) each state has 4 possible inbound paths. Thepath metrics are calculated for all 4 paths and the path with the highest metricis selected to be the survivor path, also called the output state metric. In thenext Viterbi iteration, this output state metric becomes an input state metricsomewhere else in the Viterbi trellis. Figure 6.2 shows this per state processingin diagram form.

6.3. PROPOSED SOLUTION 61

Figure 6.2: Add-Compare-Select Processing Required per State

6.3 Proposed Solution

6.3.1 State-Parallel Radix-4 Decomposition

The standard large trellis, high-throughput architecture is a state parallel so-lution in which every trellis state has dedicated hardware resources perform-ing the ACS operations. For large trellises the state metric routing is complex[16]. The solution presented here is to partition the ACS processing into radix-4 units, each of which processes 4 states. The advantage being that the statemetric read routing is short since it is local to a single radix-4 unit. The statemetric write routing is still lengthy; this is mitigated by devoting 1 completepipeline cycle to the state metric write routing.

Figure 6.3 shows how path metric processing is carried out by 32 radix-4units. The radix-4 units have branch metrics and source state metrics as inputs.The radix-4 unit outputs a survivor state metric which becomes a source statemetric on the next Viterbi iteration. Decision bits are output to indicate theselected path.

By appropriate partitioning of the states into radix-4 units comprising 4states, the 4 source state metrics are shared by all ACS units within the radix-4unit. By placing the state metric registers with radix-4 unit, the state metricoutput (read) routing is kept completely local to the radix-4 unit, is short, andconsequently fast. It is no longer a bottleneck to meeting timing.

A single radix-4 unit implements the fully-connected trellis shown in Figure6.4 which is for h=1/4, L=3, M=4. A (θD, αn−1) tuple uniquely identifies thesource and destination state metrics for a radix-4 unit. There are 8 possiblephase states for θD and 4 possible symbols for αn−1 giving 32 unique radix-4 units to carry out the processing for all 128 states in the trellis. A Matlabgenerated VHDL constant array contains the mapping of radix 4 output statemetrics onto the state metrics bus, and from the state metric bus to the radix-4unit state metrics inputs.

A standard implementation would use 4 add-compare-select units withinthe radix-4 unit. In this design we use two time shared ACS units to perform


Figure 6.3: 128 State CPM Viterbi Detector Comprising 32 Radix-4 Add-Compare-Select Units


Figure 6.4: CPM Radix-4 Trellis (h=1/4, L=3, M=4)


Figure 6.5: Add-Compare-Select (ACS) Unit Detail and Pipeline

the equivalent processing of 4 ACS units. This is shown in Figure 6.6.

6.3.2 Add-Compare-Select Unit

The proposed ACS unit design is shown in Figure 6.5. The implementation hasa deep pipeline that is 6 stages long. This keeps the logic within each stage tobe only 1 LUT logic level deep. i.e. there is only ever a maximum of 1 LUT levelbetween registers. This results in only 1 route per stage when the pipeline flip-flop is co-located with the LUT driving the flip-flop. This level of pipelining isrequired to meet timing for the 189 MHz system clock 1.

Each pipeline stage has a specific function:

1. State Metric Register - Store the previous Viterbi iterations surviving statemetrics, ready for input into the current iterations path metric calcula-tions. During this stage the state metrics route from an ACS state metricoutput to the register input. This allows one complete cycle for state met-ric routing.

2. Path Metric Accumulation - State metric is added to the branch metric.1Given the large number of branch metric connections that connect the branch metric unit and

path metric unit (PMU), it is not practical to run the PMU at a clock frequency different to the BMU,because of the added cost of the clock domain crossing logic that would be required. Nevertheless,it may be possible to clock enable the ACS unit twice every Viterbi iteration and apply a 3 cyclemulticycle timing constraint to relax the 189 MHz period constraint. This would allow 15.9 nsfor the complete ACS processing and state metric routing. It seems unlikely this would meettiming, although this has not been investigated. Besides which, multicycle constraints are an addedcomplexity and potentially disastrous if applied incorrectly since they relax timing on specificpaths on the 189 MHz clock.


Figure 6.6: Radix-4 Unit Comprising Dual ACS Units

3. 1st Comparison - Subtract 2 metrics in pairs in the first stage of compari-son.

4. 1st Selection - 2:1 multiplexer to select the largest path metric for eachpair.

5. 2nd Comparison - Subtract the 2 remaining metrics to find the largest.

6. 2nd Selection - Final 2:1 multiplexer to select the surviving path metric.

For a clock frequency of 189 MHz, there are 7 clock cycles available perViterbi iteration. Since the pipeline is 6 deep, the spare cycle is used to send asecond set of branch metrics through the pipeline and compute the path met-rics and select the survivor path for a second output state. This allows a singleACS unit to perform the state metric processing for 2 states, thus halving thenumber of ACS units required; a radix-4 unit comprises 2 ACS units as shownin Figure 6.6.

Since there are now 2 branch metrics to pass through the pipeline, a 2:1multiplexer is incorporated into the stage 2 adder. This is possible becausethe FPGA fabric uses a 4 input LUT; 3 inputs are taken by the state metricoperand and 2 branch metric operands, the 4th input is the select input for themultiplexer.

6.3.2.1 Resource Use Estimate

FPGA resource use is estimated in terms of logic cells in Table 6.1. Each logiccell contains a flip-flop and a 4 input LUT for the Xilinx Spartan 3A-DSP familyof FPGAs. The FPGA logic fabric has been designed such that adders and 2 to


PipelineStage

Function Logic CellEstimate(LCs)

Notes

1 State Metric Register 4 ∗Bpm2 Path Metric Add 4 ∗Bpm3 1st Comparison 3 ∗Bpm Each path metric must be regis-

tered to match the pipeline de-lay of the subtraction

4 1st Selection 2 ∗Bpm + 2 2 decision bits pipelined5 2nd Comparison 3 ∗Bpm + 2 2 decision bits pipelined6 2nd Selection 1 ∗Bpm + 2 1xMux and 2 decision bitsAllStages

17 ∗Bpm + 6

Table 6.1: Single Add-Compare-Select Unit FPGA Resource Use Estimate

1 multiplexers consume 1 logic cell per bit. The logic cell usage is expressed asa function of the path metric (state metric) word-length Bpm.

A radix-4 unit comprises dual ACS units, but since there is a single set ofstate metric registers shared between both ACS units, the radix-4 unit estimateis double the ACS unit estimate minus a set of state metric registers. The radix-4 unit estimate is 2 ∗ (17 ∗Bpm + 6)− 4 ∗Bpm = 30 ∗Bpm + 12 logic cells.

The h=1/4, L=3, M=4 CPM trellis decomposes into 32 radix-4 units, andwith a path metric wordlength of 11 bits, the total path metric processing FPGAresource use estimate is 32 ∗ (30 ∗ 11 + 12) = 10944 logic cells.

6.4 Implementation Results

The path metric processing unit is written in VHDL and implemented usingthe Xilinx ISE tool suite. Appendices D.2 and D.4 contain the VHDL sourceand Appendix G lists the implementation tool versions used.

A period timing constraint of 215 MHz was applied to the place and routetools. The default synthesis and PAR options were modified slightly: “keep hi-erarchy” is set to yes to allow RLOC (relative location) constraints to propagatecorrectly through the hierarchy, and “add I/O buffers” is set to false since thisdesign does not connect to FPGA device input or output pins 2.

The FPGA resource use and static timing analysis results are summarisedin Table 6.2. Appendix F.2 contains more detailed output from the mapper andPAR tools.

The estimated resource use of 10944 logic cells comes close to matchingthe actual flip-flop usage results. The 7303 slice usage represents 43.9% of theFPGAs available 16640 slices. Static timing analysis shows the design meets the189 MHz requirement, thus meeting the application’s 54 Mbit/s throughputrequirement.

2The ISE tools mapper strips all logic not connected to input or output pins. This is avoided bymanually instantiating IBUFs on the path metric processing unit inputs and applying the mappersave attribute to the decision bit outputs

6.5. CONCLUSION 67

flip-flops LUTs slices BRAMs DSP48s best caseachievableperiod

Fmax

11338 7174 7303 0 0 4.891 ns 204.5 MHz

Table 6.2: Path Metric Processing Unit Implementation Results

6.4.1 Functional Verification

The same verification strategy as used for the branch metrics implementationis used to verify the path metrics processing implementation. This test strategyis justified and described in more detail in Appendix A.

The path metrics implementation testbench VHDL source code is in Ap-pendix D.2.5 and the test vectors package in Appendix D.2.6. One hundredrandom symbols worth of I and Q samples were applied to the filter bank togenerate the branch metrics for input to the VHDL design. The Matlab fixedpoint model stores the resulting state metrics values in a VHDL constant array.The path metrics test bench checks these state metrics, bit for bit, with the onesgenerated by the VHDL model, once every Viterbi iteration. The testbench re-ports no errors. This simulation demonstrates that the VHDL generated statemetrics precisely match the Matlab fixed point model state metrics, and henceverifies operation of the design.

6.5 Conclusion

A state-parallel architecture comprising 32 radix-4 units, each containing twodeeply pipelined ACS units performs the required path metric processing forthe 128 state trellis. The proposed architecture has been implemented in VHDLand targeted toward a low-cost Xilinx Spartan 3ADSP FPGA. The implemen-tation consume 43.9% of the available FPGA resources and passes static timinganalysis at 204.5 MHz providing a healthy margin to the 189 MHz minimumrequirement. A VHDL functional simulation verifies that the VHDL modelprecisely matches the fixed point model.


Chapter 7

Conclusions and Future Work

Continuous phase modulation is a promising entrant to the microwave radiocellular backhaul market. Its constant envelope property enables the use ofnon-linear radio frequency power amplifiers in the outdoor unit. This signifi-cantly reduces cost and improves power efficiency. Nevertheless, the spectralefficiency of CPM is inferior to traditional radios using large quadrature ampli-tude modulation constellations. Furthermore, receiver implementation com-plexity and cost are an issue for spectrally efficient CPM configurations whichrequire multiple symbol duration phase pulses to reduce spectral occupancy.

This thesis proposes a CPM configuration that achieves a 50% improvementin spectral efficiency without degrading detection efficiency or system gaincompared to a CPM microwave radio recently released to the market. Thisis achieved by moving to coherent demodulation in the receiver and using alonger, smoother, phase pulse. The new CPM configuration has a modulationindex (h) of 1/4, phase pulse symbol duration (L) of 3, raised cosine phasepulse shape and has a quaternary (M=4) symbol alphabet size. By simulat-ing a floating point model, this CPM configuration is shown to meet specificETSI standard requirements [1, Annex D] that constrain the transmitted powerspectral density and require minimum standards of performance for receiverdetection efficiency. The application is for a 28 MHz wide channel and datarate of 54 Mbit/s, sufficient to transport 24 E1 circuits plus framing, forwarderror correction and auxiliary channel overhead. This is a significant improve-ment in data rate compared to the existing product which transports only 16E1 circuits within the same 28 MHz ETSI channel.

Nevertheless, this new CPM configuration has high complexity and for thisresult to have practical significance, a low cost implementation is required. Theliterature contains many examples of non-optimal, complexity reduced CPMreceivers, some of which show increased sensitivity to adjacent channel inter-ference and have degraded detection efficiency compared to an optimal re-ceiver. In this thesis, these issues are avoided by implementing the maximum-likelihood receiver using the Viterbi algorithm to demodulate the basebandCPM signal. The CPM configuration described above leads to a Viterbi trelliswith 128 states, 4 inbound paths per state and 512 branch metrics that mustbe calculated each symbol period. The application requires a throughput of27 MSymbols/s (54 Mbit/s), making a low cost FPGA implementation a chal-lenge.

69

70 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

In order to explore the tradeoff between implementation word-length (cost)and detection efficiency degradation, a fixed point model of the CPM receiverhas been developed. The degradation is only 0.2 dB when the received signalis quantised to 7 bits, branch metric filter bank coefficients quantised to 7 bits,branch metrics word-length of 9 bits and path metric word-length of 11 bits.There was no observed degradation due to the use of 2 samples per symbolprocessing and a Viterbi path history depth of 16 symbols.

The two most computationally expensive parts of the receiver are the branchmetrics unit and path metrics processing unit. This thesis presents a novel ap-proach to the branch metrics filter bank by using the well known distributedarithmetic algorithm and applying it to a CPM receiver branch metric unit forthe first time. This technique performs 27.6 giga-multiplies per second, in abit-serial fashion, making very efficient use of the FPGA logic fabric. One un-desirable consequence of the bit serial processing is an added delay of 1 symbolduration which is likely to degrade the carrier phase recovery control loop per-formance. This is the price to be paid for a low-cost implementation.

The branch metric unit is implemented in VHDL and targeted to a low-costXilinx Spartan-3ADSP FPGA. The implementation consumes 23% of the avail-able logic cells and the placed and routed design passes static timing analy-sis at 215 MHz, meeting the 189 MHz requirement to achieve a throughputof 54 Mbit/s. Functionality is verified using a VHDL testbench simulated byModelsim. VHDL test vectors are generated by the fixed point model and arepassed through the design, and results automatically checked to confirm thatthe VHDL implementation precisely matches the fixed point model.

Although all work in this thesis has been carried out in simulation, the sameVHDL models are used in simulation and for FPGA implementation. By pass-ing the VHDL implementation models through the FPGA vendors synthesisand place and route software and verifying the design meets static timing, andbecause the design uses a single clock domain, there is a high degree of confi-dence that the VHDL simulation matches the real world FPGA behaviour.

Viterbi path metric processing is implemented using a traditional state-parallel design. Static metric routing complexity is a recognised problem forlarge state Viterbi decoders; this thesis takes the approach of grouping statesinto radix-4 units so that all add-compare-select processing units within a radix-4 unit share the same set of state metrics. This keeps much of the state metricrouting local and short. In order to operate at the high clock frequency of 189MHz, the ACS unit is deeply pipelined. This results in 2 spare cycles within theViterbi iteration loop, one of which is devoted to state metric output routing, tomake it easier to meet timing on the state metric routing. The other spare cycleis used to send a second set of states through the ACS pipeline, thus halving thenumber of required ACS units to 2 per radix-4 unit bringing about a halving ofFPGA resource use.

The Viterbi path metric processing VHDL implementation consumes 44%of the available FPGA resources and passes static timing analysis at 204.5 MHz,thus achieving the throughput requirement of 54 Mbit/s. The VHDL imple-mentation was verified in simulation by precisely matching the VHDL modeloutput with that of the Matlab fixed point model.

A Viterbi CPM demodulator also requires a survivor path management unitand a search for the best state metric. Future work is required to implementthese.

71

For the first time, Hekstra’s method of normalising path metrics has beenapplied to a CPM receiver. For this technique to work, the difference betweenstate metrics must be bounded. The CPM configuration presented in this thesiscomprise two unconnected sub-trellises and within each sub-trellis the statemetric differences are bounded. Once synchronised, the transmitted symbolsequence only exists in one sub-trellis so there is an opportunity to halve theamount of processing required. This is for future study.

This thesis fills a gap in the CPM literature where there are few details ofFPGA targeted CPM receivers. Simulation results presented here show a newCPM configuration achieving a 50% improvement in spectral efficiency com-pared to a recently released CPM microwave radio product. A VHDL imple-mentation of this CPM configuration is presented and results show that a lowcost FPGA implemenation is practical. This work has assumed ideal carrierphase recovery and timing recovery. Future work is required to implementcarrier phase and timing recovery and integrate these functions with the re-ceiver implementation proposed here.

72 CHAPTER 7. CONCLUSIONS AND FUTURE WORK

Appendix

73

Appendix A

VHDL ImplementationFunctional Verification

A VHDL functional simulation and static timing analysis (STA) is used to ver-ify the FPGA implementation. The only purpose of the VHDL functional sim-ulation is to prove the VHDL implementation matches the fixed point modelprecisely. The Matlab floating point models demonstrate that the application’sspectral efficiency and BER performance requirements are met. The Matlabfixed point simulation demonstrates receiver performance after moving to re-duced precision fixed point arithmetic. The fixed point model is also used togenerate test vectors for the VHDL functional simulation.

Testing the VHDL design in a real world FPGA is not required in order todemonstrate that the application’s cost requirements are met. For this design,static timing analysgis is straightforward. The design is fully synchronous: ituses a single clock and no asynchronous inputs or outputs. By using the auto-mated Xilinx STA tool, Trace, timing through all paths is checked and guaran-tees the design will function as simulated over process, voltage and tempera-ture. This is because Trace checks the placed and routed designs timing againsta set of speedfiles specific to the FPGA device. Each FPGA device manufac-tured by Xilinx is tested against this same set of FPGA speedfiles, thus guaran-teeing the customers design will function correctly in every FPGA shipped byXilinx [49].

A.1 VHDL Functional Verification Architecture

Figure A.1 shows how the Matlab fixed point model is used to generate testvectors which are then applied to the VHDL CPM receiver implementation ina Modelsim VHDL simulation testbench. The Matlab fixed point model is thesame as used in chapter 4 but with function hooks (see Appendix E.3.1) addedto export results from specific points in the model. The exported results areread by a VHDL writer function written in Matlab. The VHDL writer func-tion generates a VHDL package file containing the exported results as VHDLconstant arrays. The VHDL writer source code is in Appendix E.3.2.

A VHDL testbench then imports this VHDL package, instantiates the VHDLdevice under test(DUT), applies the input test vectors to the DUT and checks

75

76APPENDIX A. VHDL IMPLEMENTATION FUNCTIONAL VERIFICATION

Figure A.1: VHDL Implementation Test Architecture

the DUT output against the correct output in the test vectors. Every error iscounted, and a VHDL assert prints an error message in the simulator if anyerrors are found. For example, Appendix D.1.5 contains the VHDL testbenchfor the branch metrics VHDL implementation.

Appendix B

Receive Signal Level to SNRConversion

The ETSI radio frequency performance specification used in this thesis speci-fies detection efficiency by requiring the bit error rate to be less than 10−6 at areceive signal level (RSL) of -75 dBm [1, Table D.6].

Received signal strength is expressed as a signal to noise ratio (SNR) for thepurposes of simulation and analysis. This is more useful metric because it isindependent of bandwidth. We assume two noises sources for the purposes ofthis calculation. Firstly, equation B.1 defines the thermal noise in the symbolrate bandwidth (B), modelled as additive white gaussian noise. Secondly, noiseadded by the receiver itself. We conservatively assume a receiver noise figureof 6 dB. SNR is calculated from RSL using equation B.2 [50].

Energy per bit relative to noise, Eb

No, is also commonly used as a receiver

figure of merit because it is independent of bandwidth and number of bits persymbol (M). Equation B.3 converts between SNR and Eb

No.

Pthermalnoise = −174dBm+ 10 log10(B) (B.1)

SNRdB = RSL− Pthermalnoise −NFreceiver (B.2)

SNRdB = 10log10(log2(M)) +EbNo

(B.3)

Table B.1 summarises the conversion from -75 dBm receive signal level toSNR and Eb

No. This is equivalent to a SNR of 18.7 dB and Eb

Noof 15.7 dB.

ETSI RSL [1, Ta-ble D.6]

Symbol RateBandwidth(B)

Pthermalnoise NFreceiver SNR Eb

No(M=4)

-75 dBm 27 MHz -99.7 dBm 6 dB 18.7 dB 15.7 dB

Table B.1: ETSI Received Signal Level Converted to SNR and Eb

No

77

78 APPENDIX B. RECEIVE SIGNAL LEVEL TO SNR CONVERSION

Appendix C

Baseband I/Q ModulatorDerivation

Communication systems are often described in terms of their in-phase andquadrature baseband signal components. At some point these I and Q chan-nels must modulate the actual carrier. This section derives the standard I/Qmodulator.

Consider the passband signal (C.1) that might be amplitude modulatedwith a(t) and phase modulated with θ(t).

s(t) = a(t) cos(ωct+ θ(t)) (C.1)

s(t) = <{a(t)ejθtejωt} (C.2)

Now define s(t) = a(t)ejθt as the baseband complex envelope of s(t).

s(t) = <{s(t)ejωt} (C.3)

Expressing s(t) in terms of its real sI(t) and imaginary sQ(t) parts gives(C.4).

s(t) = <{(sI(t) + jsQ(t))ejωt} (C.4)

s(t) = sI(t) cos(ωt)− sQ(t) sin(ωt) (C.5)

79

80 APPENDIX C. BASEBAND I/Q MODULATOR DERIVATION

Appendix D

VHDL Source Code

This appendix lists the VHDL source code files developed. In a few key cases,the source code is printed here. Otherwise, please see the CD associated withthis thesis for the source code in electronic format.

D.1 Branch Metric Filter Bank

D.1.1 Synthesis Top Level

/fpga/src/branch_metrics_dafir_synth.vhd

D.1.2 Top Level

−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

−− T i t l e : Branch m e t r i c f i l t e r bank us ing d i s t r i b u t e d a r i t h m e t i c f i rf i l t e r s

−− Author : Andrew B r i d g e r−− Date : 1 J u l y 2008−− High L e v e l Module D e s c r i p t i o n : Th i s module impl ements a bank o f

f i l t e r s wi th c o e f f i c i e n t s p r o v i d e d by−− a mat l ab g e n e r a t e d VHDL p a c k a g e . The i n p u t s i g n a l I and Q

components a r e s e r i a l i s e d and f e d t o a l l−− f i l t e r s in p a r a l l e l l s b f i r s t . The f i l t e r o u t pu t a p p e a r s B I = B Q

c y c l e s + p i p e l i n i n g s t a g e s l a t e r .−−−− Notes / L i m i t a t i o n s : TODO round and s a t u r a t e t h e f i l t e r o u t p u t s .−− 1) XST p r o d u c e s t h i s s i l l y warning :WARNING: Xst : 2677 − Node <

i s h i f t r e g 1 0 > o f s e q u e n t i a l t y p e i s u n co nn ec t ed in b l o c k <b r a n c h m e t r i c s d a f i r >.

−− Using t h e RTL v i e w e r you can s e e XST u s e s t h e name x ( 0 ) i n s t e a do f i s h i f t r e g 1 0 .

−−−− S y n t h e s i z a b l e : Yes−−−− T e s t b e n c h : b r a n c h m e t r i c s d a f i r t b . vhd−−−− Note : The v e r s i o n c o n t r o l sys t em in use i s t h e r e p o s i t o r y f o r

i n f o r m a t i o n r e g a r d i n g bug f i x e s ,−− v e r s i o n s , f e a t u r e a d d i t i o n s e t c .

81

82 APPENDIX D. VHDL SOURCE CODE

−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

l i b r a r y i e e e ;use i e e e . s t d l o g i c 1 1 6 4 . a l l ;use IEEE . numeric std . a l l ;

l i b r a r y work ;use work . PkgStdType . a l l ;use work . p kg pro jec t . a l l ;use work . pkg mf coef fs . a l l ;use work . pkg r loc . a l l ;

l i b r a r y unisim ;use unisim . vcomponents . a l l ;

e n t i t y b r a n c h m e t r i c s d a f i r p r i m i t i v e s i sgeneric (

PLACE : nat ur a l ; −−1 = s p a r t a n 3p l a c e d , 0 = u n p l a c e d

B COEFF : p o s i t i v e ; −−F i l t e r c o e f f i c i e n tw o r d l e n g t h

B IQ : p o s i t i v e ; −−I /Q i n p u tw o r d l e n g t h

FILTER BANK COEFFS : f i l t e r b a n k c o e f f s t a ) ; −− f i l t e rc o e f f i c i e n t s

port (c l k : in s t d l o g i c ; −−sys t em c l o c ki , q : in i q t a ; −−a r r a y o f s a m p l e s f o r

1 symbol p e r i o dnew samples : in s t d l o g i c ; −−a c t i v e dur ing f i r s t

c l k p e r i o d in which new IQ s a m p l e s a r e p r e s e n t e dbranch metr ics : out b r a n c h m e t r i c s t a ;−−v a l i d f o r on ly 1

c l o c k p e r i o dbranch metr ics rdy : out s t d l o g i c −−a c t i v e when ou tp ut

v a l i d) ;

end b r a n c h m e t r i c s d a f i r p r i m i t i v e s ;

a r c h i t e c t u r e xprim of b r a n c h m e t r i c s d a f i r p r i m i t i v e s i s

s ignal i q l s b , iq msb , i q l s b b u f , iq msb buf : s t d l o g i c ;

subtype x t i s s t d l o g i c v e c t o r (3 downto 0) ;s ignal x : x t ;s ignal bm rdy : s t d l o g i c v e c t o r (NUM FILTERS downto 1) ;s ignal dummy dout0 , dummy dout1 , dummy dout2 , dummy dout3 :

s t d l o g i c v e c t o r ( i ( 1 ) ’ range ) ;

begina s s e r t f a l s e report ” b r a n c h m e t r i c s d a f i r p r i m i t i v e s : check

c r e a t e m f d a l u t c o e f f s .m has INCLUDE NEG MF COEFFS = 0 ” s e v e r i t ywarning ;

−−I ’m c o n f u s e d . Shouldn ’ t INCLUDE NEG . . . b e s e t t o 1 b e c a u s ep k g p r o j e c t s a y s n u m f i l t e r s = s i z e o f c o e f f

−−a r r a y / 2 . !

−−S e r i a l i s e I and Q s a m p l e s and g e n e r a t e c o n t r o l s f o r t h e d a f i rf i l t e r s .

c o n t r o l : process ( c l k )

D.1. BRANCH METRIC FILTER BANK 83

var iable b i t c o u n t : na tu ra l range 0 to B IQ−1;begin

i f r i s i n g e d g e ( c l k ) theni q l s b <= ’ 0 ’ ; −−d e f a u l t a s s i g n m e n t siq msb <= ’ 0 ’ ;i f new samples = ’1 ’ then −−l o a d s h i f t r e g

b i t c o u n t := 0 ;i q l s b <= ’ 1 ’ ;

e lse −−s h i f t out , l s b f i r s ti f b i t c o u n t = ( B IQ−1) then

b i t c o u n t := 0 ;e lse

b i t c o u n t := b i t c o u n t + 1 ;end i f ;

end i f ;−−Keep t r a c k o f i q b i t p o s i t i o n and s i g n a l when m s b i t r e a c h e d .

i q msb g o e s h igh f o r 1 c y c l e on ly .i f b i t c o u n t = ( B IQ−1) then

iq msb <= ’ 1 ’ ;end i f ;

end i f ;end process ;

−−Form t h e d a l u t a d d r e s s i n p u t by p i c k i n g o f f t h e l s b o f t h e s h i f tr e g

−−i ( n ) s e r i a l i s ei n s e r i a l i s e : e n t i t y work . s h i f t r e g ( xprim )generic map(

SHIFT DIRECTION => RIGHT, −−0 r i g h t ( t owards l s b ) , 1 s h i f t l e f t (t owards msb )

PLACE => PLACE)port map(

c l k => clk ,ce => ’ 1 ’ ,load => new samples ,din => s t d l o g i c v e c t o r ( i ( 1 ) ) ,dout => dummy dout0 ,s e r i a l i n => ’ 0 ’ ,s e r i a l o u t => x ( 0 ) ) ;

−−i ( n−1) s e r i a l i s ei n 1 s e r i a l i s e : e n t i t y work . s h i f t r e g ( xprim )generic map(



c l k => clk ,ce => ’ 1 ’ ,load => new samples ,din => s t d l o g i c v e c t o r ( i ( 2 ) ) ,dout => dummy dout1 ,s e r i a l i n => ’ 0 ’ ,s e r i a l o u t => x ( 1 ) ) ;

−−q ( n ) s e r i a l i s eq n s e r i a l i s e : e n t i t y work . s h i f t r e g ( xprim )generic map(




c l k => clk ,ce => ’ 1 ’ ,load => new samples ,din => s t d l o g i c v e c t o r ( q ( 1 ) ) ,dout => dummy dout2 ,s e r i a l i n => ’ 0 ’ ,s e r i a l o u t => x ( 2 ) ) ;

−−q ( n−1) s e r i a l i s eq n 1 s e r i a l i s e : e n t i t y work . s h i f t r e g ( xprim )generic map(



c l k => clk ,ce => ’ 1 ’ ,load => new samples ,din => s t d l o g i c v e c t o r ( q ( 2 ) ) ,dout => dummy dout3 ,s e r i a l i n => ’ 0 ’ ,s e r i a l o u t => x ( 3 ) ) ;

−−B u f f e r t h e d a f i r c o n t r o l s i g n a l s t o match t h e 1 c l k l a t e n c yi n t r o d u c e d by t h e x b u f f e r i n g

iq msb f lop : FDRSEgeneric map ( INIT => ’ 0 ’ ) −− I n i t i a l v a l u e o f r e g i s t e r ( ’ 0 ’ o r ’ 1 ’ )port map( C => clk ,

D => iq msb ,Q => iq msb buf ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

i q l s b f l o p : FDRSEgeneric map ( INIT => ’ 0 ’ ) −− I n i t i a l v a l u e o f r e g i s t e r ( ’ 0 ’ o r ’ 1 ’ )port map( C => clk ,

D => i q l s b ,Q => i q l s b b u f ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

−−bank o f matched f i l t e r s t h a t g e n e r a t e branch m e t r i c s a t t h e i rou tp ut . I n p u t s a r e s e r i a l i s e d v e r s i o n s o f

−−I and Q. The f i l t e r c o e f f i c i e n t s a r e c o n t a i n e d in p k g m f c o e f f s . vhd.

−−Each f i l t e r c a l c u l a t e s I ( n ) ∗C( 0 ) + I ( n−1)∗C( 1 ) + Q( n ) ∗C( 2 ) + Q( n−1)∗C( 3 ) .

−−2 Samples p e r symbol p e r i o d a r e assumed .generate mf bank : for bmid in 1 to NUM FILTERS/2 generate−−Placement . Arrange in a 2D a r r a y . P l a c e 2 f i l t e r s s i d e by s i d e a t

a t ime , th en o r d e r i s−−column by column .−−Todo g e n e r a l i s e t o a f u n c t i o n and put i n t o p k g r l o c .−−d e f i n e s i z e o f o b j e c t b e i n g p l a c e da t t r i b u t e RLOC : s t r i n g ;−−c o n s t a n t PLACE : n a t u r a l := 1 ;constant YSIZE SLICES : na t ura l := 6 ; −−h a r d c o d e d − t o d o c a l c

p r o p e r l yconstant XSIZE SLICES : na t ura l := 4 ;constant o b j s p e r c o l : n a t ura l := 1 6 ;


−−c o n s t a n t o b j s p e r r o w : n a t u r a l := 1 6 ;constant idx : na tu ra l := bmid −1; −−z e r o r e f e r e n c e d .constant x r l o c : n a tu ra l := ( idx/ o b j s p e r c o l ) ∗

XSIZE SLICES ;constant y r l o c : n a t ura l := ( idx mod o b j s p e r c o l ) ∗

YSIZE SLICES ;

−− c o n s t a n t x y s t r : s t r i n g := ”x” & i t o a ( x r l o c ) & ”y” &i t o a ( y r l o c ) ;

−− c o n s t a n t r l o c s t r : s t r i n g := p i c k s t r i n g (PLACE , x y s t r ) ;−−S t r a n g e − p k g r l o c was miss ing , y e t s y n t h e s i z e d f i n e ??

p i c k s t r i n g s h o u l d have c a u s e d e r r o ra t t r i b u t e RLOC of m a t c h e d f i l t e r l e f t :

label i s p i c k s t r i n g (PLACE, ”x” & i t o a ( x r l o c ) & ”y” & i t o a (y r l o c ) ) ;

a t t r i b u t e RLOC of m a t c h e d f i l t e r r i g h t :label i s p i c k s t r i n g (PLACE, ”x” & i t o a ( x r l o c +2) & ”y” & i t o a (

y r l o c ) ) ;

s ignal x buf : s t d l o g i c v e c t o r ( x ’ range ) ;constant bmid mf1 : n a tu ra l := idx∗2 + 1 ;constant bmid mf2 : n a tu ra l := idx∗2 + 2 ;

begin−−m a t c h e d f i l t e r : e n t i t y work . DAFir4Tap ( r t l )m a t c h e d f i l t e r l e f t : e n t i t y work . D A f i r 4 t a p a l l p r i m i t i v e s ( r t l )generic map(

PLACE => PLACE,B COEFF => B COEFF , −−C o e f f i c i e n t w o r d l e n g t hCOEFF => FILTER BANK COEFFS ( bmid mf1 ) , −−C o e f f i c i e n t sB X => B IQ ) −−x ( n ) word l e n g t h

port map (c l k => clk ,x => x buf ,x l s b => i q l s b b u f ,x msb => iq msb buf ,y => branch metr ics ( bmid mf1 ) ,y rdy => bm rdy ( bmid mf1 ) ) ;

m a t c h e d f i l t e r r i g h t : e n t i t y work . D A f i r 4 t a p a l l p r i m i t i v e s ( r t l )generic map(

PLACE => PLACE,B COEFF => B COEFF , −−C o e f f i c i e n t w o r d l e n g t hCOEFF => FILTER BANK COEFFS ( bmid mf2 ) , −−C o e f f i c i e n t sB X => B IQ ) −−x ( n ) word l e n g t h

port map (c l k => clk ,x => x buf ,x l s b => i q l s b b u f ,x msb => iq msb buf ,y => branch metr ics ( bmid mf2 ) ,y rdy => bm rdy ( bmid mf2 ) ) ;

−− I f t h e f i l t e r bank i s l a r g e , t h e s e x i n p u t s a r e ve ry h igh f a n o u t .Add f l o p b u f f e r s e v e r y so o f t e n

−−t o r e d u c e f anout , and r e d u c e t h e r o u t i n g l e n g t h .g e n x f l o p b u f : for j in 0 to 3 generate−−p l a c e m e n t . mf RPM has F row empty , so f i t t h e s e f l o p s in t h e r e .a t t r i b u t e BEL : s t r i n g ;a t t r i b u t e BEL of x f l o p b u f : label i s p i c k s t r i n g (PLACE, ”FFX” )

; −−?? seems t o i g n o r e t h i s c o n s t r a i n t ?constant f l o p b u f x y s t r : s t r i n g := ”x” & i t o a ( x r l o c + j )

& ”y” & i t o a ( y r l o c ) ;


a t t r i b u t e RLOC of x f l o p b u f : label i s p i c k s t r i n g (PLACE,f l o p b u f x y s t r ) ;

beginx f l o p b u f : FDRSEgeneric map ( INIT => ’ 0 ’ ) −− I n i t i a l v a l u e o f r e g i s t e r ( ’ 0 ’ o r

’ 1 ’ )port map( C => clk ,

D => x ( j ) ,Q => x buf ( j ) ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

end generate ;end generate ;

−−A l l bms r e a d y a t t h e same t imebranch metr ics rdy <= bm rdy ( 1 ) ;

end xprim ;

−−S e r i a l i s e I and Q s a m p l e s and g e n e r a t e c o n t r o l s f o r t h e d a f i rf i l t e r s .

−− c o n t r o l : p r o c e s s ( c l k )−− v a r i a b l e i s h i f t r e g , q s h i f t r e g : i q t a ;−− v a r i a b l e b i t c o u n t : n a t u r a l range 0 t o B IQ−1;−− b e g i n−− i f r i s i n g e d g e ( c l k ) th en−− i q l s b <= ’ 0 ’ ; −−d e f a u l t a s s i g n m e n t s−− i q msb <= ’ 0 ’ ;−− i f new samples = ’1 ’ th en −−l o a d s h i f t r e g−− b i t c o u n t := 0 ;−− i q l s b <= ’ 1 ’ ;−− i s h i f t r e g := i ;−− q s h i f t r e g := q ;−− e l s e −−s h i f t out , l s b f i r s t−− i f b i t c o u n t = ( B IQ−1) th en−− b i t c o u n t := 0 ;−− e l s e−− b i t c o u n t := b i t c o u n t + 1 ;−− end i f ;−− f o r n in 1 t o SAMPLES PER SYMBOL l o o p−− i s h i f t r e g ( n ) := ”0” & i s h i f t r e g ( n ) ( i s h i f t r e g ( n ) ’ h igh

downto i s h i f t r e g ( n ) ’ low +1) ;−− q s h i f t r e g ( n ) := ”0” & q s h i f t r e g ( n ) ( q s h i f t r e g ( n ) ’ h igh

downto q s h i f t r e g ( n ) ’ low +1) ;−− end l o o p ;−− end i f ;−− −−Form t h e d a l u t a d d r e s s i n p u t by p i c k i n g o f f t h e l s b o f t h e

s h i f t r e g−− f o r n in 1 t o SAMPLES PER SYMBOL l o o p−− x ( n−1) <= i s h i f t r e g ( n ) ( i s h i f t r e g ( n ) ’

low ) ;−− x ( n−1+SAMPLES PER SYMBOL) <= q s h i f t r e g ( n ) ( q s h i f t r e g ( n ) ’

low ) ;−− end l o o p ;−− −−Keep t r a c k o f i q b i t p o s i t i o n and s i g n a l when m s b i t r e a c h e d .

i q msb g o e s h igh f o r 1 c y c l e on ly .−− i f b i t c o u n t = ( B IQ−1) th en−− i q msb <= ’ 1 ’ ;−− end i f ;−− end i f ;−− end p r o c e s s ;


−−

D.1.3 4-Tap Distributed Arithmetic Filter

−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

−− T i t l e : D i s t r i b u t e d A r i t h m e t i c FIR F i l t e r ( Only s u p p o r t s 4 t a p s ) ,impl emented with x i l i n x p r i m i t i v e s

−− Author : Andrew B r i d g e r−− Date : 23 J u l y 2008−− High L e v e l Module D e s c r i p t i o n : F u l l y b i t s e r i a l DA F i r f i l t e r . See

pg 62 ”DSP with FPGA” Uwe Meyer B a e s e .−− Designed as a component f o r CPM branch m e t r i c f i l t e r bank . Input

s i g n a l s e r i a l i s a t i o n not done h e r e .−− Thi s d e s i g n c o m p r i s e s a DALUT c o n t a i n i n g precomputed p a r t i a l

p r o d u c t s f o l l o w e d by a s c a l i n g a c c u m u l a t o r .−− Input s i g n a l s i g n e d in 2 ’ s complement f o r m a t .−−−− Thi s v e r s i o n c o m p l e t e l y implemented us ing x i l i n x p r i m i t i v e s and

with p l a c e m e n t a t t r i b u t e s . Th i s i s−− r e q u i r e d t o meet t im ing f o r h igh c l o c k r a t e s and when a l a r g e

number o f t h e s e f i l t e r s a r e used . E . g . The−− PAR t o o l s d i d not p l a c e t h e DALUT rom and p i p e l i n e f l o p s in t h e

same s l i c e , n e e d l e s s l y i n c u r r i n g a−− a r o u t e t h a t was more than 3 ns in some c a s e s .−−−− The s u b t r a c t and r e s e t f l o p c o n t r o l s were a l s o b e i n g o p t i m i s e d away

f o r l a r g e r f i l t e r banks . Which−− l e a d s t o ve ry h igh f a n o u t and long , s low r o u t e s .−−−−−− Notes / L i m i t a t i o n s :−− 1) B i z a r r e warning . WARNING: Xst : 1610 − ”C : / p r o j e c t /

m a s s e y s t r a t e x c p m / demosvn / a b r m a s t e r s c p m / f p g a / s r c / DAFir4Tap . vhd”l i n e 7 8 : Width mismatch . <p a r t i a l p r o d u c t> has a width o f 9 b i t sbut a s s i g n e d e x p r e s s i o n i s 10− b i t wide .

−− In n u m e r i c s t d ”+” o p e r a t o r wi th s i g n e d i n p u t s w i l l p r o d u c e ar e s u l t t h e l e n g t h o f t h e operand

−− with t h e l o n g e s t l e n g t h . With B COEFF = 7 , b o t h o p e r a n d s and t h er e s u l t a l l have 9 b i t s . Modelsim

−− g i v e s no warnings .−−−− S y n t h e s i z a b l e : Yes−−−− T e s t b e n c h : D A f i r 4 t a p t b . vhd−−−− Note : The v e r s i o n c o n t r o l sys t em in use i s t h e r e p o s i t o r y f o r

i n f o r m a t i o n r e g a r d i n g bug f i x e s ,−− v e r s i o n s , f e a t u r e a d d i t i o n s e t c .−−

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

−−A b b r e v i a t i o n s−−SA = s c a l i n g a c c u m u l a t o r

l i b r a r y i e e e ;use i e e e . s t d l o g i c 1 1 6 4 . a l l ;use i e e e . numeric std . a l l ;

l i b r a r y work ;use work . PkgStdType . a l l ;


use work . pkg mf coef fs . a l l ;use work . pkg r loc . a l l ;

l i b r a r y unisim ;use unisim . vcomponents . ALL ;

e n t i t y D A f i r 4 t a p a l l p r i m i t i v e s i sgeneric (

PLACE : nat ur a l ; −−1 = s p a r t a n 3 p l a c e d ,0 = u n p l a c e d

B COEFF : p o s i t i v e ; −−C o e f f i c i e n tw o r d l e n g t h

COEFF : s i n g l e f i l t e r c o e f f s t a ; −−C o e f f i c i e n t sB X : p o s i t i v e ) ; −−x ( n ) word l e n g t h

port (c l k : in s t d l o g i c ; −−sys t em

c l o c kx : in s t d l o g i c v e c t o r (3 downto 0) ;x l s b : in s t d l o g i c ; −−a c t i v e

dur ing f i r s t b i t o f xx msb : in s t d l o g i c ; −−a c t i v e when

t h e s i g n b i t o f x i s i n p u ty : out signed ( B X+B COEFF+2−1 downto 0) ;−− v a l i d f o r

on ly 1 c l k p e r i o dy rdy : out s t d l o g i c −−a c t i v e when

y r e s u l t i s v a l i d) ;

end D A f i r 4 t a p a l l p r i m i t i v e s ;

a r c h i t e c t u r e r t l of D A f i r 4 t a p a l l p r i m i t i v e s i s−−With 256 f i l t e r s , t h e branch m e t r i c o u t p u t s f a r e x c e e d r e a l c h i p IO

. So synth w i t h o u t IO pads . Th i s−−th en r e q u i r e s t h e use o f mapper s a v e a t t r i b u t e t o p r e v e n t t h e

mapper from s t r i p p i n g t h e whole d e s i g n−−s i n c e t h e r e a r e no l o a d s ( IO pads )−−U n f o r t u n a t e l y , a p p l y i n g t h i s a t t r i b u t e on t h e b r a n c h m e t r i c s s i g n a l

d o e s not work . P o s s i b l y b e c a u s e−−b r a n c h m e t r i c s i s an a r r a y o f a r r a y s . Hence do i t h e r e − t h i s i s

on ly r e q u i r e d when s y n t h e s i z i n g w i t h o u t−−IO pads .a t t r i b u t e s : s t r i n g ;a t t r i b u t e s of y : s ignal i s ” yes ” ;

constant LUT SIZE : p o s i t i v e := 4 ; −−on ly 4 i n p u t LUTc u r r e n t l y s u p p o r t e d .

constant DALUT BIT GROWTH : na tur a l := 2 ; −−For 6 i n p u t LUT, b i tgrowth i s 3

−−s u b t y p e p a r t i a l p r o d u c t t i s s i g n e d ( B COEFF+DALUT BIT GROWTH−1downto 0 ) ;

−−t y p e d a l u t t i s a r r a y (0 t o (2∗∗LUT SIZE )−1) o f p a r t i a l p r o d u c t t ;

−−G e n e r a t e DALUT l o o k u p t a b l e . A l l p o s s i b l e p a r t i a l p r o d u c t s a r eg e n e r a t e d and r e t u r n e d in an a r r a y .

function c r e a t e d a l u t (constant INTEGER COEFFS : s i n g l e f i l t e r c o e f f s t a ) −− f i l t e r

c o e f f i c i e n t sreturn i n t e g e r t a i s

var iable p a r t i a l p r o d u c t : i n t e g e r ;var iable dalut : i n t e g e r t a (0 to (2∗∗LUT SIZE )−1) ;var iable addr bi t , lu t addr : i n t e g e r ;

begin


a s s e r t INTEGER COEFFS ’ length = LUT SIZEreport ” CreateDalut ( ) − Exact ly 4 c o e f f i c i e n t s must be provided . ”

;−−C a l c u l a t e a l l p o s s i b l e p a r t i a l p r o d u c t s f o r t h e s e c o e f f i c i e n t s .

Th i s i s done one a d d r e s s a t a t ime .for lu t addr in 0 to 2∗∗LUT SIZE −1 loop

p a r t i a l p r o d u c t := 0 ;−−Check e a c h b i t o f t h e LUT a d d r e s s t o d e t e r m i n e which

c o e f f i c i e n t s s h o u l d be a c cu m ul a t e d t o−−g e n e r a t e t h e f i n a l p a r t i a l p r o d u c t f o r t h i s a d d r e s s .for a d d r b i t in 0 to LUT SIZE−1 loop

i f ( to unsigned ( lut addr , LUT SIZE ) ( a d d r b i t ) = ’1 ’ ) thenp a r t i a l p r o d u c t := p a r t i a l p r o d u c t + INTEGER COEFFS( a d d r b i t )

;end i f ;

end loop ;dalut ( lu t addr ) := p a r t i a l p r o d u c t ;

end loop ;return dalut ;

end function ;

constant SIGN EXTEND BIT : na t ur a l := 1 ;s ignal sa add , op a , op b : s t d l o g i c v e c t o r ( SIGN EXTEND BIT +

DALUT BIT GROWTH + B COEFF−1 downto 0) ;s ignal s a s h i f t : s t d l o g i c v e c t o r ( B X−2 downto 0) ;s ignal r e s e t s a n , subtrac t pp : s t d l o g i c ;s ignal r e s e t s a n n o r e g : s t d l o g i c ;s ignal y s l v : s t d l o g i c v e c t o r ( y ’ range ) ;

constant d a l u t i n i t : i n t e g e r t a := c r e a t e d a l u t (COEFF) ;s ignal pp : s t d l o g i c v e c t o r (DALUT BIT GROWTH + B COEFF−1

downto 0) ;−− c o n s t a n t dummy zero : s t d l o g i c v e c t o r ( pp ’ range ) := ( o t h e r s => ’ 0 ’ )

;

−−Placement . Arrange DALUT and s c a l i n g a c c u m u l a t o r in a d j a c e n tcolumns , l s b a l i g n e d .

a t t r i b u t e RLOC : s t r i n g ;−−c o n s t a n t x y s t r : s t r i n g := ” x0y0 ” & i t o a ( ( i / 2 ) ) ;−−c o n s t a n t r l o c s t r : s t r i n g := p i c k s t r i n g (PLACE , x y s t r ) ;a t t r i b u t e RLOC of dalut : label i s p i c k s t r i n g (PLACE,

”x0y1” ) ;a t t r i b u t e RLOC of sca l ing accumulator : label i s p i c k s t r i n g (PLACE,

”x1y1” ) ;−−p l a c e be low t h e l s b o f d a l u t / s a s i n c e l s b has l o n g e s t c a r r y c h a i n

pa th .a t t r i b u t e RLOC of r e s e t s a n f l o p : label i s p i c k s t r i n g (PLACE,

”x1y0” ) ;a t t r i b u t e RLOC of s u b t r a c t f l o p : label i s p i c k s t r i n g (PLACE,

”x0y0” ) ;−−c o n s t a n t c o n t r o l f l o p y p o s : n a t u r a l := (DALUT BIT GROWTH +

B COEFF+1) mod 2 ;−−a t t r i b u t e RLOC o f r e s e t s a n f l o p : l a b e l i s p i c k s t r i n g (

PLACE , ”x0y” & i t o a ( c o n t r o l f l o p y p o s ) ) ;−−a t t r i b u t e RLOC o f s u b t r a c t f l o p : l a b e l i s p i c k s t r i n g (

PLACE , ”x1y” & i t o a ( c o n t r o l f l o p y p o s ) ) ;begin

−−x i n p u t 1 b i t p e r c l o c k , l o o k u p t h e p a r t i a l p r o d u c t s sum b a s e d ont h e v a l u e o f x

−−R e g i s t e r t h e p a r t i a l produc t , and then a c c u m u l a t e in t h e s c a l i n ga c c u m u l a t o r . The s c a l i n g a c c u m u l a t o r


−− i s r e s e t t o 0 when x l s b p r e s e n t e d . When t h e x m s b i t ( s i g n b i t ) i sp r e s e n t e d t h e p a r t i a l p r o d u c t

−− i s s u b t r a c t e d from t h e s c a l i n g a c c u m u l a t o r r e s u l t . ( See Uwe Meyer−b a e s e )

−−s c a l i n g a c c u m u l a t o r d o e s => a c c p p = (2ˆ−1)∗ p r e v i o u s a c c + pp ∗2 ˆ (B X−1) .

−−I . e . t h e pp i s m s b i t a l i g n e d with (2ˆ−1)∗ p r e v i o u s a c c and thenadded . The ( B X−1) z e r o s

−−pos t−pended t o pp don ’ t need t o be added s i n c e A + 0 = 0 ! . Th i smeans we have two components t o

−−t h e s c a l i n g a c c u m u l a t o r r e s u l t . C a l l e d s a a d d and s a s h i f t .y s l v <= sa add & s a s h i f t ;y <= signed ( y s l v ) ;

−−d a l u t lookup , r e g i s t e r e d .−−t h i s lu t ram has a prob l em g e t t i n g t h r o u g h t t h e mapper−−Pack : 6 7 9 − Unable t o obey d e s i g n c o n s t r a i n t s (MACRONAME= f i l t e r b a n k

/ g e n e r a t e m f b a n k [ 1 ] . m a t c h e d f i l t e r / d a l u t / h s e t , RLOC=X0Y4 ) whichr e q u i r e t h e c o m b i n a t i o n o f t h e f o l l o w i n g symbo l s i n t o a s i n g l eSLICEM component :

−− FLOP symbol ” f i l t e r b a n k / g e n e r a t e m f b a n k [ 1 ] . m a t c h e d f i l t e r /d a l u t / b i t l o o p [ 8 ] . y e s d o u t f l o p . d o u t f l o p ” ( Output S i g n a l =f i l t e r b a n k / g e n e r a t e m f b a n k [ 1 ] . m a t c h e d f i l t e r / pp<8>)

−− RAM symbol ” f i l t e r b a n k / g e n e r a t e m f b a n k [ 1 ] . m a t c h e d f i l t e r /d a l u t / b i t l o o p [ 8 ] . r a m b i t ” ( Output S i g n a l = f i l t e r b a n k /g e n e r a t e m f b a n k [ 1 ] . m a t c h e d f i l t e r / d a l u t / d o u t n o r e g <8>)

−−Func t i on g e n e r a t o r f i l t e r b a n k / g e n e r a t e m f b a n k [ 1 ] . m a t c h e d f i l t e r /d a l u t / b i t l o o p [ 8 ] . r a m b i t has a s i t e c o n s t r a i n t o t h e r than ”G” .P l e a s e c o r r e c t t h e d e s i g n c o n s t r a i n t s a c c o r d i n g l y .

−− d a l u t : e n t i t y work . RAM16XnS( xprim )−− g e n e r i c map (−− REG => 1 , −− 0 no reg , 1 r e g i s t e r RAM out pu t−− INIT => d a l u t i n i t , −− i n i t i a l ram c o n t e n t s−− PLACE => 1) −− 0 unplaced , 1 s p a r t a n 3−− p o r t map (−− dout => pp ,−− d in => dummy zero ,−− addr => x ,−− c l k => c l k ,−− we => ’ 0 ’ ) ;

−−d a l u t lookup , r e g i s t e r e d .dalut : e n t i t y work . lut rom ( xprim )generic map (

REG => 1 , −− 0 no reg , 1 r e g i s t e r RAM out pu tINIT => d a l u t i n i t , −− i n i t i a l ram c o n t e n t sPLACE => PLACE) −− 0 unplaced , 1 s p a r t a n 3

port map (dout => pp ,addr => x ,c l k => c l k ) ;

−−Setup , and r e g i s t e r t h e c o n t r o l s f o r t h e s c a l i n g a c c u m u l a t o r a d d e r−−x msb , x l s b a r e ve ry h igh f a n o u t s i g n a l s when t h i s f i l t e r i s used

in a l a r g e−− f i l t e r bank . These f l o p s a l s o s e r v e t o b u f f e r t h e s e s i g n a l s and

r e d u c e f a n o u t .r e s e t s a n n o r e g <= not ( x l s b ) ;

r e s e t s a n f l o p : FDRSEgeneric map ( INIT => ’ 0 ’ ) −− I n i t i a l v a l u e o f r e g i s t e r ( ’ 0 ’ o r ’ 1 ’ )


port map( C => clk ,D => r e s e t s a n n o r e g ,Q => r e s e t s a n ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

s u b t r a c t f l o p : FDRSEgeneric map ( INIT => ’ 0 ’ ) −− I n i t i a l v a l u e o f r e g i s t e r ( ’ 0 ’ o r ’ 1 ’ )port map( C => clk ,

D => x msb ,Q => subtract pp ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

−−s i g n e x t e n d t h e o p e r a n d s i n t o t h e s c a l i n g a c c u m u l a t o rop b <= s t d l o g i c v e c t o r ( pp ( pp ’ high ) & pp ) ;op a <= sa add ( sa add ’ high ) & sa add ( sa add ’ high downto ( sa add ’ low

+1) ) ; −− 2ˆ−1

−−S c a l i n g a c c u m u l a t o r . ans = a +/− bsca l ing accumulator : e n t i t y work . a d d e r s u b c l r ( r t l )

generic map( PLACE => PLACE )port map(

c l k => clk ,s u b t r a c t b => subtract pp ,c l e a r a n => r e s e t s a n ,a => op a ,b => op b ,ans => sa add ) ;

−−The s h i f t r e g i s t e r component o f t h e s c a l i n g a c c u m u l a t o r s imp lys h i f t s r i g h t , t a k i n g t h e l s b o f

−−t h e s c a l i n g a c c u m u l a t o r a d d e r p o r t i o n . Excep t when t h e l s b o f a newword i s a p p l i e d , in which

−−c a s e t h e t h e s h i f t r e g must c l e a r .−− s h i f t c o m p o n e n t : e n t i t y work . s h i f t r e g ( xprim )−− g e n e r i c map (−− SHIFT DIRECTION => RIGHT , −−0 r i g h t ( t owards l s b ) , 1 s h i f t l e f t

( t owards msb )−− PLACE => 1)−− p o r t map (−− c l k => c l k ,−− c e => ’ 1 ’ ,−− l o a d => not ( r e s e t s a n ) ,−− din => i n i t t o z e r o s . . . ,−− s h i f t i n => s a a d d ( sa add ’ low ) ,−− s h i f t o u t => s a s h i f t ) ;

shif t component : process ( c l k )begin

i f r i s i n g e d g e ( c l k ) then−−The s h i f t r e g i s t e r component o f t h e s c a l i n g a c c u m u l a t o r s im p ly

s h i f t s r i g h t , t a k i n g t h e l s b o f−−t h e s c a l i n g a c c u m u l a t o r a d d e r p o r t i o n . Excep t when t h e l s b o f a

new word i s a p p l i e d , in which−−c a s e t h e t h e s h i f t r e g must c l e a r .i f r e s e t s a n = ’0 ’ then

s a s h i f t <= ( others => ’ 0 ’ ) ;e lse


s a s h i f t <= sa add ( sa add ’ low ) & s a s h i f t ( s a s h i f t ’ high downto( s a s h i f t ’ low+1) ) ;

end i f ;−−The s u b t r a c t c y c l e i s t h e l a s t f o r t h e c u r r e n t word , i . e .

r e s u l t r e a d y−−a t nex t c l o c k edge .−− I f t h i s module used in a l a r g e f i l t e r bank , t h i s f l o p becomes

r e p e a t e d many t i m e s . H o p e f u l l y−−t o o l s w i l l remove most o f t h e s e r edundant f l o p s .y rdy <= subtrac t pp ;


end r t l ;

D.1.4 Filter Coefficients

/matlab/fpga/vhdl/pkg_mf_coeffs.vhd

D.1.5 Testbench

−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

−− T i t l e : T e s t b e n c h f o r d a f i r b ranch m e t r i c f i l t e r bank−− Author : Andrew B r i d g e r−− Date : 3 J u l y 2008−− High L e v e l Module D e s c r i p t i o n : I n s t a n t i a t e s t h e f i l t e r bank , i n p u t s

mat l ab g e n e r a t e d s t i m u l u s and−− c h e c k s r e s u l t a g a i n s t mat l ab g e n e r a t e d r e s u l t s .−− Notes / L i m i t a t i o n s :−−−− S y n t h e s i z a b l e : No−−−− T e s t b e n c h : n / a−−−− Note : The v e r s i o n c o n t r o l sys t em in use i s t h e r e p o s i t o r y f o r


∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗


l i b r a r y work ;use work . PkgStdType . a l l ;use work . pkg math . a l l ;use work . p kg pro jec t . a l l ;use work . pkg mf coef fs . a l l ;−−use work . p k g m f c o e f f s h 0 2 5 L 3 M 4 . a l l ;use work . p k g m a t l a b t e s t v e c t o r s . a l l ;


e n t i t y b r a n c h m e t r i c s d a f i r t b i send b r a n c h m e t r i c s d a f i r t b ;

a r c h i t e c t u r e b r a n c h m e t r i c s d a f i r t b a r c h of b r a n c h m e t r i c s d a f i r t b i s


s ignal clk , r e s e t : s t d l o g i c ; −−sys t em c l o c ks ignal i , q : i q t a ; −−a r r a y o f s a m p l e s

f o r 1 symbol p e r i o ds ignal new samples : s t d l o g i c ; −−a c t i v e dur ing f i r s t

c l k p e r i o d in which new IQ s a m p l e s a r e p r e s e n t e ds ignal branch metr ics : b r a n c h m e t r i c s t a ;−−v a l i d f o r on ly 1

c l o c k p e r i o ds ignal branch metr ics rdy : s t d l o g i c ; −−a c t i v e when ou tp ut

v a l i d

s ignal b l a h t e s t : s t d l o g i c ;

begin

−−d u t f i l t e r b a n k : e n t i t y work . b r a n c h m e t r i c s d a f i r (b r a n c h m e t r i c s d a f i r a r c h )

d u t f i l t e r b a n k : e n t i t y work . b r a n c h m e t r i c s d a f i r p r i m i t i v e s ( xprim )generic map(

PLACE => 1 ,B COEFF => B COEFF ,B IQ => B IQ ,FILTER BANK COEFFS => FILTER BANK COEFFS )

port map(c l k => clk ,i => i ,q => q ,new samples => new samples ,branch metr ics => branch metr ics ,branch metr ics rdy => branch metr ics rdy ) ;

main : process−−100MHz c l o c kprocedure t i c k i sbegin

c l k <= ’ 0 ’ ;wait for 5 ns ;c l k <= ’ 1 ’ ;wait for 5 ns ;

end procedure ;

procedure GenReset ( constant length : n a t ura l ) i sbegin

r e s e t <= ’ 1 ’ ;for i in 0 to length loop

t i c k ;end loop ;r e s e t <= ’ 0 ’ ;t i c k ;

end procedure ;

−−T e s t two a r r a y s f o r e q u a l i t y . Returns 0 i f e q u a l e l e m e n t wise ,o t h e r w i s e r e t u r n s number o f d i f f e r e n c e s .

function i s e q u a l ( a , b : i n t e g e r t a ) return na tu ra l i svar iable d i f f : na tu ra l := 0 ;constant same length : boolean := ( a ’ length = b ’ length ) ;var iable a norm , b norm : i n t e g e r t a (1 to a ’ length ) ;

begina s s e r t same length report ” i s e q u a l : operands have d i f f e r e n t array

lengths . ” s e v e r i t y f a i l u r e ;−−n o r m a l i s e range d e f i n i t i o n sa norm := a ;


b norm := b ;for i in 1 to a ’ length loop

i f ( a norm ( i ) /= b norm ( i ) ) thend i f f := d i f f + 1 ;

end i f ;end loop ;return d i f f ;

end function ;

type many branch metrics ta i s array ( na tur a l range <>) ofb r a n c h m e t r i c s t a ;

type many iq samples ta i s array ( na tur a l range <>) of i q t a ;

−−Apply i / q s t i m u l u s and c h e c k r e s u l t . However , i / q a p p l i e d in af a i r l y s i m p l e manner ; a new symbo l s

−−worth o f s a m p l e s a r e i n p u t a f t e r t h e p r e v i o u s r e s u l t has beenf u l l y c a l c u l a t e d . I . e t h e f i l t e r

−− i s no t running a t maximum t h r o u g h p u t . ( t h e f i l t e r i s p i p e l i n e d son o r m a l l y w i l l have two symbo l s in t h e

−−p i p e l i n e a t once ) .procedure t e s t f i l t e r b a n k s i m p l e ( constant i i n p u t , q input :

i n t eg er 2D t a ;constant c o r r e c t r e s u l t :

i n t eg er 2D t a ) i svar iable e r r o r s : n a t ura l := 0 ;var iable num symbols : n a tu ra l := i i n p u t ’ length ( 1 ) ;−−n o r m a l i s e r a n g e sconstant i input norm : i n t eg e r2 D t a (1 to num symbols , 1 to

SAMPLES PER SYMBOL) := i i n p u t ;constant q input norm : i n te ge r2 D t a (1 to num symbols , 1 to

SAMPLES PER SYMBOL) := q input ;−−c o n s t a n t i norm : i q t a (1 t o SAMPLES PER SYMBOL) ;−−c o n s t a n t q norm : i q t a (1 t o SAMPLES PER SYMBOL) ;

begin−−l o o p through a l l i n p u t t e s t v e c t o r s

for symbol in 1 to num symbols loop−−a p p l y new i n p u tnew samples <= ’ 1 ’ ;for n in 1 to SAMPLES PER SYMBOL loop

i ( n ) <= to s igned ( i input norm ( symbol , n ) , B IQ ) ;q ( n ) <= to s igned ( q input norm ( symbol , n ) , B IQ ) ;−−i norm ( n ) := t o s i g n e d ( i i n p u t n o r m ( symbol , n ) , B IQ ) ;−−q norm ( n ) := t o s i g n e d ( q inpu t norm ( symbol , n ) , B IQ ) ;−−i <= i norm ( n ) ;−−q <= q norm ( n ) ;

end loop ;t i c k ;new samples <= ’ 0 ’ ;−−c l o c k u n t i l r e s u l t a r r i v e swhile ( branch metr ics rdy = ’ 0 ’ ) loop

t i c k ;end loop ;−−c h e c k f o r e r r o r in r e s u l tfor bmid in branch metr ics ’ low to branch metr ics ’ high loop

i f ( t o i n t e g e r ( branch metr ics ( bmid ) ) /= c o r r e c t r e s u l t ( symbol, bmid ) ) then

e r r o r s := e r r o r s + 1 ;end i f ;

end loop ;−−e r r o r s := e r r o r s + i s e q u a l ( t o i n t e g e r ( b r a n c h m e t r i c s ) ,

c o r r e c t r e s u l t ( symbol ) ) ;end loop ;


a s s e r t e r r o r s = 0 report ” t e s t f i l t e r b a n k : branch metr ic e r r o r sfound . ” s e v e r i t y f a i l u r e ;

end procedure ;

−−Thi s t e s t more c l o s e l y r e p r e s e n t s how t h e f i l t e r w i l l be used inp r a c t i c e . To k e e p c l o c k r a t e t o a

−−minimum , t h e i n p u t s a m p l e s w i l l be a p p l i e d a t t h e max r a t e − e . g .f o r 7 b i t i n p u t s th en a new sample i s

−−i n p u t once e v e r y 7 c l o c k s . The r e s u l t comes out with a l a t e n c y o f2 c y c l e s .

procedure t e s t f i l t e r b a n k ( constant i i n p u t , q input :i n t eg er 2D t a ;

constant c o r r e c t r e s u l t :i n t eg er 2D t a ;

constant c l k s p e r i n p u t : na tu ra l ) i svar iable e r r o r s : na t ura l := 0 ;var iable num symbols : n a tu ra l := i i n p u t ’ length ( 1 ) ;var iable l a s t r e s u l t r e a d : boolean := f a l s e ;var iable c lk count , symbol input idx , bm check idx : n a tu ra l :=

0 ;−−n o r m a l i s e r a n g e sconstant i input norm : i n t eg er 2 D t a (0 to num symbols−1,1 to

SAMPLES PER SYMBOL) := i i n p u t ;constant q input norm : i n t e ge r2 D t a (0 to num symbols−1,1 to

SAMPLES PER SYMBOL) := q input ;begin−−l o o p u n t i l a l l symbo l s have been i n p u t and t h e l a s t r e s u l t has

been r e a dl a s t r e s u l t r e a d := f a l s e ;c lk count := c l k s p e r i n p u t ;symbol input idx := 0 ;bm check idx := 1 ;while ( not l a s t r e s u l t r e a d ) loop−−i n p u t symbol a t max r a t e f o r t h i s f i l t e ri f ( c lk count = c l k s p e r i n p u t ) then

c lk count := 0 ;−−i n p u t a symbol t o t h e dutnew samples <= ’ 1 ’ ;for n in 1 to SAMPLES PER SYMBOL loop

i ( n ) <= to s igned ( i input norm ( symbol input idx , n ) , B IQ ) ;q ( n ) <= to s igned ( q input norm ( symbol input idx , n ) , B IQ ) ;

end loop ;symbol input idx := ( symbol input idx + 1) mod num symbols ;

−−i n c p t r t o nex t symbol t o i n p u te lse

new samples <= ’ 0 ’ ;end i f ;

−−r e a d d e s i g n ou tp ut when a v a i l a b l e , and c h e c k i f i t s c o r r e c ti f ( branch metr ics rdy = ’ 1 ’ ) then−−c h e c k f o r e r r o r in r e s u l tfor bmid in branch metr ics ’ low to branch metr ics ’ high loop

i f ( t o i n t e g e r ( branch metr ics ( bmid ) ) /= c o r r e c t r e s u l t (bm check idx , bmid ) ) then

e r r o r s := e r r o r s + 1 ;end i f ;

end loop ;−−c h e c k i f t h e l a s t r e s u l t has be en r e a dl a s t r e s u l t r e a d := ( bm check idx >= num symbols ) ;bm check idx := bm check idx + 1 ;

end i f ;


−−c l o c k t h e d e s i g nt i c k ;c lk count := c lk count + 1 ;

end loop ;a s s e r t e r r o r s = 0 report ” t e s t f i l t e r b a n k : branch metr ic e r r o r s

found . ” s e v e r i t y f a i l u r e ;end procedure ;

constant ANS SIMPLE : in te g er 2D t a (1 to 3 ,1 to 4) :=( (−63 ,3 ,63 ,36 ) , (−126 ,6 ,126 ,72) , (−189 ,9 ,189 ,108) ) ;

constant I SIMPLE : i n t eg e r2 D t a (1 to 3 ,1 to 2) := ( ( 1 , 0 ) ,( 2 , 0 ) , ( 3 , 0 ) ) ;

constant Q SIMPLE : i n t eg er 2D t a (1 to 3 ,1 to 2) := ( ( 0 , 0 ) ,( 0 , 0 ) , ( 0 , 0 ) ) ;

beginfor n in i ’ low to ( i ’ low+SAMPLES PER SYMBOL−1) loop

i ( n ) <= ( others => ’ 0 ’ ) ;q ( n ) <= ( others => ’ 0 ’ ) ;

end loop ;new samples <= ’ 1 ’ ; −−h o l d f i l t e r bank in r e s e t wi th t h i s c o n t r o lGenReset ( 5 ) ;

−− t e s t f i l t e r b a n k s i m p l e ( I SIMPLE , Q SIMPLE , ANS SIMPLE ) ; −−don ’ t f o r g e t t o rename p k g m f c o e f f s s i m p l e t e s t

t e s t f i l t e r b a n k ( i into bm , q into bm , bm result unrounded , B IQ ) ;t i c k ;t i c k ;t i c k ;

a s s e r t f a l s ereport ”sim end”s e v e r i t y f a i l u r e ;

end process ;end b r a n c h m e t r i c s d a f i r t b a r c h ;

D.1.6 Test Vectors Package

/fpga/src/pkg_matlab_test_vectors.vhd

D.2 Viterbi Trellis Path Metrics


/fpga/src/cpm_viterbi_decoder_synth.vhd

D.2.2 Top Level

−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

−− T i t l e : CPM v i t e r b i d e c o d e r−− Author : Andrew B r i d g e r−− Date : 24 Sep t ember 2008−− High L e v e l Module D e s c r i p t i o n : Th i s module p e f o r m s t h e v i t e r b i

d e c o d e r p r o c e s s i n g f o r t h e c o m p l e t e−− CPM t r e l l i s . Th i s module i n s t a n t i a t e s s e v e r a l r a d i x−4 u n i t s t h a t

c a r r y out t h e s t a t e m e t r i c

D.2. VITERBI TRELLIS PATH METRICS 97

−− add−compare−s e l e c t p r o c e s s i n g . The s t a t e m e t r i c s be tween r a d i x−4u n i t s a r e i n t e r c o n n e c t e d b a s e d

−− on a mat l ab g e n e r a t e d vhd l d e s c r i p t i o n o f t h e CPM t r e l l i s .

−− Thi s module c u r r e n t l y e x p e c t s branch m e t r i c s a s input s , and o u t p u t sd e c i s i o n b i t s e v e r y v i t e r b i i t e r a t i o n .

−− These d e c i s i o n b i t s i d e n t i f y t h e s u r v i v o r pa th f o r e a c h s t a t e . At r a c e b a c k u n i t ( e x t e r n a l t o t h i s module )

−− would s t o r e and p r o c e s s t h e s e b i t s f u r t h e r , b e f o r e o u t p u t t i n gd e c o d e d d a t a .

−− ( t o d o − l i k e l y in f u t u r e t h e bms w i l l be b r o u g h t in h e r e t o t i ep l a c e m e n t t o t h e r a d i x 4 u n i t p l a c e m e n t )

−−−− The b r a n c h m e t r i c s must be s u p p l i e d a t t h e c o r r e c t t ime . The c u r r e n t

i m p l e m e n t a t i o n r e q u i r e s t h e 1 s t−− bms t o be s u p p l i e d on t h e 3 / 4 c y c l e s f o l l o w i n g n e w v i t e r b i i t e r a t i o n

. S i n c e a DAFIR BM i m p l e m e n t a t i o n−− on ly has t h e bms v a l i d f o r 1 c l o c k c y c l e , t h e y must be s u p p l i e d on

p r e c i s e l y t h e r i g h t c l o c k c y c l e ,−− i n d e e d t h e c y c l e t h e y a r e r e a d .−−−− Var ious i m p l e m e n a t i o n s a r e s u p p o r t e d us ing i f . . g e n e r a t e s t a t e m e n t s

and a g e n e r i c p a r a m e t e r t o s e l e c t−− t h e d e s i r e d i m p l e m e n t a t i o n .

−− Notes / L i m i t a t i o n s :

−− S y n t h e s i z a b l e : Yes−−−− T e s t b e n c h : c p m v i t e r b i d e c o d e r t b . vhd and a l s o some t e s t c o d e

embedded in t h i s module i t s e l f .−−−− Note : The v e r s i o n c o n t r o l sys t em in use i s t h e r e p o s i t o r y f o r


∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗


l i b r a r y work ;use work . PkgStdType . a l l ;use work . pkg math . a l l ;use work . pkg r loc . a l l ;use work . p kg pro j ec t . a l l ;use work . p k g t r e l l i s . a l l ;−− s y n t h e s i s t r a n s l a t e o f fuse work . p k g m a t l a b t e s t v e c t o r s v i t e r b i . a l l ;use work . p k g t o s t r i n g . a l l ;−− s y n t h e s i s t r a n s l a t e o n


e n t i t y cpm viterbi decoder i sgeneric (

PLACE : nat ur a l := 0 ; −−d e f a u l t u n p l a c e dDESIGN TYPE : n a t ura l := 0 ; −−c u r r e n t l y not

s u p p o r t e d − p e r h a p s s h o u l d be enumerated t y p e


CHECK STATE METRICS : boolean := f a l s e) ;

port (c l k : in s t d l o g i c ;r e s e t : in s t d l o g i c ; −−s y n c h r o n o u s l y

r e s e t s s o u r c e s t a t e m e t r i c s t o 0n e w v i t e r b i i t e r a t i o n : in s t d l o g i c ;branch metr ics : in branch metr ics rounded ta (NUM FILTERS∗2

downto 1) ;d e c i s i o n b i t s : out s t d l o g i c v e c t o r (

DECISION BITS BUS WIDTH−1 downto 0) ; −−Used dur ing t r a c e b a c kt o d e t e r m i n e t r e l l i s pa th t a k e n

d e c i s i o n b i t s r d y : out s t d l o g i c −−High whend e c i s i o n b i t s a r e f i r s t v a l i d

) ;end cpm viterbi decoder ;

a r c h i t e c t u r e r t l of cpm viterbi decoder i s−−G e n e r a l p u r p o s e s e a r c h f u n c t i o n . Future move t o a l i b r a r y ?−−Finds t h e f i r s t o c c u r a n c e o f t h e i t em in sea r ch me , r e t u r n s

l o c a t i o n row in 1 and column in 2−− I f i t em not found r e t u r n s ( 0 , 0 ) − which may not be p a r t o f

s e a r c h m e o f c o u r s e .−−Thi s means t h e u s e r must assume t h e i t em i s in t h e array , which

r e a l l y l i m i t s t h e u s e f u l n e s s o f t h i s−−f u n c t i o n . Future f i x up .function f i n d f i r s t o c c u r a n c e ( constant item : i n t e g e r ;

constant search me : i n t e ge r2 D t a )return i n t e g e r t a i s

var iable found loca t ion : i n t e g e r t a (1 to 2) := ( 0 , 0 ) ;begin

for row in search me ’ range ( 1 ) loopfor c o l in search me ’ range ( 2 ) loop

i f search me ( row , c o l ) = item thenfound loca t ion := ( row , c o l ) ;return found loca t ion ;

end i f ;end loop ;

end loop ;a s s e r t f a l s e report ” f i n d f i r s t o c c u r a n c e : did not f ind item . ”

s e v e r i t y f a i l u r e ;return found loca t ion ;

end function ;

−−g i v e n a s t a t e m e t r i c , f i n d which bus c y c l e i t s v a l i d in .−−Thi s must match t h e r a d i x 4 a c s i m p l e m e n t a t i o n o b v i o u s l y .function s t a t e i d t o b u s c y c l e ( constant s t a t e i d : p o s i t i v e ;

constant output state mapping :in t eg er 2D t a

) return na tu ra l i svar iable output sm port : i n t e g e r := 0 ;var iable l o c a t i o n y x : i n t e g e r t a (1 to 2) := ( 0 , 0 ) ;

begin−−f i n d which c y c l e t h e sm i s v a l i d on by f i n d i n g which c y c l e i t was

g e n e r a t e d on by :−−1) f i n d out which r a d i x 4 u n i t sm p o r t g e n e r a t e d i t , by s e a r c h i n g

in t h e o u t p u t s t a t e m a p p i n g−−t a b l e . ( t h e r e a r e 4 p o r t s p e r r a d i x 4 u n i t mapped onto 2 b u s e s ) .l o c a t i o n y x := f i n d f i r s t o c c u r a n c e ( s t a t e i d ,

output state mapping ) ;output sm port := l o c a t i o n y x ( 2 ) ; −−e a c h column i s a new p o r t


−−2) p o r t s 1 ,3 a r e on t h e f i r s t bus c y c l e , p o r t s 2 ,4 a r e on t h es e c o n d c y c l e

i f ( output sm port = 1) or ( output sm port = 3) thenreturn 0 ;

e lsereturn 1 ;

end i f ;end function ;

−−Each r a d i x 4 u n i t l a t c h e s s t a t e m e t r i c s on d i f f e r e n t bus c y c l e s . (1o f 2 p o s s i b l e c u r r e n t l y ) .

−−G e n e r a t e t h e c o n f i g a r r a y t h a t s p e c i f i e s which sms a r e v a l i d onwhich bus c y c l e s f o r a

−− s p e c i f i c r a d i x 4 u n i t .function g e n e r a t e s m v a l i d b u s c y c l e t a b l e ( constant r a d i x 4 i d

: p o s i t i v e ;constant

output state mapping :i n t eg er 2D ta ;

constantinput state mapping :i n t eg er 2D ta

) return n a t u r a l t a i s−−s m v a l i d b u s c y c l e t i s

var iable v a l i d b u s c y c l e : n a t u r a l t a (3 downto 0) := ( 0 , 0 , 0 , 0 ) ;var iable s o u r c e s t a t e : i n t e g e r := 0 ;

begin−−f o r e a c h s o u r c e s t a t e in t h e r a d i x 4 u n i tfor j in 1 to 4 loop

s o u r c e s t a t e := input state mapping ( radix4 id , j ) ;v a l i d b u s c y c l e ( j −1) := s t a t e i d t o b u s c y c l e ( s o u r c e s t a t e ,

output state mapping ) ;end loop ;return v a l i d b u s c y c l e ;

end function ;

−−G e n e r a t e a t a b l e showing which bus c y c l e e a c h s t a t e m e t r i c i s v a l i don . Used by t h e t e s t c o d e

−−a t t h e end o f t h i s module .function s t a t e i d t o b u s c y c l e t a b l e ( constant output state mapping :

in t eg er 2D t a) return n a t u r a l t a i s

constant NUM STATES : p o s i t i v e := output state mapping ’ length ( 1 ) ∗output state mapping ’ length ( 2 ) ;

var iable t a b l e : n a t u r a l t a (NUM STATES downto 1) := ( others =>99) ;

begin−−f o r e a c h s t a t e m e t r i cfor i in 1 to NUM STATES loop−−f i n d which bus c y c l e i t s ou tp ut int a b l e ( i ) := s t a t e i d t o b u s c y c l e ( i , output state mapping ) ;

end loop ;return t a b l e ;

end function ;

−−c o n s t a n t s / s i g n a l sconstant ACS UNITS PER RADIX4 UNIT : p o s i t i v e := 2 ;constant STATE METRICS PER RADIX4 UNIT : p o s i t i v e := 4 ;constant BRANCH METRICS PER RADIX4 UNIT : p o s i t i v e := 1 6 ;

s ignal s t a t e m e t r i c s : s t a t e m e t r i c s t a (NUM VITERBI STATES downto 1) ;


−−s i g n a l s t a t e m e t r i c r d y : s t d l o g i c v e c t o r (NUM RADIX4 UNITS downto 1 ) ;

begin−−Implement t h e v i t e r b i t r e l l i s p r o c e s s i n g by i n s t a n t i a t i n g r a d i x−4

u n i t s , and w i r i ng up t h e ou tpu t−−s t a t e m e t r i c s t o i n p u t s t a t e m e t r i c s a s d e f i n e d by t h e CPM t r e l l i s .p l a c e r a d i x 4 a c s u n i t s : for r a d i x 4 i d in 1 to NUM RADIX4 UNITS

generate−−p l a c e m e n t .a t t r i b u t e RLOC : s t r i n g ;

−−p l a c e R4 u n i t s in a s q u a r e shape , bounded t o p and bot tom by t h edcm edges , and from t h e l e f t

−−bram / mult column .constant Y SLICES PER R4 : na tur a l := ( ( B SM+1) /2) ∗ 4 + 1 ; −−

vhd l rounds down on /constant X SLICES PER R4 : na tur a l := 1 0 ; −−a c t u a l l y 9 , but

round up t o even number t o g e t L /M s l i c e a l i g n m e n t r i g h tconstant AVAILABLE Y SLICES : na tur a l := 1 6 0 ; −−XC3SD1800A from

bot tom dcms edge t o t o p dcms edge .

constant R4 PER COLUMN : nat ur a l := AVAILABLE Y SLICES /Y SLICES PER R4 ;

constant IDX : na tu ra l := radix4 id −1; −−z e r or e f e r e n c e d .

constant X RLOC : nat ur a l := ( IDX/R4 PER COLUMN)∗ X SLICES PER R4 ;

constant Y RLOC : nat ur a l := ( IDX modR4 PER COLUMN) ∗ Y SLICES PER R4 ;

a t t r i b u t e RLOC of r a d i x 4 a c s u n i t : label i s p i c k s t r i n g (PLACE,x y s t r (X RLOC, Y RLOC) ) ;

s ignal s o u r c e s t a t e m e t r i c s : s t a t e m e t r i c s t a (STATE METRICS PER RADIX4 UNIT downto 1) ;

s ignal d e s t s t a t e m e t r i c b u s : s t a t e m e t r i c s t a (ACS UNITS PER RADIX4 UNIT downto 1) ;

s ignal r a d i x4 b r a nc h m e t r i c s : branch metr ics rounded ta (BRANCH METRICS PER RADIX4 UNIT downto 1) ;

−−Branch m e t r i c s a r e s h a r e d by two r a d i x−4 u n i t s . One o f t h e r a d i x4 u n i t s i s c o n f i g u r e d t o s u b t r a c t

−−t h e branch m e t r i c s . Radix4 u n i t I d s map t o a ( thetaD , a ( n−1) )t u p l e . The r a d i x 4 u n i t t h a t i s

−−( the taD + pi , a ( n−1) ) i s t h e r4 u n i t t h a t s h a r e s t h e branchm e t r i c s and i s c o n f i g u r e d t o s u b t r a c t them .

−−( the taD + pi , a ( n−1) ) maps t o an r4 u n i t i d t h a t i sNUM RADIX4 UNITS/ 2 g r e a t e r than t h e f i r s t r4 i d .

constant SUBTRACT BRANCH METRICS : boolean := r a d i x 4 i d > (NUM RADIX4 UNITS/2) ;

function bms radix4 id func ( r a d i x 4 i d : n a tu ra l range 1 toNUM RADIX4 UNITS ) return p o s i t i v e i s

begini f r a d i x 4 i d <= (NUM RADIX4 UNITS/2) then

return r a d i x 4 i d ;e lse

return r a d i x 4 i d − (NUM RADIX4 UNITS/2) ;end i f ;

end function ;constant bms radix4 id : n a t ura l := bms radix4 id func ( r a d i x 4 i d ) ;

begin−−a s s e r t f a l s e r e p o r t ” r a d i x 4 i d =” & Image ( r a d i x 4 i d ) s e v e r i t y

warning ;


−−r a d i x−4 u n i t s t h a t do t h e add−compare−s e l e c t v i t e r b i p r o c e s s i n gr a d i x 4 a c s u n i t : e n t i t y work . r a d i x 4 a c s ( r t l )

generic map(PLACE => 0 ,DESIGN TYPE => 1 , −− 0 i s vhdl , 1 i s p r i m i t i v e sSUBTRACT BRANCH METRICS => SUBTRACT BRANCH METRICS,SM VALID BUS CYCLE => g e n e r a t e s m v a l i d b u s c y c l e t a b l e (

radix4 id ,work

.p k g t r e l l i s.radix4 output state mapping,

work.p k g t r e l l i s.radix4 input s ta te mapping))

port map(c l k => clk ,r e s e t => r e s e t , −−s y n c h r o n o u s l y

r e s e t s s o u r c e s t a t e m e t r i c s t o 0n e w v i t e r b i i t e r a t i o n => n e w v i t e r b i i t e r a t i o n , −−h igh f o r 1

c l o c k c y c l e , new s o u r c e s m s a r e r e a d y f o r i n p u ts o u r c e s t a t e m e t r i c s => s o u r c e s t a t e m e t r i c s , −−s o u r c e s t a t e

m e t r i c s − d e s t sms from p r e v i o u s v i t e r b i i t e r a t i o nbranch metr ics => radix4 branch metr ics ,d e s t s t a t e m e t r i c => d e s t s t a t e m e t r i c b u s , −−s t a t e m e t r i c

bus o u t pu t from t h i s v i t e r b i i t e r a t i o n−−Used dur ing t r a c e b a c k t o d e t e r m i n e t r e l l i s pa th t a k e nd e c i s i o n b i t s => d e c i s i o n b i t s (4∗ r a d i x 4 i d −1 downto

4∗ ( radix4 id −1) )) ;

−−branch m e t r i c w i r in g . Hookup t h e branch m e t r i c hardware t o t h er i g h t r a d i x−4 u n i t bm i n p u t s .

−−The upper h a l f o f r a d i x 4 i d s r e u s e t h e bms o f t h e l o w e r h a l finput bm wiring : for radix4 bmid in 1 to

BRANCH METRICS PER RADIX4 UNIT generatebegin

r a d i x4 b r a nc h m e t r i c s ( radix4 bmid ) <=branch metr ics ( work . p k g t r e l l i s . radix4 bm mapping (

bms radix4 id , radix4 bmid ) ) ;−−b r a n c h m e t r i c s ( work . p k g t r e l l i s . rad ix4 bm mapping ( r a d i x 4 i d ,

r a d i x 4 b m i d ) ) ;end generate ;

−−s t a t e m e t r i c w i r in g . Connect r a d i x−4 u n i t s t a t e m e t r i c buso u t p u t s b a c k t o t h e s t a t e m e t r i c

−−i n p u t s .−−map s t a t e m e t r i c s i d s t o t h e r a d i x 4 a c s i n p u t sinput sm wiring : for r a d i x 4 i n p u t i d in 1 to

STATE METRICS PER RADIX4 UNIT generatebegin

s o u r c e s t a t e m e t r i c s ( r a d i x 4 i n p u t i d )<= s t a t e m e t r i c s ( work . p k g t r e l l i s . radix4 input s ta te mapping (

radix4 id , r a d i x 4 i n p u t i d ) ) ;end generate ;


−−map r a d i x 4 a c s o u t p u t s t o s t a t e m e t r i c i d soutput sm wiring : for r a d i x 4 o u t p u t i d in 1 to

STATE METRICS PER RADIX4 UNIT generatebegin

s t a t e m e t r i c s ( work . p k g t r e l l i s . radix4 output s ta te mapping (radix4 id , r a d i x 4 o u t p u t i d ) )

<= d e s t s t a t e m e t r i c b u s ( ( r a d i x 4 o u t p u t i d +1)/ACS UNITS PER RADIX4 UNIT ) ;

end generate ;

end generate ;

a s s e r t f a l s e report ”warning : bms hooked up temporai ly − s e tc o r r e c t l y ! a l s o the +ve/−ve acs conf ig too ” s e v e r i t y warning ;

−−C o n t r o l t h e r a d i x−4 u n i t s and o r g a n i s e t h e v i t e r b i i t e r a t i o n t imi ng. Th i s has be en ” hard c o d e d ” and

−−assumes branch m e t r i c s a r e i n p u t e v e r y 7 c l o c k c y c l e s , and t h er a d i x−4 a c s u n i t t a k e s 7 c l o c k c y c l e s

−−t o c o m p l e t e a v i t e r b i i t e r a t i o n . Also assumed t h e c l o c k r a t e i smatched t o 7x t h e symbol r a t e .

−−In g e n e r a l t h e s e a s s u m p t i o n s may not be t rue , e . g . c l o c k f a s t e rthan r e q u i r e d , mismatch be tween

−−branch m e t r i c and v i t e r b i c y c l e s r e q u i r e d . And so in t h e f u t u r e i tmay be w o r t h w h i l e re−w r i t i n g t h i s

−−c o n t r o l l o g i c t o work with t h e s e a s s u m p t i o n s r e l a x e d .−−n e w v i t e r b i i t e r a t i o n <= b r a n c h m e t r i c s r d y ; −−on ly c o r r e c t s i n c e

b o t h v i t e r b i and bm t a k e 7 c l o c k s

−−upper l a y e r r e s p o n s i b i l i t y t o a c t i v a t e r e s e t a t t h e r i g h t t ime .Future improvement may be t o

−−s y n c h r o n i s e t h i s wi th a s t a r t o f v i t e r b i i t e r a t i o n so r e s u l t s o ft h i s r e s e t a r e p r e d i c t a b l e . E . g .

−−r e s e t t i n g p a r t way through a c y c l e w i l l have no e f f e c t , s i n c e t h es t a t e m e t r i c s g e t re−w r i t t e n

−−a t t h e end o f t h e v i t e r b i i t e r a t i o n .−−r e s e t s o u r c e s m <= r e s e t s o u r c e s m ;

−−The f i r s t s t a t e m e t r i c and d e c i s i o n b i t s a r e r e a d y 6 c y c l e s a f t e rt h e s t a r t o f t h e v i t e r b i i t e r a t i o n .

−−The 2nd s t a t m e t i c r e a d y on t h e 7 th c y c l e .−−BETTER IDEA> j u s t use a s h i f t reg , i n s e r t n e w v i t i t e r a t t h e f r o n t

− e x t e n d e d an e x t r a c l o c k ,−−th en b i t s rdy i s j u s t t h e o u tp ut o f t h e s h i f t reg , s h i f t r e g i s t h e

l e n g t h o f t h e p i p e l i n e l ong .−−C o n c e p t u a l l y ve ry e a s y t o u n d e r s t a n d ! !

−−G e n e r a t e a r e a d y s i g n a l t h a t i n d i c a t e s when d e c i s i o n b i t s a r e v a l i d. Every t ime new−v i t e r b i i t e r a t i o n

−−g o e s high , a new i t e r a t i o n s t a r t s wi th bus c y c l e s t o i n p u t t h es t a t e m e t r i c s t o t h e r a d i x−4 a c s

−−u n i t s . Then t h e r a d i x−4 p i p e l i n e p r o c e s e s t h e sms and p r o d u c e s t h ed e c i s i o n b i t s a f t e r some l a t e n c y .

−−Use a s i m p l e s h i f t r e g t o r e p r e s e n t t h e p i p e l i n e : i t s d e p t h i s t h esame as t h e r e a l p i p e l i n e , and

−−n e w v i t e r b i i t e r a t i o n i s f e d in t h e f r o n t , and t h e d e c i s i o n b i t sr e a d y s i g n a l i s t h e r e f o r e j u s t

−−t h e ou tp ut o f t h e s h i f t r e g .g e n e r a t e d e c i s i o n b i t s r d y : process ( c l k )

var iable s h i f t r e g : s t d l o g i c v e c t o r ( R4 SM INPUT BUS CYCLES+ R4 PIPELINE DEPTH −2 downto 0) ;


var iable s h i f t i n : s t d l o g i c ;var iable bus cyc le count : p o s i t i v e range 1 to

R4 SM INPUT BUS CYCLES ;begin

i f r i s i n g e d g e ( c l k ) then−−s h i f t r e g s e r i a l i n p u ti f r e s e t = ’1 ’ then

s h i f t i n := ’ 0 ’ ;bus cyc le count := R4 SM INPUT BUS CYCLES ; −−s e t d e f a u l t va lue

, and t o come out o f r e s e t c l e a n l ye l s i f n e w v i t e r b i i t e r a t i o n = ’1 ’ then

s h i f t i n := ’ 1 ’ ;bus cyc le count := 1 ;

e l s i f bus cyc le count < R4 SM INPUT BUS CYCLES then −−e x t e n d 1 ’ si n p u t f o r d u r a t i o n o f bus c y c l e s

s h i f t i n := ’ 1 ’ ;bus cyc le count := bus cyc le count + 1 ;

e lses h i f t i n := ’ 0 ’ ;

end i f ;−−s h i f ti f r e s e t = ’1 ’ then −−p r e c l u d e s s r l 1 6

s h i f t r e g := ( others => ’ 0 ’ ) ;e lse

s h i f t r e g := s h i f t r e g ( s h i f t r e g ’ high−1 downto s h i f t r e g ’ low ) &s h i f t i n ;

end i f ;−−ou tp utd e c i s i o n b i t s r d y <= s h i f t r e g ( s h i f t r e g ’ high ) ;


−−V e r i f y c o r r e c t v i t e r b i d e c o d e r o p e r a t i o n by c h e c k i n g t h e s t a t em e t r i c s c a l c u l a t e d a t t h e end o f

−−e v e r y v i t e r b i i t e r a t i o n . They a r e c h e c k e d a g a i n s t s t a t e m e t r i c sg e n e r a t e d in t h e mat l ab f i x e d p o i n t

−−model . Th i s c o d e i s i g n o r e d f o r s y n t h e s i s , and can a l s o be ”s w i t c h e d o f f ” us ing t h e CHECK STATE METRICS

−−g e n e r i c p a r a m e t e r .−− s y n t h e s i s t r a n s l a t e o f fc h k s t a t e m e t r i c s : process ( c l k )

var iable v i t e r b i i t e r a t i o n : na t ura l := 1 ;var iable error , g o o d i t e r a t i o n c o u n t : n a tu ra l := 0 ;var iable implementation answer , matlab answer , d i f f e r e n c e :

s t a t e m e t r i c t ;var iable v a l i d s t a t e m e t r i c s : s t a t e m e t r i c s t a (

NUM VITERBI STATES downto 1) ;constant s t a t e i d t o b u s c y c l e t b l : n a t u r a l t a

:= s t a t e i d t o b u s c y c l e t a b l e ( work . p k g t r e l l i s .radix4 output s ta te mapping ) ;

var iable bus cyc le : n a t ura l := 0 ;var iable g o t a l l s m s : boolean := f a l s e ;var iable p i p e l i n e s t a t e : s t r i n g (13 downto 1) := ” xxxxxxxxxxxxx ” ;var iable c y c l e : n a t ura l := 0 ;var iable t e s t b u s c y c l e : n a t ura l := 5 5 ;constant PRECISE MATCH REQUIRED : boolean := f a l s e ; −−s e t t o f a l s e

i f a d i f f e r e n c e o f 1 i s a c c e p t a b l e−−i n p u t t e s t

v e c t o r so c c a s i a o n a l l y

have t h e −ve BMS out


by 1 − t h i n ki t s a

roundingi s s u e in t h e

mat l abmodel ?

begini f r i s i n g e d g e ( c l k ) then

i f CHECK STATE METRICS theni f r e s e t = ’1 ’ then

v i t e r b i i t e r a t i o n := 0 ;g o o d i t e r a t i o n c o u n t := 0 ;bus cyc le := 1 0 0 ;for s t a t e i d in v a l i d s t a t e m e t r i c s ’ range loop

v a l i d s t a t e m e t r i c s ( s t a t e i d ) := ( others => ’ 1 ’ ) ;end loop ;a s s e r t PRECISE MATCH REQUIRED

report ” cpm viterbi decoder : warning , s t a t e m e t r i c s arechecked to within +/− 2 . P r e c i s e match not required ”

s e v e r i t y warning ;e l s i f i s x ( s t d l o g i c v e c t o r ( s t a t e m e t r i c s ( 1 ) ) ) then−−don ’ t run t e s t i f i n p u t s not d e f i n e d

a s s e r t f a l s e report ” cpm viterbi decoder : s t a t e m e t r i c s xxxso r e s u l t s not checked ” s e v e r i t y warning ;

e lse−−At t h e b e g i n n i n g o f e v e r y v i t e r b i i t e r a t i o n , g a t h e r a

c o m p l e t e s e t o f s t a t e m e t r i c s . These−−a r e produced o v e r 1 or more c y c l e s , d epend ing on t h e

r a d i x 4 a c s i m p l e m e n t a t i o n . I . e . t h e−−s t a t e m e t r i c s a r r a y i s on ly v a l i d f o r s p e c i f i c s t a t e s on

s p e c i f i c c y c l e s .−−F i r s t d e t e r m i n e which c y c l e we a r e on .−−g o t a l l s m s := f a l s e ;i f n e w v i t e r b i i t e r a t i o n = ’1 ’ then

bus cyc le := 0 ;e lse

bus cyc le := bus cyc le + 1 ;end i f ;−−Loop through a l l m e t r i c s , s e e i f we a r e on t h e c y c l e f o r

t h a t sm , i f y e s th en r e a d i t .i f ( bus cyc le = 0) or ( bus cyc le = 1) then

for s t a t e i d in 1 to NUM VITERBI STATES loopi f ( bus cyc le ) = s t a t e i d t o b u s c y c l e t b l ( s t a t e i d ) then

v a l i d s t a t e m e t r i c s ( s t a t e i d ) := s t a t e m e t r i c s ( s t a t e i d ) ;end i f ;

end loop ;end i f ;−−Once g o t a l l sms from a l l t h e bus c y c l e s , c h e c k them

a g a i n s t t h e mat l ab d a t a .g o t a l l s m s := ( bus cyc le = 2) ;i f g o t a l l s m s then

i f v i t e r b i i t e r a t i o n /= 0 then −−i g n o r e 1 s t i t e r a t i o ne r r o r := 0 ;for s t a t e i d in 1 to NUM VITERBI STATES loop

implementation answer := v a l i d s t a t e m e t r i c s ( s t a t e i d ) ;matlab answer := to s igned ( work .

p k g m a t l a b t e s t v e c t o r s v i t e r b i .v i t e r b i s t a t e m e t r i c s (

v i t e r b i i t e r a t i o n ,s t a t e i d ) ,B SM) ;

d i f f e r e n c e := implementation answer − matlab answer ;i f PRECISE MATCH REQUIRED then


i f d i f f e r e n c e /= 0 thene r r o r := e r r o r + 1 ;

end i f ;e lse

i f ( d i f f e r e n c e < −2) or ( d i f f e r e n c e > 2) then −−d i f fo f 1 s t i l l l o t s o f e r r o r s ?

e r r o r := e r r o r + 1 ;end i f ;

end i f ;end loop ;−−a s s e r t e v e r y v i t e r b i i t e r a t i o n with e r r o r s .i f e r r o r = 0 then

g o o d i t e r a t i o n c o u n t := g o o d i t e r a t i o n c o u n t + 1 ;end i f ;a s s e r t e r r o r = 0

report ” cpm viterbi decoder : implementation r e s u l t s donot match matlab r e s u l t s . ” &

” I t e r a t i o n =” & work . p k g t o s t r i n g . Image (v i t e r b i i t e r a t i o n ) & ” , ” &

”Good i t e r a t i o n s =” & work . p k g t o s t r i n g . Image (g o o d i t e r a t i o n c o u n t ) & ” , ” &

” S t a t e m e t r i c s in e r r o r t h i s i t e r =” & work .p k g t o s t r i n g . Image ( e r r o r )

s e v e r i t y f a i l u r e ;end i f ;v i t e r b i i t e r a t i o n := v i t e r b i i t e r a t i o n + 1 ;

end i f ;end i f ;−−Aid debug by p r o v i d i n g a s t r i n g l a b e l i n g e a c h c y c l ei f ( r e s e t = ’ 1 ’ ) or ( n e w v i t e r b i i t e r a t i o n = ’ 1 ’ ) then

c y c l e := 0 ;e lse

c y c l e := c y c l e + 1 ;end i f ;case c y c l e i s

when 0 => p i p e l i n e s t a t e := ” sm bus cycle0 ” ;when 1 => p i p e l i n e s t a t e := ” sm bus cycle1 ” ;when 2 => p i p e l i n e s t a t e := ”sm plus bm ” ;when 3 => p i p e l i n e s t a t e := ”compare1 ” ;when 4 => p i p e l i n e s t a t e := ”mux1 ” ;when 5 => p i p e l i n e s t a t e := ”compare2 ” ;when 6 => p i p e l i n e s t a t e := ”mux2 ” ;when others => p i p e l i n e s t a t e := ” undefined ” ;

end case ;end i f ;−− t e s t−−t e s t b u s c y c l e := s t a t e i d t o b u s c y c l e ( 3 7 , work . p k g t r e l l i s .

r a d i x 4 o u t p u t s t a t e m a p p i n g ) ;end i f ;

end process ;−− s y n t h e s i s t r a n s l a t e o n

end r t l ;

D.2.3 Radix-4 Add-Compare-Select Unit

−−∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗

−− T i t l e : Radix−4 v i t e r b i s t a t e m e t r i c s uni t ,c o n t a i n s 1 or more ACS u n i t s

−− Author : Andrew B r i d g e r−− Date : 18 Sep t ember 2008


−− High L e v e l Module D e s c r i p t i o n : Th i s modules groups t o g e t h e r t h ev i t e r b i add−compare−s e l e c t p r o c e s s i n g

−− f o r 4 s t a t e s o f a q u a t e r n a r y (M=4) a l p h a b e t CPM t r e l l i s . Th i s ” r a d i x−4” s t a t e c o m p o s i t i o n i s c h o s e n

−− such t h a t a l l f o u r ou tp ut s t a t e m e t r i c s s h a r e t h e same f o u r s o u r c es t a t e m e t r i c s i n p u t t o t h i s module .

−− Thi s s i m p l i f i e s t h e s t a t e m e t r i c r o u t i n g c o m p l e x i t y b e c a u s e now t h er o u t i n g from any s o u r c e s t a t e m e t r i c

−− r e g i s t e r i s l o c a l t o t h i s r a d i x−4 u n i t on ly .−−−− The o ut pu t s t a t e m e t r i c r o u t i n g i s s t i l l an i s s u e however , s i n c e i t

i s non−l o c a l . That i s s u e i s p a r t l y−− a d d r e s s e d a t a h i g h e r l e v e l in t h e d e s i g n h i e r a r c h y , by a p p r o p r i a t e

group ing o f r a d i x−4 u n i t s , and t h e i r−− r e l a t i v e l a y o u t .

−− Thi s r a d i x−4 u n i t c o n t a i n s 1 or more a c s u n i t s t h a t c a r r y out t h er e q u i r e d p r o c e s s i n g .

−− Given s o u r c e s t a t e m e t r i s c ssm1−ssm4 , d e s t i n a t i o n s t a t e m e t r i c s dsm1−dsm4 and branch m e t r i c s bm1−bm16 ,

−− t h i s u n i t c a l c u l a t e s :−− dsm1 = max{ssm1+bm1 , . . . , ssm4+bm4}−− dsm2 = max{ssm1+bm5 , . . . , ssm4+bm8}−− dsm3 = max{ssm1+bm9 , . . . , ssm4+bm12}−− dsm4 = max{ssm1+bm13 , . . . , ssm4+bm16}

−−For t h e 2x ACS i m p l e m e n t a t i o n , dsm1 and then dsm2 a r e ou tp ut ond e s t s t a t e m e t r i c bus 0

−− dsm3 and then 4 a r e ou tp ut on d e s t s t a t e m e t r i c bus 1

−− The n e w v i t e r b i i t e r a t i o n s i g n a l s y n c h r o n i s e s t h e o p e r a t i o n s t h a to c c u r w i t h i n t h i s r a d i x 4 u n i t . Th i s

−− s i g n a l g o e s a c t i v e f o r 1 c l o c k c y c l e , when t h e f i r s t c y c l e o fs t a t e m e t r i c s a r e v a l i d a t t h i s

−− modules i n p u t s . The s e c o n d c y c l e o f s t a t e m e t r i c s a r e i n p u t on t h enex t c y c l e .

−− The branch m e t r i c i n p u t s must be i n p u t in l o c k s t e p with t h es t a t e m e t r i c i n p u t c y c l e s .

−− Var ious i m p l e m e n a t i o n s a r e s u p p o r t e d us ing i f . . g e n e r a t e s t a t e m e n t sand a g e n e r i c p a r a m e t e r t o s e l e c t

−− t h e d e s i r e d i m p l e m e n t a t i o n .

−− Notes / L i m i t a t i o n s : 1 ) C o n s i d e r p u t t i n g d e c i s i o n b i t s in a knowns t a t e a t r e s e t . Pe rhaps not e s s e n t i a l

−− but a n i c e t h i n g t o do f o r downstream modules ?−− 2) C o n s i d e r moving s c a l a r i z e and v e c t o r i z e i n t o

a u t i l i t i e s p a c k a g e .−− s i l l y f u n c t i o n s so t h a t s t d l o g i c can be a p p l i e d t o u n c o n s t r a i n e d

v e c t o r p o r t s . In f u t u r e move−−t h e s e i n t o a p a c k a g e . These w i l l be r e q u i r e d anyt ime you have a

component d e f i n e d with−−u n c o n s t r a i n e d s l v , y e t t h e r e may be c a s e s where you want t o p a s s in

j u s t 1 b i t , e . g . a s t d l o g i c .−−−− S y n t h e s i z a b l e : Yes−−−− T e s t b e n c h : r a d i x 4 a c s t b−−−− Note : The v e r s i o n c o n t r o l sys t em in use i s t h e r e p o s i t o r y f o r

i n f o r m a t i o n r e g a r d i n g bug f i x e s ,


−− v e r s i o n s , f e a t u r e a d d i t i o n s e t c .−−

∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗


l i b r a r y work ;use work . PkgStdType . a l l ;use work . p kg pro j ec t . a l l ;use work . pkg r loc . a l l ;use work . pkg math . a l l ;


e n t i t y r a d i x 4 a c s i sgeneric (

PLACE : nat ur a l := 0 ; −−d e f a u l t u n p l a c e dDESIGN TYPE : n a t ura l := 0 ; −−0 i s vhdl , 1 i s

x i l i n x p r i m i t i v e sSUBTRACT BRANCH METRICS : boolean := f a l s e ; −− i f t r u e s u b t r a c t

r a t h e r than add bms t o s t a t e m e t r i c s−−C o n f i g u r e t h e bus c y c l e t o which e a c h i n p u t s t a t e m e t r i c i s

r e g i s t e r e d on .SM VALID BUS CYCLE : n a t u r a l t a (3 downto 0) := ( 0 , 0 , 0 , 0 )) ;

port (c l k : in s t d l o g i c ;r e s e t : in s t d l o g i c ; −−s y n c h r o n o u s l y

r e s e t s s o u r c e s t a t e m e t r i c s t o 0n e w v i t e r b i i t e r a t i o n : in s t d l o g i c ; −−h igh f o r 1

c l o c k c y c l e , when f i r s t s o u r c e sms a r e r e a d y f o r i n p u ts o u r c e s t a t e m e t r i c s : in s t a t e m e t r i c s t a (3 downto 0) ; −−s o u r c e

s t a t e m e t r i c s − d e s t sms from p r e v i o u s v i t e r b i i t e r a t i o nbranch metr ics : in branch metr ics rounded ta (15 downto 0) ;d e s t s t a t e m e t r i c : out s t a t e m e t r i c s t a (1 downto 0) ; −−s t a t e

m e t r i c bus ou tp ut from t h i s v i t e r b i i t e r a t i o nd e c i s i o n b i t s : out s t d l o g i c v e c t o r (3 downto 0) −−Used

dur ing t r a c e b a c k t o d e t e r m i n e t r e l l i s pa th t a k e n) ;

end r a d i x 4 a c s ;

a r c h i t e c t u r e r t l of r a d i x 4 a c s i sconstant CHECK ANSWER : boolean := true ; −−s e t t o t r u e f o r

debug , f a l s e may s p e e d up s i m u l a t i o nconstant PLACE SUB INSTANCE : nat ura l := PLACE;

−−s y n t h e s i s t r a n s l a t e o f fs ignal p i p e l i n e s t a t e : s t r i n g (13 downto 1) := ” xxxxxxxxxxxxx ” ;−−s y n t h e s i s t r a n s l a t e on

−−C o n t r o l t o o v e r r i d e PLACE c o n t r o l t o f o r c e sub−i n s t a n c e s t o bep l a c e d .

−−c o n s t a n t FORCE SUB INSTANCE PLACE TRUE : b o o l e a n := t r u e ; −− <−−normal v a l u e s h o u l d be f a l s e .

−−c o n s t a n t PLACE SUB INSTANCE : n a t u r a l := PLACE when (not FORCE SUB INSTANCE PLACE TRUE ) e l s e 1 ;


begin

−−s y n t h e s i s t r a n s l a t e o f f−−debug a i d t h a t p r i n t s t h e p i p e l i n e c y c l e i n t o a s t r i n g .p i p e l i n e s t a t e s t r i n g : process ( c l k )

var iable c y c l e : i n t e g e r : = 0 ;begin

i f r i s i n g e d g e ( c l k ) theni f ( r e s e t = ’ 1 ’ ) or ( n e w v i t e r b i i t e r a t i o n = ’ 1 ’ ) then

c y c l e := 0 ;e lse

c y c l e := c y c l e + 1 ;end i f ;case c y c l e i s

when 0 => p i p e l i n e s t a t e <= ” sm bus cycle0 ” ;when 1 => p i p e l i n e s t a t e <= ” sm bus cycle1 ” ;when 2 => p i p e l i n e s t a t e <= ”sm plus bm ” ;when 3 => p i p e l i n e s t a t e <= ”compareA ” ;when 4 => p i p e l i n e s t a t e <= ”muxA ” ;when 5 => p i p e l i n e s t a t e <= ”compareB ” ;when 6 => p i p e l i n e s t a t e <= ”muxB ” ;when others => p i p e l i n e s t a t e <= ” xxxxxxxxxxxxx ” ;

end case ;end i f ;

end process ;−−s y n t h e s i s t r a n s l a t e on

−−VHDL VERSION OF 2xACS RADIX4 UNIT . ( X i l i n x p r i m i t i v e s v e r s i o nf u r t h e r be low . )

vhdl : i f DESIGN TYPE = 0 generates ignal s o u r c e s t a t e m e t r i c s r e g : s t a t e m e t r i c s t a (3 downto 0) ;s ignal s e l e c t f i r s t b m : boolean ;

begin−−Use two ACS u n i t s t o g e n e r a t e t h e 4 s t a t e m e t r i c s o u t p u t s from

t h i s r a d i x−4 u n i t . Each ACS u n i t−−o u t p u t s two s t a t e m e t r i c s s e q u e n t i a l l y .r a d i x 4 a c s : for i in 0 to 1 generatebegin−−pe form add−compare−s e l e c t p r o c e s s i n g f o r 2 s t a t e s us ing 1 s e t

o f hardware . P i p e l i n e i s 6 deep , p l u s−−1 c l o c k f o r t h e 2nd s t a t e , s o t a k e s 7 c l o c k s f o r c o m p l e t e

p r o c e s s i n g .−−Path m e t r i c s a r e s t o r e d as f i x e d p o i n t 2 ’ s complement numbers

and have l i m i t e d range . By us ing−−Hekstra ’ s method , d e d i c a t e d n o r m a l i s a t i o n l o g i c i s a v o i d e d by

a l l o w i n g t h e m e t r i c s t o o v e r f l o w and−−p e r f o r m i n g t h e f i n d−the−b e s t−m e t r i c o p e r a t i o n us ing 2 ’ s

complement a r i t h m e t i c . See t h e Matlab model f o r−−more d e t a i l s .f o u r w a y a c s v h d l 2 s t a t e s : process ( c l k )

var iable keep pm23 , keep pm1 , keep pm3 : boolean ;var iable keep pm1 pipe1 , keep pm1 pipe2 , keep pm3 pipe1 ,

keep pm3 pipe2 : boolean ;var iable pm0 reg , pm1 reg , pm2 reg , pm3 reg : s t a t e m e t r i c t ;var iable pm01 reg , pm23 reg : s t a t e m e t r i c t ;var iable pm0, pm1, pm2, pm3, pm01 , pm23 : s t a t e m e t r i c t ;type branch metric muxed ta i s array (3 downto 0) of

branch metr ic rounded t ;var iable branch metr ic :

branch metric muxed ta ;


−−v a r i a b l e d e s t s t a t e m e t r i c n o r e g :s t a t e m e t r i c t ;

begini f r i s i n g e d g e ( c l k ) then−−READ BOTTOM UP−−us ing v a r i a b l e s so r e a d from t h e bot tom up t o g e t a s e n s e

o f t h e d a t a f l o w and p i p e l i n i n g .−−The comments a l s o make more s e n s e i f r e a d from bot tom up .−−TODO DELETE6) R e g i s t e r t h e ou t pu t ( d e s t i n a t i o n ) b e s t m e t r i c .

G ives whole f l o p t o f l o p t i m in g pa th f o r r o u t i n g .−−d e s t s t a t e m e t r i c ( i ) <= d e s t s t a t e m e t r i c n o r e g ;

−−7) F i n a l l y , s e l e c t (mux) t h e b e s t m e t r i ci f keep pm23 then

d e s t s t a t e m e t r i c ( i ) <= pm23 reg ;e lse

d e s t s t a t e m e t r i c ( i ) <= pm01 reg ;end i f ;−−Output t h e d e c i s i o n b i t s b a s e d on which pa th m e t r i c was

k e p t . There a r e two d e c i s o n b i t s f o r−−e a c h ACS u n i t . Encoded as f o l l o w s .−−pm0 − ”00”−−pm1 − ”01”−−pm2 − ”10”−−pm3 − ”11”d e c i s i o n b i t s (1+ 2∗ i downto 2∗ i ) <= ”00” ;i f keep pm23 then

d e c i s i o n b i t s (1 + 2∗ i ) <= ’ 1 ’ ; −−msbi f keep pm3 pipe2 then

d e c i s i o n b i t s (0 + 2∗ i ) <= ’ 1 ’ ; −−l s bend i f ;

e lsei f keep pm1 pipe2 then

d e c i s i o n b i t s (0 + 2∗ i ) <= ’ 1 ’ ; −−l s bend i f ;

end i f ;−−6) Peform t h e s u b t r a c t i o n compar i s on be tween t h e b e s t 2 pa th

m e t r i c skeep pm23 := (pm01 − pm23) < 0 ;pm01 reg := pm01 ;pm23 reg := pm23 ;keep pm1 pipe2 := keep pm1 pipe1 ;keep pm3 pipe2 := keep pm3 pipe1 ;−−5) S e l e c t (mux) t h e b e s t m e t r i c b a s e d on t h e s i g n b i t o f t h e

s u b t r a c t i o ni f keep pm1 then

pm01 := pm1 reg ;e lse

pm01 := pm0 reg ;end i f ;i f keep pm3 then

pm23 := pm3 reg ;e lse

pm23 := pm2 reg ;end i f ;keep pm1 pipe1 := keep pm1 ; −−p i p e l i n e t h e d e c i s i o n b i t s −

used l a t e r in t h e p i p e l i n ekeep pm3 pipe1 := keep pm3 ;−−4)Now s t a r t t h e p r o c e s s o f f i n d i n g t h e ” b e s t ” pa th m e t r i c .

Th i s i s done in two s t a g e s , f i r s t l y−−f i n d i n g t h e b e s t o f e a c h o f two p a i s . And then in t h e

s e c o n d s t a g e f i n d i n g t h e b e s t o f t h e


−−r ema in ing two pa th m e t r i c s .−−The key p a r t o f t h e mat l ab model f o r f i n d i n g t h e ” b e s t ”

pa th m e t r i c i s :−− s u b t r a c t A n s = v a l − x ( i +1) ;−− i f s u b t r a c t A n s < 0 %v a l i s s m a l l e r−− v a l = x ( i +1) ;−− i d x = i +1;−−Thi s i s j u s t a s u b t r a c t i o n f o l l o w e d by t h e <0 c h e c k which

i s j u s t l o o k i n g a t t h e s i g n b i t o f t h e−−s u b t r a c t i o n r e s u l t .−−Per form t h e s u b t r a c t i o n f o r e a c h o f two p a i r s o f t h e s t a t e

m e t r i c s and p a s s t h e s i g n b i t s−−onto t h e nex t s t a g e . Also s t o r e t h e pa th m e t r i c s in

r e g i s t e r s r e a d y f o r t h e nex t p i p e l i n e s t a g e .keep pm1 := (pm0 − pm1) < 0 ;keep pm3 := (pm2 − pm3) < 0 ;pm0 reg := pm0 ;pm1 reg := pm1 ;pm2 reg := pm2 ;pm3 reg := pm3 ;

−−3) a c c u m u l a t e t h e pa th m e t r i c s a l l o w i n g 2 ’ s complemento v e r f l o w . pm = s t a t e m e t r i c + branch m e t r i c

−−S i n c e t h i s a c s u n i t p r o c e s s two s t a t e s in s e r i a l , need t omux in t h e branch m e t r i c s f o r t h e

−−2nd s t a t e . ( The mux and + map i n t o a s i n g l e LUT l e v e l − a tl e a s t t h e y s h o u l d ! )

i f s e l e c t f i r s t b m thenbranch metr ic := ( branch metr ics (3+8∗ i ) , branch metr ics

(2+8∗ i ) , branch metr ics (1+8∗ i ) , branch metr ics (0+8∗ i ) ) ;e lse

branch metr ic := ( branch metr ics (7+8∗ i ) , branch metr ics(6+8∗ i ) , branch metr ics (5+8∗ i ) , branch metr ics (4+8∗ i ) ) ;

end i f ;−−branch m e t r i c c o m p u t a t i o n h a l v e d by r e c o g n i s i n g t h a t f o r

e v e r y branch m e t r i c , t h e r e i s an−−e q u i v a l e n t −ve one . A l l 16 branch m e t r i c s i n p u t t o a r a d i x

−4 u n i t a r e e i t h e r added or−−s u b t r a c t e d , s o t h i s add / s u b t r a c t c o n f i g u r a t i o n i s s t a t i c −>

c o m p i l e t ime .i f SUBTRACT BRANCH METRICS then

pm0 := s o u r c e s t a t e m e t r i c s r e g ( 0 ) − branch metr ic ( 0 ) ; −−(d o e s synth t o o l a u t o m a t i c a l l y s i g n e x t e n d ?)

pm1 := s o u r c e s t a t e m e t r i c s r e g ( 1 ) − branch metr ic ( 1 ) ;pm2 := s o u r c e s t a t e m e t r i c s r e g ( 2 ) − branch metr ic ( 2 ) ;pm3 := s o u r c e s t a t e m e t r i c s r e g ( 3 ) − branch metr ic ( 3 ) ;

e lse −−add thempm0 := s o u r c e s t a t e m e t r i c s r e g ( 0 ) + branch metr ic ( 0 ) ;pm1 := s o u r c e s t a t e m e t r i c s r e g ( 1 ) + branch metr ic ( 1 ) ;pm2 := s o u r c e s t a t e m e t r i c s r e g ( 2 ) + branch metr ic ( 2 ) ;pm3 := s o u r c e s t a t e m e t r i c s r e g ( 3 ) + branch metr ic ( 3 ) ;

end i f ;−−2) , 1 ) Two bus c y c l e s t o r e g i s t e r t h e s o u r c e s t a t e m e t r i c s .

See be low .end i f ;

end process ;end generate ;

−−The s o u r c e s t a t e m e t r i c s must be r e g i s t e r e d . ( S t a g e 1 and 2 o ft h e p i p e l i n e ) .

−−Share one s e t o f t h e s e r e g i s t e r s amongst b o t h ACS u n i t s .


−−S i n c e t h i s r a d i x 4 u n i t o u t p u t s 2 sms on one bus , t h e r e a r e twobus c y c l e s . T h e r e f o r e , e a c h sm must

−−be r e g i s t e r e d on t h e c o r r e c t c y c l e . That i s s p e c i f i e d by ag e n e r i c p a r a m a t e r .

r e g i s t e r s o u r c e s m s : process ( c l k )var iable s h i f t r e g : s t d l o g i c v e c t o r (1 downto 0) ;var iable s h i f t i n : s t d l o g i c ;constant SELECT FIRST BM CYCLE : n a t ura l := 1 ;var iable load : s t d l o g i c v e c t o r (3 downto 0) ;

−−debugbegin−−READ TOP DOWNi f r i s i n g e d g e ( c l k ) then−−a p o l o g i e s f o r f o l l o w i n g b e i n g a l i t t l e ” t r i c k y ” . I n t e n d e d t o

be e a s y t o t r a n s l a t e i n t o−−p r i m i t i v e s . The a c s p r o c e s s i n g p i p e l i n e i s r e p r e s e n t e d with a

s h i f t r e g i s t e r . Each s h i f t−−r e g i s t e r l o c a t i o n c o r r e s p o n d s t o a d i f f e r e n t s t a g e in t h e

p i p e l i n e . N e w v i t e r b i i t e r a t i o n i s−−t h e s h i f t r e g s e r i a l input , and by s i m p l i n g l o o k i n g a t t h e

s h i f t r e g c o n t e n t s we can f i g u r e−−out what p i p e l i n e s t a g e we a r e at , and what p i p e l i n e c o n t r o l

s i g n a l s need t o be g e n e r a t e d .−−Note , c u r r e n t l y on ly need t o know a b o u t t h e f i r s t 2 s t a g e s ,

s o t h e s h i f t r e g doesn ’ t need t o be−−t h e c o m p l e t e p i p e l i n e d e p t h .s h i f t i n := n e w v i t e r b i i t e r a t i o n ;i f r e s e t = ’1 ’ then

s h i f t r e g := ( others => ’ 0 ’ ) ;e lse

s h i f t r e g := s h i f t r e g ( s h i f t r e g ’ high−1 downto 0) & s h i f t i n ;end i f ;−−Decode t h e p i p e l i n e s t a t e and g e n e r a t e t h e r e q u i r e d c o n t r o l

s i g n a l s .−−1) sm bus c y c l e 1 , 2 ) sm bus c y c l e 2 , 3 ) add 1 s t s e t o f bms ,

4 ) now mux in t h e 2nd s e t o f bms−−A) branch m e t r i c muxing .i f s h i f t r e g ( SELECT FIRST BM CYCLE ) = ’1 ’ then −−sm bus

c y c l e 2s e l e c t f i r s t b m <= true ; −−s e t u p t o c a l c wi th 1 s t s e t o f bms

e lses e l e c t f i r s t b m <= f a l s e ;

end i f ;−−B ) i n p u t s t a t e m e t r i c r e g i s t e r i n g .−−Loop through e a c h i n p u t s t a t e m e t r i c , and r e s e t o r r e g i s t e r

i t on t h e c o r r e c t c y c l e .for i in 0 to 3 loop

load ( i ) := ’ 0 ’ ; −−debugi f r e s e t = ’1 ’ then

s o u r c e s t a t e m e t r i c s r e g ( i ) <= ( others => ’ 0 ’ ) ;e l s i f s h i f t r e g ( SM VALID BUS CYCLE( i ) ) = ’1 ’ then

load ( i ) := ’ 1 ’ ; −−debugs o u r c e s t a t e m e t r i c s r e g ( i ) <= s o u r c e s t a t e m e t r i c s ( i ) ;

end i f ;end loop ;


end generate ;

−−XILINX INSTANTIATED PRIMITIVES VERSION OF 2xACS RADIX4 UNIT−−Read t h e f o l l o w i n g from h e r e ( t o p ) down t o g e t a f e e l f o r t h e d a t a

f l o w .


−−The d a t a f l o w p i p e l i n e i s : r e g i s t e r s t a t e m e t r i c s (2 c y c l e s ) −> pa thm e t r i c s = s t a t e m e t r i c s + bms −>

−− f i r s t s u b t r a c t / mux ( 2 ) −> s e c o n d s u b t r a c t / mux l e v e l ( 2 ) .p r i m i t i v e s : i f DESIGN TYPE = 1 generate−−Placementa t t r i b u t e RLOC ORIGIN : s t r i n g ;a t t r i b u t e RLOC : s t r i n g ;−−DEBUG temp : RLOC o r i g i n t o h e l p wi th p l a c e m e n t debug . Without i t

mapper seems t o s p l i t t h e rpm .−−Note : RLOC ORIGIN w i l l f i n d t h e l o w e s t l e f t most e l e m e n t in t h e

RPM t o which p i p e l i n e s t a t e b e l o n g s ,−−and map t h a t t o t h e r l o c o r i g i n s p e c i f i e d .−−a t t r i b u t e RLOC ORIGIN o f p i p e l i n e s t a t e : l a b e l i s p i c k s t r i n g (

PLACE , x y s t r ( 4 , 4 ) ) ;

−−s t a n d a r d y s i z e ( in s l i c e s ) o f t h e components we a r e p l a c i n g .constant YSTEP : na tur a l := ( ( B SM+1) / 2) ; −−+1 r e q u i r e d

t o round up t o n e a r e s t whole s l i c econstant XSTEP : na tur a l := 1 ; −−t o d o s u p p o r t

−ve 1 f o r r i g h t t o l e f t l a y o u t . . .constant ygrid : n a t u r a l t a (3 downto 0)

:= (3∗YSTEP +1 ,2∗YSTEP+1 ,YSTEP , 0 ) ;−−Layout i s s t a t e m e t r i c r e g s through t o f i n a l mux l e f t t o r i g h t o r

r i g h t t o l e f t . The f o u r−−s t a t e m e t r i c r e g s a r e l a y e d out v e r t i c a l l y , wi th an empty s l i c e

row in t h e mi dd l e t o p i p e l i n e−−c o n t r o l s i g n a l s , misc b i t s and p i e c e s o f l o g i c .a t t r i b u t e RLOC of s e l e c t f i r s t b m f l o p 0 A : label i s p i c k s t r i n g (

PLACE, x y s t r ( XSTEP, 2∗YSTEP ) ) ;a t t r i b u t e RLOC of s e l e c t f i r s t b m f l o p 0 B : label i s p i c k s t r i n g (

PLACE, x y s t r ( XSTEP, 2∗YSTEP ) ) ;a t t r i b u t e RLOC of s e l e c t f i r s t b m f l o p 1 A : label i s p i c k s t r i n g (

PLACE, x y s t r (3∗XSTEP, 2∗YSTEP ) ) ;a t t r i b u t e RLOC of s e l e c t f i r s t b m f l o p 1 B : label i s p i c k s t r i n g (

PLACE, x y s t r (3∗XSTEP, 2∗YSTEP ) ) ;−−I t i s not o b v i o u s i f 4 f l o p s i s an a d v a n t a g e o v e r 2 f l o p s b a s e d

on t im ing r e s u l t s . 1−>2 f l o p s−−c e r t a i n l y b r i n g s a b e n i f i t .constant bms plus sms xpos : n a t u r a l t a (1 downto 0)

:= (3∗XSTEP , XSTEP ) ;a t t r i b u t e RLOC of p i p e l i n e s t a t e : label i s p i c k s t r i n g (

PLACE, x y s t r ( 0 , 2∗YSTEP ) ) ;

−−The f o l l o w i n g t y p e has be en d e f i n e d h e r e so t h a t i t c anno t beused t h r o u g h o u t t h e g e n e r a l d e s i g n .

−−( i n s t e a d s h o u l d use t h e one d e f i n e d in p k g p r o j e c t ) .−−But we need an s l v r a t h e r than s i g n e d d e f i n i t i o n f o r t h i s low

l e v e l p r i m i t v e s t u f f .subtype s t a t e m e t r i c s l v t i s s t d l o g i c v e c t o r ( B SM−1 downto

0) ;type s t a t e m e t r i c s l v t a i s array ( na tur a l range <> ) of

s t a t e m e t r i c s l v t ;s ignal s o u r c e s t a t e m e t r i c s r e g : s t a t e m e t r i c s l v t a (3 downto 0) ;constant ZERO SM : b i t v e c t o r ( B SM−1 downto 0) := (

others => ’ 0 ’ ) ;

s ignal s h i f t r e g : s t d l o g i c v e c t o r (1 downto 0) ;s ignal n e x t p i p e l i n e c y c l e : s t d l o g i c v e c t o r (2 downto 0) ;s ignal s e l e c t f i r s t b m : s t d l o g i c ;s ignal load sm : s t d l o g i c v e c t o r (3 downto 0) ;constant SELECT FIRST BM CYCLE : n a tu ra l := 2 ;s ignal s e l e c t f i r s t b m p i p e : s t d l o g i c v e c t o r (3 downto 0) ;


begin−−PIPELINE CONTROL STATE MACHINE−−a p o l o g i e s f o r f o l l o w i n g b e i n g a l i t t l e ” t r i c k y ” . I n t e n d e d t o be

e a s y t o t r a n s l a t e i n t o−−p r i m i t i v e s . The a c s p r o c e s s i n g p i p e l i n e i s r e p r e s e n t e d with a

s h i f t r e g i s t e r . Each s h i f t−−r e g i s t e r l o c a t i o n c o r r e s p o n d s t o a d i f f e r e n t s t a g e in t h e

p i p e l i n e . N e w v i t e r b i i t e r a t i o n i s−−t h e s h i f t r e g s e r i a l input , and by s i mp ly l o o k i n g a t t h e s h i f t

r e g c o n t e n t s we can f i g u r e−−out what p i p e l i n e s t a g e we a r e at , and what p i p e l i n e c o n t r o l

s i g n a l s need t o be g e n e r a t e d .−−Note , c u r r e n t l y on ly need t o ou t pu t c o n t r o l s on t h e f i r s t 2

s t a g e s , s o t h e s h i f t r e g doesn ’ t−−need t o be t h e c o m p l e t e p i p e l i n e d e p t h .p i p e l i n e s t a t e : e n t i t y work . s h i f t r e g ( xprim )generic map(

SHIFT DIRECTION => LEFT , −−t owards msb )PLACE => PLACE SUB INSTANCE)

port map(c l k => clk ,ce => ’ 1 ’ ,load => r e s e t ,din => ”00” ,−−ZERO SHIFT REG ,dout => s h i f t r e g ,s e r i a l i n => n e w v i t e r b i i t e r a t i o n ,s e r i a l o u t => open ) ;

−−p i p e l i n e c y c l e must show t h e nex t p i p e l i n e c y c l e so t h a t we cand e c o d e i t and p r o d u c e c o n t r o l s r e a d y

−−f o r t h e nex t c y c l e .n e x t p i p e l i n e c y c l e <= s h i f t r e g & n e w v i t e r b i i t e r a t i o n ;

−−Decode t h e p i p e l i n e s t a t e and g e n e r a t e t h e r e q u i r e d c o n t r o ls i g n a l s .

−−1) sm bus c y c l e one , 2 ) sm bus c y c l e two , 3 ) add 1 s t s e t o f bms ,4 ) now mux in t h e 2nd s e t o f bms

−−A) branch m e t r i c muxing .s e l e c t f i r s t b m <= ’1 ’ when n e x t p i p e l i n e c y c l e (

SELECT FIRST BM CYCLE−1 ) = ’1 ’ e lse ’ 0 ’ ;−−Thi s s i g n a l i s h igh f a n o u t . i . e . 8 sm+bm a d d e r s and a t 11 b i t s

e a c h t h a t s 88 l o a d s . D u p l i c a t e−−r e g i s t e r s t o r e d u c e t h e f a n o u t t o b r i n g t h i s n e t o f f t h e t i m in g

c r i t i c a l pa th . ( s e l e c t f i r s t b m i s−−g e n e r a t e d 1 c y c l e e a r l y , t h e s e b u f f e r f l o p s th en d e l a y by 1 c y c l e

t o a l i g n i t t o t h e p i p e l i n e−−c o r r e c t l y ) .s e l e c t f i r s t b m f l o p 0 A : FDRSEgeneric map ( INIT => ’ 0 ’ )port map( C => clk ,

D => s e l e c t f i r s t b m ,Q => s e l e c t f i r s t b m p i p e ( 0 ) ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

s e l e c t f i r s t b m f l o p 0 B : FDRSEgeneric map ( INIT => ’ 0 ’ )port map( C => clk ,



s e l e c t f i r s t b m f l o p 1 A : FDRSEgeneric map ( INIT => ’ 0 ’ )port map( C => clk ,


s e l e c t f i r s t b m f l o p 1 B : FDRSEgeneric map ( INIT => ’ 0 ’ )port map( C => clk ,


−−B ) The s o u r c e s t a t e m e t r i c s must be r e g i s t e r e d . ( S t a g e 1 and 2 o ft h e p i p e l i n e ) .

−−Share one s e t o f t h e s e r e g i s t e r s amongst b o t h ACS u n i t s .−−S i n c e t h i s r a d i x 4 u n i t o u t p u t s 2 sms on one bus , t h e r e a r e two

bus c y c l e s . T h e r e f o r e , e a c h sm must−−be r e g i s t e r e d on t h e c o r r e c t c y c l e . That i s s p e c i f i e d by a

g e n e r i c p a r a m a t e r .−−Loop through e a c h i n p u t s t a t e m e t r i c , and r e s e t o r r e g i s t e r i t on

t h e c o r r e c t c y c l e .load sm ( 0 ) <= n e x t p i p e l i n e c y c l e ( SM VALID BUS CYCLE ( 0 ) ) ;load sm ( 1 ) <= n e x t p i p e l i n e c y c l e ( SM VALID BUS CYCLE ( 1 ) ) ;load sm ( 2 ) <= n e x t p i p e l i n e c y c l e ( SM VALID BUS CYCLE ( 2 ) ) ;load sm ( 3 ) <= n e x t p i p e l i n e c y c l e ( SM VALID BUS CYCLE ( 3 ) ) ;

−−ONE SET OF STATE METRIC REGISTERSsm regs : for i in 0 to 3 generate

a t t r i b u t e RLOC of sm reg : label i s p i c k s t r i n g (PLACE, x y s t r ( 0 ,ygrid ( i ) ) ) ;

beginsm reg : e n t i t y work . reg ( xprim )generic map(

INIT => ZERO SM,PLACE => PLACE SUB INSTANCE) −−s e e p k g r l o c f o r d e f i n i t i o n

port map(c l k => clk ,ce => load sm ( i ) ,d => s t d l o g i c v e c t o r ( s o u r c e s t a t e m e t r i c s ( i ) ) ,q => s o u r c e s t a t e m e t r i c s r e g ( i ) ,r e s e t => r e s e t ,s e t => ’ 0 ’ ) ;

end generate ;

−−a s s e r t PLACE = 1 r e p o r t ”PLACE not = t o 1” s e v e r i t y warning ;−−a s s e r t PLACE = 0 r e p o r t ”PLACE not = t o 0” s e v e r i t y warning ;−−a s s e r t PLACE SUB INSTANCE = 1 r e p o r t ”PLACE SUB INST not = t o 1”

s e v e r i t y warning ;−−a s s e r t PLACE SUB INSTANCE = 0 r e p o r t ”PLACE SUB INST not = t o 0”

s e v e r i t y warning ;−−TWO ADD−COMPARE−SELECT UNITS−−Use two ACS u n i t s t o g e n e r a t e t h e 4 s t a t e m e t r i c s o u t p u t s f o r

t h i s r a d i x−4 u n i t . Each ACS u n i t−−o u t p u t s two s t a t e m e t r i c s s e q u e n t i a l l y .r a d i x 4 a c s : for i in 0 to 1 generate−−p l a c e m e n t


−−1 s t l a y e r o f s u b t r a c t nex t t o t h e sm+bm t h a t g e n e r a t e s on o fi t s o p e r a n d s . The o t h e r operand

−− i s in a d i f f e r e n t y g r i d row which i s not i d e a l and i s p r o b a b l ya c a u s e o f t h e s u b t r a c t o f t e n

−−b e i n g on t h e t im ing c r i t i c a l pa th .a t t r i b u t e RLOC of subtrac t A : label i s p i c k s t r i n g (PLACE, x y s t r

( bms plus sms xpos ( i ) +1 , ygrid ( 1 ) ) ) ;a t t r i b u t e RLOC of s u b t r a c t B : label i s p i c k s t r i n g (PLACE, x y s t r

( bms plus sms xpos ( i ) +1 , ygrid ( 2 ) ) ) ;−−The f i r s t mux and a l l l a t e r s t a g e s a r e grouped such t h a t t h e

f i r s t ACS u n i t i s on t h e bot tom−−two y g r i d rows , and t h e s e c o n d ACS u n i t i s on t h e t o p two y g r i d

rows in an a t t e m p t t o min imise−−v e r t i c a l r o u t i n g d i s t a n c e and c r o s s o v e r s .

−−i jconstant pm regs xpos : i n t eg er 2D t a (1 downto 0 , 3 downto 0)

:= ( ( 4∗XSTEP, 2∗XSTEP, 6∗XSTEP, 5∗XSTEP ) , ( 6∗XSTEP, 5∗XSTEP, 4∗XSTEP, 2∗XSTEP ) ) ;

constant pm regs ypos : in te g er 2D t a (1 downto 0 , 3 downto 0) :=( ( ygrid ( 3 ) , ygrid ( 3 ) , ygrid ( 2 ) , ygrid ( 2 ) ) , ( ygrid ( 1 ) , ygrid ( 1 ) , ygrid

( 0 ) , ygrid ( 0 ) ) ) ;

−−Mux A/ B and e a c h ones a s s o c i a t e d i n p u t r e g s a r e put on t h e samerow t o min imise r o u t i n g .

a t t r i b u t e RLOC of mux A : label i s p i c k s t r i n g (PLACE,x y s t r (5∗XSTEP , ygrid (3∗ i ) ) ) ;

a t t r i b u t e RLOC of mux B : label i s p i c k s t r i n g (PLACE,x y s t r (7∗XSTEP , ygrid ( i +1) ) ) ;

a t t r i b u t e RLOC of s u b t r a c t f i n a l : label i s p i c k s t r i n g (PLACE,x y s t r (8∗XSTEP , ygrid ( i +1) ) ) ;

−−F i n a l mux put a t f a r r i g h t hand s i d e so t h e s t a t e m e t r i c s canr o u t e out h e r e . The two r e g s

−−t h a t f e e d t h a t f i n a l mux a r e p l a c e in same row t o min imis er o u t i n g .

a t t r i b u t e RLOC of reg pm01 : label i s p i c k s t r i n g (PLACE,x y s t r (6∗XSTEP , ygrid (3∗ i ) ) ) ;

a t t r i b u t e RLOC of reg pm23 : label i s p i c k s t r i n g (PLACE,x y s t r (7∗XSTEP , ygrid (3∗ i ) ) ) ;

a t t r i b u t e RLOC of mux final : label i s p i c k s t r i n g (PLACE,x y s t r (8∗XSTEP , ygrid (3∗ i ) ) ) ;

−−P l a c e t h e m i s c e l a n e o u s c o n t r o l f l o p s and d e c i s i o n b i t s f l o p s int h e empty s l i c e row t h a t

−−runs through t h e m idd l e o f t h e rpm .−−D e c i s i o n b i t s p l a c e d on t h e o u t s i d e edge , o p p o s i t e t h e s t a t e

m e t r i c s . These w i l l r o u t e out t o−−t r a c e b a c k .a t t r i b u t e RLOC of keep pm1 pipe1 flop : label i s p i c k s t r i n g (

PLACE, x y s t r (2∗XSTEP+2∗ i ∗XSTEP, 2∗YSTEP ) ) ;a t t r i b u t e RLOC of keep pm3 pipe1 flop : label i s p i c k s t r i n g (

PLACE, x y s t r (2∗XSTEP+2∗ i ∗XSTEP, 2∗YSTEP ) ) ;a t t r i b u t e RLOC of keep pm1 pipe2 flop : label i s p i c k s t r i n g (

PLACE, x y s t r (5∗XSTEP+i , 2∗YSTEP ) ) ;a t t r i b u t e RLOC of keep pm3 pipe2 flop : label i s p i c k s t r i n g (

PLACE, x y s t r (5∗XSTEP+i , 2∗YSTEP ) ) ;a t t r i b u t e RLOC of d e c i s i o n b i t s l s b : label i s p i c k s t r i n g (

PLACE, x y s t r (7∗XSTEP+i , 2∗YSTEP ) ) ;a t t r i b u t e RLOC of d e c i s i o n b i t s m s b : label i s p i c k s t r i n g (

PLACE, x y s t r (7∗XSTEP+i , 2∗YSTEP ) ) ;

−−S i l l y f u n c t i o n s so t h a t s t d l o g i c can be a p p l i e d t ou n c o n s t r a i n e d v e c t o r p o r t s . In f u t u r e move


−−t h e s e i n t o a p a c k a g e . These w i l l be r e q u i r e d anyt ime you have acomponent d e f i n e d with

−−u n c o n s t r a i n e d s l v , y e t t h e r e may be c a s e s where you want t op a s s in j u s t 1 b i t , e . g . a s t d l o g i c .

function v e c t o r i z e ( s : s t d l o g i c ) return s t d l o g i c v e c t o r i svar iable v : s t d l o g i c v e c t o r (0 downto 0) ;

beginv ( 0 ) := s ;return v ;

end ;function s c a l a r i z e ( v : in s t d l o g i c v e c t o r ) return s t d l o g i c i sbegin

a s s e r t v ’ length = 1 report ” s c a l a r i z e : output port must bes i n g l e b i t ! ” s e v e r i t y FAILURE ;

return v ( v ’ LEFT ) ;end ;s ignal decis ion bits lsb dummy : s t d l o g i c v e c t o r (0

downto 0) ;−−ACSs ignal pm, pm reg : s t a t e m e t r i c s l v t a

(3 downto 0) ;s ignal d e s t s t a t e m e t r i c s l v : s t a t e m e t r i c s l v t ;s ignal keep pm23 , keep pm1 , keep pm3 : s t d l o g i c ;s ignal keep pm1 pipe1 , keep pm1 pipe2 : s t d l o g i c ;s ignal keep pm3 pipe1 , keep pm3 pipe2 : s t d l o g i c ;s ignal pm01 , pm23 , pm01 reg , pm23 reg : s t a t e m e t r i c s l v t ;s ignal pm0 minus pm1 , pm2 minus pm3 : s t a t e m e t r i c s l v t ;s ignal pm01 minus pm23 : s t a t e m e t r i c s l v t ;

begin−−pe form add−compare−s e l e c t p r o c e s s i n g f o r 2 s t a t e s us ing 1 s e t

o f hardware . P i p e l i n e i s 6 deep , p l u s−−1 c l o c k f o r t h e 2nd s t a t e , s o t a k e s 7 c l o c k s f o r c o m p l e t e

p r o c e s s i n g .−−Path m e t r i c s a r e s t o r e d as f i x e d p o i n t 2 ’ s complement numbers

and have l i m i t e d range . By us ing−−Hekstra ’ s method , d e d i c a t e d n o r m a l i s a t i o n l o g i c i s a v o i d e d by

a l l o w i n g t h e m e t r i c s t o o v e r f l o w and−−p e r f o r m i n g t h e f i n d−the−b e s t−m e t r i c o p e r a t i o n us ing 2 ’ s

complement a r i t h m e t i c . See t h e Matlab model f o r−−more d e t a i l s .

−−Read t o p down t o g e t t h e p i p e l i n e f l o w . Number on t h e l e f t i st h e p i p e l i n e s t a g e .

−−1) , 2 ) , Two bus c y c l e s t o r e g i s t e r t h e s o u r c e s t a t e m e t r i c s .See a b o v e .

−−3) a c c u m u l a t e t h e pa th m e t r i c s a l l o w i n g 2 ’ s complement o v e r f l o w .pm = s t a t e m e t r i c + branch m e t r i c

−−S i n c e t h i s a c s u n i t p r o c e s s two s t a t e s in s e r i a l , need t o muxin t h e branch m e t r i c s f o r t h e

−−2nd s t a t e . ( The mux and + map i n t o a s i n g l e LUT l e v e l ) .−−branch m e t r i c c o m p u t a t i o n h a l v e d by r e c o g n i s i n g t h a t f o r e v e r y

branch m e t r i c , t h e r e i s an−−e q u i v a l e n t −ve one . A l l 16 branch m e t r i c s i n p u t t o a r a d i x−4

u n i t a r e e i t h e r added or−−s u b t r a c t e d , s o t h i s add / s u b t r a c t c o n f i g u r a t i o n i s s t a t i c −>

c o m p i l e t ime .bms plus sms : for j in 0 to 3 generate

s ignal f i r s t bm sign extended , second bm sign extended :s t d l o g i c v e c t o r ( s o u r c e s t a t e m e t r i c s r e g ( 0 ) ’ range ) ;

a t t r i b u t e RLOC of bm plus sm : label i s p i c k s t r i n g (PLACE,x y s t r ( bms plus sms xpos ( i ) , ygrid ( j ) ) ) ;


begin−−b o t h o p e r a n d s must be same l e n g t h so need t o s i g n e x t e n d t h e

branch m e t r i c s . ( a s sumpt i on i s−−t h a t branch m e t r i c w o r d l e n g t h a lways l e s s than s t a t e m e t r i c

w o r d l e n g t h ) .−−n u m e r i c s t d ’ s f u n c t i o n ” r e s i z e ” d o e s t h i s b e a u t i f u l l y f o r you

: i t s j u s t wir e s , no l o g i cf i r s t b m s i g n e x t e n d e d <= s t d l o g i c v e c t o r ( i e e e . numeric std .

r e s i z e ( branch metr ics ( j + 8∗ i ) , s o u r c e s t a t e m e t r i c s r e g( 0 ) ’ length ) ) ;

second bm sign extended <= s t d l o g i c v e c t o r ( i e e e . numeric std .r e s i z e ( branch metr ics ( j +4 + 8∗ i ) , s o u r c e s t a t e m e t r i c s r e g( 0 ) ’ length ) ) ;

bm plus sm : e n t i t y work . adder sub mux ( xprim )generic map( REG => true ,

INIT => ZERO SM,SUBTRACT => SUBTRACT BRANCH METRICS,CHECK ANSWER => CHECK ANSWER,PLACE => PLACE SUB INSTANCE)

port map(c l k => clk ,s e l => s e l e c t f i r s t b m p i p e (2∗ i + j /2) ,

−−0 s e l e c t s b , 1 s e l e c t s c .a => s o u r c e s t a t e m e t r i c s r e g ( j ) ,b => second bm sign extended ,c => f i r s t bm sign extended ,ans => pm( j ) , −−ans = a

+/− ( b o r c ) ( s e l s e l e c t b or c )ce => ’ 1 ’ , −−todo , on ly

e n a b l e f o r 2 c y c l e s t o h e l p with debug andmay r e d u c e power ( l e s s s w i t c h i n g a c t i v i t y )

r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;

end generate ;

−−4)Now s t a r t t h e p r o c e s s o f f i n d i n g t h e ” b e s t ” pa th m e t r i c . Th i si s done in two s t a g e s , f i r s t l y

−−f i n d i n g t h e b e s t o f e a c h o f two p a i s . And then in t h e s e c o n ds t a g e f i n d i n g t h e b e s t o f t h e

−−r ema in ing two pa th m e t r i c s .−−The key p a r t o f t h e mat l ab model f o r f i n d i n g t h e ” b e s t ” pa th

m e t r i c i s :−− s u b t r a c t A n s = v a l − x ( i +1) ;−− i f s u b t r a c t A n s < 0 %v a l i s s m a l l e r−− v a l = x ( i +1) ;−− i d x = i +1;−−Thi s i s j u s t a s u b t r a c t i o n f o l l o w e d by t h e <0 c h e c k which i s

j u s t l o o k i n g a t t h e s i g n b i t o f t h e−−s u b t r a c t i o n r e s u l t .−−Per form t h e s u b t r a c t i o n f o r e a c h o f two p a i r s o f t h e s t a t e

m e t r i c s and p a s s t h e s i g n b i t s−−onto t h e nex t s t a g e . Also s t o r e t h e pa th m e t r i c s in r e g i s t e r s

r e a d y f o r t h e nex t p i p e l i n e s t a g e .subtrac t A : e n t i t y work . adder sub ( xprim )generic map( REG => f a l s e ,

REG CARRY ONLY=> true ,INIT => ZERO SM,CHECK ANSWER => CHECK ANSWER,PLACE => PLACE SUB INSTANCE)

port map(c l k => clk ,


s u b t r a c t => ’ 1 ’ ,a => pm( 0 ) ,b => pm( 1 ) ,ans => pm0 minus pm1 , −−ans = a − bce => ’ 1 ’ ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;

−− I f s i g n b i t o f r e s u l t i s 1 , th en −ve r e s u l t and so k e e p pm1 .R e g i s t e r i t .

keep pm1 <= pm0 minus pm1 ( pm0 minus pm1 ’ high ) ;−− k e e p p m 1 f l o p : FDRSE−− g e n e r i c map ( INIT => ’ 0 ’ )−− p o r t map ( C => c l k ,−− D => pm0 minus pm1 ( pm0 minus pm1 ’ h igh ) ,−− Q => keep pm1 ,−− CE => ’ 1 ’ ,−− R => ’ 0 ’ ,−− S => ’ 0 ’ ) ;

−−2nd s u b t r a c t o r , now compare pm2 and pm3s u b t r a c t B : e n t i t y work . adder sub ( xprim )generic map(

REG => f a l s e ,REG CARRY ONLY=> true ,INIT => ZERO SM,CHECK ANSWER => CHECK ANSWER,PLACE => PLACE SUB INSTANCE)

port map(c l k => clk ,s u b t r a c t => ’ 1 ’ ,a => pm( 2 ) ,b => pm( 3 ) ,ans => pm2 minus pm3 , −−ans = a − bce => ’ 1 ’ ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;


keep pm3 <= pm2 minus pm3 ( pm2 minus pm3 ’ high ) ;−− k e e p p m 3 f l o p : FDRSE−− g e n e r i c map ( INIT => ’ 0 ’ )−− p o r t map ( C => c l k ,−− D => pm2 minus pm3 ( pm2 minus pm3 ’ h igh ) ,−− Q => keep pm3 ,−− CE => ’ 1 ’ ,−− R => ’ 0 ’ ,−− S => ’ 0 ’ ) ;

−−now r e g i s t e r a l l f o u r pms , t o b a l a n c e t h i s p i p e l i n e s t a g e .pm regs : for j in 0 to 3 generate

a t t r i b u t e RLOC of pm reg comp : label i s p i c k s t r i n g (PLACE,x y s t r ( pm regs xpos ( i , j ) , pm regs ypos ( i , j ) ) ) ;

beginpm reg comp : e n t i t y work . reg ( xprim )generic map(


port map(c l k => clk ,ce => ’ 1 ’ ,d => pm( j ) ,q => pm reg ( j ) ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;


end generate ;

−−5) S e l e c t (mux) t h e b e s t m e t r i c b a s e d on t h e s i g n b i t o f t h es u b t r a c t i o n

mux A : e n t i t y work . mux2( xprim )generic map(

REG => true ,INIT => ZERO SM,PLACE => PLACE SUB INSTANCE)

port map(c l k => clk ,s e l => keep pm1 ,a => pm reg ( 0 ) ,b => pm reg ( 1 ) ,ans => pm01 ,ce => ’ 1 ’ ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;

mux B : e n t i t y work . mux2( xprim )generic map(


port map(c l k => clk ,s e l => keep pm3 ,a => pm reg ( 2 ) ,b => pm reg ( 3 ) ,ans => pm23 ,ce => ’ 1 ’ ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;

−−p i p e l i n e t h e d e c i s i o n b i t s − used l a t e r in t h e p i p e l i n ekeep pm1 pipe1 flop : FDRSEgeneric map ( INIT => ’ 0 ’ )port map( C => clk ,

D => keep pm1 ,Q => keep pm1 pipe1 ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

keep pm3 pipe1 flop : FDRSEgeneric map ( INIT => ’ 0 ’ )port map( C => clk ,

D => keep pm3 ,Q => keep pm3 pipe1 ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

−−6) Peform t h e s u b t r a c t i o n compar i s on be tween t h e b e s t 2 pa thm e t r i c s

s u b t r a c t f i n a l : e n t i t y work . adder sub ( xprim )generic map(

REG => f a l s e ,REG CARRY ONLY=> true ,INIT => ZERO SM,CHECK ANSWER => CHECK ANSWER,PLACE => PLACE SUB INSTANCE)

port map(c l k => clk ,


s u b t r a c t => ’ 1 ’ ,a => pm01 ,b => pm23 ,ans => pm01 minus pm23 , −−ans = a − bce => ’ 1 ’ ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;


keep pm23 <= pm01 minus pm23 ( pm01 minus pm23 ’ high ) ;−− k e e p p m 2 3 f l o p : FDRSE−− g e n e r i c map ( INIT => ’ 0 ’ )−− p o r t map ( C => c l k ,−− D => pm01 minus pm23 ( pm01 minus pm23 ’ h igh ) ,−− Q => keep pm23 ,−− CE => ’ 1 ’ ,−− R => ’ 0 ’ ,−− S => ’ 0 ’ ) ;

−−P i p e l i n e t h e pms and d e c i s i o n b i t s−−pmsreg pm01 : e n t i t y work . reg ( xprim )generic map(


port map(c l k => clk ,ce => ’ 1 ’ ,d => pm01 ,q => pm01 reg ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;

reg pm23 : e n t i t y work . reg ( xprim )generic map(


port map(c l k => clk ,ce => ’ 1 ’ ,d => pm23 ,q => pm23 reg ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;

−−d e c i s i o n b i t skeep pm1 pipe2 flop : FDRSEgeneric map ( INIT => ’ 0 ’ )port map( C => clk ,

D => keep pm1 pipe1 ,Q => keep pm1 pipe2 ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

keep pm3 pipe2 flop : FDRSEgeneric map ( INIT => ’ 0 ’ )port map( C => clk ,

D => keep pm3 pipe1 ,Q => keep pm3 pipe2 ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

−−7) F i n a l l y , s e l e c t (mux) t h e b e s t m e t r i c , and o ut pu t t h ed e c i s i o n b i t s i n d i c a t i n g which o f t h e f o u r


−−c a n d i d a t e pa th m e t r i c s was s e l e c t e d .mux final : e n t i t y work . mux2( xprim )generic map(


port map(c l k => clk ,s e l => keep pm23 ,a => pm01 reg ,b => pm23 reg ,ans => d e s t s t a t e m e t r i c s l v ,ce => ’ 1 ’ ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;

d e s t s t a t e m e t r i c ( i ) <= signed ( d e s t s t a t e m e t r i c s l v ) ;

−−Output t h e d e c i s i o n b i t s b a s e d on which pa th m e t r i c was k e p t .There a r e two d e c i s o n b i t s f o r

−−e a c h ACS u n i t . Encoded as f o l l o w s .−−pm0 − ”00”−−pm1 − ”01”−−pm2 − ”10”−−pm3 − ”11”d e c i s i o n b i t s m s b : FDRSEgeneric map ( INIT => ’ 0 ’ )port map( C => clk ,

D => keep pm23 ,Q => d e c i s i o n b i t s (1 + 2∗ i ) ,CE => ’ 1 ’ ,R => ’ 0 ’ ,S => ’ 0 ’ ) ;

d e c i s i o n b i t s l s b : e n t i t y work . mux2( xprim )generic map(

REG => true ,INIT => ”0” ,PLACE => PLACE SUB INSTANCE)

port map(c l k => clk ,s e l => keep pm23 ,a => v e c t o r i z e ( keep pm1 pipe2 ) , −−f u n c t i o n

r e q u i r e d t o map s t d l o g i c t o 1 b i t s l vb => v e c t o r i z e ( keep pm3 pipe2 ) ,ans => decis ion bits lsb dummy ,ce => ’ 1 ’ ,r e s e t => ’ 0 ’ ,s e t => ’ 0 ’ ) ;

d e c i s i o n b i t s (0 + 2∗ i ) <= decis ion bits lsb dummy ( 0 ) ; −−ne ededt o map 1 b i t s l v t o s t d l o g i c

end generate ;end generate ;

end r t l ;

D.2.4 Viterbi Trellis Package

matlab/fpga/vhdl/pkg_trellis.vhd

D.2.5 Testbench

/fpga/src/cpm_viterbi_decoder_tb.vhd


D.2.6 Test Vectors Package

/matlab/fpga/vhdl/pkg_matlab_test_vectors_viterbi.vhd

D.3 Branch Metrics Filter Bank and Viterbi TrellisPath Metrics


/fpga/src/cpm_viterbi_decoder_and_bms_synth.vhd

D.3.2 Top Level

/fpga/src/cpm_viterbi_decoder_and_bms.vhd

D.3.3 Testbench

/fpga/src/cpm_viterbi_decoder_and_bms_tb.vhd

D.4 Placed Primitives Modules

D.4.1 Adder with Subtract and Clear Controls

/fpga/src/xilinx_primitive_encapsulation/adder_sub_clr.vhd

D.4.2 Adder with Subtract and 2 Input Operand Mux

/fpga/src/xilinx_primitive_encapsulation/adder_sub_mux.vhd

D.4.3 16 Deep ROM using Distributed Ram

/fpga/src/xilinx_primitive_encapsulation/lut_rom.vhd

D.4.4 2 Input Mux

/fpga/src/xilinx_primitive_encapsulation/mux2.vhd

D.4.5 Shift Register

/fpga/src/xilinx_primitive_encapsulation/shift_reg.vhd

D.4.6 Register

/fpga/src/xilinx_primitive_encapsulation/register.vhd

D.4.7 Relative Location Constraint(RLOC) Helper Package

/fpga/src/pkg_rloc.vhd

D.5. SUNDRY PACKAGES 123

D.5 Sundry Packages

D.5.1 Key Project Constants

/fpga/src/pkg_project.vhd


Appendix E

Matlab Source Code

This appendix lists the Matlab source code referenced in this thesis. Pleasesee the CD associated with this thesis for the complete listing of Matlab sourcecode and Simulink models in electronic format.

E.1 Analytical CPM Code Performance

E.1.1 CPM Code Minimum Euclidean Distance Upper Bound

/matlab/minimum_distance/dmin_upper_bound.m

E.1.2 CPM Code Baseband Double Sided Bandwidth

/matlab/minimum_distance/baseband_spectrum.m

E.2 Floating and Fixed Point M-file Models

/matlab/cpm_modulator/

/matlab/cpm_demodulator/

/matlab/cpm_trellis/

/matlab/phase_pulses/

/matlab/system_model/

/matlab/system_model_simulink/

E.3 VHDL Trellis Representation and Test Vector Gen-eration

/matlab/fpga/

125

126 APPENDIX E. MATLAB SOURCE CODE

E.3.1 Export Matlab Data to VHDL Writer

/matlab/fpga/VHDL_TEST_export.m

E.3.2 VHDL Writer

/matlab/fpga/vhdl_writer.m

E.4 Miscellaneous

/matlab/misc/

E.4.1 Hekstras Method Bound Calculation

/matlab/misc/hekstra_pm_bound.m

Appendix F

Implementation Results

F.1 Branch Metric Filter Bank

Release 1 0 . 1 . 0 3 Map K. 3 9 ( nt )X i l i n x Mapping Report F i l e for Design ’ b r a n c h m e t r i c s d a f i r s y n t h ’

Design Information−−−−−−−−−−−−−−−−−−Command Line : map −i s eC:/ p r o j e c t /massey stratex cpm/demosvn/abr masters cpm/fpga/ i s e 1 0 1 /

branch metr ics/branch metr ics . i s e − i n t s t y l e i s e −p xc3sd1800a−cs484−4 −cm area −

d e t a i l −pro f f −k 4 −c 100 −o branch metr ics daf i r synth map . ncdb r a n c h m e t r i c s d a f i r s y n t h . ngd b r a n c h m e t r i c s d a f i r s y n t h . pcfTarget Device : xc3sd1800aTarget Package : cs484Target Speed : −4Mapper Version : spartan3adsp −− $ R e v i s i o n : 1 . 4 6 . 1 2 . 2 $Mapped Date : Sat Mar 14 1 0 : 2 8 : 1 0 2009

Design Summary−−−−−−−−−−−−−−Number of e r r o r s : 0Number of warnings : 264Logic U t i l i z a t i o n :

Number of S l i c e F l i p Flops : 7 ,460 out of 33 ,280 22%Number of 4 input LUTs : 4 ,896 out of 33 ,280 14%

Logic D i s t r i b u t i o n :Number of occupied S l i c e s : 3 ,863 out of 16 ,640 23%

Number of S l i c e s conta in ing only r e l a t e d l o g i c : 3 ,863 out of3 ,863 100%

Number of S l i c e s conta in ing unrelated l o g i c : 0 out of3 ,863 0%

∗See NOTES below for an explanat ion of the e f f e c t s of unrelatedl o g i c .

Tota l Number of 4 input LUTs : 4 ,896 out of 33 ,280 14%Number of bonded IOBs : 31 out of 309 10%Number of BUFGMUXs: 1 out of 24 4%

Number of RPM macros : 5Peak Memory Usage : 310 MBTota l REAL time to MAP completion : 1 mins 30 s e csTota l CPU time to MAP completion : 1 mins 25 s e cs

127

128 APPENDIX F. IMPLEMENTATION RESULTS

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Release 1 0 . 1 . 0 3 par K. 3 9 ( nt )Copyright ( c ) 1995−2008 Xi l inx , Inc . All r i g h t s reserved .

ANDREW−K64 : : Sat Mar 14 1 0 : 2 9 : 4 6 2009

par −w − i n t s t y l e i s e −ol std −t 1 branch metr ics daf i r synth map . ncdb r a n c h m e t r i c s d a f i r s y n t h . ncd b r a n c h m e t r i c s d a f i r s y n t h . pcf

Cons t ra in t s f i l e : b r a n c h m e t r i c s d a f i r s y n t h . pcf .Loading device for a p p l i c a t i o n Rf Device from f i l e ’3 sd1800a . nph ’ in

environmentC:\Aps\xi l inx webpack 10 1\ ISE .” b r a n c h m e t r i c s d a f i r s y n t h ” i s an NCD, vers ion 3 . 2 , device xc3sd1800a ,

package cs484 , speed −4

I n i t i a l i z i n g temperature to 85 .000 Cels ius . ( d e f a u l t − Range : 0 . 000 to85 .000 Cels ius )

I n i t i a l i z i n g vol tage to 1 .140 Vol ts . ( d e f a u l t − Range : 1 . 140 to 1 .260Vol ts )

Device speed data vers ion : ”PRODUCTION 1 . 3 3 2008−07−25” .

Overal l e f f o r t l e v e l (−ol ) : StandardP l a c e r e f f o r t l e v e l (−pl ) : HighP l a c e r c o s t t a b l e entry (− t ) : 1Router e f f o r t l e v e l (− r l ) : Standard

+−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−+−−−−−−+−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+

| Clock Net | Resource | Locked | Fanout |Net Skew ( ns ) |MaxDelay ( ns ) |

+−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−+−−−−−−+−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+

| c l k | BUFGMUX X2Y0 | No | 3863 | 0 .371 |1 .859 |

+−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−+−−−−−−+−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+

Timing Score : 10

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Constra int | Check | Worst Case | Best Case |Timing | Timing

| | Slack | Achievable |Errors | Score

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

∗ TS clk = PERIOD TIMEGRP ” c l k ” | SETUP | −0.010 ns | 4 .639 ns |1 | 10

216 MHz HIGH 50% | HOLD | 0 .806 ns | |0 | 0

F.2. VITERBI TRELLIS PATH METRICS 129

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

F.2 Viterbi Trellis Path Metrics

Release 1 0 . 1 . 0 3 Map K. 3 9 ( nt )X i l i n x Mapping Report F i l e for Design ’ cpm viterbi decoder synth ’

Design Information−−−−−−−−−−−−−−−−−−Command Line : map −i s eC:/ p r o j e c t /massey stratex cpm/demosvn/abr masters cpm/fpga/ i s e 1 0 1 /

v i t e r b i a c s /vi t e r b i a c s . i s e − i n t s t y l e i s e −p xc3sd1800a−cs484−4 −cm area −pr o f f −u

−k 4 −c100 −o cpm viterbi decoder synth map . ncd cpm viterb i decoder synth . ngdcpm viterb i decoder synth . pcfTarget Device : xc3sd1800aTarget Package : cs484Target Speed : −4Mapper Version : spartan3adsp −− $ R e v i s i o n : 1 . 4 6 . 1 2 . 2 $Mapped Date : F r i Oct 31 1 3 : 5 0 : 1 6 2008

Design Summary−−−−−−−−−−−−−−Number of e r r o r s : 0Number of warnings : 1954Logic U t i l i z a t i o n :

Number of S l i c e F l i p Flops : 11 ,338 out of 33 ,280 34%Number of 4 input LUTs : 7 ,174 out of 33 ,280 21%

Logic D i s t r i b u t i o n :Number of occupied S l i c e s : 7 ,303 out of 16 ,640 43%

Number of S l i c e s conta in ing only r e l a t e d l o g i c : 7 ,303 out of7 ,303 100%

Number of S l i c e s conta in ing unrelated l o g i c : 0 out of7 ,303 0%

∗See NOTES below for an explanat ion of the e f f e c t s of unrelatedl o g i c .

Tota l Number of 4 input LUTs : 7 ,174 out of 33 ,280 21%Number used as l o g i c : 7 ,173Number used as S h i f t r e g i s t e r s : 1

Number of bonded IOBs : 1 out of 309 1%Number of BUFGMUXs: 1 out of 24 4%

Peak Memory Usage : 373 MBTota l REAL time to MAP completion : 1 mins 22 s e csTota l CPU time to MAP completion : 1 mins 11 s e cs−−

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−


ANDREW−K64 : : F r i Oct 31 1 3 : 5 1 : 4 7 2008

par −w − i n t s t y l e i s e −ol std −t 1 cpm viterbi decoder synth map . ncdcpm viterb i decoder synth . ncd cpm viterb i decoder synth . pcf

Cons t ra in t s f i l e : cpm viterb i decoder synth . pcf .Loading device for a p p l i c a t i o n Rf Device from f i l e ’3 sd1800a . nph ’


in environment C:\Aps\xi l inx webpack 10 1\ ISE .” cpm viterb i decoder synth ” i s an NCD, vers ion 3 . 2 , device xc3sd1800a ,





Overal l e f f o r t l e v e l (−ol ) : StandardP l a c e r e f f o r t l e v e l (−pl ) : HighP l a c e r c o s t t a b l e entry (− t ) : 1Router e f f o r t l e v e l (− r l ) : Standard

+−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−+−−−−−−+−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+


+−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−+−−−−−−+−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+

| c l k | BUFGMUX X2Y11 | No | 6343 | 0 .348 |1 .818 |

+−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−+−−−−−−+−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+

Timing Score : 751


ANDREW−K64 : : Sat Mar 14 1 0 : 2 9 : 4 6 2009

par −w − i n t s t y l e i s e −ol std −t 1 branch metr ics daf i r synth map . ncdb r a n c h m e t r i c s d a f i r s y n t h . ncd b r a n c h m e t r i c s d a f i r s y n t h . pcf

Cons t ra in t s f i l e : b r a n c h m e t r i c s d a f i r s y n t h . pcf .Loading device for a p p l i c a t i o n Rf Device from f i l e ’3 sd1800a . nph ’ in

environmentC:\Aps\xi l inx webpack 10 1\ ISE .” b r a n c h m e t r i c s d a f i r s y n t h ” i s an NCD, vers ion 3 . 2 , device xc3sd1800a ,





Overal l e f f o r t l e v e l (−ol ) : StandardP l a c e r e f f o r t l e v e l (−pl ) : HighP l a c e r c o s t t a b l e entry (− t ) : 1

F.2. VITERBI TRELLIS PATH METRICS 131

Router e f f o r t l e v e l (− r l ) : Standard

+−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−+−−−−−−+−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+


+−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−+−−−−−−+−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+

| c l k | BUFGMUX X2Y0 | No | 3863 | 0 .371 |1 .859 |

+−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−+−−−−−−+−−−−−−+−−−−−−−−−−−−+−−−−−−−−−−−−−+

Timing Score : 10

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Constra int | Check | Worst Case | Best Case | Timing| Timing

| | Slack | Achievable | Errors| Score

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

∗ TS clk = PERIOD TIMEGRP | SETUP | −0.240 ns | 4 .891 ns |5 | 751

” c l k ” 215 MHz HIGH 50% | HOLD | 0 .750 ns | |0 | 0

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−


Appendix G

Software Tool Versions

G.1 High Level Modelling: Matlab and Simulink

The work in this thesis used Matlab and Simulink for high level modelling.Table G.1 contains the software versions of Matlab, Simulink and the variousblocksets and toolboxes used.

G.2 FPGA Implementation: Xilinx ISE 10.1

Table G.2 contains the software version for the Xilinx synthesis and implemen-tation tools.

G.3 VHDL Simulation: Modelsim

VHDL simulator version: Modelsim Xilinx Edition III 6.0d.

Software VersionMATLAB Version 7.5 (R2007b)Simulink Version 7.0 (R2007b)Communications Blockset Version 3.6 (R2007b)Communications Toolbox Version 4.0 (R2007b)Fixed-Point Toolbox Version 2.1 (R2007b)Fixed-Point Toolbox Version 2.1 (R2007b)Signal Processing Blockset Version 6.6 (R2007b)Signal Processing Toolbox Version 6.8 (R2007b)

Table G.1: Matlab and Simulink Software Versions

133

134 APPENDIX G. SOFTWARE TOOL VERSIONS

Software Tool VersionXST Release 10.1.03 - xst K.39 (nt)Translate Release 10.1.03 ngdbuild K.39 (nt)Map Release 10.1.03 Map K.39 (nt)Place and route Release 10.1.03 par K.39 (nt)Static Timing Analysis Release 10.1.03 Trace (nt)

Table G.2: Xilinx Synthesis and Implementation Software Versions

Bibliography

[1] European Telecommunications Standards Institute, “ETSI EN 302 217-2-2 V1.1.3(2004-12) Fixed Radio Systems; Characteristics and requirementsfor point-to-point equipment and antennas,” 2004.

[2] Y.-N. Chang, H. Suzuki, and K. Parhi, “A 2-Mb/s 256-state 10-mW rate-1/3 Viterbi decoder,” Solid-State Circuits, IEEE Journal of, vol. 35, no. 6,pp. 826–834, Jun 2000.

[3] M. Morelli, U. Mengali, and G. Vitetta, “Joint phase and timing recoverywith CPM signals,” Communications, IEEE Transactions on, vol. 45, pp. 867–876, Jul 1997.

[4] G. Colavolpe and R. Raheli, “Reduced-complexity detection and phasesynchronization of CPM signals,” Communications, IEEE Transactions on,vol. 45, no. 9, pp. 1070–1079, Sep 1997.

[5] B. Sklar, Digital Communications. Prentice Hall, 2001.

[6] J. Anderson, T. Aulin, and C. Sundberg, Digital Phase Modulation. PlenumPress, 1986.

[7] J. Anderson and A. Svensson, Coded Modulation Systems. Kluwer Aca-demic/ Plenum Publishers, 2003.

[8] I. Sasase and S. Mori, “Multi-h phase-coded modulation,” CommunicationsMagazine, IEEE, vol. 29, no. 12, pp. 46–56, Dec 1991.

[9] T. Aulin, N. Rydbeck, and C.-E. Sundberg, “Continuous phasemodulation–part ii: Partial response signaling,” Communications, IEEETransactions on [legacy, pre - 1988], vol. 29, no. 3, pp. 210–225, Mar 1981.

[10] M. Geoghegan, “Description and performance results for a multi-h CPMtelemetry waveform,” MILCOM 2000. 21st Century Military Communica-tions Conference Proceedings, vol. 1, pp. 353–357 vol.1, 2000.

[11] http://www.xilinx.com/company/success/novaengr.htm. re-trieved 10th January 2008.

[12] T. Tapp and R. Mickelson, “Turbo detection of coded continuous-phasemodulations,” Military Communications Conference Proceedings, 1999. MIL-COM 1999. IEEE, vol. 1, pp. 534–537 vol.1, 1999.

135

136 BIBLIOGRAPHY

[13] M. Petrov and M. Glesner, “A state-serial Viterbi decoder architecture fordigital radio on FPGA,” Field-Programmable Technology, 2005. Proceedings.2005 IEEE International Conference on, pp. 323–324, 11-14 Dec. 2005.

[14] M. Bree, D. Dodds, R. Bolton, S. Kumar, and B. Daku, “A modularbit-serial architecture for large-constraint-length Viterbi decoding,” Solid-State Circuits, IEEE Journal of, vol. 27, no. 2, pp. 184–190, Feb 1992.

[15] C. Shung, H.-D. Lin, R. Cypher, P. Siegel, and H. Thapar, “Area-efficientarchitectures for the Viterbi algorithm. I. Theory,” Communications, IEEETransactions on, vol. 41, no. 4, pp. 636–644, Apr 1993.

[16] P. Black and T. Meng, “A 140-Mb/s, 32-state, radix-4 Viterbi decoder,”Solid-State Circuits, IEEE Journal of, vol. 27, no. 12, pp. 1877–1885, Dec 1992.

[17] T. Zhang, J. Wu, and G. Saulnier, “Efficient coherent detector VLSI de-sign for continuous phase modulation,” Signals, Systems and Computers,2003. Conference Record of the Thirty-Seventh Asilomar Conference on, vol. 2,pp. 1663–1666 Vol.2, Nov. 2003.

[18] R. Tessier, S. Swaminathan, R. Ramaswamy, D. Goeckel, and W. Burleson,“A reconfigurable, power-efficient adaptive Viterbi decoder,” Very LargeScale Integration (VLSI) Systems, IEEE Transactions on, vol. 13, no. 4,pp. 484–488, April 2005.

[19] M. Guo, M. Ahmad, M. Swamy, and C. Wang, “FPGA design and imple-mentation of a low-power systolic array-based adaptive Viterbi decoder,”Circuits and Systems I: Regular Papers, IEEE Transactions on [Circuits and Sys-tems I: Fundamental Theory and Applications, IEEE Transactions on], vol. 52,no. 2, pp. 350–365, Feb. 2005.

[20] F. Sun and T. Zhang, “Low-power state-parallel relaxed adaptive Viterbidecoder,” Circuits and Systems I: Regular Papers, IEEE Transactions on [Cir-cuits and Systems I: Fundamental Theory and Applications, IEEE Transactionson], vol. 54, no. 5, pp. 1060–1068, May 2007.

[21] C.-Y. Chang and K. Yao, “Systolic array processing of the Viterbi algo-rithm,” Information Theory, IEEE Transactions on, vol. 35, no. 1, pp. 76–86,Jan 1989.

[22] ALTERA, Megacore: Viterbi Compiler User Guide, v7.2 ed., October 2007.

[23] XILINX, Logicore: Viterbi Decoder Product Specification DS247, v6.2 ed., Oc-tober 2007.

[24] C. Shung, P. Siegel, G. Ungerboeck, and H. Thapar, “VLSI architectures formetric normalization in the Viterbi algorithm,” Communications, 1990. ICC90, Including Supercomm Technical Sessions. SUPERCOMM/ICC ’90. Confer-ence Record., IEEE International Conference on, pp. 1723–1728 vol.4, 16-19Apr 1990.

[25] A. Hekstra, “An alternative to metric rescaling in Viterbi decoders,” Com-munications, IEEE Transactions on, vol. 37, no. 11, pp. 1220–1222, Nov 1989.

BIBLIOGRAPHY 137

[26] P. Siegel, C. Shung, T. Howell, and H. Thapar, “Exact bounds for Viterbidetector path metric differences,” Acoustics, Speech, and Signal Processing,1991. ICASSP-91., 1991 International Conference on, pp. 1093–1096 vol.2, 14-17 Apr 1991.

[27] T. Aulin and C. Sundberg, “Continuous phase modulation–part i: Full re-sponse signaling,” Communications, IEEE Transactions on [legacy, pre - 1988],vol. 29, no. 3, pp. 196–209, Mar 1981.

[28] T. Svensson and A. Svensson, “Reduced complexity detection of band-width efficient partial response CPM,” Vehicular Technology Conference,1999 IEEE 49th, vol. 2, pp. 1296–1300 vol.2, Jul 1999.

[29] T. Svensson and A. Svensson, “Empirical model for spectrally efficientcontinuous phase modulation,” Vehicular Technology Conference, 2003. VTC2003-Fall. 2003 IEEE 58th, vol. 1, pp. 696–700 Vol.1, 6-9 Oct. 2003.

[30] T. Svensson and A. Svensson, “Complexity and performance of spectrallyefficient continuous phase modulation,” Proceedings Nordic Radio Sympo-sium, Nynshamn, Sweden, 2001.

[31] D. Asano, H. Leib, and S. Pasupathy, “Phase smoothing functions for con-tinuous phase modulation,” Communications, IEEE Transactions on, vol. 42,no. 234, pp. 1040–1049, Feb/Mar/Apr 1994.

[32] E. S. Perrins, “Reduced complexity detection methods for continuousphase modulation,” PHD Thesis, 2005.

[33] A. Svensson, C. Sundberg, and T. Aulin, “A class of reduced-complexityviterbi detectors for partial response continuous phase modulation,” Com-munications, IEEE Transactions on [legacy, pre - 1988], vol. 32, pp. 1079–1087,Oct 1984.

[34] W. Tang and E. Shwedyk, “A CPM receiver based on the walsh signalspace,” Communications, Computers, and Signal Processing, 1995. Proceed-ings. IEEE Pacific Rim Conference on, pp. 203–206, 17-19 May 1995.

[35] A. Svensson, “Reduced state sequence detection of full response continu-ous phase modulation,” Electronics Letters, vol. 26, no. 10, pp. 652–654, 1May 1990.

[36] A. Svensson, “Reduced state sequence detection of partial response con-tinuous phase modulation,” Communications, Speech and Vision, IEE Pro-ceedings I, vol. 138, no. 4, pp. 256–268, Aug 1991.

[37] J. Huber and W. Liu, “An alternative approach to reduced-complexityCPM-receivers,” Selected Areas in Communications, IEEE Journal on, vol. 7,no. 9, pp. 1437–1449, Dec 1989.

[38] P. Laurent, “Exact and approximate construction of digital phase modula-tions by superposition of amplitude modulated pulses (AMP),” Commu-nications, IEEE Transactions on [legacy, pre - 1988], vol. 34, pp. 150–160, Feb1986.

138 BIBLIOGRAPHY

[39] U. Mengali and M. Morelli, “Decomposition of M-ary CPM signals intoPAM waveforms,” Information Theory, IEEE Transactions on, vol. 41, no. 5,pp. 1265–1275, Sep 1995.

[40] G. Kaleh, “Simple coherent receivers for partial response continuousphase modulation,” Selected Areas in Communications, IEEE Journal on,vol. 7, pp. 1427–1436, Dec 1989.

[41] S. Simmons, “ACI susceptibility of reduced-state decoding for CPM,”Communications Letters, IEEE, vol. 3, no. 11, pp. 305–307, Nov 1999.

[42] February 2009. Proprietary Harris Stratex Documentation.

[43] I. Kuon and J. Rose, “Measuring the Gap Between FPGAs and ASICs,”Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactionson, vol. 26, pp. 203–215, Feb. 2007.

[44] XILINX, Spartan-3A DSP FPGA Family: Datasheet, v2.1 ed., June 2008.

[45] http://www.provigent.com/home/index.aspx?lang=1. re-trieved 15th May 2009.

[46] C. Rader, “Memory management in a Viterbi decoder,” Communications,IEEE Transactions on [legacy, pre - 1988], vol. 29, no. 9, pp. 1399–1401, Sep1981.

[47] U. Meyer-Baese, Digital Signal Processing with Field Programmable Gate Ar-rays. Springer, 2001.

[48] R. Andraka and A. Berkun, “FPGAs make a radar signal processor on achip a reality,” Signals, Systems, and Computers, 1999. Conference Record ofthe Thirty-Third Asilomar Conference on, vol. 1, pp. 559–563 vol.1, 1999.

[49] March 2009. Email from Jamie Pegg - Xilinx Field Applications Engineer.

[50] S. R. Bullock, Transceiver and System Design for Digital Communications.SciTech Publishing Inc, US, 2009.

Documents

Increasing the spectral efficiency of continuous phase