Uwe meyer baese presentation report

Embed Size (px)

DESCRIPTION

Presentation report of uwe meyer baese digital signal processing with fpga.

Citation preview

  • FPGA SEMINAR REPORT UNIT-4

  • 2

    CONTENTS

    S NO TITLE PG NO

    1 BINARY ADDER 3

    2 BINARY MULTIPLIER 8

    3 BINARY DIVIDER 13

    4 FIR FILTERS 18

    5 IIR FILTERS 27

    6 DECIMATION 34

    7 INTERPOLATION 39

    8 MULTISTAGE DECIMATION 44

    9 POLYPHASE DECIMATION 51

    10 FILTER BANKS 56

    11 DIT-FFT ALGORITHM 65

    12 DIF-FFT ALGORITHM 74

    13 ERROR CONTROL CODING 81

    14 CRYPTOGRAPHIC ALGORITHM 89

    15 LMS ALGORITHM 94

    16 DIGITAL UP CONVERTER 100

    17 DIGITAL DOWN CONVERTER 105

  • 3

    BINARY ADDERS

    Addition is the most commonly performed arithmetic operation in Digital systems. An adder

    is a combinational circuit which combines two arithmetic operands using addition rules. An

    adder is a basic building block in any DSP system. An adder can perform subtraction using

    2s complemented subtrahend. The following are the various types of adders:

    Half Adders

    Full Adders

    Binary (Multi Bit) Adders

    o Ripple Adders

    o Carry Look Ahead Adders

    o Pipeline Adders

    o Modulo Adders

    A basic binary N-bit adder / subtractor consist of N full-adders (FA). A full-adder

    implements the following Boolean equations:

    The Sum is defined by:

    sk = xk XOR yk XOR ck = xk yk ck

    The carry (out) bit is computed with:

    ck+1 = (xk AND yk) OR (xk AND ck) OR (yk AND ck) = (xk yk) + (xk ck) + (yk ck)

    PIPELINED ADDERS:

    Pipelining is extensively used in DSP solutions due to the intrinsic dataflow regularity of

    DSP algorithms. Programmable digital signal processor MACs typically carry at least four

    pipelined stages. The processor:

    1) Decodes the command 2) Loads the operands in registers 3) Performs multiplication and stores the product, and 4) Accumulates the products, all concurrently.

    The pipelining principle can be applied to FPGA designs as well, at little or no additional

    cost since each logic element contains a flip-flop, which is otherwise unused, to save routing

    resources. With pipelining it is possible to break an arithmetic operation into small primitive

    operations, save the carry and the intermediate values in registers, and continue the

    calculation in the next clock cycle. Such adders are sometimes called carry save adders

    (CSAs) in the literature.

  • 4

    The block diagram of pipeline adder is shown in Figure 1.

    Figure 1: Block Schematic of Pipeline adder

    VHDL CODE FOR PIPELINE ADDER:

    LIBRARY ieee;

    USE ieee.std_logic_1164.ALL;

    USE ieee.std_logic_arith.ALL;

    USE ieee.std_logic_unsigned.ALL;

    ENTITY pipeline_add IS

    GENERIC (WIDTH : INTEGER := 15;

    WIDTH1 : INTEGER := 7; WIDTH2 : INTEGER := 8);

    PORT (x,y : IN STD_LOGIC_VECTOR(WIDTH-1 DOWNTO 0);

    sum : OUT STD_LOGIC_VECTOR(WIDTH-1 DOWNTO 0);

    LSBs_Carry : OUT STD_LOGIC;

    clk : IN STD_LOGIC);

    END pipeline_add;

    ARCHITECTURE struct OF pipeline_add IS

    SIGNAL l1, l2, s1 : STD_LOGIC_VECTOR(WIDTH1-1 DOWNTO 0);

    SIGNAL r1 : STD_LOGIC_VECTOR(WIDTH1 DOWNTO 0);

    SIGNAL l3, l4, r2, s2 : STD_LOGIC_VECTOR(WIDTH2-1 DOWNTO 0);

    BEGIN

    PROCESS

    BEGIN

  • 5

    WAIT UNTIL clk = '1';

    l1

  • 6

    MODULO ADDERS

    Modulo adders are the most important building blocks in RNS-DSP designs. They are used

    for both additions and, via index arithmetic, for multiplications.

    The block diagram of modulo adder is shown in Figure 2.

    Figure 2: Block Schematic of Modulo adder

    VERILOG CODE FOR MODULO-256 ADDER:

    module mod_add (input [7:0]x, input [7:0]y, output [8:0]Sum);

    parameter m=256;

    wire [8:0] x1,x2;

    wire c;

    assign x1[8:0]= (x[7:0]+y[7:0]);

    assign x2[8:0]= (x1[7:0]-m);

    or(c,x1[8],x2[8]);

    assign x2 = (x1>255) ? x1[8:0]-m : x1[8:0];

    assign Sum = (c==1'b0) ? x1[8:0] : x2[8:0];

    endmodule

  • 7

    MODULO-256 ADDER SIMULATION RESULTS:

    SUMMARY OF BINARY ADDERS:

    Ripple Carry Adders: Are two bit at a time adder, the longest delay comes from the ripple of the carry through all stages. Carry-skip, carry look-ahead, conditional sum, or carry-select

    adders techniques are employed to reduce the delay.

    Adders implemented using modern FPGAs / CPLDs possess very fast ripple carry logic a magnitude faster than the delay through a regular logic.

    In Pipeline adders, how many pieces of pipeline addition depends on number of logic elements

    and FFs in each LAB of FPGA/CPLD

    For example in Alteras Cyclone II devices a reasonable choice will be 2 pipeline additions with maxi mum block size of 15 using an LAB with 16LEs and 16 FFs for one pipeline element. The

    feasible breakup is shown below:

    With one additional pipeline stage we can build adders up to a length 15 + 16 = 31.

    With two pipeline stages we can build adders with up to 15+15+16 = 46-bit length

    With three pipeline stages we can build adders with up to 15+15+15+16 = 61-bit length.

    Though the number of flip-flops in one LAB is 16 and we need an extra flip-flop for the carry-

    out. Only the blocks with the MSBs can be 16 bits wide.

  • 8

    BINARY MULTIPLIER

    Since we always multiply by either 0 or 1, the partial products are always either 0000 or

    the multiplicand (1101 in this example).

    There are four partial products which are added to form the result.

    We can add them in pairs, using three adders.

    Even though the product has up to 8 bits, we can use 4-bit adders if we stagger them

    leftwards, like the partial products themselves.

    If the multiplicand is of k bit and multiplier is of j bit then

    o k*j no. of & gates are require.

    o (j-1) no. of k bit adder are require.

    Ex. To multiply 1101 by 111 we require

    o 4*3=12 & gates

    o (3-1) no. of 4 bit adder.

    A 2*2 BINARY MULTIPLIER

    The AND gates produce the partial products.

    For a 2-bit by 2-bit multiplier, we can just use two half adders to sum the partial products. In general, though, well need full adders.

    Here C3-C0 are the product, not carries!

  • 9

    A 4*4 MULTIPLIER CIRCUIT

  • 10

    Here multiplier and multiplicand both are of 4 bits.

    So to implement multiplier we needed 4*4=16 & gates and 3 no. of 4 bit adders.

    First adder of each 4bit adder can be a HalfAdder because cin is always zero for those

    adder.

    VERILOG CODE

    module HA(sout,cout,a,b);

    outputsout,cout;

    inputa,b;

    assignsout=a^b;

    assigncout=(a&b);

    endmodule

    module FA(sout,cout,a,b,cin);

  • 11

    outputsout,cout;

    inputa,b,cin;

    assignsout=(a^b^cin);

    assigncout=((a&b)|(a&cin)|(b&cin));

    endmodule

    module multiply4bits(product,a,b);

    output [7:0]product;

    input [3:0]a;

    input [3:0]b;

    assign product[0]=(a[0]&b[0]);

    wire x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17;

    HA HA1(product[1],x1,(a[0]&b[1]),(a[1]&b[0]));

    FA FA1(x2,x3,a[1]&a[1],(a[2]&b[0]),x1);

    FA FA2(x4,x5,(a[2]&b[1]),(a[3]&b[0]),x3);

    HA HA2(x6,x7,(a[3]&b[1]),x5);

    HA HA3(product[2],x8,x2,(a[0]&b[2]));

    FA FA5(x9,x10,x4,(a[1]&b[2]),x8);

    FA FA4(x11,x12,x6,(a[2]&b[2]),x10);

    FA FA3(x13,x14,x7,(a[3]&b[2]),x12);

    HA HA4(product[3],x15,x9,(a[0]&b[3]));

    FA FA8(product[4],x16,x11,(a[1]&b[3]),x15);

    FA FA7(product[5],x17,x13,(a[2]&b[3]),x16);

    FA FA6(product[6],product[7],x14,(a[3]&b[3]),x17);endmodule

  • 12

    SIMULATION RESULT

  • 13

    DIVIDERS

    Of all four basic arithmetic operations division is the most complex. Consequently, it is

    the most time-consuming operation and also the operation with the largest number of different

    algorithms to be implemented. For a given dividend (or numerator) N and divisor (or

    denominator) D the division produces (unlike the other basic arithmetic operations) two results:

    the quotient Q and the remainder R, i.e.,

    N/D = Q and R with |R| < D.

    However, we may think of division as the inverse process of multiplication, as

    demonstrated through the following equation,

    N = D Q + R,

    It differs from multiplication in many aspects. Most importantly, in multiplication all

    partial products can be produced parallel, while in division each quotient bit is determined in a

    sequential trail-and-error procedure.

    For eg:

    234/50 Q=5 and R=-16 and Q=4 and R=34. But we prefer O=4 and R=34.

    Hence Q

  • 14

    RESTORING DIVISION:

    We align first the denominator and load the numerator in the remainder register. We

    then subtract the aligned denominator from the remainder and store the result in the remainder

    register. If the new remainder is positive we set the quotients LSB to 1, otherwise the

    quotients LSB is set to zero and we need to restore the previous remainder value by adding the

    denominator. Finally, we have to realign the quotient and denominator for the next step. The

    recalculation of the previous remainder is why we call such an algorithm restoring division.

    The main disadvantage of the restoring division is that we need two steps to determine one

    quotient bit. We can combine the two steps using a nonperforming divider algorithm, i.e., each

    time the denominator is larger than the remainder, we do not perform the subtraction. The

    number of steps is reduced by a factor of 2

    NON PERFORMING NON RESTORING:

    The idea behind the nonrestoring division is that if we have computed in the restoring

    division a negative remainder, i.e., rk+1 = rkdk, then in the next step we will restore rk by

    adding dk and then perform a subtraction of the next aligned denominator dk+1 = dk/2. So,

    instead of adding dk followed by subtracting dk/2, we can just skip the restoring step and

    proceed with adding dk/2, when the remainder has (temporarily) a negative value. As a result,

    we have now quotient bits that can be positive or negative, i.e., qk = 1, but not zero.We can

    change this signed-digit representation later to a twos complement representation. In

  • 15

    conclusion, every time the remainder after the iteration is positive we store a 1 and subtract the

    aligned denominator, while for negative remainder, we store a 1 = 1 in the quotient register

    and add the aligned denominator.

    Both quotient and remainder are now in the twos complement representation and

    have a valid result. If we wish to constrain our results in a way that both have the same sign, we

    need to correct the negative remainder, i.e., for r < 0 we correct this via

    r := r + D and q := q 1.

    Such a nonrestoring divider will now run faster than the nonperforming divider, with

    about the same Registered Performance as the restoring divider.

    FAST DIVIDER DESIGN:

    The first fast divider algorithm we wish to discuss is the division through

    multiplication with the reciprocal of the denominator D. The reciprocal can, for instance, be

    computed via a look-up table for small bit width. The general technique for constructing

    iterative algorithms, however, makes use of the Newton method for finding a zero.

    ARRAY DIVIDER:

    Obviously, as with multipliers, all division algorithms can be implemented in a

    sequential, FSM-like, way or in the array form. If the array form and pipelining is desired, a

    good option will then be to use the lpm_divide block, which implements an array divider with

    the option of pipelining, for a detailed description of the lpm_divide block.

    CODE:

    module divya2(q,out,a,b);

    input [7:0]a;//dividend

    input [3:0]b;//divisor

    output [3:0]out;//reminder

    output [4:0]q;//quotient

    wire [3:0]r1,r2,r3,r4;

    stage s1(q[4],r1[3:0],{1'b1},a[7:4],b[3:0]);

    stage s2(q[3],r2[3:0],q[4],{r1[2:0],a[3]},b[3:0]);

    stage s3(q[2],r3[3:0],q[3],{r2[2:0],a[2]},b[3:0]);

  • 16

    stage s4(q[1],r4[3:0],q[2],{r3[2:0],a[1]},b[3:0]);

    stage s5(q[0],out[3:0],q[1],{r4[2:0],a[0]},b[3:0]);

    endmodule

    module stage(q,out,t,a,b); // submodule

    input [3:0]a;

    input [3:0]b;

    input t;

    output [3:0]out;

    output q;

    wire [3:0]c;

    cas ca1(out[0],c[0],t,b[0],a[0],t);

    cas ca2(out[1],c[1],t,b[1],a[1],c[0]);

    cas ca3(out[2],c[2],t,b[2],a[2],c[1]);

    cas ca4(out[3],c[3],t,b[3],a[3],c[2]);

    not n1(q,out[3]);

    endmodule

    module cas(out,cout,t,divisor,rin,cin);

    input t,divisor,rin,cin;

    output cout,out;

    wire x;

    xor x1(x,t,divisor);

    fadd f1(out,cout,x,rin,cin);

    endmodule

    module fadd(s,cout,a,b,c); //full adder submodule

    input a,b,c;

  • 17

    output s,cout;

    wire w1,w2,w3;

    and a1(w1,a,b);

    and a2(w2,b,c);

    and a3(w3,c,a);

    xor x1(s,a,b,c);

    or o1(cout,w1,w2,w3);

    endmodule

    OUTPUT:

  • 18

    FIR FILTERS

    An FIR with constant coefficients is an LTI digital filter. The output of an FIR of order or

    length L, to an input time-series x[n], is given by a finite version of the convolution sum,

    namely:

    where f[0] 0 through f[L-1] 0 are the filters L coefficients. They also correspond to the

    filters impulse response.

    For LTI systems it is sometimes more convenient to express this in the z-domain with

    where F(z) is the FIRs transfer function defined in the z-domain by

    The Lth

    order LTI FIR filter is graphically interpreted in Fig. 1.1. It can be seen to consist of a

    collection of a tapped delay line, adders, and multipliers. One of the operands presented to

    each multiplier is an FIR coefficient, often referred to as a tap weight for obvious reasons.

    Fig 1: Direct Form FIR filter

  • 19

    The roots of polynomial F(z) in define the zeros of the filter. The presence of only zeros is the

    reason that FIRs are sometimes called all zero filters.

    FIR FILTER WITH TRANSPOSED STRUCTURE

    A variation of the direct FIR model is called the transposed FIR filter. It can be constructed

    from the FIR filter in Fig. 1 by:

    Exchanging the input and output

    Inverting the direction of signal flow

    Substituting an adder by a fork, and vice versa

    A transposed FIR filter is shown in Fig. 2 and is, in general, the preferred implementation of an

    FIR filter. The benefit of this filter is that we do not need an extra shift register for x[n],and

    there is no need for an extra pipeline stage for the adder (tree) of the products to achieve high

    throughput.

    Fig 2: Filter with Transposed Structure

    SYMMETRY IN FIR FILTERS

    The center of an FIRs impulse response is an important point of symmetry. It is sometimes

    convenient to define this point as the 0th sample instant. Such filter descriptions area-

    causal(centered notation). For an odd-length FIR, the a-causal filter model is given by:

  • 20

    The FIRs frequency response can be computed by evaluating the filters transfer function about

    the periphery of the unity circle, by setting z=ejT

    . It then follows that:

    We then denote with |F()| the filters magnitude frequency response and () denotes the

    phase response, and satisfies:

    Digital filters are more often characterized by phase and magnitude than by the z-domain

    transfer function or the complex frequency transform.

    Table 1: Four possible linear-phase FIR filters

    LINEAR-PHASE FIR FILTERS

    Maintaining phase integrity across a range of frequencies is a desired system attribute in many

    applications such as communications and image processing. As a result, designing filters that

    establish linear-phase versus frequency is often mandatory. The standard measure of the phase

    linearity of a system is the group delay defined by:

  • 21

    A perfectly linear-phase filter has a group delay that is constant over a range of frequencies. It

    can be shown that linear-phase is achieved if the filter is symmetric or antisymmetric. A

    constant group delay can only be achieved if the frequency response F() is a purely real or

    imaginary function. This implies that the filters impulse response possesses even or odd

    symmetry. That is:

    An odd-order even-symmetry FIR filter would, for example, have a frequency response given

    by:

    which is seen to be a purely real function of frequency. Table 1 summarizes the four possible

    choices of symmetry, antisymmetry, even order and odd order. In addition, Table 1 graphically

    displays an example of each class of linear-phase FIR.

    Fig. 3: Linear-phase filter with reduced number of multipliers

    The symmetry properties intrinsic to a linear-phase FIR can also be used to reduce the necessary

    number of multipliers L, as shown in Fig. 1. Consider the linear-phase FIR shown in Fig. 3

    (even symmetry assumed), which fully exploits coefficient symmetry. Observe that the

  • 22

    symmetric architecture has a multiplier budget per filter cycle exactly half of that found in the

    direct architecture shown in Fig. 1 (L versus L/2) while the number of adders remains constant

    at L1.

    DESIGNING FIR FILTERS

    There are two methods for FIR Filter design:

    Direct Window Design Method

    Equiripple Design Method

    DIRECT WINDOW DESIGN METHOD

    The discrete Fourier transform (DFT) establishes a direct connection between the frequency and

    time domains. Since the frequency domain is the domain of filter definition, the DFT can be

    used to calculate a set of FIR filter coefficients that produce a filter that approximates the

    frequency response of the target filter. A filter designed in this manner is called a direct FIR

    filter. A direct FIR filter is defined by:

    Consider a length-16 direct FIR filter design with a rectangular window, shown in Fig. 4a, with

    the passband ripple shown in Fig. 4b. Note that the filter provides a reasonable approximation to

    the ideal lowpass filter with the greatest mismatch occurring at the edges of the transition band.

    The observed ringing is due to the Gibbs phenomenon, which relates to the inability of a

    finite Fourier spectrum to reproduce sharp edges. The Gibbs ringing is implicit in the direct

    inverse DFT method and can be expected to be about7% over a wide range of filter orders. To

    illustrate this, consider the example filter with length 128, shown in Fig. 4c, with the passband

    ripple shown in Fig. 3.6d. Although the filter length is essentially increased (from 16 to 128) the

    ringing at the edge still has about the same quantity. The effects of ringing can only be

    suppressed with the use of a data window that tapers smoothly to zero on both sides. Data

    windows overlay the FIRs impulse response, resulting in a smoother magnitude frequency

    response with an attendant widening of the transition band. If, for instance, a Kaiser window is

    applied to the FIR, the Gibbs ringing can be reduced.

  • 23

    Fig. 4: Gibbs phenomenon.(a)Impulse response of FIR lowpass with L=16. (b) Passband of

    transfer function L=16. (c)Impulse response of FIR lowpass with L= 128. (d) Passband of

    transfer function L= 128.

    The most common windows, denoted w[n], are:

  • 24

    EQUIRIPPLE DESIGN METHOD

    A typical filter specification not only includes the specification of passband p and stopband s

    frequencies and ideal gains, but also the allowed deviation (or ripple) from the desired transfer

    function. The transition band is most often assumed to be arbitrary in terms of ripples. A special

    class of FIR filter that is particularly effective in meeting such specifications is called the

    equiripple FIR. An equiripple design protocol minimizes the maximal deviations (ripple error)

    from the ideal transfer function. The equiripple algorithm applies to a number of FIR design

    instances. The most popular are:

    Lowpass filter design

    Hilbert filter, i.e., a unit magnitude filter that produces a 90 phase shift for all

    frequencies in the passband

    Differentiator filter that has a linear increasing frequency magnitude proportional to

    The equiripple or minimum-maximum algorithm is normally implemented using the Parks

    McClellan iterative method. The ParksMcClellan method is used to produce a equiripple or

    minimax data fit in the frequency domain.

    The length of the polynomial, and therefore the filter, can be estimated for a lowpass with

    where p is the passband and s the stopband ripple.

    CONSTANT COEFFICIENT FIR DESIGN

    The method used for implementing FIR filters in FPGAs is the Constant Coefficient FIR Design

    method. The different ways to implement this method are:

    Direct design

    Transposed form design

    Design using Distributed Arithmetic (DA) architecture

  • 25

    DIRECT FIR DESIGN

    The direct FIR filter shown in Fig. 1 can be implemented in VHDL using (sequential)

    PROCESS statements or by component instantiations of the adders and multipliers. A

    PROCESS design provides more freedom to the synthesizer, while component instantiation

    gives full control to the designer. To illustrate this, a length-4 FIR will be presented as a

    PROCESS design. Although a length-4 FIR is far too short for most practical applications, it is

    easily extended to higher orders and has the advantage of a short compiling time. The linear-

    phase (therefore symmetric) FIRs impulse response is assumed to be given by

    FOUR-TAP DIRECT FIR FILTER VHDL CODE:

    PACKAGE eight_bit_int IS -- User-defined types

    SUBTYPE BYTE IS INTEGER RANGE -128 TO 127;

    TYPE ARRAY_BYTE IS ARRAY (0 TO 3) OF BYTE;

    END eight_bit_int;

    LIBRARY work;

    USE work.eight_bit_int.ALL;

    LIBRARY ieee;

    USE ieee.std_logic_1164.ALL;

    USE ieee.std_logic_arith.ALL;

    ENTITY fir_srg IS ------> Interface

    PORT (clk : IN STD_LOGIC;

    x : IN BYTE;

    y : OUT BYTE);

    END fir_srg;

    ARCHITECTURE flex OF fir_srg IS

    SIGNAL tap : ARRAY_BYTE := (0,0,0,0);

    -- Tapped delay line of bytes

    BEGIN

    p1: PROCESS ------> Behavioral style

    BEGIN

    WAIT UNTIL clk = '1';

    -- Compute output y with the filter coefficients weight.

    -- The coefficients are [-1 3.75 3.75 -1]

  • 26

    y

  • 27

    IIR FILTERS

    A nonrecursive filter incorporates, as the name implies, no feedback. The impulse response of

    such a filter is finite, i.e., it is an FIR filter. A recursive filter, on the other hand has feedback,

    and is expected, in general, to have an infinite impulse response, i.e., to be an IIR filter. Figure

    4.4a shows filters with separate recursive and nonrecursive parts. A canonical filter is produced

    if these recursive and nonrecursive parts are merged together, as shown in Fig. 4.4b. The

    transfer function of the filter from Fig. 4.4 can be written as:

    The difference equation for such a system yields:

    Comparing this with the difference equation for the FIR filter,we find that the difference

    equation for recursive systems depends not only on the L previous values of the input sequence

    x[n], but also on the L 1 previous values of y[n].

    If we compute poles and zeros of F(z), we see that the nonrecursive part, i.e., the numerator of

    F(z), produces the zeros p0l, while the denominator of F(z) produces the poles pl

  • 28

    FAST IIR FILTER

    FIR filter Registered Performance was improved using pipelining. In the case of FIR filters,

    pipelining can be achieved at essentially no cost. Pipelining IIR filters, however, is more

    sophisticated and is certainly not free. Simply introducing pipeline registers for all adders will,

    especially in the feedback path, very likely change the pole locations and therefore the transfer

    function of the IIR filter.

    The methods that improve IIR filter throughput are:

    Look-ahead interleaving in the time domain

    Parallel processing

    These methods are based on filter architecture or signal flow techniques. These techniques will

    be demonstrated with examples. To simplify the VHDL representation of each case, only a first-

    order IIR filter will be considered, but the same ideas can be applied to higher-order IIR filters.

    TIME-DOMAIN INTERLEAVING

    Consider the differential equation of a first-order IIR system, namely

    .

    The output of the first-order system, namely y[n + 1], can be computed using a look-ahead

    methodology by substituting y[n+1] into the differential equation for y[n + 2]. That is

    The equivalent system is shown in Fig. 4.14.This concept can be generalized by applying the

    look-ahead transform for (S 1) steps, resulting in:

  • 29

    It can be seen that the term () defines an FIR filter having coefficients {b, ab, a2b, . . . ,

    aS1b}, that can be pipelined using the pipelining techniques.

    The recursive part of (4.12) can now also be implemented with an S-stage pipelined multiplier

    for the coefficient.

    The VHDL code shown below, implements the IIR filter in look-ahead form.

    PACKAGE n_bit_int IS -- User-defined type

    SUBTYPE BITS15 IS INTEGER RANGE -2**14 TO 2**14-1;

    END n_bit_int;

    LIBRARY work;

    USE work.n_bit_int.ALL;

    LIBRARY ieee;

    USE ieee.std_logic_1164.ALL;

    USE ieee.std_logic_arith.ALL;

    ENTITY iir_pipe IS

    PORT ( x_in : IN BITS15; -- Input

    y_out : OUT BITS15; -- Result

    clk : IN STD_LOGIC);

    END iir_pipe;

    ARCHITECTURE fpga OF iir_pipe IS

    SIGNAL x, x3, sx, y, y9 : BITS15 := 0;

    BEGIN

    PROCESS -- Use FFs for input, output and pipeline stages

    BEGIN

    WAIT UNTIL clk = 1;

    x

  • 30

    PARALLEL PROCESSING

    In a parallel-processing filter implementation [100], P parallel IIR paths are formed, each

    running at a 1/P input sampling rate. They are combined at the output using a multiplexer, as

    shown in Fig. 4.18. Because a multiplexer, in general, will be faster than a multiplier and/or

    adder, the parallel approach will be faster. Furthermore, each path P has a factor of P more time

    to compute its assigned output.

    To illustrate, consider again a first-order system and P = 2. The lookahead

    scheme, as in (4.11)

    is now split into even n = 2k and odd n = 2k1 output sequences, obtaining

    where n, k Z. The two equations are the basis for the following parallel IIR filter FPGA implementation.

  • 31

    VHDL CODE:

    PACKAGE n_bit_int IS -- User-defined type

    SUBTYPE BITS15 IS INTEGER RANGE -2**14 TO 2**14-1;

    END n_bit_int;

    LIBRARY work;

    USE work.n_bit_int.ALL;

    LIBRARY ieee;

    USE ieee.std_logic_1164.ALL;

    USE ieee.std_logic_arith.ALL;

    ENTITY iir_par IS ------> Interface

    PORT ( clk, reset : IN STD_LOGIC;

    x_in : IN BITS15;

    x_e, x_o, y_e, y_o : OUT BITS15;

    clk2 : OUT STD_LOGIC;

    y_out : OUT BITS15);

    END iir_par;

    ARCHITECTURE fpga OF iir_par IS

    TYPE STATE_TYPE IS (even, odd);

    SIGNAL state : STATE_TYPE;

    SIGNAL x_even, xd_even : BITS15 := 0;

    SIGNAL x_odd, xd_odd, x_wait : BITS15 := 0;

    SIGNAL y_even, y_odd, y_wait, y : BITS15 := 0;

    SIGNAL sum_x_even, sum_x_odd : BITS15 := 0;

    SIGNAL clk_div2 : STD_LOGIC;

    BEGIN

    Multiplex: PROCESS (reset, clk) --> Split x into even and

    BEGIN -- odd samples; recombine y at clk rate

    IF reset = 1 THEN -- asynchronous reset

    state

  • 32

    WHEN even =>

    x_even

  • 33

    The design is realized with two PROCESS statements. In the first, PROCESS Multiplex, x is

    split into even and odd indexed parts, and the output y is recombined at the clk rate. In addition,

    the first PROCESS statement generates

    the second clock, running at clk/2. The second block implements the filters arithmetic

    according to (4.22). The design uses 268 LEs, no embedded multiplier, and has a 168.12MHz

    Registered Performance.

  • 34

    DECIMATION

    INTRODUCTION

    A frequent task in digital signal processing is to adjust the sampling rate according to the signal

    of interest. Systems with different sampling rates are referred to as multirate systems. Two

    typical examples in multirate DSP systems are decimation and interpolation . Multirate systems

    are sometimes used for sampling-rate conversion, which involves both decimation and

    interpolation.

    DECIMATION

    Decimation can be regarded as the discrete-time counterpart of sampling. Whereas in sampling

    we start with a continuous-time signal x(t) and convert it into a sequence of samples x[n], in

    decimation we start with a discrete-time signal x[n] and convert it into another discrete-time

    signal y[n], which consists of sub-samples of x[n]. Thus, the formal definition of M-fold

    decimation, or down-sampling, is defined by equation below

    .In decimation, the sampling rate is reduced from Fs to Fs/M by discarding M 1 samples for

    every M samples in the original sequence. A narrow filter followed by a down sampler is

    usually referred to as a decimator

    Fig 1: Block diagram notation of decimation, by a factor of M.

    The block diagram notation of the decimation process is depicted in Figure.

  • 35

    An anti-aliasing digital filter precedes the down-sampler to prevent aliasing from

    occurring, due to the lower sampling rate. In Figure 2 below, it illustrates the concept of

    3-fold decimation i.e. M = 3. Here, the samples of x[n] corresponding to n = , -2, 1,

    4, and n = , -1, 2, 5, are lost in the decimation process.

    In general, the samples of x[n] corresponding to n kM, where k is an integer, are

    discarded in M-fold decimation. In Figure 2 it shows samples of the decimated signal

    y[n] spaced three times wider than the samples of x[n].

    In real time, the decimated signal appears at a slower rate than that of the original signal

    by a factor of M.

    If the sampling frequency of x[n] is Fs, then that of y[n] is Fs/M.

    Fig 2: Decimation of a discrete-time signal by a factor of 3

    ANTI ALIASING FILTER

    We can reduce the sampling rate up to the limit called the Nyquist rate, which says that the

    sampling rate must be higher than the bandwidth of the signal, in order to avoid aliasing.

    Aliasing is demonstrated in Fig. 3. For a low pass signal. Aliasing is irreparable, and should be

    avoided at all cost. For a bandpass signal, the frequency band of interest must fall within an

    integer band. If fs is the sampling rate, and R is the desired downsampling factor, then the band

  • 36

    of interest must fall between. If it does not, there may be aliasing due to copies from the

    negative frequency bands, although the sampling rate may still be higher than the Nyquist rate,

    Fig 3: Unaliased and aliased decimation cases.

    Fig 4: Decimation of signal x[n] X().

    DOWN SAMPLER

    Down sampling is the process of reducing the sampling rate of a signal. The down sampling

    factor is usually an integer or a rational fraction greater than one. The sampling rate can be

    reduced up to the limit called the Nyquist rate .An down-sampler with a down-sampling

    factor M, where M is a positive integer, develops an output sequence y[n] with a sampling rate

    that is (1/M)-th of that of the input sequence x[n]

  • 37

    VHDL CODE

    entity decimator_1 is

    port( inseq: in std_logic_vector( 7 downto 0);--input sequence

    clk: in std_logic;

    reset:in std_logic;

    dec_op: out std_logic_vector( 7 downto 0));-- decimated output sequence

    end decimator_1;

    architecture Behavioral of decimator_1 is

    begin

    process(clk,inseq)

    variable count: integer ;

    begin

    if reset='1' then

    count:=2;--count initiated,counts the clock pulses

    end if;

    if clk='1' and clk'event then

    if (count mod 2 = 0) thenif count is multiple of 2, then input is passed to

    output

    dec_op

  • 38

    OUTPUT

    Fig: test bench waveform

    Fig: simulated output

    Input sequence = {8hff, 8hfe, 8hfd, 8hfc, 8hfb, 8hfa}

    Output sequence= {8hff, 8hfd, 8hfb }

  • 39

    INTERPOLATION

    A frequent task in digital signal processing is to adjust the sampling rate

    according to the signal of interest. Systems with different sampling rates are

    referred to as multiratesystems.

    After A/D conversion, the signal of interest can be found in a small

    frequencyband (typically, lowpass or bandpass), then it is reasonable to filter

    with a lowpass or bandpass filter and to reduce the sampling rate. A narrow filter

    followed by a downsampler is usually referred to as a decimator .Increasing the

    sampling rate can be useful, in the D/A conversion process, for example.

    Typically, D/A converters use a sample-and-hold of first-order at the output,

    which produces a step-like output function. This can be compensated

    for with an analog 1/sinc(x) compensation filter, but most often a digital solution

    is more efficient.

    We can use, in the digital domain, an expanderand an additional filter to get

    the desired frequency band. The introduced zeros produce an extra copy of the

    baseband spectrum that must first be removed before the signal can be processed

    with the D/A converter.

    For the interpolator, the Noble relation is defined as

    F(z) ( R) = ( R) F(zR),

    i.e., in an interpolation putting the filter before the expander results in an

    R-times shorter filter.

  • 40

    INTERPOLATION

    A process by which the output sampling rate of a signal is increased is known

    as interpolation.

    Consists of an up-sampler and an anti-imaging filter.

    The up-sampling operation is just simply inserting (N-1) zeroes between every

    two input samples.

    x(n) v(m) y(m)

    The up-sampling produces the intermediate signal v(m) from the input signal x(n).

    The output signal y(m) is obtained by convolving the intermediate signal with the

    impulse response h(n).

    k)

    The up-sampled signal can also be denoted as

    y(m) = x(n/L)

    The spectral properties of up-sampling is simple in the z-transform domain.

    So up-sampling is simply a contraction of the frequency axis by a factor of N.

    L

    H(Z)

  • 41

    The original spectrum X(ej

    ) over [-,].

    The original spectrum X(ej

    ) over [-5,5]

    The upsampled spectrum

    Interpolation example. For R = 3 ,x[n] X()is shown below.

  • 42

    Interpolation in time domain,

    The up-sampling factor used is 4 , so three zeroes are inserted between two

    input samples.So ,up-sampling is expansion in time domain.

    CODE

    entity interpolator is

    port(a:inbit_vector(1 to 16);

    b:out bit_vector(1 to 32));

  • 43

    end interpolator;

    architectureBehavioral of interpolator is

    begin

    process(a)

    begin

    b

  • 44

    MULTISTAGE DECIMATOR

    The single stage of decimator is repeatedly performed to get our required output of the

    multistage decimator(ie., upto Pth stage).

    Block Diagram of Multistage Decimator

    If the decimation rate R is large it can be shown that a multistage design

    can be realized with less effort than a single-stage converter. In particular, S stages, each having

    a decimation capability of Rk, are designed to have an overall down sampling rate

    ofR=R1R2RS. Unfortunately, pass band imperfections, such as ripple deviation, accumulate

    from stage to stage. As a result, a pass band deviation target of p must normally be tightened on the order of p=p/S to meet overall system specifications. This is obviously a worst-case assumption, in which all short filters have the maximum ripple at the same frequencies, which

    is, in general, too pessimistic. It is often more reasonable to try an initial value near the given

    pass band specification p, and then selectively reduce it if necessary.

    MULTISTAGE DECIMATOR DESIGN USING GOODMANCAREY HALF-BAND FILTERS:

    Goodman and Carey [80] proposed to develop multistage systems based on

    the use of CIC and half-band filters. A half-band filter has a pass band and stop band located at

    s =p=/2, or midway in the baseband. A half-band filter can therefore be used to change the sampling rate by a factor of two. If the half-band filter has point symmetry relative to =/2, then all even coefficients (except the center tap) become zero.

  • 45

    CIC FILTER:

    CIC (cascaded integrator comb) filter is an optimized class of finite impulse response (FIR) filter combined with an interpolator or decimator.

    It consists of one or more integrator and comb filter pairs.

    For decimating CIC, the input signal is fed through one or more cascaded integrators, then a down sampler which is followed by one or more comb sections.

    For an input impulse response, the single stage CIC filter produces the step response

    output and also the same logic is used for the multistage decimator.

    VHDL PROGRAM:

    entity vrb is

    Port ( clk : in STD_LOGIC;

    x_in : in STD_LOGIC_VECTOR (7 downto 0);

    y_out : out STD_LOGIC_VECTOR (8 downto 0));

    end vrb;

    architecture Behavioral of vrb is

  • 46

    TYPE STATE_TYPE is (hold,sample);

    SIGNAL state :STATE_TYPE;

    SIGNAL count:integer RANGE 0 to 64;

    SIGNAL clk2: STD_LOGIC;

    SIGNAL x : STD_LOGIC_VECTOR( 7 DOWNTO 0);

    SIGNAL sxtx: STD_LOGIC_VECTOR( 25 DOWNTO 0);

    SIGNAL i0 :word26;

    SIGNAL i1 :word21;

    SIGNAL i2 :word16;

    SIGNAL i2d1,i2d2,i2d3,i2d4,c1,c0: word14;

    SIGNAL c1d1,c1d2,c1d3,c1d4,c2: word13;

    SIGNAL c2d1,c2d2,c2d3,c2d4,c3: word12;

    begin

    FSM:PROCESS

    BEGIN

    WAIT UNTIL clk='0';

    CASE state is

    WHEN hold =>

    IF count

  • 47

    END CASE;

    END PROCESS FSM;

    Sxt: PROCESS(x)

    BEGIN

    sxtx(7 DOWNTO 0)

  • 48

    clk2

  • 49

    ANOTHER PROGRAM:

    entity newvrb is

    port (a : in STD_LOGIC_vector(1 to 32);

    bintr : out STD_LOGIC_vector(1 to 16);

    cintr : out STD_LOGIC_vector(1 to 8);

    d : out STD_LOGIC_vector(1 to 4));

    end newvrb;

    architecture Behavioral of newvrb is

    signal b : STD_LOGIC_vector(1 to 16);

    signal c : STD_LOGIC_vector(1 to 8);

    begin

    process (a,b,c)

    begin

    for I in 1 to 16 loop

    b(I)

  • 50

    end loop;

    end process;

    end Behavioral;

    APPLICATIONS:

    During A/D conversion: Oversampling to alleviate the stringent requirements on the Analog anti-alising filter.

    During D/A conversion: Filter to remove spectrum images.

    Fractional sampling rate conversion.

  • 51

    POLYPHASE DECOMPOSITION

    Polyphase decomposition is very useful when implementing decimation or interpolation

    in IIR or FIR filter and filter banks. To illustrate this, consider the polyphase decomposition of

    an FIR decimation filter. If we add downsampling by a factor of R to the FIR filter structure

    shown in Figure1, we find that we only need to compute the outputs y[n] at time instances

    (1)

    Figure1: Direct form of FIR.

    It follows that we do not need to compute all sums-of-product f [k] x[n k] of the convolution. For instance, x[0] only needs to be multiplied by

    f [0], f [R], f [2R] , . . . . (2)

    Besides x[0], these coefficients only need to be multiplied by

    x [R], x [2R] , . . . . (3)

    It is therefore reasonable to split the input signal first into R separate sequences

    according to

    and also to split the filter f [n] into R sequences

  • 52

    Figure 2 shows a decimator filter implemented using polyphase decomposition. Such a

    decimator can run R times faster than the usual FIR filter followed by a downsampler. The

    filters fr [n] are called polyphase filters, because they all have the same magnitude transfer

    function, but they are separated by a sample delay, which introduces a phase offset. A final

    example illustrates the polyphase decomposition.

    EXAMPLE 5.1: POLYPHASE DECIMATOR FILTER

    Consider a Daubechies length-4 filter with G(z) and R = 2.

    Quantizing the filter to 8 bits of precision results in the following model:

    and it follows that

    Figure2: Polyphase realization of decimation filter.

  • 53

    The following VHDL code3 shows the polyphase implementation for DB4.

  • 54

  • 55

    Figure 3 : Output for the given code

  • 56

    FILTER BANKS

    A digital filter bank is a collection of filters having a common input or output. One

    common application of the analysis filter bank is spectrum analysis, i.e., to split the input signal

    into R different so-called subband signals. The combination of several signals into a common

    output signal is called a synthesis filter bank. The analysis filter may be nonoverlapping,

    slightly overlapping, or substantially overlapping. Another important characteristic that

    distinguishes different classes of filter banks is the bandwidth and spacing of the center

    frequencies of the filters. A popular example of a non-uniform filter bank is the octave-spaced

    or wavelet filter bank.

    .

    UNIFORM DFT FILTER BANK:

    In uniform filter banks, all filters have the same bandwidth and sampling rates. In a

    maximal decimating, or critically sampled filter bank, the decimation or R is equal to the

    number of bands K. If the rth band filter is computed from the modulation of a single

    prototype filter h[n], according to

    (1)

    then it is a uniform DFT filter bank.

    FIG.1 R channel filter bank, with a small amount of overlapping

    An efficient implementation of the R channel filter bank can be generated if we use

    polyphase decomposition of the filter and the input signal x[n]. Because each of these

    bandpass filters is critically sampled, we use a decomposition with R polyphase signals

    according to

  • 57

    (2)

    (3)

    If we now substitute (2) into (1), we find that all bandpass filters share the same polyphase

    filter hk[n], while the twiddle factors for each filter are different. It is now obvious that this

    twiddle multiplication for corresponds to the rth DFT component, with an input vector

    of The computation for the whole analysis band can be reduced to

    filtering with R polyphase filters, followed by a DFT (or FFT) of these R filtered components.

    This is obviously much more efficient than direct computation. The polyphase filter bank for

    the uniform DFT synthesis bank can be developed as an inverse operation to the analysis bank.

    Perfect reconstruction occurs if the convolution of the included polyphase filter gives a unit

    sample function, i.e.,

    (4)

    TWO CHANNEL FILTER BANKS:

    The input x[n] is split by using lowpass G(z) and highpass H(z) analysis filters.

    The resulting signal x[n] is reconstructed using lowpass and highpass synthesis

    filters.

    Between the analysis and synthesis sections are decimation and interpolation by 2 units.

    The construction rule is given by H(z) = G(z) which defines the filters to be mirrored

    pairs. This is a quadrature mirror filter (QMF) bank, because the two filters have mirror

    symmetry to /2.

    A perfectly reconstructed signal has the same shape as the original, up to a phase (time)

    shift.

  • 58

    Fig.2 Two-channel filter bank

    If the signal is applied to the two-channel filter bank, the

    lowpass path XG(z) and highpass path XH(z) become

    (5)

    (6)

    After multiplication by the synthesis filter and summation

    of the results, we get as

    (7)

    The factor of X(z) shows the aliasing component, while the term at X(z) shows the

    amplitude distortion.

    PERFECT RECONSTRUCTION:

    A perfect reconstruction for a two-channel filter bank is achieved if

    1) , i.e., the reconstruction is free of aliasing.

  • 59

    2) i.e., the amplitude distortion has amplitude

    one.

    A two-channel filter bank is aliasing-free if

    IMPLEMENTING TWO-CHANNEL FILTER BANKS:

    POLYPHASE TWO-CHANNEL FILTER BANKS:

    In the general case, with two filters G(z) and H(z), we can realize each filter as a

    polyphase filter as shown below

    Fig.3 Polyphase implementation of the two-channel filter bank

    (8)

    This does not reduce the hardware effort (2L multipliers and 2(L1) adders are still used), but

    the design can be run with twice the usual sampling frequency, 2fs.These four polyphase filters

    have only half the length of the original filters.

    LIFTING:

    Another general approach to constructing fast and efficient two channel filter banks is the

    lifting scheme introduced recently by Swelden and Herley and Vetterli. The basic idea is the use

  • 60

    of cross-terms (called lifting and dual-lifting), as in a lattice filter, to construct a longer filter

    from a short filter, while preserving the perfect reconstruction conditions.

    Any (bi)orthogonal wavelet filter bank can be converted into a sequence of lifting and

    dual-lifting steps. The number of multipliers and adders required then depends on the number of

    lifting steps (more steps gives less complexity) and can reach up to 50% compared with the

    direct polyphase implementation.

    QMF IMPLEMENTATION:

    For QMF,we know

    H(z) = G(z) (9)

    But this implies that the polyphase filters are the same (except the sign),i.e.,

    G0(z) = H0(z), G1(z) = H1(z) . (10)

    Instead of the four filters, for QMF we only need two filters and an additional Butterfly. This

    saves about 50%. For the QMF filter we need L real adders, L real multipliers and the filter can

    run with twice the usual input-sampling rate.

    ORTHOGONAL FILTER BANKS:

    If highpass and lowpass polynomials are mirror versions of each other,then it is

    orthogonal filter banks. An orthogonal filter pair obeys the conjugate mirror filter (CQF)

    condition, defined by

    (11)

    If we use the transposed FIR filter, we need only half the number of multipliers. The

    disadvantage is that we can not benefit from polyphase decomposition to double the speed.

    FIG.4. Lattice realization for the orthogonal two-channel filter bank

  • 61

    VHDL CODE:

    PACKAGE n_bits_int IS -- User-defined types

    SUBTYPE BITS8 IS INTEGER RANGE -128 TO 127;

    SUBTYPE BITS9 IS INTEGER RANGE -2**8 TO 2**8-1;

    SUBTYPE BITS17 IS INTEGER RANGE -2**16 TO 2**16-1;

    TYPE ARRAY_BITS17_4 IS ARRAY (0 TO 3) OF BITS17;

    END n_bits_int;

    LIBRARY work;

    USE work.n_bits_int.ALL;

    LIBRARY ieee;

    USE ieee.std_logic_1164.ALL;

    USE ieee.std_logic_arith.ALL;

    USE ieee.std_logic_unsigned.ALL;

    ENTITY db4latti IS ------> Interface

    PORT (clk, reset : IN std_logic;

    clk2 : OUT std_logic;

    x_in : IN BITS8;

    x_e, x_o : OUT BITS17;

    g, h : OUT BITS9);

    END db4latti;

    ARCHITECTURE fpga OF db4latti IS

    TYPE STATE_TYPE IS (even, odd);

    SIGNAL state : STATE_TYPE;

    SIGNAL sx_up, sx_low, x_wait : BITS17 := 0;

    SIGNAL clk_div2 : std_logic;

    SIGNAL sxa0_up, sxa0_low : BITS17 := 0;

    SIGNAL up0, up1, low0, low1 : BITS17 := 0;

    BEGIN

    Multiplex: PROCESS (reset, clk) ----> Split into even and

    BEGIN -- odd samples at clk rate

    IF reset = '1' THEN -- Asynchronous reset

  • 62

    state

    -- Multiply with 256*s=124

    sx_up

  • 63

    up1

  • 64

    Computational complexity is reduced.

    QMF based subband coders provide more natural sounding,pitch prediction and wider

    bandwidth than earlier subband coders.

    APPLICATIONS:

    Accurate channel selection in wireless communications.

    Faster convergence and lower complexity in adaptive equalization.

    Flexible compression of speech and music

    Lower latency and better frequency compensation in hearing aids

    More efficient short-time spectral analysis and synthesis

    Multi-resolution image compression and wavelet transformations

    Reliable automatic speech recognition.

  • 65

    DIT-FFT ALGORITHM

    A Fast Fourier Transform(FFT) is an efficient algorithm for calculating the discrete Fourier

    transform of a set of data. A DFT basically decomposes a set of data in time domain into

    different frequency components. DFT is defined by the following equation:

    A FFT algorithm uses some interesting properties of the above formula to simply the

    calculations.

    COOLEY-TUKEY ALGORITHM

    The CooleyTukey algorithm, named after J.W. Cooley and John Tukey, is the most common

    fast Fourier transform (FFT) algorithm. It re-expresses the discrete Fourier transform (DFT) of

    an arbitrary composite size N = N1N2 in terms of smaller DFTs of sizes N1 and N2, recursively,

    in order to reduce the computation time to O(N log N) for highly-composite N (smooth

    numbers).

    Basically, the computational problem for the DFT is to compute the sequence {X(k)}

    of N complex-valued numbers given another sequence of data {x(n)} of length N, according to

    the formula

    In general, the data sequence x(n) is also assumed to be complex valued. Similarly, The IDFT

    becomes

    Since DFT and IDFT involve basically the same type of computations, our discussion of

    efficient computational algorithms for the DFT applies as well to the efficient computation of

    the IDFT.

    We observe that for each value of k, direct computation of X(k) involves N complex

    multiplications (4N real multiplications) and N-1 complex additions (4N-2 real additions).

  • 66

    Consequently, to compute all N values of the DFT requires N 2 complex multiplications and N

    2-

    N complex additions.

    Direct computation of the DFT is basically inefficient primarily because it does not exploit the

    symmetry and periodicity properties of the phase factor WN. In particular, these two properties

    are :

    The computationally efficient algorithms described in this sectio, known collectively as fast

    Fourier transform (FFT) algorithms, exploit these two basic properties of the phase factor.

    RADIX-2 FFT ALGORITHM

    Let us consider the computation of the N = 2v point DFT by the divide-and conquer approach.

    We split the N-point data sequence into two N/2-point data sequences f1(n) and f2(n),

    corresponding to the even-numbered and odd-numbered samples of x(n), respectively, that is,

    Thus f1(n) and f2(n) are obtained by decimating x(n) by a factor of 2, and hence the resulting

    FFT algorithm is called a decimation-in-time algorithm.

    Now the N-point DFT can be expressed in terms of the DFT's of the decimated sequences as

    follows:

    But WN2 = WN/2. With this substitution, the equation can be expressed as

  • 67

    where F1(k) and F2(k) are the N/2-point DFTs of the sequences f1(m) and f2(m), respectively.

    Since F1(k) and F2(k) are periodic, with period N/2, we have F1(k+N/2) = F1(k) and F2(k+N/2)

    = F2(k). In addition, the factor WNk+N/2

    = -WNk. Hence the equation may be expressed as

    We observe that the direct computation of F1(k) requires (N/2)2 complex multiplications. The

    same applies to the computation of F2(k). Furthermore, there are N/2 additional complex

    multiplications required to computeWNkF2(k). Hence the computation of X(k) requires

    2(N/2)2 + N/2 = N

    2/2 + N/2 complex multiplications. This first step results in a reduction of the

    number of multiplications from N 2

    to N 2/2 + N/2, which is about a factor of 2 for N large.

    By computing N/4-point DFTs, we would obtain the N/2-point DFTs F1(k) and F2(k) from the

    relations

    The decimation of the data sequence can be repeated again and again until the resulting

    sequences are reduced to one-point sequences. For N = 2v, this decimation can be performed v =

    log2N times. Thus the total number of complex multiplications is reduced to (N/2)log2N. The

    number of complex additions is Nlog2N.

    The following figure depicts the computation of N = 8 point DFT. We observe that the

    computation is performed in tree stages, beginning with the computations of four two-point

    DFTs, then two four-point DFTs, and finally, one eight-point DFT. The combination for the

    smaller DFTs to form the larger DFT is illustrated in following figure for N = 8.

  • 68

    Figure1 Three stages in the computation of an N = 8-point DFT.

    Figure 2 Eight-point decimation-in-time FFT algorithm.

  • 69

    Figure 3 Basic butterfly computation in the decimation-in-time FFT algorithm.

    An important observation is concerned with the order of the input data sequence after it is

    decimated (v-1) times. For example, if we consider the case where N = 8, we know that the first

    decimation yeilds the sequencex(0), x(2), x(4), x(6), x(1), x(3), x(5), x(7), and the second

    decimation results in the sequence x(0), x(4), x(2), x(6), x(1), x(5), x(3), x(7). This shuffling of

    the input data sequence has a well-defined order as can be ascertained from observing the

    following figure, which illustrates the decimation of the eight-point sequence.

    INPUT DATA

    INDEX

    INDEX BITS REVERSED BITS OUTPUT DATA

    INDEX

    0 000 000 0

    4 100 001 1

    2 010 010 2

    6 110 011 3

    1 001 100 4

    5 101 101 5

    3 011 110 6

    7 111 111 7

    Figure TC.3.5 Shuffling of the data and bit reversal.

    ADVANTAGES:

    To reduce the computational complexity of the DFT algorithm.

    Energy compaction.

    Delay can be reduced.

  • 70

    CODE:

    module ditfft(clk,sel,yr,yi);

    input clk;

    input [2:0]sel;

    output reg [7:0]yr,yi ;

    wire [7:0]y0r,y1r,y2r,y3r,y4r,y5r,y6r,y7r,y0i,y1i,y2i,y3i,y4i,y5i,y6i,y7i;

    wire [7:0]x20r,x20i,x21r,x21i,x22r,x22i,x23r,x23i,x24r,x24i,x25r,x25i,x26r,x26i,x27r,x27i;

    wire [7:0]x10r,x10i,x11r,x11i,x12r,x12i,x13r,x13i,x14r,x14i,x15r,x15i,x16r,x16i,x17r,x17i;

    wire [7:0]x0,x1,x2,x3,x4,x5,x6,x7;

    assign x0=8'b10;

    assign x1=8'b10;

    assign x2=8'b10;

    assign x3=8'b10;

    assign x4=8'b1;

    assign x5=8'b1;

    assign x6=8'b1;

    assign x7=8'b1;

    //stage1

    bfly1 s11(x0,x4,x10r,x10i,x11r,x11i);

    bfly1 s12(x2,x6,x12r,x12i,x13r,x13i);

    bfly1 s13(x1,x5,x14r,x14i,x15r,x15i);

    bfly1 s14(x3,x7,x16r,x16i,x17r,x17i);

    //stage2

    bfly1 s21(x10r,x12r,x20r,x20i,x22r,x22i);

  • 71

    bfly2 s22(x11r,x11i,x13r,x13i,x21r,x21i,x23r,x23i);

    bfly1 s23(x14r,x16r,x24r,x24i,x26r,x26i);

    bfly2 s24(x15r,x15i,x17r,x17i,x25r,x25i,x27r,x27i);

    //stage3

    bfly1 s31(x20r,x24r,y0r,y0i,y4r,y4i);

    bfly3 s32(x21r,x21i,x25r,x25i,y1r,y1i,y5r,y5i);

    bfly2 s33(x22r,x22i,x26r,x26i,y2r,y2i,y6r,y6i);

    bfly4 s34(x23r,x23i,x27r,x27i,y3r,y3i,y7r,y7i);

    module bfly1(x,y,x0r,x0i,x1r,x1i);

    input [7:0]x,y;

    output[7:0]x1r,x1i,x0r,x0i;

    assign x0r=x+y;

    assign x0i=8'd0;

    assign x1r=x-y;

    assign x1i=8'd0;

    endmodule

    module bfly2(xr,xi,yr,yi,x0r,x0i,x1r,x1i);

    input [7:0]xr,xi,yr,yi;

    output [7:0]x0r,x0i,x1r,x1i;

    wire [7:0]q1;

    assign x0r=xr+yi;

    assign x0i=xi-yr;

    assign x1r=xr-yi;

    assign x1i=xi+yr;

    Endmodule

  • 72

    module bfly3(xr,xi,yr,yi,x0r,x0i,x1r,x1i);

    input [7:0]xr,xi,yr,yi;

    output [7:0]x0r,x0i,x1r,x1i;

    parameter sht=8'd1000;

    wire [7:0]p1,p2;

    assign p1=(707*yr)>>sht;

    assign p2=(707*yi)>>sht;

    assign x0r=xr+p1+p2;

    assign x0i=xi-p1+p2;

    assign x1r=xr-p1-p2;

    assign x1i=xi-p2+p1;

    endmodule

    module bfly4(xr,xi,yr,yi,x0r,x0i,x1r,x1i);

    input [7:0]xr,xi,yr,yi;

    output [7:0]x0r,x0i,x1r,x1i;

    parameter sht=8'd1000;

    wire [7:0]p1,p2;

    assign p1=(707*yr)>>sht;

    assign p2=(707*yi)>>sht;

    assign x0r=xr-p1+p2;

    assign x0i=xi-p1-p2;

    assign x1r=xr+p1-p2;

    assign x1i=xi+p2+p1;

    endmodule

  • 73

    always@(posedge clk)

    case(sel)

    0:begin yr=y0r; yi=y0i; end

    1:begin yr=y1r; yi=y1i; end

    2:begin yr=y2r; yi=y2i; end

    3:begin yr=y3r; yi=y3i; end

    4:begin yr=y4r; yi=y4i; end

    5:begin yr=y5r; yi=y5i; end

    6:begin yr=y6r; yi=y6i; end

    7:begin yr=y7r; yi=y7i; end

    endcase

    endmodule

    OUTPUT:

    Input : 2,1,2,1,2,1,2,1.

  • 74

    DIF-FFT

    FOURIER TRANSFORM

    A fourier transform is an useful analytical tool that is important for many fields of

    application in the digital signal processing.

    In describing the properties of the fourier transform and inverse fourier transform, it is

    quite convenient to use the concept of time and frequency.

    In image processing applications it plays a critical role.

    FAST FOURIER TRANSFORM

    Fast fourier transform proposed by Cooley and Tukey in 1965.

    The fast fourier transform is a highly efficient procedure for computing the DFT of a finite series and requires less number of computations than that of direct evaluation of

    DFT.

    The FFT is based on decomposition and breaking the transform into smaller transforms and combining them to get the total transform.

    DISCRETE FOURIER TRANSFORM

    The DFT pair was given as

    Baseline for computational complexity:

    Each DFT coefficient requires - N complex multiplications,N-1 complex additions.

    All N DFT coefficients require - N2 complex multiplications, N(N-1) complex additions.

    1

    0

    /2][N

    n

    knNjenxkX

    1

    0

    /21][N

    k

    knNjekXN

    nx

  • 75

    SYMMETRY AND PERIODICITY PROPERTY

    Symmetry

    Periodicity

    FFT algorithm provides speed increase factors, when compared with direct computation of the DFT, of approximately 64 and 205 for 256 point and 1024

    point transforms respectively.

    The number of multiplications and additions required to compute N-point DFT using radix-2 FFT are Nlog2N and N/2 log2N respectively.

    EXAMPLE: The number of complex multiplications required using direct computation is

    N2=64

    2 =4096

    The number of complex multiplications required using FFT is

    N/2log2 N=64/2log2 64=192

    Speed improvement factor =4096/192= 21.33.

    NUMBER OF COMPLEX MULTIPLICATIONS REQUIRED IN DIF- FFT

    ALGORITHM

    No. of points

    in a sequence

    x(n), N

    Complex

    multiplications

    in direct

    computation of

    DFT

    =NN =A

    Complex

    multiplications

    in FFT

    algorithms

    N/2 log2 N = B

    Speed

    improvement

    Factor -A/B

    4 16 4 4

    8 64 12 5

    16 256 32 8

    )()(

    )()(

    )(

    kNn

    N

    nNk

    N

    kn

    N

    nNk

    N

    Nnk

    N

    kn

    N

    kn

    N

    kn

    N

    WWW

    WWW

    WW

  • 76

    FFT ALGORITHMS

    There are basically two types of FFT algorithms.

    They are:

    Decimation in Time

    Decimation in frequency

    DECIMATION-IN-FREQUENCY

    It is a popular form of FFT algorithm.

    In this the output sequence x(k) is divided into smaller and smaller subsequences, that is why the name decimation in frequency.

    Initially the input sequence x(n) is divided into two sequences x1(n) and x2(n) consisting of the first n/2 samples of x(n) and the last n/2 samples of x(n) respectively.

    RADIX-2 DIF- FFT ALGORITHM To divide N-point sequence x(n) into two N/2-point sequence

    The former N/2-point

    The latter N/2-point

    Decimation-in-frequency algorithm of length-8 for radix-2.

  • 77

    THE COMPARISON OF DIT AND DIF The order of samples

    DIT-FFT: the input is bit- reversed order and the output is natural order.

    DIF-FFT: the input is natural order and the output is bit- reversed order.

    The butterfly computation

    DIT-FFT: multiplication is done before additions.

    DIF-FFT: multiplication is done after addition.

    Both DIT-FFT and DIF-FFT have the identical computation complexity. i.e. for, there

    are total L stages and each has N/2 butterfly computation. Each butterfly computation has 1

    multiplication and 2 additions.

    A DIT-FFT flow graph can be transposed to a DIF-FFT flow graph and vice versa.

    RADIX-2 DIF- FFT ALGORITHM

    VERILOG CODE

    module diff(clk,sel,yr,yi);

    input clk;

    input [2:0]sel;

    output reg [7:0]yr,yi;

    wire [7:0]y0r,y1r,y2r,y3r,y4r,y5r,y6r,y7r,y0i,y1i,y2i,y3i,y4i,y5i,y6i,y7i;

    wire [7:0]x20r,x20i,x21r,x21i,x22r,x22i,x23r,x23i,x24r,x24i,x25r,x25i,x26r,x26i,x27r,x27i;

    wire [7:0]x10r,x10i,x11r,x11i,x12r,x12i,x13r,x13i,x14r,x14i,x15r,x15i,x16r,x16i,x17r,x17i;

    wire [7:0]x0r,x0i,x1r,x1i,x2r,x2i,x3r,x3i,x4r,x4i,x5r,x5i,x6r,x6i,x7r,x7i;

    parameter w0r=8'b1;

    parameter w0i=8'b0;

    parameter w1r=8'b10110101;

    parameter w1i=8'b01001011;

    parameter w2r=8'b0;

    parameter w2i=8'b11111111;

    parameter w3r=8'b01001011;

    parameter w3i=8'b01001011;

  • 78

    assign x0r=8'b11111111;

    assign x0i=8'b00000000;

    assign x1r=8'b11011100;

    assign x1i=8'b00010101;

    assign x2r=8'b11001101;

    assign x2i=8'b00000000;

    assign x3r=8'b11011100;

    assign x3i=8'b00010101;

    assign x4r=8'b10101011;

    assign x4i=8'b00000000;

    assign x5r=8'b00000110;

    assign x5i=8'b11101011;

    assign x6r=8'b11001101;

    assign x6i=8'b00000000;

    assign x7r=8'b00000110;

    assign x7i=8'b11101011;

    //stage1

    bfly1 s11(x0r,x0i,x4r,x4i,w0r,w0i,x10r,x10i,x14r,x14i);

    bfly3 s12(x1r,x1i,x5r,x5i,w1r,w1i,x11r,x11i,x15r,x15i);

    bfly2 s13(x2r,x2i,x6r,x6i,w2r,w2i,x12r,x12i,x16r,x16i);

    bfly4 s14(x3r,x3i,x7r,x7i,w3r,w3i,x13r,x13i,x17r,x17i);

    //stage2

    bfly1 s21(x10r,x10i,x12r,x12i,w0r,w0i,x20r,x20i,x22r,x22i);

    bfly2 s22(x11r,x11i,x13r,x13i,w2r,w2i,x21r,x21i,x23r,x23i);

    bfly1 s23(x14r,x14i,x16r,x16i,w0r,w0i,x24r,x24i,x26r,x26i);

    bfly2 s24(x15r,x15i,x17r,x17i,w2r,w2i,x25r,x25i,x27r,x27i);

    //stage3

    bfly1 s31(x20r,x20i,x24r,x24i,w0r,w0i,y0r,y0i,y1r,y1i);

    bfly1 s32(x22r,x22i,x26r,x26i,w0r,w0i,y2r,y2i,y3r,y3i);

    bfly1 s33(x21r,x21i,x25r,x25i,w0r,w0i,y4r,y4i,y5r,y5i);

    bfly1 s34(x23r,x23i,x27r,x27i,w0r,w0i,y6r,y6i,y7r,y7i);

    always@(posedge clk)

    case(sel)

    0:begin yr=y0r; yi=y0i; end

    1:begin yr=y1r; yi=y1i; end

    2:begin yr=y2r; yi=y2i; end

    3:begin yr=y3r; yi=y3i; end

    4:begin yr=y4r; yi=y4i; end

    5:begin yr=y5r; yi=y5i; end

    6:begin yr=y6r; yi=y6i; end

    7:begin yr=y7r; yi=y7i; end

    endcase

    endmodule

    module bfly1(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i);// sub module

  • 79

    input [7:0]xr,xi,yr,yi,wr,wi;

    output[7:0]x1r,x1i,x0r,x0i;

    assign x0r=xr+yr;

    assign x0i=yi*wi;

    assign x1r=xr+(~(yr*wr)+1);

    assign x1i=~(yr*wi)+1;

    endmodule

    module bfly2(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i); // sub module

    input [7:0]xr,xi,yr,yi,wr,wi;

    output [7:0]x0r,x0i,x1r,x1i;

    wire [7:0]q1;

    assign q1=yr*(wi+1);

    assign x0r=xr;

    assign x0i=q1;

    assign x1r=xr;

    assign x1i=~q1+1;

    endmodule

    module bfly3(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i); // sub module

    input [7:0]xr,xi,yr,yi,wr,wi;

    output [7:0]x0r,x0i,x1r,x1i;

    wire [15:0]p1,p2,p3,p4;

    wire [7:0]win,yrn,yin;

    wire [8:0]ywr,ywi;

    parameter sht=8'b1000;

    assign yrn=~yr+1;

    assign yin=yi;

    assign win=~wi+1;

    assign p1=(yrn*wr)>>sht;

    assign p2=(yin*win)>>sht;

    assign p3=(yrn*win)>>sht;

    assign p4=(yin*wr)>>sht;

    assign ywr=(~p1+1)+p2;

    assign ywi=p3+p4;

    assign x0r=xr+ywr;

    assign x0i=xi+ywi;

    assign x1r=xr+(~ywr+1);

    assign x1i=xi+(~ywi+1);

    endmodule

    module bfly4(xr,xi,yr,yi,wr,wi,x0r,x0i,x1r,x1i); // sub module

    input [7:0]xr,xi,yr,yi,wr,wi;

  • 80

    output [7:0]x0r,x0i,x1r,x1i;

    wire [15:0]p1,p2;

    wire [7:0]win,yrn,yin;

    wire [8:0]ywr,ywi;

    parameter sht=8'b1000;

    assign yrn=~yr+1;

    assign yin=~yi+1;

    assign win=~wi+1;

    assign p1=(yrn*win)>>sht;

    assign p2=(yin*win)>>sht;

    assign ywr=(~p1+1)+p2;

    assign ywi=p1+p2;

    assign x0r=xr+ywr;

    assign x0i=xi+ywi;

    assign x1r=xr+(~ywr+1);

    assign x1i=xi+(~ywi+1);

    endmodule

    OUTPUT:

  • 81

    ERROR CONTROL CODES

    ERROR DETECTION AND CORRECTION CODES:

    When a message is transmitted, it has the potential to get scrambled by noise. This is certainly true

    of voice messages, and is also true of the digital messages that are sent to and from computers. Now

    even sound and video are being transmitted in this manner. A digital message is a sequence of 0s and

    1s which encodes a given message. More data will be added to a given binary message that will help to

    detect if an error has been made in the transmission of the message; adding such data is called an error-

    detecting code.

    More data may also be added to the original message so that errors made in transmission may be

    detected, and also to figure out what the original message was from the possibly corrupt message that

    was received. This type of code is an error-correcting code. Error detection is the ability to detect errors.

    Error correction has an additional feature that enables identification and correction of the errors. Error

    detection always precedes error correction. Both can be achieved by having extra or redundant or check

    bits in addition to data deduce that there is an error. Original Data is encoded with the redundant bit(s).

    New data formed is known as code word.Coding is the process of adding redundancy for error detection

    or correction. It is of two types:

    Block codes

    Divides the data to be sent into a set of blocks

    Extra information attached to each block

    Memoryless

    Convolutional codes

    Treats data as a series of bits, and computes a code over a continuous series

    The code computed for a set of bits depends on the current and previous input.

    HAMMING CODES:

    Hamming Codes are used in detecting and correcting a code. An error-correcting code is an

    algorithm for expressing a sequence of numbers such that any errors which are introduced can be

    detected and corrected (within certain limitations) based on the remaining numbers. Errors can happen in

    a variety of ways. Bits can be added, deleted, or flipped. Errors can happen in fixed or variable codes.

  • 82

    Error-correcting codes are used in CD players, high speed modems, and cellular phones. Error detection

    is much simpler than error correction. For example, one or more "check" digits are commonly embedded

    in credit card numbers in order to detect mistakes.Hamming code adopt parity concept, but have more

    than one parity bit. Example of a block code is the (7,4) Hamming code. This is an error detecting and

    error-correcting binary code, which transmits N=7 bits for every K=4 source bits.

    GENERAL ALGORITHM:

    The following general algorithm generates a single-error correcting (SEC) code for any

    number of bits.

    1. Number the bits starting from 1: bit 1, 2, 3, 4, 5, etc.

    2. Write the bit numbers in binary: 1, 10, 11, 100, 101, etc.

    3. All bit positions that are powers of two (have only one 1 bit in the binary form of their position)

    are parity bits: 1, 2, 4, 8, etc. (1, 10, 100, 1000)

    4. All other bit positions, with two or more 1 bits in the binary form of their position, are data bits.

    5. Each data bit is included in a unique set of 2 or more parity bits, as determined by the binary

    form of its bit position.

    ENCODING A (7,4) CODE:

    CONSTRUCTION OF G AND H:

    The matrix is called a (Canonical) generator matrix of a linear (n,k) code and

    is called a parity check matrix.This is the construction of G and H in standard (or

    systematic) form. Regardless of form, G and H for linear block codes must satisfy , an all-

    zeros matrix. Since [7,4,3]=[n,k,d]=[2m 1, 2m1-m, m]. The parity check matrix H of a Hamming code

    is constructed by listing all columns of length m that are pair-wise independent.The parity check matrix

    H of a Hamming code is constructed by listing all columns of length m that are pair-wise

    independent.Thus H is a matrix whose left side is all of the nonzero n-tuples where order of the n-tuples

    in the columns of matrix does not matter. The right hand side is just the (n-k)-identity matrix.So G can

    be obtained from H by taking the transpose of the left hand side of H with the identity k- identity matrix

    on the left hand side of G.

  • 83

    The code generator matrixG and parity check matrix H are:

    and

    From the above matrix we have 2k=2

    4=16 codewords. The codewords of this binary code can be

    obtained from . With with exist in ( A field with two elements

    namely 0 and 1).

    Thus the codewords are all the 4-tuples (k-tuples).Therefore,(1,0,1,1) gets encoded as (1,0,1,1,0,1,0).

    HAMMING DISTANCE:

    Hamming Distance = No of bit positions in which 2 code words differ

    E.g. 10001001 and 10110001 have distance of 3. If distance is d, then d-single bit errors are required to

    convert any one valid code into another. Implying that this error would not be detected. In general, to

    detect k-single bit error, minimum hamming distance D(min)=k+1.Hence we need code words that have

    D(min) = 2 + 1 = 3 to detect 2-bit errors.

    DECODING OF HAMMING CODE:

    The decoding task can be re-expressed as syndrome decoding.

    s = H x r

    where s - syndrome.

    H - parity checks matrix.

    r - is the received vector.

    The following two possibilities are:

    If the syndrome is zero, that is, all three parity checks agree with the corresponding received bits,

    then the received vector is a codeword, and the most probable decoding is given by reading out its first

    four bits. Then u is supposed to be the same than u. One can give the situation in that the errors are not

  • 84

    detectable. This happens when the error vector is identical to a non null word code. In this case r it is the

    sum of two words code and therefore the syndrome is similar to zero. These errors are non-detectable

    errors. As there are 2k-1 non-null words code, there are 2k-1 non-detectable errors.

    If the syndrome is non-zero, then we are certain that the noise sequence for the present block was

    non-zero(we have noise in our transmissions). Since the received vector v is given by v = Gt*u+n and

    HGT=0 the syndrome decoding problem is then to find the most probable noise vector n satisfying the

    equation Hn = e. Once the error vector is found, the original source sequence is identified.

    VHDL CODE:

    HAMMING ENCODER

    entity hamenc is

    Port ( datain : in STD_LOGIC_VECTOR (3 downto 0);

    p :inout STD_LOGIC_VECTOR(2 downto 0);

    hamout : out STD_LOGIC_VECTOR (6 downto 0));

    end hamenc;

    architecture Behavioral of hamenc is

    begin --check bits

    --generate check bits

    p(0)

  • 85

    hamout(3) rxdrxd rxd rxd rxd rxd rxd rxd

  • 86

    IF syndrome = "000" THEN

    Data out(3 down to 0)

  • 87

    SIMULATION RESULTS

    ENCODER OUT

    DECODER OUTPUT:

  • 88

    INFERENCE:

    Hamming code is code word of n bits with m data bits and r parity (or check bits)

    i.e. n = m + r

    It can detect D(min) 1 errors and can correct errors. Hence to correct k errors, need D(min)

    =2k+1.It need a least a distance of 3 to correct a single bit error.

  • 89

    CRYPTOGRAPHIC ALGORITHMS FOR FPGAS

    Many communication systems use data-stream ciphers to protect relevant Information.

    The key sequence K is more or less a pseudorandom sequence (known to the sender and the receiver), and with the modulo 2 property of the XOR function, the plaintext P can be

    reconstructed at the receiver side, because

    P K K = P 0 = P.

    We shall discuss to encryption algorithms namely:

    Linear Feedback Shift Registers (LFSR) algorithm

    Data Encryption Standard (DES)

    Neither algorithm requires large tables and both are suitable for an FPGA implementation.

    LINEAR FEEDBACK SHIFT REGISTERS ALGORITHM

    LFSRs with maximal sequence length are a good approach for an ideal security key,

    because they have good statistical properties. In other words, it is difficult to analyze the

    sequence in a cryptographic attack, an analysis called cryptoanalysis. Because bitwise designs

    are possible with FPGAs, such LFSRs are more efficiently realized with FPGAs than PDSPs.

    Two possible realizations of a LFSR of length 8 are shown in Fig. 1.1.

    Fig 1.1. Possible realizations of LFSRs. (a) Fibonacci configuration. (b) Galois configuration.

    For the XOR LFSR there is always the possibility of the all-zero word, which should

    never be reached. If the cycle starts with any nonzero word, the cycle length is always 2l 1.

    Sometimes, if the FPGA wakes up with an all-zero state, it is more convenient to use a

    mirrored or inverted LFSR circuit. If the all-zero word is a valid pattern and produces exactly the inverse sequence, it is necessary to substitute the XOR with a not XOR or XNOR gate. Such LFSRs can easily be designed using a PROCESS statement in VHDL, as the following

    example shows.

  • 90

    The following VHDL code implements a LFSR of length 6.

    LIBRARY ieee;

    USE ieee.std_logic_1164.ALL;

    USE ieee.std_logic_arith.ALL;

    ENTITY lfsr IS ------> Interface

    PORT ( clk : IN STD_LOGIC;

    y : OUT STD_LOGIC_VECTOR(6 DOWNTO 1));

    END lfsr;

    ARCHITECTURE fpga OF lfsr IS

    SIGNAL ff : STD_LOGIC_VECTOR(6 DOWNTO 1)

    := (OTHERS => 0); BEGIN

    PROCESS -- Implement length 6 LFSR with xnor

    BEGIN

    WAIT UNTIL clk = 1; ff(1)

  • 91

    Note that a complete cycle of an LFSR sequence fulfills the three criteria for optimal

    length 2l 1 pseudorandom sequences.

    1) The number of 1s and 0s in a cycle differs by no more than one.

    2) Runs of length k (e.g., 111 sequence, 000 sequence) have a total

    fractional part of all runs of 1/2k.

    3) The autocorrelation function C() is constant for [1, n 1].

    DES BASED ALGORITHM:

    The data encryption standard (DES), outlined in Fig. 1.3, is typically used in a block

    cipher. By selecting the output feedback mode (OFB) it is also possible to use the modified DES in a data-stream cipher.

    Fig 1.3. State machine for a block encryption system (DES)

  • 92

    PRINCIPLE:

    The DES comprises a finite state machine translating plaintext blocks into ciphertext

    blocks. First the block to be substituted is loaded into the state register (32 bits). Next it is

    expanded (to 48 bits), combined with the key (also 48 bits) and substituted in eight 64 bit-width S-boxes. Finally, permutations of single bits are performed. This cycle may be (if desired,

    with a changing key) applied several times.

    In the DES, the key is usually shifted one or two bits so that after 16 rounds the key is

    back in the original position. Because the DES can therefore be seen as an iterative application

    of the Feistel cipher (shown in Fig. 1.4), the S-boxes must not be invertible. To simplify an

    FPGA realization some modifications are useful, such as a reduction of the length of the state

    register to 25 bits. No expansion is used. Use the final permutations as listed in Table 1.1.

    Because most FPGAs only have four to five input look-up tables (LUTs), S-boxes with five

    inputs have been designed.

    Fig 1.4. Principle of the Feistel Network

    A reasonable test for S-boxes is the dependency matrix. This matrix shows, for every

    input/output combination, the probability that an output bit changes if an input bit is changed.

    With the avalanche effect the ideal probability is 1/2.

    Table 1.1 Table for Permutation

  • 93

    Since there are 25 = 32 possible input vectors for each S-box, the ideal value is 16. A

    random generator was used to generate the S-boxes. The reason that some values differ much

    from the ideal 16 may lie in the desired inversion.

    Though Des was considered most secure till the late 90s, it was successfully cracked in 1997 and thereafter. So more complex encryption algorithms like AES, Triple DES, etc., are

    used practically nowadays.

  • 94

    FPGA DESIGN OF LMS ALGORITHM

    The WidrowHoff least mean square (LMS) adaptive algorithm is a practical method for

    finding a close approximation to in real time.

    It is a very simple algorithm and it does not require explicit measurement of the

    correlation functions, nor does it involve matrix inversion.

    The LMS algorithm is an implementation of the method of the steepest descent.

    According to this method, the next filter coefficient vector f[n + 1] is equal to the

    present filter coefficient vector f[n] plus a change proportional to the negative gradient.

    where the parameter is the learning factor or step size that controls stability and the

    rate of convergence of the algorithm. During each iteration the true gradient is

    represented by [n].

    The LMS algorithm estimates an instantaneous gradient in a crude but efficient manner

    by assuming that the gradient of J = is an estimate of the gradient of the mean-

    square error E{ }. The relationship between the true gradient [n] and the

    estimated gradients [n] is given by the following expression:

    Therefore the coefficient update equation becomes,

  • 95

    FIG: LMS configuration

    The LMS algorithm makes use of gradients of mean-square error functions, it does not require squaring, averaging, or differentiation. The algorithm is simple and generally

    easy to implement.

    The LMS algorithm is convergent in the mean square if and only if the step-size

    parameter satisfies,

    where max is the largest eigenvalue of the correlation matrix of the input data.

    To ensure the fast convergence the closely bounded step size value is

    where L is the filter order & is the autocorrelation function of input.

    For the higher order filter the upper bound can be relaxed by factor of 3

    Normalized LMS

    The LMS algorithm discussed so far uses a constant step size proportional to the

    stability bound

    . Obviously this requires knowledge of the signal

    statistic, i.e., rxx[0], and this statistic must not change over time.

    It is, however, possible that this statistic changes over time, and we wish to adjust accordingly i.e time varying step size parameter().

    The normalized is given by

  • 96

    If we are concerned that the denominator can temporary become very small and too large, we may add a small constant to [n]x[n], which yields

    Therefore the coefficient update equation for NLMS is,

    where is the norm of the input.

    PIPELINED LMS FILTER:

    This method is used to increase the throughput of the Adaptive system.

    The optimal number of pipeline stages can be computed is as follows. For the (b b) multiplier a total of log2(b) stages are needed, for the adder tree an additional log2(L) pipeline stages would be sufficient and one additional stage for the computation of the

    error. The coefficient update multiplication requires an additional log2(b) pipeline

    stages.

    The total number of pipeline stages for a maximum throughput are,

    where we have assumed that is a power-of-two constant and the scaling with can be done without the need of additional pipeline stages. If, however, the normalized LMS is

    used, then will no longer be a constant and depending on the bit width of additional pipeline stages will be required.

    BLOCK TRANSFORMATION USING FFTs:

    LMS algorithms that solve the filter coefficient adjustment in a transform domain have been

    proposed for two reasons,

    The goal of the fast convolution techniques is to lower the computational effort, by using block update and transforming the convolution to compute the adaptive filter output and

    the filter coefficient adjustment in the transform domain with the help of a fast cyclic

    convolution algorithm.

    The second method that uses transform domain techniques has the main goal to improve the adaptation rate of the LMS algorithm, because it is possible to find transforms that

    allow a decoupling of the modes of the adaptive filter. The coefficient update equation is,

    The step size can be reduced to

  • 97

    for a block update of B steps each.

    Choice of Block size: B=L: optimal choice from the viewpoint of computational complexity. BL: redundant operations in the adaptation, not optimal.

    VHDL CODE:

    LIBRARY lpm;

    USE lpm.lpm_components.ALL;

    LIBRARY ieee;

    USE ieee.std_logic_1164.ALL;

    USE ieee.std_logic_arith.ALL;

    USE ieee.std_logic_signed.ALL;

    ENTITY fir_lms IS

    GENERIC (W1 : INTEGER := 8;

    W2 : INTEGER := 16;

    L : INTEGER := 2 );

    PORT ( clk : IN STD_LOGIC;

    x_in : IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);

    d_in : IN STD_LOGIC_VECTOR(W1-1 DOWNTO 0);

    e_out, y_out : OUT STD_LOGIC_VECTOR(W2-1 DOWNTO 0);

    f0_out, f1_out : OUT STD_LOGIC_VECTOR(W1-1 DOWNTO 0));

    END fir_lms;

    ARCHITECTURE flex OF fir_lms IS

    SUBTYPE N1BIT IS STD_LOGIC_VECTOR(W1-1 DOWNTO 0);

    SUBTYPE N2BIT IS STD_LOGIC_VECTOR(W2-1 DOWNTO 0);

    TYPE ARRAY_N1BIT IS ARRAY (0 TO L-1) OF N1BIT;

    TYPE ARRAY_N2BIT IS ARRAY (0 TO L-1) OF N2BIT;

    SIGNAL d : N1BIT;

    SIGNAL emu : N1BIT;

    SIGNAL y, sxty : N2BIT;

    SIGNAL e, sxtd : N2BIT;

    SIGNAL x, f : ARRAY_N1BIT;

    SIGNAL p, xemu : ARRAY_N2BIT;

    BEGIN

    dsxt: PROCESS (d)

    BEGIN

    sxtd(7 DOWNTO 0)

  • 98

    BEGIN

    WAIT UNTIL clk = '1';

    d W2)

    PORT MAP ( dataa => x(I), datab => f(I),

    result => p(I));

    END GENERATE;

    y

  • 99

    OUTPUT:

    APPLICATIONS:

    Interference cancellation Prediction Inverse Modelling

  • 100

    DIGITAL UP CONVERTER

    An ideal Software Defined Radio base station would perform all signal processing tasks

    in the digital domain. However, current-generation wideband data converters cannot support the

    processing bandwidth and dynamic r