11
Parallel processing speed increase of the one-bit auto-correlation function in hardware Nicolas Huber a,, M.S. Hromalik-Pouchet b , T.D. Carozzi c , M.P. Gough c , A.M. Buckley c a Biomedical Engineering Group, University of Sussex, Brighton BN1 9QT, United Kingdom b Lab. of Atomic & Solid State Physics, Cornell University, Ithaca, NY, USA c Space Science Centre, University of Sussex, Brighton BN1 9QT, United Kingdom article info Article history: Available online 19 January 2011 Keywords: Parallelization Counterbased Auto-correlation FPGA abstract Recently, a serial implementation of the one-bit auto- and cross-correlation functions (ACF and CCF respectively) in a field programmable gate array (FPGA) has been developed, based on asynchronous delay elements and counters, known as the counterbased correlation. This paper proposes a method of parallelizing this otherwise serial process, offering significant improvements in the applicability of this approach to more types of ACF. Furthermore, the possibility of obtaining lag results from a parallel data sequence without first shifting the entire sequence has been realized, hence decreasing the number of clock cycles necessary for the calculation of the ACF. A synchronous design was preferred here for reasons of stability and portability, the technology of choice again being an FPGA. The advantages offered by the counterbased implementation in terms of device area usage and speed still apply. A practical implemen- tation in the instrumentation of an upcoming space mission is also discussed. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction A wide sense, random, stationary signal is fully characterized by its auto-correlation function (ACF). Hence the ACF in various forms enjoys extensive use in a wide range of applications. It can be found in such diverse fields as economics [1], image processing [2], multiple areas of astronomy [3] and Brownian motion studies [4], to name but a few. An ACF calculates the degree of association between the ele- ments of a single series separated by some lag. Its purpose is to de- tect non-randomness in data, and to identify an appropriate time series model if the data is found to be non-random [5]. For a peri- odic sequence ðX i Þ N1 i¼0 , it generally takes the form of R X ½i¼ X N1 k¼0 X½kX½k þ i ð1Þ Significant effort has been made to implement correlation func- tions in hardware, due to the speed-up offered by the possibility of parallel processing. These functions are currently being deployed in FPGAs, while traditionally DSP devices have been the technology of choice [6], as well as custom-made ASIC devices [7]. In [8], a CCF was implemented in a FPGA, where the function was defined using low- level schematic capture, probably due to the availability of opti- mized schematic components. The significant work by Von Herzen [9] shows the possibility of very fast (up to 250 MHz) CCF calculation in reconfigurable computing engines. Further to this, there has also been some investigation into ACFs in particular, and their imple- mentation in a FPGA. In the work by Bezerra et al. [10] an ACF devel- oped using VHDL targeting an FPGA was directly compared with the exact same algorithm as carried out by a pair of microcontrollers of the 8051 family, as found on-board the NASA 36.152 CUSP sounding rocket. The significant performance gains from the deployment of correlation functions in reconfigurable devices over purely software implementations were proven, leading in part to the work presented here. In this paper, we investigate the parallelization of a special type of ACF, the 1-bit version, in hardware. 2. 1-Bit correlation A two-level (Bernoullian) quantization can be performed on a series before it is correlated without significant loss of correlation information, assuming the series investigated is wide sense sta- tionary and bandlimited. Correlation coefficients before and after two level quantization are related by what is known as the Van Vleck correction [11], which states that the result of a measured correlation q 2 is proportional to the two-level quantized correla- tion q, as given by: q 2 ¼ 2 p sin 1 q ð2Þ 0141-9331/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2011.01.001 Corresponding author. Tel.: +44 1273872821. E-mail address: [email protected] (N. Huber). Microprocessors and Microsystems 35 (2011) 297–307 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

Parallel processing speed increase of the one-bit auto-correlation function in hardware

Embed Size (px)

Citation preview

Page 1: Parallel processing speed increase of the one-bit auto-correlation function in hardware

Microprocessors and Microsystems 35 (2011) 297–307

Contents lists available at ScienceDirect

Microprocessors and Microsystems

journal homepage: www.elsevier .com/locate /micpro

Parallel processing speed increase of the one-bit auto-correlationfunction in hardware

Nicolas Huber a,⇑, M.S. Hromalik-Pouchet b, T.D. Carozzi c, M.P. Gough c, A.M. Buckley c

a Biomedical Engineering Group, University of Sussex, Brighton BN1 9QT, United Kingdomb Lab. of Atomic & Solid State Physics, Cornell University, Ithaca, NY, USAc Space Science Centre, University of Sussex, Brighton BN1 9QT, United Kingdom

a r t i c l e i n f o a b s t r a c t

Article history:Available online 19 January 2011

Keywords:ParallelizationCounterbasedAuto-correlationFPGA

0141-9331/$ - see front matter � 2011 Elsevier B.V. Adoi:10.1016/j.micpro.2011.01.001

⇑ Corresponding author. Tel.: +44 1273872821.E-mail address: [email protected] (N. Huber).

Recently, a serial implementation of the one-bit auto- and cross-correlation functions (ACF and CCFrespectively) in a field programmable gate array (FPGA) has been developed, based on asynchronousdelay elements and counters, known as the counterbased correlation. This paper proposes a method ofparallelizing this otherwise serial process, offering significant improvements in the applicability of thisapproach to more types of ACF. Furthermore, the possibility of obtaining lag results from a parallel datasequence without first shifting the entire sequence has been realized, hence decreasing the number ofclock cycles necessary for the calculation of the ACF. A synchronous design was preferred here for reasonsof stability and portability, the technology of choice again being an FPGA. The advantages offered by thecounterbased implementation in terms of device area usage and speed still apply. A practical implemen-tation in the instrumentation of an upcoming space mission is also discussed.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

A wide sense, random, stationary signal is fully characterized byits auto-correlation function (ACF). Hence the ACF in various formsenjoys extensive use in a wide range of applications. It can befound in such diverse fields as economics [1], image processing[2], multiple areas of astronomy [3] and Brownian motion studies[4], to name but a few.

An ACF calculates the degree of association between the ele-ments of a single series separated by some lag. Its purpose is to de-tect non-randomness in data, and to identify an appropriate timeseries model if the data is found to be non-random [5]. For a peri-odic sequence ðXiÞN�1

i¼0 , it generally takes the form of

RX ½i� ¼XN�1

k¼0

X½k�X½kþ i� ð1Þ

Significant effort has been made to implement correlation func-tions in hardware, due to the speed-up offered by the possibility ofparallel processing. These functions are currently being deployed inFPGAs, while traditionally DSP devices have been the technology ofchoice [6], as well as custom-made ASIC devices [7]. In [8], a CCF wasimplemented in a FPGA, where the function was defined using low-level schematic capture, probably due to the availability of opti-

ll rights reserved.

mized schematic components. The significant work by Von Herzen[9] shows the possibility of very fast (up to 250 MHz) CCF calculationin reconfigurable computing engines. Further to this, there has alsobeen some investigation into ACFs in particular, and their imple-mentation in a FPGA. In the work by Bezerra et al. [10] an ACF devel-oped using VHDL targeting an FPGA was directly compared with theexact same algorithm as carried out by a pair of microcontrollers ofthe 8051 family, as found on-board the NASA 36.152 CUSP soundingrocket. The significant performance gains from the deployment ofcorrelation functions in reconfigurable devices over purely softwareimplementations were proven, leading in part to the work presentedhere. In this paper, we investigate the parallelization of a specialtype of ACF, the 1-bit version, in hardware.

2. 1-Bit correlation

A two-level (Bernoullian) quantization can be performed on aseries before it is correlated without significant loss of correlationinformation, assuming the series investigated is wide sense sta-tionary and bandlimited. Correlation coefficients before and aftertwo level quantization are related by what is known as the VanVleck correction [11], which states that the result of a measuredcorrelation q2 is proportional to the two-level quantized correla-tion q, as given by:

q2 ¼2p

sin�1 q ð2Þ

Page 2: Parallel processing speed increase of the one-bit auto-correlation function in hardware

298 N. Huber et al. / Microprocessors and Microsystems 35 (2011) 297–307

Assuming this correction factor is applied, it follows that instead ofcarrying out a multi-bit correlation, a 1-bit (or binary) equivalentmay be preferred without significant loss of accuracy. Hence, a1-bit ACF or CCF can be a much more computationally efficientand realizable statistical tool in hardware systems where thereare heavy restrictions, such as limited power, processing capabili-ties or transmission bandwidths. The most prominent example areais radio astronomy [12], where 1-bit cross-correlation functionshave been extensively used due to bandwidth restrictions. Brisken[13] made a strong case for using dedicated hardware to performcorrelations in this field, instead of software, in that the currentlyunder development Expanded Very Large Array (EVLA) radio tele-scope would require a total of 200,000 3 GHz Pentium processorsto carry out the excessively parallel computations needed. Thisapproach gains more weight in [14], postulating that the bestapproach to the processing necessary for the correlator of theSquare Kilometric Array Radio Telescope will probably be dedicatedhardware. For this project, a series of dedicated correlator chipswere always going to be the most important processing elements[15], focusing on FPGAs [16], whatever the number of quantizationlevels used.

Traditionally, 1-bit correlations in digital logic are carried out byduplicating the original set of bits. The duplicate is then shifted onebit at a time against the original, until all but the last bit have‘‘rolled-off’’, i.e. until the highest lag has been computed. The roll-ing-off is achieved by correlating the shifted-out part of the bit-stream padded with ’0’. Depending on whether coincidencesstrictly of 1’s or of both 1’s and 0’s in the bitstreams are desired,logic AND or NXOR gates respectively carry out the multiplication.The product is a bitstream of length equal to that of the original foreach lag/shift (Fig. 1). The result of the ACF for that lag is then ex-tracted by finding the number of 1’s in this new bitstream. Thisprocess is carried out by a 1’s counter, an asynchronous design ofwhich forces the sequential propagation of the bitstream through

7 6 235 4

7 6 5 4

7 6 35 4

7 6 235 4

LAG 0

LAG 1

LAG 2

LAG 3

LAG 4

LAG 5

LA

SHIFTS

ACF LAG RESULT

Fig. 1. Traditional implementation o

many levels of logic. Each lag result is available at the end of eachshift cycle through this process. The above described procedure canformally be viewed as a classical ACF of a Bernoullian sequence.

An example of the above is the ’Bit Correlator’ by Xilinx, whichis a freely distributed, optimized, drop-in module for Xilinx devices[17]. It performs a bit-by-bit comparison between input data and auser defined bit match pattern, producing an unsigned outputvalue of the number of bits that match. Limited to just single bitcorrelations, this function will not allow the setting of the matchpattern during runtime, thus preventing the core from becomingas suitable to our needs as an ACF. In a broader implementation,a 1-bit version is also discussed in [18], following the principledescribed above, carried out in an ASIC as part of an XF-typespectro-correlator. A 32-lag, 1-bit correlation is performed in thetraditional manner, returning a lag result at the end of clock cycleof 32 MHz.

3. Counterbased ACF

Recently in the work by Pouchet et al. [19] a novel method ofimplementing the 1-bit ACF, dubbed ‘‘Counterbased ACF’’ (CBACF),was proposed and explored in some detail. A chain of matched de-lay elements through which the original bitstream is serially prop-agated is used to produce the shifted coincidences, as a function ofthe position of each element in the chain. The chosen delay ele-ments were latches found in the fabric of a Xilinx FPGA. Themultiply-add-shift operation is replaced by a set of counters, whichasynchronously count the number of coincidences. A running sumis kept for each degree of shift in the counters, and with each newbit input all sums are updated in parallel. Fig. 2 displays this meth-od, along with the use of two delay lines to capture coincidences of’1’s and coincidences of ’0’s independently, at will. Similarly, acounterbased implementation of a CCF involves only the shiftingof two separate sequences through the delay elements. The

01

0123

012

7 6 01235 4

01

G 6

LAG 7

0 0 0 0 000

ROLL-OFF

f a 1-bit ACF, showing ‘‘roll-off’’.

Page 3: Parallel processing speed increase of the one-bit auto-correlation function in hardware

Delay Delay Delay Delay

Delay Delay Delay Delay

)(lX

O1 O2 O3 O4

Z4Z3Z2Z1

CounterCounterCounter Counter

O1 Z1 O2 Z2 O3 Z3 O4 Z4

COINCIDENCES OF ‘1’s

COINCIDENCES OF ‘0’s

Fig. 2. Counterbased ACF, showing the matched delay lines for coincidences of ‘1’s and ‘0’s.

N. Huber et al. / Microprocessors and Microsystems 35 (2011) 297–307 299

counterbased method, however, is limited only to classical ACFs bythe fact that no full lag result is available until the entire inputsequence has been shifted through.

In the present paper, we propose a scheme that circumventsthis limitation by partially parallelizing the counterbased ACFmethod. This new approach allows the CBACF method to be ap-plied to other types of ACF other than the classical one, makingthe CBACF more general. Moreover, counterbased ACFs up tonow have been fed with a serial input, not offering significant gainsin the event of already available parallel data. It will be shown thatit is possible through parallelization to obtain the CBACF lag resultsaccurately in a fraction of the clock cycles (shifts) required up tonow. This is due to the fact that the extensions proposed are moreapplicable to datasets that are instantaneously available in parallel,and which are not a time dependent sequence.

Since this work is a continuation of the work in [19], the samecase study, namely the Sussex Correlating Electron Spectrograph(CORES) [20], which is planned to be flown to the InternationalSpace Station, has been taken into consideration. The current re-sults are compared both against those found in [19] and those ofthe original case study therein.

4. Application to other types of ACF

Until this stage, only roll-off 1-bit ACFs have been tested withthe previously discussed delay method. There exist cases, however,where this type of ACF is not preferred. One such case is the SussexCORES instrument itself, where, due to telemetry budget limita-tions, a very specific form of ACF is to be carried out on board. Thistype of ACF will now be discussed, together with how the CBACFmethod can be applied to it.

4.1. Fixed length summation ACF

The ‘‘fixed length summation’’ ACF (FLSACF) can be defined as thecorrelation function where only a specific portion of the original bit-stream is selected to slide against the whole. The chosen portion,called the ‘window’, is duplicated and shifted against the original.The number of shifts, however, is only chosen to be to the end ofthe original bitstream, so that the ACF is termed ‘‘non roll-off’’,and that at all shifts the same fixed number of terms is correlated.ACFs of this particular nature have been suggested and used onboard space missions in the past [21], and in interferometer correla-tors [11,18]. In most cases, the window is selected to be equal to N/2,where N is the length of the original bitstream, thus, inclusive of thezero lag, a total of only N/2 + 1 lag results is returned by the FLSACF.

It is obvious that this ACF cannot be implemented using thecounterbased method, simply because just a portion of the originalbitstream is shifted through the system, and hence not all possiblelag results are present in the counters by the end of this shifting. Itis possible, however, to view this type of ACF in a different form,which will enable the use of the counterbased method.

The fixed length ACF with no weighting of a Bernoullian se-quence can be expressed directly as:

RflX ½l� ¼

PL

k¼1X½k�X½kþ l�;

0 6 l 6 N � L;

0 < L < N

ð3Þ

where l represents lag, k the summation element, L the length of thesummation element, and N the length of the original sequence. TheBernoullian nature of the sequence allows it to be easily translatedto a binary format. In the specific case when the first half of a

Page 4: Parallel processing speed increase of the one-bit auto-correlation function in hardware

300 N. Huber et al. / Microprocessors and Microsystems 35 (2011) 297–307

sequence of length N is chosen as the window (hence L = N/2), wecan express (3) as:

RflX ½l� ¼

XN2�l

k¼1

X½k�X½kþ l� þXl

k¼1

X½ðN=2Þ � kþ 1�X½N=2� kþ 1þ l�;

0 6 l 6N2

ð4Þ

where the first summation represents the correlation of the firsthalf of the bitstream with itself, and the second element representsthe correlation of the first half of the bitstream with the second half.At the zero lag, the second summation degenerates to a value ofzero, as expected.

The elements in Eq. (4) can also be viewed as an ACF movingforward in time and a CCF moving backward in time, with theirintermediate lag results being summed. Specifically:

RflX ½l� ¼

XN2�l

k¼1

X½k�X½kþ l� þXl

k¼1

X½ðN=2Þ � kþ 1�Y ½l� kþ 1�;

0 6 l 6N2

ð5Þ

where

½X�NðN=2Þþ1 ¼ ½Y�N=21

As an example, let us assume a sequence of 8 bits, this being the tar-get stream of whose ACF we require. The normal implementation ofsuch a function (4) resembles the one in Fig. 3. It is common to use

7 6 5

7 6 5

7 6

LAG 0

LAG 1

LAG 2

SHIFTS

ACF LAG RESULT

Fig. 3. Conceptual implementation

only the first half of the bitstream, or even a smaller portion of it.Fig. 4 gives a graphical representation of Eq. (5).

The FLSACF has now been expressed as two separate procedures(an ACF and a CCF) that use only half of the original bitstream each.What is more important is that they shift this half against only anequal portion of the original sequence, which would then presentall lag results at the end of these two procedures when imple-mented using the counterbased method. We point out that as lagis increased, the roll-off in the ACF increases, as the roll-off of theCCF decreases, and so the total number of useful (correlated) termsper lag remains constant.

Please note that the CCF used here and the CCF result are not ofimport here. It is only a means by which we can obtain the correctresults for the lags of the fixed length summation ACF.

4.2. Classical ACF evaluated to half-length

According to the work found in [22], the Mean-Square Error(MSE) of autocorrelograms is lag dependent, and it is more pro-nounced in the case of the FLSACF compared to a classical ACF.Thus, to minimize this error margin, it is preferable to implementan ACF that is a mixture of the classical and the fixed length ver-sions, and hence it is a prime candidate for on-board implementa-tion. It is simply a classical, ‘‘roll-off’’ ACF that is carried out only tothe N/2 lag. This half-length classical ACF (HLCACF) would producea number of lags equal to N/2 + 1 by including the zero lag, butthese lags will have a highly accurate value, equal to those of theclassical ACF. Fig. 5 displays this notion graphically.

01234

4

5 4

7 6 5 4

LAG 3

LAG 4

of the 1-bit fixed length ACF.

Page 5: Parallel processing speed increase of the one-bit auto-correlation function in hardware

013 2

457 6

57 6

7

4

ACF LAG RESULT

CCF LAG RESULT

ACF LAG

0

ACF LAG

1

ACF LAG

2

ACF LAG

3

7 6

7 6 45

CCFLAG

3

CCFLAG

2

CCFLAG

1

CCFLAG

0

Lag 1 Summed Result

Lag 2 Summed Result

Lag 3 Summed Result

Lag 0 Summed Result

Lag 4 Summed Result

457 6

456

45

Fig. 4. Summation of lag results of ACF moving forward in time and CCF backward in time.

N. Huber et al. / Microprocessors and Microsystems 35 (2011) 297–307 301

Mathematically, this method can be expressed as simply as:

RflX ½l� ¼

PL

k¼1X½k�X½kþ l�;

0 6 l 6 N=2� lð6Þ

If, however, we decide to translate this equation to one that relieson the counterbased form of an ACF, we would require a combina-tion of three elements. We would require the two elements thatmake up the fixed length auto-correlation function, along with aclassical ACF of the second half of the bitstream. This is better ex-pressed as:

RflX ½l� ¼

XN=2�L

k¼1

X½k�X½kþ l� þXl

k¼1

X½N=2� kþ 1�X½N=2� kþ 1þ l�

þXN�l

k¼N=2þ1

X½k�X½k� l�; 0 6 l 6 N � l ð7Þ

The rearrangement of Eqs. (3) and (6) are again necessary to cir-cumvent the issue of the counterbased method not producing a fulllag result before the entire sequence has been passed through thesystem.

5. Implementation in hardware

The most significant advantages of the work in [19] wereachieved using delay elements operating asynchronously. Thespecific implementation, however, is not portable, since it de-pends on the manual floor-planning of the delay line elementswithin the target device used (in this case, a Xilinx Spartan-IIEXC2S300e). This task becomes taxing and inaccurate in the caseof long delay lines, as discussed in [23]. As such, all ACF hardwareimplementations presented in this current work are of a synchro-nous nature, driven as usual by an external clock to the FPGA.To make the results of this work comparable to those in[19], an identical Xilinx device was selected as the target technol-ogy. However, the hardware designs were created using genericVHDL, similar to that in [10], for portability and ease ofimplementation.

5.1. Classical ACF in H/W

To achieve the functionality of a traditional roll-off ACF, our de-sign was based heavily on a synchronous version of the ACF found

Page 6: Parallel processing speed increase of the one-bit auto-correlation function in hardware

7 6 01235 4

7 6 235 4

7 6 1235 4

7 6 01235 4

LAG 0

LAG 1

LAG 2

LAG 3

LAG 4

SHIFTS

ACF LAG RESULT

7 6 5 4

7 6 35 4

Fig. 5. Conceptual implementation of the 1-bit half-length classical ACF.

302 N. Huber et al. / Microprocessors and Microsystems 35 (2011) 297–307

in [19]. As in this work, two identical delay lines are created, one ofwhich is fed the original bitstream, while the other is fed the in-verse of this bitstream, to allow for coincidences of ‘0’s and ‘1’sto be measured. The maximum speed of this new implementationwas 161 MHz for an 8-bit sequence, with just 1% of the fabric beingutilized, while for a 32-bit sequence, 5% of the chip was used with a

Delay Delay

Delay Delay

O1

Z1

COINCID

COINCID

1 0 1 0 0 0 0 0

1 0 1 0 1 0 1 0

0 1 0 1 0 0 0 0

Assuming 8 deAssuming windAssuming inpu

Fig. 6. FLSACF, showing independ

maximum speed of 153 MHz, a minimal trade off between scalabil-ity and speed.

5.2. Fixed length summation ACF in H/W

The FLSACF can be viewed as correlating the bitstream windowbuffered with ‘0’s with the original bitstream. It could, hence, beimplemented in H/W by using the classical ACF with a very slightmodification. This implementation requires that the input to thedelay line is separate from that of the AND gates. For the lengthof the desired window, the input to both the delay lines and theAND gates is identical. After a number of bits equal to the windowlength have been fed to the design, the input to the delay lines ismade equal to ‘0’, effectively buffering the window with ‘0’s. Anexample of this implementation is shown in Fig. 6. No further mod-ification to the classical ACF system is necessary, and hence the re-quired resource usage and speed are identical. It has, unfortunately,the disadvantage of requiring N clock cycles to push the entire buf-fered window sequence through to only acquire N/2 + 1 lag results.

There is, however, a different approach to this method. It ismore cumbersome, but proves to be critical to the transition fromthe FLSACF to the half-length classical version, and requires N/2 + 1of clock cycles to produce an equal amount of lag results. Thismethod is a direct implementation of Eq. (5).

To allow for the combination of an ACF and a CCF to produce thedesired results, it is necessary to have acquired the full bitstream tobe used in advance. This means that the method described hereoperates on a data sequence that can be input in parallel into thedesign, and hence is time independent. In both the counterbasedACF and CCF, the AND gates are fed with the window, in this casethe first half of the original sequence. Their delay chains, however,are fed the first and second half of the bitstream respectively. Thereis also a need for adders to carry out the summation of the lag re-sults of the ACF and CCF, and hence produce the intermittent lagresults of the FLSACF. The results of lag 0 and N/2 + 1 are taken

Delay Delay

Delay Delay

O2 O3 O4

Z4Z3Z2

ENCES OF ‘1’s

ENCES OF ‘0’s

lay elements per chain.ow size of 4 bits.t bitstream to be: 10101010

ent inputs to the delay lines.

Page 7: Parallel processing speed increase of the one-bit auto-correlation function in hardware

N. Huber et al. / Microprocessors and Microsystems 35 (2011) 297–307 303

directly to the outputs from the ACF and CCF counters respectively.For a graphical representation of this system please refer to Fig. 7.

This is a counter-intuitive method, that works surprisingly well.For example, if a 16-bit sequence is to be correlated thus to pro-duce N/2 + 1 = 9 lag results, only 3% of the device is required, witha max. speed of 158.6 MHz. If a 32-bit sequence is required, thenthe device usage rises to 7% and the max. speed drops to146.9 MHz. The device utilization increase complies with the factthat the number of delay elements, counters and adders neededhas doubled, but, again, the overall speed decrease is minimal.

5.3. Half-length classical ACF

As mentioned earlier, it is possible to further extend the abovedelay-line method to successfully implement a half-length classi-cal ACF (HLCACF) (Eq. (7)). The only addition to the FLSACF systemis that of a straightforward, roll-off ACF applied to the second half

Delay 0 Delay 1

Delay 0 Delay 1

A1

C1

ACF: COINCID

CCF: COINCID

1 0 1 0

1 0 1 0

1 1 1 1

Assuming 4 delaAssuming windoAssuming input b

1 0 1 0

CCF COUNTER 2

ACF COUNTER 1

ADDER LAG 1RESULT

A2

C3

Fig. 7. Combination of counterbased ACF and CCF to give

of the input bitstream, as shown in Fig. 8. This approach requiresthat the entire bitstream is known in advance to allow the feedingof the individual delay lines in parallel, as required. This impliesthat adder stages are required to add the three individual lag re-sults produced per shift, which stages can be pipelined to maxi-mize the speed of the design, without overly taxing the overallprocessing time of the system.

As a simple example, let us consider the case of a sequence ofN = 8 like the one in Fig. 8. This sequence will produce N/2 + 1 = 5lag results, as before. However, an arbitrary lag result, say lag 1,will be produced from the summation of the lag 1 result fromthe first ACF delay line, the lag 1 result of the second ACF delay lineand the lag 3 result of the CCF delay line (returning the value 3, inthis case). Following this scheme, lag 0 result is equal to the sum ofthe lag 0 results of the two ACF delay lines (=8), while the 5th lagresult will be equal to the lag 0 result of the CCF delay line only(=2). This implementation requires marginally more FPGA

Delay 2 Delay 3

Delay 2 Delay 3

A2 A3 A4

C4C3C2

ENCES OF ‘1’s

ENCES OF ‘1’s

y elements per chain.w size of 4 bits.itstream to be: 10101111

CCF COUNTER 1

ACF COUNTER 2

ADDERLAG 2

RESULT

C2

A3

FLSACF results. Only coincidences of 1’s are shown.

Page 8: Parallel processing speed increase of the one-bit auto-correlation function in hardware

Delay 0 Delay 1 Delay 2 Delay 3

Delay 0 Delay 1 Delay 2 Delay 3

A1 A2 A3 A4

C4C3C2C1

ACF FIRST HALF: COINCIDENCES OF ‘1’s

CCF: COINCIDENCES OF ‘1’s

1 0 1 0

1 0 1 0

1 1 1 1

Assuming 4 delay elements per chain.Assuming window size of 4 bits.Assuming input bitstream to be: 10101111

1 0 1 0

ACF S COUNTER 0

ACF COUNTER 0

CCF COUNTER3

ACF COUNTER 1

ADDER

ADDER

DELAY 1RESULT

DELAY 0RESULT

A2

C3

AS1

A1

Delay 0 Delay 1 Delay 2 Delay 3

AS1 AS2 AS3 AS4

ACF SECOND HALF: COINCIDENCES OF ‘1’s

1 1 1 1

1 1 1 1

ACF S COUNTER 1

AS2

Fig. 8. Combination of three delay lines to implement the half-length classical ACF. Only coincidences of 1’s are shown.

304 N. Huber et al. / Microprocessors and Microsystems 35 (2011) 297–307

resources, with an 8-bit length version taking just 4% of the devicerunning again at a max. of 158.6 MHz. On the other hand, if theimplementation is extended to 32 points, the device utilizationrises to 13%, a considerable amount, but since this method is

an extension of the FLSACF, there is no noticeable drop in max.speed.

In all the above examples, there are N/2 + 1 lag results obtained.Hence, the latency of these circuits to return a result is equal to

Page 9: Parallel processing speed increase of the one-bit auto-correlation function in hardware

1

1

432

432

BA

Original Bistream

Duplicate

1

1

3

3

1

2

ACFs Forward in Time

ACFs Forward in Time

CCFs Backward in

Time

1

3

N. Huber et al. / Microprocessors and Microsystems 35 (2011) 297–307 305

N/2 + 1 clock cycles, this being an improvement over the N numberof cycles required for any ACF in the work by [19].

6. Efficiency comparison

Most of the above work was carried out with a a practical appli-cation in mind, namely CORES. Please note that the originalrequirements of CORES have evolved to demand a 32-point ACF in-stead of an 8-point one, as stated in [19]. In Table 1, a summary ispresented of the implementation results of the extensions to thecounterbased ACF, compared to the asynchronous method in [19]and the implementation currently within the CORES instrument.

It can be seen that the synchronous methods described here dosuffer from a decrease in max. speed compared to their asynchro-nous counterpart. However, the device usage has not increaseddramatically, especially taking into consideration that 32 pointsare now required. Furthermore, the above extensions add signifi-cant flexibility to a system otherwise limited in scope, which canbe used to increase the sensitivity of the current system on CORES.In addition, it should be noted that for the systems described here,no painstaking floor-planning was needed, contrary to the asyn-chronous counterbased ACF. It is likely that if the above implemen-tations had all been floor-planned, no significant loss in speedwould have been noted. However, this would have defeated thepurpose of this research. To conclude, a number of noteworthy ad-vances to pre-existing systems have therefore been achieved. Sometrade-offs were required, but expected, as a compensation for in-creased flexibility, accuracy and ease of implementation.

7. Applying to parallel datasets

The most significant benefit of the above extensions lies in thepossibility of parallelizing an otherwise fully serial process. Asmentioned earlier, previous counterbased ACFs suffered from thefact that they were dependent on discrete time process data, orthat the input sequence had to be shifted in serially. Althoughthe speed at which this can be done may be very high due to thepossibly asynchronous design, it meant that there was a minimumnumber of clock cycles necessary before the first result was avail-able. We have shown that this limitation can be overcome. In theACFs that were investigated above, it is clear that, when the num-ber of elements is a power of 2 and all the elements are alreadyavailable in a form of a data array, it is possible to split the ACF intoa number of parallel correlating functions (ACF or CCF). These inturn produce correct results at the end of a number of cycles equalto the number of original elements, divided by a factor equal to thenumber of ACF functions running in parallel.

These, in turn, will produce correct results after a number ofclock cycles equal to the number of input array bits, divided by

Table 1Summary of implementation results of extensions to counterbased ACF.

Algorithm Number of points Deviceusage(%)

Maximumspeed(MHz)

Currentimplementation onCORES

32 10 47

AsynchronousDelay-line 8 1 250Classical 8 1 161

32 5 153FLSACF 8 3 158.6

32 7 146.9Half-length classical 8 4 158.6

32 13 146.9

the number of parallel correlation functions the bit-stream is fedto.

As an example, consider a predefined bitstream as shown inFig. 9, which is assumed to have a length of 64-bits and which isrequired to be processed by the HLCACF method. Up to now wehave only considered splitting it into two parts, say A and B, butwe will now split it into parts 1, 2, 3 and 4 instead. To get the re-sults for the 32 lags (half of the 64 elements), only 32 clock cycles/shifts are required, which was the case with the traditional imple-mentation, as explained in Section 5.3. However, this can now belimited to just 16 shifts, if the process is split into four ACFs (mov-ing forward in time) and five CCF’s (moving backward in time), asshown in the lower half of Fig. 9. All these operations will take only16 clock cycles to complete, and, if the bitstream is already known,can all run in parallel. The main advantage of the generalization ofthe CBACF into a parallel form is that it is possible to receive cor-rect lag results of full accuracy at a fraction of the shifts that wouldotherwise be necessary. Of course, there will be some device usageoverheads (extra control circuitry, extra adders/accumulators) dueto the higher parallelization. It can be shown that, for a sequencesplit into N parts, the total number of parallel functions, Ofp, neededto calculate the HLCACF of the sequence is given by:

Ofp ¼N2

4þ N

2þXN=2

i¼1

i ð8Þ

The squared term makes this splitting computationally inten-sive for a large number of splits, but this shortcoming will be morethan compensated by the speed increase of the entire system.

2

2

4

4

2

3

3

4

2

4

Fig. 9. Block diagram showing the method of splitting the dataset to acquire fullHLCACF results at a fraction of the clock cycles. Red and black arrows indicate ACFsand CCF’s respectively.

Page 10: Parallel processing speed increase of the one-bit auto-correlation function in hardware

306 N. Huber et al. / Microprocessors and Microsystems 35 (2011) 297–307

Taking this example to its extreme, assuming that a 64-bitsequence is split into its constituent 1-bit elements, a total of1584 functions would be required. However, this would mean thatthe entire sequence could be correlated in this way within a singleclock cycle. Moreover, there would not be a need for the interme-diate counters, since a single gate could act as the correlation ele-ment. Allowing for a multilevel, pipelined adder for theamalgamation of the gate results, with a propagation delay of fiveclock cycles, would mean that a HLCACF correlation, with 32 fullaccuracy results, could be achieved in a total of six clock cycles.Running at approx. 150 MHz as before, a sampling speed of theup to 1.6 GHz is possible, to feed the HLCACF. At the same time,1584 gates can very easily be fitted into a much smaller FPGA thanthe one targeted here. Moreover, we have shown that the HLCACFis the most complicated form of parallelizing an ACF, and so it fol-lows that something like the FLSACF would be more hardware eco-nomical. If newer and more powerful FPGAs, or even ASICS, areused for this method, the performance increase will be significant,without the hardware overheads weighing heavily on the device.

8. Applications

An example application for the above improvements could be insuch cases where the specifics of the investigated phenomenon arenot perfectly defined. Consider that the data series to be correlatedare already known, and that a classical, un-weighted ACF is thepreferred correlation method. As such, the first few lag results willhave the higher sensitivity. By calculating the first few lag resultsin the parallel way described above, it is possible to speed up theprocess considerably. If these few results seem promising (e.g. byreturning a value higher than the expected average) the entire pro-cess can then be allowed to run its course. If, on the other hand, theresults look poor, the ACF can be prematurely terminated, and thenext data series read in. The performance increase of such a systemcould be considerable, depending on the specifics of the applica-tion, with no accuracy loss for any lag result calculated.

The fact that the proposed algorithm requires slightly more chipfloor area than other competing algorithms does not make itsplanned adoption less attractive since FPGA total gate counts con-tinue to increase with improving technologies. The ability to pro-cess many parallel datasets simultaneously has many applicationsbesides in a space instrument like CORES with its 128 independentenergy-angle particle detection data streams. There is a clear appli-cation to multi-sensor radio telescope arrays such as the LOFAR lowfrequency telescope involving some 10,000 digitised antennasspread across Europe [25,26] and the Square Kilometre Array, SKA[27]. Both LOFAR and SKA involve simultaneous auto-correlationand cross-correlation of the many data streams produced by thelarge number of sensors and frequency bands, and both aredesigned to take advantage of future increases in parallel dataprocessing capability. If these algorithms are also to be applied todedicated HW, specially designed for the purposes of correlation,then they could also benefit the SETI project, where the serialdatasets are assembled and split between parallel HW modulesfor processing [28].

9. Conclusion

By implementing the counterbased ACF in a synchronous man-ner and by circumventing its limitation to Classical ACFs only, it ispossible to make this method the implementation of choice formost one-bit ACF and, by extension, CCF applications in program-mable logic devices. The method has also been shown to be able tobecome parallelized, a feature that can produce ACF lag results insignificantly fewer clock cycles, something not previously achiev-

able, thus further increasing the usability of this kind of algorithm.Finally, it has been shown that this form of ACF can be split ad infi-nitum, until each element is correlated individually, resulting in thepossibility of very high ACF sampling speeds. In this sense, at thislimit all counting is effectively replaced by addition, carried outby simple gates.

Work has been undertaken to incorporate a counterbased half-length classical ACF function of 32 points into the CORES instru-ment. It will be replacing the current shift-register based fixedlength summation ACF currently in the system. The advantages ofspeed and ease of use are the main reasons the new version was se-lected, and the 3% increment in device utilization compared to theexisting implementation is not a major drawback. It is expectedthat this particular ACF will return scientific results of higher sensi-tivity, and will ultimately prove of higher scientific importance.

References

[1] M.D. McKenzie, S.J. Kim, Evidence of an assymetry in the relationship betweenvolatility and autocorrelation, International Review of Financial Analysis 16 (1)(2007) 22–40.

[2] J. He, M. Li, H.J. Zhang, C. Zhang, Symmetry feature in content-based imageretrieval, in: Proceeding of the ICIP, vol. 1, 2004, pp. 417–420.

[3] A. Emrich, Autocorrelation spectrometers for space borne (sub)millimetreastronomy, in: Proceedings of the 30th ESLAB Symposium, 1996, pp. 257–260.

[4] M. Engels, B. Hoppe, H. Meuth, R. Peters, Fast digital photon correlation systemwith high dynamic range, in: Proceedings of the 13th Annual IEEEInternational ASIC/SOC Conference, 2000, pp. 18–20.

[5] G.E.P Box, G. Jenkins, Time Series Analysis: Forecasting and Control, HoldenDay, 1976.

[6] G. Danese, F. Leporati, et al., A correlator for light scattering experiment, IEEETransactions on Instrumentation and Measurement 47 (1998) 935–940.

[7] M. Engels, B. Hoope et al., A single chip 200 MHz digital correlation system forlaser spectroscopy with 512 correlation channels, in: Proceedings of the 1999IEEE ISCAS, vol. 5, 1999, pp. 160–163.

[8] J. Urena, M. Mazo, et al., Correlation detector based on a FPGA for ultrasonicsensors, Microprocessors and Microsystems 23 (1999) 25–33.

[9] B. Von Herzen, Signal processing at 250 MHz using high performance FPGAs,IEEE Transactions on VLSI Systems 6 (1998) 238–246.

[10] E.A. Bezerra and M.P. Gough and A.M. Buckley, A VHDL Implementation of anOn-board ACF Application Targeting FPGAs, Proceeding of the Military andAerospace Programmable Logic Design (MAPLD) Conference, 1999.

[11] J.H. Van Vleck, D. Middleton, The spectrum of clipped noise, Proceedings of theIEEE 54 (1) (1966) 2–19.

[12] A.R. Thompson, J.M. Moran, G.W. Swenson Jr., Interferometry and Synthesis inRadio Astronomy, 2nd ed., Wiley-Interscience, 2001. ISBN: 0471254924.

[13] W. Brisken, Cross Correlators, Lectures of the 10th Synthesis Imaging School,NRAO, June 2006.

[14] T.J. Cornwell, Memo 77: EVLA and SKA Computing Costs for Wide FieldImaging, NRAO, 2004.

[15] B. Carlson, ‘‘Memo 01: A Proposed WIDAR Correlator for the EVLA Project,NRAO, 2000.

[16] B. Carlson, Memo 04: A More Detailed Analysis of Recirculation Architecture,Algorithms, and Limitations in the Proposed WIDAR Correlator for the EVLA,NRAO, 2000.

[17] http://www.xilinx.com/ipcenter/catalog/logicore/docs/.[18] S.K. Okumura et al., 1-GHz Bandwidth Digital Spectro-Correlator System for

the Nobeyama Millimeter Array, Publ. Astron. Soc. Japan 52 (2000) 393–400.[19] M.S. Pouchet, G. Seferiadis, N. Huber, M.P. Gough, A counterbased field

programmable gate array implementation of the one-bit autocorrelation andcross-correlation functions, Review of Scientific Instruments 76 (2005).

[20] G. Seferiadis, The Correlating Electron Spectroghaph: CORES, Phd Thesis,Department of Informatics, University of Sussex, Brighton, UK, 2004.

[21] M.P. Gough, A.M. Buckley, T. Carozzi, N. Beloff, Experimental studies of wave–particle interactions in space using particle correlators: results and futuredevelopments, Advances in Space Research 32 (2003).

[22] T. Carozzi, A.M. Buckley, Sampling errors of autocorrelograms for wide-sensewhite noise with arbitrary mean, Journal of Time Series Analysis, submitted forpublication.

[23] G. Seferiadis, M.S. Pouchet, M.P. Gough, Microchannel plate position read-outsystem using field programmable gate arrays, Review of Scientific Instruments76 (2005).

[25] H.J.A. Rottgering, LOFAR, a new low frequency radio telescope, New AstronomyReviews, vol. 47, 2003, pp. 405–409.

[26] http://www.lofar.org.[27] P.E. Dewdney, P.J. Hall, R.T. Schilizzi, T.J.L.W. Lazio, The Square Kilometre

Array, Proceedings of the IEEE 97 (8) (2009) 1482–1496.[28] A. Parsons, D. Backer, et al., PetaOp/second FPGA signal processing for SETI and

radio astronomy, in: Proceedings of the Fortieth IEEE ACSSC, 2006, pp. 2031–2035.

Page 11: Parallel processing speed increase of the one-bit auto-correlation function in hardware

nd Microsystems 35 (2011) 297–307 307

Nicolas Huber received an Electronic Engineeringdegree from the University of Sussex, UK, in 2001, and

an MEng degree in Electronic Engineering from thesame institution in 2002. He completed his PhD degreein the Sussex Space Center, focusing on the SussexCorrelating Spectrograph (CORES). He is now a researchfellow at the Sussex Biomedical Engineering Group. Hisresearch interests include biomedical instrumentation,embedded systems, reconfigurable computing andadapting mathematical algorithms to hardware.

N. Huber et al. / Microprocessors a

Marianne S. Pouchet is a graduate of Electrical andComputer Engineering at the University of the WestIndies with a specialization in Electronics. She recentlycompleted her D.Phil. at Sussex University in Infor-matics where she designed an FPGA-based Neural Net-work for the classification of DNA. She currently worksin a Post-Doctoral capacity at Cornell University, Ithacain the design and implementation of a Pixel ArrayDetector (PAD) for measuring X-ray diffraction off singlemolecules at the Laser Coherent Light Source (LCLS) atStanford, California.

Tobia D. Carozzi was born in Torino, Italy in 1966. Hereceived his M.Sc. degree in engineering physics, spe-cializing in radiation sciences, from Uppsala University,Sweden, in 1993 and his Ph.D. in space physics from theSwedish Institute of Space Physics in 2002. In 2000 heco-founded Red Snake Radio Technology AB, Stockholm,Sweden, where he was a senior researcher. He is cur-rently a research associate at the Physics and Astron-omy department of the University of Glasgow.

Prof M. Paul Gough is Professor of Space Science at theUniversity of Sussex. He received his BSc(1967),MSc(1968) University of Leicester and Ph.D (1972)University of Southampton. He is a Fellow of the Insti-tution of Engineering and Technology, and Fellow of theRoyal Astronomical Society and author of some 200journal papers published covering a range of topicswithin the disciplines of Space Science, InformationSystems, Physics and Geophysics, including: Iono-spheric Physics; Magnetospheric Physics; Astrophysics;Instrument Technology; Space Plasma Physics; Micro-processor Applications; Fault Tolerance; Parallel Com-

puting; Neural Networks; Fuzzy Logic; and Data Compression. He has flowninstruments on a number of space missions including: ESA GEOS I & II; AMPTE-UKS;CRRES, ESA Cluster II, STS-46; STS-75, Double Star probe, ISS.

Andrew M. Buckley is a Senior Lecturer in the School ofEngineering and Design at the University of Sussex.Formerly he was a Research Fellow in the Space ScienceCentre working on Cluster II space mission data. Prior tothis he worked as a Research Fellow in the PhysicsDepartment at the University of Warwick. His researchinterests cover wave analysis techniques, coherencystudies and electromagnetic analysis and simulations.