157
Advanced Architectures for Astrophysical Supercomputing Benjamin R. Barsdell Presented in fulfillment of the requirements of the degree of Doctor of Philosophy 2012 Faculty of Information and Communication Technology Swinburne University

Benjamin Barsdell Thesis - Swinburne University of Technology

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Benjamin Barsdell Thesis - Swinburne University of Technology

Advanced Architectures for AstrophysicalSupercomputing

Benjamin R. Barsdell

Presented in fulfillment of the requirements

of the degree of Doctor of Philosophy

2012

Faculty of Information and Communication Technology

Swinburne University

Page 2: Benjamin Barsdell Thesis - Swinburne University of Technology
Page 3: Benjamin Barsdell Thesis - Swinburne University of Technology

i

Abstract

Modern astronomy has come to depend on the exponential progress of computing

technology. In recent times, however, processor hardware has undergone a dramatic shift

away from the traditional, sequential model of computing toward a new, massively parallel

model. Due to the preponderance of unparallelised software in astronomy, this develop-

ment poses significant challenges for the community. This thesis explores the substantial

benefits offered by advanced ‘many-core’ computing architectures and advocates a pow-

erful, general approach to their use; these concepts are then put into practice to achieve

new science outcomes in the field of pulsar astronomy.

We begin by developing a methodology for tackling the challenges of massively-parallel

computing based on the analysis of algorithms. Simple analysis techniques are shown to

provide deep insight into both the suitability of particular problems for advanced architec-

tures as well as the optimal implementation approach when targeting such hardware. The

method is applied to four well-known astronomy applications, highlighting their scalabil-

ity and resulting in the rapid identification of potential speed-ups from cheaply-available

many-core devices. The hardware- and software-independent nature of our approach

means that, like a mathematical proof, such results remain valid in perpetuity.

Building on this foundation, we then consider in more detail the process of incoherent

dedispersion, a computationally-intensive problem at the heart of surveys for fast radio

transients. Three different dedispersion algorithms are analysed and implemented for a

particular form of many-core hardware, the graphics processing unit (GPU), and speed-

ups of up to nine times are obtained when compared to an efficient multi-core CPU

implementation. The GPU-based direct dedispersion code is shown to enable processing

of data from the High Time Resolution Universe (HTRU) survey, currently ongoing at the

CSIRO Parkes 64 m radio telescope in New South Wales, Australia, at a rate three times

faster than real time.

We look toward GPU-driven scientific outcomes by developing a real-time fast-radio-

transient detection pipeline capable of exploiting many-core computing architectures. Our

GPU dedispersion code is combined with new data-parallel and statistically robust im-

plementations of algorithms for radio-frequency-interference (RFI) mitigation, baseline

removal, normalisation, matched filtering and event detection to form a complete system

capable of sustained real-time operation. The pipeline is demonstrated using both archival

data from the HTRU survey and real-time observations at Parkes Observatory, where it

has been deployed as part of the Berkeley Parkes Swinburne Recorder back-end. Early

Page 4: Benjamin Barsdell Thesis - Swinburne University of Technology

ii

results demonstrate several key abilities, including live detection of individual pulses from

known pulsars and rotating radio transients (RRATs), detailed real-time monitoring of

the RFI environment, and continuous quality-assurance of recorded data. The increased

sensitivity and ability to rapidly re-process archival data also resulted in the discovery of

a new RRAT in a 2009 pointing from the HTRU survey, which we have confirmed using

the pipeline in real-time at Parkes.

We conclude that our generalised, algorithm-centric approach offers a prudent path

through the challenges posed by advanced architectures, and that exploiting the power

and scalability of such hardware can and does provide paradigm-shifting accelerations to

computationally-limited astronomy problems.

Page 5: Benjamin Barsdell Thesis - Swinburne University of Technology

iii

Acknowledgements

This thesis would not exist, and I would not be where I am, without the help of many

people.

I would first like to thank my supervisors David Barnes, Chris Fluke and Matthew

Bailes. I am indebted to David and Chris for daring to explore this unique and exciting

topic, and I thank them deeply for their guidance through both the good times and the

hard; their enthusiasm and unique perspectives were invaluable. I especially thank Chris

for the many weekly meetings that kept me on track even when my work fell outside

of his expertise. I also owe a great deal of gratitude to Matthew for accepting the lead

supervisory role mid-way through my term and for providing me with the opportunity to

apply my work to a rich and exciting field of discovery. I am extremely grateful for the

wisdom imparted upon me by all three supervisors during our many conversations over

the past forty five months.

My eternal thanks go to Catarina, my parents and my sister for their unwavering

support and excellent advice when things didn’t go according to plan. Their refreshing

perspectives always showed me the bright side of any situation and kept me motivated

through to the end.

A huge thanks goes to all the members of the pulsar group, Matthew, Willem, Ramesh,

Andrew, Jonathon, Sarah, Lina, Stefan and Paul, for embracing me as a member and for

teaching me the many ways of the neutron star. Pulsar coffee was always one of my

favourite times of the week, and I will forever be thankful to have been part of such a

close group both personally and professionally.

Special thanks go to: Amr Hassan for standing beside me at the outskirts of what

some would call ‘normal’ astronomy topics; Max Bernyk, Georgios Vernados, Juan Madrid,

Anna Sippel, Guido Moyano Loyola and the other affiliates of the SciVis group for thought-

provoking discussions and presentations; Paul Coster for many enlightening discussions

and for testing my (bug-ridden) code; Jarrod Hurley for his ongoing support and for

taking me along for the amazing experience of observing at Keck; Willem van Straten for

his help when my supervisors were away (as well as when they weren’t); Andrew Jameson

for putting up with my code and spending long hours deploying and debugging it at Parkes;

Nick Bate, Alister Graham, Darren Croton, Chris Flynn and Felipe Marin for introducing

me to fascinating new fields of study and potential GPU applications; all the CAS soccer

players who joined me in the park for (almost) all of the 180 weeks I was here; Gin Tan and

Simon Forsayeth for fixing all of my computer issues; Carolyn Cliff, Elizabeth Thackray,

Page 6: Benjamin Barsdell Thesis - Swinburne University of Technology

iv

Mandish Webb and Sharon Raj for dealing with my often poor admin skills; and Luke

Hodkinson for introducing me to Emacs, with which this thesis and virtually all of the

source code that went into it were written.

Finally, I would like to thank all of the other students, postdocs and staff whose paths

I crossed during my time at Swinburne. I had a huge amount of fun here thanks to you

all, and I sincerely hope that our paths intersect again in the future.

Page 7: Benjamin Barsdell Thesis - Swinburne University of Technology

v

Declaration

The work presented in this thesis has been carried out in the Centre for Astrophysics

& Supercomputing at the Swinburne University of Technology between 2008 and 2012.

This thesis contains no material that has been accepted for the award of any other degree

or diploma. To the best of my knowledge, this thesis contains no material previously

published or written by another author, except where due reference is made in the text

of the thesis. All work presented is primarily that of the author with the exception of the

two opening paragraphs of Section 2.3.1, which were written by Christopher Fluke, and

the CPU-based dedispersion code benchmarked in Chapters 3 and 4, which was written by

Matthew Bailes. The content of the chapters listed below has appeared in refereed jour-

nals. Alterations have been made to the published papers in order to maintain argument

continuity and consistency of spelling and style.

• Chapter 2 has been published as Barsdell, Barnes & Fluke (2010)

• Chapter 3 has been published as Barsdell et al. (2012)

Benjamin Robert Barsdell

Melbourne, Australia

2012

Page 8: Benjamin Barsdell Thesis - Swinburne University of Technology

vi

Dedicated to my parents Mark and Susan,

and to my sister Wendy.

Page 9: Benjamin Barsdell Thesis - Swinburne University of Technology

Contents

Abstract i

Acknowledgements iii

Declaration v

List of Figures ix

List of Tables xii

1 Introduction 1

1.1 Astrophysical supercomputing . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Advanced architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Central processing units . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Graphics processing units . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.3 Other accelerator cards . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.3 Advanced architectures in astronomy . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Purpose of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.5 Advanced architectures meet pulsar astronomy . . . . . . . . . . . . . . . . 20

1.5.1 History and characteristics of pulsars . . . . . . . . . . . . . . . . . . 20

1.5.2 Pulsar observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.5.3 Pulsar astronomy and advanced architectures . . . . . . . . . . . . . 25

1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2 A Generalised Approach to Many-core Architectures for Astronomy 27

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 A Strategic Approach: Algorithm Analysis . . . . . . . . . . . . . . . . . . . 28

2.2.1 Principle characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.2 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.3 Analysis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.2.4 Global analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3 Application to Astronomy Algorithms . . . . . . . . . . . . . . . . . . . . . 36

2.3.1 Inverse ray-shooting gravitational lensing . . . . . . . . . . . . . . . 36

2.3.2 Hogbom CLEAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.3.3 Volume rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3.4 Pulsar time-series dedispersion . . . . . . . . . . . . . . . . . . . . . 42

vii

Page 10: Benjamin Barsdell Thesis - Swinburne University of Technology

viii Contents

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 Accelerating Incoherent Dedispersion 47

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Direct Dedispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.2 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.3 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3 Tree Dedispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.2 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.3 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4 Sub-band dedispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.2 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.4.3 Implementation notes . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5.1 Smearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.6.1 Comparison with other work . . . . . . . . . . . . . . . . . . . . . . 77

3.6.2 Code availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4 Fast-Radio-Transient Detection in Real-Time with GPUs 81

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.2 The pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2.1 RFI mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2.2 Incoherent dedispersion . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2.3 Baseline removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.2.4 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.2.5 Matched filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.2.6 Event detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.2.7 Event merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

4.2.8 Candidate classification and multibeam coincidence . . . . . . . . . 94

4.2.9 Deployment at Parkes Radio Observatory . . . . . . . . . . . . . . . 95

Page 11: Benjamin Barsdell Thesis - Swinburne University of Technology

Contents ix

4.2.10 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

4.2.11 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.3.1 Discovery of PSR J1926–13 . . . . . . . . . . . . . . . . . . . . . . . 102

4.3.2 Giant pulses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.3.3 RFI monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.3.4 Quality assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5 Future Directions and Conclusions 115

5.1 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.1.1 The future evolution of GPUs . . . . . . . . . . . . . . . . . . . . . . 116

5.1.2 Prospects for astronomy applications . . . . . . . . . . . . . . . . . . 120

5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Bibliography 125

A Chapter 3 Appendix 139

A.1 Error analysis for the tree dedispersion algorithm . . . . . . . . . . . . . . . 139

A.2 Error analysis for the sub-band dedispersion algorithm . . . . . . . . . . . . 140

Page 12: Benjamin Barsdell Thesis - Swinburne University of Technology
Page 13: Benjamin Barsdell Thesis - Swinburne University of Technology

List of Figures

1.1 Clock-rate versus core-count phase space of Moore’s Law. . . . . . . . . . . 3

1.2 Schematic of the programming model for recent NVIDIA GPUs. . . . . . . 12

1.3 Sample of the known pulsars plotted in P -P space. . . . . . . . . . . . . . . 22

2.1 Representative memory access patterns indicating varying levels of locality

of reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2 A schematic view of divergent execution within a SIMD architecture. . . . . 33

3.1 Illustration of a dispersion trail and its corresponding dispersion transform. 49

3.2 Visualisation of the tree dedispersion algorithm. . . . . . . . . . . . . . . . . 56

3.3 Signal degradation and performance results for the piecewise linear tree

algorithm compared to the direct dedispersion algorithm. . . . . . . . . . . 69

3.4 Signal degradation and performance results for the sub-band algorithm com-

pared to the direct dedispersion algorithm. . . . . . . . . . . . . . . . . . . 70

4.1 Flow-chart of the key processing operations in our transient detection pipeline. 85

4.2 Results overview plots from our transient pipeline for an archived pointing

in the HTRU survey containing a new rotating radio transient candidate. . 97

4.3 Plot showing the break-down of execution times during each gulp for dif-

ferent parts of the transient pipeline. . . . . . . . . . . . . . . . . . . . . . . 100

4.4 Plot showing the variation of execution times for different parts of the tran-

sient pipeline as a function of the gulp size. Here all stages of the pipeline

are executed on the GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.5 Plot showing the variation of execution times for different parts of the tran-

sient pipeline as a function of the gulp size. Here all stages of the pipeline

but dedispersion are executed on the CPUs using 3 cores. . . . . . . . . . . 103

4.6 Results overview plots from our transient pipeline for a confirmation point-

ing of the rotating radio transient candidate shown in Fig. 4.2 . . . . . . . 105

4.7 Results overview plots from the pipeline during a timing observation of the

millisecond pulsar PSR J1022+1001 showing the detection of a number of

strong pulses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.8 Results overview plots from the pipeline for a pointing containing strong

bursts of radio-frequency interference. . . . . . . . . . . . . . . . . . . . . . 109

xi

Page 14: Benjamin Barsdell Thesis - Swinburne University of Technology

xii List of Figures

5.1 Trends in theoretical peak GPU memory bandwidth and compute perfor-

mance over the last five years. . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.2 Trends in GPU core count and critical arithmetic intensity over the last five

years. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Page 15: Benjamin Barsdell Thesis - Swinburne University of Technology

List of Tables

1.1 Summary of advanced architectures. Numbers are indicative only. See main

text for acronym definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Analysis of four foundation algorithms . . . . . . . . . . . . . . . . . . . . . 34

3.1 Summary of host↔GPU memory copy times during dedispersion. . . . . . . 74

3.2 Timing comparisons for direct GPU dedispersion of the ‘toy observation’

defined in Magro et al. (2011). . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.1 Properties of the discovered RRAT. . . . . . . . . . . . . . . . . . . . . . . 104

xiii

Page 16: Benjamin Barsdell Thesis - Swinburne University of Technology
Page 17: Benjamin Barsdell Thesis - Swinburne University of Technology

1Introduction

If I had asked people what they wanted, they would have said

faster horses.

—Henry Ford

1.1 Astrophysical supercomputing

Computing resources are a fundamental component of modern astronomy. Computers

are used in the acquisition, reduction, analysis, simulation and visualisation of virtually

all astronomical data. The increase in processing power that has followed Moore’s Law

(Moore, 1965) since the mid 1960s has opened up vast new avenues of research that

would not otherwise have been possible. Take, for example, simulations of gravitating

bodies: first performed in the 1960s with up to 100 particles (von Hoerner, 1960; Aarseth,

1963), 50 years of evolution in computing saw this number increase by more than nine

orders of magnitude (Kim et al., 2011; Angulo et al., 2012). On the observational front,

contemporary projects such as the Sloan Digital Sky Survey (Thakar, 2008; York et al.,

2000) have been made possible only by the advances in instrumentation and computing

necessary to capture and process vast quantities of data; future projects such as the Square

Kilometre Array (Cornwell, 2004; Dewdney et al., 2009) will push these requirements

even further, demanding nothing short of world-class supercomputing facilities. While

algorithmic developments also play a crucial role in these applications, it is the unwavering

trend in computing power—doubling every two years for fixed cost—that has carried

computational astronomy to where it is today.

There is, however, more to this story than it first appears, the key to which lies in

the term ‘computing power’. Gordon Moore’s 1965 observation actually used the much

1

Page 18: Benjamin Barsdell Thesis - Swinburne University of Technology

2 Chapter 1. Introduction

more specific ‘cost per component’, referring to the manufacturing cost of transistors.

Thus, being precise, the fundamental trend that has held strong for more than 45 years is

the halving of the minimum cost per transistor every two years. The importance of this

pedantry is that increasing the number of transistors at fixed cost does not necessarily

translate into increasing ‘computing power’.

For the majority of the past 45 years, most computer software has been able to re-

main blissfully ignorant of the true nature of Moore’s Law. In addition to the doubling

in the number of transistors, computers’ central processing units (CPUs) also exhibited a

doubling in clock-rate every two years. The beauty of this was that software that ran at

a particular speed would, two years later, run at twice that speed (on a new computer of

similar cost) with no extra effort. Unfortunately, this all changed around 2005. As clock

rates were increased, power consumption and heat generation also rose, and eventually a

point was reached where the excess heat could not be effectively dissipated. The result

was that clock-rates could not be stably pushed very far beyond 3 GHz. In response, hard-

ware manufacturers turned to another means of increasing performance: placing multiple

processors (or cores) on a single chip.

Figure 1.1 plots processors from the last ∼20 years in clock-rate versus core-count

phase space. In this space, the evolution of CPUs turns a ‘corner’ around 2005 when

clock rates plateaued and multi-core processors emerged. It now appears that the old

trend of a doubling in clock-rates has been replaced by a similar trend in the number

of cores. Projecting forward, this implies a future where processors exhibit ‘many-core’

architectures containing 100s or 1000s of cores.

The replacement of clock-rate with core-count as a means of increasing processor per-

formance has ensured that Moore’s Law will continue to be a useful driving force for hard-

ware manufacturers. However, on the software front there are significant consequences.

Most software is composed of sequential codes, which execute instructions one after the

other. In a multi-core environment, such codes will experience no direct performance gain

from the presence of multiple processing cores, forever remaining limited by the clock-

rate. The only way to take advantage of the new paradigm in processor architectures is

to (re-)write codes to exploit scalable parallelism.

The dependence of modern astronomy on high-performance computing makes adapting

to these changes particularly important. While some astronomy codes have already made

the transition to multi-core processing (e.g., Merz, Pen & Trac 2005; Thacker & Couchman

2006; Mudryk & Murray 2009), many legacy codes are still in use, and performance-limited

software is still often written without parallelism in mind. Furthermore, it is unlikely that

Page 19: Benjamin Barsdell Thesis - Swinburne University of Technology

1.1. Astrophysical supercomputing 3

Figure 1.1 Clock-rate versus core-count phase space of Moore’s Law binned every twoyears for CPUs (circles) and GPUs (diamonds). There is a general trend for performanceto increase from bottom left to top right.

Page 20: Benjamin Barsdell Thesis - Swinburne University of Technology

4 Chapter 1. Introduction

all approaches to multi-core parallelism will scale effectively to many-core architectures.

It is worth noting that, in addition to core count and clock speed, there is another

dimension at play in processor performance: memory bandwidth. This property defines

the rate at which data can be read from or written to memory, and can play a critical role

in some applications. Codes that perform only a small number of arithmetic operations

per element of data accessed can become limited by the system’s memory bandwidth

before they are able to saturate the arithmetic capabilities of the hardware (this concept

is discussed in more detail in Chapter 2). In recent years, memory bandwidth has not

kept pace with progress in arithmetic performance, and an increasing number of codes are

finding data-access the ultimate bottleneck. However, many of the most time-consuming

processes in computational astronomy remain heavily reliant on arithmetic performance,

and for these applications memory bandwidth is often not a concern.

While the hardware landscape has experienced a fundamental change in the design of

CPUs, the last five years has also seen the rise in popularity of completely new hardware

architectures for solving computationally intensive problems. These advanced architec-

tures offer very high performance for certain problems at lower monetary and power costs

than traditional CPUs, and potentially hold the key to carrying astronomy computations

through the next decade of Moore’s Law. However, they also pose significant challenges to

existing software development paradigms. The next section details the history, hardware

architecture and programming models of these devices.

1.2 Advanced architectures

The mainstay of modern computing has been the central processing unit, which is tasked

with everything from running an operating system and loading web pages to executing

complex numerical simulations. However, CPUs are only one application of Moore’s Law,

and only one way of tackling scientific computations. In recent years a number of alterna-

tive hardware architectures have been released into the high-performance computing mar-

ket. By trading flexibility for performance, these products are often able to out-perform

CPUs at compute-intensive tasks by an order of magnitude or more. They also offer a

glimpse of a possible future for all high-performance computing hardware. A summary of

the architectures discussed in this section is presented in Table 1.1.

While many-core central processing units are not yet a reality, modern graphics pro-

cessing units (GPUs) already contain 100s of cores (see Figure 1.1). In recent years, GPUs

have undergone a shift from a highly specialised graphics-oriented architecture to a flex-

ible general-purpose computing platform. The results of this evolution will be discussed

Page 21: Benjamin Barsdell Thesis - Swinburne University of Technology

1.2. Advanced architectures 5

Table 1.1 Summary of advanced architectures. Numbers are indicative only. See maintext for acronym definitions.Architecture Hardware Peak speed Price

CPU 2–8 cores, vector registers, 3-level cache 224 GFLOP/s US$1700GPU 500–2500 cores, 2-level cache 2592 GFLOP/s US$2500GRAPE Many cores, ‘hard-wired’ force calculations 131 GFLOP/s US$6000Clearspeed 192 SIMD cores, 2-level cache 96 GFLOP/s US$3000Cell BE Heterogeneous, 8 cores, vector registers 180 GFLOP/s US$8000Xeon Phi 50 cores, vector registers, 2-level cache Unknown US$2000

Architecture Software Power Applications

CPU TBB, OpenMP, Intel MKL 130 W ManyGPU CUDA, OpenCL, OpenACC 250 W ManyGRAPE Custom API 15 W N-bodyClearspeed Cn 9 W FewCell BE OpenMP, OpenCL, vector intrinsics 210 W ManyXeon Phi OpenMP, MPI, OpenCL, TBB, Intel MKL Unknown Many

further in Section 1.2.2 (see also Owens et al. 2005 for an early review).

Other processors designed to accelerate specific or certain classes of computational

problems have also appeared on the market. These range from hard-wired chips dedicated

to evaluating Newton’s law of gravitation, to heterogeneous architectures combining se-

quential and parallel processing performance to speed-up a wide variety of applications.

These devices are discussed in detail in Section 1.2.3. Note that we do not include a

discussion of field programmable gate arrays (FPGAs), which lie outside the scope of this

work due to their complex programming environment and minimal use in main-stream

computing (see, e.g., Monmasson & Cirstea 2007).

Before discussing new architectures, a brief review of current CPU designs and pro-

gramming models is presented in Section 1.2.1.

1.2.1 Central processing units

Hardware architecture

As described in Section 1.1, current-generation CPUs exhibit multi-core designs, typically

containing between two and eight full-function cores. In addition to this form of par-

allelism, CPUs also contain vector registers within each core that allow multiple values

to be operated-on simultaneously in a single instruction multiple data (SIMD) fashion.

Previous generations used the Streaming SIMD Extensions (SSE) instruction set, which

provided access to 128-bit registers allowing four (two) single-precision (double-precision)

floating-point values to be operated-on simultaneously per core. Current CPUs now use

Page 22: Benjamin Barsdell Thesis - Swinburne University of Technology

6 Chapter 1. Introduction

the Advanced Vector Extensions (AVX), which provide twice the vector width at 256-

bits. When all vector registers and cores are employed, modern CPUs can perform up to

224 billion single-precision floating-point operations per second (GFLOP/s) (Vladimirov,

2012)1.

Most modern CPUs also exhibit three hardware-managed cache levels, allowing low-

latency memory access and fast data-sharing between cores; the maximum bandwidth to

main memory is around 50 GB/s. Current server-class CPUs cost around US$1700 and

consume up to 130 W, giving them monetary and power efficiencies of 0.132 GFLOP/s/$

and 1.72 GFLOP/s/W respectively2.

Programming models

Development for CPUs is most-commonly approached using optimising compilers, which

can in some cases automatically vectorise sequential codes into SSE/AVX instructions.

Alternatively, low-level SSE/AVX instructions can be used directly by the developer to

ensure optimal use of the hardware, although this adds significant development complex-

ity. Multiple cores can be exploited through the use of multi-threading libraries such

as Threading Building Blocks (TBB)3 (where parallel processing threads are managed

explicitly), directive-based approaches like OpenMP4 (where parallel processing threads

are managed implicitly), or pre-optimised maths libraries such as the Intel Math Kernel

Library (MKL)5 (where parallel processing is hidden completely from the developer).

1.2.2 Graphics processing units

History

Graphics processing units (GPUs) first appeared as physical co-processors to regular CPUs

in the 1980s. Their development was driven by the rise in popularity of graphical user

interfaces, which often demanded significantly more computational power than the rest of

a computer’s operating system. Moving these graphics operations to a GPU promised to

free up the CPU to focus on traditional computing tasks, providing a better overall user

experience. However, with a fixed transistor (or dollar) budget, simply moving compu-

tations from one processor to another would provide little or no benefit. The key to the

1224 GFLOP/s = 1 operation × 8 AVX vector slots × 8 cores × 3.5 GHz2http://ark.intel.com/products/64583/Intel-Xeon-Processor-E5-26803http://threadingbuildingblocks.org/4http://www.openmp.org/5http://software.intel.com/en-us/articles/intel-mkl/

Page 23: Benjamin Barsdell Thesis - Swinburne University of Technology

1.2. Advanced architectures 7

success of this approach was that graphics computations are algorithmically distinct from

many traditional computations.

Non-graphics compute tasks are typically heterogeneous, branch-heavy and sequential;

for example, editing a document or loading a web page involves a melange of computational

tasks and a huge number of logical decisions, many of which must be made in progression.

The design of CPUs reflects this workload: CPU hardware is characterised by very fast

sequential performance, large, deep cache hierarchies and branch prediction capabilities.

In stark contrast, graphics tasks are often highly homogeneous, branch-free and parallel.

Applying an operation to an image comes down to applying the operation to each pixel

independently, and rendering a 3D scene involves independently transforming the vertices

of polygons and texturing the corresponding pixels that are projected onto the screen. The

hardware architecture of a GPU is thus characterised by parallel, homogeneous processing

capabilities.

While GPUs have differed fundamentally from CPUs since their very first incarnation,

their design has itself evolved significantly over the past three decades. This evolution

has been driven by the combination of Moore’s Law and the ever-increasing demands

of graphics-based software such as computer aided design applications, image and video

editing tools and video games. Up until the end of the 1990s, GPUs contained only fixed-

function hardware for computing different parts of the rendering pipeline—most-commonly

rasterisation and texture-mapping of pixels in polygons. With the desire for more flexibility

in the rendering process (primarily from the video-games industry), the new millennium

saw the appearance of simple programmable capabilities on the most popular GPUs. This

functionality allowed developers to write small shader programs that would be executed on

the hardware to transform the properties of either polygon vertices (via a vertex shader)

or pixels (via a pixel shader). This new flexibility was rapidly adopted by the graphics

programming community, and continued demand led to further improvements throughout

the 2000s, providing more shader processors per chip and enabling longer, more complex

shader programs to be written.

Another important step in the evolution of GPUs came in 2006 with the release of

devices exhibiting a unified shader architecture. This new design replaced the use of

separate vertex and pixel shader units with unified shaders capable of performing both

roles (as well as new functions entirely). One advantage of this design was the ability to

maintain high efficiency even when the use of one type of shader program greatly exceeded

that of another. A much more significant advantage, however, was the ability to perform a

Page 24: Benjamin Barsdell Thesis - Swinburne University of Technology

8 Chapter 1. Introduction

more general set of computations. In 2007 NVIDIA6 released its Compute Unified Device

Architecture (CUDA), a platform for general purpose computation on GPUs (GPGPU),

and with it opened up GPUs to a new world of applications7. The driving force in GPU

design has since undergone a shift from purely graphics applications to a combination of

graphics and general-purpose demands. Today, GPUs exhibit general-purpose features

such as cache hierarchies, fast double-precision floating-point support, atomic operations

and dynamic parallelism, making them applicable to a wide range of parallel computing

problems. Modern graphics hardware also contains thousands of computing cores (unified

shader units), resulting in peak performance of over a trillion floating-point operations

per second (FLOP/s). This compute performance, around an order of magnitude greater

than a similarly-priced CPU, has led to huge levels of interest from the scientific and

high-performance computing (HPC) communities.

Hardware architecture

In this section the recently-announced NVIDIA Kepler K20 GPU will be used as an

example of a cutting edge GPU architecture (NVIDIA Corporation, 2012). This GPU

connects to the PC via the Peripheral Component Interconnect (PCI) Express bus v3.0,

which provides bidirectional data transfer at rates of up to 16 GB/s. This bus allows the

GPU to communicate with the main system memory as well as other devices on the PCI

Express bus (e.g., other GPUs, network interface cards etc.). The GPU also has its own

main memory, which can be accessed with a bandwidth of up to 288 GB/s (significantly

exceeding typical CPU memory bandwidth of up to 50 GB/s). Attached to the main

memory is 1.5 MB of level two (L2) cache, which serves to provide efficient data access

and sharing between processing units on the device.

The primary processing unit on the Kepler K20 GPU is the Streaming Multiprocessor

(SMX); a single device can contain up to 15 SMXs. On each SMX sits 64 KB of L1 cache,

which is divided into a regular L1 cache and what is called ‘shared memory’ (the division

being application-configurable). Shared memory is an application-managed memory space

that can be used to perform efficient data sharing and manipulation operations. In addition

to these caches is 48 KB of read-only cache designed for data known to remain constant

throughout program execution. Rounding out the memory spaces on the Kepler GPU

is the register file on each SMX, which contains 65,536 32-bit registers that are divided

between processing units (NVIDIA Corporation, 2012).

6NVIDIA is one of two main competitors in the GPU hardware industry, the other being AdvancedMicro Devices (AMD).

7http://www.nvidia.com/object/cuda_home_new.html

Page 25: Benjamin Barsdell Thesis - Swinburne University of Technology

1.2. Advanced architectures 9

As suggested by its name, a Streaming Multiprocessor is composed of many individual

processors: 192 general processing cores (supporting integer and single-precision floating-

point arithmetic), 64 double-precision floating-point units, 32 special-function units, 32

load/store units and 16 texture filtering units. The general processing cores provide the

bulk of the computational horsepower, totalling up to 2592 GFLOP/s8. The double-

precision units similarly provide 864 GFLOP/s of double-precision performance. The

purpose of the special function units is to provide very fast implementations of common

mathematical functions such as roots, exponentiation, logarithms and trigonometric func-

tions. The load/store units simply provide access to memory. Finally, the texture filtering

units provide fast interpolation functions in one, two and three dimensions.

The latest GPUs cost around US$2500 for scientific computing models and consume

up to 250 W of power, giving them monetary and power efficiencies of 1.04 GFLOP/s/$

and 10.4 GFLOP/s/W respectively.

Programming models

Programming for general-purpose computation on GPUs began with real-time rendering

shader languages such as the OpenGL Shading Language (GLSL)9, C for Graphics (Cg)10

and the High Level Shader Language (HLSL)11. These languages require the developer to

pose their problem in terms of graphics operations such as transforming vertices and ren-

dering textured polygons. In addition to the out-of-context thought-process demanded by

this approach, it also comes with significant performance disadvantages and limitations on

the types of algorithms that can be computed. For example, scattering operations, where

data are written to arbitrary locations in memory, are particularly difficult to implement

using shader languages12. Shader languages also lack the ability to use shared memory to

efficiently share and communicate data between processors, a feature that proved particu-

larly critical to the performance of a number of algorithms including N-body simulations

(Belleman, Bedorf & Portegies Zwart, 2008).

Auspicious early performance results led to rising interest in using GPUs for general-

purpose computations and saw the appearance of new programming interfaces designed

to ease the development process. One such project was the BrookGPU language, which

82592 GFLOP/s = 1 operation × 192 cores × 15 SMXs × 900 MHz. This number increases by afurther factor of two if one considers the hardware’s ability to fuse multiply and add operations into asingle instruction.

9http://www.opengl.org/documentation/glsl/10http://developer.nvidia.com/page/cg_main.html11http://msdn.microsoft.com/en-us/library/windows/desktop/bb509561(v=vs.85).aspx12Implementing a scatter operation in a shader language requires placing the data at the vertices of a

polygon and rendering it using a vertex shader that translates each vertex to the desired location.

Page 26: Benjamin Barsdell Thesis - Swinburne University of Technology

10 Chapter 1. Introduction

simplified the programming of parallel applications using a ‘stream processing’ approach

(Buck et al., 2004). By restricting the allowed communication between parallel streams of

computation, this approach enables a variety of parallel algorithms to be executed on both

traditional and graphics processing hardware. BrookGPU provided back-ends supporting

OpenGL13 and DirectX14, as well as the low-level GPU programming interface Close to

Metal15.

A significant step in the rise of GPGPU programming came with the release of the

CUDA platform by NVIDIA in 2007. CUDA changed the GPU computing landscape by

not only removing ties to graphics operations, but by also opening up additional hardware

features such as unrestricted memory reads and writes, access to shared memory and fast

bi-directional data transfers between the GPU and system memory. Subsequent versions

have continued to introduce new features for general-purpose processing, including atomic

operations16, parallel voting functions, double-precision floating-point arithmetic, unified

memory addressing across the CPU and GPU, function recursion, full C++ support and,

most recently, dynamic parallelism and remote direct memory access. CUDA programs are

made up of calls to a runtime library, providing access to device and memory management

functions, and special functions called kernels written in ‘C for CUDA’, a C-like language

providing extensions for parallel processing functionality.

The CUDA programming model defines a hierarchy of parallel processing and memory

abstractions, depicted in Fig. 1.2. The fundamental unit of parallelism is the thread, which

executes one instance of a kernel. Each thread is allocated a number of registers that it

uses to perform its local computations; registers provide the fastest access times of all the

GPU memory spaces, but cannot be used for communication between threads. Threads

are grouped together in two ways, the first being into sets of 32 called warps. On the

GPU, instructions are issued on a per-warp basis; threads within a warp therefore execute

instructions in a lock-step fashion. In cases where some threads within a warp wish to

execute a different instruction to others (i.e., during a conditional statement), those threads

must wait, idle, while the other threads execute their operations. The second grouping

of threads is into blocks. Blocks can vary in size according to application requirements,

but are typically created with O(100) threads. The purpose of the block abstraction

is to provide a means of communicating between threads: threads within a block have

access to a fast synchronisation mechanism and shared memory, allowing them to rapidly

13http://www.opengl.org/14http://www.microsoft.com/en-us/download/details.aspx?id=3515http://sourceforge.net/projects/amdctm/16In contrast to regular operations, atomic operations guarantee conflict-free parallel memory writes.

Page 27: Benjamin Barsdell Thesis - Swinburne University of Technology

1.2. Advanced architectures 11

share and exchange data, while threads in different blocks can communicate only through

significantly slower means. The final layer in the processing hierarchy is the grid, which is

composed of blocks and contains all of the threads created to execute a given kernel; recent

GPUs also allow multiple grids to execute simultaneously. Communication at the grid level

(i.e., between thread blocks) can only be achieved through global memory using either

kernel-wide synchronisations or atomic operations. The success of this programming model

is due to its careful balance between hardware and software demands: it constrains parallel

execution and communication just enough to allow for a highly efficient and scalable

hardware implementation, but provides enough flexibility to enable the realisation of a

huge variety of parallel algorithms in software.

Following the success of CUDA, in 2008 the Khronos Group (a non-profit technology

consortium) released the Open Compute Language (OpenCL)17, an open standard frame-

work targeting heterogeneous parallel computing. OpenCL provides a similar program-

ming model to CUDA, but allows developers to target a variety of hardware back-ends,

including CPUs, GPUs and other processors from any vendor that provides an implemen-

tation. Like CUDA, OpenCL has added support for new hardware features in updated

versions of the specification.

Building on the foundations provided by CUDA and OpenCL, many higher-level pro-

gramming interfaces are now available to further ease the development of GPU applica-

tions. Libraries such as cuFFT, cuBLAS, cuSPARSE, cuRAND18, CUSP19, CULA20 and

others provide fast GPU implementations of common mathematical operations, and can

often be substituted directly into existing CPU codes. Other libraries, such as Thrust

(Hoberock & Bell, 2010) and CUDPP21 provide high-level interfaces to common parallel

algorithms. Support is also available for programming GPUs using high-level languages,

including Python (via PyCUDA22), MATLAB23, Mathematica24 and IDL (via GPULib25).

Lastly, directive-based approaches such as OpenACC26 allow for GPU execution (in the

same way that OpenMP27 allows for multi-core execution) of existing CPU codes through

17http://www.khronos.org/opencl/18cuFFT, cuBLAS, cuSPARSE and cuRAND are part of the CUDA Toolkit available here: http:

//www.nvidia.com/content/cuda/cuda-toolkit.html19http://code.google.com/p/cusp-library/20http://www.culatools.com/21https://code.google.com/p/cudpp/22http://mathema.tician.de/software/pycuda/23http://www.mathworks.com.au/discovery/matlab-gpu.html24http://reference.wolfram.com/mathematica/guide/GPUComputing.html25http://www.txcorp.com/products/GPULib/26http://openacc.org/27http://openmp.org/

Page 28: Benjamin Barsdell Thesis - Swinburne University of Technology

12 Chapter 1. Introduction

Global memory

Thread

Thread

Thread

Thread

L1

ca

che

/sh

are

d m

em

.

...

Block

...

Warp

Grid

R

R

R

RR

Thread

Thread

Thread

Thread

...

Warp

R

R

R

RR

Thread

Thread

Thread

Thread

...

Block

Warp

R

R

R

RR

Thread

Thread

Thread

Thread

...

Warp

R

R

R

RR

L2 cache

L1

ca

che

/sh

are

d m

em

.

Host CPU/memory

Figure 1.2 Schematic of the programming model for recent NVIDIA GPUs showing theprocessing and memory hierarchies. Boxes labelled with the letter ‘R’ represent memoryregisters. Threads within a warp are depicted chained together to indicated the require-ment that they execute instructions in lock-step. See main text for a description of eachelement.

Page 29: Benjamin Barsdell Thesis - Swinburne University of Technology

1.2. Advanced architectures 13

the use of annotations and hints to the compiler.

Use in the scientific literature

The notion of using GPU hardware as a computational engine has existed for nearly as

long as GPUs themselves, with the first formal analysis of the idea appearing in the late

1980s (Fournier & Fussell, 1988). At that time, research focused mainly on algorithms used

in rendering 3D graphics, such as visible surface detection and shadow computation (op-

erations that went on to become mainstays of the 3D graphics industry). However, GPU-

based implementations of more general algorithms soon appeared. The ability of early

graphics hardware to rasterise polygons was exploited to implement high-performance mo-

tion planning algorithms (Lengyel et al., 1990), and later work used the OpenGL graphics

API to develop GPU-based implementations of artificial neural networks (Bohn, 1998),

3D convolution (Hopf & Ertl, 1999), numerous computational geometry algorithms (Hoff

et al., 1999; Mustafa et al., 2001; Krishnan, Mustafa & Venkatasubramanian, 2002; Agar-

wal et al., 2003; Sun, Agrawal & El Abbadi, 2003), matrix multiplication (Larsen & McAl-

lister, 2001), non-linear diffusion for image processing (Rumpf & Strzodka, 2001; Diewald

et al., 2001) and cellular automata-based fluid simulations (Harris et al., 2002). With no

access to programmable graphics hardware, these implementations relied on exploiting the

limited operations available in fixed-function graphics pipelines. Major obstacles imposed

by this approach included limited arithmetic precision, lack of support for arbitrary math-

ematical functions, limited support for conditional execution and restricted convolution

capabilities (Trendall & Stewart, 2000).

The appearance of programmable shaders in graphics hardware (Lindholm, Kilgard &

Moreton, 2001; Proudfoot et al., 2001; Mark et al., 2003) alleviated many of the issues

associated with fixed-function pipelines, opening up GPU acceleration to a wider variety

of applications. New algorithms included multigrid and sparse conjugate gradient matrix

solvers (Goodnight et al., 2003; Bolz et al., 2003), fast Fourier transforms (Moreland &

Angel, 2003), wavefront phase recovery (Rosa, Marichal-Hernandez & Rodriguez-Ramos,

2004), direct gravitational N-body simulation (Nyland, Prins & Harris, 2004), Monte Carlo

simulations in statistical mechanics (Tomov et al., 2005), computer-generated holography

(Masuda et al., 2006), 3D shape measurement (Zhang, Royer & Yau, 2006) and mag-

netic resonance imaging reconstruction (Schiwietz et al., 2006). While shader languages

provided unprecedented flexibility on GPU hardware, they remained graphics-specific, re-

stricting memory access and forcing developers to map their problems into a graphics-based

context.

Page 30: Benjamin Barsdell Thesis - Swinburne University of Technology

14 Chapter 1. Introduction

The release of CUDA (and later OpenCL) was the final step in liberating GPU hard-

ware for general-purpose computations. Severing ties with graphics-specific operations

and allowing arbitrary access to memory opened the flood gates to a plethora of applica-

tions. Implementations of algorithms from all areas of science appeared, some examples

being: k-nearest neighbour search (Garcia, Debreuve & Barlaud, 2008; Campana-Olivo

& Manian, 2011), molecular dynamics (Anderson, Lorenz & Travesset, 2008; van Meel

et al., 2008; Sunarso, Tsuji & Chono, 2010), numerical weather prediction (Michalakes &

Vachharajani, 2008; Govett et al., 2011; Mielikainen, Huang & Huang, 2011), radiotherapy

dose calculation (de Greef et al., 2009; Jia et al., 2010; Men et al., 2009), computational

fluid dynamics (Cohen & Molemake, 2009; Horvath & Liebmann, 2010; Tomczak et al.,

2012), pattern formation in financial markets (Preis et al., 2009), air pollution modeling

(Molnar et al., 2010), spherical harmonic transforms (Hupca et al., 2012) and ant colony

optimisation (Cecilia et al., 2011). New applications, and improvements to existing ones,

continue to appear as new generations of GPU hardware and software provide additional

performance and flexibility.

1.2.3 Other accelerator cards

History

In addition to GPUs, the past decade has seen the release of a number of dedicated accel-

erator cards, typically designed to plug into PC expansion slots and provide acceleration

for certain types of computations. By focusing only on particular mathematical opera-

tions, these devices aim to provide cost-effective solutions to increasing the performance

of compute-intensive codes. While several products have seen short-term success, over the

longer term manufacturers have often struggled to compete with commodity hardware.

In the early 1990s, a series of devices aimed at accelerating the O(N2) operations in-

volved in direct gravitational N-body simulations was developed at the University of Tokyo

(Ebisuzaki et al., 1993). The Gravity Pipe (GRAPE) hardware speeds up simulations by

offloading the computationally intensive force calculations (performed between all pairs of

gravitating bodies in a system) from the CPU. Successive versions of the devices, from the

GRAPE-1 through to the GRAPE-5 (Kawai et al., 2000) and GRAPE-6 (Makino et al.,

2003) brought increased performance and accuracy. A modified version of the GRAPE-6,

the GRAPE-6A, also offered a smaller form factor that allowed it to be plugged into a PC

expansion slot (Fukushige, Makino & Kawai, 2005). While the GRAPE hardware proved

successful for more than a decade of N-body simulations, it was ultimately faced with

strong competition from GPUs, which concluded with the development of a substitute

Page 31: Benjamin Barsdell Thesis - Swinburne University of Technology

1.2. Advanced architectures 15

library allowing GRAPE-based codes to exploit cheaper GPU hardware instead through

a simple re-linking operation (Gaburov, Harfst & Portegies Zwart, 2009).

Products offering more general acceleration capabilities were released between 2003

and 2009 by ClearSpeed Technology, providing full floating-point and integer arithmetic

capabilities. Their most recent device, the CSX70028 offers up to 96 GFLOP/s of double-

precision performance at a typical power consumption of 9 W. The product was priced at

around US$3000 in 2008. However, no subsequent models have been released, likely also

due to competition from GPUs, which now provide better performance at lower cost.

In 2006, Mercury Computer Systems released a PCI Express card featuring the Cell

Broadband Engine Architecture (Cell BE), a heterogeneous multi-core chip providing high

floating-point performance. The Cell Accelerator Board (CAB) consumes 210 W and

provides theoretical peak single-precision (double-precision) performance of 180 GFLOP/s

(90 GFLOP/s), priced at around US$8000 in 2007. As with the ClearSpeed device, no

updated models of the Mercury CAB have been announced since its first release.

Following the success of GPUs in high-performance computing (HPC) applications,

Intel has recently developed a dedicated computational accelerator card designed to com-

pete with GPUs in the HPC market. The Xeon Phi will offer massively-parallel processing

on a PCI Express board and provide a CPU-like development environment29.

Hardware architectures

The hardware architecture of the GRAPE has evolved significantly through its six ver-

sions, offering increased performance through more cores and higher clock rates. In ad-

dition, odd-numbered versions have exploited logarithmic arithmetic to avoid expensive

root operations, at the cost of reduced accuracy. The GRAPE-6 exhibits a massively-

parallel, hierarchical processor architecture. At the smallest level, this consists of in-

teraction pipelines designed specifically to evaluate the equations of Newtonian gravity,

including the force and its time derivative. The ‘hard-wired’ nature of the pipelines en-

sures maximum computational efficiency, while the massive parallelism allows multiple

interactions to be computed simultaneously, providing high performance at a relatively

low clock-rate (90 MHz).

The ClearSpeed CSX700 has a single instruction multiple data (SIMD) architecture

containing 192 processing elements divided between two parallel arrays. The SIMD archi-

tecture means that all of the processing elements in one parallel array execute instructions

28http://www.clearspeed.com/products/csx700.php29http://www.intel.com/content/www/us/en/high-performance-computing/

high-performance-xeon-phi-coprocessor-brief.html

Page 32: Benjamin Barsdell Thesis - Swinburne University of Technology

16 Chapter 1. Introduction

in lock-step. The processing elements have access to two levels of cache as well as around

2 GB of main memory, and communicate with the host machine over a PCI Express bus

at up to 8 GB/s.

The PowerXCell 8i processor featured on the Mercury CAB exhibits a heterogeneous

architecture made up of a primary processor called a Power Processing Element (PPE) and

eight co-processors called Synergistic Processing Elements (SPEs). The PPE is similar to

a traditional CPU and provides the general-purpose processing capabilities required to run

an operating system and support the SPEs. The SPEs provide the bulk of the computing

power and feature 256 KB of local memory and 128-bit wide SIMD capabilities, allow-

ing four (two) single-precision (double-precision) floating-point values to be operated-on

simultaneously per SPE. The board also contains 1 GB of high-performance memory and

4 GB of main memory. The PPE, SPEs and memory units are connected via the Element

Interconnect Bus, a circular bus providing concurrent transactions between components

on the chip30.

The Xeon Phi will feature at least 50 independent processing cores, with each offering

512-bit SIMD units, allowing 16 (8) single-precision (double-precision) floating-point val-

ues to be operated-on simultaneously per core. In contrast to GPUs and other accelerator

cards, the cores in the Xeon Phi will be compatible with the x86 instruction set, allowing

them to support existing CPU-based software (most of which uses x86). The device will

also exhibit two levels of cache and a ring bus connecting the processors and memory,

allowing for fast communication across the chip29.

Programming models

Due to its single-purpose design, GRAPE hardware is accessed via a very simple appli-

cation programming interface (API). Functions are provided to initialise and shut down

the device, transfer particle data to and from its memory and to instruct it to begin com-

putation. The only flexibility afforded by the hardware is in the number of particles it is

given; however, this is sufficient to support (certain implementations of) tree-based force

evaluation algorithms, which are often used to speed up the simulation of collisionless

gravitational systems (Makino & Funato, 1993).

Development for the ClearSpeed CSX700 is done through a proprietary software de-

velopment kit31. Direct programming is done using Cn, a C-like language with extensions

for parallel processing. The extensions allow a programmer to qualify data types as ‘poly’,

30http://www.ibm.com/developerworks/power/library/pa-cellperf/31http://www.clearspeed.com/products/sdk.php

Page 33: Benjamin Barsdell Thesis - Swinburne University of Technology

1.2. Advanced architectures 17

indicating to the compiler that they should be replicated across each processing element

and operated-on in parallel. A run-time function is provided to query the index of the

current processor, allowing it to act on processor-dependent data; conditional statements

depending on the processor index will, however, lead to unselected processors waiting idle,

due to the SIMD nature of the hardware architecture. A number of libraries are also

provided to accelerate common mathematical algorithms such as fast Fourier transforms,

basic linear algebra operations and pseudo-random number generation.

Several programming interfaces are available for the Cell BE processor on the Mercury

CAB. At the lowest level, writing assembler code provides complete access to all hard-

ware capabilities and allows the programmer to extract the greatest performance, at a

significant development cost. A more common approach is to write C or C++ code using

an application programming interface and vector intrinsics to target the SPEs and their

SIMD units. Optimising compilers have also been developed to automatically exploit the

parallel processing hardware and on-chip memory spaces; code sections can be marked as

parallel by the programmer using a similar model to OpenMP (Eichenberger et al., 2005).

Other options for targeting the Cell BE processor are an implementation of the OpenCL

specification, and optimised maths libraries.

The Xeon Phi’s x86 compatibility is designed to allow the use of many existing par-

allel programming tools. These include OpenMP and message passing interface (MPI)

implementations, Intel’s Array Building Blocks32, Threading Building Blocks33 and Math

Kernel34 libraries, as well as Intel’s Cilk Plus35 extensions to C and C++. OpenCL will

also be supported. While the individual processor cores will be able to execute much

existing code, use of the new 512-bit SIMD units will require additional development.

Use in the scientific literature

Due to its problem-specific nature, GRAPE hardware has not seen significant use outside

of astronomy. Applications of these devices within astronomy are discussed in Section 1.3.

ClearSpeed’s accelerator devices have seen only very limited use by the scientific com-

munity, and do not not appear to have been used in astronomy. Published applications

include lattice Boltzmann methods (Heuveline & Weiß, 2009), hologram generation (Tan-

abe et al., 2009) and geographic flood inundation simulations (Neal et al., 2009).

The Cell processor’s versatility has seen it applied to a large number of problems. Ap-

32http://intel.com/go/arbb33http://threadingbuildingblocks.org34http://software.intel.com/en-us/articles/intel-mkl/35http://software.intel.com/en-us/articles/intel-cilk-plus/

Page 34: Benjamin Barsdell Thesis - Swinburne University of Technology

18 Chapter 1. Introduction

plications include quantum chromodynamics simulations (Belletti et al., 2007), 3D com-

puted tomography reconstruction (Scherl et al., 2007), high-energy physics reconstruction

algorithms (Gorbunov et al., 2008), self-organising maps (McConnell, 2010), molecular

dynamics simulations (Gonnet, 2010) and video encoding for large-scale surveillance (Lu

et al., 2010).

1.3 Advanced architectures in astronomy

While some advanced architectures, like the GPU, have only recently seen broad use by

the astronomical community, others, like the GRAPE, have been in use for more than

two decades. The primary application of GRAPE hardware has been to simulations of

collisional stellar environments (Makino, 1991, 1996; Klessen & Kroupa, 1998; Shara &

Hurley, 2002; Baumgardt et al., 2003), but it has also been applied to collisionless SPH

simulations (Steinmetz, 1996; Springel, Yoshida & White, 2001). While competition from

GPUs appears to have pushed GRAPE hardware out of the market, it remains in use

today (e.g., Jalali et al. 2012).

Astronomy applications of the Cell BE processor have been limited to those investi-

gated by a small number of early adopters. These include image synthesis (Varbanescu

et al., 2008) and signal correlation (van Nieuwpoort & Romein, 2009) for radio astronomy,

period searching in light curves (Cytowski, Remiszewski & Soszyski, 2010) and numerical

relativity simulations (Khanna, 2010).

Direct gravitational N-body simulations were among the first astronomy codes imple-

mented on GPUs, initially using graphics shader languages (Nyland, Prins & Harris, 2004;

Portegies Zwart, Belleman & Geldof, 2007) and later using the general-purpose GPU lan-

guages BrookGPU (Elsen et al., 2007) and CUDA (Hamada & Iitaka, 2007; Nyland, Harris

& Prins, 2007; Belleman, Bedorf & Portegies Zwart, 2008; Gaburov, Harfst & Portegies

Zwart, 2009). More recently, algorithmic advances have led to GPU implementations of

hierarchical tree-based N-body algorithms (Hamada et al., 2009; Nakasato et al., 2012;

Bedorf, Gaburov & Portegies Zwart, 2012). A review of GPU use in N-body simulations

has been published by Bedorf & Portegies Zwart (2012).

While N-body simulations have received particular attention, GPU applications in

astronomy now span a wide range of problems. Some examples are radio-telescope signal

correlation (Schaaf & Overeem, 2004; Harris, Haines & Staveley-Smith, 2008; Ord et al.,

2009; Wayth, Greenhill & Briggs, 2009; Clark, La Plante & Greenhill, 2011), the solution of

Kepler’s equation (Ford, 2009), galaxy spectral energy distribution calculations (Jonsson

& Primack, 2010; Heymann & Siebenmorgen, 2012), gravitational lensing ray-shooting

Page 35: Benjamin Barsdell Thesis - Swinburne University of Technology

1.4. Purpose of the thesis 19

(Thompson et al., 2010; Bate et al., 2010), adaptive mesh refinement (Wang, Abel &

Kaehler, 2010; Schive, Tsai & Chiueh, 2010), volume rendering of spectral data cubes

(Hassan, Fluke & Barnes, 2012) and cosmological lattice simulations (Sainio, 2012). A

review of practical issues faced when implementing astronomy problems on GPUs has also

been published by Fluke et al. (2011).

It is important to note that many of these applications are well-known for exhibiting

large degrees of parallelism. In this sense, they may be considered ‘low-hanging fruit’

for implementation on massively-parallel architectures like GPUs. It is also evident that

the two main sources of knowledge regarding the design requirements for these implemen-

tations are hardware-specific documentation and simple trial and error [e.g., Hamada &

Iitaka (2007); Harris, Haines & Staveley-Smith (2008); Thompson et al. (2010)]. While this

‘ad-hoc’ approach has proven successful in early work, it is unclear whether such methods,

which generally require significant investments of time for development and optimisation,

will produce similar rewards for all areas of astronomy.

1.4 Purpose of the thesis

This thesis is motivated by two key observations: 1) the changing landscape of computing

hardware is threatening to leave behind astronomy research that does not adapt; and 2)

advanced architectures offer the potential to enable new science today. Consequently, its

aims are: 1) to motivate, develop and demonstrate a generalised approach to the use of

many-core architectures in astronomy; and 2) to use an advanced architecture to enable

new science.

It is crucial that astronomy be able to exploit advances in computing hardware, and

therefore critical that the software community embrace the current trend in processor

design that is placing more and more emphasis on massively-parallel processing. The

key obstacles to this are the foreign programming model and often steep learning curve

presented by advanced architectures. It is the first goal of this thesis to ameliorate this

issue by introducing a generalised approach to analysing and implementing algorithms on

such hardware and removing the risks associated with ad-hoc development. This forms

the basis of Chapter 2.

The order of magnitude more computing power offered by advanced architectures rela-

tive to CPUs today provides a unique opportunity to enable new science. Computationally-

limited fields of study stand to reap great rewards from the ability to process more data,

explore more parameter space or produce results with more accuracy. It is the second

aim of this thesis to demonstrate this possibility by applying an advanced architecture

Page 36: Benjamin Barsdell Thesis - Swinburne University of Technology

20 Chapter 1. Introduction

to problems in pulsar astronomy and subsequently developing a real-time event detection

pipeline capable of unlocking unprecedented discovery opportunities. These ideas form

the basis of Chapters 3 and 4. An introduction to pulsar astronomy and a discussion of

the motivation behind the choice of this field for the application of advanced architectures

is presented in Section 1.5.

To avoid undue complication, this thesis focuses primarily on graphics processing units

as the canonical example of an advanced, many-core hardware architecture. This does not,

however, represent a reduction in scope: the ideas and methods presented in this work are

expected to apply equally well to other massively-parallel architectures, both present and

future. That said, the long-running history and established market position of GPUs give

good reason to believe that they will continue to remain a significant force in accelerated

computing for the foreseable future.

1.5 Advanced architectures meet pulsar astronomy

The applications to which GPUs were applied in this thesis focus primarily on problems in

pulsar astronomy (e.g., Chapters 3 and 4). Pulsar astronomy has a strong dependence on

high-performance computation, and in many cases its science is computationally-limited.

Here we provide a brief introduction to pulsars, their observation at radio frequencies

and why their study is an excellent field for the application of advanced architectures like

GPUs.

1.5.1 History and characteristics of pulsars

Pulsars get their name from a portmanteau of ‘pulsating star’, which describes their ap-

pearance when observed through a telescope. As with many phenomena in astronomy,

this observationally-derived name does not correspond to their underlying physical nature.

Pulsars are in fact rotating neutron stars, remnants from supernovae. Their serendipitous

discovery in 1967 by Jocelyn Bell Burnell involved observations around 81.5 MHz of un-

explained regular pulses of emission from a consistent celestial location (Hewish et al.,

1968). These pulses were found to have remarkable periodicity, with one source repeat-

ing every ∼1337 ms to better than one part in 107 [and since measured to better than

one part in 1012 by Hobbs et al. (2004)]. After the initial discovery of four such sources,

the phenomenon was quickly attributed to polar emission from a rotating neutron star,

where an intense magnetic field accelerates charged particles from the surface of the star

to relativistic speeds, resulting in the emission of synchrotron radiation from the magnetic

Page 37: Benjamin Barsdell Thesis - Swinburne University of Technology

1.5. Advanced architectures meet pulsar astronomy 21

poles (Gold, 1968; Pacini, 1968). The observation of discrete pulses arises from a misalign-

ment between the star’s rotation and magnetic axes, which causes the emission beam to

periodically sweep across our line of sight as the star rotates.

Today more than 2000 pulsars have been discovered, and ongoing surveys continue to

add to this number. Two primary metrics used to characterise a pulsar are the rotation

period P and its derivative P . Plotting the known pulsars in this phase space (see Fig.

1.3) reveals several distinct groupings [see Bhattacharya & van den Heuvel (1991) and

Cordes et al. (2004) for reviews]. The largest group primarily occupies the range 0.25 s

. P . 1.25 s, forming the population known as the slow, regular or canonical pulsars.

At shorter periods lies a distinct population of ‘millisecond pulsars’ (MSPs), correlated

strongly with pulsars known to be members of binary systems. MSPs are thought to be

‘recycled’ slow pulsars—since their formation, they have been spun-up by the accretion

of mass from a companion star (Alpar et al., 1982). At the longest periods and steep-

est period derivatives are a population of pulsars known as magnetars, named for their

extremely strong surface magnetic field strengths (Mereghetti, 2008). Another class of

pulsars, generally exhibiting periods similar to magnetars but spin-down rates more char-

acteristic of regular pulsars, are the rotating radio transients (RRATs). These objects are

distinguished by their sporadic pulse detection rates and are now thought to be pulsars

that experience on-and-off ‘nulling’ of their emission (McLaughlin et al., 2006); their exact

definition, however, remains uncertain.

Pulsars have a number of attributes that make them very useful objects of study.

Observations can provide insights into the physics behind neutron stars, entities that

skirt the edges of the known physical laws (Lattimer & Prakash, 2004). Their place at

the end of the stellar evolutionary path also makes pulsars amenable to studies of stellar

populations (Bhattacharya & van den Heuvel, 1991). The stability of their rotation, which

can rival the best Earth-based atomic clocks (Matsakis, Taylor & Eubanks, 1997), also

allows them to be used to refine solar system ephemerides and potentially the presence of

a gravitational wave background (Foster & Backer, 1990). Furthermore, in tight binary

systems they become even more useful, providing probes of high-energy plasma physics

and gravitational radiation (Lyne et al., 2004).

1.5.2 Pulsar observations

Pulsars have been observed in the radio, optical, X-ray and gamma-ray bands (Abdo et al.,

2010; Mignani, 2011). Of the known pulsars, the majority have been detected at radio

frequencies, a large fraction of which were discovered at the Parkes Radio Observatory

Page 38: Benjamin Barsdell Thesis - Swinburne University of Technology

22 Chapter 1. Introduction

10-22

10-20

10-18

10-16

10-14

10-12

10-10

10-8

0.001 0.01 0.1 1 10 100

Pe

rio

d d

eriva

tive

[s s

-1]

Period [s]

RegularBinary members

MagnetarsRRATs

Figure 1.3 Sample of the known pulsars plotted in P-P space. Data obtained using theATNF Pulsar Catalogue (Manchester et al., 2005) available here: http://www.atnf.

csiro.au/people/pulsar/psrcat/.

Page 39: Benjamin Barsdell Thesis - Swinburne University of Technology

1.5. Advanced architectures meet pulsar astronomy 23

in bands centred at 436 MHz (Lyne et al., 1998) and 1382 MHz (Lorimer et al., 2006;

Keith et al., 2010). An important phenomenon at these frequencies is dispersion: the

introduction of a frequency-dependent time delay in the pulse signal as a result of refraction

by free electrons, which reside in the interstellar medium between source and observer.

The dispersion delay varies quadratically with frequency and is directly proportional to the

dispersion measure (DM), a quantity defining the column density of free electrons along

the line of sight. Left uncorrected, interstellar dispersion causes pulsar signals to appear

smeared out in time across a finite observing bandwidth. For this reason, observations of

pulsars must be corrected for the dispersion delay at each frequency prior to integrating

the band.

Pulsar observations using radio telescopes require a number of processing stages to

reduce the raw voltages to final data products. The signal path begins at the receiver

horn, which captures radiation as complex voltages in two orthogonal polarisations. These

signals are fed through a low-noise amplifier, which boosts weak astonomical signals to

detectable levels in a low-thermal-noise environment (often cooled cryogenically). A low-

frequency signal from a separate oscillator is then mixed with this amplified signal to

reduce the frequency to O(10 MHz), simplifying subsequent electronics and preventing

feedback into the receiver. The mixed signal is then passed through a band-pass filter to

produce the intermediate frequency (IF) feed. The IF feed attaches to what is known as

the receiver back-end.

Back-ends vary significantly depending on the intended observing mode and the tech-

nology used. For pulsar timing observations, where the dispersion measure of the source is

known a-priori, a process known as coherent dedispersion can be applied to directly correct

the complex signal voltages for the effects of interstellar dispersion (Hankins & Rickett,

1975). After this procedure, the data can be folded at the pulsar period to produce an

integrated pulse profile at the native time resolution, allowing for very high-precision tim-

ing. Popular current choices of back-end hardware are the FPGA platforms developed by

the Center for Astronomy Signal Processing and Electronics Research (CASPER) (e.g.,

Langston, Rumberg & Brandt 2007; Keith et al. 2010; Sane et al. 2012). The CASPER

Parkes Swinburne Recorder (CASPSR) is a recently-developed pulsar timing back-end

that operates by digitising the IF feed and using a CASPER Interconnect Break-out

Board (IBOB) to packetise and transmit the data to a cluster of server computers; the

coherent dedispersion and folding process is then performed in software (van Straten &

Bailes, 2011).

Different techniques are used when taking survey observations, which aim to detect new

Page 40: Benjamin Barsdell Thesis - Swinburne University of Technology

24 Chapter 1. Introduction

sources. Survey back-ends must record the observed data such that it can be searched

for signals across a range of dispersion measures. The computational cost of coherent

dedispersion (discussed further in Section 1.5.3) precludes applying it at this number of

DMs, and thus survey data are generally dedispersed incoherently. Modern survey back-

ends act as digital spectrometers by dividing up the IF feed into a number of independent

frequency channels, usually using a polyphase filterbank to avoid issues associated with

the straightforward discrete Fourier transform [see Harris & Haines (2011) for a review of

the use of polyphase filterbanks in astronomy]. The Berkeley-Parkes-Swinburne Recorder

(BPSR) is a survey back-end that uses an IBOB to apply a polyphase filterbank to the

digitised input signal. The FPGA then ‘detects’ each channel by squaring it, integrates

over 25 time samples to reduce the time resolution to 64 µs, scales and decimates the

samples to eight bits, and sends the data in packets to a server computer. The server is

tasked with summing the filterbanks from two polarisations and normalising each channel,

before finally rescaling to two bits per sample and writing the data to disk (Keith et al.,

2010). Subsequent incoherent dedispersion of the filterbanks is performed by artificially

delaying and summing each frequency channel to produce dedispersed time series.

Several different techniques are used to search for pulsars, often targeting specific

classes (i.e., those shown in Fig. 1.3). The periodic nature of pulsar emission typically

makes searching for them in the Fourier domain the most sensitive option. Algorithms

for folding time series at many trial periods such as the fast folding algorithm have also

been employed in the past (Burns & Clark, 1969; Hankins & Rickett, 1975). Two main

cases exist where period-search techniques can fail: highly-accelerated pulsars in tight

binary systems, and nulling pulsars/rotating radio transients. Doppler shifting causes

accelerated pulsars to exhibit non-linear stretching and compressing of the inter-pulse

interval when observed from Earth, which can prevent the coherent addition of pulses

during a Fourier transform. A number of techniques have been developed to solve this

problem, including stretch-correction of time series and Fourier-domain matched filtering

(Johnston & Kulkarni, 1991; Ransom, 2001). Detection of rotating radio transients suffers

(by definition) from the problem of having too few pulses to produce a more significant

signal in the Fourier domain than the time domain. These objects are detected using

single-pulse search techniques that look for individual bright pulses (McLaughlin et al.,

2006).

One final issue that affects all modern radio observations is the existence of man-made

radio-frequency interference (RFI). Population growth and the explosion in the use of

wireless technologies and satellite communications has resulted in a crowded broadcast

Page 41: Benjamin Barsdell Thesis - Swinburne University of Technology

1.5. Advanced architectures meet pulsar astronomy 25

spectrum that is increasingly difficult to escape. While radio observatories are generally

located in sparsely populated radio-quiet zones, a certain amount of RFI inevitably makes

its presence known, and has a tendency to be orders of magnitude stronger than astronom-

ical signals. In typical pulsar surveys, both periodic and impulsive RFI signals abound

in the data, overpowering all but the brightest pulsars and RRATs. Fortunately, terres-

trial signals often exhibit tell-tale signs that allow them to be identified and excised, and

many different RFI mitigation techniques have been developed over the years (see, e.g.,

Fridman & Baan 2001; Bhat et al. 2005; Kesteven et al. 2005; Floer, Winkel & Kerp 2010;

Hogden et al. 2012; Spitler et al. 2012). Two simple signs of RFI are the presence of only

narrow-band emission and the lack of a dispersion sweep across the band (in broad-band

signals). In addition to the use of these discriminators, another common approach to RFI

mitigation is to exploit coincidence information from multiple antennas or receivers, ei-

ther geographically separated and pointing at the same location on the sky (in which case

coincidence evidences an astronomical origin), or geographically co-located and pointing

at different regions on the sky36 (in which case coincidence evidences an Earth origin).

These techniques offer very effective means of mitigating RFI at the cost of additional

computing resources, which can become significant particularly in real-time systems.

1.5.3 Pulsar astronomy and advanced architectures

Pulsar astronomy relies heavily on high-performance computation during both observa-

tions and data analysis. The process of coherent dedispersion is an example of a compu-

tationally intensive operation, involving the application of many large Fourier transforms.

This algorithm is particularly expensive at large observing bandwidths and high dispersion

measures, often making it prohibitively expensive to perform in real-time on traditional

computing hardware. However, the high performance of the fast Fourier transform (FFT)

algorithm on GPUs makes them an excellent way to accelerate this computation; a GPU-

based real-time coherent dedispersion pipeline has indeed already been deployed at the

Parkes Radio Observatory (van Straten & Bailes, 2011).

Pulsar surveys also depend on this ability to rapidly process data, being necessary

to enable the exploration of large parameter spaces. Searching for pulsars in filterbank

data requires incoherent dedispersion at many trial DMs, each of which must then be

converted into a form sensitive to the target signals (e.g., via Fourier transformation or

matched filtering) and searched independently. While this often leads to large parameter

spaces and considerable computing demands, the operations have the significant positive

36In the case of dedicated reference antennas, they may not be pointing at the sky at all.

Page 42: Benjamin Barsdell Thesis - Swinburne University of Technology

26 Chapter 1. Introduction

of exhibiting very high degrees of parallelism, from the independence between search trials

to the data-parallelism within filterbanks and time series (see Chapter 2 for discussion of

parallel algorithms). This makes them exceptionally well-suited to the massively-parallel

advanced architectures introduced in Section 1.2. This conclusion is evidenced by recent

work applying GPUs to new and ongoing pulsar surveys (Magro et al., 2011; Ait-Allal

et al., 2012).

Finally, RFI mitigation can also place heavy demands on computing resources, espe-

cially when applied in real time. Processes such as cleaning filterbanks of narrow-band and

zero-dispersion-measure signals are relatively undemanding, but methods involving coin-

cidence detection across multiple data streams can become intensive, particularly when

using non-trivial coincidence criteria (Briggs & Kocz, 2005; Kocz et al., 2012). In such

cases, the available computing power can directly influence the effectiveness of the RFI

mitigation, and, consequently, the rate of scientific progress.

The excellent match between the computational demands of pulsar astronomy and the

computational capabilities of advanced architectures motivated the selection of this field

as the target of applications investigated in the later parts of this thesis.

1.6 Thesis outline

This thesis is structured as follows. Chapter 2 motivates and presents a generalised ap-

proach to the use of advanced, many-core architectures in astronomy, describing a method-

ology based around the analysis of algorithms and demonstrating it on four well-known

applications. This methodology is then applied in Chapter 3 to guide a GPU implemen-

tation of the problem of incoherent dedispersion in pulsar astronomy. Three different

algorithms are analysed and implemented, and their performance is compared across the

CPU and GPU. Chapter 4 then builds on the results of Chapters 2 and 3 to describe

the development and deployment of a complete GPU-based fast-radio-transient detection

pipeline, concluding with early science results. Finally, Chapter 5 presents a discussion of

future directions and summarises the findings of the previous chapters.

Page 43: Benjamin Barsdell Thesis - Swinburne University of Technology

2Algorithm Analysis: A Generalised Approach to

Many-core Architectures for Astronomy

I know how to get four horses to pull a cart, but I don’t

know how to make 1024 chickens do it.

—Enrico Clementi

2.1 Introduction

The appearance of low-cost computational accelerators in the form of graphics processing

units (GPUs) has heralded a new era of high-performance computing (HPC) in astronomy

research, with speed-ups of an order of magnitude available even to those on the tightest

research budgets. However, while this lowering of the cost barrier to HPC represents a

significant step forward, there remains a high learning barrier accompanying the use of

these new hardware architectures. As a result of this, GPU use in astronomy to date

has largely been limited to the most computer-literate researchers working on applications

that may be considered ‘low-hanging fruit’ for parallel computing.

Inevitably, a section of the astronomy community will continue with an ad hoc ap-

proach to the adaptation of software from single-core to many-core architectures. In this

chapter, we demonstrate that there is a significant difference between current comput-

ing techniques and those required to efficiently utilise new hardware architectures such

as many-core processors, as exemplified by GPUs. These techniques will be unfamiliar

to most astronomers and will pose a challenge in terms of keeping the discipline at the

forefront of computational science. We present a practical, effective and simple methodol-

ogy for creating astronomy software whose performance scales well to present and future

many-core architectures. Our methodology is grounded in the classical computer science

27

Page 44: Benjamin Barsdell Thesis - Swinburne University of Technology

28 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

field of algorithm analysis.

In Section 2.2 we introduce the key concepts in algorithm analysis, with particular

focus on the context of many-core architectures. We present four foundation algorithms,

and characterise them as we outline our algorithm analysis methodology. In Section 2.3

we demonstrate the proposed methodology by applying it to four well-known astronomy

problems, which we break down into their constituent foundation algorithms. We validate

our analysis of these problems against ad hoc many-core implementations as available in

the literature and discuss the implications of our approach for the future of computing in

astronomy in Section 2.4.

2.2 A Strategic Approach: Algorithm Analysis

Algorithm analysis, pioneered by Donald Knuth (see, e.g., Knuth 1998), is a fundamental

component of computer science—a discipline that is more about how to solve problems

than the actual implementation in code. In this work, we are not interested in the specifics

(i.e., syntax) of implementing a given astronomy algorithm with a particular programming

language or library (e.g., CUDA, OpenCL, Thrust) on a chosen computing architecture

(e.g., GPU, ClearSpeed, Cell). As Harris (2007) notes, algorithm-level optimisations are

much more important with respect to overall performance on many-core hardware (specifi-

cally GPUs) than implementation optimisations, and should be made first. We will return

to the issue of implementation in Chapter 3.

Here we present an approach to tackling the transition to many-core hardware based

on the analysis of algorithms. The purpose of this analysis is to determine the potential of

a given algorithm for a many-core architecture before any code is written. This provides

essential information about the optimal approach as well as the return on investment one

might expect for the effort of (re-)implementing a particular algorithm. Our methodology

was in part inspired by the work of Harris (2005).

Work in a similar vein has also been undertaken by Asanovic et al. (2006, 2009) who

classified parallel algorithms into 12 groups, referring to them as ‘dwarfs’. While insightful

and opportune, these dwarfs consider a wide range of parallel architectures, cover all areas

of computation (including several that are not of great relevance to astronomy) and are

limited as a resource by the coarse nature of the classification. In contrast, the approach

presented here is tailored to the parallelism offered by many-core processor architectures,

contains algorithms that appear frequently within astronomy computations, and provides

a fine-grained level of detail. Furthermore, our approach considers the fundamental con-

cerns raised by many-core architectures at a level of abstraction that avoids dealing with

Page 45: Benjamin Barsdell Thesis - Swinburne University of Technology

2.2. A Strategic Approach: Algorithm Analysis 29

hardware or software-specific details and terminology. This is in contrast to the work by

Che et al. (2008), who presented a useful but highly-targeted summary of general-purpose

programming on the NVIDIA GPU architecture.

For these reasons this work will serve as a valuable and practical resource for those

wishing to analyse the expected performance of particular astronomy algorithms on current

and future many-core architectures.

For a given astronomy problem, our methodology is as follows:

1. Outline each step in the problem.

2. Identify steps that resemble known algorithms (see below).

(a) Outlined steps may need to be further decomposed into sub-steps before a

known counterpart is recognised. Such composite steps may later be added to

the collection of known algorithms.

3. For each identified algorithm, refer to its pre-existing analysis.

(a) Where a particular step does not appear to match any known algorithm, refer

to a relevant analysis methodology to analyse the step as a custom algorithm

(see Sections 2.2.1, 2.2.2 and 2.2.3). The newly-analysed algorithm can then be

added to the collection for future reference.

4. Once analysis results have been obtained for each step, apply a global analysis to

the algorithm to obtain a complete picture of its behaviour (see Section 2.2.4).

Here we present a small collection of foundation algorithms1 that appear in computa-

tional astronomy problems. This is motivated by the fact that complex algorithms may be

composed from simpler ones. We propose that algorithm composition provides an excellent

approach to turning the multi-core corner. Here we focus on its application to algorithm

analysis; in Chapter 4 we will show how it may also be applied to implementation method-

ologies. The algorithms are described below using a vector data structure. This is a data

structure like a Fortran or C array representing a contiguous block of memory and pro-

viding constant-time random access to individual elements2. We use the notation v[i] to

represent the ith element of a vector v.

1Note that for these algorithms we have used naming conventions that are familiar to us but are by nomeans unique in the literature.

2Here we use constant-time in the algorithmic sense, i.e., constant with respect to the size of the inputdata. In this context we are not concerned with hardware-specific performance factors.

Page 46: Benjamin Barsdell Thesis - Swinburne University of Technology

30 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

Transform: Returns a vector containing the result of the application of a specified

function to every individual element of an input vector.

out[i] = f(in[i]) (2.1)

Functions of more than one variable may also be applied to multiple input vectors. Scaling

the brightness of an image (defined as a vector of pixels) is an example of a transform

operation.

Reduce: Returns the sum of every element in a vector.

out =∑i

in[i] (2.2)

Reductions may be generalised to use any associative binary operator, e.g., product, min,

max etc. Calculating image noise is a common application of the reduce algorithm.

Gather: Retrieves values from an input vector according to a specified index mapping

and writes them to an output vector.

out[i] = in[map[i]] (2.3)

Reading a shifted or transformed subregion of an image is a common example of a gather

operation.

Interact: For each element i of an input vector, in1, sums the interaction between i

and each element j in a second input vector, in2.

out[i] =∑j

f(in1[i], in2[j]) (2.4)

where f is a given interaction function. The best-known application of this algorithm

in astronomy is the computation of forces in a direct N-body simulation, where both

input vectors represent the system’s particles and the interaction function calculates the

gravitational force between two particles.

These four algorithms were chosen from experience with a number of computational

astronomy problems. The transform, reduce and gather operations may be referred to as

‘atoms’ in the sense that they are indivisible operations. While the interact algorithm is

technically a composition of transforms and reductions, it will be analysed as if it too was

an atom, enabling rapid analysis of problems that use the interact algorithm without the

need for further decomposition.

Page 47: Benjamin Barsdell Thesis - Swinburne University of Technology

2.2. A Strategic Approach: Algorithm Analysis 31

We now describe a number of algorithm analysis techniques that we have found to

be relevant to massively-parallel architectures. These techniques should be applied to

the individual algorithms that comprise a complete problem in order to gain a detailed

understanding of their behaviour.

2.2.1 Principle characteristics

Many-core architectures exhibit a number of characteristics that can impact strongly on

the performance of an algorithm. Here we summarise four of the most important issues

that must be considered.

Massive parallelism: To fully utilise massively-parallel architectures, algorithms

must exhibit a high level of parallel granularity, i.e., the number of required operations that

may be performed simultaneously must be large and scalable. Data-parallel algorithms,

which divide their data between parallel processors rather than (or in addition to) their

tasks, exhibit parallelism that scales with the size of their input data, making them ideal

candidates for massively-parallel architectures. However, performance may suffer when

these algorithms are executed on sets of input data that are small relative to the number

of processors in a particular many-core architecture3.

Memory access patterns: Many-core architectures contain very high bandwidth

main memory4 in order to ‘feed’ the large number of parallel processing units. However,

high latency (i.e., memory transfer startup) costs mean that performance depends strongly

on the pattern in which memory is accessed. In general, maintaining ‘locality of reference’

(i.e., neighbouring threads accessing similar locations in memory) is vital to achieving

good performance5. Fig. 2.1 illustrates different levels of locality of reference.

Collisions between threads trying to read the same location in memory can also be

costly, and write-collisions must be treated using expensive atomic operations in order to

avoid conflicts between threads.

Branching: Current many-core architectures rely on single instruction multiple data

(SIMD) hardware. This means that neighbouring threads that wish to execute different

instructions must wait for each other to complete the divergent code section before ex-

ecution can continue in parallel (see Fig. 2.2). For this reason, algorithms that involve

significant branching between different threads may suffer severe performance degrada-

3Note also that oversubscription of threads to processors is often a requirement for good performancein many-core architectures. For example, an NVIDIA GT200-class GPU may be under-utilised with anallocation of fewer than ∼ 104 parallel threads, corresponding to an oversubscription rate of around 50×.

4Memory bandwidths on current GPUs are O(100GB/s).5Locality of reference also affects performance on traditional CPU architectures, but to a lesser extent

than on GPUs.

Page 48: Benjamin Barsdell Thesis - Swinburne University of Technology

32 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

Figure 2.1 Representative memory access patterns indicating varying levels of localityof reference. Contiguous memory access is the optimal case for many-core architectures.Patterns with high locality will generally achieve good performance; those with low localitymay incur severe performance penalties.

Page 49: Benjamin Barsdell Thesis - Swinburne University of Technology

2.2. A Strategic Approach: Algorithm Analysis 33

Figure 2.2 A schematic view of divergent execution within a SIMD architecture. Linesindicate the flow of instructions; white diamonds indicate branch points, where the codepaths of neighbouring threads diverge. The statements on the left indicate typical corre-sponding source code. White space between branch points indicates a thread waiting forits neighbours to complete a divergent code section.

tion. Similar to the effects of memory access locality, performance will in general depend

on the locality of branching, i.e., the number of different code-paths taken by a group of

neighbouring threads.

Arithmetic intensity: Executing arithmetic instructions is generally much faster

than accessing memory on current many-core hardware. Algorithms performing few arith-

metic operations per memory access may become memory-bandwidth-bound; i.e., their

speed becomes limited by the rate at which memory can be accessed, rather than the

rate at which arithmetic instructions can be processed. Memory bandwidths in many-

core architectures are typically significantly higher than in CPUs, meaning that even

bandwidth-bound algorithms may exhibit strong performance; however, they will not be

able to take full advantage of the available computing power. In some cases, it may be

beneficial to re-work an algorithm entirely in order to increase its arithmetic intensity,

even at the cost of performing more numerical work in total.

For the arithmetic intensities presented in this paper, we assume an idealised cache

model in which only the first memory read of a particular piece of data is included in

the count; subsequent or parallel reads of the same data are assumed to be made from a

cache, and are not counted. The ability to achieve this behaviour in practice will depend

strongly on the memory access pattern (specifically the locality of memory accesses).

Page 50: Benjamin Barsdell Thesis - Swinburne University of Technology

34 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

Table 2.1 Analysis of four foundation algorithmsTransform Reduction Gather Interact

Work O(N) O(N) O(N) O(NM)Depth O(1) O(logN) O(1) O(M) or O(logM)Memory access locality Contiguous Contiguous Variable Contiguous

Arithmetic intensity 1 : 1 : α 1 : 1N : α 1 : 1 : 0 1 + M

N : 1 : 2Mα

2.2.2 Complexity analysis

The complexity of an algorithm is a formal measure of its execution time given a certain

size of input. It is often used as a means of comparing the speeds of two different algorithms

that compute the same (or a similar) result. Such comparisons are critical to understanding

the relative contributions of different parts of a composite algorithm and identifying bottle-

necks.

Computational complexity is typically expressed as the total run-time, T , of an algo-

rithm as a function of the input size, N , using ‘Big O’ notation. Thus T (N) = O(N)

means a run-time that is proportional to the input size N . An algorithm with complexity

of T (N) = O(N2) will take four times as long to run after a doubling of its input size.

While the complexity measure is traditionally used for algorithms running on serial

processors, it can be generalised to analyse parallel algorithms. One method is to introduce

a second parameter: P , the number of processors. The run-time is then expressed as a

function of both N and P . For example, an algorithm with a parallel complexity of

T (N,P ) = O(NP ) will run P times faster on P processors than on a single processor

for a given input size; i.e., it exhibits perfect parallel scaling. More complex algorithms

may incur overheads when run in parallel, e.g., those requiring communication between

processors. In these cases, the parallel complexity will depend on the specifics of the target

hardware architecture.

An alternative way to express parallel complexity is using the work, W , and depth,

D, metrics first introduced formally by Blelloch (1996). Here, work measures the total

number of computational operations performed by an algorithm (or, equivalently, the

run-time on a single processor), while depth measures the longest sequence of sequentially-

dependent operations (or, equivalently, the run-time on an infinite number of processors).

The depth metric is a measure of the amount of inherent parallelism in the algorithm. A

perfectly parallel algorithm has work complexity of W (N) = O(N) and depth complexity

of D(N) = O(1), meaning all but a constant number of operations may be performed in

parallel. An algorithm with W = O(N) and D = O(logN) is highly parallel, but contains

some serial dependencies between operations that scale as a function of the input size.

Page 51: Benjamin Barsdell Thesis - Swinburne University of Technology

2.2. A Strategic Approach: Algorithm Analysis 35

Parallel algorithms with work complexities equal to those of their serial counterparts are

said to be ‘work efficient’; those that further exhibit low depth complexities are considered

to be efficient parallel algorithms. The benefit of the work/depth metrics over the parallel

run-time is that they have no dependence on the particular parallel architecture on which

the algorithm is executed, i.e., they measure properties inherent to the algorithm.

A final consideration regarding parallel algorithms is Amdahl’s law (Amdahl, 1967),

which states that the maximum possible speedup over a serial algorithm is limited by the

fraction of the parallel algorithm that cannot be (or simply is not) parallelised. Assuming

an infinite number of available processors, the run-time of the parallel part of the algorithm

will reduce to a constant, while the serial part will continue to scale with the size of the

input. In terms of the work/depth metrics, the depth of the algorithm represents the

fraction that cannot be parallelised, and the maximum theoretical speedup is given by

Smax ≈ WD . Note the implication that the maximum speedup is actually a function of the

input size. Increasing the problem size in addition to the number of processors allows the

speedup to scale more effectively.

2.2.3 Analysis results

We have applied the techniques discussed in Sections 2.2.1 and 2.2.2 to the four foundation

algorithms introduced at the beginning of Section 2.2. We use the following metrics:

• Work and depth: The complexity metrics as described in Section 2.2.2.

• Memory access locality: The nature of the memory access patterns as discussed

in Section 2.2.1.

• Arithmetic intensity: Defined by the triple ratio r : w : f representing the num-

ber of read, write and function evalation operations respectively that the algorithm

performs (normalised to the input size). The symbol α is used, where applicable, to

represent the internal arithmetic intensity of the function given to the algorithm.

The results are presented in Table 2.1. Note that this analysis is based on the most-efficient

known parallel version of each algorithm.

2.2.4 Global analysis

Once local analysis results have been obtained for each step of a problem, it is necessary

to put them together and perform a global analysis. Our methodology is as follows:

Page 52: Benjamin Barsdell Thesis - Swinburne University of Technology

36 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

1. Determine the components of the algorithm where most of the computational work

lies by comparing work complexities. Components with similar work complexities

should receive similar attention with respect to parallelisation in order to avoid

leaving behind bottle-necks as a result of Amdahl’s Law.

2. Consider the amount of inherent parallelism in each algorithm

by observing its theoretical speedup Smax ≈ WD .

3. Use the theoretical arithmetic intensity of each algorithm to determine the likeli-

hood of it being limited by memory bandwidth rather than instruction throughput.

The theoretical global arithmetic intensity may be obtained by comparing the total

amount of input and output data to the total amount of arithmetic work to be done

in the problem.

4. Assess the memory access patterns of each algorithm to identify the potential to

achieve peak arithmetic intensity6.

5. If particular components exhibit poor properties, consider alternative algorithms.

6. Once a set of component algorithms with good theoretical performance has been

obtained, the algorithm decomposition should provide a good starting point for an

implementation.

2.3 Application to Astronomy Algorithms

We now apply our methodology from Section 2.2 to four typical astronomy computations.

In each case, we demonstrate how to identify the steps in an outline of the problem

as foundation algorithms from our collection described at the beginning of Section 2.2.

We then use this knowledge to study the exact nature of the available parallelism and

determine the problem’s overall suitability for many-core architectures. We note that we

have deliberately chosen simple versions of the problems in order to maximise clarity and

brevity in illustrating the principles of our algorithm analysis methodology.

2.3.1 Inverse ray-shooting gravitational lensing

Introduction: Inverse ray-shooting is a numerical technique used in gravitational mi-

crolensing. Light rays are projected backwards (i.e., from the observer) through an en-

6Studying the memory access patterns will also help to identify the optimal caching strategy if thislevel of optimisation is desired.

Page 53: Benjamin Barsdell Thesis - Swinburne University of Technology

2.3. Application to Astronomy Algorithms 37

semble of lenses and on to a source-plane pixel grid. The number of rays that fall into

each pixel gives an indication of the magnification at that spatial position relative to the

case where there was no microlensing. In cosmological scenarios, the resultant maps are

used to study brightness variations in light curves of lensed quasars, providing constraints

on the physical size of the accretion disk and broad line emission regions.

The two main approaches to ray-shooting are based on either the direct calculation

of the gravitational deflection by each lens (Kayser, Refsdal & Stabell, 1986; Schneider

& Weiss, 1986, 1987) or the use of a tree hierarchy of psuedo-lenses (Wambsganss, 1990,

1999). Here, we consider the direct method.

Outline: The ray-shooting algorithm is easily divided into a number of distinct steps:

1. Obtain a collection of lenses according to a desired distribution, where each lens has

position and mass.

2. Generate a collection of rays according to a uniform distribution within a specified

2D region, where each ray is defined by its position.

3. For each ray, calculate and sum its deflection due to each lens.

4. Add each ray’s calculated deflection to its initial position to obtain its deflected

position.

5. Calculate the index of the pixel that each ray falls into.

6. Count the number of rays that fall into each pixel.

7. Output the list of pixels as the magnification map.

Analysis: To begin the analysis, we interpret the above outline as follows:

• Steps 1 and 2 may be considered transform operations that initialise the vectors of

lenses and rays.

• Step 3 is an example of the interact algorithm, where the inputs are the vectors of

rays and lenses and the interaction function calculates the deflection of a ray due to

the gravitational potential around a lens mass.

• Steps 4 and 5 apply further transforms to the collection of rays.

• Step 6 involves the generation of a histogram. As we have not already identified

this algorithm in Section 2.2, it will be necessary to analyse this step as a unique

algorithm.

Page 54: Benjamin Barsdell Thesis - Swinburne University of Technology

38 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

According to this analysis, three basic algorithms comprise the complete technique:

transform, interact and histogram generation. Referring to Table 2.1, we see that, in the

context of a lensing simulation using Nrays rays and Nlenses lenses, the amount of work

performed by the transform and interact algorithms will be W = O(Nrays) + O(Nlenses)

and W = O(NraysNlenses) respectively.

We now analyse the histogram step. Considering first a serial algorithm for generating

a histogram, where each point is considered in turn and the count in its corresponding bin is

incremented, we find the work complexity to be W = O(Nrays). Without further analysis,

we compare this to those of the other component algorithms. The serial histogram and

the transform operations each perform similar work. The interact algorithm on the other

hand must, as we have seen, perform work proportional to Nrays×Nlenses. For large Nlenses

(e.g., as occurs in cosmological microlensing simulations, where Nlenses > 104) this step

will dominate the total work. Assuming the number of lenses is scaled with the amount

of parallel hardware, the interact step will also dominate the total run-time.

Given the dominance of the interact step, we now choose to ignore the effects of the

other steps in the problem. It should be noted, however, that in contrast to cosmological

microlensing, planetary microlensing models contain only a few lenses. In this case, the

work performed by the interact step will be similar to that of the other steps, and thus

the use of a serial histogram algorithm alongside parallel versions of all other steps would

result in a severe performance bottle-neck. Several parallel histogram algorithms exist,

but a discussion of them is beyond the scope of this work.

Returning to the analysis of the interact algorithm, we again refer to Table 2.1. Its

worst-case depth complexity indicates a maximum speedup of S ≈ W = O(Nrays), i.e.,

parallel speedup scaling perfectly up to the number of rays. The arithmetic intensity of

the algorithm scales as Nlenses and will thus be very high. Contiguous memory accesses

indicate strong potential to achieve this high arithmetic intensity. We conclude that direct

inverse ray-shooting for cosmological microlensing is an ideal candidate for an efficient

implementation on a many-core architecture.

2.3.2 Hogbom CLEAN

Introduction: Raw (‘dirty’) images produced by radio interferometers exhibit unwanted

artefacts as the result of the incomplete sampling of the visibility plane. These artefacts

can inhibit image analysis and should ideally be removed by deconvolution. Several dif-

ferent techniques have been developed to ‘clean’ these images. For a review, see Briggs

(1995). Here we analyse the image-based algorithm first described by Hogbom (1974). We

Page 55: Benjamin Barsdell Thesis - Swinburne University of Technology

2.3. Application to Astronomy Algorithms 39

note that the algorithm by Clark (1980) is now the more popular choice in the astronomy

community, but point out that it is essentially an approximation to Hogbom’s algorithm

that provides increased performance at the cost of reduced accuracy.

The algorithm involves iteratively finding the brightest point in the ‘dirty image’ and

subtracting from the dirty image an image of the beam centred on and scaled by this

brightest point. The procedure continues until the brightest point in the image falls below

a prescribed threshold. While the iterative procedure must be performed sequentially, the

computations within each iteration step are performed independently for every pixel of

the images, suggesting a substantial level of parallelism. The output of the algorithm is a

series of ‘clean components’, which may be used to reconstruct a cleaned image.

Outline: The algorithm may be divided into the following steps:

1. Obtain the beam image.

2. Obtain the image to be cleaned.

3. Find the brightest point, b, the standard deviation, σ, and the mean, µ, of the image.

4. If the brightness of b is less than a prescribed threshold (e.g., |b − µ| < 3σ), go to

step 9.

5. Scale the beam image by a fraction (referred to as the ‘loop gain’) of the brightness

of b.

6. Shift the beam image to centre it over b.

7. Subtract the scaled, shifted beam image from the input image to produce a partially-

cleaned image.

8. Repeat from step 3.

9. Output the ‘clean components’.

Analysis: We decompose the outline of the Hogbom clean algorithm as follows:

• Steps 1 and 2 are simple data-loading operations, and may be thought of as trans-

forms.

• Step 3 involves a number of reduce operations over the pixels in the dirty image.

• Step 5 is a transform operation, where each pixel in the beam is multiplied by a

scale factor.

Page 56: Benjamin Barsdell Thesis - Swinburne University of Technology

40 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

• Step 6 may be achieved in two ways, either by directly reading an offset subset

of the beam pixels, or by switching to the Fourier domain and exploiting the shift

theorem. Here we will only consider the former option, which we identify as a gather

operation.

• Step 7 is a transform operation over pixels in the dirty image.

We thus identify three basic algorithms in Hogbom clean: transform, reduce and

gather. Table 2.1 shows that the work performed by each of these algorithms will be

comparable (assuming the input and beam images are of similar pixel resolutions). This

suggests that any acceleration should be applied equally to all of the steps in order to

avoid the creation of bottle-necks.

The depth complexities of each algorithm indicate a limiting speed-up of Smax ≈O(

Npxls

logNpxls) during the reduce operations. While not quite ideal, this is still a good result.

Further, the algorithms do not exhibit high arithmetic intensity (the calculations involving

only a few subtractions and multiplies) and are thus likely to be bandwidth-bound. This

will dominate any effect the limiting speed-up may have.

The efficiency with which the algorithm will use the available memory bandwidth will

depend on the memory access patterns. The transform and reduce algorithms both make

contiguous memory accesses, and will thus achieve peak bandwidth. The gather operation

in step 6, where the beam image is shifted to centre it on a point in the input image, will

access memory in an offset but contiguous 2-dimensional block. This 2D locality suggests

the potential to achieve near-peak memory throughput.

We conclude that the Hogbom clean algorithm represents a good candidate for im-

plementation on many-core hardware, but will likely be bound by the available memory

bandwidth rather than arithmetic computing performance.

2.3.3 Volume rendering

Introduction: There are a number of sources of volume data in astronomy, including

spectral cubes from radio telescopes and integral field units, as well as simulations using

adaptive mesh refinement and smoothed particle hydrodynamics techniques. Visualising

these data in physically-meaningful ways is important as an analysis tool, but even small

volumes (e.g., 2563) require large amounts of computing power to render, particularly

when real-time interactivity is desired.

Several methods exist for rendering volume data; here we analyse a direct (or brute-

force) ray-casting algorithm (Levoy, 1990). While similarities exist between ray-shooting

Page 57: Benjamin Barsdell Thesis - Swinburne University of Technology

2.3. Application to Astronomy Algorithms 41

for microlensing (Section 2.3.1) and the volume rendering technique we describe here, they

are fundamentally different algorithms.

Outline: The algorithm may be divided into the following steps:

1. Obtain the input data cube.

2. Create a 2D grid of output pixels to be displayed.

3. Generate a corresponding grid of rays, where each is defined by a position (initially

the centre of the corresponding pixel), a direction (defined by the viewing transfor-

mation) and a colour (initially black).

4. Project each ray a small distance (the step size) along its direction.

5. Determine which volume pixel (voxel) each ray now resides in.

6. Retrieve the colour of the voxel from the data volume.

7. Use a specified transfer function to combine the voxel colour with the current ray

colour.

8. Repeat from step 4 until all rays exit the data volume.

9. Output the final ray colours as the rendered image.

Analysis: We interpret the steps in the above outline as follows:

• Steps 2 to 5 and 7 are all transform operations.

• Step 6 is a gather operation.

All steps perform work scaling with the number of output pixels, Npxls, indicating

there are no algorithmic bottle-necks and thus acceleration should be applied to the whole

algorithm equally.

Given that the number of output pixels is likely to be large and scalable, we should

expect the transforms and the gather, with their O(1) depth complexities, to parallelise

perfectly on many-core hardware.

The outer loop of the algorithm, which marches rays through the volume until they

leave its bounds, involves some branching as different rays traverse thicker or thinner parts

of the arbitrarily-oriented cube. This will have a negative impact on the performance of

the algorithm on a SIMD architecture like a GPU. However, if rays are ordered in such a

way as to maintain 2D locality between their positions, neighbouring threads will traverse

Page 58: Benjamin Barsdell Thesis - Swinburne University of Technology

42 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

similar depths through the data cube, resulting in little divergence in their branch paths

and thus good performance on SIMD architectures.

The arithmetic intensity of each of the steps will typically be low (common trans-

fer functions can be as simple as taking the average or maximum), while the complete

algorithm requires O(NpxlsNd) memory reads, O(Npxls) memory writes and O(NpxlsNd)

function evaluations for an input data volume of side length Nd. This global arithmetic

intensity of Nd : 1 : Ndα (where α represents the arithmetic intensity of the transfer

function) indicates the algorithm is likely to remain bandwidth-bound.

The use of bandwidth will depend primarily on the memory access patterns in the

gather step (the transform operations perform ideal contiguous memory accesses). Dur-

ing each iteration of the algorithm, the rays will access an arbitrarily oriented plane of

voxels within the data volume. Such a pattern exhibits 3D spatial locality, presenting an

opportunity to cache the memory reads effectively and thus obtain near-peak bandwidth.

We conclude that the direct ray-casting volume rendering algorithm is a good candidate

for efficient implementation on many-core hardware, although, in the absence of transfer

functions with significant arithmetic intensity, the algorithm is likely to remain limited by

the available memory bandwidth.

2.3.4 Pulsar time-series dedispersion

Introduction: Radio telescopes observing pulsars produce time-series data containing

the pulse signal. Due to its passage through the interstellar medium, the pulse signature

gets delayed as a function of frequency, resulting in a ‘dispersing’ of the data. The signal

can be ‘dedispersed’ by assuming a frequency-dependent delay before summing the signals

at each frequency. In the case of pulsar searching, the data are dedispersed using a number

of trial dispersion measures (DMs), from which the true DM of the signal is measured.

There are several dedispersion algorithms used in the literature, including the direct

algorithm and the tree algorithm (Taylor, 1974). Here we consider the direct method,

which simply involves delaying and summing time series for a range of DMs. The cal-

culation for each DM is entirely independent, presenting an immediate opportunity for

parallelisation. Further, each sample in the time series is operated-on individually, hinting

at additional fine-grained parallelism.

Outline: Here we describe the key steps of the algorithm:

1. Obtain a set of input time series, one per frequency channel.

2. If necessary, transpose the input data to place it into channel-major order.

Page 59: Benjamin Barsdell Thesis - Swinburne University of Technology

2.3. Application to Astronomy Algorithms 43

3. Impose a time delay on each channel by offsetting its starting location by the number

of samples corresponding to the delay. The delay introduced into each channel is a

quadratic function of its frequency and a linear function of the dispersion measure.

4. Sum aligned samples across every channel to produce a single accumulated time

series.

5. Output the result and repeat (potentially in parallel) from step 3 for each desired

trial DM.

Analysis: We interpret the above outline of the direct dedispersion algorithm as follows:

• Step 2 involves transposing the data, which is a form of gather.

• Step 3 may be considered a set of gather operations that shift the reading location

of samples in each channel by an offset.

• Step 4 involves the summation of many time series. This is a nested operation, and

may be interpreted as either a transform, where the operation is to sum the time

sample in each channel, or a reduce, where the operation is to sum whole time series.

The algorithm therefore involves gather operations in addition to nested transforms

and reductions. For data consisting of Ns samples for each of Nc channels, each step of the

computation operates on all O(NsNc) total samples. Acceleration should thus be applied

equally to all parts of the algorithm.

According to the depth complexity listed in Table 2.1, the gather operation will paral-

lelise perfectly. The nested transform and reduce calculation may be parallelised in three

possible ways: a) by parallelising the transform, where Ns parallel threads each compute

the sum of a single time sample over every channel sequentially; b) by parallelising the

reduce, where Nc parallel threads cooperate to sum each time sample in turn; or c) by

parallelising both the transform and the reduce, where Ns×Nc parallel threads cooperate

to complete the entire computation in parallel.

Analysing these three options, we see that they have depth complexities of O(Nc),

O(Ns logNc) and O(logNc) respectively. Option (c) would appear to provide the greatest

speedup; however, it relies on using significantly more parallel processors than the other

options. It will in fact only be the better choice in the case where the number of available

parallel processors is much greater than Ns. For hardware with fewer than Ns parallel

processors, option (a) will likely prove the better choice, as it is expected to scale perfectly

up to Ns parallel threads, as opposed to the less efficient scaling of option (c). In practice,

Page 60: Benjamin Barsdell Thesis - Swinburne University of Technology

44 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

the number of time samples Ns will generally far exceed the number of parallel processors,

and thus the algorithm can be expected to exhibit excellent parallel scaling using option

(a).

Turning now to the arithmetic intensity, we observe that the computation of a single

trial DM involves only an addition for each of the Ns×Nc total samples. This suggests the

algorithm will be limited by memory bandwidth. However, this does not take into account

the fact that we wish to compute many trial dispersion measures. The computation of

NDM trial DMs still requires only O(Ns × Nc) memory reads and writes, but performs

NDM×Ns×Nc addition operations. The theoretical global arithmetic intensity is therefore

1 : 1 : NDM. Given a typical number of trial DMs of O(100), we conclude that the

algorithm could, in theory at least, make efficient use of all available arithmetic processing

power.

The ability to achieve such a high arithmetic intensity will depend on the ability to

keep data in fast memory for the duration of many arithmetic calculations (i.e., the ability

to efficiently cache the data). This in turn will depend on the memory access patterns.

We note that in general, similar trial DMs will need to access similar areas of memory;

i.e., the problem exhibits some locality of reference. The exact memory access pattern is

non-trivial though, and a discussion of these details is outside the scope of this work.

We conclude that the pulsar dedispersion algorithm would likely perform to a high

efficiency on a many-core architecture. While it is apparent that some locality of reference

exists within the algorithm’s memory accesses, optimal arithmetic intensity is unlikely

to be observed without a thorough and problem-specific analysis of the memory access

patterns.

2.4 Discussion

The direct inverse ray-shooting method has been implemented on a GPU by Thompson

et al. (2010). They simulated systems with up to 109 lenses. Using a single GPU, they

parallelised the interaction step of the problem and obtained a speedup of O(100×) relative

to a single CPU core—a result consistent with the relative peak floating-point performance

of the two processing units7. These results validate our conclusion that the inverse ray-

shooting algorithm is very well suited to many-core architectures like GPUs.

Our conclusions regarding the pulsar dedispersion algorithm are validated by a prelim-

7We note that Thompson et al. (2010) did not use the CPU’s Streaming SIMD Extensions, which havethe potential to provide a speed increase of up to 4×. However, our conclusion regarding the efficiency ofthe algorithm on the GPU remains unchanged by this fact.

Page 61: Benjamin Barsdell Thesis - Swinburne University of Technology

2.4. Discussion 45

inary GPU implementation we have written. With only a simplistic approach to memory

caching, we have recorded a speedup of 9× over an efficient multi-core CPU code run-

ning on four cores. This result is in line with the relative peak memory bandwidth of

the two architectures, supporting the conclusions of Section 2.3.4 that, without a detailed

investigation into the memory access patterns, the problem will remain bandwidth-bound.

Some astronomy problems are well-suited to a many-core architecture, others are not.

It is important to know how to distinguish between these. In the astronomy community,

the majority of work with many-core hardware to date has focused on the implementation

or porting of specific codes perhaps best classified as ‘low-hanging fruit’. Not surprisingly,

these codes have achieved significant speed-ups, in line with the raw performance benefits

offered by their target hardware.

A more generalised use of ‘novel’ computing architectures was undertaken by Brunner,

Kindratenko & Myers (2007), who, as a case study, implemented the two-point angular cor-

relation function for cosmological galaxy clustering on two different FPGA architectures8.

While they successfully communicated the advantages offered by these new technologies,

their focus on implementation details for their FPGA hardware inhibits the ability to

generalise their findings to other architectures.

It is interesting to note that previous work has in fact identified a number of common

concerns with respect to GPU implementations of astronomy algorithms. For example,

the issues of optimal use of the memory hierarchy and underuse of available hardware for

small particle counts have been discussed in the context of the direct N-body problem

(e.g., Belleman, Bedorf & Portegies Zwart 2008). These concerns essentially correspond

to a combination of what we have referred to as memory access patterns, arithmetic

intensity and massive parallelism. While originally being discussed as implementation

issues specific to particular choices of software and hardware, our abstractions re-cast

them at the algorithm level, and allow us to consider their impact across a variety of

problems and hardware architectures.

Using algorithm analysis techniques, we now have a basis for understanding which

astronomy algorithms will benefit most from many-core processors. Those with well-

defined memory access patterns and high arithmetic intensity stand to receive the greatest

performance boost, while problems that involve a significant amount of decision-making

may struggle to take advantage of the available processing power.

For some astronomy problems, it may be important to look beyond the techniques

currently in use, as these will have been developed (and optimised) with traditional CPU

8Field Programmable Gate Arrays are another hardware architecture exhibiting significant fine-grainedparallelism, but their specific details lie outside the scope of this thesis.

Page 62: Benjamin Barsdell Thesis - Swinburne University of Technology

46 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy

architectures in mind. Avenues of research could include, for instance, using higher-order

numerical schemes (Nitadori & Makino, 2008) or choosing simplicity over efficiency by

using brute-force methods (Bate et al. submitted). Some algorithms, such as histogram

generation, do not have a single obvious parallel implementation, and may require problem-

specific input during the analysis process.

In this work, we have discussed the future of astronomy computation, highlighting the

change to many-core processing that is likely to occur in CPUs.

The shift in commodity hardware from serial to parallel processing units will funda-

mentally change the landscape of computing. While the market is already populated with

multi-core chips, it is likely that chip designs will undergo further significant changes in

the coming years. We believe that for astronomy, a generalised methodology based on the

analysis of algorithms is a prudent approach to confronting these changes—one that will

continue to be applicable across the range of hardware architectures likely to appear in

the coming years: CPUs, GPUs and beyond.

Acknowledgments

We would like to thank Amr Hassan and Matthew Bailes for useful discussions regard-

ing this chapter, and the reviewer of the corresponding paper Gilles Civario for helpful

suggestions.

Page 63: Benjamin Barsdell Thesis - Swinburne University of Technology

3Accelerating Incoherent Dedispersion

Any idiot can get a ten times speed-up with a GPU.

—David Barnes

3.1 Introduction

With the advent of modern telescopes and digital signal processing back-ends, the time-

resolved radio sky has become a rich source of astrophysical information. Observations

of pulsars allow us to probe the nature of neutron stars (Lattimer & Prakash 2004),

stellar populations (Bhattacharya & van den Heuvel 1991), the Galactic environment

(Gaensler et al. 2008), plasma physics and gravitational waves (Lyne et al. 2004). Of equal

significance are transient signals such as those from rotating radio transients (McLaughlin

et al., 2006) and potentially rare one-off events such as ‘Lorimer bursts’ (Lorimer et al.,

2007; Keane et al., 2011), which may correspond to previously unknown phenomena.

These observations all depend on the use of significant computing power to search for

signals within long, frequency-resolved time series.

As radiation from sources such as pulsars propagates to Earth, it is refracted by free

electrons in the interstellar medium. This interaction has the effect of delaying the signal in

a frequency-dependent manner—signals at lower frequencies are delayed more than those

at higher frequencies. Formally, the observed time delay, ∆t, between two frequencies ν1

and ν2 as a result of dispersion by the interstellar medium is given by

∆t = kDM ·DM · (ν−21 − ν−2

2 ), (3.1)

where kDM = e2

2πmec= 4.148808 × 103 MHz2 pc−1 cm3 s is the dispersion constant1 and

1We note that the dispersion constant is commonly approximated in the literature as 1/(2.41 ×

47

Page 64: Benjamin Barsdell Thesis - Swinburne University of Technology

48 Chapter 3. Accelerating Incoherent Dedispersion

the frequencies are in MHz. The parameter DM specifies the dispersion measure along the

line of sight in pc cm−3, and is defined as

DM ≡∫ d

0ne dl, (3.2)

where ne is the electron number density (cm−3) and d is the distance to the source (pc).

Once a time-varying source has been detected, its dispersion measure can be obtained

from observations of its phase as a function of frequency; this in turn allows the ap-

proximate distance to the object to be calculated via equation (3.2), assuming one has a

model for the Galactic electron density ne. When searching for new sources, however, one

does not know the distance to the object. In these cases, the dispersion measure must

be guessed prior to looking for a signal. To avoid excessive smearing of signals in the

time series, and a consequent loss of signal-to-noise, survey pipelines typically repeat the

process for many trial dispersion measures. This process is referred to as a dedispersion

transform. An example of the dedispersion transform is shown in Fig. 3.1.

Computing the dedispersion transform is a computationally expensive task: a simple

approach involves a summation across a band of, e.g., ∼ 103 frequency channels for each

of ∼ 103 (typically) dispersion measures, for each time sample. Given modern sampling

intervals of O(64µs), computing this in real-time is a challenging task, especially if the

process must be repeated for multiple beams. The prohibitive cost of real-time dedisper-

sion has traditionally necessitated that pulsar and transient survey projects use offline

processing.

In this paper we consider three ways in which computation of the dedispersion trans-

form may be accelerated, enabling real-time processing at low cost. First, in Section 3.2

we demonstrate how modern many-core computing hardware in the form of graphics pro-

cessing units [GPUs; see Chapter 1 for an introduction, also Fluke et al. (2011)] can

provide an order of magnitude more performance over a multi-core central processing

unit (CPU) when dedispersing ‘directly’. The use of GPUs for incoherent dedispersion is

not an entirely new idea. Dodson et al. (2010) introduced an implementation of such a

system as part of the CRAFT survey. Magro et al. (2011) described a similar approach

and how it may be used to construct a GPU-based real-time transient detection pipeline

for modest fractional bandwidths, demonstrating that their GPU dedisperser could out

perform a generic code by two orders of magnitude. In this work we provide a thorough

analysis of both the direct incoherent dedispersion algorithm itself and the details of its

10−4) MHz2 pc−1 cm3 s.

Page 65: Benjamin Barsdell Thesis - Swinburne University of Technology

3.1. Introduction 49

Figure 3.1 An illustration of a dispersion trail (top) and its corresponding dedispersiontransform (bottom). The darkest horizontal slice in the dedispersion transform gives thecorrectly dedispersed time series.

Page 66: Benjamin Barsdell Thesis - Swinburne University of Technology

50 Chapter 3. Accelerating Incoherent Dedispersion

implementation on GPU hardware.

In Section 3.3 we then consider the use of the ‘tree’ algorithm, a (theoretically) more

efficient means of computing the dedispersion transform. To our knowledge, this technique

has not previously been implemented on a GPU. We conclude our analysis of dedispersion

algorithms in Section 3.4 with a discussion of the ‘sub-band’ method, a derivative of the

direct method.

In section 3.5 we report accuracy and timing benchmarks for the three algorithms and

compare them to our theoretical results. Finally, we present a discussion of our results,

their implications for future pulsar and transient surveys and a comparison with previous

work in Section 3.6.

3.2 Direct Dedispersion

3.2.1 Introduction

The direct dedispersion algorithm operates by directly summing frequency channels along

a quadratic dispersion trail for each time sample and dispersion measure. In detail, the al-

gorithm computes an array of dedispersed time series D from an input dataset A according

to the following equation:

Dd,t =

Nν∑ν

Aν,t+∆t(d,ν), (3.3)

where the subscripts d, t and ν represent dispersion measure, time sample and frequency

channel respectively, and Nν is the total number of frequency channels. Note that through-

out this paper we use the convention that∑N

i means the sum over the range i = 0 to

i = N − 1. The function ∆t(d, ν) is a discretized version of equation (3.1) and gives the

time delay relative to the start of the band in whole time samples for a given dispersion

measure and frequency channel:

∆T (ν) ≡ kDM

∆τ

(1

(ν0 + ν∆ν)2− 1

ν20

), (3.4)

∆t(d, ν) ≡ round (DM(d)∆T (ν)) , (3.5)

where ∆τ is the time difference in seconds between two adjacent samples (the sampling

interval), ν0 is the frequency in MHz at the start of the band, ∆ν is the frequency difference

in MHz between two adjacent channels and the function round(x) means x rounded to the

nearest integer. The function or array DM(d) is used to specify the dispersion measures to

be computed. Note that the commonly-used central frequency, νc, and bandwidth, BW,

Page 67: Benjamin Barsdell Thesis - Swinburne University of Technology

3.2. Direct Dedispersion 51

parameters are related by BW ≡ Nν∆ν and νc ≡ ν0 + 12BW.

After dedispersion, the dedispersed time series Dd,t can be searched for periodic or

transient signals.

When dedispersing at large DM, the dispersion of a signal can be such that it is

smeared significantly within a single frequency channel. Specifically, this occurs when the

gradient of a dispersion curve on the time-frequency grid is less than unity (i.e., beyond

the ‘diagonal’). Once this effect becomes significant, it becomes somewhat inefficient to

continue to dedisperse at the full native time resolution. One option is to reduce the time

resolution by a factor of two when the DM exceeds the diagonal by adding adjacent pairs

of time samples. This process is then repeated at 2× the diagonal, 4× etc. We refer to

this technique as ‘time-scrunching’. The use of time-scrunching will reduce the overall

computational cost, but can also slightly reduce the signal-to-noise ratio if the intrinsic

pulse width is comparable to that of the dispersion smear.

3.2.2 Algorithm analysis

The direct dedispersion algorithm’s summation over Nν frequency channels for each of Nt

time samples and NDM dispersion measures gives it a computational complexity of

Tdirect = O(NtNνNDM). (3.6)

The algorithm was analysed previously for many-core architectures in Chapter 2. The key

findings were:

1. the algorithm is best parallelised over the “embarassingly parallel” dispersion-measure

(d) and time (t) dimensions, with the sum over frequency channels (ν) being per-

formed sequentially,

2. the algorithm has a very high theoretical arithmetic intensity, of the same magnitude

as the number of dispersion measures computed [typically O(100− 1000)], and

3. the memory access patterns generally exhibit reasonable locality, but their non-trivial

nature may make it difficult to achieve a high arithmetic intensity.

While overall the algorithm appears well-positioned to take advantage of massively parallel

hardware, we need to perform a deeper analysis to determine the optimal implementation

strategy. The pattern in which memory is accessed is often critical to performance on

massively-parallel architectures, so this is where we now turn our attention.

Page 68: Benjamin Barsdell Thesis - Swinburne University of Technology

52 Chapter 3. Accelerating Incoherent Dedispersion

While the d dimension involves a potentially non-linear mapping of input indices to

output indices, the t dimension maintains a contiguous mapping from input to output.

This makes the t dimension suitable for efficient memory access operations via spatial

caching, where groups of adjacent parallel threads access memory all at once. This be-

haviour typically allows a majority of the available memory bandwidth to be exploited.

The remaining memory access issue is the potential use of temporal caching to increase

the arithmetic intensity of the algorithm. Dedispersion at similar DMs involves access-

ing similar regions of input data. By pre-loading a block of data into a shared cache,

many DMs could be computed before needing to return to main memory for more data.

This would increase the arithmetic intensity by a factor proportional to the size of the

shared cache, potentially providing a significant performance increase, assuming the al-

gorithm was otherwise limited by available memory bandwidth. The problem with the

direct dedispersion algorithm, however, is its non-linear memory access pattern in the d

dimension. This behaviour makes a caching scheme difficult to devise, as one must account

for threads at different DMs needing to access data at delayed times. Whether temporal

caching can truly be used effectively for the direct dedispersion algorithm will depend on

details of the implementation.

3.2.3 Implementation Notes

When discussing GPU implementations throughout this paper, we use the terms ‘Fermi’

and ‘pre-Fermi’ GPUs to mean GPUs of the NVIDIA Fermi architecture and those of older

architectures respectively. We consider both architectures in order to study the recent

evolution of GPU hardware and gain insight into the future direction of the technology.

We implemented the direct dedispersion algorithm using the C for CUDA platform2.

As suggested by the analysis in Section 3.2.2, the algorithm was parallelised over the

dispersion-measure and time dimensions, with each thread summing all Nν channels se-

quentially. During the analysis it was also noted that the algorithm’s memory access pat-

tern exhibits good spatial locality in the time dimension, with contiguous output indices

mapping to contiguous input indices. We therefore chose time as the fastest-changing

(i.e., x) thread dimension, such that reads from global memory would always be from

contiguous regions with a unit stride, maximising throughput. The DM dimension was

consequently mapped to the second (i.e., y) thread dimension.

While the memory access pattern is always contiguous, it is not always aligned. This

is a result of the delays, ∆t(d, ν), introduced in the time dimension. At all non-zero

2http://developer.nvidia.com/object/gpucomputing.html

Page 69: Benjamin Barsdell Thesis - Swinburne University of Technology

3.2. Direct Dedispersion 53

DMs, the majority of memory accesses will begin at arbitrary offsets with respect to the

internal alignment boundaries of the memory hardware. The consequence of this is that

GPUs that do not have built-in caching support may need to split the memory requests

into many smaller ones, significantly impacting throughput to the processors. In order

to avoid this situation, we made use of the GPU’s texture memory, which does support

automatic caching. On pre-Fermi GPU hardware, the use of texture memory resulted

in a speed-up of around 5× compared to using plain device memory, highlighting the

importance of understanding the details of an algorithm’s memory access patterns when

using these architectures. With the advent of Fermi-class GPUs, however, the situation

has improved significantly. These devices contain an L1 cache that provides many of the

advantages of using texture memory without having to explicitly refer to a special memory

area. Using texture memory on Fermi-class GPUs was slightly slower than using plain

device memory (with L1 cache enabled), as suggested in the CUDA programming guide3.

Input data with fewer bits per sample than the machine word size (currently assumed

to be 32 bits) were handled using bit-shifting and masking operations on the GPU. It was

found that a convenient format for working with the input data was to transpose the input

from time-major order to frequency-major order by whole words, leaving consecutive fre-

quency channels within each word. For example, for the case of two samples per word, the

data order would be: (Aν1,t1Aν2,t1),(Aν1,t2Aν2,t2),..., (Aν3,t1Aν4,t1),(Aν3,t2Aν4,t2),..., where

brackets denote data within a machine word. This format means that time delays are

always applied in units of whole words, avoiding the need to deal with intra-word delays.

The thread decomposition was written to allow the shape of the block (i.e., number

of DMs or time samples per block) to be tuned. We found that for a block size of 256

threads, optimal performance on a Fermi GPU was achieved when this was divided into

8 time samples × 32 DMs. We interpreted this result as a cache-related effect, where the

block shape determines the spread of memory locations accessed by a group of neighbour-

ing threads spread across time-DM space, and the optimum occurs when this spread is

minimised. On pre-Fermi GPUs, changing the block shape was found to have very little

impact on performance.

To minimise redundant computations, the functions DM(d) and ∆T (ν) were pre-

computed and stored in look-up tables for the given dispersion measures and frequency

channels respectively. Delays were then computed simply by retrieving values from the

two tables and evaluating equation (3.5), requiring only a single multiplication and a

rounding operation. On pre-Fermi GPUs, the table corresponding to ∆T (ν) was explic-

3The current version of the CUDA programming guide is available for download at: http://www.

nvidia.com/object/cuda_develop.html

Page 70: Benjamin Barsdell Thesis - Swinburne University of Technology

54 Chapter 3. Accelerating Incoherent Dedispersion

itly stored in the GPU’s constant memory space, which provides extremely efficient access

when all threads read the same value (this is always the case for our implementation,

where frequency channels are traversed sequentially). On Fermi-generation cards, this

explicit use of the constant memory space is unnecessary—constant memory caching is

used automatically when the compiler determines it to be possible.

To amortize overheads within the GPU kernel such as index calculations, loop counters

and time-delay computations, we allowed each thread to store and sum multiple time

samples. Processing four samples per thread was found to significantly reduce the total

arithmetic cost without affecting memory throughput. Increasing this number required

more registers per thread (a finite resource), and led to diminishing returns; we found four

to be the optimal solution for our implementation.

Our implementation was written to support a channel “kill mask”, which specifies

which frequency channels should be included in the computation and which should be

skipped (e.g., to avoid radio frequency interference present within them). While our initial

approach was to apply this mask as a conditional statement [e.g., if( kill mask[channel]

) { sum += data }], it was found that applying the mask arithmetically (e.g., sum +=

data * kill mask[channel]) resulted in better performance. This is not particularly

surprising given the GPU hardware’s focus on arithmetic throughput rather than branch-

ing operations.

Finally, we investigated the possibility of using temporal caching, as discussed in the

analysis in Section 3.2.2. Unlike most CPUs, GPUs provide a manually-managed cache

(known as shared memory on NVIDIA GPUs). This provides additional power and flexi-

bility at the cost of programming effort. We used shared memory to stage a rectangular

section of input data (i.e., of time-frequency space) in each thread block. Careful attention

was given to the amount of data cached, with additional time samples being loaded to al-

low for differences in delay across a block. The cost of development was significant, and it

remained unclear whether the caching mechanism could be made robust against a variety

of input parameters. Further, we found that the overall performance of the code was not

significantly altered by the addition of the temporal caching mechanism. We concluded

that the additional overheads involved in handling the non-linear memory access patterns

(i.e., the mapping of blocks of threads in time-DM space to memory in time-frequency

space) negated the performance benefit of staging data in the shared cache. We note,

however, that cacheing may prove beneficial when considering only low DMs (e.g., below

the diagonal), where delays vary slowly and memory access patterns remain relatively

compact.

Page 71: Benjamin Barsdell Thesis - Swinburne University of Technology

3.3. Tree Dedispersion 55

In theory it is possible that, via careful re-packing of the input data, one could exploit

the bit-level parallelism available in modern computing hardware in addition to the thread-

level parallelism. For example, for 2-bit data, packing each 2-bit value into 8-bits would

allow four values to be summed in parallel with a single 32-bit addition instruction. In

this case, 28−122−1

= 85 additions could be performed before one risked integer overflow. To

dedisperse say 1024 channels, one could first sum blocks of 85 channels and then finish

the summation by re-packing the partial sums into a larger data type. This would achieve

efficient use of the available processing hardware, at the cost of additional implementation

complexity and overheads for re-packing and data management. We did not use this

technique in our GPU dedispersion codes, although our reference CPU code does exploit

this extra parallelism by packing four 2-bit samples into a 64-bit word before dedispersion.

3.3 Tree Dedispersion

3.3.1 Introduction

The tree dedispersion algorithm, first described by Taylor (1974), attempts to reduce

the complexity of the dedispersion computation from O(NtNνNDM) to O(NtNν logNν).

This significant speed-up is obtained by first regularising the problem and then exploiting

the regularity to allow repeated calculations to be shared between different DMs. While

theoretical speed-ups of O(100) are possible, in practice a number of additional overheads

arise when working with real data. These overheads, as well as its increased complexity,

have meant that the tree algorithm is rarely used in modern search pipelines. In this work

we investigate the tree algorithm in order to assess its usefulness in the age of many-core

processors.

In its most basic form, the tree dedispersion algorithm is used to compute the following:

D′d′,t =

Nν∑ν

Aν,t+∆t′(d′,ν), (3.7)

∆t′(d′, ν) = round

(d′

ν

Nν − 1

), (3.8)

for d′ in the range 0 ≤ d′ < Nν . The regularisation is such that the delay function ∆t′(d, ν)

is now a linear function of ν that ranges from 0 to exactly d′ across the band. The DMs

Page 72: Benjamin Barsdell Thesis - Swinburne University of Technology

56 Chapter 3. Accelerating Incoherent Dedispersion

Figure 3.2 Visualisation of the tree dedispersion algorithm. Rectangles represent frequencychannels, each containing a time series going ‘into the page’. Arrows indicate the flow ofdata, triangles represent addition operations and circles indicate unit time delays into thepage.

computed by the tree algorithm are therefore:

DM(d′) =d′

∆T (Nν − 1), (3.9)

where the function ∆T (ν) is that given by equation (3.4).

The tree algorithm is able to evaluate equation (3.7) for d′ in the range 0 ≤ d′ < Nν

in just log2Nν steps. It achieves this feat by using a divide and conquer approach in the

same way as the well-known fast Fourier transform (FFT) algorithm. The tree algorithm

Page 73: Benjamin Barsdell Thesis - Swinburne University of Technology

3.3. Tree Dedispersion 57

is visualised in Fig. 3.2. We define the computation at each step i as follows:

A0ν,t ≡ Aν,t (3.10)

Ai+12ν,t = AiΦ(i,2ν),t +AiΦ(i,2ν+1),t+Θ(i,2ν) (3.11)

Ai+12ν+1,t = AiΦ(i,2ν),t +AiΦ(i,2ν+1),t+Θ(i,2ν+1) (3.12)

D′d′,t = D′ν,t = Alog2Nνν,t . (3.13)

The integer function Θ(i, ν) gives the time delay for a given iteration and frequency channel

and can be defined as

Θ(i, ν) ≡ [(ν mod 2i+1) + 1]/2, (3.14)

where mod is the modulus operator, and division is taken to be truncated integer division.

The integer function Φ(i, ν), which we refer to as the ‘shuffle’ function, re-orders the indices

ν according to a pattern defined as follows:

Φ(r, ν) ≡ (ν mod 2)× r +ν

2+

ν

2r× r, (3.15)

where the parameter r ≡ 2i is known as the radix.

While the tree dedispersion algorithm succeeds in dramatically reducing the compu-

tational cost of dedispersion, it has a number of constraints not present in the direct

algorithm:

1. the computed dispersion trails are linear in frequency, not quadratic as found in

nature [see equation (3.1)],

2. the computed dispersion measures are constrained to those given by equation (3.9),

and

3. the number of input frequency channels Nν (and thus also the number of DMs) must

be a power of two.

Constraint (iii) is generally not a significant concern, as it is common for the number of

frequency channels to be a power of two, and blank channels can be added when this is not

the case. Constraints (i) and (ii) are more problematic, as they prevent the computation

of accurate and efficiently-distributed dispersion trails. Fortunately there are ways of

working around these limitations.

One method is to approximate the dispersion trail with piecewise linear segments by

dividing the input data into sub-bands (Manchester et al., 1996). Another approach is

Page 74: Benjamin Barsdell Thesis - Swinburne University of Technology

58 Chapter 3. Accelerating Incoherent Dedispersion

to quadratically space the input frequencies by padding with blank channels as a pre-

processing step such that the second order term in the dispersion trail is removed (Manch-

ester et al., 2001). These techniques are described in the next two sections.

The piecewise linear tree method

Approximation of the quadratic dispersion curve using piecewise linear segments involves

two stages of computation. If the input data are divided into Ns sub-bands of length

N ′ν =Nν

Ns, (3.16)

with the nth sub-band starting at frequency channel

νn = nN ′ν , (3.17)

then from equation (3.7) we see that the tree dedispersion algorithm applied to each

sub-band results in the following:

Sn,d′,t =

N ′ν∑ν′

Aνn+ν′,t+∆t′(d′,νn+ν′), (3.18)

which we refer to as stage 1 of the piecewise linear tree method.

In each sub-band, we approximate the quadratic dispersion trail with a linear one. We

compute the linear DM in the nth sub-band that approximates the true DM indexed by d

as follows:

d′n(d) = ∆t(d, νn+1)−∆t(d, νn) (3.19)

= round (DM(d) [∆T (νn+1)−∆T (νn)]) . (3.20)

Applying the constraint d′n < N ′ν and noting that the greatest dispersion delay occurs at

the end of the band, we obtain a limit on the DM that the basic piecewise linear tree

algorithm can compute. This limit is commonly referred to as the ‘diagonal’ DM, as it

corresponds to a dispersion trail in the time-frequency grid with a gradient of unity:4

DM(piecewise)diag =

N ′ν − 12

∆T (Nν)−∆T (Nν −N ′ν). (3.21)

A technique for computing larger DMs with the tree algorithm is discussed in Section 3.3.1.

4Note that the ‘ 12’ in equation (3.21) arises from the round-to-nearest operation in equation (3.20).

Page 75: Benjamin Barsdell Thesis - Swinburne University of Technology

3.3. Tree Dedispersion 59

The dedispersed sub-bands can now be combined to approximate the result of equation

(3.3):

Dd,t ≈Ns∑n

Sn,d′n(d),t+∆t′′n(d), (3.22)

∆t′′n(d) = round

(DM(d)

n∑m

∆T (νm+1)−∆T (νm)

). (3.23)

This forms stage 2 of the piecewise linear tree computation.

The use of the tree algorithm with sub-banding introduces an additional source of

smearing into the dedispersed time series as a result of approximating the quadratic dis-

persion curve with a piecewise linear one. We derive an analytic upper limit for this

smearing in Appendix A.1.

The frequency-padded tree method

An alternative approach to adapting the tree algorithm to quadratic dispersion curves is

to linearise the input data via a change of frequency coordinates. Formally, the aim is

to ‘stretch’ ∆T (ν) [equation (3.4)] to a linear function ∆T ′(ν ′) ∝ ν ′. Expanding to first

order around ν = 0, we have:

∆T ′(ν ′) = ∆T (0) + ∆T (ν ′)d

dν[∆T (ν)]

∣∣∣∣0

. (3.24)

The change of variables ν → ν ′ is then found by equating ∆T (ν) with its linear approxi-

mation, ∆T ′(ν ′), and solving for ν ′(ν), which gives

ν ′ = round

(1

2

ν0

∆ν

[1−

(1 +

∆ν

ν0ν

)−2])

. (3.25)

Evaluating at ν = Nν gives the total number of frequency channels in the linearised

coordinates, which determines the additional computational overhead introduced by the

procedure. Note, however, that this number must be rounded up to a power of two before

the tree dedispersion algorithm can be applied. For observations with typical effective

bandwidths and channel counts that are already a power of two, the frequency padding

technique is unlikely to require increasing the total number of channels by more than a

factor of two.

In practice, the linearisation procedure is applied by padding the frequency dimension

Page 76: Benjamin Barsdell Thesis - Swinburne University of Technology

60 Chapter 3. Accelerating Incoherent Dedispersion

with blank channels such that the real channels are spaced according to equation (3.25).

Once the dispersion trails have been linearised, the tree algorithm can be applied directly.

The ‘diagonal’ DM when using the frequency padding method corresponds to

DM(padded)diag =

1

∆T (1). (3.26)

Computing larger DMs

The basic tree dedispersion algorithm computes exactly the DMs specified by equation

(3.9). In practice, however, it is often necessary to search a much larger range of dispersion

measures. Fortunately, there are techniques by which this can be achieved without having

to resort to using the direct method. The tree algorithm can be made to compute higher

DMs by first transforming the input data and then repeating the dedispersion computation.

Formally, the following sequence of operations can be used to compute an arbitrary range

of DMs:

1. Apply the tree algorithm to obtain DMs from zero to DMdiag.

2. Impose a time delay across the band.

3. Apply the tree algorithm to obtain DMs from DMdiag to 2DMdiag.

4. Increment the imposed time delay.

5. Repeat from step 2 to obtain DMs up to 2NDMdiag.

The imposed time delay is initially a simple diagonal across the band (i.e., ∆t = ν), and is

implemented by incrementing a memory stride value rather than actually shifting memory.

While this method enables dedispersion up to arbitrarily large DMs, it does not alter the

spacing of DM trials, which remains fixed as per equation (3.9).

The ‘time-scrunching’ technique, discussed in Section 3.2.1 for the direct algorithm,

can also be applied to the tree algorithm. The procedure is as follows:

1. Apply the tree algorithm to obtain DMs from zero to DMdiag.

2. Impose a time delay across the band.

3. Apply the tree algorithm to obtain DMs from DMdiag to 2DMdiag.

4. Compress (‘scrunch’) time by a factor of 2 by summing adjacent samples.

5. Impose a time delay across the band.

Page 77: Benjamin Barsdell Thesis - Swinburne University of Technology

3.3. Tree Dedispersion 61

6. Apply the tree algorithm to obtain DMs from 2DMdiag to 4DMdiag.

7. Repeat from step 4 to obtain DMs up to 2NDMdiag.

As with the direct algorithm, the use of time-scrunching provides a performance benefit

at the cost of a minor reduction in the signal-to-noise ratio for pulses of intrinsic width

near the dispersion measure smearing time.

3.3.2 Algorithm analysis

The tree dedispersion algorithm’s computational complexity of O(NtNν logNν) breaks

down into log2Nν sequential steps, with each step involving the computation of O(NtNν)

independent new values, as seen in equations (3.10) to (3.13). Following the analysis

methodology of Chapter 2, the algorithm therefore has a depth complexity of O(logNν),

meaning it contains this many sequentially-dependent operations. Interestingly, this result

matches that of the direct algorithm, although the tree algorithm requires significantly less

total work. From a theoretical perspective, this implies that the tree algorithm contains

less inherent parallelism than the direct algorithm. In practice, however, the number

of processors will be small relative to the size of the problem (NtNν), and this reduced

inherent parallelism is unlikely to be a concern for performance except when processing

very small data-sets.

Branching (i.e., conditional statements) within an algorithm can have a significant

effect on performance when targetting GPU-like hardware (see Chapter 2). Fortunately,

the tree algorithm is inherently branch-free, with all operations involving only memory

accesses and arithmetic operations. This issue is therefore of no concern in this instance.

The arithmetic intensity of the tree algorithm is determined from the ratio of arith-

metic operations to memory operations. To process NtNν samples, the algorithm involves

NtNν log2Nν ‘delay and add’ operations, and produces NtNν samples of output. In con-

trast to the direct algorithm, where the theoretical arithmetic intensity was proportional

to the number of DMs computed, the tree algorithm requires only O(logNν) operations

per sample. This suggests that the tree algorithm may be unable to exploit GPU-like

hardware as efficiently as the direct algorithm. However, the exact arithmetic intensity

will depend on constant factors and additional arithmetic overheads, and will only become

apparent once the algorithm has been implemented. We defer discussion of these results

to Section 3.3.3.

Achieving peak arithmetic intensity requires reading input data from ‘slow memory’

into ‘fast memory’ (e.g., from disk into main memory, from main memory into cache,

Page 78: Benjamin Barsdell Thesis - Swinburne University of Technology

62 Chapter 3. Accelerating Incoherent Dedispersion

from host memory into GPU memory etc.) only once, before performing all computations

within fast memory and writing the results, again just once, back to slow memory. In the

tree dedispersion algorithm, this means performing all log2Nν steps entirely within fast

memory. The feasibility of this will depend on implementation details, the discussion of

which we defer to Section 3.3.3. However, it will be useful to assume that some sub-set of

the total computation will fit within this model. We will therefore continue the analysis

of the tree algorithm under the assumption that we are computing only a (power-of-two)

subset, or block, of Bν channels.

The memory access patterns within the tree algorithm resemble those of the direct al-

gorithm (see Section 3.2.2). Time samples are always accessed contiguously, with an offset

that is essentially arbitrary. In the frequency dimension, memory is accessed according

to the shuffle function [equation (3.15)] depicted in Fig. 3.2, where at any given step of

the algorithm the frequency channels ‘interact’ in pairs, the interaction involving their

addition with different time delays.

With respect to the goal of achieving peak arithmetic intensity, the key issue for

the memory access patterns within the tree algorithm is the extent to which they remain

‘compact’. This is important because it determines the ability to operate on isolated blocks

of data independently, which is critical to prolonging the time between successive trips to

slow memory. In the frequency dimension, the computation of some local (power-of-two)

sub-set of channels Bν involves accessing only other channels within the same subset. In

this sense we can say that the memory access patterns are ‘locally compact’ in channels.

In the time dimension, however, we note that the algorithm applies compounding delays

(equivalent to offsets in whole time samples). This means that the memory access patterns

‘leak’ forward, with any local group of time samples always requiring access to the next

group. The amount by which the necessary delays ‘leak’ in time for each channel is given

by the integrated delay in that channel after Bν steps (see Fig. 3.2). The total integrated

delay across Bν channels is Bν(Bν − 1)/2, which is the number of additional values that

must be read into fast memory by the block in order to compute all log2Bν steps without

needing to return to global memory and apply a global synchronisation.

3.3.3 Implementation Notes

As with the direct algorithm, we implemented the tree algorithm on a GPU in C for

CUDA. For our first attempt, we took a simple approach where each of the log2Nν steps

in the computation was performed by a separate call to a GPU function (or kernel).

This approach is not ideal, as it is preferable to perform more computation on the device

Page 79: Benjamin Barsdell Thesis - Swinburne University of Technology

3.3. Tree Dedispersion 63

before returning to the host (as per the discussion of arithmetic intensity in Section 3.3.2),

but was necessary in order to guarantee global synchronisation across threads on the GPU

between steps. This is a result of the lack of global synchronisation mechanisms on current

GPUs.

Between steps, the integer delay and shuffle functions [equations (3.14) and (3.15)] were

evaluated on the host and stored in look-up tables. These were then copied to constant

memory on the device prior to executing the kernel function to compute the step. The

use of constant memory ensured retrieval of these values would not be a bottle-neck to

performance during the computation of each step of the tree algorithm.

The problem was divided between threads on the GPU by allocating one thread for

every time sample and every pair of frequency channels. This meant that each thread

would compute the delayed sums between two ‘interacting’ channels according to the

pattern depicted in Fig. 3.2 for the current step.

The tree algorithm’s iterative updating behaviour requires that computations at each

step be performed ‘out-of-place’; i.e., output must be written to a memory space separate

from that of the input to avoid modifying input values before they have been used. We

achieved this effect by using a double-buffering scheme, where input and output arrays

are swapped after each step.

While the algorithms differ significantly in their details, one point of consistency be-

tween the direct and tree methods is the need to apply time delays to the input data.

Therefore, just as with our implementation of the direct algorithm, the tree algorithm

requires accessing memory locations that are not aligned with internal memory bound-

aries. As such, we took the same approach as before and mapped the input data to the

GPU’s texture memory before launching the device kernel. As noted in Section 3.2.3, this

procedure is unnecessary on Fermi-generation GPUs, as their built-in caches provide the

same behaviour automatically.

After successfully implementing the tree algorithm on a GPU using a simple one-

step-per-GPU-call approach, we investigated the possibility of computing multiple steps

of the algorithm on the GPU before returning to the CPU for synchronisation. This is

possible because current GPUs, while lacking support for global thread synchronisation,

do support synchronisation across local thread groups (or blocks). These thread blocks

typically contain O(100) threads, and provide mechanisms for synchronisation and data-

sharing, both of which are required for a more efficient tree dedispersion implementation.

As discussed in Section 3.3.2, application of the tree algorithm to a block of Bν channels

× Bt time samples requires caching additional values from the next block in time. We

Page 80: Benjamin Barsdell Thesis - Swinburne University of Technology

64 Chapter 3. Accelerating Incoherent Dedispersion

used blocks of Bν × Bt = 16 × 16 threads, each loading both their corresponding data

value and required additional values into shared cache. Once all values have been stored,

computation of the log2Bν = 4 steps proceeds entirely within the shared cache. Using

larger thread blocks would allow more steps to be completed within the cache; however,

the choice is constrained by the available volume of shared memory (typically around

48kB). Once the block computation is complete, subsequent steps must be computed

using the one-step-per-GPU-call approach described earlier, due to the requirement of

global synchronisations.

While theory suggests that an implementation of the tree algorithm exploiting shared

memory to perform multiple steps in cache would provide a performance benefit over

a simpler implementation, in practice we were unable to achieve a net gain using this

approach. The limitations on block size imposed by the volume of shared memory, the

need to load additional data into cache and the logarithmic scaling of steps relative to

data size significantly reduce the potential speed-up, and overheads from increased code-

complexity quickly erode what remains. For this reason we reverted to the straight-forward

implementation of the tree algorithm as our final code for testing and benchmarking.

In addition to the base tree algorithm, we also implemented the sub-band method so as

to allow the computation of arbitrary dispersion measures. This was achieved by dividing

the computation into two stages. In stage 1, the first log2N′ν steps of the tree algorithm

are applied to the input data, which produces the desired Nν/N′ν tree-dedispersed sub-

bands. Stage 2 then involves applying an algorithm to combine the dedispersed time

series in different sub-bands into approximated quadratic dispersion curves according to

equation (3.22). Stage 2 was implemented on the GPU in much the same way as the direct

algorithm, with input data mapped to texture memory (on pre-Fermi GPUs) and delays

stored in look-up tables in constant device memory.

The frequency padding approach described in Section 3.3.1 was implemented by con-

structing an array large enough to hold the stretched frequency coordinates, initialising

its elements to zero, and then copying (or scattering) the input data into this array ac-

cording to equation (3.25). The results of this procedure were then fed to the basic tree

dedispersion code to produce the final set of dedispersed time series.

Because the tree algorithm involves sequentially updating the entire data-set, the data

must remain in their final format for the duration of the computation. This means that

low bit-rate data, e.g., 2-bit, must be unpacked (in a pre-processing step) into a format

that will not overflow during accumulation. This is in contrast to the direct algorithm,

where each sum is independent, and can be stored locally to each thread.

Page 81: Benjamin Barsdell Thesis - Swinburne University of Technology

3.4. Sub-band dedispersion 65

3.4 Sub-band dedispersion

3.4.1 Introduction

Sub-band dedispersion is the name given to another technique used to compute the dedis-

persion transform. Like the tree algorithm described in Section 3.3, the sub-band algorithm

attempts to reduce the cost of the computation relative to the direct method; however,

rather than exploiting a regularisation of the dedispersion algorithm, the sub-band method

takes a simple approximation approach.

In its simplest form, the algorithm involves two processing steps. In the first, the set of

trial DMs is approximated by a reduced set of NDMnom = NDM/N′DM ‘nominal’ DMs, each

separated by N ′DM trial dispersion measures. The direct dedispersion algorithm is applied

to sub-bands of N ′ν channels to compute a dedispersed time series for each nominal DM

and sub-band. In the second step, the DM trials near each nominal value are computed

by applying the direct algorithm to the ‘miniature filterbanks’ formed by the time series

for the sub-bands at each nominal DM. These data have a reduced frequency resolution

of NSB = Nν/N′ν channels across the band. The two steps thus operate at reduced

dispersion measure and frequency resolution respectively, resulting in an overall reduction

in the computational cost.

The sub-band algorithm is implemented in the presto software suite (Ransom, 2001)

and was recently implemented on a GPU by Magro et al. (2011) (see Section 3.6.1 for a

comparison with their work). Unlike the tree algorithm, the sub-band method is able to

compute the dedispersion transform with the same flexibility as the direct method, making

its application to real observational data significantly simpler.

The approximations made by the sub-band algorithm introduce additional smearing

into the dedispersed time series. We derive an analytic upper-bound in Appendix A.2 and

show that, to first order, the smearing time tSB is proportional to the product N ′DMN′ν

[see equation (A.8)].

Page 82: Benjamin Barsdell Thesis - Swinburne University of Technology

66 Chapter 3. Accelerating Incoherent Dedispersion

3.4.2 Algorithm analysis

The computational complexity of the sub-band dedispersion algorithm can be computed

by summing that of the two steps:

TSB,1 = NSB · Tdirect(Nt, N′ν , NDMnom) (3.27)

TSB,2 = NDMnom · Tdirect(Nt, NSB, N′DM) (3.28)

TSB = TSB,1 + TSB,2 (3.29)

= O

[NtNDMNν

(1

N ′DM

+1

N ′ν

)](3.30)

This result can be combined with knowledge of the smearing introduced by the algorithm

to probe the relationship between accuracy and performance. Inserting the smearing

constraint tSB ∝ N ′DMN′ν (see Section 3.4.1) into equation (3.30), we obtain a second-

order expression that is minimised at N ′DM = N ′ν ∝√tSB, which amounts to balancing

the execution time between the two steps. This result optimises the time complexity of

the algorithm, which then takes the simple form

T ′SB = O

(NDMNν√

tSB

)(3.31)

and represents a theoretical speed-up over the direct algorithm proportional to the square

root of the introduced smearing.

The sub-band algorithm’s dependence on the direct algorithm means that it inherits

similar algorithmic behaviour. However, as with the tree method, the decrease in computa-

tional work afforded by the sub-band approach corresponds to a decrease in the arithmetic

intensity of the algorithm. This can be expected to reduce the intrinsic performance of

the two sub-band steps relative to the direct algorithm.

One further consideration for the sub-band algorithm is the additional volume of mem-

ory required to store the intermediate results produced by the first step. These data consist

of time series for each sub-band and nominal DM, giving a space complexity of

MSB = O (NSBNDMnom) . (3.32)

Assuming the time complexity is optimised as in equation (3.31), the space complexity

becomes

M ′SB = O

(1

tSB

), (3.33)

Page 83: Benjamin Barsdell Thesis - Swinburne University of Technology

3.5. Results 67

which indicates that the memory consumption increases much faster than the execution

time, in direct proportion to the introduced smearing rather than to its square root. This

can be expected to place a lower limit on the smearing that can be achieved in practice.

3.4.3 Implementation notes

A significant advantage of the sub-band algorithm over the tree algorithm is that it involves

little more than repeated execution of the direct algorithm. With sufficient generalisation5

of our implementation of the direct algorithm, we were able to implement the sub-band

method with just two consecutive calls to the direct dedispersion routine and the addition

of a temporary data buffer.

In our implementation, the ‘intermediate’ data (i.e., the outputs of the first step)

are stored in the temporary buffer using 32 bits per sample. The second call to the

dedispersion routine then reads these values directly before writing the final output using

a desired number of bits per sample.

Experimentation showed that optimal performance occurred at a slightly different

shape and size of the thread blocks on the GPU compared to the direct algorithm (see

Section 3.2.2). The sub-band kernels operated most efficiently with 128 threads per block

divided into 16 time samples and 8 DMs. In addition, the optimal choice of the ratio

N ′ν/N′DM was found to be close to unity, which matches the theoretical result derived in

Section 3.4.2. While these parameters minimised the execution time, the sub-band kernels

were still found to perform around 40% less efficiently than the direct kernel. This result

is likely due to the reduced arithmetic intensity of the algorithm (see Section 3.4.2).

3.5 Results

3.5.1 Smearing

Our analytic upper-bounds on the increase in smearing due to use of the piecewise linear

tree algorithm [equation (A.6)] and the sub-band algorithm [equation (A.9)] are plotted

in the upper panels of Figs. 3.3 and 3.4 respectively. The reference point [W in equa-

tions (A.6) and (A.9)] was calculated using equations for the smearing during the direct

dedispersion process6 assuming an intrinsic pulse width of 40µs.

For the piecewise linear tree algorithm, the effective signal smearing at low dispersion

5The direct dedispersion routine was modified to support ‘batching’ (simultaneous application to severaladjacent data-sets) and arbitrary strides through the input and output arrays, trial DMs and channels.

6Levin, L. 2011, priv. comm.

Page 84: Benjamin Barsdell Thesis - Swinburne University of Technology

68 Chapter 3. Accelerating Incoherent Dedispersion

measure is dominated by the intrinsic pulse width, the sampling time ∆τ and the effect of

finite DM sampling. As the DM is increased, however, the effects of finite channel width

and the sub-band technique grow, and eventually become dominant. These smearing terms

both scale linearly with the dispersion measure, and so the relative contribution of the

sub-band method, µSB, tends to a constant.

The sub-band algorithm exhibits virtually constant smearing as a function of DM

due to its dependence on the DM step, which is itself chosen to maintain a fixed frac-

tional smearing. While the general trend mirrors that of the tree algorithm, the sub-band

algorithm’s smearing is typically around two orders of magnitude worse than its tree coun-

terpart.

Page 85: Benjamin Barsdell Thesis - Swinburne University of Technology

3.5. Results 69

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-up

Sub

-ban

d si

ze (

Nν’

)

Tre

e on

GT

X 4

80T

ree

on T

esla

C20

50T

ree

on T

esla

C10

60

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-up

Sub

-ban

d si

ze (

Nν’

)

Dire

ct o

n G

TX

480

Dire

ct o

n T

esla

C20

50D

irect

on

Tes

la C

1060

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-up

Sub

-ban

d si

ze (

Nν’

)

Dire

ct o

n C

ore

i7 9

30, 4

thre

ads

Dire

ct o

n C

ore

i7 9

30, 2

thre

ads

Dire

ct o

n C

ore

i7 9

30, 1

thre

ad

10-6

10-4

10-2

100

102

Frac. smear increase(µSB - 1)

DM

= 1

000

pc c

m-3

DM

= 6

2 pc

cm

-3

DM

= 4

pc

cm-3

(a)

Wit

hti

me

scru

nch

ing

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-up

Sub

-ban

d si

ze (

Nν’

)

Tre

e on

GT

X 4

80T

ree

on T

esla

C20

50T

ree

on T

esla

C10

60

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-up

Sub

-ban

d si

ze (

Nν’

)

Dire

ct o

n G

TX

480

Dire

ct o

n T

esla

C20

50D

irect

on

Tes

la C

1060

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-up

Sub

-ban

d si

ze (

Nν’

)

Dire

ct o

n C

ore

i7 9

30, 4

thre

ads

Dire

ct o

n C

ore

i7 9

30, 2

thre

ads

Dire

ct o

n C

ore

i7 9

30, 1

thre

ad

10-6

10-4

10-2

100

102

Frac. smear increase(µSB - 1)

DM

= 1

000

pc c

m-3

DM

= 6

2 pc

cm

-3

DM

= 4

pc

cm-3

(b)

Wit

hout

tim

esc

runch

ing

Fig

ure

3.3

Up

per:

An

alyti

cup

per

-bou

nd

onsi

gnal

deg

rad

atio

nof

a40µ

sp

uls

ed

ue

toth

ep

iece

wis

elin

ear

tree

alg

ori

thm

as

afu

nct

ion

ofth

enu

mb

erof

chan

nel

sp

ersu

b-b

and

[see

equ

atio

n(A

.6)]

.L

ow

er:

Per

form

ance

resu

lts

for

the

dir

ect

an

dp

iece

wis

eli

nea

rtr

eeal

gori

thm

sw

ith

(a)

and

wit

hou

t(b

)‘t

ime-

scru

nch

ing’

app

lied

.B

ench

mar

ks

wer

eex

ecu

ted

onan

Inte

lC

ore

i7930

qu

ad

-core

CP

Uan

dN

VID

IAT

esla

C10

60,

Tes

laC

2050

and

GeF

orce

GT

X48

0G

PU

s.A

llre

sult

sco

rres

pon

dto

op

erati

on

son

on

em

inu

teof

inp

ut

dat

aw

ith

the

foll

owin

gob

serv

ing

par

amet

ers:

bit

s/sa

mple

=2,ν c

=13

81.8

MH

z,B

W=

400M

Hz,Nν

=1024,

∆τ

=64µ

s.A

tota

lof

1196

DM

tria

lsw

ere

use

d,

spac

edn

on-l

inea

rly

inth

era

nge

0≤

DM<

1000

pc

cm−

3(s

eete

xt

for

det

ail

s).

Err

or

bars

are

too

smal

lto

be

seen

atth

issc

ale

and

are

not

plo

tted

.N

ote

that

per

form

ance

resu

lts

are

pro

ject

edfr

om

mea

sure

men

tsof

cod

esp

erfo

rmin

gsu

b-s

ets

ofth

eb

ench

mar

kta

sk(s

eete

xt

for

det

ails

).

Page 86: Benjamin Barsdell Thesis - Swinburne University of Technology

70 Chapter 3. Accelerating Incoherent Dedispersion

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-upS

ub-b

and

size

(N

ν’)

Sub

-ban

d on

GT

X 4

80S

ub-b

and

on T

esla

C20

50S

ub-b

and

on T

esla

C10

60

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-upS

ub-b

and

size

(N

ν’)

Dire

ct o

n G

TX

480

Dire

ct o

n T

esla

C20

50D

irect

on

Tes

la C

1060

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-upS

ub-b

and

size

(N

ν’)

Dire

ct o

n C

ore

i7 9

30, 4

thre

ads

Dire

ct o

n C

ore

i7 9

30, 2

thre

ads

Dire

ct o

n C

ore

i7 9

30, 1

thre

ad

10-6

10-4

10-2

100

102

Frac. smear increase(µSB - 1)

DM

= 4

00 p

c cm

-3

DM

= 4

pc

cm-3

(a)

Wit

hti

me

scru

nch

ing

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-up

Sub

-ban

d si

ze (

Nν’

)

Sub

-ban

d on

GT

X 4

80S

ub-b

and

on T

esla

C20

50S

ub-b

and

on T

esla

C10

60

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-up

Sub

-ban

d si

ze (

Nν’

)

Dire

ct o

n G

TX

480

Dire

ct o

n T

esla

C20

50D

irect

on

Tes

la C

1060

0.1 1 10

100

2 4

8 1

6 3

2 6

4 1

28 2

56

1 10

100

Observation time / compute time

Speed-up

Sub

-ban

d si

ze (

Nν’

)

Dire

ct o

n C

ore

i7 9

30, 4

thre

ads

Dire

ct o

n C

ore

i7 9

30, 2

thre

ads

Dire

ct o

n C

ore

i7 9

30, 1

thre

ad

10-6

10-4

10-2

100

102

Frac. smear increase(µSB - 1)

DM

= 4

00 p

c cm

-3

DM

= 4

pc

cm-3

(b)

Wit

hout

tim

esc

runch

ing

Fig

ure

3.4

Up

per:

An

alyti

cu

pp

er-b

oun

don

sign

ald

egra

dat

ion

ofa

40µ

sp

uls

ed

ue

toth

esu

b-b

and

algo

rith

mas

afu

nct

ion

of

the

nu

mb

erof

chan

nel

sp

ersu

b-b

and

[see

equat

ion

(A.9

)].

Low

er:

Per

form

ance

resu

lts

for

the

dir

ect

and

sub-b

an

dm

eth

od

sw

ith

(a)

an

dw

ithou

t(b

)th

eu

seof

tim

e-sc

run

chin

g.S

eeF

ig.

3.3

cap

tion

for

det

ails

.N

ote

that

itw

asn

otp

ossi

ble

toru

nb

ench

mark

sof

the

sub

-ban

dco

de

forN′ ν<

16d

ue

tom

emor

yco

nst

rain

ts.

Page 87: Benjamin Barsdell Thesis - Swinburne University of Technology

3.5. Results 71

3.5.2 Performance

Our codes as implemented allowed us to directly compute the following:

• Any list of DMs using the direct or sub-band algorithm with no time-scrunching,

• DMs up to the diagonal [see equation (3.21)] using the piecewise linear tree algorithm,

and

• DMs up to the diagonal [see equation (3.26)] using the frequency-padded tree algo-

rithm.

A number of timing benchmarks were run to compare the performance of the CPU to

the GPU and the direct algorithm to the tree algorithms. Input and dispersion parameters

were chosen to reflect a typical scenario as appears in modern pulsar surveys such as

the High Time Resolution Universe (HTRU) survey currently underway at the Parkes

radio telescope (Keith et al., 2010). The benchmarks involved computing the dedispersion

transform of one minute of input data with observing parameters of bits/sample = 2,

ν0 = 1581.8MHz, ∆ν = −0.39062MHz, Nν = 1024, ∆τ = 64µs. DM trials were chosen

to match those used in the HTRU survey, which were originally derived by applying an

analytic constraint on the signal-smearing due to incorrect trial DM7. The chosen set

contained 1196 trial DMs in the range 0 ≤ DM < 1000 pc cm−3 with approximately

exponential spacing.

For comparison purposes, we benchmarked a reference CPU direct dedispersion code in

addition to our GPU codes. The CPU code (named dedisperse all) is highly optimised,

and uses multiple CPU cores to compute the dedispersion transform (parallelised over

the time dimension) in addition to bit-level parallelism as described in Section 3.2.3.

dedisperse all is approximately 60× more efficient than the generic dedisperse routine

from sigproc8, but is only applicable to a limited subset of data formats.

At the time of writing, our dedispersion code-base did not include ‘full-capability’

implementations of all of the discussed algorithms. However, we were able to perform a

number of benchmarks that were sufficient to obtain accurate estimates of the performance

of complete runs. Timing measurements for our codes were projected to produce a number

of derived results representative of the complete benchmark task. The direct/sub-band

dedispersion code was able to compute the complete list of desired DMs, but was not able to

exploit time-scrunching; results for these algorithms with time scrunching were calculated

7Levin, L. 2011, priv. comm.; see Cordes & McLaughlin 2003 for a similar derivation.8sigproc.sourceforge.net

Page 88: Benjamin Barsdell Thesis - Swinburne University of Technology

72 Chapter 3. Accelerating Incoherent Dedispersion

by assuming that the computation of DMs between 2× and 4× the diagonal would proceed

twice as fast as the computation up to 2× the diagonal (as a result of there being half as

many time samples), and similarly for 4× to 8× etc. up to the maximum desired DM. A

simple code to perform the time-scrunching operation (i.e., adding adjacent time samples

to reduce the time resolution by a factor of two) was also benchmarked and factored

into the projection. For the tree codes, which were unable to compute DMs beyond the

diagonal, timing results were projected by scaling as appropriate for the computation of

the full set of desired DMs with or without time-scrunching. Individual sections of code

were timed separately to allow for different scaling behaviours.

Benchmarks were run on a variety of hardware configurations. CPU benchmarks were

run on an Intel i7 930 quad-core CPU (Hyperthreading enabled). GPU benchmarks were

run using version 3.2 of the CUDA toolkit on the pre-Fermi generation NVIDIA Tesla

C1060 and the Fermi generation NVIDIA Tesla C2050 (error-correcting memory disabled)

and GeForce GTX 480 GPUs. Hardware specifications of the GPUs’ host machines varied,

but were not considered to significantly impact performance measurements other than

the copies between host and GPU memory. Benchmarks for these copy operations were

averaged across the different machines.

Our derived performance results for the direct and piecewise linear tree codes are

plotted in the lower panels of Fig. 3.3. The performance of the frequency-padded tree

code corresponded to almost exactly half that of the piecewise linear tree code at a sub-

band size of N ′ν = 1024; these results were omitted from the plot for clarity.

Performance results for the sub-band dedispersion code are plotted in the lower panels

of Fig. 3.4 along with the results of the direct code for comparison. Due to limits on

memory use (see Section 3.4.2), benchmarks for N ′ν < 16 were not possible.

Performance was measured by inserting calls to the Unix function gettimeofday() before

and after relevant sections of code. Calls to the CUDA function cudaThreadSynchronize()

were inserted where necessary to ensure that asynchronous GPU functions had completed

their execution prior to recording the time.

Several different sections of code were timed independently. These included pre- and

post-processing steps (e.g., unpacking, transposing, scaling) and copies between host and

GPU memory (in both directions), as well as the dedispersion kernels themselves. Disk

I/O and portions of code whose execution time does not scale with the size of the input

were not timed (see Section 3.6 for a discussion of the impact of disk I/O). Timing results

represent the total execution time of all timed sections, including memory copies between

the host and the device in the case of the GPU codes.

Page 89: Benjamin Barsdell Thesis - Swinburne University of Technology

3.6. Discussion 73

Each benchmark was run 101 times, from which the the median execution time was

chosen as the final measurement. Recorded uncertainties corresponded to the 5th and 95th

percentiles; the error bars are too small to be seen in Figs. 3.3 and 3.4 and were not

plotted.

3.6 Discussion

The lower panel of Fig. 3.3(a) shows a number of interesting performance trends. As

expected, the slowest computation speeds come from the direct dedispersion code running

on the CPU. Here, some scaling is achieved via the use of multiple cores, but the speed-up

is limited to around 2.5× when using all four. This is likely due to saturation of the

available memory bandwidth.

Looking at the corresponding results on a GPU, a large performance advantage is

clear. The GTX 480 achieves a 9× speed-up over the quad-core CPU, and even the last-

generation Tesla C1060 manages a factor of 5×. The fact that a single GPU is able to

compute the dedispersion transform in less than a third of the real observation time makes

it an attractive option for real-time detection pipelines.

A further performance boost is seen in the transition to the tree algorithm. Compu-

tation speed is projected to exceed that of the direct code for almost all choices of N ′ν ,

peaking at around 3× at N ′ν = 64. Performance is seen to scale approximately linearly

for N ′ν < 32, before peaking and then decreasing very slowly for N ′ν > 64. This behaviour

is explained by the relative contributions of the two stages of the computation. For small

N ′ν , the second, ‘sub-band combination’, stage dominates the total execution time [scaling

as O(1/N ′ν)]. At large N ′ν the execution time of the second stage becomes small relative

to the first, and scaling follows that of the basic tree algorithm [i.e., O(logN ′ν)].

The results of the sub-band algorithm in Fig. 3.4(a) also show a significant performance

advantage over the direct algorithm. The computable benchmarks start at N ′ν=16 with

around the same performance as the tree code. From there, performance rapidly increases

as the size of the sub-bands is increased, eventually tailing off around N ′ν=256 with a

speed-up of approximately 20× over the direct code. At such high speeds, the time spent

in the GPU kernel is less than the time spent transferring the data into and out of the

GPU. The significance of this effect for each of the three algorithms is given in Table 3.1.

The results discussed so far have assumed the use of the time-scrunching technique

during the dedispersion computation. If time-scrunching is not used, the projected per-

formance results change significantly [see lower panels Figs. 3.3(b) and 3.4(b)]. Without

the use of time-scrunching, the direct dedispersion codes perform around 1.6× slower, and

Page 90: Benjamin Barsdell Thesis - Swinburne University of Technology

74 Chapter 3. Accelerating Incoherent Dedispersion

Table 3.1 Summary of host↔GPU memory copy timesCode Copy Time Fraction of total time

Direct 0.62 s < 5%Tree 1.05 s < 30%Sub-band 0.62 s 10% – 65%

similar results are seen for the sub-band code. The tree codes, however, are much more

severely affected, and perform 5× slower when time-scrunching is not employed. This

striking result can be explained by the inflexibilities of the tree algorithm discussed in

Section 3.3.1. At large dispersion measure, the direct algorithm allows one to sample DM

space very thinly. The tree algorithms, however, do not—they will always compute DM

trials at a fixed spacing [see equation (3.9)]. This means that the tree algorithms are effec-

tively over-computing the problem, which leads to the erosion of their original theoretical

performance advantage. The use of time-scrunching emulates the thin DM-space sampling

of the direct code, and allows the tree codes to maintain an advantage.

While the piecewise linear tree code and the sub-band code are seen to provide signifi-

cant speed-ups over the direct code, their performance leads come at the cost of introducing

additional smearing into the dedispersed signal. Our analytic results for the magnitude

of the smearing due to the tree code (upper panels Fig. 3.3) show that for the chosen

observing parameters, the total smear is expected to increase by less than 10% for all

N ′ν ≤ 64 at a DM of 1000 pc cm−3. Given that peak performance of the tree code also

corresponded to N ′ν = 64, we conclude that this is the optimal choice of sub-band size for

such observations.

The smearing introduced by the sub-band code (upper panels Fig. 3.4) is significantly

worse, increasing the signal degradation by three orders of magnitude more than the tree

code. Here, the total smear is expected to increase by around 40% at N ′ν=16, and at

N ′ν=32 the increase in smearing reaches 300%. While these results are upper limits, it

is unlikely that sub-band sizes of more than N ′ν=32 will produce acceptable results in

practical scenarios.

In contrast to the piecewise linear code, the frequency-padded tree code showed only

a modest speed-up of around 1.5× over the direct approach due to its doubling of the

number of frequency channels. Given that the sub-band algorithm has a minimal impact

on signal quality, we conclude that the frequency-padding technique is an inferior option.

It is also important to consider the development cost of the algorithms we have dis-

cussed. While the tree code has shown both high performance and accuracy, it is also

considerably more complex than the other algorithms. The tree algorithm in its base

Page 91: Benjamin Barsdell Thesis - Swinburne University of Technology

3.6. Discussion 75

form, as discussed in Section 3.3.1, is much less intuitive than the direct algorithm (e.g.,

the memory access patterns in Fig. 3.2). This fact alone makes implementation more

difficult. The situation gets significantly worse when one must adapt the tree algorithm

to work in practical scenarios, with quadratic dispersion curves and arbitrary DM tri-

als. Here, the algorithm’s inflexibility makes implementation a daunting task. We note

that our own implementations are as yet incomplete. By comparison, implementation of

the direct code is relatively straightforward, and the sub-band code requires only mini-

mal changes. Development time must play a role in any decision to use one dedispersion

algorithm over another.

The three algorithms we have discussed each show relative strengths and weaknesses.

The direct algorithm makes for a relatively straightforward move to the GPU architecture

with no concerns regarding accuracy, and offers a speed-up of up to 10× over an efficient

CPU code. However, its performance is convincingly beaten by the tree and sub-band

methods. The tree method is able to provide significantly better performance with only a

minimal loss of signal quality; however, it comes with a high cost of development that may

outweigh its advantages. Finally, the sub-band method combines excellent performance

with an easy implementation, but is let down by the substantial smearing it introduces

into the dedispersed signal. The optimal choice of algorithm will therefore depend on

which factors are most important to a particular project. While there is no clear best

choice among the three different algorithms, we emphasize that between the two hardware

architectures the GPU clearly outperforms the CPU.

When comparing the use of a GPU to a CPU, it is interesting to note that our fi-

nal GPU implementation of the direct dedispersion algorithm on a Fermi-class device is,

relatively speaking, a simple code. While it was necessary in both the pre-Fermi GPU

and multi-core CPU implementations to use non-trivial optimisation techniques (e.g., tex-

ture memory, bit-packing etc.), the optimal implementation on current-generation, Fermi,

GPU hardware was also the simplest or ‘obvious’ implementation. This demonstrates how

far the (now rather misnamed) graphics processing unit has come in its ability to act as a

general-purpose processor.

In addition to the performance advantage offered by GPUs today, we expect our imple-

mentations of the dedispersion problem to scale well to future architectures with little to no

code modification. The introduction of the current generation of GPU hardware brought

with it both a significant performance increase and an equally significant reduction in

programming complexity. We expect these trends to continue when the next generation of

GPUs is released, and see a promising future for these architectures and the applications

Page 92: Benjamin Barsdell Thesis - Swinburne University of Technology

76 Chapter 3. Accelerating Incoherent Dedispersion

that make use of them.

While we have only discussed single-GPU implementations of dedispersion, it would in

theory be a simple matter to make use of multiple GPUs, e.g., via time-division multiplex-

ing of the input data or allocation of a sub-set of beams to each GPU. As long as the total

execution time is dominated by the GPU dedispersion kernel, the effects of multiple GPUs

within a machine sharing resources such as CPU cycles and PCI-Express bandwidth are

expected to be negligible. However, as shown in Table 3.1, the tree and sub-band codes

are in some circumstances so efficient that host↔device memory copy times become a sig-

nificant fraction of the total run time. In these situations, the use of multiple GPUs within

a single host machine may influence the overall performance due to reduced PCI-Express

bandwidth.

Disk I/O is another factor that can contribute to the total execution time of a dedis-

persion process. Typical server-class machines have disk read/write speeds of only around

100 MB/s, while our GPU dedispersion codes are capable of producing 8-bit time series

at well over twice this rate. If dedispersion is performed in an offline fashion, where time

series are read from and written to disk before and after dedispersion, then it is likely that

disk performance will become the bottle-neck. The use of multiple GPUs within a machine

may exacerbate this effect. However, for real-time processing pipelines where data are kept

in memory between operations, the dedispersion kernel can be expected to dominate the

execution time. This is particularly important for transient search pipelines, where accel-

eration searching is not necessary and dedispersion is typically the most time-consuming

operation.

The potential impact of limited PCI-Express bandwidth or disk I/O performance high-

lights the need to remember Amdahl’s Law when considering further speed-ups in the

dedispersion codes: the achievable speed-up is limited by the largest bottle-neck. The

tree and sub-band codes are already on the verge of being dominated by the host↔device

memory copies, meaning that further optimisation of their kernels will provide diminish-

ing returns. While disk and memory bandwidths will no-doubt continue to increase in

the future, we expect the ratio of arithmetic performance to memory performance to get

worse rather than better.

The application of GPUs to the problem of dedispersion has produced speed-ups of

an order of magnitude. The implications of this result for current and future surveys are

significant. Current projects often execute pulsar and transient search pipelines offline

due to limited computational resources. This results in event detections being made long

after the time of the events themselves, limiting analysis and confirmation power to what

Page 93: Benjamin Barsdell Thesis - Swinburne University of Technology

3.6. Discussion 77

can be gleamed from archived data alone. A real-time detection pipeline, made possible

by a GPU-powered dedispersion code, could instead trigger systems to record invaluable

baseband data during significant events, or alert other observatories to perform follow-

up observations over a range of wavelengths. Real-time detection capabilities will also

be crucial for next-generation telescopes such as the Square Kilometre Array pathfinder

programs ASKAP and MeerKAT. The use of GPUs promises significant reductions in the

set-up and running costs of real-time pulsar and transient processing pipelines, and could

be the enabling factor in the construction of ever-larger systems in the future.

3.6.1 Comparison with other work

Magro et al. (2011) recently reported on a GPU code that could achieve very high (> 100×)

speed-ups over the dedispersion routines in sigproc and presto (Ransom, 2001) whereas

our work only finds improvements of factors of 10–30 over dedisperse all. There are

two key reasons for the apparent discrepancy in speed. Firstly, the sigproc routine was

never written to optimise performance but rather to produce reliable dedispersed data

streams from a very large number of different backends. Inspection of the innermost loop

reveals a conditional test that prohibits parallelisation, and a two dimensional array that

is computationally expensive. Secondly, sigproc only produces one DM per file read,

which is very inefficient. We believe that these factors explain the very large speed-ups

reported by Magro et al. In our own benchmarks, we have found our CPU comparison

code dedisperse all to be ∼ 60× faster than sigproc. For comparison, this puts our

direct GPU code at ∼ 300× faster than sigproc when using the same Tesla C1060 model

GPU as Magro et al.

Direct comparison of our GPU results with those of Magro et al. is difficult, as the

details of the CPU code, the method of counting FLOP/s and the observing parameters

used in their performance plots is not clear. However, we have benchmarked our GPU

code on the ‘toy observation’ presented in section 5 of their paper. The execution times are

compared in Table 3.2. Magro et al. did not specify the number of bits per sample used in

their benchmark; we chose to use 8 bits/sample, but found no significant difference when

using 32 bits/sample. We found our implementation of the direct dedispersion algorithm

to be ∼ 2.3× faster than that reported in their work. Possible factors contributing to

this difference include our use of texture memory, two-dimensional thread blocks and

allocation of multiple samples per thread. The performance results of our implementation

of the sub-band dedispersion algorithm generally agree with those of Magro et al., although

the impact of the additional smearing is not quantified in their work.

Page 94: Benjamin Barsdell Thesis - Swinburne University of Technology

78 Chapter 3. Accelerating Incoherent Dedispersion

Table 3.2 Timing comparisons for direct GPU dedispersion of the ‘toy observation’ definedin Magro et al. (2011) (νc=610 MHz, BW=20 MHz, Nν=256, ∆τ=12.8µs, NDM=500,0≤DM<60 pc cm−3). All benchmarks were executed on a Tesla C1060 GPU.

Stage Magro et al. (2011) This work Ratio

Corner turn 112 ms 7 ms 16×De-dispersion 4500 ms 1959 ms 2.29×GPU→CPU copy 220 ms 144 ms 1.52×Total 4832 ms 2110 ms 2.29×

In summary, we agree with Magro et al. that GPUs offer great promise in incoherent

dedispersion. The benefit over that of CPUs is, however, closer to the ratio of their

memory bandwidths (∼ 10×) than the factor of 100 reported in their paper, which relied

on comparison with a non-optimised single-threaded CPU code.

3.6.2 Code availability

We have packaged our GPU implementation of the direct incoherent dedispersion algo-

rithm into a C library that we make available to the community9. The application pro-

gramming interface (API) was modeled on that of the FFTW library10, which was found to

be a convenient fit. The library requires the NVIDIA CUDA Toolkit, but places no require-

ments on the host application, allowing easy integration into existing C/C++/Fortran etc.

codes. While the library currently uses the direct dedispersion algorithm, we may consider

adding support for a tree or sub-band algorithm in future.

3.7 Conclusions

We have analysed the direct, tree and sub-band dedispersion algorithms and found all

three to be good matches for massively-parallel computing architectures such as GPUs.

Implementations of the three algorithms were written for the current and previous gen-

erations of GPU hardware, with the more recent devices providing benefits in terms of

both performance and ease of development. Timing results showed a 9× speed-up over a

multi-core CPU when executing the direct dedispersion algorithm on a GPU. Using the

tree algorithm with a piecewise linear approximation technique results in some additional

smearing of the input signal, but was projected to provide a further 3× speed-up at a

very modest level of signal-loss. The sub-band method provides a means of obtaining

even greater speed-ups, but imposes significant additional smearing on the dedispersed

9Our library and its source code are available at: http://dedisp.googlecode.com/10http://www.fftw.org

Page 95: Benjamin Barsdell Thesis - Swinburne University of Technology

3.7. Conclusions 79

signal. These results have significant implications for current and future radio pulsar and

transient surveys, and promise to dramatically lower the cost barrier to the deployment

of real-time detection pipelines.

Acknowledgments

We would like to thank Lina Levin and Willem van Straten for very helpful discussions

relating to pulsar searching, Mike Keith for valuable information regarding the tree dedis-

persion algorithm, and Paul Coster for help in testing our dedispersion code. We would

also like to thank the referee Scott Ransom for his very helpful comments and suggestions

for the paper corresponding to this chapter.

Page 96: Benjamin Barsdell Thesis - Swinburne University of Technology
Page 97: Benjamin Barsdell Thesis - Swinburne University of Technology

4Fast-Radio-Transient Detection in Real-Time with

GPUs

The machine does not isolate man from the great problems

of nature but plunges him more deeply into them.

—Antoine de Saint-Exupery

4.1 Introduction

The sub-second transient radio sky is a poorly understood yet potentially fruitful source of

astrophysical phenomena (Cordes & McLaughlin, 2003). Over the past decade a number of

surveys have made inroads into characterising the sources that populate this domain. Re-

processing of the Parkes Multibeam Survey resulted in the discovery of new sources forming

a class of pulsars known as the rotating radio transients (RRATs) (McLaughlin et al., 2006;

Keane et al., 2010; Keane et al., 2011; Burke-Spolaor & Bailes, 2010)1. The apparent

detection of an extragalactic burst (Lorimer et al., 2007) sparked significant excitement in

the field, although it was not followed by similar success; a possible second such event was

eventually found (Keane et al., 2011), but the identification of terrestrial signals (given the

name perytons) mimicking the frequency-swept appearance of astronomical sources added

doubt to the true origin of these events (Burke-Spolaor et al., 2011; Bagchi, Cortes Nieves

& McLaughlin, 2012). While uncertainty remains, these discoveries have prompted a new

generation of wide-field surveys across nearly all of the major radio astronomy facilities.

These include Parkes Observatory (Keith et al., 2010), the Australian Square Kilometre

Array Pathfinder (ASKAP) (Macquart et al., 2010), the Effelsberg radio telescope (Barr,

2011), the Low Frequency Array (LOFAR) (Stappers et al., 2011), the Allen Telescope

1A catalogue of the known RRATs is available at http://www.as.wvu.edu/~pulsar/rratalog/

81

Page 98: Benjamin Barsdell Thesis - Swinburne University of Technology

82 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

Array (Siemion et al., 2012) and the Green Bank Telescope (Boyles et al., 2012). A fast

transient survey has also been conducted at Arecibo Observatory (Deneva et al., 2009), and

studies have already been made of the potential for transient-detection at next-generation

facilities like the Square Kilometre Array (Macquart, 2011; Colegate & Clarke, 2011).

Chapter 1 introduced pulsar astronomy as a field that stands to benefit significantly

from the use of advanced computing architectures. In this chapter, we demonstrate this

potential by harnessing the power of graphics processing units (GPUs) to develop a full-

featured real-time ‘fast radio transient’ detection pipeline for the 20 cm Multibeam Re-

ceiver (Staveley-Smith et al., 1996) at Parkes Observatory. This work will demonstrate

how the use of advanced hardware architectures can enable new scientific opportunities

that reach beyond what is practical with traditional CPU-based computing to unlock new

paradigms of observation and discovery. Real-time data reduction systems have been an-

nounced for two of the above-mentioned survey projects (ASKAP: Macquart et al. 2010;

and LOFAR: Armour et al. 2011; Serylak et al. 2012), and we expect such systems to

become the standard for cutting-edge surveys in the future (see also Jones et al. 2012).

The ability to detect transient radio events as they are observed provides a number of

advantages over traditional offline processing. These include:

1. access to uncompressed data — offline processing typically requires reduction of

dynamic range and/or time resolution prior to writing data to disk or tape;

2. instant feedback on radio-frequency interference (RFI) environment — offline pro-

cessing leaves little information available during observing about the RFI environ-

ment;

3. immediate follow-up on the order of seconds — offline processing imposes a delay

between observation and detection that can be days to weeks; and

4. triggered baseband dumps — offline detections provide limited information about

events, with no opportunity to capture the corresponding high-resolution baseband

data2.

The ability to precisely characterise the effects of RFI in the search space of transient and

pulsar surveys provides significant benefits over generic metrics such as visualisations of

the bandpass and zero-dispersion-measure time series. Knowing the current ‘RFI weather’

allows observers to adapt their observing schedule based on the quality of data they are

2Baseband data are the digitised but otherwise-unprocessed voltage signals from the receiver system.

Page 99: Benjamin Barsdell Thesis - Swinburne University of Technology

4.1. Introduction 83

obtaining and the current observing mode. Detailed information on the properties of

incoming RFI can also aid the identification of local terrestrial sources of emission.

Point 3 is particularly important in the context of observing RRATs. These sources

are known to be very intermittent in nature, and are in some cases detectable for a total of

less than one second per day (McLaughlin et al., 2006). Hence if they are not re-observed

immediately it can be very time-consuming to find them again in their ‘on’ state. Real-

time detection provides the opportunity to continue observing potential RRAT sources

and confirm or rule out their existence during the same observing session.

The ability to raise an alert for a significant detection only seconds after it is observed

also makes possible immediate follow-up observations of the same event at lower frequen-

cies by taking advantage of the dispersion delay. This concept is discussed further in

Section 4.4.

Possibly the greatest scientific potential for real-time transient observations comes from

their ability to reactively record baseband data upon the detection of highly significant

events. Recording of Nyquist-sampled baseband information over long periods of time is

typically prohibitively expensive due to the excessive data rate (at Parkes Observatory,

this eclipses the survey data rate by almost three orders of magnitude); however, short

timespans of data can be saved to disk if they are known to contain signals of interest. If

captured, such data would provide unprecedented insight into the nature of unique events

and would likely reveal the true origins of tantalising Lorimer-burst-like detections.

The primary scientific goals of this work are a) to enable the detection and confirma-

tion of new RRATs in real-time, b) to enable characterisation and reporting of the RFI

environment during live survey observations and c) to provide the opportunity to cap-

ture baseband recordings of significant events such as giant pulses, extragalactic pulses or

Lorimer bursts.

There are two key obstacles to achieving these goals. First, the pipeline must exhibit

sufficient performance so as to maintain real-time processing using the available hardware,

ideally with a short duty cycle. And second, the pipeline must include effective RFI

mitigation in order to maintain a manageable number of false positives. This chapter

presents our approaches to these challenges and discusses their effectiveness.

The details of our software pipeline, including its implementation on GPUs, deploy-

ment at Parkes Observatory and performance measurements, are described in Section 4.2.

Section 4.3 then presents early results obtained with the system, including the detection

of a new RRAT. Finally, we discuss the system, our results and future work in Section

4.4.

Page 100: Benjamin Barsdell Thesis - Swinburne University of Technology

84 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

4.2 The pipeline

The general design of our pipeline is based on the work of Burke-Spolaor et al. (2011), but

was developed specifically to exploit the power of GPUs. Dedispersion is performed using

the GPU-based code presented in Chapter 3, while data-parallel implementations of the

other algorithms comprising the pipeline were guided by the work presented in Chapter 2.

The key components of the system are depicted in Fig. 4.1. The detection pipeline

(given the name heimdall3) receives data from one receiver beam in the form of a filter-

bank containing a time series for each frequency channel. These observations are buffered

and broken into discrete sections of time to be processed in a single pass of the pipeline;

the size of the sections is chosen to balance memory constraints, GPU loading and the

delay between the generation and reporting of results. Once a complete section of data is

available, the pipeline begins by ‘cleaning’ the filterbank to remove the effects of strong

radio-frequency interference. The data are then incoherently dedispersed, baselined, nor-

malised and filtered. Following this, the processed time series are searched for strong

signals, and detections are grouped together into a list of candidate events. In the final

stage of computation, candidates from each beam are combined and checked for coinci-

dence before being recorded and displayed to the observer.

Apart from dedispersion, which is performed using an external library (see Section

4.2.2), all other stages of the pipeline are implemented on the GPU using the Thrust C++

template library (Hoberock & Bell, 2010), which is supplied as part of NVIDIA’s Compute

Unified Device Architecture (CUDA) Toolkit4. The library provides generic implementa-

tions of a number of data-parallel algorithms and allows them to be customized and com-

bined within a framework similar to the C++ Standard Template Library5. Thrust also

provides multiple back-ends, allowing code to be compiled to use the GPU (via CUDA) or

the CPU (via OpenMP or Intel’s Threading Building Blocks6). The library was chosen for

its excellent fit to the data-parallel, algorithm-centric approach to advanced architectures

motivated in Chapter 2.

4.2.1 RFI mitigation

Radio-frequency interference is the undesired (and generally unavoidable) detection of

terrestrial radio emissions by the telescope. While radio telescopes are typically located

3After the Marvel Comics character of the same name, who acts as guardian of Asgard and is knownfor having spotted an army of giants from a great distance.

4http://developer.nvidia.com/cuda-downloads5See, e.g., http://www.sgi.com/tech/stl/6http://threadingbuildingblocks.org/

Page 101: Benjamin Barsdell Thesis - Swinburne University of Technology

4.2. The pipeline 85

Add polarisations

Clean RFI

Dedisperse

Polyphase filterbank

Remove baseline

Normalise

Digitise

Receiver beam

Matched filter

Detect events

Merge events

Multibeam coincidence

Candidate classification

Candidate display

Extract time series

More DM trials?

More filter trials?

HEIMDALL

Filterbank data

Candidate list

G

F

F

C

C

C

C

FPGA operation

CPU operation

GPU operation

C

G

F

G

G

G

G

G

G

G

Yes

Yes

No

No

Other beams

Figure 4.1 Flow-chart of the key processing operations in the pipeline. heimdall is thename of the main GPU-based pipeline implementation.

Page 102: Benjamin Barsdell Thesis - Swinburne University of Technology

86 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

in radio-quiet zones, the inevitable existence of RFI and its tendency to be many times

stronger than astronomical sources mean that techniques must be employed to mitigate

its effects. In this section we describe our approach to RFI mitigation in the context of a

fast-transient detection pipeline.

RFI signals can generally be divided into two classes: narrow-band signals that extend

over only a small fraction of the frequency band, and broad-band signals that appear in

all channels. The two types must be detected and excised using different techniques.

While narrow-band signals are significantly diluted by integration over the band during

dedispersion (see Section 4.2.2), extremely bright samples in the filterbank can maintain a

strong presence in the dedispersed time series. Detection of these bright samples requires

an estimate of the RMS noise level in each channel as well as the underlying shape of the

bandpass, which varies as a result of filtering processes in the telescope receiver system.

In our implementation, the mean bandpass has already been removed prior to the

entry-point of the transient pipeline. We estimate the RMS noise level by randomly

selecting data from different points in time (within a two-second window) and computing

the median absolute deviation, from which the RMS is derived (see Section 4.2.4 for more

details). For performance and simplicity reasons, the code computes the ‘recursive median’

rather than the true median—each group of five consecutive values is replaced with its

median recursively until only one value remains. Tests showed that this approach retained

robust statistical behaviour while avoiding the need to use a full sort or selection algorithm

as required to compute the true median. Following this procedure, the (true) median RMS

is quickly selected from those of the random samples.

With the RMS measured, narrow-band RFI is identified as individual samples exceed-

ing a five standard deviation threshold. In our implementation, these samples are then

‘cleaned’ by replacing them with randomly chosen good samples from neighbouring fre-

quency channels. All steps of the algorithm are implemented using Thrust’s for each,

transform and reduce functions. Random sampling is performed using the default

pseudo-random number generator provided by Thrust, on a per-thread basis where nec-

essary. Note that when the number of bits per sample is small (e.g., nbits≤4), the limited

dynamic range acts to considerably reduce the potency of narrow-band signals. In such

cases there is no need to perform explicit narrow-band RFI excision.

Broad-band signals of terrestrial origin are most-easily identified through their lack of

dispersion delay across the band; i.e., broad-band RFI generally appears at a dispersion

measure (DM) of zero (see Section 4.2.2 for an introduction to dispersion). To detect these

signals, the filterbank data are integrated over the band (with no dispersion delay) and

Page 103: Benjamin Barsdell Thesis - Swinburne University of Technology

4.2. The pipeline 87

the resulting time series is searched for peaks exceeding 5σ. An RFI mask is then derived

from the peaks and used to clean the original filterbank data by replacing bad samples

with good ones randomly chosen from nearby in time (< ±0.25 s). The band integration

is performed using the dedisp library, while the remainder of the process relies on simple

Thrust functions as with the narrow-band RFI mitigation.

To avoid losing sensitivity to sources of astronomical origin with low dispersion mea-

sures, the zero-DM cleaning procedure is limited to detections at the native time resolution

of 64 µs. Zero-DM pulses wider than this may not be excised during cleaning, allowing

them to pass through the pipeline. For this reason, a low-DM cut at 1.5 pc cm−3 [fol-

lowing Burke-Spolaor et al. (2011)] is applied during candidate classification at the end of

processing (see Section 4.2.8).

In addition to frequency characteristics, RFI can also be identified through its coinci-

dent presence in multiple beams of the Parkes Multibeam receiver, as often occurs when

a signal is observed via a side-lobe rather than boresight to a single beam (Burke-Spolaor

et al., 2011). Correlations between beams therefore provide a strong discriminator between

astronomical and terrestrial sources. The current implementation of our pipeline uses this

information at the end of processing to classify candidate events (see Section 4.2.8); how-

ever, this information could also be used prior to dedispersion to provide more confidence

in the filterbank cleaning process. We plan to integrate the use of such information in the

future. Other, more involved methods of RFI mitigation are also possible (Briggs & Kocz,

2005; Kocz et al., 2012; Spitler et al., 2012); these will also be considered in future work.

4.2.2 Incoherent dedispersion

As described in Chapter 3, interactions with free electrons in the interstellar medium cause

radio-frequency signals from astronomical sources to be delayed in time as a quadratic func-

tion of frequency. Broad-band signals thus appear as quadratic sweeps in the frequency-

time space of recorded filterbank data. The scale of the delay depends linearly on the

number of free electrons in the line of sight to the source, and is referred to as the disper-

sion measure (DM). As the distance to an undetected source is unknown, it is necessary

to integrate over the band along a number of trial dispersion measures (and subsequently

search each of the resultant time-series for signals). This process is known as dedispersion.

Due to the large number of trial dispersion measures required to comprehensively cover

the expected range [typically O(100− 1000)] and the computational expense of each inte-

gration over the band, the process of dedispersion is generally the most time-consuming

stage of a transient-detection pipeline.

Page 104: Benjamin Barsdell Thesis - Swinburne University of Technology

88 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

Our pipeline, targeting a centre observing frequency of 1381 MHz, samples the dis-

persion measure space 0 ≤ DM ≤ 1000 pc cm−3 using 1196 non-linearly distributed trials

chosen to maintain a constant fraction of finite-sampling-induced smearing as a function

of DM. In addition, the input data are reduced in time resolution by successive factors of

two when the smearing cost of doing-so falls below 15%. This improves the overall speed

of the dedispersion process by ∼ 2× without significant loss of signal.

To perform dedispersion on the GPU we used our (publicly available) software library

described in Chapter 3, requiring no more than calls to an application programming inter-

face to create and execute a dedispersion plan. The performance of the direct dedispersion

algorithm implemented by the library was found to be sufficient for the current version

of the pipeline. Use of the ‘tree’ or ‘sub-band’ dedispersion algorithms remains an option

for the future should additional performance be required at the expense of increased code

complexity or signal smearing (see Chapter 3 of this work; Magro et al. 2011).

4.2.3 Baseline removal

Due to instrumental effects in the telescope receiver system, the mean signal level can vary

slowly as a function of time. This baseline typically varies over a timescale of seconds,

and must be subtracted prior to event detection.

The baseline is easily measured by smoothing the dedispersed time series with a wide

window function. However, the process is complicated by the presence of bright impulses,

which can severely bias the baseline estimate. It is therefore necessary to use robust

statistical methods. The running median is one such method, but comes with a high

computational and/or implementation cost. In particular, the large window size (2 s

≈ 3 × 104 samples at the sampling time of 64 µs used in the HTRU survey) and the

constraint of executing in real-time meant that the running median was not a practical

solution for our pipeline.

An alternative method that proved more suitable is the clipped mean, in which the

baseline is first estimated by computing the running mean and then iteratively made more

robust by clipping outliers and re-smoothing. Tests on real data showed that a three-pass

algorithm that clipped first at 10σ and then at 3σ was sufficient to produce a robust

measurement of the baseline. However, the multi-pass nature of this algorithm resulted

in relatively slow performance compared to other parts of the pipeline. A pre-processing

step involving reduction of the time-resolution prior to baselining was attempted, but the

resulting code became complex and difficult to maintain and perfect; the algorithm also

exhibited a strong dependence on the choice of clipping parameters.

Page 105: Benjamin Barsdell Thesis - Swinburne University of Technology

4.2. The pipeline 89

Subsequent investigation led to an alternative approach based on the recursive median

(see Section 4.2.1 for more details on this algorithm). The method involves applying the

recursive median to reduce the time resolution of the data to a value representative of

the desired smoothing length. These data are then simply linearly interpolated back to

the original time resolution to form the complete baseline. This provides a robust and

parameter-free approach to dealing with outliers as well as a very simple implementa-

tion. All operations in the baselining process were implemented within the data-parallel

paradigm using calls to Thrust’s transform function.

4.2.4 Normalisation

Accurate thresholding of time series by the pipeline requires robust measurements of their

root mean square (RMS) noise level. The RMS as computed by its definition for a time

series fi of N samples (with zero mean),

RMS ≡

√√√√ 1

N

N∑i=0

f2i , (4.1)

can be biased by the presence of non-Gaussian signals due to RFI and strong astronomical

sources. We investigated two methods of measuring the RMS that are robust against

outliers: 1) truncation of the distribution, and 2) use of (approximate) median statistics.

The first method operates by pre-truncating the distribution of values such that outliers

in the tails of the distribution are not included in the computation. The resulting RMS

estimate can be corrected for the bias this introduces by assuming it follows a truncated

normal distribution. The correction factor for this case is given by:

RMS =RMStrunc

1− γ(t), (4.2)

γ(t) ≡ 2tφ(t)

2Φ(t)− 1, (4.3)

where t is the signal-to-noise ratio at which the distribution was (symmetrically) trun-

cated and φ(x) and Φ(x) are the normal distribution’s probability density and cumulative

distribution functions respectively. By choosing a small value of t (e.g., t ≈ 1σ), extreme

values will have no impact on the estimated RMS. Note that the quality of the corrected

measurement was found to remain high even for t << 1σ.

Truncation of the distribution is most conveniently achieved by sorting the samples

and ignoring those at the beginning and end of the sorted array. Because sorting is a

Page 106: Benjamin Barsdell Thesis - Swinburne University of Technology

90 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

computationally expensive operation, our implementation first sub-sampled the data by a

factor of 100 to reduce the workload.

The second method employs median statistics to mitigate the effects of strong out-

liers. The median of the absolute deviations from the median [or simply median absolute

deviation (MAD)] can be used to estimate the RMS according to the efficiency factor

RMS = 1.4862 MAD. (4.4)

This constant arises from the fact that the population MAD corresponds to the 75th

percentile of the distribution.

To avoid the cost of computing the full median over each time series, our implementa-

tion uses the recursive median (see Section 4.2.1) to approximate the MAD.

Data-parallel implementations of these techniques were constructed using a selec-

tion of operations from the Thrust library. Equation (4.1) was implemented using the

transform reduce function, while sorting was done using Thrust’s sort. The sub-sampling

step was parallelised with the use of a gather operation, and repeated application of a

transform sufficed to compute the recursive median. After testing both algorithms, the

median-based statistic was chosen for the final implementation due to its parameter-free

nature and its significantly simpler code.

4.2.5 Matched filtering

In order to detect signals with durations longer than the sampling time dt, a collection of

matched filters is applied to each time series. The data are convolved with top-hat profiles

of duration wn = 2n samples for 0 ≤ n ≤ 12, and normalised by√wn. The filtered time

series then continue through the remainder of the pipeline.

The use of simple top-hat profiles makes a data-parallel implementation of the matched

filtering process surprisingly simple. For a time series fi, the filtering operation F (fi;w)

can be defined as follows:

F (fi;w) =1√w

i+dw/2e∑j=i−bw/2c

fj ; bw/2c ≤ i < N + 1− dw/2e (4.5)

=1√w

i+dw/2e∑j=0

fj −i−bw/2c∑j=0

fj

(4.6)

=1√w

[Φi+dw/2e − Φi−bw/2c

], (4.7)

Page 107: Benjamin Barsdell Thesis - Swinburne University of Technology

4.2. The pipeline 91

where

Φi ≡i∑

j=0

fj (4.8)

is the prefix-sum of the time series. The top-hat convolution can therefore be expressed

solely in terms of prefix-sum and transform operations, making a data-parallel implemen-

tation straightforward. An additional feature of the algorithm is that once Φi has been

computed [an O(N) operation], the time series can be filtered at any width in constant

time per element. This allows n filters to be applied to the time series in O(nN) time,

regardless of the width of the filters.

To further speed up this algorithm and the detection process that follows it, time

series to be filtered with wide filters are first reduced in resolution by adding adjacent

samples. The filter width at which this resolution-reduction begins is a parameter to the

pipeline, and allows trading sensitivity for performance; the current setting is to reduce the

resolution prior to applying filters greater than 23 samples wide (meaning that 2n-sample

filtering, where n > 3, is applied by adding 2(n−3) adjacent samples and then applying a

23-element filter).

The data-parallel implementation of the matched filtering process was constructed

using Thrust’s exclusive scan for the prefix sum computation and transform for the

differencing and normalisation operations. Resolution-reduction was performed simply

by striding through the prefix-sum array by the number of samples to combine, and was

implemented using Thrust’s ‘fancy iterators’7.

4.2.6 Event detection

The final step in processing the collection of time series is to identify and extract significant

signals. Signals are considered significant if they exceed a certain threshold, which in our

pipeline is set to six times the RMS noise level (6σ). Identifying samples that exceed this

threshold is a trivial matter; however, the process is complicated by the fact that for a

single event of finite duration, many neighbouring samples may exceed the threshold. In

order to correctly identify such a case as a single extended event, rather than a myriad of

single-sample signals, threshold-exceeding samples separated by only a small number of

bins (currently three) are classified as belonging to the same event.

Once groups, or ‘islands’, of threshold-exceeding samples have been identified, they

are converted to individual events by finding the value and time of their maximum. In

7http://thrust.github.com/doc/group__fancyiterator.html

Page 108: Benjamin Barsdell Thesis - Swinburne University of Technology

92 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

addition to these properties, the time of the first and last samples comprising the event

are extracted and stored.

A data-parallel implementation of the event-detection process involves algorithms that

are more complicated than those that have been mentioned thus far. The first step is to

extract samples from the time series that exceed the detection threshold. This operation

is known as a ‘stream compaction’, and can be performed using a data-parallel implemen-

tation of an algorithm such as ‘copy if’. Specifically, when a sample exceeds the threshold,

we copy both it and its array index to a new memory location. This allows the rest of the

event detection process to operate only on samples that exceeded the threshold, without

losing information on the arrival time of each sample.

Once the significant samples have been isolated, we wish to identify those samples

comprising temporally-isolated events. As the array index of each sample was retained,

identifying significant gaps between events is simply a matter of looking for jumps between

successive sample indices that exceed the bin separation criterion. For example, if the

significant samples had array indices ‘0 2 5 6 8 11’ and one considered samples separated

by more than two bins to belong to independent events, then the difference between

successive indices would identify the gaps between events as ‘0 2|5 6 8|11’. Examination of

the difference between successive values in this way can be performed using a transform

function.

Having identified the boundaries between individual events, the next step is to locate

when each event’s maximum signal-to-noise occurs. This requires the application of a

reduce operation to the samples comprising each event. Given the potentially large number

of events, it is highly desirable to perform these reductions in parallel. Fortunately, this

can be achieved using Thrust’s reduce by key function, which reduces contiguous values

belonging to the same segment; in this case segments correspond to temporally-isolated

events. The result of this operation is an array of the maximum signal-to-noise of each

event along with a separate array containing the corresponding offsets into the original

time series.

The final step of the event detection process is to record the starting and ending time

of each signal. This information is extracted directly from the indices of the samples

that exceeded the detection threshold by using the scatter if function along with the

locations of the gaps between events.

In contrast to the previous stages of the pipeline, the execution time of the event

detection procedure is not simply a function of the length of the time series, depending

instead on the number of detected events. This has implications for the ability to main-

Page 109: Benjamin Barsdell Thesis - Swinburne University of Technology

4.2. The pipeline 93

tain real-time performance. When the number of events remains low (as is typical), the

detection stage consumes only a small fraction of the total execution time of the pipeline.

However, in the event of a large burst of RFI (as was found during testing), this stage of

the pipeline can become swamped with events and cause a catastrophic slow-down. To

avoid this situation, a hard limit was placed on the rate of detections—reaching this limit

causes the pipeline to stop the DM-trial and filter search early and to return only those

candidates found up to that point. In this case, an error code is returned by the pipeline

warning the system that processing of the gulp of data was incomplete. This ‘bail condi-

tion’ provides an effective (although inelegant) means of ensuring real-time performance

regardless of observing conditions.

4.2.7 Event merging

While detected events will appear most strongly at their best-matching dispersion measure

trial and filter width, bright signals will typically be detected across a number of DM trials

and filters. To avoid reporting these secondary candidates as individual events, temporally-

coincident signals are first grouped together. The process takes the form of a connected

component labelling algorithm: pairs of candidates occurring at times within three filter

widths of each other are considered connected; candidates connected directly or indirectly

to each other are then merged to form a single event.

The connected component labelling algorithm is implemented as a three-step process.

First, each candidate’s label is initialised to the candidate’s index in the total list of

candidates. Next, a loop over all pairs of candidates detects coincidences and replaces the

corresponding labels with the minimum of the two original labels. Finally, each candidate

traces its label back along the ‘equivalency chain’ to find its lowest equivalent; e.g., if a

candidate has label 8, and candidate number 8 has label 5, and candidate number 5 has

label 5, then the initial candidate will have its label set to 5. The end result of this process

is a list of labels where matching values indicate connected candidates.

Once the connected component labelling process is complete, merging the candidates

is simply a matter of sorting them by their labels and then using Thrust’s reduce by key

function to merge those with matching labels. The merging function is defined to return

the parameters of the member candidate with the greatest signal-to-noise ratio.

While the loop over all pairs of candidates8 makes the computational complexity of

this process O(N2), the total number of candidates being operated-on is small enough

8Note that more efficient search algorithms are possible—e.g., a binning procedure could reduce thecomplexity to O(N); our approach was chosen for its simplicity rather than its performance.

Page 110: Benjamin Barsdell Thesis - Swinburne University of Technology

94 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

(relative to the work performed by the rest of the pipeline) that the overall cost is not

significant.

4.2.8 Candidate classification and multibeam coincidence

Once the main pipeline is complete, the lists of candidates from each beam are gathered

on a single machine and a classification process is performed. Following the procedure

of Burke-Spolaor et al. (2011), candidates having been produced from the merging of

fewer than three individual events in DM-trial/filter space (i.e., having fewer than three

members) are classified as noise spikes. In practice, this effectively raises the detection

threshold by requiring events to be either strong enough or broad enough to be detected

in three successive DM or filter trials. A cut in dispersion measure at 1.5 pc cm−3 then

identifies low-DM signals likely to be of terrestrial origin.

The final stage of classification is a multibeam coincidence analysis. This process op-

erates on the expectation that most signals of terrestrial origin will appear simultaneously

in multiple beams (having been detected through a far side-lobe of the receiver), while as-

tronomical sources will remain localised to a single beam. While this assumption generally

holds very well, there are two ways in which astronomical events may be found to appear

in multiple beams. The first is when exceptionally bright events are picked up by the

finite response of neighbouring beams. The other is when astronomical signals coincide

with RFI signals in other beams.

The Parkes 20 cm multibeam receiver contains 13 beams pointing at locations on the

sky separated by approximately 30 arcmins, with a response pattern that falls to 50%

of peak sensitivity one quarter of the way between beams, and by more than two orders

of magnitude at neighbouring beams (Staveley-Smith et al., 1996). Astronomical point-

sources lying directly in the centre of a beam would, therefore, need to exceed ∼750σ

to be detected above 6σ in neighbouring beams, while those lying mid-way between two

beams would need to exceed around 400σ to be detected above 6σ in both beams9. As a

conservative measure, we require candidates to appear in more than three beams before

classifying them as RFI. Possibilities exist for decreasing this threshold without losing

sensitivity to bright sources, such as checking that the coincident beams are adjacent or

even computing how well a given coincident event is fitted by the known beam response

pattern; however, the situation is complicated by the existence of false-positives in the

coincidence information (see below). For this reason, our current implementation relies on

9These thresholds are approximate values only, as the response pattern becomes highly asymmetricalin the outer beams of the receiver.

Page 111: Benjamin Barsdell Thesis - Swinburne University of Technology

4.2. The pipeline 95

just the simple threshold criterion.

The other issue with using coincidence information to identify RFI is the production

of false-positives during coincidence detection. This can occur when astronomical events

occur coincidentally with RFI bursts appearing in other beams, which becomes more likely

with broad signals. To minimise the likelihood of this situation, our pipeline checks event-

pairs for coincidence not only in time (with a tolerance of three times the greater filter

width), but also in the detection filter (tolerance of four filters) and the signal-to-noise

ratio (tolerance of 30%). These criteria were found to strike a reasonable balance between

identification of RFI and mis-identification of astronomical sources.

4.2.9 Deployment at Parkes Radio Observatory

The software pipeline was deployed at the Parkes Radio Observatory as part of the Berkeley

Parkes Swinburne Recorder (BPSR) back-end. This currently consists of 13 Reconfigurable

Open Architecture Computing Hardware (ROACH) boards10 connected via an Infiniband

network switch to 8 server computers, each of which contains two 6-core Nehalem-class

Intel Xeon 5650 CPUs and two Fermi-class NVIDIA Tesla C2070 GPUs. The 26 inter-

mediate frequency (IF) signals from the multibeam receiver (2 polarisations × 13 beams)

are fed via analogue-to-digital converters into the ROACH boards, where the signals are

passed through a polyphase filter bank and broken into 1024 frequency channels cover-

ing the 400 MHz bandwidth centered at 1382 MHz. The ROACH then places the data

(integrated to 64 µs samples) into packets and forwards them to the server machines,

each machine receiving the dual-polarisation signal from two beams. Here, the data are

captured by software daemons, which proceed to sum the polarisations and write the to-

tal intensity information to a ring buffer in memory. It is this ring buffer to which the

transient detection pipeline attaches, and at this point that the detection process begins.

The transient pipeline is run as a number of separate instances, each instance processing

the data from one beam using a single GPU. Upon completion of the pipeline for a gulp

of data (typically ∼10 s worth), the pipeline instances output their lists of candidates. A

‘multibeam monitor’ code, running in another process, then collates these lists, performs

the RFI coincidence check between beams and produces the results overview plots as

an image file. Finally, this image file is transferred to a web server and presented to

the observer on a web-based graphical user interface (see Section 4.2.10 for details on

visualisation of results).

While the primary use of the real-time pipeline is during pulsar and fast-transient

10https://casper.berkeley.edu/wiki/ROACH

Page 112: Benjamin Barsdell Thesis - Swinburne University of Technology

96 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

survey observations, it is also possible to run it simultaneously with other observations that

use the Parkes Multibeam receiver, providing varying levels of usefulness. During timing

and other follow-up studies of pulsars and RRATs the pipeline remains a powerful indicator

of the current RFI environment, as well as providing detailed feedback on the quality

of data (for sufficiently bright sources). Other observing modes (e.g., quasar pointings

or studies requiring the use of a calibrator signal) can pose problems for the pipeline,

producing unintuitive output. However, meaningful results can often still be obtained in

such cases via the twelve off-centre receiver beams. When possible, the use of these beams

can provide useful RFI information as well as the opportunity to capture serendipitous

transient events.

In its current form, the output of the transient pipeline is presented only to the ob-

server(s). While this already represents a significant change in the observing paradigm,

there exists even more avenue for discovery through the use of a fully-automated machine

interface capable of further reducing the detection→reaction delay to the order of seconds.

The implementation of such a system is beyond the scope of this work, but the idea is

discussed in more detail in Section 4.4.

Page 113: Benjamin Barsdell Thesis - Swinburne University of Technology

4.2. The pipeline 97

Fig

ure

4.2

Res

ult

sov

ervie

wp

lots

from

the

pip

elin

efo

ran

arch

ived

poi

nti

ng

inth

eH

TR

Usu

rvey

.T

he

poin

tin

gco

nta

ins

an

ewro

tati

ng

rad

iotr

ansi

ent

can

did

ate,

wh

ich

app

ears

asth

eth

ree

pin

ksp

ots

lab

elle

dw

ith

(bea

m)

‘1’

at

aD

Mof

aro

un

d45

pc

cm−

3

(occ

uri

ng

atti

mes∼

70s,

270

san

d320

s).

See

mai

nte

xt

for

det

ails

ofth

evis

ual

isat

ion

.A

cut

inS

NR

of

6.5

was

ap

pli

edfo

rcl

ari

ty.

Page 114: Benjamin Barsdell Thesis - Swinburne University of Technology

98 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

4.2.10 Visualisation

The primary way of visualising the results of the pipeline is through a set of plots created to

provide an overview of the complete collection of candidates (see Fig. 4.2). Here, the main

plot displays detections in the time–DM plane and allows for immediate characterisation

of the RFI environment and the presence of bright dispersed signals. Candidates below

the cut in dispersion measure indicating low-DM RFI are shown as hollow circles, while

detections with high multibeam coincidence are shown as stars. Grey crosses indicate

candidates flagged as noise (due to a small number of component members in the detection

space), and candidates that are strongest at the largest filter width are not shown. All

other signals are then displayed as filled circles. Further information is conveyed in the

size (representing peak signal-to-noise ratio), colour (representing pulse width) and label

(representing strongest beam) of each of the points. A second plot shows a histogram of

the total number of candidates in each beam as a function of dispersion measure. This is

useful for identifying periodic sources such as pulsars and some rotating radio transients,

which show up as a narrow spike at the corresponding dispersion measure (assuming a

sufficient number of pulses are detected). Finally, a third plot displays the signal-to-noise

ratio as a function of dispersion measure to provide more detailed information.

Currently only the final candidates are visualised in the overview plots—the individual

member events comprising each candidate are not shown. This is in contrast to previous

work where events appear as extended ‘trails’ of detections across DM space, peaking in

signal-to-noise ratio at the true DM and providing some additional insight into the shape

of the signal (Cordes & McLaughlin, 2003; McLaughlin et al., 2006; Burke-Spolaor et al.,

2011). The decision to plot only the final events in our work was due to the desire to

simultaneously display results from all 13 beams of the Parkes Multibeam receiver and

the risk of overcrowding, potentially hiding interesting signals behind the trails of others.

Improvement of our visualisation methodology is ongoing, and we may return to plotting

full candidate trails in the future.

During live observing, the overview plots are updated and displayed to the observer

around every 10 s. The web-based interface also provides a set of controls allowing the

observer to interactively modify visualisation parameters such as cuts in SNR, filter and

DM, and to toggle the inclusion of individual beams. The thirty strongest candidates from

the pipeline are also displayed, along with lists of the known pulsars located within the

field of view of each beam.

Page 115: Benjamin Barsdell Thesis - Swinburne University of Technology

4.2. The pipeline 99

4.2.11 Performance

This section presents performance results for the execution of the pipeline, demonstrating

scaling of the different processing stages and comparing total computing time to real-time.

All benchmarks were run on a server node containing two six-core Intel Xeon X5650 CPUs

and two NVIDIA Tesla C2070 GPUs.

Fig. 4.3 shows the execution time of each part of the pipeline when processing the

central beam from the 9.4 minute observation shown in Fig. 4.2. As expected, dedisper-

sion consumes the majority of the execution time, and remains constant throughout the

observation. Filterbank cleaning, memory copying, baseline removal, normalisation and

matched filtering all consume only a small fraction of the total computation time. The

only data-dependent stage of the pipeline is event detection, which can be seen to fluc-

tuate significantly throughout the observation as different numbers of events are detected

(corresponding in this case to RFI). The potential for large increases in the execution time

of this stage during periods of strong RFI motivated the addition of a hard limit to the

event rate in order to guarantee sustained real-time performance (see Section 4.2.6).

Fig. 4.4 shows the total execution times of the different pipeline stages as a function

of the gulp size when processing the observation shown in Fig. 4.2. The computation is

most efficient when processing large lengths of data at a time (e.g., processed time per

gulp ∼ 30 s) and the RFI cleaning and dedispersion processes remain very efficient down

to gulp lengths of ∼ 2 s. However, the later stages of the pipeline become extremely inef-

ficient at small gulp sizes. This issue is due to under-utilisation of the GPU’s computing

resources: at small gulp sizes there are insufficient time samples to exploit all of the avail-

able processing threads, and many simply remain idle. The RFI cleaning and dedispersion

algorithms are more resilient to this problem because they operate simultaneously on all

1024 channels of the filterbank data; in contrast, our current implementation applies the

later parts of the pipeline to each dedispersed time series sequentially. The solution is

clearly to exploit the additional parallelism between independent DM trials. Limitations

of the Thrust library currently prevent this from being a straightforward task; however,

we expect upcoming versions of Thrust to allow the use of separate streams of GPU com-

putation. This feature should allow the simultaneous processing of multiple DM trials on

a single GPU, providing much greater efficiency at short gulp sizes.

While our pipeline was designed to execute all processes on the GPU, a hybrid GPU-

CPU approach is also possible. Using Thrust’s ability to target multiple back-ends, we

trivially recompiled the pipeline to use OpenMP-based implementations of all algorithms

except dedispersion (which remained on the GPU using our external library). CPU-based

Page 116: Benjamin Barsdell Thesis - Swinburne University of Technology

100 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

12

34

56

78

910

1112

1314

1516

1718

1920

2122

2324

2526

2728

2930

3132

3334

0

2

4

6

8

10

12

Detect eventsMatched filterNormaliseBaselineMem copyDedisperseClean RFIMem alloc

Gulp number

Exe

cutio

n ti

me

[s]

Figure 4.3 Plot showing the break-down of execution times during each gulp for differentparts of our transient pipeline when processing the central beam of the 565 s pointingshown in Fig 4.2. Here each gulp (except the last) processes 16.8 s of data and all stagesof the pipeline are executed on the GPU (an NVIDIA Tesla C2070).

Page 117: Benjamin Barsdell Thesis - Swinburne University of Technology

4.2. The pipeline 101

1.049 2.097 4.194 8.389 16.777 33.5540

100

200

300

400

500

600

700

All GPU Hybrid totalDetect eventsMatched filterNormaliseBaselineMem copyDedisperseClean RFIMem alloc

Processed time per gulp [s]

Exe

cutio

n ti

me

[s]

Figure 4.4 Plot showing the variation of execution times for different parts of our transientpipeline as a function of the gulp size when processing the central beam of the 565 spointing shown in Fig 4.2. Here all stages of the pipeline are executed on the GPU (anNVIDIA Tesla C2070). The dashed region shows the total time when using the hybridGPU-CPU(3 cores) approach for comparison (see Fig. 4.5).

Page 118: Benjamin Barsdell Thesis - Swinburne University of Technology

102 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

dedispersion was not considered due to its approximately six times slower performance

(see Chapter 3). Fig. 4.5 shows the results of the same benchmarks as in Fig. 4.4 but for

the hybrid GPU-CPU code using three11 cores of the CPU. At large gulp sizes, the CPU is

around three times slower than the GPU. However, with the current implementation, the

CPU is able to scale more effectively to smaller chunks of data, and becomes faster than

the GPU below gulps of around four seconds. We note that we observed scaling efficiencies

of approximately 80 per cent when using different numbers of CPU cores between one and

twelve, indicating that the algorithms are well-suited to both GPUs and multi-core CPUs.

The speed of the GPU dedispersion code and the data-parallel implementations of

the other parts of the pipeline have proved sufficient to comfortably maintain real-time

execution under the current back-end configuration with 8 s gulps. The code has been in

operation at the telescope since mid July 2012 and has not exhibited any performance-

related issues during this time besides the automated bail-outs during periods of excessive

RFI (corresponding to event-rates exceeding 1.5× 105 detection peaks per minute across

the search space).

4.3 Results

This section presents preliminary testing and science results from our pipeline using

archived data as well as real-time observations. Further work applying the system to

specific science applications is ongoing.

4.3.1 Discovery of PSR J1926–13

Here we report the serendipitous discovery of a new rotating radio transient (RRAT) source

found in existing data (observed in April 2009) from the High Time Resolution Universe

(HTRU) survey (Keith et al., 2010) during testing of our pipeline. Manual inspection of

the overview plots from this pointing (see Fig. 4.2) prompted further study of a small

number of strong pulses appearing in the central beam at consistent DM (∼45 pc cm−3)

and filter (∼8 ms) trials. Our confidence in the origin of the signal was sufficient to

schedule a follow-up observation during HTRU observing time.

A fifteen minute confirmation observation was made in July 2012, in which several

strong pulses were again detected at similar DM and filter trials (see Fig. 4.6). Manual

inspection of the dedispersed time series from both the detection and confirmation obser-

vations found the eleven observed pulses to match a 4.864±0.002 s period, confirming the

11Only three cores out of six on the CPU were used because the remaining cores are needed for otherprocessing tasks during observations.

Page 119: Benjamin Barsdell Thesis - Swinburne University of Technology

4.3. Results 103

1.049 2.097 4.194 8.389 16.777 33.5540

100

200

300

400

500

600

700

GPU/CPU(3 cores) hybrid All GPU totalDetect eventsMatched filterNormaliseBaselineMem copyDedisperseClean RFIMem alloc

Processed time per gulp [s]

Exe

cutio

n ti

me

[s]

Figure 4.5 Plot showing the variation of execution times for different parts of our transientpipeline as a function of the gulp size when processing the central beam of the pointingshown in Fig 4.2. Here all stages of the pipeline but dedispersion are executed on the CPU(an Intel Xeon X5650) using 3 cores. The dashed region shows the total time when usingthe all-GPU approach for comparison (see Fig. 4.4).

Page 120: Benjamin Barsdell Thesis - Swinburne University of Technology

104 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

Table 4.1 Properties of the newly discovered RRAT. Columns are: (1) name derived fromJ2000 coordinates; (2,3) right ascension and declination of the beam centre; (4) best-fitting period; (5) observed pulsation rate in pulses per hour; (6) best-fitting DM; and (7)observed pulse width at half maximum of brightest single pulse. Uncertainties in the lastdigit are give in brackets.

PSRJ RAJ DecJ P (s) χ (h−1) DM (pc cm−3) weff (ms)

J1926–13 19:26:38 -13:13:37 4.864(2) 25 45(5) 8(2)

rotating transient nature of the source. A Fourier search was also performed, but did not

result in a significant detection, making the object a potential member of the RRAT class

of pulsars. Measured properties of the source are listed in Table 4.1. We note that this

source was found after inspecting approximately ten per cent of the mid-lattitude portion

of the HTRU survey, and we therefore expect processing and inspection of the remaining

data to yield additional discoveries. Further studies such as measurements of the detection

rate for known pulsars and RRATs will form the basis of future work.

After the completion of this work we became aware that this source had also been

discovered independently by Rosen et al. (2012), whose work includes a timing solution.

Page 121: Benjamin Barsdell Thesis - Swinburne University of Technology

4.3. Results 105

Fig

ure

4.6

Res

ult

sov

ervie

wp

lots

from

the

pip

elin

efo

ra

con

firm

atio

np

ointi

ng

ofth

ero

tati

ng

rad

iotr

an

sien

tca

nd

idate

show

nin

Fig

.4.2

.O

nly

resu

lts

from

the

centr

al

bea

mar

esh

own

.T

he

can

did

ate

(re-

)ap

pea

rsas

the

pin

kan

dp

urp

lesp

ots

at

aD

Mof

aro

un

d45

pc

cm−

3.

Page 122: Benjamin Barsdell Thesis - Swinburne University of Technology

106 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

4.3.2 Giant pulses

In addition to survey observations, the transient pipeline is also able to operate during

pulsar timing sessions. One use of this ability is to detect the emission of particularly bright

individual pulses from known pulsars in real-time and trigger dumps of the corresponding

baseband data to disk for later study. Some pulsars are known to have extended tails in the

luminosity distribution of their individual pulses and emit what are known as ‘giant pulses’

that can exceed the mean pulse strength by two to three orders of magnitude (Cognard

et al., 1996; Cordes & McLaughlin, 2003). While baseband recording facilities are typically

limited in their capacity due to the extreme data rate (see Section 4.1), recordings can

be kept manageable if they are restricted only to signals of interest. By connecting the

output of our pipeline (i.e., the list of candidates from the current observation) to the

baseband recording hardware via some form of decision-maker, significant events could be

captured within strict recording constraints. The decision-maker could simply be a human

observer; however, a more robust solution would be to use a machine program to analyse

the list of candidates and decide on which to record based on, e.g., their significance, their

likelyhood of being RFI and the acceptable event rate. This idea will form the basis of

future work.

As a proof of concept, our transient pipeline was run during several pulsar timing

observations with the aim of detecting bright pulses from the tail of the distribution. The

results from an observation of PSR J1022+1001, a millisecond pulsar with a period of

16.45 ms (Camilo et al., 1996; Verbiest et al., 2009), are shown in Fig. 4.7. While this

pulsar is not known to emit giant pulses (Kramer et al., 1999), it was sufficiently bright to

detect a number of individual pulses using the transient pipeline. Of the 3852 s / 0.01645 s

≈ 2.3×105 stellar rotations during the observation, the strongest detected emission was

only around eight times the mean pulse SNR (derived from the integrated timing SNR),

consistent with the lack of giant pulse emission from this pulsar.

Page 123: Benjamin Barsdell Thesis - Swinburne University of Technology

4.3. Results 107

Fig

ure

4.7

Res

ult

sov

ervie

wp

lots

from

the

pip

elin

ed

uri

ng

ati

min

gob

serv

atio

nof

the

mil

lise

con

dp

uls

ar

PS

RJ1022+

1001

show

ing

the

det

ecti

onof

anu

mb

erof

stro

ng

nar

row

pu

lses

atth

ep

uls

ar’s

DM

of10

.25

pc

cm−

3.

Page 124: Benjamin Barsdell Thesis - Swinburne University of Technology

108 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

4.3.3 RFI monitoring

The ability of the pipeline to search for signals across a wide range of parameter space in

real-time allows it to provide unprecedented feedback on the radio-frequency interference

(RFI) environment during observations. Fig. 4.8 shows the overview plots from a pointing

in which several strong bursts of RFI occurred. Visible are narrow events at zero-DM

appearing in single beams (hollow green circles), isolated broad events at a variety of DMs

appearing in multiple beams (orange stars) and intermediate-width events spanning many

DMs also appearing in multiple beams (purple stars).

Such information can be used by observers to guide their observing schedule. For

example, narrow zero-DM RFI may be acceptable for certain observations, but strong

RFI spanning many DMs may render the data useless. In the latter case, the observer

could respond by moving to another target, or by attempting to identify the source of the

RFI. Results from the real-time pipeline may also be useful for long-term monitoring of

the RFI environment at the observatory.

Page 125: Benjamin Barsdell Thesis - Swinburne University of Technology

4.3. Results 109

Fig

ure

4.8

Res

ult

sov

ervie

wp

lots

from

the

pip

elin

efo

ra

poi

nti

ng

conta

inin

gst

ron

gb

urs

tsof

RF

I,in

clu

din

gn

arr

owze

ro-D

Msi

gn

als

(hol

low

gree

nci

rcle

s),

isola

ted

bro

ad

even

tsap

pea

rin

gin

mu

ltip

leb

eam

s(o

ran

gest

ars)

an

dm

ediu

m-w

idth

even

tssp

an

nin

gm

any

DM

san

db

eam

s(p

urp

le/b

lue

star

s).

Pu

lses

from

the

kn

own

pu

lsar

PS

RJ10

46–5

813

can

als

ob

ese

enin

bea

mn

ine

aro

un

dit

sD

Mof

240

pc

cm−

3(N

ewto

n,

Man

ches

ter

&C

ooke

,19

81).

Page 126: Benjamin Barsdell Thesis - Swinburne University of Technology

110 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

4.3.4 Quality assurance

The real-time pipeline also serves as a means of monitoring the quality of observation data.

While existing monitoring tools such as plots of the integrated band-pass and zero-DM

time series allow the observer to identify many problems in the observing system, in some

cases issues can go unnoticed due to their subtle impact on these diagnostics. Due to

its large search space, we expect the real-time transient pipeline to provide much greater

diagnostic power.

A case in point was the identification, during initial deployment of the pipeline at

Parkes Observatory, of problems with several beams of the Parkes Multibeam Receiver.

During observing, this problem was immediately visible in the transient overview plots—

an overwhelming presence of bright events from beam six forced the beam to be manually

hidden in order to see the results from other beams. This behaviour was observed in-

termittently over a period of weeks and was also seen to shift into neighbouring beams

during this time. Further investigation suggested an origin inside the focus cabin, and

maintenance work is planned to track down the problem. The strong visibility of this

issue in the overview plots as well the ability to assess its impact on data quality made the

real-time transient pipeline a valuable addition to the set of quality assurance diagnostics

presented to observers.

4.4 Discussion

Our transient detection pipeline can comfortably operate in real-time under the current

survey observing configuration at Parkes Observatory thanks primarily to the use of GPUs.

For comparison, based on additional timing benchmarks of dedispersion and the other

stages of the pipeline, we estimate that an equivalent CPU-only system capable of execut-

ing the pipeline in real-time would require around five times as many nodes, multiplying

the total monetary, power and rack-space costs considerably. These metrics would have

put such a system well beyond the available budgets. Furthermore, the need to partition

the problem between additional nodes would have added considerable complexity to the

software implementation and networking requirements. For these reasons we consider the

use of GPUs to have been the enabling factor in the development of this system.

The ability to observe transient event detections in real-time has dramatically changed

the observing paradigm at Parkes Observatory. Immediate feedback on astronomical

sources, terrestrial interference and instrumentation issues now allows observers to assess

the contents of their data and proactively adapt their observing schedules based on what

Page 127: Benjamin Barsdell Thesis - Swinburne University of Technology

4.4. Discussion 111

they see. As HTRU survey observations continue, we expect the pipeline to produce the

first reactive confirmations of new RRAT and pulsar sources in the near future, bypassing

the issues and delays associated with offline processing. In addition, work is underway to

investigate (automatically-)triggered baseband recording and to add additional features

to the real-time observing system, including the ability to produce plots of frequency and

SNR versus time and DM for individual candidate events. These diagnostics will allow

observers to better discriminate between astronomical and terrestrial events, and could

be linked to a manual trigger for recording baseband data. Ongoing real-time monitoring

of the effects of RFI and instrumental problems are also expected to result in improved

observing quality as such issues are progressively resolved over the long term.

The detection of significant unique events is also a high-reward possibility. One op-

tion for reacting to such detections would be to release an alert to the community such

that immediate follow-up observations could be taken at suitable observatories. The VO-

Event standard from the International Virtual Observatory Alliance is designed for such

purposes, and provides the means for an ‘author’ (e.g., a human or machine analysing

the output of our pipeline) to send structured information about a new event to a ‘pub-

lisher’, which then forwards the information to attached ‘subscribers’ according to event

filtering criteria (Williams & Seaman, 2006). Significant unique events with DM >> 0

detected using the 20 cm Parkes Multibeam Receiver may, with sufficient coordination,

subsequently be detectable at low frequency facilities such as the Murchison Widefield

Array (Tingay et al., 2012), aided substantially by the ability to ‘steer’ such telescopes in

software. For example, a burst with a DM of 100 pc cm−3 detected at Parkes at 1381 MHz

would be appear at 200 MHz around 10 s later, albeit with significantly increased disper-

sion smearing. If successful, the results from such an effort would provide unprecedented

multi-wavelength data on one-off short-duration events.

While the speed of the pipeline is currently sufficient to satisfy the requirements of real-

time processing under the existing back-end configuration, possibilities exist for upgrading

the system given further improvements in performance. Increasing the time resolution

captured by the back-end is one such change that could be made if the pipeline exhibited

the necessary processing power; such an upgrade would improve the pipeline’s sensitivity

to short-duration signals. Another opportunity is the capture of polarisation information,

which is currently ignored by operating only on the total intensity data. Reducing the

gulp size is also of interest—minimising the delay between event and alert is critical to the

notion of immediate follow-up observations at external observatories.

Two obvious avenues exist to obtain the performance needed to support these upgrades.

Page 128: Benjamin Barsdell Thesis - Swinburne University of Technology

112 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs

The first is to further optimise the dedispersion process, which remains the computational

bottle-neck. Use of the tree or sub-band algorithms, or a hybrid approach, is one option

(see Chapter 3 of this work; Magro et al. 2011); alternatively, there is the possibility of

further improving the efficiency of the direct dedispersion algorithm on GPUs (Armour

et al., 2011).

The second avenue for increasing performance is simply to use faster GPUs. Since the

purchase of our computing nodes, the next generation of hardware has been announced

and is expected to provide at least two times the performance of the current devices12.

This is by far the simplest means of speeding-up the application. In addition, it would

be a relatively straightforward task to divide the workload of each receiver beam between

multiple GPUs, e.g., by partitioning across dispersion trials. With the existing hardware,

a seven-beam observing mode could provide two GPUs per beam, doubling the processing

power. A multi-GPU approach would also allow the use of GPUs containing two discrete

chips on a single board.

Increasing the overall sensitivity of the pipeline is highly desirable; the choice of detec-

tion threshold is, however, currently constrained by the number of false positives produced

and the rate at which candidates can be assessed by a human observer. A solution to this

problem is instead to use a machine to analyse the events; e.g., by training an artifi-

cial neural network (Eatough et al., 2010). Machine-based candidate analysis also has

the advantage of providing ceaseless attention—something that cannot be guaranteed by

a human observer. Such an approach would allow for much more robust and confident

real-time alerts and baseband recording following positive astronomical detections.

Further improvements in sensitivity could also be achieved through the use of more ad-

vanced RFI mitigation techniques. While the current approach of performing multibeam

coincidence analysis only at the end of the pipeline has the benefit of reducing imple-

mentation complexity, a more powerful approach would undoubtedly be to leverage the

discriminatory power of multibeam coincidence information during the initial filterbank

cleaning process. We plan to investigate this option in future work by allowing pipeline

instances to communicate data between each other during processing.

While the code developed in this work was targeted specifically at GPU hardware, in

the spirit of the ideas put forth in Chapter 2 the algorithms chosen for each stage of the

pipeline are entirely general, remaining suitable for virtually any parallel shared-memory

computing architecture. Thus, while the immediate significance of the work has been

demonstrated, we also believe that the algorithmic ideas it presents will be of long-term

12http://www.nvidia.com/object/nvidia-kepler.html

Page 129: Benjamin Barsdell Thesis - Swinburne University of Technology

4.4. Discussion 113

value, extending effortlessly to embrace future architectures. Given the current volatility

in the landscape of computing hardware, this is a welcome thought.

Our pipeline software heimdall is freely available, currently as part of the open source

psrdada package13.

Acknowledgments

We would like to thank everyone at the Max Planck Institut fur Radioastronomie for

hosting us during the early stages of this work, Aris Karastergiou for a useful discussion

about transient pipelines and RFI mitigation, and Andrew Jameson for his tireless efforts

integrating our pipeline into the back-end systems at Parkes Observatory and for his

subsequent help during testing.

13http://psrdada.sourceforge.net/

Page 130: Benjamin Barsdell Thesis - Swinburne University of Technology
Page 131: Benjamin Barsdell Thesis - Swinburne University of Technology

5Future Directions and Conclusions

You’ve got to think about big things while you’re doing small

things, so that all the small things go in the right direction.

—Alvin Toffler

5.1 Future directions

Chapter 2 of this thesis advocated a generalised approach to many-core hardware based

on the analysis of algorithms. While this was shown to provide significant insight into the

optimal implementation approach for a given problem, it was later found in Chapter 3

that platform-specific issues can become important during the final stages of optimisation.

Tuning of code and parameters to best take advantage of a particular architecture can

require significant effort, and can often be guided only by benchmarking and trial and error

(see Volkov & Demmel 2008 for an example of the complexities involved in tuning matrix

multiplication on GPUs). One possibility for future work is an investigation into methods

of auto-tuning. Automatic optimisation techniques allow algorithms to be developed once,

in a very generic way, and subsequently deployed efficiently to different hardware by letting

a machine perform the final, platform-specific tuning of the code. This technique, used

in the Fourier transform library FFTW1 and recently applied to the problem of matrix

multiplication on GPUs (Li, Dongarra & Tomov, 2009; Cui et al., 2010), could prove very

valuable for performance-critical astronomy applications needing to extract peak efficiency

from current and future computing architectures.

While the applications studied in this thesis cover a variety of different algorithms, they

may generally be classified as problems involving ‘dense’ data structures (e.g., densely-

sampled particle lists, pixel arrays, time series etc.). These contrast with problems in-

1http://www.fftw.org

115

Page 132: Benjamin Barsdell Thesis - Swinburne University of Technology

116 Chapter 5. Future Directions and Conclusions

volving ‘sparse’ data structures, where data and computations can be irregular; common

examples are tree-based methods [e.g., the Barnes-Hut force-calculation algorithm (Barnes

& Hut, 1986)] and sparse matrix calculations [e.g., those used in solving the Poisson equa-

tion in multiple dimensions (Stone & Norman, 1992)]. The irregularity of these algorithms

can pose problems on highly-vectorised architectures like GPUs, where they often require

significantly more complex implementations than on traditional sequential processors (see,

e.g., Bedorf, Gaburov & Portegies Zwart 2012). A more detailed investigation into gener-

alised approaches to the analysis and implementation of algorithms involving sparse data

structures would be an excellent avenue for future work, with the potential outcome of

opening up new application areas to acceleration by advanced architectures.

One final direction for future work is the development of new software tools and li-

braries optimised for many-core architectures. While this thesis has presented and demon-

strated a powerful and general approach to such hardware, the adoption of GPUs (or other

accelerators) by the wider astronomy community will only come once sufficient utilities

and applications are in place. The Thrust library is a good example of how well-made

tools with a focus on algorithms can dramatically lower the entry barrier and increase

productivity when targeting complex computing architectures, even allowing code to be

effortlessly switched between different hardware. Porting or redeveloping widely-used as-

tronomy software for GPUs remains an important ongoing area of work.

5.1.1 The future evolution of GPUs

Since the arrival of true general-purpose GPU computing platforms in 2007/08, GPUs have

continued to increase dramatically in both computational power and flexibility. November

2008 marked the first appearance of a GPU-accelerated cluster in the Top500 supercom-

puter ranking2, and as of July 2012 the list contains 57 machines exhibiting accelerator

or co-processor cards3. To examine the future directions of this hardware, we will focus

on the recent evolution of products from one vendor chosen as being representative of

the market. Fig. 5.1 plots the peak memory bandwidth and compute4 performance over

the last five years for NVIDIA GeForce GPUs costing around USD$400 on release. An

exponential fit shows compute performance doubling every 1.6 years—architectural im-

provements actually allow faster growth than that defined by Moore’s Law. Given the

tight fit displayed by these five years of data and the unwavering success of Moore’s Law

2http://www.nvidia.com/object/io_1226945999108.html3http://www.top500.org/lists/2012/06/highlights4Here we use the term compute performance to mean arithmetic performance calculated as: core count

× shader clock rate × 2 operations per clock cycle.

Page 133: Benjamin Barsdell Thesis - Swinburne University of Technology

5.1. Future directions 117

10

100

1000

6 7 8 9 10 11 12 13 100

1000

10000M

emor

y ba

ndw

idth

(G

B/s

)

Sin

gle-

prec

. FP

per

form

ance

(G

FLO

P/s

)

Time (years since 2000)

GF 8800 GTS

GF GTX 260

GF GTX 470

GF GTX 570

GF GTX 670

Memory bandwidthDoubling time = 3.7+1.3

- 0.3 yearsCompute performance

Doubling time = 1.6+0.1- 0.1 years

Figure 5.1 Trends in theoretical peak memory bandwidth (+) and compute performance(×) over the last five years for NVIDIA GeForce GPUs costing around USD$400 on release.Dashed and dotted lines show exponential fits.

over the last fifty years, it is quite reasonable to expect that this trend will continue for

(at least) another five years5, at which point GPUs could be expected to provide an order

of magnitude more performance than today. However, this represents arithmetic perfor-

mance only; memory bandwidth is seen to increase on a much longer timescale, doubling

approximately every four years. If memory technology also continues at its current pace,

five years of evolution will provide only ∼2.5 times the data access speed of today.

The ratio of compute performance to memory bandwidth, which we define as the criti-

cal arithmetic intensity, is plotted in Fig. 5.2, and is seen to double around every 2.8 years.

This metric gives an indication of the number of floating-point operations required per

byte of memory access to balance the compute and bandwidth capabilities of the hard-

5Moore’s Law must ultimately come to an end, but technology roadmaps defining the near term (upto 2018) and long term (up to 2026) prospects for its continuation continue to drive the industry (http://www.itrs.net/Links/2011ITRS/Home2011.htm)

Page 134: Benjamin Barsdell Thesis - Swinburne University of Technology

118 Chapter 5. Future Directions and Conclusions

100

1000

6 7 8 9 10 11 12 13

10

100

Cor

e co

unt

Arit

hmet

ic in

tens

ity (

FLO

P/B

)

Time (years since 2000)

GF 8800 GTS

GF GTX 260

GF GTX 470

GF GTX 570

GF GTX 670

Core countDoubling time = 1.5+0.2

- 0.1 yearsCritical arithmetic intensity

Doubling time = 2.8+0.2- 0.4 years

Figure 5.2 Trends in core count (+) and critical arithmetic intensity (×) over the last fiveyears for NVIDIA GeForce GPUs costing around USD$400 on release. Dashed and dottedlines show exponential fits.

ware, and provides insight into the scalability of different applications. Problems with

arithmetic intensities below the critical value of the target hardware will be bound by

memory performance, while those exceeding it will be bound by arithmetic capabilities.

The continued increase in the critical arithmetic intensity of GPU hardware threatens to

leave more and more algorithms in the bandwidth-limited regime, where they are con-

strained by the slower growth-rate of memory speed. The effect of this phenomenon on

astronomy applications will be discussed in Section 5.1.2.

Fig. 5.2 also plots the evolution in the number of cores exhibited by recent GPUs,

which is seen to double every 1.5 years (n.b., this is faster than the increase in compute

performance due to a negative trend in shader clock rate). This metric, the fastest-growing

of those discussed in this section, places a lower-bound on the amount of parallelism re-

quired to fully-utilise the GPU hardware. Applications are therefore required to exhibit

substantial and scalable division of work in order to remain efficient on future GPUs.

Page 135: Benjamin Barsdell Thesis - Swinburne University of Technology

5.1. Future directions 119

However, with the advent of dynamic parallelism functionality and multiple kernel execu-

tion in the Kepler generation of GPUs, this is expected to become an easier goal for many

algorithms. Furthermore, the large number of pixels/voxels/samples/particles/rays typi-

cally appearing in astronomy applications means that in many cases the relevant quantity

far exceeds the number of cores; in such cases the issue is easily resolved through the use

of a data-parallel approach to algorithm design as described in Chapter 2.

In addition to increases in theoretical performance, GPU technology has also exhibited

significant increases in flexibility over the last five years, which has allowed a wider range

of applications to achieve higher practical performance with less development effort. The

addition of two levels of automatically-managed cache space in NVIDIA’s ‘Fermi’ gener-

ation of GPUs alieviated the need to consider the precise alignment of memory accesses,

allowing a more CPU-like approach to code design (see Chapter 3). Similarly, the ability

of the next-generation ‘Kepler’6 architecture to support GPU-managed kernel launches

(also known as dynamic parallelism) will provide a much more natural and efficient means

of implementing algorithms that must adapt dynamically to their data during execution

[e.g., adaptive mesh refinement methods (Schive, Tsai & Chiueh, 2010)]. Looking forward,

we expect this trend of increasing flexibility to continue strongly into the future, provid-

ing even more freedom from the constraints of traditional GPU programming models and

opening up the raw power of the hardware to an ever-broader range of algorithms. The

release of Intel’s Xeon Phi accelerator card (expected in late 2012) may be a significant

step in this direction, with its compatibility with the same instruction set as used by most

modern CPUs (the x86 instruction set) allowing it to exploit legacy software tools7.

One final trend of interest in the GPU market is the divergence of the hardware tar-

geting the graphics/games industry and that targeting the scientific/high-performance

computing sector. NVIDIA’s Tesla series of devices, which targets the scientific com-

puting market, began as simple derivatives of the company’s game-oriented GeForce line,

differentiated only by their increased memory volume and quality guarantee. However, suc-

cessive generations have introduced additional Tesla-exclusive features, such as increased

double-precision floating-point arithmetic performance and error-correcting memory. The

next-generation Tesla K20 will further increase this divide by providing dynamic paral-

lelism and virtualisation features that will not be available in the corresponding GeForce

models8. While this divergence of features is a natural result of the different applica-

6http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.

pdf7http://www.intel.com/content/www/us/en/high-performance-computing/

high-performance-xeon-phi-coprocessor-brief.html8http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf

Page 136: Benjamin Barsdell Thesis - Swinburne University of Technology

120 Chapter 5. Future Directions and Conclusions

tions being targeted by the science and graphics markets, it is a trend that we expect

to require careful management: hardware like the Tesla series owes its existence to the

well-established video gaming market, and it is not yet clear whether the scientific and

high-performance computing markets are large enough to support the development of

custom hardware on their own. For this reason, we expect manufacturers to continue

to share their architectures between the two markets until such time as the demand for

high-performance computing justifies a complete separation. The future impact of this

situation is difficult to predict, but it is possible that a point will be reached where scien-

tific computing hardware is held back by the needs of graphics applications, and progress

beyond this point will be restricted as a result. On the other hand, it is also possible that

the high-performance computing market will grow (or has in fact already grown) large

enough to support its own hardware. In this case, graphics and compute architectures can

be expected to continue to diverge as demanded by their respective application areas.

5.1.2 Prospects for astronomy applications

While the application of advanced architectures to certain computational problems in as-

tronomy has already proven very successful, most of the applications targeted to date

are well-known for exhibiting high degrees of parallelism and for being key performance

bottle-necks [the O(N2) direct gravitational N-body force calculation being the archetypal

example]. This is a result of the non-trivial nature of development for advanced architec-

tures and the restrictions imposed by hardware limitations, as well as the computational

needs of astronomy research. However, the work in this thesis has paved the way for

simplifying future adoption of such technologies, and we expect the number and variety

of accelerated applications to increase significantly in the coming years.

As seen in Section 5.1.1, a key trend in the evolution of GPU hardware is the slow

increase in memory bandwidth relative to compute performance, which poses a concern

for the ability of some algorithms to scale effectively to future hardware. The algorithm

analysis approach presented in this thesis provides useful insight into this issue. Appli-

cations whose computational complexity grows significantly faster than their input and

output data [e.g., the direct gravitational N-body problem, radio-telescope signal correla-

tion and matrix multiplication all have arithmetic intensities of O(N)] are unlikely to ever

be constrained by memory performance, and will continue to push the limits of advanced

architectures with little additional implementation effort. Problems with slowly-growing

arithmetic intensities [e.g., the fast Fourier transform and tree-based algorithms, which

have arithmetic intensities ofO(logN)] will tread the line between bandwidth and compute

Page 137: Benjamin Barsdell Thesis - Swinburne University of Technology

5.1. Future directions 121

limitations, requiring careful optimisation to avoid being held back by memory hardware.

In the worst position are algorithms with constant arithmetic intensities (e.g., transforms

and reductions), which, in practical applications, are often used with only small constant

compute factors. In such cases, performance will remain limited by the available memory

bandwidth. Furthermore, the number of algorithms in this category can be expected to

increase as more and more sink below the (projected) rising critical arithmetic intensity

of future hardware architectures.

Memory bandwidth is not the only bottle-neck faced by algorithms with low arith-

metic intensity. PCI-Express bandwidth is typically an order of magnitude lower than

that of GPU memory, and network bandwidth can be an order of magnitude lower still.

Avoiding these bottle-necks often requires the consideration of a more coarse-grained form

of arithmetic intensity: the number of tasks that can be performed for each data transfer.

The PCI-Express or network communication cost may be significant with respect to any

single task in a pipeline, but its impact can often be reduced by performing multiple tasks

in one place (e.g., moving multiple processing steps onto the GPU rather than just one).

The ability to overlap communication and computation also helps in these situations.

While this analysis paints a somewhat dark picture of the future for a large number

of algorithms in use today, to assume that this is representative of the overall prospects

for the future of computational astronomy would be very pessimistic. It is important

to note that, due to their low computational complexity, bandwidth-bound applications

only rarely form the overall bottle-neck in scientific applications: moving and transforming

data may be limited by memory performance9, but its computational cost quickly becomes

insignificant when faced with anO(N2) or evenO(N logN) algorithm in the same pipeline.

One possible result of the increasing critical arithmetic intensity of GPU hardware is

that we will witness a shift in the balance of algorithm design: computationally intensive

processes will become (relatively) cheaper to employ, and will consequently be used more

liberally than they are today. Given the ability of these processes to fully exploit the

rapid growth in computing power and their position as the key performance bottle-necks

in many scientific applications, we see the overall future for computational astronomy as

being very bright.

Future research may even necessitate the use of more computationally-intensive al-

gorithms as a result of growing data rates and the increasing infeasibility of human in-

tervention during processing. Machine-learning and data-mining algorithms have already

become important tools in some areas of astronomy research, and their importance is ex-

9Among other hardware limitations such as PCI-Express, network and disk IO bandwidth.

Page 138: Benjamin Barsdell Thesis - Swinburne University of Technology

122 Chapter 5. Future Directions and Conclusions

pected to increase significantly over the next decade (e.g., Ball et al. 2006, 2007, 2008;

Mahabal et al. 2008; Borne 2008; Richards et al. 2011). It is widely believed that the next

generation of telescopes will bring about a new era of “Big Data” astronomy research,

where traditional analysis, distribution and archiving methods will fail to cope with the

rate of data generation (Hey, Tansley & Tolle, 2009; Jones et al., 2012)10. While the

computational requirements of new surveys in the optical and infrared may be relatively

undemanding (Schlegel, 2012), future radio surveys such as those at the Square Kilo-

metre Array are expected to require the use of world-class high-performance computing

facilities in conjunction with new data-processing techniques (Cornwell, 2004; Lonsdale,

Doeleman & Oberoi, 2004; Smits et al., 2009). Many of the processes involved in these

surveys (e.g., synthesis imaging, pulsar and transient detection algorithms) exhibit high

arithmetic intensities, making them ideal for deployment on advanced architectures.

The increasing reliance on computationally-intensive algorithms in astronomy makes

the use of rapidly-progressing advanced architectures such as those discussed in this thesis

a very attractive prospect for enabling the next generation of research. However, the shift

away from traditional sequential computing models continues to pose significant challenges.

We believe that the work presented in this thesis offers a prudent path through these

obstacles and into a new decade of discovery.

5.2 Summary

Motivated by recent advances in computing hardware, this thesis began with an investi-

gation into new approaches to the problem of applying advanced massively-parallel com-

puting architectures to applications in astronomy. Chapter 2 eschewed ad-hoc approaches

in favour of a generalised methodology based on algorithm analysis. Simple analysis tech-

niques were shown to provide deep insight into the suitability of particular problems for

advanced architectures now and into the future, answering the questions of both whether

to invest in a many-core solution for a given problem and where to begin such an imple-

mentation. The application of this methodology to four well-known astronomy problems

resulted in the rapid identification of potential speed-ups from cheaply-available hardware

such as GPUs and ultimately led to the work presented in Chapter 3. Due to the gen-

eral nature of algorithm analysis, these results are expected to stand the test of time and

remain relevant for virtually all future parallel architectures.

Incoherent dedispersion is a computationally intensive problem at the heart of surveys

10We note that the Big Data paradigm is expected to affect many areas of science and is not restrictedto astronomy.

Page 139: Benjamin Barsdell Thesis - Swinburne University of Technology

5.2. Summary 123

for fast radio transients. Commonly positioned as the primary performance bottle-neck

in these applications, the speed of this algorithm can place direct constraints on the

rate of scientific discovery. This fact, combined with the results of Chapter 2 showing

a strong potential for GPU-acceleration, made it a logical choice for further study and

implementation. Chapter 3 presented a detailed analysis of three different incoherent

dedispersion algorithms and described their implementations using the CUDA platform.

Building on the results of the analysis in Chapter 2, this chapter presented a more detailed

investigation of the memory access patterns exhibited by the algorithms. Also discussed

were implementation-specific details, which were found to play an important role in the

optimisation for older generations of GPU hardware, but a less-significant one for more

recent devices exhibiting automatic cache spaces. The GPU implementation of the direct

dedispersion algorithm was found to out-perform an optimised multi-core CPU code by

up to a factor of 9× using high-end hardware available at the time. The sub-band and tree

algorithms on the GPU provided further speed-ups of 3–20×, but were found to introduce

significant smearing into the output time series due to their use of approximations. The

ability of even the direct dedispersion code to execute in one third of real-time on the

GPU suggested the possibility of using this implementation as the basis for a real-time

transient detection pipeline, which led to the work presented in Chapter 4.

Looking toward GPU-driven scientific outcomes, Chapter 4 described the development

of a complete real-time fast-radio-transient detection pipeline capable of exploiting the

power of advanced many-core computing architectures. Performance and radio-frequency

interference (RFI) mitigation were noted as the key issues to be solved, and the GPU-

implementation of the direct dedispersion algorithm from Chapter 3 was used as the basis

for the solution of the former. The algorithms comprising the remaining stages of the

pipeline were chosen by taking into consideration the need for both robust statistical meth-

ods and efficient data-parallel implementations, with the use of foundation algorithms and

algorithm-composition techniques introduced in Chapter 2 proving crucial to the work.

The additional processing performance also allowed the use of high-resolution matched fil-

tering, providing increased sensitivity. Implementations for GPU hardware were expedited

through use of the Thrust library of algorithms, which also allowed trivial retargeting of

the codebase for multi-core CPUs. RFI mitigation algorithms employed in the pipeline

included pre-processing of filterbank data to remove both narrow- and broad-band sig-

nals likely to be of terrestrial origin, and spatial discrimination of candidates based on

coincidence information from independent receiver beams.

The pipeline was demonstrated using both archival data from the High Time Resolu-

Page 140: Benjamin Barsdell Thesis - Swinburne University of Technology

124 Chapter 5. Future Directions and Conclusions

tion Universe survey and real-time observations at Parkes Observatory. Using NVIDIA

Tesla C2050 GPUs, execution time was found to remain comfortably below real-time (e.g.,

8 s of data processed in ∼4 s) for the vast majority of pointings. The exceptions, corre-

sponding to periods of extreme RFI, resulted in the triggering of a ‘bail’ condition and

returned incomplete results for the given time-segment. The system was deployed as part

of the Berkeley Parkes Swinburne Recorder back-end connected to the 20 cm Multibeam

Receiver on the 64 m Parkes radio telescope, with a web-based interface providing control

of the pipeline and visualisation of results to observers. Early results demonstrated sev-

eral powerful abilities, including live detection of individual pulses from known pulsars and

RRATs, detailed real-time monitoring of the RFI environment, and continuous quality-

assurance of recorded data. The increased sensitivity and ability to rapidly re-process

archival data also resulted in the serendipitous discovery of a new RRAT candidate in a

2009 pointing from the High Time Resolution Universe survey, which was subsequently

confirmed using the same pipeline in real-time at Parkes. A number of future projects

are now being planned for the system, including long-term RFI monitoring, base-band

capture of giant pulses and triggered inter-observatory observations of significant unique

events. The ability to detect dispersed transient events in real-time is expected to be

critical to next-generation facilities such as the Square Kilometre Array, and it is likely

that the feasibility of such endeavours will depend on the ability to exploit advanced,

massively-parallel computing architectures, making this work particularly timely.

In conclusion, this thesis has shown how a generalised approach to exploiting the power

and scalability offered by advanced computing architectures can provide paradigm-shifting

accelerations to computationally-limited astronomy problems today while also promising

to carry these same problems effortlessly through the foreseable future of developments in

hardware technology.

Page 141: Benjamin Barsdell Thesis - Swinburne University of Technology

Bibliography

Aarseth S. J., 1963, M.N.R.A.S., 126, 223

Abdo A. A. et al., 2010, ApJ. Supp., 187, 460

Agarwal P. K., Krishnan S., Mustafa N. H., Suresh, 2003, in In Proc. 11th European

Sympos. Algorithms, Lect. Notes Comput. Sci, Springer-Verlag, pp. 544–555

Ait-Allal D., Weber R., Dumez-Viou C., Cognard I., Theureau G., 2012, Comptes Rendus

Physique, 13, 80

Alpar M. A., Cheng A. F., Ruderman M. A., Shaham J., 1982, Nature, 300, 728

Amdahl G. M., 1967, in AFIPS ’67: Proceedings of the American Federation of Informa-

tion Processing Societies Conference, pp. 483–485

Anderson J. A., Lorenz C. D., Travesset A., 2008, Journal of Computational Physics, 227,

5342

Angulo R. E., Springel V., White S. D. M., Jenkins A., Baugh C. M., Frenk C. S., 2012,

Scaling relations for galaxy clusters in the Millennium-XXL simulation, arXiv:1203.3216

[astro-ph.CO]

Armour W. et al., 2011, A GPU-based survey for millisecond radio transients using

ARTEMIS, arXiv:1111.6399 [astro-ph.IM]

Asanovic K. et al., 2006, The landscape of parallel computing research: A view from

berkeley. Tech. Rep. UCB/EECS-2006-183, EECS Department, University of Califor-

nia, Berkeley, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/

EECS-2006-183.html

Asanovic K. et al., 2009, Communications of the ACM, 52, 56

Bagchi M., Cortes Nieves A., McLaughlin M., 2012, A search for dispersed radio bursts in

archival Parkes Multibeam Pulsar Survey data, arXiv:1207.2992 [astro-ph.HE]

Ball N. M., Brunner R. J., Myers A. D., Strand N. E., Alberts S. L., Tcheng D., 2008,

ApJ, 683, 12

Ball N. M., Brunner R. J., Myers A. D., Strand N. E., Alberts S. L., Tcheng D., Llora X.,

2007, ApJ, 663, 774

Ball N. M., Brunner R. J., Myers A. D., Tcheng D., 2006, ApJ, 650, 497

125

Page 142: Benjamin Barsdell Thesis - Swinburne University of Technology

126 Bibliography

Barnes J., Hut P., 1986, Nature, 324, 446

Barr E., 2011, in American Institute of Physics Conference Series, Vol. 1357, American

Institute of Physics Conference Series, Burgay M., D’Amico N., Esposito P., Pellizzoni

A., Possenti A., eds., pp. 52–53

Barsdell B. R., Bailes M., Barnes D. G., Fluke C. J., 2012, M.N.R.A.S., 2599

Barsdell B. R., Barnes D. G., Fluke C. J., 2010, M.N.R.A.S., 408, 1936

Bate N. F., Fluke C. J., Barsdell B. R., Garsden H., Lewis G. F., 2010, New Astronomy,

15, 726

Baumgardt H., Hut P., Makino J., McMillan S., Portegies Zwart S., 2003, ApJL, 582, L21

Bedorf J., Gaburov E., Portegies Zwart S., 2012, Journal of Computational Physics, 231,

2825

Bedorf J., Portegies Zwart S., 2012, A pilgrimage to gravity on GPUs, arXiv:1204.3106

[astro-ph.IM]

Belleman R. G., Bedorf J., Portegies Zwart S. F., 2008, New Astronomy, 13, 103

Belletti F. et al., 2007, QCD on the Cell Broadband Engine, arXiv:0710.2442 [hep-lat]

Bhat N. D. R., Cordes J. M., Chatterjee S., Lazio T. J. W., 2005, Radio Science, 40, 5

Bhattacharya D., van den Heuvel E. P. J., 1991, Physics Reports, 203, 1

Blelloch G. E., 1996, Commun. ACM, 39, 85

Bohn C.-A., 1998, in In Proceedings of Int. Conf. on Compu. Intelligence and Neuro-

sciences, pp. 64–67

Bolz J., Farmer I., Grinspun E., Schrooder P., 2003, ACM Trans. Graph., 22, 917

Borne K. D., 2008, Astronomische Nachrichten, 329, 255

Boyles J. et al., 2012, The Green Bank Telescope 350 MHz Drift-scan Survey I: Survey

Observations and the Discovery of 13 Pulsars, arXiv:1209.4293

Briggs D. S., 1995, PhD thesis, New Mexico Institute of Mining and Technology

Briggs F. H., Kocz J., 2005, Radio Science, 40, 5

Page 143: Benjamin Barsdell Thesis - Swinburne University of Technology

Bibliography 127

Brunner R. J., Kindratenko V. V., Myers A. D., 2007, in NSTC ’07: Proceedings of the

NASA Science Technology Conference

Buck I., Foley T., Horn D., Sugerman J., Fatahalian K., Houston M., Hanrahan P., 2004,

ACM TRANSACTIONS ON GRAPHICS, 23, 777

Burke-Spolaor S., Bailes M., 2010, M.N.R.A.S., 402, 855

Burke-Spolaor S., Bailes M., Ekers R., Macquart J.-P., Crawford, III F., 2011, ApJ, 727,

18

Burke-Spolaor S. et al., 2011, M.N.R.A.S., 416, 2465

Burns W. R., Clark B. G., 1969, A&A, 2, 280

Camilo F., Nice D. J., Shrauner J. A., Taylor J. H., 1996, ApJ, 469, 819

Campana-Olivo R., Manian V., 2011, in Society of Photo-Optical Instrumentation En-

gineers (SPIE) Conference Series, Vol. 8048, Society of Photo-Optical Instrumentation

Engineers (SPIE) Conference Series

Cecilia J. M., Garcia J. M., Ujaldon M., Nisbet A., Amos M., 2011, in Proceedings of the

2011 IEEE International Symposium on Parallel and Distributed Processing Workshops

and PhD Forum, IPDPSW ’11, IEEE Computer Society, Washington, DC, USA, pp.

339–346

Che S., Boyer M., Meng J., Tarjan D., Sheaffer J., Skadron K., 2008, Journal of Parallel

and Distributed Computing, 68, 1370

Clark B. G., 1980, A&A, 89, 377

Clark M. A., La Plante P. C., Greenhill L. J., 2011, Accelerating Radio Astronomy Cross-

Correlation with Graphics Processing Units, arXiv:1107.4264 [astro-ph.IM]

Cognard I., Shrauner J. A., Taylor J. H., Thorsett S. E., 1996, ApJL, 457, L81

Cohen J. M., Molemake J., 2009, in 21st International Conference on Parallel Computa-

tional Fluid Dynamics (ParCFD2009)

Colegate T. M., Clarke N., 2011, Pub. Astron. Soc. Australia, 28, 299

Cordes J. M., Kramer M., Lazio T. J. W., Stappers B. W., Backer D. C., Johnston S.,

2004, New Astronomy Reviews, 48, 1413

Page 144: Benjamin Barsdell Thesis - Swinburne University of Technology

128 Bibliography

Cordes J. M., McLaughlin M. A., 2003, ApJ, 596, 1142

Cornwell T. J., 2004, Experimental Astronomy, 17, 329

Cui X., Chen Y., Zhang C., Mei H., 2010, in Proceedings of the 2010 IEEE 16th Interna-

tional Conference on Parallel and Distributed Systems, ICPADS ’10, IEEE Computer

Society, Washington, DC, USA, pp. 237–242

Cytowski M., Remiszewski M., Soszyski I., 2010, in Lecture Notes in Computer Science,

Vol. 6067, Parallel Processing and Applied Mathematics, Wyrzykowski R., Dongarra J.,

Karczewski K., Wasniewski J., eds., Springer Berlin Heidelberg, pp. 507–516

de Greef M., Crezee J., van Eijk J. C., Pool R., Bel A., 2009, Medical Physics, 36, 4095

Deneva J. S. et al., 2009, ApJ, 703, 2259

Dewdney P. E., Hall P. J., Schilizzi R. T., Lazio T. J. L. W., 2009, IEEE Proceedings, 97,

1482

Diewald U., Preußer T., Rumpf M., Strzodka R., 2001, Acta Mathematica Universitatis

Comenianae (AMUC), LXX, 15

Dodson R., Harris C., Pal S., Wayth R., 2010, in ISKAF2010 Science Meeting

Eatough R. P., Molkenthin N., Kramer M., Noutsos A., Keith M. J., Stappers B. W.,

Lyne A. G., 2010, M.N.R.A.S., 407, 2443

Ebisuzaki T., Makino J., Fukushige T., Taiji M., Sugimoto D., Ito T., Okumura S. K.,

1993, Proc. Astron. Soc. Japan, 45, 269

Eichenberger A. E. et al., 2005, in Proceedings of the 14th International Conference on

Parallel Architectures and Compilation Techniques, PACT ’05, IEEE Computer Society,

Washington, DC, USA, pp. 161–172

Elsen E., Vishal V., Houston M., Pande V., Hanrahan P., Darve E., 2007, N-Body Simu-

lations on GPUs, arXiv:0706.3060

Floer L., Winkel B., Kerp J., 2010, in RFI Mitigation Workshop

Fluke C. J., Barnes D. G., Barsdell B. R., Hassan A. H., 2011, Pub. Astron. Soc. Australia,

28, 15

Ford E. B., 2009, New Astronomy, 14, 406

Page 145: Benjamin Barsdell Thesis - Swinburne University of Technology

Bibliography 129

Foster R. S., Backer D. C., 1990, ApJ, 361, 300

Fournier A., Fussell D., 1988, ACM Trans. Graph., 7, 103

Fridman P. A., Baan W. A., 2001, A&A, 378, 327

Fukushige T., Makino J., Kawai A., 2005, Proc. Astron. Soc. Japan, 57, 1009

Gaburov E., Harfst S., Portegies Zwart S., 2009, New Astronomy, 14, 630

Gaensler B. M., Madsen G. J., Chatterjee S., Mao S. A., 2008, Pub. Astron. Soc. Australia,

25, 184

Garcia V., Debreuve E., Barlaud M., 2008, in Computer Vision and Pattern Recognition

Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on, pp. 1 –6

Gold T., 1968, 218, 731

Gonnet P., 2010, in American Institute of Physics Conference Series, Vol. 1281, American

Institute of Physics Conference Series, Simos T. E., Psihoyios G., Tsitouras C., eds.,

pp. 1305–1308

Goodnight N., Woolley C., Lewin G., Luebke D., Humphreys G., 2003, in Proceedings of

the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS

’03, Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, pp. 102–111

Gorbunov S., Kebschull U., Kisel I., Lindenstruth V., Muller W. F. J., 2008, Computer

Physics Communications, 178, 374

Govett M. W., Middlecoff J., Henderson T. B., Rosinski J., Madden P., 2011, AGU Fall

Meeting Abstracts, A2

Hamada T., Iitaka T., 2007, The Chamomile Scheme: An Optimized Algorithm for N-body

simulations on Programmable Graphics Processing Units, arXiv:astro-ph/0703100

Hamada T. et al., 2009, Computer Science - Research and Development, 24, 21

Hankins T. H., Rickett B. J., 1975, in Methods in Computational Physics. Volume 14 -

Radio astronomy, Alder B., Fernbach S., Rotenberg M., eds., Vol. 14, pp. 55–129

Harris C., Haines K., 2011, Pub. Astron. Soc. Australia, 28, 317

Harris C., Haines K., Staveley-Smith L., 2008, Experimental Astronomy, 22, 129

Page 146: Benjamin Barsdell Thesis - Swinburne University of Technology

130 Bibliography

Harris M., 2005, GPU Gems 2 - Mapping Computational Concepts to GPUs, Pharr M.,

ed., Addison-Wesley Professional, pp. 493–508

Harris M., 2007, Optimizing parallel reduction in cuda. Tech. rep., avail-

able at: http://developer.download.nvidia.com/compute/cuda/1_1/Website/

projects/reduction/doc/reduction.pdf

Harris M. J., Coombe G., Scheuermann T., Lastra A., 2002, in Proceedings of the ACM

SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS ’02, Euro-

graphics Association, Aire-la-Ville, Switzerland, Switzerland, pp. 109–118

Hassan A. H., Fluke C. J., Barnes D. G., 2012, A Distributed GPU-based Framework for

real-time 3D Volume Rendering of Large Astronomical Data Cubes, arXiv:1205.0282

[astro-ph.IM]

Heuveline V., Weiß J.-P., 2009, European Physical Journal Special Topics, 171, 31

Hewish A., Bell S. J., Pilkington J. D. H., Scott P. F., Collins R. A., 1968, 217, 709

Hey T., Tansley S., Tolle K., eds., 2009, The Fourth Paradigm, Microsoft Research

Heymann F., Siebenmorgen R., 2012, ApJ, 751, 27

Hobbs G., Lyne A. G., Kramer M., Martin C. E., Jordan C., 2004, M.N.R.A.S., 353, 1311

Hoberock J., Bell N., 2010, Thrust: A parallel template library. http://www.

meganewtons.com/, version 1.6.0

Hoff, III K. E., Keyser J., Lin M., Manocha D., Culver T., 1999, in Proceedings of the

26th annual conference on Computer graphics and interactive techniques, SIGGRAPH

’99, ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, pp. 277–286

Hogbom J. A., 1974, A&AS, 15, 417

Hogden J., Vander Wiel S., Bower G. C., Michalak S., Siemion A., Werthimer D., 2012,

ApJ, 747, 141

Hopf M., Ertl T., 1999, in Proceedings of the conference on Visualization ’99: celebrating

ten years, VIS ’99, IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 471–474

Horvath Z., Liebmann M., 2010, in American Institute of Physics Conference Series, Vol.

1281, American Institute of Physics Conference Series, Simos T. E., Psihoyios G., Tsi-

touras C., eds., pp. 1789–1792

Page 147: Benjamin Barsdell Thesis - Swinburne University of Technology

Bibliography 131

Hupca I. O., Falcou J., Grigori L., Stompor R., 2012, in Proceedings of the 2011 interna-

tional conference on Parallel Processing, Euro-Par’11, Springer-Verlag, Berlin, Heidel-

berg, pp. 355–366

Jalali B., Baumgardt H., Kissler-Patig M., Gebhardt K., Noyola E., Lutzgendorf N., de

Zeeuw P. T., 2012, A&A, 538, A19

Jia X., Gu X., Sempau J., Choi D., Majumdar A., Jiang S. B., 2010, Physics in Medicine

and Biology, 55, 3077

Johnston H. M., Kulkarni S. R., 1991, ApJ, 368, 504

Jones D. L. et al., 2012, in IAU Symposium, Vol. 285, IAU Symposium, Griffin R. E. M.,

Hanisch R. J., Seaman R., eds., pp. 340–341

Jonsson P., Primack J. R., 2010, New Astronomy, 15, 509

Kawai A., Fukushige T., Makino J., Taiji M., 2000, Proc. Astron. Soc. Japan, 52, 659

Kayser R., Refsdal S., Stabell R., 1986, A&A, 166, 36

Keane E. F., Kramer M., Lyne A. G., Stappers B. W., McLaughlin M. A., 2011,

M.N.R.A.S., 838

Keane E. F., Ludovici D. A., Eatough R. P., Kramer M., Lyne A. G., McLaughlin M. A.,

Stappers B. W., 2010, M.N.R.A.S., 401, 1057

Keith M. J. et al., 2010, M.N.R.A.S., 409, 619

Kesteven M., Hobbs G., Clement R., Dawson B., Manchester R., Uppal T., 2005, Radio

Science, 40, 5

Khanna G., 2010, International Journal of Modeling, Simulation and Scientific Computing,

01, 147

Kim J., Park C., Rossi G., Lee S. M., Gott, III J. R., 2011, Journal of Korean Astronomical

Society, 44, 217

Klessen R. S., Kroupa P., 1998, ApJ, 498, 143

Knuth D. E., 1998, The art of computer programming, 2nd edn., Vol. 3. Addison-Wesley

Longman Publishing Co., Boston, MA, USA

Kocz J., Bailes M., Barnes D., Burke-Spolaor S., Levin L., 2012, M.N.R.A.S., 420, 271

Page 148: Benjamin Barsdell Thesis - Swinburne University of Technology

132 Bibliography

Kramer M. et al., 1999, ApJ, 520, 324

Krishnan S., Mustafa N. H., Venkatasubramanian S., 2002, in Proceedings of the thirteenth

annual ACM-SIAM symposium on Discrete algorithms, SODA ’02, Society for Industrial

and Applied Mathematics, Philadelphia, PA, USA, pp. 558–567

Langston G., Rumberg B., Brandt P., 2007, in Bulletin of the American Astronomical

Society, Vol. 39, American Astronomical Society Meeting Abstracts, p. 745

Larsen E. S., McAllister D., 2001, in Proceedings of the 2001 ACM/IEEE conference

on Supercomputing (CDROM), Supercomputing ’01, ACM, New York, NY, USA, pp.

55–55

Lattimer J. M., Prakash M., 2004, Science, 304, 536

Lengyel J., Reichert M., Donald B. R., Greenberg D. P., 1990, SIGGRAPH Comput.

Graph., 24, 327

Levoy M., 1990, ACM Trans. Graph., 9, 245

Li Y., Dongarra J., Tomov S., 2009, in Proceedings of the 9th International Conference

on Computational Science: Part I, ICCS ’09, Springer-Verlag, Berlin, Heidelberg, pp.

884–892

Lindholm E., Kilgard M. J., Moreton H., 2001, in Proceedings of the 28th annual con-

ference on Computer graphics and interactive techniques, SIGGRAPH ’01, ACM, New

York, NY, USA, pp. 149–158

Lonsdale C. J., Doeleman S. S., Oberoi D., 2004, Experimental Astronomy, 17, 345

Lorimer D. R., Bailes M., McLaughlin M. A., Narkevic D. J., Crawford F., 2007, Science,

318, 777

Lorimer D. R. et al., 2006, M.N.R.A.S., 372, 777

Lu L., Paulovicks B., Sheinin V., Perrone M., 2010, in Society of Photo-Optical Instru-

mentation Engineers (SPIE) Conference Series, Vol. 7744, Society of Photo-Optical

Instrumentation Engineers (SPIE) Conference Series

Lyne A. G. et al., 2004, Science, 303, 1153

Lyne A. G. et al., 1998, M.N.R.A.S., 295, 743

Page 149: Benjamin Barsdell Thesis - Swinburne University of Technology

Bibliography 133

Macquart J.-P., 2011, ApJ, 734, 20

Macquart J.-P. et al., 2010, Pub. Astron. Soc. Australia, 27, 272

Magro A., Karastergiou A., Salvini S., Mort B., Dulwich F., Zarb Adami K., 2011,

M.N.R.A.S., 417, 2642

Mahabal A. et al., 2008, in American Institute of Physics Conference Series, Vol. 1082,

American Institute of Physics Conference Series, Bailer-Jones C. A. L., ed., pp. 287–293

Makino J., 1991, Proc. Astron. Soc. Japan, 43, 859

Makino J., 1996, ApJ, 471, 796

Makino J., Fukushige T., Koga M., Namura K., 2003, Proc. Astron. Soc. Japan, 55, 1163

Makino J., Funato Y., 1993, Proc. Astron. Soc. Japan, 45, 279

Manchester R. et al., 2001, M.N.R.A.S., 328, 17

Manchester R. et al., 1996, M.N.R.A.S., 279, 1235

Manchester R. N., Hobbs G. B., Teoh A., Hobbs M., 2005, AJ, 129, 1993

Mark W. R., Glanville R. S., Akeley K., Kilgard M. J., 2003, ACM Trans. Graph., 22, 896

Masuda N., Ito T., Tanaka T., Shiraki A., Sugie T., 2006, Optics Express, 14, 603

Matsakis D. N., Taylor J. H., Eubanks T. M., 1997, A&A, 326, 924

McConnell S. M., 2010, Journal of Physics Conference Series, 256, 012013

McLaughlin M. A. et al., 2006, Nature, 439, 817

Men C., Gu X., Choi D., Majumdar A., Zheng Z., Mueller K., Jiang S. B., 2009, Physics

in Medicine and Biology, 54, 6565

Mereghetti S., 2008, A&A Rev., 15, 225

Merz H., Pen U.-L., Trac H., 2005, New Astronomy, 10, 393

Michalakes J., Vachharajani M., 2008, in Parallel and Distributed Processing, 2008. IPDPS

2008. IEEE International Symposium on, pp. 1 –7

Mielikainen J., Huang B., Huang A., 2011, AGU Fall Meeting Abstracts, B6

Page 150: Benjamin Barsdell Thesis - Swinburne University of Technology

134 Bibliography

Mignani R. P., 2011, Advances in Space Research, 47, 1281

Molnar F., Szakaly T., Meszaros R., Lagzi I., 2010, Computer Physics Communications,

181, 105

Monmasson E., Cirstea M., 2007, Industrial Electronics, IEEE Transactions on, 54, 1824

Moore G. E., 1965, Electronics, 38, 4

Moreland K., Angel E., 2003, in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS

conference on Graphics hardware, HWWS ’03, Eurographics Association, Aire-la-Ville,

Switzerland, Switzerland, pp. 112–119

Mudryk L. R., Murray N. W., 2009, New Astronomy, 14, 71

Mustafa N., Koutsofios E., Krishnan S., Venkatasubramanian S., 2001, in Proceedings of

the seventeenth annual symposium on Computational geometry, SCG ’01, ACM, New

York, NY, USA, pp. 50–59

Nakasato N., Ogiya G., Miki Y., Mori M., Nomoto K., 2012, Astrophysical Particle Sim-

ulations on Heterogeneous CPU-GPU Systems, arXiv:1206.1199 [astro-ph.IM]

Neal J., Fewtrell T., Trigg M., Bates P., 2009, in EGU General Assembly Conference Ab-

stracts, Vol. 11, EGU General Assembly Conference Abstracts, Arabelos D. N., Tsch-

erning C. C., eds., p. 1464

Newton L. M., Manchester R. N., Cooke D. J., 1981, M.N.R.A.S., 194, 841

Nitadori K., Makino J., 2008, New Astronomy, 13, 498

NVIDIA Corporation, 2012, Nvidias next generation cuda compute architecture: Ke-

pler gk110. Tech. rep., available at: http://www.nvidia.com/content/PDF/kepler/

NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

Nyland L., Harris M., Prins J. F., 2007, GPU Gems 3 - Fast N-Body Simulation with

CUDA, Nguyen H., ed., Addison-Wesley, pp. 677–695

Nyland L., Prins J. F., Harris M., 2004, The Rapid Evaluation of Potential Fields in

N-Body Problems Using Programmable Graphics Hardware (Poster)

Ord S., Greenhill L., Wayth R., Mitchell D., Dale K., Pfister H., Edgar R. G., 2009, GPUs

for data processing in the MWA, arXiv:0902.0915

Page 151: Benjamin Barsdell Thesis - Swinburne University of Technology

Bibliography 135

Owens J. D., Luebke D., Govindaraju N., Harris M., Kruger J., Lefohn A. E., Purcell T.,

2005, in Eurographics 2005, State of the Art Reports, pp. 21–51

Pacini F., 1968, 219, 145

Portegies Zwart S. F., Belleman R. G., Geldof P. M., 2007, New Astronomy, 12, 641

Preis T., Virnau P., Paul W., Schneider J. J., 2009, New Journal of Physics, 11, 093024

Proudfoot K., Mark W. R., Tzvetkov S., Hanrahan P., 2001, in Proceedings of the 28th

annual conference on Computer graphics and interactive techniques, SIGGRAPH ’01,

ACM, New York, NY, USA, pp. 159–170

Ransom S. M., 2001, PhD thesis, Harvard University

Richards J. W. et al., 2011, ApJ, 733, 10

Rosa F. L., Marichal-Hernandez J. G., Rodriguez-Ramos J. M., 2004, in Society of Photo-

Optical Instrumentation Engineers (SPIE) Conference Series, Vol. 5572, Society of

Photo-Optical Instrumentation Engineers (SPIE) Conference Series, Gonglewski J. D.,

Stein K., eds., pp. 262–272

Rosen R. et al., 2012, The Pulsar Search Collaboratory: Discovery and Timing of Five

New Pulsars, arXiv:1209.4108

Rumpf M., Strzodka R., 2001, in Proceedings of EG/IEEE TCVG Symposium on Visual-

ization (VisSym ’01), pp. 75–84

Sainio J., 2012, Journal of Cosmology and Astroparticle Physics, 4, 38

Sane N., Ford J., Harris A. I., Bhattacharyya S. S., 2012, Radio Science, 47, 3005

Schaaf K., Overeem R., 2004, Experimental Astronomy, 17, 287

Scherl H., Koerner M., Hofmann H., Eckert W., Kowarschik M., Hornegger J., 2007,

in Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, Vol.

6510, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series

Schive H., Tsai Y., Chiueh T., 2010, ApJ. Supp., 186, 457

Schiwietz T., Chang T.-c., Speier P., Westermann R., 2006, in Society of Photo-Optical In-

strumentation Engineers (SPIE) Conference Series, Vol. 6142, Society of Photo-Optical

Instrumentation Engineers (SPIE) Conference Series, Flynn M. J., Hsieh J., eds., pp.

1279–1290

Page 152: Benjamin Barsdell Thesis - Swinburne University of Technology

136 Bibliography

Schlegel D., 2012, LSST is Not ”Big Data”, arXiv:1203.0591

Schneider P., Weiss A., 1986, A&A, 164, 237

Schneider P., Weiss A., 1987, A&A, 171, 49

Serylak M., Karastergiou A., Williams C., Armour W., LOFAR Pulsar Working Group,

2012, Observations of transients and pulsars with LOFAR international stations,

arXiv:1207.0354 [astro-ph.IM]

Shara M. M., Hurley J. R., 2002, ApJ, 571, 830

Siemion A. P. V. et al., 2012, ApJ, 744, 109

Smits R., Kramer M., Stappers B., Lorimer D. R., Cordes J., Faulkner A., 2009, A&A,

493, 1161

Spitler L. G., Cordes J. M., Chatterjee S., Stone J., 2012, ApJ, 748, 73

Springel V., Yoshida N., White S. D. M., 2001, New Astronomy, 6, 79

Stappers B. W. et al., 2011, A&A, 530, A80

Staveley-Smith L. et al., 1996, Pub. Astron. Soc. Australia, 13, 243

Steinmetz M., 1996, M.N.R.A.S., 278, 1005

Stone J. M., Norman M. L., 1992, ApJ. Supp., 80, 753

Sun C., Agrawal D., El Abbadi A., 2003, in Proceedings of the 2003 ACM SIGMOD

international conference on Management of data, SIGMOD ’03, ACM, New York, NY,

USA, pp. 455–466

Sunarso A., Tsuji T., Chono S., 2010, Journal of Computational Physics, 229, 5486

Tanabe N., Ichihashi Y., Nakayama H., Masuda N., Ito T., 2009, Computer Physics

Communications, 180, 1870

Taylor J. H., 1974, A&AS, 15, 367

Thacker R. J., Couchman H. M. P., 2006, Computer Physics Communications, 174, 540

Thakar A. R., 2008, Computing in Science and Engineering, 10, 9

Thompson A. C., Fluke C. J., Barnes D. G., Barsdell B. R., 2010, New Astronomy, 15, 16

Page 153: Benjamin Barsdell Thesis - Swinburne University of Technology

Bibliography 137

Tingay S. J. et al., 2012, The Murchison Widefield Array: the Square Kilometre Array

Precursor at low radio frequencies, arXiv:1206.6945 [astro-ph.IM]

Tomczak T., Zadarnowska K., Koza Z., Matyka M., Miros law A., 2012, Complete PISO

and SIMPLE solvers on Graphics Processing Units, arXiv:1207.1571 [cs.DC]

Tomov S., McGuigan M., Bennett R., Smith G., Spiletic J., 2005, Computers & Graphics,

29, 71

Trendall C., Stewart A. J., 2000, in In Eurographics Workshop on Rendering, Springer,

pp. 287–298

van Meel J., Arnold A., Frenkel D., Portegies Zwart S., Belleman R., 2008, Molecular

Simulation, 34, 259266

van Nieuwpoort R. V., Romein J. W., 2009, in ICS ’09: Proceedings of the 23rd interna-

tional conference on Supercomputing, ACM, New York, NY, USA, pp. 440–449

van Straten W., Bailes M., 2011, Pub. Astron. Soc. Australia, 28, 1

Varbanescu A., Amesfoort A., Cornwell T., Mattingly A., Elmegreen B., Nieuwpoort R.,

Diepen G., Sips H., 2008, in Lecture Notes in Computer Science, Vol. 5168, Euro-

Par 2008 Parallel Processing, Luque E., Margalef T., Bentez D., eds., Springer Berlin

Heidelberg, pp. 749–762

Verbiest J. P. W. et al., 2009, M.N.R.A.S., 400, 951

Vladimirov A., 2012, Arithmetics on intels sandy bridge and westmere cpus: not all flops

are created equal. Tech. rep., available at: http://research.colfaxinternational.

com/post/2012/04/30/FLOPS.aspx

Volkov V., Demmel J. W., 2008, in Proceedings of the 2008 ACM/IEEE conference on

Supercomputing, SC ’08, IEEE Press, Piscataway, NJ, USA, pp. 31:1–31:11

von Hoerner S., 1960, Zeitschrift fur Astrophysik, 50, 184

Wambsganss J., 1990, PhD thesis, Thesis Ludwig-Maximilians-Univ., Munich (Germany,

F. R.). Fakultat fur Physik., (1990)

Wambsganss J., 1999, Journal of Computational and Applied Mathematics, 109, 353

Wang P., Abel T., Kaehler R., 2010, New Astronomy, 15, 581

Page 154: Benjamin Barsdell Thesis - Swinburne University of Technology

138 Bibliography

Wayth R. B., Greenhill L. J., Briggs F. H., 2009, Pub. Astron. Soc. Pacific, 121, 857

Williams R. D., Seaman R., 2006, in Astronomical Society of the Pacific Conference Series,

Vol. 351, Astronomical Data Analysis Software and Systems XV, Gabriel C., Arviset

C., Ponz D., Enrique S., eds., p. 637

York D. G. et al., 2000, AJ, 120, 1579

Zhang S., Royer D., Yau S.-T., 2006, Optics Express, 14, 9120

Page 155: Benjamin Barsdell Thesis - Swinburne University of Technology

AChapter 3 Appendix

A.1 Error analysis for the tree dedispersion algorithm

Here we derive an expression for the maximum error introduced by the use of the piecewise

linear tree dedispersion algorithm.

The deviation of a function f(x) from a linear approximation between x = x0 and

x = x1 is bounded by

εf ≤1

8(x1 − x0)2 max

x0≤x≤x1

∣∣∣∣ d2

dx2f(x)

∣∣∣∣ , (A.1)

which shows that the error is proportional to the square of the step size and the second

derivative of the function. For the dedispersion problem, the second derivative of the delay

function with respect to frequency is given by

∂2

∂ν2∆t(d, ν) = DM(d)

d2

dν2∆T (ν) (A.2)

= 6 DM(d)kDM∆ν2

ν40

(1 +

∆ν

ν0ν

)−4

, (A.3)

which has greater magnitude at lower frequencies. Evaluating at the lowest frequency in

the band, ν = Nν , and substituting into equation (A.1) along with the sub-band size N ′ν ,

one finds the error to be bounded by:

ttree ≡ ε∆t ≤3

4DM

kDM

ν20

(N ′νNν

)2 λ2

(1 + λ)4, (A.4)

where λ ≡ ∆νν0Nν is a proxy for the fractional bandwidth, a measure of the width of the

antenna band.

If the smearing as a result of using the direct algorithm is quantified as the effective

139

Page 156: Benjamin Barsdell Thesis - Swinburne University of Technology

140 Appendix A. Chapter 3 Appendix

width, W , of an observed pulse, then the piecewise linear tree algorithm is expected to

produce a signal with an effective width of

Wtree =√W 2 + t2tree, (A.5)

giving a relative smearing of

µtree ≡Wtree

W=

√W 2 + t2tree

W. (A.6)

In contrast to the use of a piecewise linear approximation, the use of a change of frequency

coordinates (‘frequency padding’) to linearise the dispersion trails results in no additional

sources of smear.

A.2 Error analysis for the sub-band dedispersion algorithm

Here we derive an expression for the maximum error introduced by the use of the sub-band

dedispersion algorithm.

The smearing introduced into a dedispersed time series due to an approximation to

the dispersion curve is bounded by the maximum temporal deviation of the approximation

from the exact curve. The maximum change in delay across a sub-band is ∆t(DM, Nν)−∆t(DM, Nν − N ′ν); the difference in this value between two nominal DMs then gives the

smearing time:

tSB ≤ ∆DMnom

[∆T (Nν)−∆T (Nν −N ′ν)

](A.7)

= N ′DM∆DMkDM

ν20

[−2

N ′νNν

λ

(1 + λ)3+O

(N ′νNν

)2], (A.8)

where the second form is obtained through Taylor expansion in powers of N ′νNν

around zero.

Note that this derivation assumes the dispersion curve is approximated by aligning the

’early’ edge of each sub-band. An alternative approach is to centre the sub-bands on the

curve, which reduces the smearing by ∼ 2× but adds complexity to the implementation.

As with the tree algorithm, we can define the relative smearing of the sub-band algo-

rithm with respect to the direct algorithm as

µSB ≡WSB

W=

√W 2 + t2SB

W, (A.9)

Page 157: Benjamin Barsdell Thesis - Swinburne University of Technology

A.2. Error analysis for the sub-band dedispersion algorithm 141

where, as before, W is the effective width of an observed pulse after direct dedispersion.