Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Advanced Architectures for AstrophysicalSupercomputing
Benjamin R. Barsdell
Presented in fulfillment of the requirements
of the degree of Doctor of Philosophy
2012
Faculty of Information and Communication Technology
Swinburne University
i
Abstract
Modern astronomy has come to depend on the exponential progress of computing
technology. In recent times, however, processor hardware has undergone a dramatic shift
away from the traditional, sequential model of computing toward a new, massively parallel
model. Due to the preponderance of unparallelised software in astronomy, this develop-
ment poses significant challenges for the community. This thesis explores the substantial
benefits offered by advanced ‘many-core’ computing architectures and advocates a pow-
erful, general approach to their use; these concepts are then put into practice to achieve
new science outcomes in the field of pulsar astronomy.
We begin by developing a methodology for tackling the challenges of massively-parallel
computing based on the analysis of algorithms. Simple analysis techniques are shown to
provide deep insight into both the suitability of particular problems for advanced architec-
tures as well as the optimal implementation approach when targeting such hardware. The
method is applied to four well-known astronomy applications, highlighting their scalabil-
ity and resulting in the rapid identification of potential speed-ups from cheaply-available
many-core devices. The hardware- and software-independent nature of our approach
means that, like a mathematical proof, such results remain valid in perpetuity.
Building on this foundation, we then consider in more detail the process of incoherent
dedispersion, a computationally-intensive problem at the heart of surveys for fast radio
transients. Three different dedispersion algorithms are analysed and implemented for a
particular form of many-core hardware, the graphics processing unit (GPU), and speed-
ups of up to nine times are obtained when compared to an efficient multi-core CPU
implementation. The GPU-based direct dedispersion code is shown to enable processing
of data from the High Time Resolution Universe (HTRU) survey, currently ongoing at the
CSIRO Parkes 64 m radio telescope in New South Wales, Australia, at a rate three times
faster than real time.
We look toward GPU-driven scientific outcomes by developing a real-time fast-radio-
transient detection pipeline capable of exploiting many-core computing architectures. Our
GPU dedispersion code is combined with new data-parallel and statistically robust im-
plementations of algorithms for radio-frequency-interference (RFI) mitigation, baseline
removal, normalisation, matched filtering and event detection to form a complete system
capable of sustained real-time operation. The pipeline is demonstrated using both archival
data from the HTRU survey and real-time observations at Parkes Observatory, where it
has been deployed as part of the Berkeley Parkes Swinburne Recorder back-end. Early
ii
results demonstrate several key abilities, including live detection of individual pulses from
known pulsars and rotating radio transients (RRATs), detailed real-time monitoring of
the RFI environment, and continuous quality-assurance of recorded data. The increased
sensitivity and ability to rapidly re-process archival data also resulted in the discovery of
a new RRAT in a 2009 pointing from the HTRU survey, which we have confirmed using
the pipeline in real-time at Parkes.
We conclude that our generalised, algorithm-centric approach offers a prudent path
through the challenges posed by advanced architectures, and that exploiting the power
and scalability of such hardware can and does provide paradigm-shifting accelerations to
computationally-limited astronomy problems.
iii
Acknowledgements
This thesis would not exist, and I would not be where I am, without the help of many
people.
I would first like to thank my supervisors David Barnes, Chris Fluke and Matthew
Bailes. I am indebted to David and Chris for daring to explore this unique and exciting
topic, and I thank them deeply for their guidance through both the good times and the
hard; their enthusiasm and unique perspectives were invaluable. I especially thank Chris
for the many weekly meetings that kept me on track even when my work fell outside
of his expertise. I also owe a great deal of gratitude to Matthew for accepting the lead
supervisory role mid-way through my term and for providing me with the opportunity to
apply my work to a rich and exciting field of discovery. I am extremely grateful for the
wisdom imparted upon me by all three supervisors during our many conversations over
the past forty five months.
My eternal thanks go to Catarina, my parents and my sister for their unwavering
support and excellent advice when things didn’t go according to plan. Their refreshing
perspectives always showed me the bright side of any situation and kept me motivated
through to the end.
A huge thanks goes to all the members of the pulsar group, Matthew, Willem, Ramesh,
Andrew, Jonathon, Sarah, Lina, Stefan and Paul, for embracing me as a member and for
teaching me the many ways of the neutron star. Pulsar coffee was always one of my
favourite times of the week, and I will forever be thankful to have been part of such a
close group both personally and professionally.
Special thanks go to: Amr Hassan for standing beside me at the outskirts of what
some would call ‘normal’ astronomy topics; Max Bernyk, Georgios Vernados, Juan Madrid,
Anna Sippel, Guido Moyano Loyola and the other affiliates of the SciVis group for thought-
provoking discussions and presentations; Paul Coster for many enlightening discussions
and for testing my (bug-ridden) code; Jarrod Hurley for his ongoing support and for
taking me along for the amazing experience of observing at Keck; Willem van Straten for
his help when my supervisors were away (as well as when they weren’t); Andrew Jameson
for putting up with my code and spending long hours deploying and debugging it at Parkes;
Nick Bate, Alister Graham, Darren Croton, Chris Flynn and Felipe Marin for introducing
me to fascinating new fields of study and potential GPU applications; all the CAS soccer
players who joined me in the park for (almost) all of the 180 weeks I was here; Gin Tan and
Simon Forsayeth for fixing all of my computer issues; Carolyn Cliff, Elizabeth Thackray,
iv
Mandish Webb and Sharon Raj for dealing with my often poor admin skills; and Luke
Hodkinson for introducing me to Emacs, with which this thesis and virtually all of the
source code that went into it were written.
Finally, I would like to thank all of the other students, postdocs and staff whose paths
I crossed during my time at Swinburne. I had a huge amount of fun here thanks to you
all, and I sincerely hope that our paths intersect again in the future.
v
Declaration
The work presented in this thesis has been carried out in the Centre for Astrophysics
& Supercomputing at the Swinburne University of Technology between 2008 and 2012.
This thesis contains no material that has been accepted for the award of any other degree
or diploma. To the best of my knowledge, this thesis contains no material previously
published or written by another author, except where due reference is made in the text
of the thesis. All work presented is primarily that of the author with the exception of the
two opening paragraphs of Section 2.3.1, which were written by Christopher Fluke, and
the CPU-based dedispersion code benchmarked in Chapters 3 and 4, which was written by
Matthew Bailes. The content of the chapters listed below has appeared in refereed jour-
nals. Alterations have been made to the published papers in order to maintain argument
continuity and consistency of spelling and style.
• Chapter 2 has been published as Barsdell, Barnes & Fluke (2010)
• Chapter 3 has been published as Barsdell et al. (2012)
Benjamin Robert Barsdell
Melbourne, Australia
2012
vi
Dedicated to my parents Mark and Susan,
and to my sister Wendy.
Contents
Abstract i
Acknowledgements iii
Declaration v
List of Figures ix
List of Tables xii
1 Introduction 1
1.1 Astrophysical supercomputing . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Advanced architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Central processing units . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Graphics processing units . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Other accelerator cards . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Advanced architectures in astronomy . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Purpose of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Advanced architectures meet pulsar astronomy . . . . . . . . . . . . . . . . 20
1.5.1 History and characteristics of pulsars . . . . . . . . . . . . . . . . . . 20
1.5.2 Pulsar observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5.3 Pulsar astronomy and advanced architectures . . . . . . . . . . . . . 25
1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 A Generalised Approach to Many-core Architectures for Astronomy 27
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 A Strategic Approach: Algorithm Analysis . . . . . . . . . . . . . . . . . . . 28
2.2.1 Principle characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.3 Analysis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.4 Global analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Application to Astronomy Algorithms . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Inverse ray-shooting gravitational lensing . . . . . . . . . . . . . . . 36
2.3.2 Hogbom CLEAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.3 Volume rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.4 Pulsar time-series dedispersion . . . . . . . . . . . . . . . . . . . . . 42
vii
viii Contents
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3 Accelerating Incoherent Dedispersion 47
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Direct Dedispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.2 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.3 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Tree Dedispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.3 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Sub-band dedispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.2 Algorithm analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.3 Implementation notes . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5.1 Smearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.1 Comparison with other work . . . . . . . . . . . . . . . . . . . . . . 77
3.6.2 Code availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4 Fast-Radio-Transient Detection in Real-Time with GPUs 81
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 The pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.1 RFI mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.2 Incoherent dedispersion . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.3 Baseline removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.4 Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.5 Matched filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2.6 Event detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2.7 Event merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2.8 Candidate classification and multibeam coincidence . . . . . . . . . 94
4.2.9 Deployment at Parkes Radio Observatory . . . . . . . . . . . . . . . 95
Contents ix
4.2.10 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.2.11 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.1 Discovery of PSR J1926–13 . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.2 Giant pulses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.3.3 RFI monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3.4 Quality assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5 Future Directions and Conclusions 115
5.1 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1.1 The future evolution of GPUs . . . . . . . . . . . . . . . . . . . . . . 116
5.1.2 Prospects for astronomy applications . . . . . . . . . . . . . . . . . . 120
5.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography 125
A Chapter 3 Appendix 139
A.1 Error analysis for the tree dedispersion algorithm . . . . . . . . . . . . . . . 139
A.2 Error analysis for the sub-band dedispersion algorithm . . . . . . . . . . . . 140
List of Figures
1.1 Clock-rate versus core-count phase space of Moore’s Law. . . . . . . . . . . 3
1.2 Schematic of the programming model for recent NVIDIA GPUs. . . . . . . 12
1.3 Sample of the known pulsars plotted in P -P space. . . . . . . . . . . . . . . 22
2.1 Representative memory access patterns indicating varying levels of locality
of reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2 A schematic view of divergent execution within a SIMD architecture. . . . . 33
3.1 Illustration of a dispersion trail and its corresponding dispersion transform. 49
3.2 Visualisation of the tree dedispersion algorithm. . . . . . . . . . . . . . . . . 56
3.3 Signal degradation and performance results for the piecewise linear tree
algorithm compared to the direct dedispersion algorithm. . . . . . . . . . . 69
3.4 Signal degradation and performance results for the sub-band algorithm com-
pared to the direct dedispersion algorithm. . . . . . . . . . . . . . . . . . . 70
4.1 Flow-chart of the key processing operations in our transient detection pipeline. 85
4.2 Results overview plots from our transient pipeline for an archived pointing
in the HTRU survey containing a new rotating radio transient candidate. . 97
4.3 Plot showing the break-down of execution times during each gulp for dif-
ferent parts of the transient pipeline. . . . . . . . . . . . . . . . . . . . . . . 100
4.4 Plot showing the variation of execution times for different parts of the tran-
sient pipeline as a function of the gulp size. Here all stages of the pipeline
are executed on the GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5 Plot showing the variation of execution times for different parts of the tran-
sient pipeline as a function of the gulp size. Here all stages of the pipeline
but dedispersion are executed on the CPUs using 3 cores. . . . . . . . . . . 103
4.6 Results overview plots from our transient pipeline for a confirmation point-
ing of the rotating radio transient candidate shown in Fig. 4.2 . . . . . . . 105
4.7 Results overview plots from the pipeline during a timing observation of the
millisecond pulsar PSR J1022+1001 showing the detection of a number of
strong pulses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.8 Results overview plots from the pipeline for a pointing containing strong
bursts of radio-frequency interference. . . . . . . . . . . . . . . . . . . . . . 109
xi
xii List of Figures
5.1 Trends in theoretical peak GPU memory bandwidth and compute perfor-
mance over the last five years. . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2 Trends in GPU core count and critical arithmetic intensity over the last five
years. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
List of Tables
1.1 Summary of advanced architectures. Numbers are indicative only. See main
text for acronym definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Analysis of four foundation algorithms . . . . . . . . . . . . . . . . . . . . . 34
3.1 Summary of host↔GPU memory copy times during dedispersion. . . . . . . 74
3.2 Timing comparisons for direct GPU dedispersion of the ‘toy observation’
defined in Magro et al. (2011). . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1 Properties of the discovered RRAT. . . . . . . . . . . . . . . . . . . . . . . 104
xiii
1Introduction
If I had asked people what they wanted, they would have said
faster horses.
—Henry Ford
1.1 Astrophysical supercomputing
Computing resources are a fundamental component of modern astronomy. Computers
are used in the acquisition, reduction, analysis, simulation and visualisation of virtually
all astronomical data. The increase in processing power that has followed Moore’s Law
(Moore, 1965) since the mid 1960s has opened up vast new avenues of research that
would not otherwise have been possible. Take, for example, simulations of gravitating
bodies: first performed in the 1960s with up to 100 particles (von Hoerner, 1960; Aarseth,
1963), 50 years of evolution in computing saw this number increase by more than nine
orders of magnitude (Kim et al., 2011; Angulo et al., 2012). On the observational front,
contemporary projects such as the Sloan Digital Sky Survey (Thakar, 2008; York et al.,
2000) have been made possible only by the advances in instrumentation and computing
necessary to capture and process vast quantities of data; future projects such as the Square
Kilometre Array (Cornwell, 2004; Dewdney et al., 2009) will push these requirements
even further, demanding nothing short of world-class supercomputing facilities. While
algorithmic developments also play a crucial role in these applications, it is the unwavering
trend in computing power—doubling every two years for fixed cost—that has carried
computational astronomy to where it is today.
There is, however, more to this story than it first appears, the key to which lies in
the term ‘computing power’. Gordon Moore’s 1965 observation actually used the much
1
2 Chapter 1. Introduction
more specific ‘cost per component’, referring to the manufacturing cost of transistors.
Thus, being precise, the fundamental trend that has held strong for more than 45 years is
the halving of the minimum cost per transistor every two years. The importance of this
pedantry is that increasing the number of transistors at fixed cost does not necessarily
translate into increasing ‘computing power’.
For the majority of the past 45 years, most computer software has been able to re-
main blissfully ignorant of the true nature of Moore’s Law. In addition to the doubling
in the number of transistors, computers’ central processing units (CPUs) also exhibited a
doubling in clock-rate every two years. The beauty of this was that software that ran at
a particular speed would, two years later, run at twice that speed (on a new computer of
similar cost) with no extra effort. Unfortunately, this all changed around 2005. As clock
rates were increased, power consumption and heat generation also rose, and eventually a
point was reached where the excess heat could not be effectively dissipated. The result
was that clock-rates could not be stably pushed very far beyond 3 GHz. In response, hard-
ware manufacturers turned to another means of increasing performance: placing multiple
processors (or cores) on a single chip.
Figure 1.1 plots processors from the last ∼20 years in clock-rate versus core-count
phase space. In this space, the evolution of CPUs turns a ‘corner’ around 2005 when
clock rates plateaued and multi-core processors emerged. It now appears that the old
trend of a doubling in clock-rates has been replaced by a similar trend in the number
of cores. Projecting forward, this implies a future where processors exhibit ‘many-core’
architectures containing 100s or 1000s of cores.
The replacement of clock-rate with core-count as a means of increasing processor per-
formance has ensured that Moore’s Law will continue to be a useful driving force for hard-
ware manufacturers. However, on the software front there are significant consequences.
Most software is composed of sequential codes, which execute instructions one after the
other. In a multi-core environment, such codes will experience no direct performance gain
from the presence of multiple processing cores, forever remaining limited by the clock-
rate. The only way to take advantage of the new paradigm in processor architectures is
to (re-)write codes to exploit scalable parallelism.
The dependence of modern astronomy on high-performance computing makes adapting
to these changes particularly important. While some astronomy codes have already made
the transition to multi-core processing (e.g., Merz, Pen & Trac 2005; Thacker & Couchman
2006; Mudryk & Murray 2009), many legacy codes are still in use, and performance-limited
software is still often written without parallelism in mind. Furthermore, it is unlikely that
1.1. Astrophysical supercomputing 3
Figure 1.1 Clock-rate versus core-count phase space of Moore’s Law binned every twoyears for CPUs (circles) and GPUs (diamonds). There is a general trend for performanceto increase from bottom left to top right.
4 Chapter 1. Introduction
all approaches to multi-core parallelism will scale effectively to many-core architectures.
It is worth noting that, in addition to core count and clock speed, there is another
dimension at play in processor performance: memory bandwidth. This property defines
the rate at which data can be read from or written to memory, and can play a critical role
in some applications. Codes that perform only a small number of arithmetic operations
per element of data accessed can become limited by the system’s memory bandwidth
before they are able to saturate the arithmetic capabilities of the hardware (this concept
is discussed in more detail in Chapter 2). In recent years, memory bandwidth has not
kept pace with progress in arithmetic performance, and an increasing number of codes are
finding data-access the ultimate bottleneck. However, many of the most time-consuming
processes in computational astronomy remain heavily reliant on arithmetic performance,
and for these applications memory bandwidth is often not a concern.
While the hardware landscape has experienced a fundamental change in the design of
CPUs, the last five years has also seen the rise in popularity of completely new hardware
architectures for solving computationally intensive problems. These advanced architec-
tures offer very high performance for certain problems at lower monetary and power costs
than traditional CPUs, and potentially hold the key to carrying astronomy computations
through the next decade of Moore’s Law. However, they also pose significant challenges to
existing software development paradigms. The next section details the history, hardware
architecture and programming models of these devices.
1.2 Advanced architectures
The mainstay of modern computing has been the central processing unit, which is tasked
with everything from running an operating system and loading web pages to executing
complex numerical simulations. However, CPUs are only one application of Moore’s Law,
and only one way of tackling scientific computations. In recent years a number of alterna-
tive hardware architectures have been released into the high-performance computing mar-
ket. By trading flexibility for performance, these products are often able to out-perform
CPUs at compute-intensive tasks by an order of magnitude or more. They also offer a
glimpse of a possible future for all high-performance computing hardware. A summary of
the architectures discussed in this section is presented in Table 1.1.
While many-core central processing units are not yet a reality, modern graphics pro-
cessing units (GPUs) already contain 100s of cores (see Figure 1.1). In recent years, GPUs
have undergone a shift from a highly specialised graphics-oriented architecture to a flex-
ible general-purpose computing platform. The results of this evolution will be discussed
1.2. Advanced architectures 5
Table 1.1 Summary of advanced architectures. Numbers are indicative only. See maintext for acronym definitions.Architecture Hardware Peak speed Price
CPU 2–8 cores, vector registers, 3-level cache 224 GFLOP/s US$1700GPU 500–2500 cores, 2-level cache 2592 GFLOP/s US$2500GRAPE Many cores, ‘hard-wired’ force calculations 131 GFLOP/s US$6000Clearspeed 192 SIMD cores, 2-level cache 96 GFLOP/s US$3000Cell BE Heterogeneous, 8 cores, vector registers 180 GFLOP/s US$8000Xeon Phi 50 cores, vector registers, 2-level cache Unknown US$2000
Architecture Software Power Applications
CPU TBB, OpenMP, Intel MKL 130 W ManyGPU CUDA, OpenCL, OpenACC 250 W ManyGRAPE Custom API 15 W N-bodyClearspeed Cn 9 W FewCell BE OpenMP, OpenCL, vector intrinsics 210 W ManyXeon Phi OpenMP, MPI, OpenCL, TBB, Intel MKL Unknown Many
further in Section 1.2.2 (see also Owens et al. 2005 for an early review).
Other processors designed to accelerate specific or certain classes of computational
problems have also appeared on the market. These range from hard-wired chips dedicated
to evaluating Newton’s law of gravitation, to heterogeneous architectures combining se-
quential and parallel processing performance to speed-up a wide variety of applications.
These devices are discussed in detail in Section 1.2.3. Note that we do not include a
discussion of field programmable gate arrays (FPGAs), which lie outside the scope of this
work due to their complex programming environment and minimal use in main-stream
computing (see, e.g., Monmasson & Cirstea 2007).
Before discussing new architectures, a brief review of current CPU designs and pro-
gramming models is presented in Section 1.2.1.
1.2.1 Central processing units
Hardware architecture
As described in Section 1.1, current-generation CPUs exhibit multi-core designs, typically
containing between two and eight full-function cores. In addition to this form of par-
allelism, CPUs also contain vector registers within each core that allow multiple values
to be operated-on simultaneously in a single instruction multiple data (SIMD) fashion.
Previous generations used the Streaming SIMD Extensions (SSE) instruction set, which
provided access to 128-bit registers allowing four (two) single-precision (double-precision)
floating-point values to be operated-on simultaneously per core. Current CPUs now use
6 Chapter 1. Introduction
the Advanced Vector Extensions (AVX), which provide twice the vector width at 256-
bits. When all vector registers and cores are employed, modern CPUs can perform up to
224 billion single-precision floating-point operations per second (GFLOP/s) (Vladimirov,
2012)1.
Most modern CPUs also exhibit three hardware-managed cache levels, allowing low-
latency memory access and fast data-sharing between cores; the maximum bandwidth to
main memory is around 50 GB/s. Current server-class CPUs cost around US$1700 and
consume up to 130 W, giving them monetary and power efficiencies of 0.132 GFLOP/s/$
and 1.72 GFLOP/s/W respectively2.
Programming models
Development for CPUs is most-commonly approached using optimising compilers, which
can in some cases automatically vectorise sequential codes into SSE/AVX instructions.
Alternatively, low-level SSE/AVX instructions can be used directly by the developer to
ensure optimal use of the hardware, although this adds significant development complex-
ity. Multiple cores can be exploited through the use of multi-threading libraries such
as Threading Building Blocks (TBB)3 (where parallel processing threads are managed
explicitly), directive-based approaches like OpenMP4 (where parallel processing threads
are managed implicitly), or pre-optimised maths libraries such as the Intel Math Kernel
Library (MKL)5 (where parallel processing is hidden completely from the developer).
1.2.2 Graphics processing units
History
Graphics processing units (GPUs) first appeared as physical co-processors to regular CPUs
in the 1980s. Their development was driven by the rise in popularity of graphical user
interfaces, which often demanded significantly more computational power than the rest of
a computer’s operating system. Moving these graphics operations to a GPU promised to
free up the CPU to focus on traditional computing tasks, providing a better overall user
experience. However, with a fixed transistor (or dollar) budget, simply moving compu-
tations from one processor to another would provide little or no benefit. The key to the
1224 GFLOP/s = 1 operation × 8 AVX vector slots × 8 cores × 3.5 GHz2http://ark.intel.com/products/64583/Intel-Xeon-Processor-E5-26803http://threadingbuildingblocks.org/4http://www.openmp.org/5http://software.intel.com/en-us/articles/intel-mkl/
1.2. Advanced architectures 7
success of this approach was that graphics computations are algorithmically distinct from
many traditional computations.
Non-graphics compute tasks are typically heterogeneous, branch-heavy and sequential;
for example, editing a document or loading a web page involves a melange of computational
tasks and a huge number of logical decisions, many of which must be made in progression.
The design of CPUs reflects this workload: CPU hardware is characterised by very fast
sequential performance, large, deep cache hierarchies and branch prediction capabilities.
In stark contrast, graphics tasks are often highly homogeneous, branch-free and parallel.
Applying an operation to an image comes down to applying the operation to each pixel
independently, and rendering a 3D scene involves independently transforming the vertices
of polygons and texturing the corresponding pixels that are projected onto the screen. The
hardware architecture of a GPU is thus characterised by parallel, homogeneous processing
capabilities.
While GPUs have differed fundamentally from CPUs since their very first incarnation,
their design has itself evolved significantly over the past three decades. This evolution
has been driven by the combination of Moore’s Law and the ever-increasing demands
of graphics-based software such as computer aided design applications, image and video
editing tools and video games. Up until the end of the 1990s, GPUs contained only fixed-
function hardware for computing different parts of the rendering pipeline—most-commonly
rasterisation and texture-mapping of pixels in polygons. With the desire for more flexibility
in the rendering process (primarily from the video-games industry), the new millennium
saw the appearance of simple programmable capabilities on the most popular GPUs. This
functionality allowed developers to write small shader programs that would be executed on
the hardware to transform the properties of either polygon vertices (via a vertex shader)
or pixels (via a pixel shader). This new flexibility was rapidly adopted by the graphics
programming community, and continued demand led to further improvements throughout
the 2000s, providing more shader processors per chip and enabling longer, more complex
shader programs to be written.
Another important step in the evolution of GPUs came in 2006 with the release of
devices exhibiting a unified shader architecture. This new design replaced the use of
separate vertex and pixel shader units with unified shaders capable of performing both
roles (as well as new functions entirely). One advantage of this design was the ability to
maintain high efficiency even when the use of one type of shader program greatly exceeded
that of another. A much more significant advantage, however, was the ability to perform a
8 Chapter 1. Introduction
more general set of computations. In 2007 NVIDIA6 released its Compute Unified Device
Architecture (CUDA), a platform for general purpose computation on GPUs (GPGPU),
and with it opened up GPUs to a new world of applications7. The driving force in GPU
design has since undergone a shift from purely graphics applications to a combination of
graphics and general-purpose demands. Today, GPUs exhibit general-purpose features
such as cache hierarchies, fast double-precision floating-point support, atomic operations
and dynamic parallelism, making them applicable to a wide range of parallel computing
problems. Modern graphics hardware also contains thousands of computing cores (unified
shader units), resulting in peak performance of over a trillion floating-point operations
per second (FLOP/s). This compute performance, around an order of magnitude greater
than a similarly-priced CPU, has led to huge levels of interest from the scientific and
high-performance computing (HPC) communities.
Hardware architecture
In this section the recently-announced NVIDIA Kepler K20 GPU will be used as an
example of a cutting edge GPU architecture (NVIDIA Corporation, 2012). This GPU
connects to the PC via the Peripheral Component Interconnect (PCI) Express bus v3.0,
which provides bidirectional data transfer at rates of up to 16 GB/s. This bus allows the
GPU to communicate with the main system memory as well as other devices on the PCI
Express bus (e.g., other GPUs, network interface cards etc.). The GPU also has its own
main memory, which can be accessed with a bandwidth of up to 288 GB/s (significantly
exceeding typical CPU memory bandwidth of up to 50 GB/s). Attached to the main
memory is 1.5 MB of level two (L2) cache, which serves to provide efficient data access
and sharing between processing units on the device.
The primary processing unit on the Kepler K20 GPU is the Streaming Multiprocessor
(SMX); a single device can contain up to 15 SMXs. On each SMX sits 64 KB of L1 cache,
which is divided into a regular L1 cache and what is called ‘shared memory’ (the division
being application-configurable). Shared memory is an application-managed memory space
that can be used to perform efficient data sharing and manipulation operations. In addition
to these caches is 48 KB of read-only cache designed for data known to remain constant
throughout program execution. Rounding out the memory spaces on the Kepler GPU
is the register file on each SMX, which contains 65,536 32-bit registers that are divided
between processing units (NVIDIA Corporation, 2012).
6NVIDIA is one of two main competitors in the GPU hardware industry, the other being AdvancedMicro Devices (AMD).
7http://www.nvidia.com/object/cuda_home_new.html
1.2. Advanced architectures 9
As suggested by its name, a Streaming Multiprocessor is composed of many individual
processors: 192 general processing cores (supporting integer and single-precision floating-
point arithmetic), 64 double-precision floating-point units, 32 special-function units, 32
load/store units and 16 texture filtering units. The general processing cores provide the
bulk of the computational horsepower, totalling up to 2592 GFLOP/s8. The double-
precision units similarly provide 864 GFLOP/s of double-precision performance. The
purpose of the special function units is to provide very fast implementations of common
mathematical functions such as roots, exponentiation, logarithms and trigonometric func-
tions. The load/store units simply provide access to memory. Finally, the texture filtering
units provide fast interpolation functions in one, two and three dimensions.
The latest GPUs cost around US$2500 for scientific computing models and consume
up to 250 W of power, giving them monetary and power efficiencies of 1.04 GFLOP/s/$
and 10.4 GFLOP/s/W respectively.
Programming models
Programming for general-purpose computation on GPUs began with real-time rendering
shader languages such as the OpenGL Shading Language (GLSL)9, C for Graphics (Cg)10
and the High Level Shader Language (HLSL)11. These languages require the developer to
pose their problem in terms of graphics operations such as transforming vertices and ren-
dering textured polygons. In addition to the out-of-context thought-process demanded by
this approach, it also comes with significant performance disadvantages and limitations on
the types of algorithms that can be computed. For example, scattering operations, where
data are written to arbitrary locations in memory, are particularly difficult to implement
using shader languages12. Shader languages also lack the ability to use shared memory to
efficiently share and communicate data between processors, a feature that proved particu-
larly critical to the performance of a number of algorithms including N-body simulations
(Belleman, Bedorf & Portegies Zwart, 2008).
Auspicious early performance results led to rising interest in using GPUs for general-
purpose computations and saw the appearance of new programming interfaces designed
to ease the development process. One such project was the BrookGPU language, which
82592 GFLOP/s = 1 operation × 192 cores × 15 SMXs × 900 MHz. This number increases by afurther factor of two if one considers the hardware’s ability to fuse multiply and add operations into asingle instruction.
9http://www.opengl.org/documentation/glsl/10http://developer.nvidia.com/page/cg_main.html11http://msdn.microsoft.com/en-us/library/windows/desktop/bb509561(v=vs.85).aspx12Implementing a scatter operation in a shader language requires placing the data at the vertices of a
polygon and rendering it using a vertex shader that translates each vertex to the desired location.
10 Chapter 1. Introduction
simplified the programming of parallel applications using a ‘stream processing’ approach
(Buck et al., 2004). By restricting the allowed communication between parallel streams of
computation, this approach enables a variety of parallel algorithms to be executed on both
traditional and graphics processing hardware. BrookGPU provided back-ends supporting
OpenGL13 and DirectX14, as well as the low-level GPU programming interface Close to
Metal15.
A significant step in the rise of GPGPU programming came with the release of the
CUDA platform by NVIDIA in 2007. CUDA changed the GPU computing landscape by
not only removing ties to graphics operations, but by also opening up additional hardware
features such as unrestricted memory reads and writes, access to shared memory and fast
bi-directional data transfers between the GPU and system memory. Subsequent versions
have continued to introduce new features for general-purpose processing, including atomic
operations16, parallel voting functions, double-precision floating-point arithmetic, unified
memory addressing across the CPU and GPU, function recursion, full C++ support and,
most recently, dynamic parallelism and remote direct memory access. CUDA programs are
made up of calls to a runtime library, providing access to device and memory management
functions, and special functions called kernels written in ‘C for CUDA’, a C-like language
providing extensions for parallel processing functionality.
The CUDA programming model defines a hierarchy of parallel processing and memory
abstractions, depicted in Fig. 1.2. The fundamental unit of parallelism is the thread, which
executes one instance of a kernel. Each thread is allocated a number of registers that it
uses to perform its local computations; registers provide the fastest access times of all the
GPU memory spaces, but cannot be used for communication between threads. Threads
are grouped together in two ways, the first being into sets of 32 called warps. On the
GPU, instructions are issued on a per-warp basis; threads within a warp therefore execute
instructions in a lock-step fashion. In cases where some threads within a warp wish to
execute a different instruction to others (i.e., during a conditional statement), those threads
must wait, idle, while the other threads execute their operations. The second grouping
of threads is into blocks. Blocks can vary in size according to application requirements,
but are typically created with O(100) threads. The purpose of the block abstraction
is to provide a means of communicating between threads: threads within a block have
access to a fast synchronisation mechanism and shared memory, allowing them to rapidly
13http://www.opengl.org/14http://www.microsoft.com/en-us/download/details.aspx?id=3515http://sourceforge.net/projects/amdctm/16In contrast to regular operations, atomic operations guarantee conflict-free parallel memory writes.
1.2. Advanced architectures 11
share and exchange data, while threads in different blocks can communicate only through
significantly slower means. The final layer in the processing hierarchy is the grid, which is
composed of blocks and contains all of the threads created to execute a given kernel; recent
GPUs also allow multiple grids to execute simultaneously. Communication at the grid level
(i.e., between thread blocks) can only be achieved through global memory using either
kernel-wide synchronisations or atomic operations. The success of this programming model
is due to its careful balance between hardware and software demands: it constrains parallel
execution and communication just enough to allow for a highly efficient and scalable
hardware implementation, but provides enough flexibility to enable the realisation of a
huge variety of parallel algorithms in software.
Following the success of CUDA, in 2008 the Khronos Group (a non-profit technology
consortium) released the Open Compute Language (OpenCL)17, an open standard frame-
work targeting heterogeneous parallel computing. OpenCL provides a similar program-
ming model to CUDA, but allows developers to target a variety of hardware back-ends,
including CPUs, GPUs and other processors from any vendor that provides an implemen-
tation. Like CUDA, OpenCL has added support for new hardware features in updated
versions of the specification.
Building on the foundations provided by CUDA and OpenCL, many higher-level pro-
gramming interfaces are now available to further ease the development of GPU applica-
tions. Libraries such as cuFFT, cuBLAS, cuSPARSE, cuRAND18, CUSP19, CULA20 and
others provide fast GPU implementations of common mathematical operations, and can
often be substituted directly into existing CPU codes. Other libraries, such as Thrust
(Hoberock & Bell, 2010) and CUDPP21 provide high-level interfaces to common parallel
algorithms. Support is also available for programming GPUs using high-level languages,
including Python (via PyCUDA22), MATLAB23, Mathematica24 and IDL (via GPULib25).
Lastly, directive-based approaches such as OpenACC26 allow for GPU execution (in the
same way that OpenMP27 allows for multi-core execution) of existing CPU codes through
17http://www.khronos.org/opencl/18cuFFT, cuBLAS, cuSPARSE and cuRAND are part of the CUDA Toolkit available here: http:
//www.nvidia.com/content/cuda/cuda-toolkit.html19http://code.google.com/p/cusp-library/20http://www.culatools.com/21https://code.google.com/p/cudpp/22http://mathema.tician.de/software/pycuda/23http://www.mathworks.com.au/discovery/matlab-gpu.html24http://reference.wolfram.com/mathematica/guide/GPUComputing.html25http://www.txcorp.com/products/GPULib/26http://openacc.org/27http://openmp.org/
12 Chapter 1. Introduction
Global memory
Thread
Thread
Thread
Thread
L1
ca
che
/sh
are
d m
em
.
...
Block
...
Warp
Grid
R
R
R
RR
Thread
Thread
Thread
Thread
...
Warp
R
R
R
RR
Thread
Thread
Thread
Thread
...
Block
Warp
R
R
R
RR
Thread
Thread
Thread
Thread
...
Warp
R
R
R
RR
L2 cache
L1
ca
che
/sh
are
d m
em
.
Host CPU/memory
Figure 1.2 Schematic of the programming model for recent NVIDIA GPUs showing theprocessing and memory hierarchies. Boxes labelled with the letter ‘R’ represent memoryregisters. Threads within a warp are depicted chained together to indicated the require-ment that they execute instructions in lock-step. See main text for a description of eachelement.
1.2. Advanced architectures 13
the use of annotations and hints to the compiler.
Use in the scientific literature
The notion of using GPU hardware as a computational engine has existed for nearly as
long as GPUs themselves, with the first formal analysis of the idea appearing in the late
1980s (Fournier & Fussell, 1988). At that time, research focused mainly on algorithms used
in rendering 3D graphics, such as visible surface detection and shadow computation (op-
erations that went on to become mainstays of the 3D graphics industry). However, GPU-
based implementations of more general algorithms soon appeared. The ability of early
graphics hardware to rasterise polygons was exploited to implement high-performance mo-
tion planning algorithms (Lengyel et al., 1990), and later work used the OpenGL graphics
API to develop GPU-based implementations of artificial neural networks (Bohn, 1998),
3D convolution (Hopf & Ertl, 1999), numerous computational geometry algorithms (Hoff
et al., 1999; Mustafa et al., 2001; Krishnan, Mustafa & Venkatasubramanian, 2002; Agar-
wal et al., 2003; Sun, Agrawal & El Abbadi, 2003), matrix multiplication (Larsen & McAl-
lister, 2001), non-linear diffusion for image processing (Rumpf & Strzodka, 2001; Diewald
et al., 2001) and cellular automata-based fluid simulations (Harris et al., 2002). With no
access to programmable graphics hardware, these implementations relied on exploiting the
limited operations available in fixed-function graphics pipelines. Major obstacles imposed
by this approach included limited arithmetic precision, lack of support for arbitrary math-
ematical functions, limited support for conditional execution and restricted convolution
capabilities (Trendall & Stewart, 2000).
The appearance of programmable shaders in graphics hardware (Lindholm, Kilgard &
Moreton, 2001; Proudfoot et al., 2001; Mark et al., 2003) alleviated many of the issues
associated with fixed-function pipelines, opening up GPU acceleration to a wider variety
of applications. New algorithms included multigrid and sparse conjugate gradient matrix
solvers (Goodnight et al., 2003; Bolz et al., 2003), fast Fourier transforms (Moreland &
Angel, 2003), wavefront phase recovery (Rosa, Marichal-Hernandez & Rodriguez-Ramos,
2004), direct gravitational N-body simulation (Nyland, Prins & Harris, 2004), Monte Carlo
simulations in statistical mechanics (Tomov et al., 2005), computer-generated holography
(Masuda et al., 2006), 3D shape measurement (Zhang, Royer & Yau, 2006) and mag-
netic resonance imaging reconstruction (Schiwietz et al., 2006). While shader languages
provided unprecedented flexibility on GPU hardware, they remained graphics-specific, re-
stricting memory access and forcing developers to map their problems into a graphics-based
context.
14 Chapter 1. Introduction
The release of CUDA (and later OpenCL) was the final step in liberating GPU hard-
ware for general-purpose computations. Severing ties with graphics-specific operations
and allowing arbitrary access to memory opened the flood gates to a plethora of applica-
tions. Implementations of algorithms from all areas of science appeared, some examples
being: k-nearest neighbour search (Garcia, Debreuve & Barlaud, 2008; Campana-Olivo
& Manian, 2011), molecular dynamics (Anderson, Lorenz & Travesset, 2008; van Meel
et al., 2008; Sunarso, Tsuji & Chono, 2010), numerical weather prediction (Michalakes &
Vachharajani, 2008; Govett et al., 2011; Mielikainen, Huang & Huang, 2011), radiotherapy
dose calculation (de Greef et al., 2009; Jia et al., 2010; Men et al., 2009), computational
fluid dynamics (Cohen & Molemake, 2009; Horvath & Liebmann, 2010; Tomczak et al.,
2012), pattern formation in financial markets (Preis et al., 2009), air pollution modeling
(Molnar et al., 2010), spherical harmonic transforms (Hupca et al., 2012) and ant colony
optimisation (Cecilia et al., 2011). New applications, and improvements to existing ones,
continue to appear as new generations of GPU hardware and software provide additional
performance and flexibility.
1.2.3 Other accelerator cards
History
In addition to GPUs, the past decade has seen the release of a number of dedicated accel-
erator cards, typically designed to plug into PC expansion slots and provide acceleration
for certain types of computations. By focusing only on particular mathematical opera-
tions, these devices aim to provide cost-effective solutions to increasing the performance
of compute-intensive codes. While several products have seen short-term success, over the
longer term manufacturers have often struggled to compete with commodity hardware.
In the early 1990s, a series of devices aimed at accelerating the O(N2) operations in-
volved in direct gravitational N-body simulations was developed at the University of Tokyo
(Ebisuzaki et al., 1993). The Gravity Pipe (GRAPE) hardware speeds up simulations by
offloading the computationally intensive force calculations (performed between all pairs of
gravitating bodies in a system) from the CPU. Successive versions of the devices, from the
GRAPE-1 through to the GRAPE-5 (Kawai et al., 2000) and GRAPE-6 (Makino et al.,
2003) brought increased performance and accuracy. A modified version of the GRAPE-6,
the GRAPE-6A, also offered a smaller form factor that allowed it to be plugged into a PC
expansion slot (Fukushige, Makino & Kawai, 2005). While the GRAPE hardware proved
successful for more than a decade of N-body simulations, it was ultimately faced with
strong competition from GPUs, which concluded with the development of a substitute
1.2. Advanced architectures 15
library allowing GRAPE-based codes to exploit cheaper GPU hardware instead through
a simple re-linking operation (Gaburov, Harfst & Portegies Zwart, 2009).
Products offering more general acceleration capabilities were released between 2003
and 2009 by ClearSpeed Technology, providing full floating-point and integer arithmetic
capabilities. Their most recent device, the CSX70028 offers up to 96 GFLOP/s of double-
precision performance at a typical power consumption of 9 W. The product was priced at
around US$3000 in 2008. However, no subsequent models have been released, likely also
due to competition from GPUs, which now provide better performance at lower cost.
In 2006, Mercury Computer Systems released a PCI Express card featuring the Cell
Broadband Engine Architecture (Cell BE), a heterogeneous multi-core chip providing high
floating-point performance. The Cell Accelerator Board (CAB) consumes 210 W and
provides theoretical peak single-precision (double-precision) performance of 180 GFLOP/s
(90 GFLOP/s), priced at around US$8000 in 2007. As with the ClearSpeed device, no
updated models of the Mercury CAB have been announced since its first release.
Following the success of GPUs in high-performance computing (HPC) applications,
Intel has recently developed a dedicated computational accelerator card designed to com-
pete with GPUs in the HPC market. The Xeon Phi will offer massively-parallel processing
on a PCI Express board and provide a CPU-like development environment29.
Hardware architectures
The hardware architecture of the GRAPE has evolved significantly through its six ver-
sions, offering increased performance through more cores and higher clock rates. In ad-
dition, odd-numbered versions have exploited logarithmic arithmetic to avoid expensive
root operations, at the cost of reduced accuracy. The GRAPE-6 exhibits a massively-
parallel, hierarchical processor architecture. At the smallest level, this consists of in-
teraction pipelines designed specifically to evaluate the equations of Newtonian gravity,
including the force and its time derivative. The ‘hard-wired’ nature of the pipelines en-
sures maximum computational efficiency, while the massive parallelism allows multiple
interactions to be computed simultaneously, providing high performance at a relatively
low clock-rate (90 MHz).
The ClearSpeed CSX700 has a single instruction multiple data (SIMD) architecture
containing 192 processing elements divided between two parallel arrays. The SIMD archi-
tecture means that all of the processing elements in one parallel array execute instructions
28http://www.clearspeed.com/products/csx700.php29http://www.intel.com/content/www/us/en/high-performance-computing/
high-performance-xeon-phi-coprocessor-brief.html
16 Chapter 1. Introduction
in lock-step. The processing elements have access to two levels of cache as well as around
2 GB of main memory, and communicate with the host machine over a PCI Express bus
at up to 8 GB/s.
The PowerXCell 8i processor featured on the Mercury CAB exhibits a heterogeneous
architecture made up of a primary processor called a Power Processing Element (PPE) and
eight co-processors called Synergistic Processing Elements (SPEs). The PPE is similar to
a traditional CPU and provides the general-purpose processing capabilities required to run
an operating system and support the SPEs. The SPEs provide the bulk of the computing
power and feature 256 KB of local memory and 128-bit wide SIMD capabilities, allow-
ing four (two) single-precision (double-precision) floating-point values to be operated-on
simultaneously per SPE. The board also contains 1 GB of high-performance memory and
4 GB of main memory. The PPE, SPEs and memory units are connected via the Element
Interconnect Bus, a circular bus providing concurrent transactions between components
on the chip30.
The Xeon Phi will feature at least 50 independent processing cores, with each offering
512-bit SIMD units, allowing 16 (8) single-precision (double-precision) floating-point val-
ues to be operated-on simultaneously per core. In contrast to GPUs and other accelerator
cards, the cores in the Xeon Phi will be compatible with the x86 instruction set, allowing
them to support existing CPU-based software (most of which uses x86). The device will
also exhibit two levels of cache and a ring bus connecting the processors and memory,
allowing for fast communication across the chip29.
Programming models
Due to its single-purpose design, GRAPE hardware is accessed via a very simple appli-
cation programming interface (API). Functions are provided to initialise and shut down
the device, transfer particle data to and from its memory and to instruct it to begin com-
putation. The only flexibility afforded by the hardware is in the number of particles it is
given; however, this is sufficient to support (certain implementations of) tree-based force
evaluation algorithms, which are often used to speed up the simulation of collisionless
gravitational systems (Makino & Funato, 1993).
Development for the ClearSpeed CSX700 is done through a proprietary software de-
velopment kit31. Direct programming is done using Cn, a C-like language with extensions
for parallel processing. The extensions allow a programmer to qualify data types as ‘poly’,
30http://www.ibm.com/developerworks/power/library/pa-cellperf/31http://www.clearspeed.com/products/sdk.php
1.2. Advanced architectures 17
indicating to the compiler that they should be replicated across each processing element
and operated-on in parallel. A run-time function is provided to query the index of the
current processor, allowing it to act on processor-dependent data; conditional statements
depending on the processor index will, however, lead to unselected processors waiting idle,
due to the SIMD nature of the hardware architecture. A number of libraries are also
provided to accelerate common mathematical algorithms such as fast Fourier transforms,
basic linear algebra operations and pseudo-random number generation.
Several programming interfaces are available for the Cell BE processor on the Mercury
CAB. At the lowest level, writing assembler code provides complete access to all hard-
ware capabilities and allows the programmer to extract the greatest performance, at a
significant development cost. A more common approach is to write C or C++ code using
an application programming interface and vector intrinsics to target the SPEs and their
SIMD units. Optimising compilers have also been developed to automatically exploit the
parallel processing hardware and on-chip memory spaces; code sections can be marked as
parallel by the programmer using a similar model to OpenMP (Eichenberger et al., 2005).
Other options for targeting the Cell BE processor are an implementation of the OpenCL
specification, and optimised maths libraries.
The Xeon Phi’s x86 compatibility is designed to allow the use of many existing par-
allel programming tools. These include OpenMP and message passing interface (MPI)
implementations, Intel’s Array Building Blocks32, Threading Building Blocks33 and Math
Kernel34 libraries, as well as Intel’s Cilk Plus35 extensions to C and C++. OpenCL will
also be supported. While the individual processor cores will be able to execute much
existing code, use of the new 512-bit SIMD units will require additional development.
Use in the scientific literature
Due to its problem-specific nature, GRAPE hardware has not seen significant use outside
of astronomy. Applications of these devices within astronomy are discussed in Section 1.3.
ClearSpeed’s accelerator devices have seen only very limited use by the scientific com-
munity, and do not not appear to have been used in astronomy. Published applications
include lattice Boltzmann methods (Heuveline & Weiß, 2009), hologram generation (Tan-
abe et al., 2009) and geographic flood inundation simulations (Neal et al., 2009).
The Cell processor’s versatility has seen it applied to a large number of problems. Ap-
32http://intel.com/go/arbb33http://threadingbuildingblocks.org34http://software.intel.com/en-us/articles/intel-mkl/35http://software.intel.com/en-us/articles/intel-cilk-plus/
18 Chapter 1. Introduction
plications include quantum chromodynamics simulations (Belletti et al., 2007), 3D com-
puted tomography reconstruction (Scherl et al., 2007), high-energy physics reconstruction
algorithms (Gorbunov et al., 2008), self-organising maps (McConnell, 2010), molecular
dynamics simulations (Gonnet, 2010) and video encoding for large-scale surveillance (Lu
et al., 2010).
1.3 Advanced architectures in astronomy
While some advanced architectures, like the GPU, have only recently seen broad use by
the astronomical community, others, like the GRAPE, have been in use for more than
two decades. The primary application of GRAPE hardware has been to simulations of
collisional stellar environments (Makino, 1991, 1996; Klessen & Kroupa, 1998; Shara &
Hurley, 2002; Baumgardt et al., 2003), but it has also been applied to collisionless SPH
simulations (Steinmetz, 1996; Springel, Yoshida & White, 2001). While competition from
GPUs appears to have pushed GRAPE hardware out of the market, it remains in use
today (e.g., Jalali et al. 2012).
Astronomy applications of the Cell BE processor have been limited to those investi-
gated by a small number of early adopters. These include image synthesis (Varbanescu
et al., 2008) and signal correlation (van Nieuwpoort & Romein, 2009) for radio astronomy,
period searching in light curves (Cytowski, Remiszewski & Soszyski, 2010) and numerical
relativity simulations (Khanna, 2010).
Direct gravitational N-body simulations were among the first astronomy codes imple-
mented on GPUs, initially using graphics shader languages (Nyland, Prins & Harris, 2004;
Portegies Zwart, Belleman & Geldof, 2007) and later using the general-purpose GPU lan-
guages BrookGPU (Elsen et al., 2007) and CUDA (Hamada & Iitaka, 2007; Nyland, Harris
& Prins, 2007; Belleman, Bedorf & Portegies Zwart, 2008; Gaburov, Harfst & Portegies
Zwart, 2009). More recently, algorithmic advances have led to GPU implementations of
hierarchical tree-based N-body algorithms (Hamada et al., 2009; Nakasato et al., 2012;
Bedorf, Gaburov & Portegies Zwart, 2012). A review of GPU use in N-body simulations
has been published by Bedorf & Portegies Zwart (2012).
While N-body simulations have received particular attention, GPU applications in
astronomy now span a wide range of problems. Some examples are radio-telescope signal
correlation (Schaaf & Overeem, 2004; Harris, Haines & Staveley-Smith, 2008; Ord et al.,
2009; Wayth, Greenhill & Briggs, 2009; Clark, La Plante & Greenhill, 2011), the solution of
Kepler’s equation (Ford, 2009), galaxy spectral energy distribution calculations (Jonsson
& Primack, 2010; Heymann & Siebenmorgen, 2012), gravitational lensing ray-shooting
1.4. Purpose of the thesis 19
(Thompson et al., 2010; Bate et al., 2010), adaptive mesh refinement (Wang, Abel &
Kaehler, 2010; Schive, Tsai & Chiueh, 2010), volume rendering of spectral data cubes
(Hassan, Fluke & Barnes, 2012) and cosmological lattice simulations (Sainio, 2012). A
review of practical issues faced when implementing astronomy problems on GPUs has also
been published by Fluke et al. (2011).
It is important to note that many of these applications are well-known for exhibiting
large degrees of parallelism. In this sense, they may be considered ‘low-hanging fruit’
for implementation on massively-parallel architectures like GPUs. It is also evident that
the two main sources of knowledge regarding the design requirements for these implemen-
tations are hardware-specific documentation and simple trial and error [e.g., Hamada &
Iitaka (2007); Harris, Haines & Staveley-Smith (2008); Thompson et al. (2010)]. While this
‘ad-hoc’ approach has proven successful in early work, it is unclear whether such methods,
which generally require significant investments of time for development and optimisation,
will produce similar rewards for all areas of astronomy.
1.4 Purpose of the thesis
This thesis is motivated by two key observations: 1) the changing landscape of computing
hardware is threatening to leave behind astronomy research that does not adapt; and 2)
advanced architectures offer the potential to enable new science today. Consequently, its
aims are: 1) to motivate, develop and demonstrate a generalised approach to the use of
many-core architectures in astronomy; and 2) to use an advanced architecture to enable
new science.
It is crucial that astronomy be able to exploit advances in computing hardware, and
therefore critical that the software community embrace the current trend in processor
design that is placing more and more emphasis on massively-parallel processing. The
key obstacles to this are the foreign programming model and often steep learning curve
presented by advanced architectures. It is the first goal of this thesis to ameliorate this
issue by introducing a generalised approach to analysing and implementing algorithms on
such hardware and removing the risks associated with ad-hoc development. This forms
the basis of Chapter 2.
The order of magnitude more computing power offered by advanced architectures rela-
tive to CPUs today provides a unique opportunity to enable new science. Computationally-
limited fields of study stand to reap great rewards from the ability to process more data,
explore more parameter space or produce results with more accuracy. It is the second
aim of this thesis to demonstrate this possibility by applying an advanced architecture
20 Chapter 1. Introduction
to problems in pulsar astronomy and subsequently developing a real-time event detection
pipeline capable of unlocking unprecedented discovery opportunities. These ideas form
the basis of Chapters 3 and 4. An introduction to pulsar astronomy and a discussion of
the motivation behind the choice of this field for the application of advanced architectures
is presented in Section 1.5.
To avoid undue complication, this thesis focuses primarily on graphics processing units
as the canonical example of an advanced, many-core hardware architecture. This does not,
however, represent a reduction in scope: the ideas and methods presented in this work are
expected to apply equally well to other massively-parallel architectures, both present and
future. That said, the long-running history and established market position of GPUs give
good reason to believe that they will continue to remain a significant force in accelerated
computing for the foreseable future.
1.5 Advanced architectures meet pulsar astronomy
The applications to which GPUs were applied in this thesis focus primarily on problems in
pulsar astronomy (e.g., Chapters 3 and 4). Pulsar astronomy has a strong dependence on
high-performance computation, and in many cases its science is computationally-limited.
Here we provide a brief introduction to pulsars, their observation at radio frequencies
and why their study is an excellent field for the application of advanced architectures like
GPUs.
1.5.1 History and characteristics of pulsars
Pulsars get their name from a portmanteau of ‘pulsating star’, which describes their ap-
pearance when observed through a telescope. As with many phenomena in astronomy,
this observationally-derived name does not correspond to their underlying physical nature.
Pulsars are in fact rotating neutron stars, remnants from supernovae. Their serendipitous
discovery in 1967 by Jocelyn Bell Burnell involved observations around 81.5 MHz of un-
explained regular pulses of emission from a consistent celestial location (Hewish et al.,
1968). These pulses were found to have remarkable periodicity, with one source repeat-
ing every ∼1337 ms to better than one part in 107 [and since measured to better than
one part in 1012 by Hobbs et al. (2004)]. After the initial discovery of four such sources,
the phenomenon was quickly attributed to polar emission from a rotating neutron star,
where an intense magnetic field accelerates charged particles from the surface of the star
to relativistic speeds, resulting in the emission of synchrotron radiation from the magnetic
1.5. Advanced architectures meet pulsar astronomy 21
poles (Gold, 1968; Pacini, 1968). The observation of discrete pulses arises from a misalign-
ment between the star’s rotation and magnetic axes, which causes the emission beam to
periodically sweep across our line of sight as the star rotates.
Today more than 2000 pulsars have been discovered, and ongoing surveys continue to
add to this number. Two primary metrics used to characterise a pulsar are the rotation
period P and its derivative P . Plotting the known pulsars in this phase space (see Fig.
1.3) reveals several distinct groupings [see Bhattacharya & van den Heuvel (1991) and
Cordes et al. (2004) for reviews]. The largest group primarily occupies the range 0.25 s
. P . 1.25 s, forming the population known as the slow, regular or canonical pulsars.
At shorter periods lies a distinct population of ‘millisecond pulsars’ (MSPs), correlated
strongly with pulsars known to be members of binary systems. MSPs are thought to be
‘recycled’ slow pulsars—since their formation, they have been spun-up by the accretion
of mass from a companion star (Alpar et al., 1982). At the longest periods and steep-
est period derivatives are a population of pulsars known as magnetars, named for their
extremely strong surface magnetic field strengths (Mereghetti, 2008). Another class of
pulsars, generally exhibiting periods similar to magnetars but spin-down rates more char-
acteristic of regular pulsars, are the rotating radio transients (RRATs). These objects are
distinguished by their sporadic pulse detection rates and are now thought to be pulsars
that experience on-and-off ‘nulling’ of their emission (McLaughlin et al., 2006); their exact
definition, however, remains uncertain.
Pulsars have a number of attributes that make them very useful objects of study.
Observations can provide insights into the physics behind neutron stars, entities that
skirt the edges of the known physical laws (Lattimer & Prakash, 2004). Their place at
the end of the stellar evolutionary path also makes pulsars amenable to studies of stellar
populations (Bhattacharya & van den Heuvel, 1991). The stability of their rotation, which
can rival the best Earth-based atomic clocks (Matsakis, Taylor & Eubanks, 1997), also
allows them to be used to refine solar system ephemerides and potentially the presence of
a gravitational wave background (Foster & Backer, 1990). Furthermore, in tight binary
systems they become even more useful, providing probes of high-energy plasma physics
and gravitational radiation (Lyne et al., 2004).
1.5.2 Pulsar observations
Pulsars have been observed in the radio, optical, X-ray and gamma-ray bands (Abdo et al.,
2010; Mignani, 2011). Of the known pulsars, the majority have been detected at radio
frequencies, a large fraction of which were discovered at the Parkes Radio Observatory
22 Chapter 1. Introduction
10-22
10-20
10-18
10-16
10-14
10-12
10-10
10-8
0.001 0.01 0.1 1 10 100
Pe
rio
d d
eriva
tive
[s s
-1]
Period [s]
RegularBinary members
MagnetarsRRATs
Figure 1.3 Sample of the known pulsars plotted in P-P space. Data obtained using theATNF Pulsar Catalogue (Manchester et al., 2005) available here: http://www.atnf.
csiro.au/people/pulsar/psrcat/.
1.5. Advanced architectures meet pulsar astronomy 23
in bands centred at 436 MHz (Lyne et al., 1998) and 1382 MHz (Lorimer et al., 2006;
Keith et al., 2010). An important phenomenon at these frequencies is dispersion: the
introduction of a frequency-dependent time delay in the pulse signal as a result of refraction
by free electrons, which reside in the interstellar medium between source and observer.
The dispersion delay varies quadratically with frequency and is directly proportional to the
dispersion measure (DM), a quantity defining the column density of free electrons along
the line of sight. Left uncorrected, interstellar dispersion causes pulsar signals to appear
smeared out in time across a finite observing bandwidth. For this reason, observations of
pulsars must be corrected for the dispersion delay at each frequency prior to integrating
the band.
Pulsar observations using radio telescopes require a number of processing stages to
reduce the raw voltages to final data products. The signal path begins at the receiver
horn, which captures radiation as complex voltages in two orthogonal polarisations. These
signals are fed through a low-noise amplifier, which boosts weak astonomical signals to
detectable levels in a low-thermal-noise environment (often cooled cryogenically). A low-
frequency signal from a separate oscillator is then mixed with this amplified signal to
reduce the frequency to O(10 MHz), simplifying subsequent electronics and preventing
feedback into the receiver. The mixed signal is then passed through a band-pass filter to
produce the intermediate frequency (IF) feed. The IF feed attaches to what is known as
the receiver back-end.
Back-ends vary significantly depending on the intended observing mode and the tech-
nology used. For pulsar timing observations, where the dispersion measure of the source is
known a-priori, a process known as coherent dedispersion can be applied to directly correct
the complex signal voltages for the effects of interstellar dispersion (Hankins & Rickett,
1975). After this procedure, the data can be folded at the pulsar period to produce an
integrated pulse profile at the native time resolution, allowing for very high-precision tim-
ing. Popular current choices of back-end hardware are the FPGA platforms developed by
the Center for Astronomy Signal Processing and Electronics Research (CASPER) (e.g.,
Langston, Rumberg & Brandt 2007; Keith et al. 2010; Sane et al. 2012). The CASPER
Parkes Swinburne Recorder (CASPSR) is a recently-developed pulsar timing back-end
that operates by digitising the IF feed and using a CASPER Interconnect Break-out
Board (IBOB) to packetise and transmit the data to a cluster of server computers; the
coherent dedispersion and folding process is then performed in software (van Straten &
Bailes, 2011).
Different techniques are used when taking survey observations, which aim to detect new
24 Chapter 1. Introduction
sources. Survey back-ends must record the observed data such that it can be searched
for signals across a range of dispersion measures. The computational cost of coherent
dedispersion (discussed further in Section 1.5.3) precludes applying it at this number of
DMs, and thus survey data are generally dedispersed incoherently. Modern survey back-
ends act as digital spectrometers by dividing up the IF feed into a number of independent
frequency channels, usually using a polyphase filterbank to avoid issues associated with
the straightforward discrete Fourier transform [see Harris & Haines (2011) for a review of
the use of polyphase filterbanks in astronomy]. The Berkeley-Parkes-Swinburne Recorder
(BPSR) is a survey back-end that uses an IBOB to apply a polyphase filterbank to the
digitised input signal. The FPGA then ‘detects’ each channel by squaring it, integrates
over 25 time samples to reduce the time resolution to 64 µs, scales and decimates the
samples to eight bits, and sends the data in packets to a server computer. The server is
tasked with summing the filterbanks from two polarisations and normalising each channel,
before finally rescaling to two bits per sample and writing the data to disk (Keith et al.,
2010). Subsequent incoherent dedispersion of the filterbanks is performed by artificially
delaying and summing each frequency channel to produce dedispersed time series.
Several different techniques are used to search for pulsars, often targeting specific
classes (i.e., those shown in Fig. 1.3). The periodic nature of pulsar emission typically
makes searching for them in the Fourier domain the most sensitive option. Algorithms
for folding time series at many trial periods such as the fast folding algorithm have also
been employed in the past (Burns & Clark, 1969; Hankins & Rickett, 1975). Two main
cases exist where period-search techniques can fail: highly-accelerated pulsars in tight
binary systems, and nulling pulsars/rotating radio transients. Doppler shifting causes
accelerated pulsars to exhibit non-linear stretching and compressing of the inter-pulse
interval when observed from Earth, which can prevent the coherent addition of pulses
during a Fourier transform. A number of techniques have been developed to solve this
problem, including stretch-correction of time series and Fourier-domain matched filtering
(Johnston & Kulkarni, 1991; Ransom, 2001). Detection of rotating radio transients suffers
(by definition) from the problem of having too few pulses to produce a more significant
signal in the Fourier domain than the time domain. These objects are detected using
single-pulse search techniques that look for individual bright pulses (McLaughlin et al.,
2006).
One final issue that affects all modern radio observations is the existence of man-made
radio-frequency interference (RFI). Population growth and the explosion in the use of
wireless technologies and satellite communications has resulted in a crowded broadcast
1.5. Advanced architectures meet pulsar astronomy 25
spectrum that is increasingly difficult to escape. While radio observatories are generally
located in sparsely populated radio-quiet zones, a certain amount of RFI inevitably makes
its presence known, and has a tendency to be orders of magnitude stronger than astronom-
ical signals. In typical pulsar surveys, both periodic and impulsive RFI signals abound
in the data, overpowering all but the brightest pulsars and RRATs. Fortunately, terres-
trial signals often exhibit tell-tale signs that allow them to be identified and excised, and
many different RFI mitigation techniques have been developed over the years (see, e.g.,
Fridman & Baan 2001; Bhat et al. 2005; Kesteven et al. 2005; Floer, Winkel & Kerp 2010;
Hogden et al. 2012; Spitler et al. 2012). Two simple signs of RFI are the presence of only
narrow-band emission and the lack of a dispersion sweep across the band (in broad-band
signals). In addition to the use of these discriminators, another common approach to RFI
mitigation is to exploit coincidence information from multiple antennas or receivers, ei-
ther geographically separated and pointing at the same location on the sky (in which case
coincidence evidences an astronomical origin), or geographically co-located and pointing
at different regions on the sky36 (in which case coincidence evidences an Earth origin).
These techniques offer very effective means of mitigating RFI at the cost of additional
computing resources, which can become significant particularly in real-time systems.
1.5.3 Pulsar astronomy and advanced architectures
Pulsar astronomy relies heavily on high-performance computation during both observa-
tions and data analysis. The process of coherent dedispersion is an example of a compu-
tationally intensive operation, involving the application of many large Fourier transforms.
This algorithm is particularly expensive at large observing bandwidths and high dispersion
measures, often making it prohibitively expensive to perform in real-time on traditional
computing hardware. However, the high performance of the fast Fourier transform (FFT)
algorithm on GPUs makes them an excellent way to accelerate this computation; a GPU-
based real-time coherent dedispersion pipeline has indeed already been deployed at the
Parkes Radio Observatory (van Straten & Bailes, 2011).
Pulsar surveys also depend on this ability to rapidly process data, being necessary
to enable the exploration of large parameter spaces. Searching for pulsars in filterbank
data requires incoherent dedispersion at many trial DMs, each of which must then be
converted into a form sensitive to the target signals (e.g., via Fourier transformation or
matched filtering) and searched independently. While this often leads to large parameter
spaces and considerable computing demands, the operations have the significant positive
36In the case of dedicated reference antennas, they may not be pointing at the sky at all.
26 Chapter 1. Introduction
of exhibiting very high degrees of parallelism, from the independence between search trials
to the data-parallelism within filterbanks and time series (see Chapter 2 for discussion of
parallel algorithms). This makes them exceptionally well-suited to the massively-parallel
advanced architectures introduced in Section 1.2. This conclusion is evidenced by recent
work applying GPUs to new and ongoing pulsar surveys (Magro et al., 2011; Ait-Allal
et al., 2012).
Finally, RFI mitigation can also place heavy demands on computing resources, espe-
cially when applied in real time. Processes such as cleaning filterbanks of narrow-band and
zero-dispersion-measure signals are relatively undemanding, but methods involving coin-
cidence detection across multiple data streams can become intensive, particularly when
using non-trivial coincidence criteria (Briggs & Kocz, 2005; Kocz et al., 2012). In such
cases, the available computing power can directly influence the effectiveness of the RFI
mitigation, and, consequently, the rate of scientific progress.
The excellent match between the computational demands of pulsar astronomy and the
computational capabilities of advanced architectures motivated the selection of this field
as the target of applications investigated in the later parts of this thesis.
1.6 Thesis outline
This thesis is structured as follows. Chapter 2 motivates and presents a generalised ap-
proach to the use of advanced, many-core architectures in astronomy, describing a method-
ology based around the analysis of algorithms and demonstrating it on four well-known
applications. This methodology is then applied in Chapter 3 to guide a GPU implemen-
tation of the problem of incoherent dedispersion in pulsar astronomy. Three different
algorithms are analysed and implemented, and their performance is compared across the
CPU and GPU. Chapter 4 then builds on the results of Chapters 2 and 3 to describe
the development and deployment of a complete GPU-based fast-radio-transient detection
pipeline, concluding with early science results. Finally, Chapter 5 presents a discussion of
future directions and summarises the findings of the previous chapters.
2Algorithm Analysis: A Generalised Approach to
Many-core Architectures for Astronomy
I know how to get four horses to pull a cart, but I don’t
know how to make 1024 chickens do it.
—Enrico Clementi
2.1 Introduction
The appearance of low-cost computational accelerators in the form of graphics processing
units (GPUs) has heralded a new era of high-performance computing (HPC) in astronomy
research, with speed-ups of an order of magnitude available even to those on the tightest
research budgets. However, while this lowering of the cost barrier to HPC represents a
significant step forward, there remains a high learning barrier accompanying the use of
these new hardware architectures. As a result of this, GPU use in astronomy to date
has largely been limited to the most computer-literate researchers working on applications
that may be considered ‘low-hanging fruit’ for parallel computing.
Inevitably, a section of the astronomy community will continue with an ad hoc ap-
proach to the adaptation of software from single-core to many-core architectures. In this
chapter, we demonstrate that there is a significant difference between current comput-
ing techniques and those required to efficiently utilise new hardware architectures such
as many-core processors, as exemplified by GPUs. These techniques will be unfamiliar
to most astronomers and will pose a challenge in terms of keeping the discipline at the
forefront of computational science. We present a practical, effective and simple methodol-
ogy for creating astronomy software whose performance scales well to present and future
many-core architectures. Our methodology is grounded in the classical computer science
27
28 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
field of algorithm analysis.
In Section 2.2 we introduce the key concepts in algorithm analysis, with particular
focus on the context of many-core architectures. We present four foundation algorithms,
and characterise them as we outline our algorithm analysis methodology. In Section 2.3
we demonstrate the proposed methodology by applying it to four well-known astronomy
problems, which we break down into their constituent foundation algorithms. We validate
our analysis of these problems against ad hoc many-core implementations as available in
the literature and discuss the implications of our approach for the future of computing in
astronomy in Section 2.4.
2.2 A Strategic Approach: Algorithm Analysis
Algorithm analysis, pioneered by Donald Knuth (see, e.g., Knuth 1998), is a fundamental
component of computer science—a discipline that is more about how to solve problems
than the actual implementation in code. In this work, we are not interested in the specifics
(i.e., syntax) of implementing a given astronomy algorithm with a particular programming
language or library (e.g., CUDA, OpenCL, Thrust) on a chosen computing architecture
(e.g., GPU, ClearSpeed, Cell). As Harris (2007) notes, algorithm-level optimisations are
much more important with respect to overall performance on many-core hardware (specifi-
cally GPUs) than implementation optimisations, and should be made first. We will return
to the issue of implementation in Chapter 3.
Here we present an approach to tackling the transition to many-core hardware based
on the analysis of algorithms. The purpose of this analysis is to determine the potential of
a given algorithm for a many-core architecture before any code is written. This provides
essential information about the optimal approach as well as the return on investment one
might expect for the effort of (re-)implementing a particular algorithm. Our methodology
was in part inspired by the work of Harris (2005).
Work in a similar vein has also been undertaken by Asanovic et al. (2006, 2009) who
classified parallel algorithms into 12 groups, referring to them as ‘dwarfs’. While insightful
and opportune, these dwarfs consider a wide range of parallel architectures, cover all areas
of computation (including several that are not of great relevance to astronomy) and are
limited as a resource by the coarse nature of the classification. In contrast, the approach
presented here is tailored to the parallelism offered by many-core processor architectures,
contains algorithms that appear frequently within astronomy computations, and provides
a fine-grained level of detail. Furthermore, our approach considers the fundamental con-
cerns raised by many-core architectures at a level of abstraction that avoids dealing with
2.2. A Strategic Approach: Algorithm Analysis 29
hardware or software-specific details and terminology. This is in contrast to the work by
Che et al. (2008), who presented a useful but highly-targeted summary of general-purpose
programming on the NVIDIA GPU architecture.
For these reasons this work will serve as a valuable and practical resource for those
wishing to analyse the expected performance of particular astronomy algorithms on current
and future many-core architectures.
For a given astronomy problem, our methodology is as follows:
1. Outline each step in the problem.
2. Identify steps that resemble known algorithms (see below).
(a) Outlined steps may need to be further decomposed into sub-steps before a
known counterpart is recognised. Such composite steps may later be added to
the collection of known algorithms.
3. For each identified algorithm, refer to its pre-existing analysis.
(a) Where a particular step does not appear to match any known algorithm, refer
to a relevant analysis methodology to analyse the step as a custom algorithm
(see Sections 2.2.1, 2.2.2 and 2.2.3). The newly-analysed algorithm can then be
added to the collection for future reference.
4. Once analysis results have been obtained for each step, apply a global analysis to
the algorithm to obtain a complete picture of its behaviour (see Section 2.2.4).
Here we present a small collection of foundation algorithms1 that appear in computa-
tional astronomy problems. This is motivated by the fact that complex algorithms may be
composed from simpler ones. We propose that algorithm composition provides an excellent
approach to turning the multi-core corner. Here we focus on its application to algorithm
analysis; in Chapter 4 we will show how it may also be applied to implementation method-
ologies. The algorithms are described below using a vector data structure. This is a data
structure like a Fortran or C array representing a contiguous block of memory and pro-
viding constant-time random access to individual elements2. We use the notation v[i] to
represent the ith element of a vector v.
1Note that for these algorithms we have used naming conventions that are familiar to us but are by nomeans unique in the literature.
2Here we use constant-time in the algorithmic sense, i.e., constant with respect to the size of the inputdata. In this context we are not concerned with hardware-specific performance factors.
30 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
Transform: Returns a vector containing the result of the application of a specified
function to every individual element of an input vector.
out[i] = f(in[i]) (2.1)
Functions of more than one variable may also be applied to multiple input vectors. Scaling
the brightness of an image (defined as a vector of pixels) is an example of a transform
operation.
Reduce: Returns the sum of every element in a vector.
out =∑i
in[i] (2.2)
Reductions may be generalised to use any associative binary operator, e.g., product, min,
max etc. Calculating image noise is a common application of the reduce algorithm.
Gather: Retrieves values from an input vector according to a specified index mapping
and writes them to an output vector.
out[i] = in[map[i]] (2.3)
Reading a shifted or transformed subregion of an image is a common example of a gather
operation.
Interact: For each element i of an input vector, in1, sums the interaction between i
and each element j in a second input vector, in2.
out[i] =∑j
f(in1[i], in2[j]) (2.4)
where f is a given interaction function. The best-known application of this algorithm
in astronomy is the computation of forces in a direct N-body simulation, where both
input vectors represent the system’s particles and the interaction function calculates the
gravitational force between two particles.
These four algorithms were chosen from experience with a number of computational
astronomy problems. The transform, reduce and gather operations may be referred to as
‘atoms’ in the sense that they are indivisible operations. While the interact algorithm is
technically a composition of transforms and reductions, it will be analysed as if it too was
an atom, enabling rapid analysis of problems that use the interact algorithm without the
need for further decomposition.
2.2. A Strategic Approach: Algorithm Analysis 31
We now describe a number of algorithm analysis techniques that we have found to
be relevant to massively-parallel architectures. These techniques should be applied to
the individual algorithms that comprise a complete problem in order to gain a detailed
understanding of their behaviour.
2.2.1 Principle characteristics
Many-core architectures exhibit a number of characteristics that can impact strongly on
the performance of an algorithm. Here we summarise four of the most important issues
that must be considered.
Massive parallelism: To fully utilise massively-parallel architectures, algorithms
must exhibit a high level of parallel granularity, i.e., the number of required operations that
may be performed simultaneously must be large and scalable. Data-parallel algorithms,
which divide their data between parallel processors rather than (or in addition to) their
tasks, exhibit parallelism that scales with the size of their input data, making them ideal
candidates for massively-parallel architectures. However, performance may suffer when
these algorithms are executed on sets of input data that are small relative to the number
of processors in a particular many-core architecture3.
Memory access patterns: Many-core architectures contain very high bandwidth
main memory4 in order to ‘feed’ the large number of parallel processing units. However,
high latency (i.e., memory transfer startup) costs mean that performance depends strongly
on the pattern in which memory is accessed. In general, maintaining ‘locality of reference’
(i.e., neighbouring threads accessing similar locations in memory) is vital to achieving
good performance5. Fig. 2.1 illustrates different levels of locality of reference.
Collisions between threads trying to read the same location in memory can also be
costly, and write-collisions must be treated using expensive atomic operations in order to
avoid conflicts between threads.
Branching: Current many-core architectures rely on single instruction multiple data
(SIMD) hardware. This means that neighbouring threads that wish to execute different
instructions must wait for each other to complete the divergent code section before ex-
ecution can continue in parallel (see Fig. 2.2). For this reason, algorithms that involve
significant branching between different threads may suffer severe performance degrada-
3Note also that oversubscription of threads to processors is often a requirement for good performancein many-core architectures. For example, an NVIDIA GT200-class GPU may be under-utilised with anallocation of fewer than ∼ 104 parallel threads, corresponding to an oversubscription rate of around 50×.
4Memory bandwidths on current GPUs are O(100GB/s).5Locality of reference also affects performance on traditional CPU architectures, but to a lesser extent
than on GPUs.
32 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
Figure 2.1 Representative memory access patterns indicating varying levels of localityof reference. Contiguous memory access is the optimal case for many-core architectures.Patterns with high locality will generally achieve good performance; those with low localitymay incur severe performance penalties.
2.2. A Strategic Approach: Algorithm Analysis 33
Figure 2.2 A schematic view of divergent execution within a SIMD architecture. Linesindicate the flow of instructions; white diamonds indicate branch points, where the codepaths of neighbouring threads diverge. The statements on the left indicate typical corre-sponding source code. White space between branch points indicates a thread waiting forits neighbours to complete a divergent code section.
tion. Similar to the effects of memory access locality, performance will in general depend
on the locality of branching, i.e., the number of different code-paths taken by a group of
neighbouring threads.
Arithmetic intensity: Executing arithmetic instructions is generally much faster
than accessing memory on current many-core hardware. Algorithms performing few arith-
metic operations per memory access may become memory-bandwidth-bound; i.e., their
speed becomes limited by the rate at which memory can be accessed, rather than the
rate at which arithmetic instructions can be processed. Memory bandwidths in many-
core architectures are typically significantly higher than in CPUs, meaning that even
bandwidth-bound algorithms may exhibit strong performance; however, they will not be
able to take full advantage of the available computing power. In some cases, it may be
beneficial to re-work an algorithm entirely in order to increase its arithmetic intensity,
even at the cost of performing more numerical work in total.
For the arithmetic intensities presented in this paper, we assume an idealised cache
model in which only the first memory read of a particular piece of data is included in
the count; subsequent or parallel reads of the same data are assumed to be made from a
cache, and are not counted. The ability to achieve this behaviour in practice will depend
strongly on the memory access pattern (specifically the locality of memory accesses).
34 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
Table 2.1 Analysis of four foundation algorithmsTransform Reduction Gather Interact
Work O(N) O(N) O(N) O(NM)Depth O(1) O(logN) O(1) O(M) or O(logM)Memory access locality Contiguous Contiguous Variable Contiguous
Arithmetic intensity 1 : 1 : α 1 : 1N : α 1 : 1 : 0 1 + M
N : 1 : 2Mα
2.2.2 Complexity analysis
The complexity of an algorithm is a formal measure of its execution time given a certain
size of input. It is often used as a means of comparing the speeds of two different algorithms
that compute the same (or a similar) result. Such comparisons are critical to understanding
the relative contributions of different parts of a composite algorithm and identifying bottle-
necks.
Computational complexity is typically expressed as the total run-time, T , of an algo-
rithm as a function of the input size, N , using ‘Big O’ notation. Thus T (N) = O(N)
means a run-time that is proportional to the input size N . An algorithm with complexity
of T (N) = O(N2) will take four times as long to run after a doubling of its input size.
While the complexity measure is traditionally used for algorithms running on serial
processors, it can be generalised to analyse parallel algorithms. One method is to introduce
a second parameter: P , the number of processors. The run-time is then expressed as a
function of both N and P . For example, an algorithm with a parallel complexity of
T (N,P ) = O(NP ) will run P times faster on P processors than on a single processor
for a given input size; i.e., it exhibits perfect parallel scaling. More complex algorithms
may incur overheads when run in parallel, e.g., those requiring communication between
processors. In these cases, the parallel complexity will depend on the specifics of the target
hardware architecture.
An alternative way to express parallel complexity is using the work, W , and depth,
D, metrics first introduced formally by Blelloch (1996). Here, work measures the total
number of computational operations performed by an algorithm (or, equivalently, the
run-time on a single processor), while depth measures the longest sequence of sequentially-
dependent operations (or, equivalently, the run-time on an infinite number of processors).
The depth metric is a measure of the amount of inherent parallelism in the algorithm. A
perfectly parallel algorithm has work complexity of W (N) = O(N) and depth complexity
of D(N) = O(1), meaning all but a constant number of operations may be performed in
parallel. An algorithm with W = O(N) and D = O(logN) is highly parallel, but contains
some serial dependencies between operations that scale as a function of the input size.
2.2. A Strategic Approach: Algorithm Analysis 35
Parallel algorithms with work complexities equal to those of their serial counterparts are
said to be ‘work efficient’; those that further exhibit low depth complexities are considered
to be efficient parallel algorithms. The benefit of the work/depth metrics over the parallel
run-time is that they have no dependence on the particular parallel architecture on which
the algorithm is executed, i.e., they measure properties inherent to the algorithm.
A final consideration regarding parallel algorithms is Amdahl’s law (Amdahl, 1967),
which states that the maximum possible speedup over a serial algorithm is limited by the
fraction of the parallel algorithm that cannot be (or simply is not) parallelised. Assuming
an infinite number of available processors, the run-time of the parallel part of the algorithm
will reduce to a constant, while the serial part will continue to scale with the size of the
input. In terms of the work/depth metrics, the depth of the algorithm represents the
fraction that cannot be parallelised, and the maximum theoretical speedup is given by
Smax ≈ WD . Note the implication that the maximum speedup is actually a function of the
input size. Increasing the problem size in addition to the number of processors allows the
speedup to scale more effectively.
2.2.3 Analysis results
We have applied the techniques discussed in Sections 2.2.1 and 2.2.2 to the four foundation
algorithms introduced at the beginning of Section 2.2. We use the following metrics:
• Work and depth: The complexity metrics as described in Section 2.2.2.
• Memory access locality: The nature of the memory access patterns as discussed
in Section 2.2.1.
• Arithmetic intensity: Defined by the triple ratio r : w : f representing the num-
ber of read, write and function evalation operations respectively that the algorithm
performs (normalised to the input size). The symbol α is used, where applicable, to
represent the internal arithmetic intensity of the function given to the algorithm.
The results are presented in Table 2.1. Note that this analysis is based on the most-efficient
known parallel version of each algorithm.
2.2.4 Global analysis
Once local analysis results have been obtained for each step of a problem, it is necessary
to put them together and perform a global analysis. Our methodology is as follows:
36 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
1. Determine the components of the algorithm where most of the computational work
lies by comparing work complexities. Components with similar work complexities
should receive similar attention with respect to parallelisation in order to avoid
leaving behind bottle-necks as a result of Amdahl’s Law.
2. Consider the amount of inherent parallelism in each algorithm
by observing its theoretical speedup Smax ≈ WD .
3. Use the theoretical arithmetic intensity of each algorithm to determine the likeli-
hood of it being limited by memory bandwidth rather than instruction throughput.
The theoretical global arithmetic intensity may be obtained by comparing the total
amount of input and output data to the total amount of arithmetic work to be done
in the problem.
4. Assess the memory access patterns of each algorithm to identify the potential to
achieve peak arithmetic intensity6.
5. If particular components exhibit poor properties, consider alternative algorithms.
6. Once a set of component algorithms with good theoretical performance has been
obtained, the algorithm decomposition should provide a good starting point for an
implementation.
2.3 Application to Astronomy Algorithms
We now apply our methodology from Section 2.2 to four typical astronomy computations.
In each case, we demonstrate how to identify the steps in an outline of the problem
as foundation algorithms from our collection described at the beginning of Section 2.2.
We then use this knowledge to study the exact nature of the available parallelism and
determine the problem’s overall suitability for many-core architectures. We note that we
have deliberately chosen simple versions of the problems in order to maximise clarity and
brevity in illustrating the principles of our algorithm analysis methodology.
2.3.1 Inverse ray-shooting gravitational lensing
Introduction: Inverse ray-shooting is a numerical technique used in gravitational mi-
crolensing. Light rays are projected backwards (i.e., from the observer) through an en-
6Studying the memory access patterns will also help to identify the optimal caching strategy if thislevel of optimisation is desired.
2.3. Application to Astronomy Algorithms 37
semble of lenses and on to a source-plane pixel grid. The number of rays that fall into
each pixel gives an indication of the magnification at that spatial position relative to the
case where there was no microlensing. In cosmological scenarios, the resultant maps are
used to study brightness variations in light curves of lensed quasars, providing constraints
on the physical size of the accretion disk and broad line emission regions.
The two main approaches to ray-shooting are based on either the direct calculation
of the gravitational deflection by each lens (Kayser, Refsdal & Stabell, 1986; Schneider
& Weiss, 1986, 1987) or the use of a tree hierarchy of psuedo-lenses (Wambsganss, 1990,
1999). Here, we consider the direct method.
Outline: The ray-shooting algorithm is easily divided into a number of distinct steps:
1. Obtain a collection of lenses according to a desired distribution, where each lens has
position and mass.
2. Generate a collection of rays according to a uniform distribution within a specified
2D region, where each ray is defined by its position.
3. For each ray, calculate and sum its deflection due to each lens.
4. Add each ray’s calculated deflection to its initial position to obtain its deflected
position.
5. Calculate the index of the pixel that each ray falls into.
6. Count the number of rays that fall into each pixel.
7. Output the list of pixels as the magnification map.
Analysis: To begin the analysis, we interpret the above outline as follows:
• Steps 1 and 2 may be considered transform operations that initialise the vectors of
lenses and rays.
• Step 3 is an example of the interact algorithm, where the inputs are the vectors of
rays and lenses and the interaction function calculates the deflection of a ray due to
the gravitational potential around a lens mass.
• Steps 4 and 5 apply further transforms to the collection of rays.
• Step 6 involves the generation of a histogram. As we have not already identified
this algorithm in Section 2.2, it will be necessary to analyse this step as a unique
algorithm.
38 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
According to this analysis, three basic algorithms comprise the complete technique:
transform, interact and histogram generation. Referring to Table 2.1, we see that, in the
context of a lensing simulation using Nrays rays and Nlenses lenses, the amount of work
performed by the transform and interact algorithms will be W = O(Nrays) + O(Nlenses)
and W = O(NraysNlenses) respectively.
We now analyse the histogram step. Considering first a serial algorithm for generating
a histogram, where each point is considered in turn and the count in its corresponding bin is
incremented, we find the work complexity to be W = O(Nrays). Without further analysis,
we compare this to those of the other component algorithms. The serial histogram and
the transform operations each perform similar work. The interact algorithm on the other
hand must, as we have seen, perform work proportional to Nrays×Nlenses. For large Nlenses
(e.g., as occurs in cosmological microlensing simulations, where Nlenses > 104) this step
will dominate the total work. Assuming the number of lenses is scaled with the amount
of parallel hardware, the interact step will also dominate the total run-time.
Given the dominance of the interact step, we now choose to ignore the effects of the
other steps in the problem. It should be noted, however, that in contrast to cosmological
microlensing, planetary microlensing models contain only a few lenses. In this case, the
work performed by the interact step will be similar to that of the other steps, and thus
the use of a serial histogram algorithm alongside parallel versions of all other steps would
result in a severe performance bottle-neck. Several parallel histogram algorithms exist,
but a discussion of them is beyond the scope of this work.
Returning to the analysis of the interact algorithm, we again refer to Table 2.1. Its
worst-case depth complexity indicates a maximum speedup of S ≈ W = O(Nrays), i.e.,
parallel speedup scaling perfectly up to the number of rays. The arithmetic intensity of
the algorithm scales as Nlenses and will thus be very high. Contiguous memory accesses
indicate strong potential to achieve this high arithmetic intensity. We conclude that direct
inverse ray-shooting for cosmological microlensing is an ideal candidate for an efficient
implementation on a many-core architecture.
2.3.2 Hogbom CLEAN
Introduction: Raw (‘dirty’) images produced by radio interferometers exhibit unwanted
artefacts as the result of the incomplete sampling of the visibility plane. These artefacts
can inhibit image analysis and should ideally be removed by deconvolution. Several dif-
ferent techniques have been developed to ‘clean’ these images. For a review, see Briggs
(1995). Here we analyse the image-based algorithm first described by Hogbom (1974). We
2.3. Application to Astronomy Algorithms 39
note that the algorithm by Clark (1980) is now the more popular choice in the astronomy
community, but point out that it is essentially an approximation to Hogbom’s algorithm
that provides increased performance at the cost of reduced accuracy.
The algorithm involves iteratively finding the brightest point in the ‘dirty image’ and
subtracting from the dirty image an image of the beam centred on and scaled by this
brightest point. The procedure continues until the brightest point in the image falls below
a prescribed threshold. While the iterative procedure must be performed sequentially, the
computations within each iteration step are performed independently for every pixel of
the images, suggesting a substantial level of parallelism. The output of the algorithm is a
series of ‘clean components’, which may be used to reconstruct a cleaned image.
Outline: The algorithm may be divided into the following steps:
1. Obtain the beam image.
2. Obtain the image to be cleaned.
3. Find the brightest point, b, the standard deviation, σ, and the mean, µ, of the image.
4. If the brightness of b is less than a prescribed threshold (e.g., |b − µ| < 3σ), go to
step 9.
5. Scale the beam image by a fraction (referred to as the ‘loop gain’) of the brightness
of b.
6. Shift the beam image to centre it over b.
7. Subtract the scaled, shifted beam image from the input image to produce a partially-
cleaned image.
8. Repeat from step 3.
9. Output the ‘clean components’.
Analysis: We decompose the outline of the Hogbom clean algorithm as follows:
• Steps 1 and 2 are simple data-loading operations, and may be thought of as trans-
forms.
• Step 3 involves a number of reduce operations over the pixels in the dirty image.
• Step 5 is a transform operation, where each pixel in the beam is multiplied by a
scale factor.
40 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
• Step 6 may be achieved in two ways, either by directly reading an offset subset
of the beam pixels, or by switching to the Fourier domain and exploiting the shift
theorem. Here we will only consider the former option, which we identify as a gather
operation.
• Step 7 is a transform operation over pixels in the dirty image.
We thus identify three basic algorithms in Hogbom clean: transform, reduce and
gather. Table 2.1 shows that the work performed by each of these algorithms will be
comparable (assuming the input and beam images are of similar pixel resolutions). This
suggests that any acceleration should be applied equally to all of the steps in order to
avoid the creation of bottle-necks.
The depth complexities of each algorithm indicate a limiting speed-up of Smax ≈O(
Npxls
logNpxls) during the reduce operations. While not quite ideal, this is still a good result.
Further, the algorithms do not exhibit high arithmetic intensity (the calculations involving
only a few subtractions and multiplies) and are thus likely to be bandwidth-bound. This
will dominate any effect the limiting speed-up may have.
The efficiency with which the algorithm will use the available memory bandwidth will
depend on the memory access patterns. The transform and reduce algorithms both make
contiguous memory accesses, and will thus achieve peak bandwidth. The gather operation
in step 6, where the beam image is shifted to centre it on a point in the input image, will
access memory in an offset but contiguous 2-dimensional block. This 2D locality suggests
the potential to achieve near-peak memory throughput.
We conclude that the Hogbom clean algorithm represents a good candidate for im-
plementation on many-core hardware, but will likely be bound by the available memory
bandwidth rather than arithmetic computing performance.
2.3.3 Volume rendering
Introduction: There are a number of sources of volume data in astronomy, including
spectral cubes from radio telescopes and integral field units, as well as simulations using
adaptive mesh refinement and smoothed particle hydrodynamics techniques. Visualising
these data in physically-meaningful ways is important as an analysis tool, but even small
volumes (e.g., 2563) require large amounts of computing power to render, particularly
when real-time interactivity is desired.
Several methods exist for rendering volume data; here we analyse a direct (or brute-
force) ray-casting algorithm (Levoy, 1990). While similarities exist between ray-shooting
2.3. Application to Astronomy Algorithms 41
for microlensing (Section 2.3.1) and the volume rendering technique we describe here, they
are fundamentally different algorithms.
Outline: The algorithm may be divided into the following steps:
1. Obtain the input data cube.
2. Create a 2D grid of output pixels to be displayed.
3. Generate a corresponding grid of rays, where each is defined by a position (initially
the centre of the corresponding pixel), a direction (defined by the viewing transfor-
mation) and a colour (initially black).
4. Project each ray a small distance (the step size) along its direction.
5. Determine which volume pixel (voxel) each ray now resides in.
6. Retrieve the colour of the voxel from the data volume.
7. Use a specified transfer function to combine the voxel colour with the current ray
colour.
8. Repeat from step 4 until all rays exit the data volume.
9. Output the final ray colours as the rendered image.
Analysis: We interpret the steps in the above outline as follows:
• Steps 2 to 5 and 7 are all transform operations.
• Step 6 is a gather operation.
All steps perform work scaling with the number of output pixels, Npxls, indicating
there are no algorithmic bottle-necks and thus acceleration should be applied to the whole
algorithm equally.
Given that the number of output pixels is likely to be large and scalable, we should
expect the transforms and the gather, with their O(1) depth complexities, to parallelise
perfectly on many-core hardware.
The outer loop of the algorithm, which marches rays through the volume until they
leave its bounds, involves some branching as different rays traverse thicker or thinner parts
of the arbitrarily-oriented cube. This will have a negative impact on the performance of
the algorithm on a SIMD architecture like a GPU. However, if rays are ordered in such a
way as to maintain 2D locality between their positions, neighbouring threads will traverse
42 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
similar depths through the data cube, resulting in little divergence in their branch paths
and thus good performance on SIMD architectures.
The arithmetic intensity of each of the steps will typically be low (common trans-
fer functions can be as simple as taking the average or maximum), while the complete
algorithm requires O(NpxlsNd) memory reads, O(Npxls) memory writes and O(NpxlsNd)
function evaluations for an input data volume of side length Nd. This global arithmetic
intensity of Nd : 1 : Ndα (where α represents the arithmetic intensity of the transfer
function) indicates the algorithm is likely to remain bandwidth-bound.
The use of bandwidth will depend primarily on the memory access patterns in the
gather step (the transform operations perform ideal contiguous memory accesses). Dur-
ing each iteration of the algorithm, the rays will access an arbitrarily oriented plane of
voxels within the data volume. Such a pattern exhibits 3D spatial locality, presenting an
opportunity to cache the memory reads effectively and thus obtain near-peak bandwidth.
We conclude that the direct ray-casting volume rendering algorithm is a good candidate
for efficient implementation on many-core hardware, although, in the absence of transfer
functions with significant arithmetic intensity, the algorithm is likely to remain limited by
the available memory bandwidth.
2.3.4 Pulsar time-series dedispersion
Introduction: Radio telescopes observing pulsars produce time-series data containing
the pulse signal. Due to its passage through the interstellar medium, the pulse signature
gets delayed as a function of frequency, resulting in a ‘dispersing’ of the data. The signal
can be ‘dedispersed’ by assuming a frequency-dependent delay before summing the signals
at each frequency. In the case of pulsar searching, the data are dedispersed using a number
of trial dispersion measures (DMs), from which the true DM of the signal is measured.
There are several dedispersion algorithms used in the literature, including the direct
algorithm and the tree algorithm (Taylor, 1974). Here we consider the direct method,
which simply involves delaying and summing time series for a range of DMs. The cal-
culation for each DM is entirely independent, presenting an immediate opportunity for
parallelisation. Further, each sample in the time series is operated-on individually, hinting
at additional fine-grained parallelism.
Outline: Here we describe the key steps of the algorithm:
1. Obtain a set of input time series, one per frequency channel.
2. If necessary, transpose the input data to place it into channel-major order.
2.3. Application to Astronomy Algorithms 43
3. Impose a time delay on each channel by offsetting its starting location by the number
of samples corresponding to the delay. The delay introduced into each channel is a
quadratic function of its frequency and a linear function of the dispersion measure.
4. Sum aligned samples across every channel to produce a single accumulated time
series.
5. Output the result and repeat (potentially in parallel) from step 3 for each desired
trial DM.
Analysis: We interpret the above outline of the direct dedispersion algorithm as follows:
• Step 2 involves transposing the data, which is a form of gather.
• Step 3 may be considered a set of gather operations that shift the reading location
of samples in each channel by an offset.
• Step 4 involves the summation of many time series. This is a nested operation, and
may be interpreted as either a transform, where the operation is to sum the time
sample in each channel, or a reduce, where the operation is to sum whole time series.
The algorithm therefore involves gather operations in addition to nested transforms
and reductions. For data consisting of Ns samples for each of Nc channels, each step of the
computation operates on all O(NsNc) total samples. Acceleration should thus be applied
equally to all parts of the algorithm.
According to the depth complexity listed in Table 2.1, the gather operation will paral-
lelise perfectly. The nested transform and reduce calculation may be parallelised in three
possible ways: a) by parallelising the transform, where Ns parallel threads each compute
the sum of a single time sample over every channel sequentially; b) by parallelising the
reduce, where Nc parallel threads cooperate to sum each time sample in turn; or c) by
parallelising both the transform and the reduce, where Ns×Nc parallel threads cooperate
to complete the entire computation in parallel.
Analysing these three options, we see that they have depth complexities of O(Nc),
O(Ns logNc) and O(logNc) respectively. Option (c) would appear to provide the greatest
speedup; however, it relies on using significantly more parallel processors than the other
options. It will in fact only be the better choice in the case where the number of available
parallel processors is much greater than Ns. For hardware with fewer than Ns parallel
processors, option (a) will likely prove the better choice, as it is expected to scale perfectly
up to Ns parallel threads, as opposed to the less efficient scaling of option (c). In practice,
44 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
the number of time samples Ns will generally far exceed the number of parallel processors,
and thus the algorithm can be expected to exhibit excellent parallel scaling using option
(a).
Turning now to the arithmetic intensity, we observe that the computation of a single
trial DM involves only an addition for each of the Ns×Nc total samples. This suggests the
algorithm will be limited by memory bandwidth. However, this does not take into account
the fact that we wish to compute many trial dispersion measures. The computation of
NDM trial DMs still requires only O(Ns × Nc) memory reads and writes, but performs
NDM×Ns×Nc addition operations. The theoretical global arithmetic intensity is therefore
1 : 1 : NDM. Given a typical number of trial DMs of O(100), we conclude that the
algorithm could, in theory at least, make efficient use of all available arithmetic processing
power.
The ability to achieve such a high arithmetic intensity will depend on the ability to
keep data in fast memory for the duration of many arithmetic calculations (i.e., the ability
to efficiently cache the data). This in turn will depend on the memory access patterns.
We note that in general, similar trial DMs will need to access similar areas of memory;
i.e., the problem exhibits some locality of reference. The exact memory access pattern is
non-trivial though, and a discussion of these details is outside the scope of this work.
We conclude that the pulsar dedispersion algorithm would likely perform to a high
efficiency on a many-core architecture. While it is apparent that some locality of reference
exists within the algorithm’s memory accesses, optimal arithmetic intensity is unlikely
to be observed without a thorough and problem-specific analysis of the memory access
patterns.
2.4 Discussion
The direct inverse ray-shooting method has been implemented on a GPU by Thompson
et al. (2010). They simulated systems with up to 109 lenses. Using a single GPU, they
parallelised the interaction step of the problem and obtained a speedup of O(100×) relative
to a single CPU core—a result consistent with the relative peak floating-point performance
of the two processing units7. These results validate our conclusion that the inverse ray-
shooting algorithm is very well suited to many-core architectures like GPUs.
Our conclusions regarding the pulsar dedispersion algorithm are validated by a prelim-
7We note that Thompson et al. (2010) did not use the CPU’s Streaming SIMD Extensions, which havethe potential to provide a speed increase of up to 4×. However, our conclusion regarding the efficiency ofthe algorithm on the GPU remains unchanged by this fact.
2.4. Discussion 45
inary GPU implementation we have written. With only a simplistic approach to memory
caching, we have recorded a speedup of 9× over an efficient multi-core CPU code run-
ning on four cores. This result is in line with the relative peak memory bandwidth of
the two architectures, supporting the conclusions of Section 2.3.4 that, without a detailed
investigation into the memory access patterns, the problem will remain bandwidth-bound.
Some astronomy problems are well-suited to a many-core architecture, others are not.
It is important to know how to distinguish between these. In the astronomy community,
the majority of work with many-core hardware to date has focused on the implementation
or porting of specific codes perhaps best classified as ‘low-hanging fruit’. Not surprisingly,
these codes have achieved significant speed-ups, in line with the raw performance benefits
offered by their target hardware.
A more generalised use of ‘novel’ computing architectures was undertaken by Brunner,
Kindratenko & Myers (2007), who, as a case study, implemented the two-point angular cor-
relation function for cosmological galaxy clustering on two different FPGA architectures8.
While they successfully communicated the advantages offered by these new technologies,
their focus on implementation details for their FPGA hardware inhibits the ability to
generalise their findings to other architectures.
It is interesting to note that previous work has in fact identified a number of common
concerns with respect to GPU implementations of astronomy algorithms. For example,
the issues of optimal use of the memory hierarchy and underuse of available hardware for
small particle counts have been discussed in the context of the direct N-body problem
(e.g., Belleman, Bedorf & Portegies Zwart 2008). These concerns essentially correspond
to a combination of what we have referred to as memory access patterns, arithmetic
intensity and massive parallelism. While originally being discussed as implementation
issues specific to particular choices of software and hardware, our abstractions re-cast
them at the algorithm level, and allow us to consider their impact across a variety of
problems and hardware architectures.
Using algorithm analysis techniques, we now have a basis for understanding which
astronomy algorithms will benefit most from many-core processors. Those with well-
defined memory access patterns and high arithmetic intensity stand to receive the greatest
performance boost, while problems that involve a significant amount of decision-making
may struggle to take advantage of the available processing power.
For some astronomy problems, it may be important to look beyond the techniques
currently in use, as these will have been developed (and optimised) with traditional CPU
8Field Programmable Gate Arrays are another hardware architecture exhibiting significant fine-grainedparallelism, but their specific details lie outside the scope of this thesis.
46 Chapter 2. A Generalised Approach to Many-core Architectures for Astronomy
architectures in mind. Avenues of research could include, for instance, using higher-order
numerical schemes (Nitadori & Makino, 2008) or choosing simplicity over efficiency by
using brute-force methods (Bate et al. submitted). Some algorithms, such as histogram
generation, do not have a single obvious parallel implementation, and may require problem-
specific input during the analysis process.
In this work, we have discussed the future of astronomy computation, highlighting the
change to many-core processing that is likely to occur in CPUs.
The shift in commodity hardware from serial to parallel processing units will funda-
mentally change the landscape of computing. While the market is already populated with
multi-core chips, it is likely that chip designs will undergo further significant changes in
the coming years. We believe that for astronomy, a generalised methodology based on the
analysis of algorithms is a prudent approach to confronting these changes—one that will
continue to be applicable across the range of hardware architectures likely to appear in
the coming years: CPUs, GPUs and beyond.
Acknowledgments
We would like to thank Amr Hassan and Matthew Bailes for useful discussions regard-
ing this chapter, and the reviewer of the corresponding paper Gilles Civario for helpful
suggestions.
3Accelerating Incoherent Dedispersion
Any idiot can get a ten times speed-up with a GPU.
—David Barnes
3.1 Introduction
With the advent of modern telescopes and digital signal processing back-ends, the time-
resolved radio sky has become a rich source of astrophysical information. Observations
of pulsars allow us to probe the nature of neutron stars (Lattimer & Prakash 2004),
stellar populations (Bhattacharya & van den Heuvel 1991), the Galactic environment
(Gaensler et al. 2008), plasma physics and gravitational waves (Lyne et al. 2004). Of equal
significance are transient signals such as those from rotating radio transients (McLaughlin
et al., 2006) and potentially rare one-off events such as ‘Lorimer bursts’ (Lorimer et al.,
2007; Keane et al., 2011), which may correspond to previously unknown phenomena.
These observations all depend on the use of significant computing power to search for
signals within long, frequency-resolved time series.
As radiation from sources such as pulsars propagates to Earth, it is refracted by free
electrons in the interstellar medium. This interaction has the effect of delaying the signal in
a frequency-dependent manner—signals at lower frequencies are delayed more than those
at higher frequencies. Formally, the observed time delay, ∆t, between two frequencies ν1
and ν2 as a result of dispersion by the interstellar medium is given by
∆t = kDM ·DM · (ν−21 − ν−2
2 ), (3.1)
where kDM = e2
2πmec= 4.148808 × 103 MHz2 pc−1 cm3 s is the dispersion constant1 and
1We note that the dispersion constant is commonly approximated in the literature as 1/(2.41 ×
47
48 Chapter 3. Accelerating Incoherent Dedispersion
the frequencies are in MHz. The parameter DM specifies the dispersion measure along the
line of sight in pc cm−3, and is defined as
DM ≡∫ d
0ne dl, (3.2)
where ne is the electron number density (cm−3) and d is the distance to the source (pc).
Once a time-varying source has been detected, its dispersion measure can be obtained
from observations of its phase as a function of frequency; this in turn allows the ap-
proximate distance to the object to be calculated via equation (3.2), assuming one has a
model for the Galactic electron density ne. When searching for new sources, however, one
does not know the distance to the object. In these cases, the dispersion measure must
be guessed prior to looking for a signal. To avoid excessive smearing of signals in the
time series, and a consequent loss of signal-to-noise, survey pipelines typically repeat the
process for many trial dispersion measures. This process is referred to as a dedispersion
transform. An example of the dedispersion transform is shown in Fig. 3.1.
Computing the dedispersion transform is a computationally expensive task: a simple
approach involves a summation across a band of, e.g., ∼ 103 frequency channels for each
of ∼ 103 (typically) dispersion measures, for each time sample. Given modern sampling
intervals of O(64µs), computing this in real-time is a challenging task, especially if the
process must be repeated for multiple beams. The prohibitive cost of real-time dedisper-
sion has traditionally necessitated that pulsar and transient survey projects use offline
processing.
In this paper we consider three ways in which computation of the dedispersion trans-
form may be accelerated, enabling real-time processing at low cost. First, in Section 3.2
we demonstrate how modern many-core computing hardware in the form of graphics pro-
cessing units [GPUs; see Chapter 1 for an introduction, also Fluke et al. (2011)] can
provide an order of magnitude more performance over a multi-core central processing
unit (CPU) when dedispersing ‘directly’. The use of GPUs for incoherent dedispersion is
not an entirely new idea. Dodson et al. (2010) introduced an implementation of such a
system as part of the CRAFT survey. Magro et al. (2011) described a similar approach
and how it may be used to construct a GPU-based real-time transient detection pipeline
for modest fractional bandwidths, demonstrating that their GPU dedisperser could out
perform a generic code by two orders of magnitude. In this work we provide a thorough
analysis of both the direct incoherent dedispersion algorithm itself and the details of its
10−4) MHz2 pc−1 cm3 s.
3.1. Introduction 49
Figure 3.1 An illustration of a dispersion trail (top) and its corresponding dedispersiontransform (bottom). The darkest horizontal slice in the dedispersion transform gives thecorrectly dedispersed time series.
50 Chapter 3. Accelerating Incoherent Dedispersion
implementation on GPU hardware.
In Section 3.3 we then consider the use of the ‘tree’ algorithm, a (theoretically) more
efficient means of computing the dedispersion transform. To our knowledge, this technique
has not previously been implemented on a GPU. We conclude our analysis of dedispersion
algorithms in Section 3.4 with a discussion of the ‘sub-band’ method, a derivative of the
direct method.
In section 3.5 we report accuracy and timing benchmarks for the three algorithms and
compare them to our theoretical results. Finally, we present a discussion of our results,
their implications for future pulsar and transient surveys and a comparison with previous
work in Section 3.6.
3.2 Direct Dedispersion
3.2.1 Introduction
The direct dedispersion algorithm operates by directly summing frequency channels along
a quadratic dispersion trail for each time sample and dispersion measure. In detail, the al-
gorithm computes an array of dedispersed time series D from an input dataset A according
to the following equation:
Dd,t =
Nν∑ν
Aν,t+∆t(d,ν), (3.3)
where the subscripts d, t and ν represent dispersion measure, time sample and frequency
channel respectively, and Nν is the total number of frequency channels. Note that through-
out this paper we use the convention that∑N
i means the sum over the range i = 0 to
i = N − 1. The function ∆t(d, ν) is a discretized version of equation (3.1) and gives the
time delay relative to the start of the band in whole time samples for a given dispersion
measure and frequency channel:
∆T (ν) ≡ kDM
∆τ
(1
(ν0 + ν∆ν)2− 1
ν20
), (3.4)
∆t(d, ν) ≡ round (DM(d)∆T (ν)) , (3.5)
where ∆τ is the time difference in seconds between two adjacent samples (the sampling
interval), ν0 is the frequency in MHz at the start of the band, ∆ν is the frequency difference
in MHz between two adjacent channels and the function round(x) means x rounded to the
nearest integer. The function or array DM(d) is used to specify the dispersion measures to
be computed. Note that the commonly-used central frequency, νc, and bandwidth, BW,
3.2. Direct Dedispersion 51
parameters are related by BW ≡ Nν∆ν and νc ≡ ν0 + 12BW.
After dedispersion, the dedispersed time series Dd,t can be searched for periodic or
transient signals.
When dedispersing at large DM, the dispersion of a signal can be such that it is
smeared significantly within a single frequency channel. Specifically, this occurs when the
gradient of a dispersion curve on the time-frequency grid is less than unity (i.e., beyond
the ‘diagonal’). Once this effect becomes significant, it becomes somewhat inefficient to
continue to dedisperse at the full native time resolution. One option is to reduce the time
resolution by a factor of two when the DM exceeds the diagonal by adding adjacent pairs
of time samples. This process is then repeated at 2× the diagonal, 4× etc. We refer to
this technique as ‘time-scrunching’. The use of time-scrunching will reduce the overall
computational cost, but can also slightly reduce the signal-to-noise ratio if the intrinsic
pulse width is comparable to that of the dispersion smear.
3.2.2 Algorithm analysis
The direct dedispersion algorithm’s summation over Nν frequency channels for each of Nt
time samples and NDM dispersion measures gives it a computational complexity of
Tdirect = O(NtNνNDM). (3.6)
The algorithm was analysed previously for many-core architectures in Chapter 2. The key
findings were:
1. the algorithm is best parallelised over the “embarassingly parallel” dispersion-measure
(d) and time (t) dimensions, with the sum over frequency channels (ν) being per-
formed sequentially,
2. the algorithm has a very high theoretical arithmetic intensity, of the same magnitude
as the number of dispersion measures computed [typically O(100− 1000)], and
3. the memory access patterns generally exhibit reasonable locality, but their non-trivial
nature may make it difficult to achieve a high arithmetic intensity.
While overall the algorithm appears well-positioned to take advantage of massively parallel
hardware, we need to perform a deeper analysis to determine the optimal implementation
strategy. The pattern in which memory is accessed is often critical to performance on
massively-parallel architectures, so this is where we now turn our attention.
52 Chapter 3. Accelerating Incoherent Dedispersion
While the d dimension involves a potentially non-linear mapping of input indices to
output indices, the t dimension maintains a contiguous mapping from input to output.
This makes the t dimension suitable for efficient memory access operations via spatial
caching, where groups of adjacent parallel threads access memory all at once. This be-
haviour typically allows a majority of the available memory bandwidth to be exploited.
The remaining memory access issue is the potential use of temporal caching to increase
the arithmetic intensity of the algorithm. Dedispersion at similar DMs involves access-
ing similar regions of input data. By pre-loading a block of data into a shared cache,
many DMs could be computed before needing to return to main memory for more data.
This would increase the arithmetic intensity by a factor proportional to the size of the
shared cache, potentially providing a significant performance increase, assuming the al-
gorithm was otherwise limited by available memory bandwidth. The problem with the
direct dedispersion algorithm, however, is its non-linear memory access pattern in the d
dimension. This behaviour makes a caching scheme difficult to devise, as one must account
for threads at different DMs needing to access data at delayed times. Whether temporal
caching can truly be used effectively for the direct dedispersion algorithm will depend on
details of the implementation.
3.2.3 Implementation Notes
When discussing GPU implementations throughout this paper, we use the terms ‘Fermi’
and ‘pre-Fermi’ GPUs to mean GPUs of the NVIDIA Fermi architecture and those of older
architectures respectively. We consider both architectures in order to study the recent
evolution of GPU hardware and gain insight into the future direction of the technology.
We implemented the direct dedispersion algorithm using the C for CUDA platform2.
As suggested by the analysis in Section 3.2.2, the algorithm was parallelised over the
dispersion-measure and time dimensions, with each thread summing all Nν channels se-
quentially. During the analysis it was also noted that the algorithm’s memory access pat-
tern exhibits good spatial locality in the time dimension, with contiguous output indices
mapping to contiguous input indices. We therefore chose time as the fastest-changing
(i.e., x) thread dimension, such that reads from global memory would always be from
contiguous regions with a unit stride, maximising throughput. The DM dimension was
consequently mapped to the second (i.e., y) thread dimension.
While the memory access pattern is always contiguous, it is not always aligned. This
is a result of the delays, ∆t(d, ν), introduced in the time dimension. At all non-zero
2http://developer.nvidia.com/object/gpucomputing.html
3.2. Direct Dedispersion 53
DMs, the majority of memory accesses will begin at arbitrary offsets with respect to the
internal alignment boundaries of the memory hardware. The consequence of this is that
GPUs that do not have built-in caching support may need to split the memory requests
into many smaller ones, significantly impacting throughput to the processors. In order
to avoid this situation, we made use of the GPU’s texture memory, which does support
automatic caching. On pre-Fermi GPU hardware, the use of texture memory resulted
in a speed-up of around 5× compared to using plain device memory, highlighting the
importance of understanding the details of an algorithm’s memory access patterns when
using these architectures. With the advent of Fermi-class GPUs, however, the situation
has improved significantly. These devices contain an L1 cache that provides many of the
advantages of using texture memory without having to explicitly refer to a special memory
area. Using texture memory on Fermi-class GPUs was slightly slower than using plain
device memory (with L1 cache enabled), as suggested in the CUDA programming guide3.
Input data with fewer bits per sample than the machine word size (currently assumed
to be 32 bits) were handled using bit-shifting and masking operations on the GPU. It was
found that a convenient format for working with the input data was to transpose the input
from time-major order to frequency-major order by whole words, leaving consecutive fre-
quency channels within each word. For example, for the case of two samples per word, the
data order would be: (Aν1,t1Aν2,t1),(Aν1,t2Aν2,t2),..., (Aν3,t1Aν4,t1),(Aν3,t2Aν4,t2),..., where
brackets denote data within a machine word. This format means that time delays are
always applied in units of whole words, avoiding the need to deal with intra-word delays.
The thread decomposition was written to allow the shape of the block (i.e., number
of DMs or time samples per block) to be tuned. We found that for a block size of 256
threads, optimal performance on a Fermi GPU was achieved when this was divided into
8 time samples × 32 DMs. We interpreted this result as a cache-related effect, where the
block shape determines the spread of memory locations accessed by a group of neighbour-
ing threads spread across time-DM space, and the optimum occurs when this spread is
minimised. On pre-Fermi GPUs, changing the block shape was found to have very little
impact on performance.
To minimise redundant computations, the functions DM(d) and ∆T (ν) were pre-
computed and stored in look-up tables for the given dispersion measures and frequency
channels respectively. Delays were then computed simply by retrieving values from the
two tables and evaluating equation (3.5), requiring only a single multiplication and a
rounding operation. On pre-Fermi GPUs, the table corresponding to ∆T (ν) was explic-
3The current version of the CUDA programming guide is available for download at: http://www.
nvidia.com/object/cuda_develop.html
54 Chapter 3. Accelerating Incoherent Dedispersion
itly stored in the GPU’s constant memory space, which provides extremely efficient access
when all threads read the same value (this is always the case for our implementation,
where frequency channels are traversed sequentially). On Fermi-generation cards, this
explicit use of the constant memory space is unnecessary—constant memory caching is
used automatically when the compiler determines it to be possible.
To amortize overheads within the GPU kernel such as index calculations, loop counters
and time-delay computations, we allowed each thread to store and sum multiple time
samples. Processing four samples per thread was found to significantly reduce the total
arithmetic cost without affecting memory throughput. Increasing this number required
more registers per thread (a finite resource), and led to diminishing returns; we found four
to be the optimal solution for our implementation.
Our implementation was written to support a channel “kill mask”, which specifies
which frequency channels should be included in the computation and which should be
skipped (e.g., to avoid radio frequency interference present within them). While our initial
approach was to apply this mask as a conditional statement [e.g., if( kill mask[channel]
) { sum += data }], it was found that applying the mask arithmetically (e.g., sum +=
data * kill mask[channel]) resulted in better performance. This is not particularly
surprising given the GPU hardware’s focus on arithmetic throughput rather than branch-
ing operations.
Finally, we investigated the possibility of using temporal caching, as discussed in the
analysis in Section 3.2.2. Unlike most CPUs, GPUs provide a manually-managed cache
(known as shared memory on NVIDIA GPUs). This provides additional power and flexi-
bility at the cost of programming effort. We used shared memory to stage a rectangular
section of input data (i.e., of time-frequency space) in each thread block. Careful attention
was given to the amount of data cached, with additional time samples being loaded to al-
low for differences in delay across a block. The cost of development was significant, and it
remained unclear whether the caching mechanism could be made robust against a variety
of input parameters. Further, we found that the overall performance of the code was not
significantly altered by the addition of the temporal caching mechanism. We concluded
that the additional overheads involved in handling the non-linear memory access patterns
(i.e., the mapping of blocks of threads in time-DM space to memory in time-frequency
space) negated the performance benefit of staging data in the shared cache. We note,
however, that cacheing may prove beneficial when considering only low DMs (e.g., below
the diagonal), where delays vary slowly and memory access patterns remain relatively
compact.
3.3. Tree Dedispersion 55
In theory it is possible that, via careful re-packing of the input data, one could exploit
the bit-level parallelism available in modern computing hardware in addition to the thread-
level parallelism. For example, for 2-bit data, packing each 2-bit value into 8-bits would
allow four values to be summed in parallel with a single 32-bit addition instruction. In
this case, 28−122−1
= 85 additions could be performed before one risked integer overflow. To
dedisperse say 1024 channels, one could first sum blocks of 85 channels and then finish
the summation by re-packing the partial sums into a larger data type. This would achieve
efficient use of the available processing hardware, at the cost of additional implementation
complexity and overheads for re-packing and data management. We did not use this
technique in our GPU dedispersion codes, although our reference CPU code does exploit
this extra parallelism by packing four 2-bit samples into a 64-bit word before dedispersion.
3.3 Tree Dedispersion
3.3.1 Introduction
The tree dedispersion algorithm, first described by Taylor (1974), attempts to reduce
the complexity of the dedispersion computation from O(NtNνNDM) to O(NtNν logNν).
This significant speed-up is obtained by first regularising the problem and then exploiting
the regularity to allow repeated calculations to be shared between different DMs. While
theoretical speed-ups of O(100) are possible, in practice a number of additional overheads
arise when working with real data. These overheads, as well as its increased complexity,
have meant that the tree algorithm is rarely used in modern search pipelines. In this work
we investigate the tree algorithm in order to assess its usefulness in the age of many-core
processors.
In its most basic form, the tree dedispersion algorithm is used to compute the following:
D′d′,t =
Nν∑ν
Aν,t+∆t′(d′,ν), (3.7)
∆t′(d′, ν) = round
(d′
ν
Nν − 1
), (3.8)
for d′ in the range 0 ≤ d′ < Nν . The regularisation is such that the delay function ∆t′(d, ν)
is now a linear function of ν that ranges from 0 to exactly d′ across the band. The DMs
56 Chapter 3. Accelerating Incoherent Dedispersion
Figure 3.2 Visualisation of the tree dedispersion algorithm. Rectangles represent frequencychannels, each containing a time series going ‘into the page’. Arrows indicate the flow ofdata, triangles represent addition operations and circles indicate unit time delays into thepage.
computed by the tree algorithm are therefore:
DM(d′) =d′
∆T (Nν − 1), (3.9)
where the function ∆T (ν) is that given by equation (3.4).
The tree algorithm is able to evaluate equation (3.7) for d′ in the range 0 ≤ d′ < Nν
in just log2Nν steps. It achieves this feat by using a divide and conquer approach in the
same way as the well-known fast Fourier transform (FFT) algorithm. The tree algorithm
3.3. Tree Dedispersion 57
is visualised in Fig. 3.2. We define the computation at each step i as follows:
A0ν,t ≡ Aν,t (3.10)
Ai+12ν,t = AiΦ(i,2ν),t +AiΦ(i,2ν+1),t+Θ(i,2ν) (3.11)
Ai+12ν+1,t = AiΦ(i,2ν),t +AiΦ(i,2ν+1),t+Θ(i,2ν+1) (3.12)
D′d′,t = D′ν,t = Alog2Nνν,t . (3.13)
The integer function Θ(i, ν) gives the time delay for a given iteration and frequency channel
and can be defined as
Θ(i, ν) ≡ [(ν mod 2i+1) + 1]/2, (3.14)
where mod is the modulus operator, and division is taken to be truncated integer division.
The integer function Φ(i, ν), which we refer to as the ‘shuffle’ function, re-orders the indices
ν according to a pattern defined as follows:
Φ(r, ν) ≡ (ν mod 2)× r +ν
2+
ν
2r× r, (3.15)
where the parameter r ≡ 2i is known as the radix.
While the tree dedispersion algorithm succeeds in dramatically reducing the compu-
tational cost of dedispersion, it has a number of constraints not present in the direct
algorithm:
1. the computed dispersion trails are linear in frequency, not quadratic as found in
nature [see equation (3.1)],
2. the computed dispersion measures are constrained to those given by equation (3.9),
and
3. the number of input frequency channels Nν (and thus also the number of DMs) must
be a power of two.
Constraint (iii) is generally not a significant concern, as it is common for the number of
frequency channels to be a power of two, and blank channels can be added when this is not
the case. Constraints (i) and (ii) are more problematic, as they prevent the computation
of accurate and efficiently-distributed dispersion trails. Fortunately there are ways of
working around these limitations.
One method is to approximate the dispersion trail with piecewise linear segments by
dividing the input data into sub-bands (Manchester et al., 1996). Another approach is
58 Chapter 3. Accelerating Incoherent Dedispersion
to quadratically space the input frequencies by padding with blank channels as a pre-
processing step such that the second order term in the dispersion trail is removed (Manch-
ester et al., 2001). These techniques are described in the next two sections.
The piecewise linear tree method
Approximation of the quadratic dispersion curve using piecewise linear segments involves
two stages of computation. If the input data are divided into Ns sub-bands of length
N ′ν =Nν
Ns, (3.16)
with the nth sub-band starting at frequency channel
νn = nN ′ν , (3.17)
then from equation (3.7) we see that the tree dedispersion algorithm applied to each
sub-band results in the following:
Sn,d′,t =
N ′ν∑ν′
Aνn+ν′,t+∆t′(d′,νn+ν′), (3.18)
which we refer to as stage 1 of the piecewise linear tree method.
In each sub-band, we approximate the quadratic dispersion trail with a linear one. We
compute the linear DM in the nth sub-band that approximates the true DM indexed by d
as follows:
d′n(d) = ∆t(d, νn+1)−∆t(d, νn) (3.19)
= round (DM(d) [∆T (νn+1)−∆T (νn)]) . (3.20)
Applying the constraint d′n < N ′ν and noting that the greatest dispersion delay occurs at
the end of the band, we obtain a limit on the DM that the basic piecewise linear tree
algorithm can compute. This limit is commonly referred to as the ‘diagonal’ DM, as it
corresponds to a dispersion trail in the time-frequency grid with a gradient of unity:4
DM(piecewise)diag =
N ′ν − 12
∆T (Nν)−∆T (Nν −N ′ν). (3.21)
A technique for computing larger DMs with the tree algorithm is discussed in Section 3.3.1.
4Note that the ‘ 12’ in equation (3.21) arises from the round-to-nearest operation in equation (3.20).
3.3. Tree Dedispersion 59
The dedispersed sub-bands can now be combined to approximate the result of equation
(3.3):
Dd,t ≈Ns∑n
Sn,d′n(d),t+∆t′′n(d), (3.22)
∆t′′n(d) = round
(DM(d)
n∑m
∆T (νm+1)−∆T (νm)
). (3.23)
This forms stage 2 of the piecewise linear tree computation.
The use of the tree algorithm with sub-banding introduces an additional source of
smearing into the dedispersed time series as a result of approximating the quadratic dis-
persion curve with a piecewise linear one. We derive an analytic upper limit for this
smearing in Appendix A.1.
The frequency-padded tree method
An alternative approach to adapting the tree algorithm to quadratic dispersion curves is
to linearise the input data via a change of frequency coordinates. Formally, the aim is
to ‘stretch’ ∆T (ν) [equation (3.4)] to a linear function ∆T ′(ν ′) ∝ ν ′. Expanding to first
order around ν = 0, we have:
∆T ′(ν ′) = ∆T (0) + ∆T (ν ′)d
dν[∆T (ν)]
∣∣∣∣0
. (3.24)
The change of variables ν → ν ′ is then found by equating ∆T (ν) with its linear approxi-
mation, ∆T ′(ν ′), and solving for ν ′(ν), which gives
ν ′ = round
(1
2
ν0
∆ν
[1−
(1 +
∆ν
ν0ν
)−2])
. (3.25)
Evaluating at ν = Nν gives the total number of frequency channels in the linearised
coordinates, which determines the additional computational overhead introduced by the
procedure. Note, however, that this number must be rounded up to a power of two before
the tree dedispersion algorithm can be applied. For observations with typical effective
bandwidths and channel counts that are already a power of two, the frequency padding
technique is unlikely to require increasing the total number of channels by more than a
factor of two.
In practice, the linearisation procedure is applied by padding the frequency dimension
60 Chapter 3. Accelerating Incoherent Dedispersion
with blank channels such that the real channels are spaced according to equation (3.25).
Once the dispersion trails have been linearised, the tree algorithm can be applied directly.
The ‘diagonal’ DM when using the frequency padding method corresponds to
DM(padded)diag =
1
∆T (1). (3.26)
Computing larger DMs
The basic tree dedispersion algorithm computes exactly the DMs specified by equation
(3.9). In practice, however, it is often necessary to search a much larger range of dispersion
measures. Fortunately, there are techniques by which this can be achieved without having
to resort to using the direct method. The tree algorithm can be made to compute higher
DMs by first transforming the input data and then repeating the dedispersion computation.
Formally, the following sequence of operations can be used to compute an arbitrary range
of DMs:
1. Apply the tree algorithm to obtain DMs from zero to DMdiag.
2. Impose a time delay across the band.
3. Apply the tree algorithm to obtain DMs from DMdiag to 2DMdiag.
4. Increment the imposed time delay.
5. Repeat from step 2 to obtain DMs up to 2NDMdiag.
The imposed time delay is initially a simple diagonal across the band (i.e., ∆t = ν), and is
implemented by incrementing a memory stride value rather than actually shifting memory.
While this method enables dedispersion up to arbitrarily large DMs, it does not alter the
spacing of DM trials, which remains fixed as per equation (3.9).
The ‘time-scrunching’ technique, discussed in Section 3.2.1 for the direct algorithm,
can also be applied to the tree algorithm. The procedure is as follows:
1. Apply the tree algorithm to obtain DMs from zero to DMdiag.
2. Impose a time delay across the band.
3. Apply the tree algorithm to obtain DMs from DMdiag to 2DMdiag.
4. Compress (‘scrunch’) time by a factor of 2 by summing adjacent samples.
5. Impose a time delay across the band.
3.3. Tree Dedispersion 61
6. Apply the tree algorithm to obtain DMs from 2DMdiag to 4DMdiag.
7. Repeat from step 4 to obtain DMs up to 2NDMdiag.
As with the direct algorithm, the use of time-scrunching provides a performance benefit
at the cost of a minor reduction in the signal-to-noise ratio for pulses of intrinsic width
near the dispersion measure smearing time.
3.3.2 Algorithm analysis
The tree dedispersion algorithm’s computational complexity of O(NtNν logNν) breaks
down into log2Nν sequential steps, with each step involving the computation of O(NtNν)
independent new values, as seen in equations (3.10) to (3.13). Following the analysis
methodology of Chapter 2, the algorithm therefore has a depth complexity of O(logNν),
meaning it contains this many sequentially-dependent operations. Interestingly, this result
matches that of the direct algorithm, although the tree algorithm requires significantly less
total work. From a theoretical perspective, this implies that the tree algorithm contains
less inherent parallelism than the direct algorithm. In practice, however, the number
of processors will be small relative to the size of the problem (NtNν), and this reduced
inherent parallelism is unlikely to be a concern for performance except when processing
very small data-sets.
Branching (i.e., conditional statements) within an algorithm can have a significant
effect on performance when targetting GPU-like hardware (see Chapter 2). Fortunately,
the tree algorithm is inherently branch-free, with all operations involving only memory
accesses and arithmetic operations. This issue is therefore of no concern in this instance.
The arithmetic intensity of the tree algorithm is determined from the ratio of arith-
metic operations to memory operations. To process NtNν samples, the algorithm involves
NtNν log2Nν ‘delay and add’ operations, and produces NtNν samples of output. In con-
trast to the direct algorithm, where the theoretical arithmetic intensity was proportional
to the number of DMs computed, the tree algorithm requires only O(logNν) operations
per sample. This suggests that the tree algorithm may be unable to exploit GPU-like
hardware as efficiently as the direct algorithm. However, the exact arithmetic intensity
will depend on constant factors and additional arithmetic overheads, and will only become
apparent once the algorithm has been implemented. We defer discussion of these results
to Section 3.3.3.
Achieving peak arithmetic intensity requires reading input data from ‘slow memory’
into ‘fast memory’ (e.g., from disk into main memory, from main memory into cache,
62 Chapter 3. Accelerating Incoherent Dedispersion
from host memory into GPU memory etc.) only once, before performing all computations
within fast memory and writing the results, again just once, back to slow memory. In the
tree dedispersion algorithm, this means performing all log2Nν steps entirely within fast
memory. The feasibility of this will depend on implementation details, the discussion of
which we defer to Section 3.3.3. However, it will be useful to assume that some sub-set of
the total computation will fit within this model. We will therefore continue the analysis
of the tree algorithm under the assumption that we are computing only a (power-of-two)
subset, or block, of Bν channels.
The memory access patterns within the tree algorithm resemble those of the direct al-
gorithm (see Section 3.2.2). Time samples are always accessed contiguously, with an offset
that is essentially arbitrary. In the frequency dimension, memory is accessed according
to the shuffle function [equation (3.15)] depicted in Fig. 3.2, where at any given step of
the algorithm the frequency channels ‘interact’ in pairs, the interaction involving their
addition with different time delays.
With respect to the goal of achieving peak arithmetic intensity, the key issue for
the memory access patterns within the tree algorithm is the extent to which they remain
‘compact’. This is important because it determines the ability to operate on isolated blocks
of data independently, which is critical to prolonging the time between successive trips to
slow memory. In the frequency dimension, the computation of some local (power-of-two)
sub-set of channels Bν involves accessing only other channels within the same subset. In
this sense we can say that the memory access patterns are ‘locally compact’ in channels.
In the time dimension, however, we note that the algorithm applies compounding delays
(equivalent to offsets in whole time samples). This means that the memory access patterns
‘leak’ forward, with any local group of time samples always requiring access to the next
group. The amount by which the necessary delays ‘leak’ in time for each channel is given
by the integrated delay in that channel after Bν steps (see Fig. 3.2). The total integrated
delay across Bν channels is Bν(Bν − 1)/2, which is the number of additional values that
must be read into fast memory by the block in order to compute all log2Bν steps without
needing to return to global memory and apply a global synchronisation.
3.3.3 Implementation Notes
As with the direct algorithm, we implemented the tree algorithm on a GPU in C for
CUDA. For our first attempt, we took a simple approach where each of the log2Nν steps
in the computation was performed by a separate call to a GPU function (or kernel).
This approach is not ideal, as it is preferable to perform more computation on the device
3.3. Tree Dedispersion 63
before returning to the host (as per the discussion of arithmetic intensity in Section 3.3.2),
but was necessary in order to guarantee global synchronisation across threads on the GPU
between steps. This is a result of the lack of global synchronisation mechanisms on current
GPUs.
Between steps, the integer delay and shuffle functions [equations (3.14) and (3.15)] were
evaluated on the host and stored in look-up tables. These were then copied to constant
memory on the device prior to executing the kernel function to compute the step. The
use of constant memory ensured retrieval of these values would not be a bottle-neck to
performance during the computation of each step of the tree algorithm.
The problem was divided between threads on the GPU by allocating one thread for
every time sample and every pair of frequency channels. This meant that each thread
would compute the delayed sums between two ‘interacting’ channels according to the
pattern depicted in Fig. 3.2 for the current step.
The tree algorithm’s iterative updating behaviour requires that computations at each
step be performed ‘out-of-place’; i.e., output must be written to a memory space separate
from that of the input to avoid modifying input values before they have been used. We
achieved this effect by using a double-buffering scheme, where input and output arrays
are swapped after each step.
While the algorithms differ significantly in their details, one point of consistency be-
tween the direct and tree methods is the need to apply time delays to the input data.
Therefore, just as with our implementation of the direct algorithm, the tree algorithm
requires accessing memory locations that are not aligned with internal memory bound-
aries. As such, we took the same approach as before and mapped the input data to the
GPU’s texture memory before launching the device kernel. As noted in Section 3.2.3, this
procedure is unnecessary on Fermi-generation GPUs, as their built-in caches provide the
same behaviour automatically.
After successfully implementing the tree algorithm on a GPU using a simple one-
step-per-GPU-call approach, we investigated the possibility of computing multiple steps
of the algorithm on the GPU before returning to the CPU for synchronisation. This is
possible because current GPUs, while lacking support for global thread synchronisation,
do support synchronisation across local thread groups (or blocks). These thread blocks
typically contain O(100) threads, and provide mechanisms for synchronisation and data-
sharing, both of which are required for a more efficient tree dedispersion implementation.
As discussed in Section 3.3.2, application of the tree algorithm to a block of Bν channels
× Bt time samples requires caching additional values from the next block in time. We
64 Chapter 3. Accelerating Incoherent Dedispersion
used blocks of Bν × Bt = 16 × 16 threads, each loading both their corresponding data
value and required additional values into shared cache. Once all values have been stored,
computation of the log2Bν = 4 steps proceeds entirely within the shared cache. Using
larger thread blocks would allow more steps to be completed within the cache; however,
the choice is constrained by the available volume of shared memory (typically around
48kB). Once the block computation is complete, subsequent steps must be computed
using the one-step-per-GPU-call approach described earlier, due to the requirement of
global synchronisations.
While theory suggests that an implementation of the tree algorithm exploiting shared
memory to perform multiple steps in cache would provide a performance benefit over
a simpler implementation, in practice we were unable to achieve a net gain using this
approach. The limitations on block size imposed by the volume of shared memory, the
need to load additional data into cache and the logarithmic scaling of steps relative to
data size significantly reduce the potential speed-up, and overheads from increased code-
complexity quickly erode what remains. For this reason we reverted to the straight-forward
implementation of the tree algorithm as our final code for testing and benchmarking.
In addition to the base tree algorithm, we also implemented the sub-band method so as
to allow the computation of arbitrary dispersion measures. This was achieved by dividing
the computation into two stages. In stage 1, the first log2N′ν steps of the tree algorithm
are applied to the input data, which produces the desired Nν/N′ν tree-dedispersed sub-
bands. Stage 2 then involves applying an algorithm to combine the dedispersed time
series in different sub-bands into approximated quadratic dispersion curves according to
equation (3.22). Stage 2 was implemented on the GPU in much the same way as the direct
algorithm, with input data mapped to texture memory (on pre-Fermi GPUs) and delays
stored in look-up tables in constant device memory.
The frequency padding approach described in Section 3.3.1 was implemented by con-
structing an array large enough to hold the stretched frequency coordinates, initialising
its elements to zero, and then copying (or scattering) the input data into this array ac-
cording to equation (3.25). The results of this procedure were then fed to the basic tree
dedispersion code to produce the final set of dedispersed time series.
Because the tree algorithm involves sequentially updating the entire data-set, the data
must remain in their final format for the duration of the computation. This means that
low bit-rate data, e.g., 2-bit, must be unpacked (in a pre-processing step) into a format
that will not overflow during accumulation. This is in contrast to the direct algorithm,
where each sum is independent, and can be stored locally to each thread.
3.4. Sub-band dedispersion 65
3.4 Sub-band dedispersion
3.4.1 Introduction
Sub-band dedispersion is the name given to another technique used to compute the dedis-
persion transform. Like the tree algorithm described in Section 3.3, the sub-band algorithm
attempts to reduce the cost of the computation relative to the direct method; however,
rather than exploiting a regularisation of the dedispersion algorithm, the sub-band method
takes a simple approximation approach.
In its simplest form, the algorithm involves two processing steps. In the first, the set of
trial DMs is approximated by a reduced set of NDMnom = NDM/N′DM ‘nominal’ DMs, each
separated by N ′DM trial dispersion measures. The direct dedispersion algorithm is applied
to sub-bands of N ′ν channels to compute a dedispersed time series for each nominal DM
and sub-band. In the second step, the DM trials near each nominal value are computed
by applying the direct algorithm to the ‘miniature filterbanks’ formed by the time series
for the sub-bands at each nominal DM. These data have a reduced frequency resolution
of NSB = Nν/N′ν channels across the band. The two steps thus operate at reduced
dispersion measure and frequency resolution respectively, resulting in an overall reduction
in the computational cost.
The sub-band algorithm is implemented in the presto software suite (Ransom, 2001)
and was recently implemented on a GPU by Magro et al. (2011) (see Section 3.6.1 for a
comparison with their work). Unlike the tree algorithm, the sub-band method is able to
compute the dedispersion transform with the same flexibility as the direct method, making
its application to real observational data significantly simpler.
The approximations made by the sub-band algorithm introduce additional smearing
into the dedispersed time series. We derive an analytic upper-bound in Appendix A.2 and
show that, to first order, the smearing time tSB is proportional to the product N ′DMN′ν
[see equation (A.8)].
66 Chapter 3. Accelerating Incoherent Dedispersion
3.4.2 Algorithm analysis
The computational complexity of the sub-band dedispersion algorithm can be computed
by summing that of the two steps:
TSB,1 = NSB · Tdirect(Nt, N′ν , NDMnom) (3.27)
TSB,2 = NDMnom · Tdirect(Nt, NSB, N′DM) (3.28)
TSB = TSB,1 + TSB,2 (3.29)
= O
[NtNDMNν
(1
N ′DM
+1
N ′ν
)](3.30)
This result can be combined with knowledge of the smearing introduced by the algorithm
to probe the relationship between accuracy and performance. Inserting the smearing
constraint tSB ∝ N ′DMN′ν (see Section 3.4.1) into equation (3.30), we obtain a second-
order expression that is minimised at N ′DM = N ′ν ∝√tSB, which amounts to balancing
the execution time between the two steps. This result optimises the time complexity of
the algorithm, which then takes the simple form
T ′SB = O
(NDMNν√
tSB
)(3.31)
and represents a theoretical speed-up over the direct algorithm proportional to the square
root of the introduced smearing.
The sub-band algorithm’s dependence on the direct algorithm means that it inherits
similar algorithmic behaviour. However, as with the tree method, the decrease in computa-
tional work afforded by the sub-band approach corresponds to a decrease in the arithmetic
intensity of the algorithm. This can be expected to reduce the intrinsic performance of
the two sub-band steps relative to the direct algorithm.
One further consideration for the sub-band algorithm is the additional volume of mem-
ory required to store the intermediate results produced by the first step. These data consist
of time series for each sub-band and nominal DM, giving a space complexity of
MSB = O (NSBNDMnom) . (3.32)
Assuming the time complexity is optimised as in equation (3.31), the space complexity
becomes
M ′SB = O
(1
tSB
), (3.33)
3.5. Results 67
which indicates that the memory consumption increases much faster than the execution
time, in direct proportion to the introduced smearing rather than to its square root. This
can be expected to place a lower limit on the smearing that can be achieved in practice.
3.4.3 Implementation notes
A significant advantage of the sub-band algorithm over the tree algorithm is that it involves
little more than repeated execution of the direct algorithm. With sufficient generalisation5
of our implementation of the direct algorithm, we were able to implement the sub-band
method with just two consecutive calls to the direct dedispersion routine and the addition
of a temporary data buffer.
In our implementation, the ‘intermediate’ data (i.e., the outputs of the first step)
are stored in the temporary buffer using 32 bits per sample. The second call to the
dedispersion routine then reads these values directly before writing the final output using
a desired number of bits per sample.
Experimentation showed that optimal performance occurred at a slightly different
shape and size of the thread blocks on the GPU compared to the direct algorithm (see
Section 3.2.2). The sub-band kernels operated most efficiently with 128 threads per block
divided into 16 time samples and 8 DMs. In addition, the optimal choice of the ratio
N ′ν/N′DM was found to be close to unity, which matches the theoretical result derived in
Section 3.4.2. While these parameters minimised the execution time, the sub-band kernels
were still found to perform around 40% less efficiently than the direct kernel. This result
is likely due to the reduced arithmetic intensity of the algorithm (see Section 3.4.2).
3.5 Results
3.5.1 Smearing
Our analytic upper-bounds on the increase in smearing due to use of the piecewise linear
tree algorithm [equation (A.6)] and the sub-band algorithm [equation (A.9)] are plotted
in the upper panels of Figs. 3.3 and 3.4 respectively. The reference point [W in equa-
tions (A.6) and (A.9)] was calculated using equations for the smearing during the direct
dedispersion process6 assuming an intrinsic pulse width of 40µs.
For the piecewise linear tree algorithm, the effective signal smearing at low dispersion
5The direct dedispersion routine was modified to support ‘batching’ (simultaneous application to severaladjacent data-sets) and arbitrary strides through the input and output arrays, trial DMs and channels.
6Levin, L. 2011, priv. comm.
68 Chapter 3. Accelerating Incoherent Dedispersion
measure is dominated by the intrinsic pulse width, the sampling time ∆τ and the effect of
finite DM sampling. As the DM is increased, however, the effects of finite channel width
and the sub-band technique grow, and eventually become dominant. These smearing terms
both scale linearly with the dispersion measure, and so the relative contribution of the
sub-band method, µSB, tends to a constant.
The sub-band algorithm exhibits virtually constant smearing as a function of DM
due to its dependence on the DM step, which is itself chosen to maintain a fixed frac-
tional smearing. While the general trend mirrors that of the tree algorithm, the sub-band
algorithm’s smearing is typically around two orders of magnitude worse than its tree coun-
terpart.
3.5. Results 69
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-up
Sub
-ban
d si
ze (
Nν’
)
Tre
e on
GT
X 4
80T
ree
on T
esla
C20
50T
ree
on T
esla
C10
60
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-up
Sub
-ban
d si
ze (
Nν’
)
Dire
ct o
n G
TX
480
Dire
ct o
n T
esla
C20
50D
irect
on
Tes
la C
1060
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-up
Sub
-ban
d si
ze (
Nν’
)
Dire
ct o
n C
ore
i7 9
30, 4
thre
ads
Dire
ct o
n C
ore
i7 9
30, 2
thre
ads
Dire
ct o
n C
ore
i7 9
30, 1
thre
ad
10-6
10-4
10-2
100
102
Frac. smear increase(µSB - 1)
DM
= 1
000
pc c
m-3
DM
= 6
2 pc
cm
-3
DM
= 4
pc
cm-3
(a)
Wit
hti
me
scru
nch
ing
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-up
Sub
-ban
d si
ze (
Nν’
)
Tre
e on
GT
X 4
80T
ree
on T
esla
C20
50T
ree
on T
esla
C10
60
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-up
Sub
-ban
d si
ze (
Nν’
)
Dire
ct o
n G
TX
480
Dire
ct o
n T
esla
C20
50D
irect
on
Tes
la C
1060
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-up
Sub
-ban
d si
ze (
Nν’
)
Dire
ct o
n C
ore
i7 9
30, 4
thre
ads
Dire
ct o
n C
ore
i7 9
30, 2
thre
ads
Dire
ct o
n C
ore
i7 9
30, 1
thre
ad
10-6
10-4
10-2
100
102
Frac. smear increase(µSB - 1)
DM
= 1
000
pc c
m-3
DM
= 6
2 pc
cm
-3
DM
= 4
pc
cm-3
(b)
Wit
hout
tim
esc
runch
ing
Fig
ure
3.3
Up
per:
An
alyti
cup
per
-bou
nd
onsi
gnal
deg
rad
atio
nof
a40µ
sp
uls
ed
ue
toth
ep
iece
wis
elin
ear
tree
alg
ori
thm
as
afu
nct
ion
ofth
enu
mb
erof
chan
nel
sp
ersu
b-b
and
[see
equ
atio
n(A
.6)]
.L
ow
er:
Per
form
ance
resu
lts
for
the
dir
ect
an
dp
iece
wis
eli
nea
rtr
eeal
gori
thm
sw
ith
(a)
and
wit
hou
t(b
)‘t
ime-
scru
nch
ing’
app
lied
.B
ench
mar
ks
wer
eex
ecu
ted
onan
Inte
lC
ore
i7930
qu
ad
-core
CP
Uan
dN
VID
IAT
esla
C10
60,
Tes
laC
2050
and
GeF
orce
GT
X48
0G
PU
s.A
llre
sult
sco
rres
pon
dto
op
erati
on
son
on
em
inu
teof
inp
ut
dat
aw
ith
the
foll
owin
gob
serv
ing
par
amet
ers:
bit
s/sa
mple
=2,ν c
=13
81.8
MH
z,B
W=
400M
Hz,Nν
=1024,
∆τ
=64µ
s.A
tota
lof
1196
DM
tria
lsw
ere
use
d,
spac
edn
on-l
inea
rly
inth
era
nge
0≤
DM<
1000
pc
cm−
3(s
eete
xt
for
det
ail
s).
Err
or
bars
are
too
smal
lto
be
seen
atth
issc
ale
and
are
not
plo
tted
.N
ote
that
per
form
ance
resu
lts
are
pro
ject
edfr
om
mea
sure
men
tsof
cod
esp
erfo
rmin
gsu
b-s
ets
ofth
eb
ench
mar
kta
sk(s
eete
xt
for
det
ails
).
70 Chapter 3. Accelerating Incoherent Dedispersion
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-upS
ub-b
and
size
(N
ν’)
Sub
-ban
d on
GT
X 4
80S
ub-b
and
on T
esla
C20
50S
ub-b
and
on T
esla
C10
60
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-upS
ub-b
and
size
(N
ν’)
Dire
ct o
n G
TX
480
Dire
ct o
n T
esla
C20
50D
irect
on
Tes
la C
1060
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-upS
ub-b
and
size
(N
ν’)
Dire
ct o
n C
ore
i7 9
30, 4
thre
ads
Dire
ct o
n C
ore
i7 9
30, 2
thre
ads
Dire
ct o
n C
ore
i7 9
30, 1
thre
ad
10-6
10-4
10-2
100
102
Frac. smear increase(µSB - 1)
DM
= 4
00 p
c cm
-3
DM
= 4
pc
cm-3
(a)
Wit
hti
me
scru
nch
ing
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-up
Sub
-ban
d si
ze (
Nν’
)
Sub
-ban
d on
GT
X 4
80S
ub-b
and
on T
esla
C20
50S
ub-b
and
on T
esla
C10
60
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-up
Sub
-ban
d si
ze (
Nν’
)
Dire
ct o
n G
TX
480
Dire
ct o
n T
esla
C20
50D
irect
on
Tes
la C
1060
0.1 1 10
100
2 4
8 1
6 3
2 6
4 1
28 2
56
1 10
100
Observation time / compute time
Speed-up
Sub
-ban
d si
ze (
Nν’
)
Dire
ct o
n C
ore
i7 9
30, 4
thre
ads
Dire
ct o
n C
ore
i7 9
30, 2
thre
ads
Dire
ct o
n C
ore
i7 9
30, 1
thre
ad
10-6
10-4
10-2
100
102
Frac. smear increase(µSB - 1)
DM
= 4
00 p
c cm
-3
DM
= 4
pc
cm-3
(b)
Wit
hout
tim
esc
runch
ing
Fig
ure
3.4
Up
per:
An
alyti
cu
pp
er-b
oun
don
sign
ald
egra
dat
ion
ofa
40µ
sp
uls
ed
ue
toth
esu
b-b
and
algo
rith
mas
afu
nct
ion
of
the
nu
mb
erof
chan
nel
sp
ersu
b-b
and
[see
equat
ion
(A.9
)].
Low
er:
Per
form
ance
resu
lts
for
the
dir
ect
and
sub-b
an
dm
eth
od
sw
ith
(a)
an
dw
ithou
t(b
)th
eu
seof
tim
e-sc
run
chin
g.S
eeF
ig.
3.3
cap
tion
for
det
ails
.N
ote
that
itw
asn
otp
ossi
ble
toru
nb
ench
mark
sof
the
sub
-ban
dco
de
forN′ ν<
16d
ue
tom
emor
yco
nst
rain
ts.
3.5. Results 71
3.5.2 Performance
Our codes as implemented allowed us to directly compute the following:
• Any list of DMs using the direct or sub-band algorithm with no time-scrunching,
• DMs up to the diagonal [see equation (3.21)] using the piecewise linear tree algorithm,
and
• DMs up to the diagonal [see equation (3.26)] using the frequency-padded tree algo-
rithm.
A number of timing benchmarks were run to compare the performance of the CPU to
the GPU and the direct algorithm to the tree algorithms. Input and dispersion parameters
were chosen to reflect a typical scenario as appears in modern pulsar surveys such as
the High Time Resolution Universe (HTRU) survey currently underway at the Parkes
radio telescope (Keith et al., 2010). The benchmarks involved computing the dedispersion
transform of one minute of input data with observing parameters of bits/sample = 2,
ν0 = 1581.8MHz, ∆ν = −0.39062MHz, Nν = 1024, ∆τ = 64µs. DM trials were chosen
to match those used in the HTRU survey, which were originally derived by applying an
analytic constraint on the signal-smearing due to incorrect trial DM7. The chosen set
contained 1196 trial DMs in the range 0 ≤ DM < 1000 pc cm−3 with approximately
exponential spacing.
For comparison purposes, we benchmarked a reference CPU direct dedispersion code in
addition to our GPU codes. The CPU code (named dedisperse all) is highly optimised,
and uses multiple CPU cores to compute the dedispersion transform (parallelised over
the time dimension) in addition to bit-level parallelism as described in Section 3.2.3.
dedisperse all is approximately 60× more efficient than the generic dedisperse routine
from sigproc8, but is only applicable to a limited subset of data formats.
At the time of writing, our dedispersion code-base did not include ‘full-capability’
implementations of all of the discussed algorithms. However, we were able to perform a
number of benchmarks that were sufficient to obtain accurate estimates of the performance
of complete runs. Timing measurements for our codes were projected to produce a number
of derived results representative of the complete benchmark task. The direct/sub-band
dedispersion code was able to compute the complete list of desired DMs, but was not able to
exploit time-scrunching; results for these algorithms with time scrunching were calculated
7Levin, L. 2011, priv. comm.; see Cordes & McLaughlin 2003 for a similar derivation.8sigproc.sourceforge.net
72 Chapter 3. Accelerating Incoherent Dedispersion
by assuming that the computation of DMs between 2× and 4× the diagonal would proceed
twice as fast as the computation up to 2× the diagonal (as a result of there being half as
many time samples), and similarly for 4× to 8× etc. up to the maximum desired DM. A
simple code to perform the time-scrunching operation (i.e., adding adjacent time samples
to reduce the time resolution by a factor of two) was also benchmarked and factored
into the projection. For the tree codes, which were unable to compute DMs beyond the
diagonal, timing results were projected by scaling as appropriate for the computation of
the full set of desired DMs with or without time-scrunching. Individual sections of code
were timed separately to allow for different scaling behaviours.
Benchmarks were run on a variety of hardware configurations. CPU benchmarks were
run on an Intel i7 930 quad-core CPU (Hyperthreading enabled). GPU benchmarks were
run using version 3.2 of the CUDA toolkit on the pre-Fermi generation NVIDIA Tesla
C1060 and the Fermi generation NVIDIA Tesla C2050 (error-correcting memory disabled)
and GeForce GTX 480 GPUs. Hardware specifications of the GPUs’ host machines varied,
but were not considered to significantly impact performance measurements other than
the copies between host and GPU memory. Benchmarks for these copy operations were
averaged across the different machines.
Our derived performance results for the direct and piecewise linear tree codes are
plotted in the lower panels of Fig. 3.3. The performance of the frequency-padded tree
code corresponded to almost exactly half that of the piecewise linear tree code at a sub-
band size of N ′ν = 1024; these results were omitted from the plot for clarity.
Performance results for the sub-band dedispersion code are plotted in the lower panels
of Fig. 3.4 along with the results of the direct code for comparison. Due to limits on
memory use (see Section 3.4.2), benchmarks for N ′ν < 16 were not possible.
Performance was measured by inserting calls to the Unix function gettimeofday() before
and after relevant sections of code. Calls to the CUDA function cudaThreadSynchronize()
were inserted where necessary to ensure that asynchronous GPU functions had completed
their execution prior to recording the time.
Several different sections of code were timed independently. These included pre- and
post-processing steps (e.g., unpacking, transposing, scaling) and copies between host and
GPU memory (in both directions), as well as the dedispersion kernels themselves. Disk
I/O and portions of code whose execution time does not scale with the size of the input
were not timed (see Section 3.6 for a discussion of the impact of disk I/O). Timing results
represent the total execution time of all timed sections, including memory copies between
the host and the device in the case of the GPU codes.
3.6. Discussion 73
Each benchmark was run 101 times, from which the the median execution time was
chosen as the final measurement. Recorded uncertainties corresponded to the 5th and 95th
percentiles; the error bars are too small to be seen in Figs. 3.3 and 3.4 and were not
plotted.
3.6 Discussion
The lower panel of Fig. 3.3(a) shows a number of interesting performance trends. As
expected, the slowest computation speeds come from the direct dedispersion code running
on the CPU. Here, some scaling is achieved via the use of multiple cores, but the speed-up
is limited to around 2.5× when using all four. This is likely due to saturation of the
available memory bandwidth.
Looking at the corresponding results on a GPU, a large performance advantage is
clear. The GTX 480 achieves a 9× speed-up over the quad-core CPU, and even the last-
generation Tesla C1060 manages a factor of 5×. The fact that a single GPU is able to
compute the dedispersion transform in less than a third of the real observation time makes
it an attractive option for real-time detection pipelines.
A further performance boost is seen in the transition to the tree algorithm. Compu-
tation speed is projected to exceed that of the direct code for almost all choices of N ′ν ,
peaking at around 3× at N ′ν = 64. Performance is seen to scale approximately linearly
for N ′ν < 32, before peaking and then decreasing very slowly for N ′ν > 64. This behaviour
is explained by the relative contributions of the two stages of the computation. For small
N ′ν , the second, ‘sub-band combination’, stage dominates the total execution time [scaling
as O(1/N ′ν)]. At large N ′ν the execution time of the second stage becomes small relative
to the first, and scaling follows that of the basic tree algorithm [i.e., O(logN ′ν)].
The results of the sub-band algorithm in Fig. 3.4(a) also show a significant performance
advantage over the direct algorithm. The computable benchmarks start at N ′ν=16 with
around the same performance as the tree code. From there, performance rapidly increases
as the size of the sub-bands is increased, eventually tailing off around N ′ν=256 with a
speed-up of approximately 20× over the direct code. At such high speeds, the time spent
in the GPU kernel is less than the time spent transferring the data into and out of the
GPU. The significance of this effect for each of the three algorithms is given in Table 3.1.
The results discussed so far have assumed the use of the time-scrunching technique
during the dedispersion computation. If time-scrunching is not used, the projected per-
formance results change significantly [see lower panels Figs. 3.3(b) and 3.4(b)]. Without
the use of time-scrunching, the direct dedispersion codes perform around 1.6× slower, and
74 Chapter 3. Accelerating Incoherent Dedispersion
Table 3.1 Summary of host↔GPU memory copy timesCode Copy Time Fraction of total time
Direct 0.62 s < 5%Tree 1.05 s < 30%Sub-band 0.62 s 10% – 65%
similar results are seen for the sub-band code. The tree codes, however, are much more
severely affected, and perform 5× slower when time-scrunching is not employed. This
striking result can be explained by the inflexibilities of the tree algorithm discussed in
Section 3.3.1. At large dispersion measure, the direct algorithm allows one to sample DM
space very thinly. The tree algorithms, however, do not—they will always compute DM
trials at a fixed spacing [see equation (3.9)]. This means that the tree algorithms are effec-
tively over-computing the problem, which leads to the erosion of their original theoretical
performance advantage. The use of time-scrunching emulates the thin DM-space sampling
of the direct code, and allows the tree codes to maintain an advantage.
While the piecewise linear tree code and the sub-band code are seen to provide signifi-
cant speed-ups over the direct code, their performance leads come at the cost of introducing
additional smearing into the dedispersed signal. Our analytic results for the magnitude
of the smearing due to the tree code (upper panels Fig. 3.3) show that for the chosen
observing parameters, the total smear is expected to increase by less than 10% for all
N ′ν ≤ 64 at a DM of 1000 pc cm−3. Given that peak performance of the tree code also
corresponded to N ′ν = 64, we conclude that this is the optimal choice of sub-band size for
such observations.
The smearing introduced by the sub-band code (upper panels Fig. 3.4) is significantly
worse, increasing the signal degradation by three orders of magnitude more than the tree
code. Here, the total smear is expected to increase by around 40% at N ′ν=16, and at
N ′ν=32 the increase in smearing reaches 300%. While these results are upper limits, it
is unlikely that sub-band sizes of more than N ′ν=32 will produce acceptable results in
practical scenarios.
In contrast to the piecewise linear code, the frequency-padded tree code showed only
a modest speed-up of around 1.5× over the direct approach due to its doubling of the
number of frequency channels. Given that the sub-band algorithm has a minimal impact
on signal quality, we conclude that the frequency-padding technique is an inferior option.
It is also important to consider the development cost of the algorithms we have dis-
cussed. While the tree code has shown both high performance and accuracy, it is also
considerably more complex than the other algorithms. The tree algorithm in its base
3.6. Discussion 75
form, as discussed in Section 3.3.1, is much less intuitive than the direct algorithm (e.g.,
the memory access patterns in Fig. 3.2). This fact alone makes implementation more
difficult. The situation gets significantly worse when one must adapt the tree algorithm
to work in practical scenarios, with quadratic dispersion curves and arbitrary DM tri-
als. Here, the algorithm’s inflexibility makes implementation a daunting task. We note
that our own implementations are as yet incomplete. By comparison, implementation of
the direct code is relatively straightforward, and the sub-band code requires only mini-
mal changes. Development time must play a role in any decision to use one dedispersion
algorithm over another.
The three algorithms we have discussed each show relative strengths and weaknesses.
The direct algorithm makes for a relatively straightforward move to the GPU architecture
with no concerns regarding accuracy, and offers a speed-up of up to 10× over an efficient
CPU code. However, its performance is convincingly beaten by the tree and sub-band
methods. The tree method is able to provide significantly better performance with only a
minimal loss of signal quality; however, it comes with a high cost of development that may
outweigh its advantages. Finally, the sub-band method combines excellent performance
with an easy implementation, but is let down by the substantial smearing it introduces
into the dedispersed signal. The optimal choice of algorithm will therefore depend on
which factors are most important to a particular project. While there is no clear best
choice among the three different algorithms, we emphasize that between the two hardware
architectures the GPU clearly outperforms the CPU.
When comparing the use of a GPU to a CPU, it is interesting to note that our fi-
nal GPU implementation of the direct dedispersion algorithm on a Fermi-class device is,
relatively speaking, a simple code. While it was necessary in both the pre-Fermi GPU
and multi-core CPU implementations to use non-trivial optimisation techniques (e.g., tex-
ture memory, bit-packing etc.), the optimal implementation on current-generation, Fermi,
GPU hardware was also the simplest or ‘obvious’ implementation. This demonstrates how
far the (now rather misnamed) graphics processing unit has come in its ability to act as a
general-purpose processor.
In addition to the performance advantage offered by GPUs today, we expect our imple-
mentations of the dedispersion problem to scale well to future architectures with little to no
code modification. The introduction of the current generation of GPU hardware brought
with it both a significant performance increase and an equally significant reduction in
programming complexity. We expect these trends to continue when the next generation of
GPUs is released, and see a promising future for these architectures and the applications
76 Chapter 3. Accelerating Incoherent Dedispersion
that make use of them.
While we have only discussed single-GPU implementations of dedispersion, it would in
theory be a simple matter to make use of multiple GPUs, e.g., via time-division multiplex-
ing of the input data or allocation of a sub-set of beams to each GPU. As long as the total
execution time is dominated by the GPU dedispersion kernel, the effects of multiple GPUs
within a machine sharing resources such as CPU cycles and PCI-Express bandwidth are
expected to be negligible. However, as shown in Table 3.1, the tree and sub-band codes
are in some circumstances so efficient that host↔device memory copy times become a sig-
nificant fraction of the total run time. In these situations, the use of multiple GPUs within
a single host machine may influence the overall performance due to reduced PCI-Express
bandwidth.
Disk I/O is another factor that can contribute to the total execution time of a dedis-
persion process. Typical server-class machines have disk read/write speeds of only around
100 MB/s, while our GPU dedispersion codes are capable of producing 8-bit time series
at well over twice this rate. If dedispersion is performed in an offline fashion, where time
series are read from and written to disk before and after dedispersion, then it is likely that
disk performance will become the bottle-neck. The use of multiple GPUs within a machine
may exacerbate this effect. However, for real-time processing pipelines where data are kept
in memory between operations, the dedispersion kernel can be expected to dominate the
execution time. This is particularly important for transient search pipelines, where accel-
eration searching is not necessary and dedispersion is typically the most time-consuming
operation.
The potential impact of limited PCI-Express bandwidth or disk I/O performance high-
lights the need to remember Amdahl’s Law when considering further speed-ups in the
dedispersion codes: the achievable speed-up is limited by the largest bottle-neck. The
tree and sub-band codes are already on the verge of being dominated by the host↔device
memory copies, meaning that further optimisation of their kernels will provide diminish-
ing returns. While disk and memory bandwidths will no-doubt continue to increase in
the future, we expect the ratio of arithmetic performance to memory performance to get
worse rather than better.
The application of GPUs to the problem of dedispersion has produced speed-ups of
an order of magnitude. The implications of this result for current and future surveys are
significant. Current projects often execute pulsar and transient search pipelines offline
due to limited computational resources. This results in event detections being made long
after the time of the events themselves, limiting analysis and confirmation power to what
3.6. Discussion 77
can be gleamed from archived data alone. A real-time detection pipeline, made possible
by a GPU-powered dedispersion code, could instead trigger systems to record invaluable
baseband data during significant events, or alert other observatories to perform follow-
up observations over a range of wavelengths. Real-time detection capabilities will also
be crucial for next-generation telescopes such as the Square Kilometre Array pathfinder
programs ASKAP and MeerKAT. The use of GPUs promises significant reductions in the
set-up and running costs of real-time pulsar and transient processing pipelines, and could
be the enabling factor in the construction of ever-larger systems in the future.
3.6.1 Comparison with other work
Magro et al. (2011) recently reported on a GPU code that could achieve very high (> 100×)
speed-ups over the dedispersion routines in sigproc and presto (Ransom, 2001) whereas
our work only finds improvements of factors of 10–30 over dedisperse all. There are
two key reasons for the apparent discrepancy in speed. Firstly, the sigproc routine was
never written to optimise performance but rather to produce reliable dedispersed data
streams from a very large number of different backends. Inspection of the innermost loop
reveals a conditional test that prohibits parallelisation, and a two dimensional array that
is computationally expensive. Secondly, sigproc only produces one DM per file read,
which is very inefficient. We believe that these factors explain the very large speed-ups
reported by Magro et al. In our own benchmarks, we have found our CPU comparison
code dedisperse all to be ∼ 60× faster than sigproc. For comparison, this puts our
direct GPU code at ∼ 300× faster than sigproc when using the same Tesla C1060 model
GPU as Magro et al.
Direct comparison of our GPU results with those of Magro et al. is difficult, as the
details of the CPU code, the method of counting FLOP/s and the observing parameters
used in their performance plots is not clear. However, we have benchmarked our GPU
code on the ‘toy observation’ presented in section 5 of their paper. The execution times are
compared in Table 3.2. Magro et al. did not specify the number of bits per sample used in
their benchmark; we chose to use 8 bits/sample, but found no significant difference when
using 32 bits/sample. We found our implementation of the direct dedispersion algorithm
to be ∼ 2.3× faster than that reported in their work. Possible factors contributing to
this difference include our use of texture memory, two-dimensional thread blocks and
allocation of multiple samples per thread. The performance results of our implementation
of the sub-band dedispersion algorithm generally agree with those of Magro et al., although
the impact of the additional smearing is not quantified in their work.
78 Chapter 3. Accelerating Incoherent Dedispersion
Table 3.2 Timing comparisons for direct GPU dedispersion of the ‘toy observation’ definedin Magro et al. (2011) (νc=610 MHz, BW=20 MHz, Nν=256, ∆τ=12.8µs, NDM=500,0≤DM<60 pc cm−3). All benchmarks were executed on a Tesla C1060 GPU.
Stage Magro et al. (2011) This work Ratio
Corner turn 112 ms 7 ms 16×De-dispersion 4500 ms 1959 ms 2.29×GPU→CPU copy 220 ms 144 ms 1.52×Total 4832 ms 2110 ms 2.29×
In summary, we agree with Magro et al. that GPUs offer great promise in incoherent
dedispersion. The benefit over that of CPUs is, however, closer to the ratio of their
memory bandwidths (∼ 10×) than the factor of 100 reported in their paper, which relied
on comparison with a non-optimised single-threaded CPU code.
3.6.2 Code availability
We have packaged our GPU implementation of the direct incoherent dedispersion algo-
rithm into a C library that we make available to the community9. The application pro-
gramming interface (API) was modeled on that of the FFTW library10, which was found to
be a convenient fit. The library requires the NVIDIA CUDA Toolkit, but places no require-
ments on the host application, allowing easy integration into existing C/C++/Fortran etc.
codes. While the library currently uses the direct dedispersion algorithm, we may consider
adding support for a tree or sub-band algorithm in future.
3.7 Conclusions
We have analysed the direct, tree and sub-band dedispersion algorithms and found all
three to be good matches for massively-parallel computing architectures such as GPUs.
Implementations of the three algorithms were written for the current and previous gen-
erations of GPU hardware, with the more recent devices providing benefits in terms of
both performance and ease of development. Timing results showed a 9× speed-up over a
multi-core CPU when executing the direct dedispersion algorithm on a GPU. Using the
tree algorithm with a piecewise linear approximation technique results in some additional
smearing of the input signal, but was projected to provide a further 3× speed-up at a
very modest level of signal-loss. The sub-band method provides a means of obtaining
even greater speed-ups, but imposes significant additional smearing on the dedispersed
9Our library and its source code are available at: http://dedisp.googlecode.com/10http://www.fftw.org
3.7. Conclusions 79
signal. These results have significant implications for current and future radio pulsar and
transient surveys, and promise to dramatically lower the cost barrier to the deployment
of real-time detection pipelines.
Acknowledgments
We would like to thank Lina Levin and Willem van Straten for very helpful discussions
relating to pulsar searching, Mike Keith for valuable information regarding the tree dedis-
persion algorithm, and Paul Coster for help in testing our dedispersion code. We would
also like to thank the referee Scott Ransom for his very helpful comments and suggestions
for the paper corresponding to this chapter.
4Fast-Radio-Transient Detection in Real-Time with
GPUs
The machine does not isolate man from the great problems
of nature but plunges him more deeply into them.
—Antoine de Saint-Exupery
4.1 Introduction
The sub-second transient radio sky is a poorly understood yet potentially fruitful source of
astrophysical phenomena (Cordes & McLaughlin, 2003). Over the past decade a number of
surveys have made inroads into characterising the sources that populate this domain. Re-
processing of the Parkes Multibeam Survey resulted in the discovery of new sources forming
a class of pulsars known as the rotating radio transients (RRATs) (McLaughlin et al., 2006;
Keane et al., 2010; Keane et al., 2011; Burke-Spolaor & Bailes, 2010)1. The apparent
detection of an extragalactic burst (Lorimer et al., 2007) sparked significant excitement in
the field, although it was not followed by similar success; a possible second such event was
eventually found (Keane et al., 2011), but the identification of terrestrial signals (given the
name perytons) mimicking the frequency-swept appearance of astronomical sources added
doubt to the true origin of these events (Burke-Spolaor et al., 2011; Bagchi, Cortes Nieves
& McLaughlin, 2012). While uncertainty remains, these discoveries have prompted a new
generation of wide-field surveys across nearly all of the major radio astronomy facilities.
These include Parkes Observatory (Keith et al., 2010), the Australian Square Kilometre
Array Pathfinder (ASKAP) (Macquart et al., 2010), the Effelsberg radio telescope (Barr,
2011), the Low Frequency Array (LOFAR) (Stappers et al., 2011), the Allen Telescope
1A catalogue of the known RRATs is available at http://www.as.wvu.edu/~pulsar/rratalog/
81
82 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
Array (Siemion et al., 2012) and the Green Bank Telescope (Boyles et al., 2012). A fast
transient survey has also been conducted at Arecibo Observatory (Deneva et al., 2009), and
studies have already been made of the potential for transient-detection at next-generation
facilities like the Square Kilometre Array (Macquart, 2011; Colegate & Clarke, 2011).
Chapter 1 introduced pulsar astronomy as a field that stands to benefit significantly
from the use of advanced computing architectures. In this chapter, we demonstrate this
potential by harnessing the power of graphics processing units (GPUs) to develop a full-
featured real-time ‘fast radio transient’ detection pipeline for the 20 cm Multibeam Re-
ceiver (Staveley-Smith et al., 1996) at Parkes Observatory. This work will demonstrate
how the use of advanced hardware architectures can enable new scientific opportunities
that reach beyond what is practical with traditional CPU-based computing to unlock new
paradigms of observation and discovery. Real-time data reduction systems have been an-
nounced for two of the above-mentioned survey projects (ASKAP: Macquart et al. 2010;
and LOFAR: Armour et al. 2011; Serylak et al. 2012), and we expect such systems to
become the standard for cutting-edge surveys in the future (see also Jones et al. 2012).
The ability to detect transient radio events as they are observed provides a number of
advantages over traditional offline processing. These include:
1. access to uncompressed data — offline processing typically requires reduction of
dynamic range and/or time resolution prior to writing data to disk or tape;
2. instant feedback on radio-frequency interference (RFI) environment — offline pro-
cessing leaves little information available during observing about the RFI environ-
ment;
3. immediate follow-up on the order of seconds — offline processing imposes a delay
between observation and detection that can be days to weeks; and
4. triggered baseband dumps — offline detections provide limited information about
events, with no opportunity to capture the corresponding high-resolution baseband
data2.
The ability to precisely characterise the effects of RFI in the search space of transient and
pulsar surveys provides significant benefits over generic metrics such as visualisations of
the bandpass and zero-dispersion-measure time series. Knowing the current ‘RFI weather’
allows observers to adapt their observing schedule based on the quality of data they are
2Baseband data are the digitised but otherwise-unprocessed voltage signals from the receiver system.
4.1. Introduction 83
obtaining and the current observing mode. Detailed information on the properties of
incoming RFI can also aid the identification of local terrestrial sources of emission.
Point 3 is particularly important in the context of observing RRATs. These sources
are known to be very intermittent in nature, and are in some cases detectable for a total of
less than one second per day (McLaughlin et al., 2006). Hence if they are not re-observed
immediately it can be very time-consuming to find them again in their ‘on’ state. Real-
time detection provides the opportunity to continue observing potential RRAT sources
and confirm or rule out their existence during the same observing session.
The ability to raise an alert for a significant detection only seconds after it is observed
also makes possible immediate follow-up observations of the same event at lower frequen-
cies by taking advantage of the dispersion delay. This concept is discussed further in
Section 4.4.
Possibly the greatest scientific potential for real-time transient observations comes from
their ability to reactively record baseband data upon the detection of highly significant
events. Recording of Nyquist-sampled baseband information over long periods of time is
typically prohibitively expensive due to the excessive data rate (at Parkes Observatory,
this eclipses the survey data rate by almost three orders of magnitude); however, short
timespans of data can be saved to disk if they are known to contain signals of interest. If
captured, such data would provide unprecedented insight into the nature of unique events
and would likely reveal the true origins of tantalising Lorimer-burst-like detections.
The primary scientific goals of this work are a) to enable the detection and confirma-
tion of new RRATs in real-time, b) to enable characterisation and reporting of the RFI
environment during live survey observations and c) to provide the opportunity to cap-
ture baseband recordings of significant events such as giant pulses, extragalactic pulses or
Lorimer bursts.
There are two key obstacles to achieving these goals. First, the pipeline must exhibit
sufficient performance so as to maintain real-time processing using the available hardware,
ideally with a short duty cycle. And second, the pipeline must include effective RFI
mitigation in order to maintain a manageable number of false positives. This chapter
presents our approaches to these challenges and discusses their effectiveness.
The details of our software pipeline, including its implementation on GPUs, deploy-
ment at Parkes Observatory and performance measurements, are described in Section 4.2.
Section 4.3 then presents early results obtained with the system, including the detection
of a new RRAT. Finally, we discuss the system, our results and future work in Section
4.4.
84 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
4.2 The pipeline
The general design of our pipeline is based on the work of Burke-Spolaor et al. (2011), but
was developed specifically to exploit the power of GPUs. Dedispersion is performed using
the GPU-based code presented in Chapter 3, while data-parallel implementations of the
other algorithms comprising the pipeline were guided by the work presented in Chapter 2.
The key components of the system are depicted in Fig. 4.1. The detection pipeline
(given the name heimdall3) receives data from one receiver beam in the form of a filter-
bank containing a time series for each frequency channel. These observations are buffered
and broken into discrete sections of time to be processed in a single pass of the pipeline;
the size of the sections is chosen to balance memory constraints, GPU loading and the
delay between the generation and reporting of results. Once a complete section of data is
available, the pipeline begins by ‘cleaning’ the filterbank to remove the effects of strong
radio-frequency interference. The data are then incoherently dedispersed, baselined, nor-
malised and filtered. Following this, the processed time series are searched for strong
signals, and detections are grouped together into a list of candidate events. In the final
stage of computation, candidates from each beam are combined and checked for coinci-
dence before being recorded and displayed to the observer.
Apart from dedispersion, which is performed using an external library (see Section
4.2.2), all other stages of the pipeline are implemented on the GPU using the Thrust C++
template library (Hoberock & Bell, 2010), which is supplied as part of NVIDIA’s Compute
Unified Device Architecture (CUDA) Toolkit4. The library provides generic implementa-
tions of a number of data-parallel algorithms and allows them to be customized and com-
bined within a framework similar to the C++ Standard Template Library5. Thrust also
provides multiple back-ends, allowing code to be compiled to use the GPU (via CUDA) or
the CPU (via OpenMP or Intel’s Threading Building Blocks6). The library was chosen for
its excellent fit to the data-parallel, algorithm-centric approach to advanced architectures
motivated in Chapter 2.
4.2.1 RFI mitigation
Radio-frequency interference is the undesired (and generally unavoidable) detection of
terrestrial radio emissions by the telescope. While radio telescopes are typically located
3After the Marvel Comics character of the same name, who acts as guardian of Asgard and is knownfor having spotted an army of giants from a great distance.
4http://developer.nvidia.com/cuda-downloads5See, e.g., http://www.sgi.com/tech/stl/6http://threadingbuildingblocks.org/
4.2. The pipeline 85
Add polarisations
Clean RFI
Dedisperse
Polyphase filterbank
Remove baseline
Normalise
Digitise
Receiver beam
Matched filter
Detect events
Merge events
Multibeam coincidence
Candidate classification
Candidate display
Extract time series
More DM trials?
More filter trials?
HEIMDALL
Filterbank data
Candidate list
G
F
F
C
C
C
C
FPGA operation
CPU operation
GPU operation
C
G
F
G
G
G
G
G
G
G
Yes
Yes
No
No
Other beams
Figure 4.1 Flow-chart of the key processing operations in the pipeline. heimdall is thename of the main GPU-based pipeline implementation.
86 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
in radio-quiet zones, the inevitable existence of RFI and its tendency to be many times
stronger than astronomical sources mean that techniques must be employed to mitigate
its effects. In this section we describe our approach to RFI mitigation in the context of a
fast-transient detection pipeline.
RFI signals can generally be divided into two classes: narrow-band signals that extend
over only a small fraction of the frequency band, and broad-band signals that appear in
all channels. The two types must be detected and excised using different techniques.
While narrow-band signals are significantly diluted by integration over the band during
dedispersion (see Section 4.2.2), extremely bright samples in the filterbank can maintain a
strong presence in the dedispersed time series. Detection of these bright samples requires
an estimate of the RMS noise level in each channel as well as the underlying shape of the
bandpass, which varies as a result of filtering processes in the telescope receiver system.
In our implementation, the mean bandpass has already been removed prior to the
entry-point of the transient pipeline. We estimate the RMS noise level by randomly
selecting data from different points in time (within a two-second window) and computing
the median absolute deviation, from which the RMS is derived (see Section 4.2.4 for more
details). For performance and simplicity reasons, the code computes the ‘recursive median’
rather than the true median—each group of five consecutive values is replaced with its
median recursively until only one value remains. Tests showed that this approach retained
robust statistical behaviour while avoiding the need to use a full sort or selection algorithm
as required to compute the true median. Following this procedure, the (true) median RMS
is quickly selected from those of the random samples.
With the RMS measured, narrow-band RFI is identified as individual samples exceed-
ing a five standard deviation threshold. In our implementation, these samples are then
‘cleaned’ by replacing them with randomly chosen good samples from neighbouring fre-
quency channels. All steps of the algorithm are implemented using Thrust’s for each,
transform and reduce functions. Random sampling is performed using the default
pseudo-random number generator provided by Thrust, on a per-thread basis where nec-
essary. Note that when the number of bits per sample is small (e.g., nbits≤4), the limited
dynamic range acts to considerably reduce the potency of narrow-band signals. In such
cases there is no need to perform explicit narrow-band RFI excision.
Broad-band signals of terrestrial origin are most-easily identified through their lack of
dispersion delay across the band; i.e., broad-band RFI generally appears at a dispersion
measure (DM) of zero (see Section 4.2.2 for an introduction to dispersion). To detect these
signals, the filterbank data are integrated over the band (with no dispersion delay) and
4.2. The pipeline 87
the resulting time series is searched for peaks exceeding 5σ. An RFI mask is then derived
from the peaks and used to clean the original filterbank data by replacing bad samples
with good ones randomly chosen from nearby in time (< ±0.25 s). The band integration
is performed using the dedisp library, while the remainder of the process relies on simple
Thrust functions as with the narrow-band RFI mitigation.
To avoid losing sensitivity to sources of astronomical origin with low dispersion mea-
sures, the zero-DM cleaning procedure is limited to detections at the native time resolution
of 64 µs. Zero-DM pulses wider than this may not be excised during cleaning, allowing
them to pass through the pipeline. For this reason, a low-DM cut at 1.5 pc cm−3 [fol-
lowing Burke-Spolaor et al. (2011)] is applied during candidate classification at the end of
processing (see Section 4.2.8).
In addition to frequency characteristics, RFI can also be identified through its coinci-
dent presence in multiple beams of the Parkes Multibeam receiver, as often occurs when
a signal is observed via a side-lobe rather than boresight to a single beam (Burke-Spolaor
et al., 2011). Correlations between beams therefore provide a strong discriminator between
astronomical and terrestrial sources. The current implementation of our pipeline uses this
information at the end of processing to classify candidate events (see Section 4.2.8); how-
ever, this information could also be used prior to dedispersion to provide more confidence
in the filterbank cleaning process. We plan to integrate the use of such information in the
future. Other, more involved methods of RFI mitigation are also possible (Briggs & Kocz,
2005; Kocz et al., 2012; Spitler et al., 2012); these will also be considered in future work.
4.2.2 Incoherent dedispersion
As described in Chapter 3, interactions with free electrons in the interstellar medium cause
radio-frequency signals from astronomical sources to be delayed in time as a quadratic func-
tion of frequency. Broad-band signals thus appear as quadratic sweeps in the frequency-
time space of recorded filterbank data. The scale of the delay depends linearly on the
number of free electrons in the line of sight to the source, and is referred to as the disper-
sion measure (DM). As the distance to an undetected source is unknown, it is necessary
to integrate over the band along a number of trial dispersion measures (and subsequently
search each of the resultant time-series for signals). This process is known as dedispersion.
Due to the large number of trial dispersion measures required to comprehensively cover
the expected range [typically O(100− 1000)] and the computational expense of each inte-
gration over the band, the process of dedispersion is generally the most time-consuming
stage of a transient-detection pipeline.
88 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
Our pipeline, targeting a centre observing frequency of 1381 MHz, samples the dis-
persion measure space 0 ≤ DM ≤ 1000 pc cm−3 using 1196 non-linearly distributed trials
chosen to maintain a constant fraction of finite-sampling-induced smearing as a function
of DM. In addition, the input data are reduced in time resolution by successive factors of
two when the smearing cost of doing-so falls below 15%. This improves the overall speed
of the dedispersion process by ∼ 2× without significant loss of signal.
To perform dedispersion on the GPU we used our (publicly available) software library
described in Chapter 3, requiring no more than calls to an application programming inter-
face to create and execute a dedispersion plan. The performance of the direct dedispersion
algorithm implemented by the library was found to be sufficient for the current version
of the pipeline. Use of the ‘tree’ or ‘sub-band’ dedispersion algorithms remains an option
for the future should additional performance be required at the expense of increased code
complexity or signal smearing (see Chapter 3 of this work; Magro et al. 2011).
4.2.3 Baseline removal
Due to instrumental effects in the telescope receiver system, the mean signal level can vary
slowly as a function of time. This baseline typically varies over a timescale of seconds,
and must be subtracted prior to event detection.
The baseline is easily measured by smoothing the dedispersed time series with a wide
window function. However, the process is complicated by the presence of bright impulses,
which can severely bias the baseline estimate. It is therefore necessary to use robust
statistical methods. The running median is one such method, but comes with a high
computational and/or implementation cost. In particular, the large window size (2 s
≈ 3 × 104 samples at the sampling time of 64 µs used in the HTRU survey) and the
constraint of executing in real-time meant that the running median was not a practical
solution for our pipeline.
An alternative method that proved more suitable is the clipped mean, in which the
baseline is first estimated by computing the running mean and then iteratively made more
robust by clipping outliers and re-smoothing. Tests on real data showed that a three-pass
algorithm that clipped first at 10σ and then at 3σ was sufficient to produce a robust
measurement of the baseline. However, the multi-pass nature of this algorithm resulted
in relatively slow performance compared to other parts of the pipeline. A pre-processing
step involving reduction of the time-resolution prior to baselining was attempted, but the
resulting code became complex and difficult to maintain and perfect; the algorithm also
exhibited a strong dependence on the choice of clipping parameters.
4.2. The pipeline 89
Subsequent investigation led to an alternative approach based on the recursive median
(see Section 4.2.1 for more details on this algorithm). The method involves applying the
recursive median to reduce the time resolution of the data to a value representative of
the desired smoothing length. These data are then simply linearly interpolated back to
the original time resolution to form the complete baseline. This provides a robust and
parameter-free approach to dealing with outliers as well as a very simple implementa-
tion. All operations in the baselining process were implemented within the data-parallel
paradigm using calls to Thrust’s transform function.
4.2.4 Normalisation
Accurate thresholding of time series by the pipeline requires robust measurements of their
root mean square (RMS) noise level. The RMS as computed by its definition for a time
series fi of N samples (with zero mean),
RMS ≡
√√√√ 1
N
N∑i=0
f2i , (4.1)
can be biased by the presence of non-Gaussian signals due to RFI and strong astronomical
sources. We investigated two methods of measuring the RMS that are robust against
outliers: 1) truncation of the distribution, and 2) use of (approximate) median statistics.
The first method operates by pre-truncating the distribution of values such that outliers
in the tails of the distribution are not included in the computation. The resulting RMS
estimate can be corrected for the bias this introduces by assuming it follows a truncated
normal distribution. The correction factor for this case is given by:
RMS =RMStrunc
1− γ(t), (4.2)
γ(t) ≡ 2tφ(t)
2Φ(t)− 1, (4.3)
where t is the signal-to-noise ratio at which the distribution was (symmetrically) trun-
cated and φ(x) and Φ(x) are the normal distribution’s probability density and cumulative
distribution functions respectively. By choosing a small value of t (e.g., t ≈ 1σ), extreme
values will have no impact on the estimated RMS. Note that the quality of the corrected
measurement was found to remain high even for t << 1σ.
Truncation of the distribution is most conveniently achieved by sorting the samples
and ignoring those at the beginning and end of the sorted array. Because sorting is a
90 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
computationally expensive operation, our implementation first sub-sampled the data by a
factor of 100 to reduce the workload.
The second method employs median statistics to mitigate the effects of strong out-
liers. The median of the absolute deviations from the median [or simply median absolute
deviation (MAD)] can be used to estimate the RMS according to the efficiency factor
RMS = 1.4862 MAD. (4.4)
This constant arises from the fact that the population MAD corresponds to the 75th
percentile of the distribution.
To avoid the cost of computing the full median over each time series, our implementa-
tion uses the recursive median (see Section 4.2.1) to approximate the MAD.
Data-parallel implementations of these techniques were constructed using a selec-
tion of operations from the Thrust library. Equation (4.1) was implemented using the
transform reduce function, while sorting was done using Thrust’s sort. The sub-sampling
step was parallelised with the use of a gather operation, and repeated application of a
transform sufficed to compute the recursive median. After testing both algorithms, the
median-based statistic was chosen for the final implementation due to its parameter-free
nature and its significantly simpler code.
4.2.5 Matched filtering
In order to detect signals with durations longer than the sampling time dt, a collection of
matched filters is applied to each time series. The data are convolved with top-hat profiles
of duration wn = 2n samples for 0 ≤ n ≤ 12, and normalised by√wn. The filtered time
series then continue through the remainder of the pipeline.
The use of simple top-hat profiles makes a data-parallel implementation of the matched
filtering process surprisingly simple. For a time series fi, the filtering operation F (fi;w)
can be defined as follows:
F (fi;w) =1√w
i+dw/2e∑j=i−bw/2c
fj ; bw/2c ≤ i < N + 1− dw/2e (4.5)
=1√w
i+dw/2e∑j=0
fj −i−bw/2c∑j=0
fj
(4.6)
=1√w
[Φi+dw/2e − Φi−bw/2c
], (4.7)
4.2. The pipeline 91
where
Φi ≡i∑
j=0
fj (4.8)
is the prefix-sum of the time series. The top-hat convolution can therefore be expressed
solely in terms of prefix-sum and transform operations, making a data-parallel implemen-
tation straightforward. An additional feature of the algorithm is that once Φi has been
computed [an O(N) operation], the time series can be filtered at any width in constant
time per element. This allows n filters to be applied to the time series in O(nN) time,
regardless of the width of the filters.
To further speed up this algorithm and the detection process that follows it, time
series to be filtered with wide filters are first reduced in resolution by adding adjacent
samples. The filter width at which this resolution-reduction begins is a parameter to the
pipeline, and allows trading sensitivity for performance; the current setting is to reduce the
resolution prior to applying filters greater than 23 samples wide (meaning that 2n-sample
filtering, where n > 3, is applied by adding 2(n−3) adjacent samples and then applying a
23-element filter).
The data-parallel implementation of the matched filtering process was constructed
using Thrust’s exclusive scan for the prefix sum computation and transform for the
differencing and normalisation operations. Resolution-reduction was performed simply
by striding through the prefix-sum array by the number of samples to combine, and was
implemented using Thrust’s ‘fancy iterators’7.
4.2.6 Event detection
The final step in processing the collection of time series is to identify and extract significant
signals. Signals are considered significant if they exceed a certain threshold, which in our
pipeline is set to six times the RMS noise level (6σ). Identifying samples that exceed this
threshold is a trivial matter; however, the process is complicated by the fact that for a
single event of finite duration, many neighbouring samples may exceed the threshold. In
order to correctly identify such a case as a single extended event, rather than a myriad of
single-sample signals, threshold-exceeding samples separated by only a small number of
bins (currently three) are classified as belonging to the same event.
Once groups, or ‘islands’, of threshold-exceeding samples have been identified, they
are converted to individual events by finding the value and time of their maximum. In
7http://thrust.github.com/doc/group__fancyiterator.html
92 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
addition to these properties, the time of the first and last samples comprising the event
are extracted and stored.
A data-parallel implementation of the event-detection process involves algorithms that
are more complicated than those that have been mentioned thus far. The first step is to
extract samples from the time series that exceed the detection threshold. This operation
is known as a ‘stream compaction’, and can be performed using a data-parallel implemen-
tation of an algorithm such as ‘copy if’. Specifically, when a sample exceeds the threshold,
we copy both it and its array index to a new memory location. This allows the rest of the
event detection process to operate only on samples that exceeded the threshold, without
losing information on the arrival time of each sample.
Once the significant samples have been isolated, we wish to identify those samples
comprising temporally-isolated events. As the array index of each sample was retained,
identifying significant gaps between events is simply a matter of looking for jumps between
successive sample indices that exceed the bin separation criterion. For example, if the
significant samples had array indices ‘0 2 5 6 8 11’ and one considered samples separated
by more than two bins to belong to independent events, then the difference between
successive indices would identify the gaps between events as ‘0 2|5 6 8|11’. Examination of
the difference between successive values in this way can be performed using a transform
function.
Having identified the boundaries between individual events, the next step is to locate
when each event’s maximum signal-to-noise occurs. This requires the application of a
reduce operation to the samples comprising each event. Given the potentially large number
of events, it is highly desirable to perform these reductions in parallel. Fortunately, this
can be achieved using Thrust’s reduce by key function, which reduces contiguous values
belonging to the same segment; in this case segments correspond to temporally-isolated
events. The result of this operation is an array of the maximum signal-to-noise of each
event along with a separate array containing the corresponding offsets into the original
time series.
The final step of the event detection process is to record the starting and ending time
of each signal. This information is extracted directly from the indices of the samples
that exceeded the detection threshold by using the scatter if function along with the
locations of the gaps between events.
In contrast to the previous stages of the pipeline, the execution time of the event
detection procedure is not simply a function of the length of the time series, depending
instead on the number of detected events. This has implications for the ability to main-
4.2. The pipeline 93
tain real-time performance. When the number of events remains low (as is typical), the
detection stage consumes only a small fraction of the total execution time of the pipeline.
However, in the event of a large burst of RFI (as was found during testing), this stage of
the pipeline can become swamped with events and cause a catastrophic slow-down. To
avoid this situation, a hard limit was placed on the rate of detections—reaching this limit
causes the pipeline to stop the DM-trial and filter search early and to return only those
candidates found up to that point. In this case, an error code is returned by the pipeline
warning the system that processing of the gulp of data was incomplete. This ‘bail condi-
tion’ provides an effective (although inelegant) means of ensuring real-time performance
regardless of observing conditions.
4.2.7 Event merging
While detected events will appear most strongly at their best-matching dispersion measure
trial and filter width, bright signals will typically be detected across a number of DM trials
and filters. To avoid reporting these secondary candidates as individual events, temporally-
coincident signals are first grouped together. The process takes the form of a connected
component labelling algorithm: pairs of candidates occurring at times within three filter
widths of each other are considered connected; candidates connected directly or indirectly
to each other are then merged to form a single event.
The connected component labelling algorithm is implemented as a three-step process.
First, each candidate’s label is initialised to the candidate’s index in the total list of
candidates. Next, a loop over all pairs of candidates detects coincidences and replaces the
corresponding labels with the minimum of the two original labels. Finally, each candidate
traces its label back along the ‘equivalency chain’ to find its lowest equivalent; e.g., if a
candidate has label 8, and candidate number 8 has label 5, and candidate number 5 has
label 5, then the initial candidate will have its label set to 5. The end result of this process
is a list of labels where matching values indicate connected candidates.
Once the connected component labelling process is complete, merging the candidates
is simply a matter of sorting them by their labels and then using Thrust’s reduce by key
function to merge those with matching labels. The merging function is defined to return
the parameters of the member candidate with the greatest signal-to-noise ratio.
While the loop over all pairs of candidates8 makes the computational complexity of
this process O(N2), the total number of candidates being operated-on is small enough
8Note that more efficient search algorithms are possible—e.g., a binning procedure could reduce thecomplexity to O(N); our approach was chosen for its simplicity rather than its performance.
94 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
(relative to the work performed by the rest of the pipeline) that the overall cost is not
significant.
4.2.8 Candidate classification and multibeam coincidence
Once the main pipeline is complete, the lists of candidates from each beam are gathered
on a single machine and a classification process is performed. Following the procedure
of Burke-Spolaor et al. (2011), candidates having been produced from the merging of
fewer than three individual events in DM-trial/filter space (i.e., having fewer than three
members) are classified as noise spikes. In practice, this effectively raises the detection
threshold by requiring events to be either strong enough or broad enough to be detected
in three successive DM or filter trials. A cut in dispersion measure at 1.5 pc cm−3 then
identifies low-DM signals likely to be of terrestrial origin.
The final stage of classification is a multibeam coincidence analysis. This process op-
erates on the expectation that most signals of terrestrial origin will appear simultaneously
in multiple beams (having been detected through a far side-lobe of the receiver), while as-
tronomical sources will remain localised to a single beam. While this assumption generally
holds very well, there are two ways in which astronomical events may be found to appear
in multiple beams. The first is when exceptionally bright events are picked up by the
finite response of neighbouring beams. The other is when astronomical signals coincide
with RFI signals in other beams.
The Parkes 20 cm multibeam receiver contains 13 beams pointing at locations on the
sky separated by approximately 30 arcmins, with a response pattern that falls to 50%
of peak sensitivity one quarter of the way between beams, and by more than two orders
of magnitude at neighbouring beams (Staveley-Smith et al., 1996). Astronomical point-
sources lying directly in the centre of a beam would, therefore, need to exceed ∼750σ
to be detected above 6σ in neighbouring beams, while those lying mid-way between two
beams would need to exceed around 400σ to be detected above 6σ in both beams9. As a
conservative measure, we require candidates to appear in more than three beams before
classifying them as RFI. Possibilities exist for decreasing this threshold without losing
sensitivity to bright sources, such as checking that the coincident beams are adjacent or
even computing how well a given coincident event is fitted by the known beam response
pattern; however, the situation is complicated by the existence of false-positives in the
coincidence information (see below). For this reason, our current implementation relies on
9These thresholds are approximate values only, as the response pattern becomes highly asymmetricalin the outer beams of the receiver.
4.2. The pipeline 95
just the simple threshold criterion.
The other issue with using coincidence information to identify RFI is the production
of false-positives during coincidence detection. This can occur when astronomical events
occur coincidentally with RFI bursts appearing in other beams, which becomes more likely
with broad signals. To minimise the likelihood of this situation, our pipeline checks event-
pairs for coincidence not only in time (with a tolerance of three times the greater filter
width), but also in the detection filter (tolerance of four filters) and the signal-to-noise
ratio (tolerance of 30%). These criteria were found to strike a reasonable balance between
identification of RFI and mis-identification of astronomical sources.
4.2.9 Deployment at Parkes Radio Observatory
The software pipeline was deployed at the Parkes Radio Observatory as part of the Berkeley
Parkes Swinburne Recorder (BPSR) back-end. This currently consists of 13 Reconfigurable
Open Architecture Computing Hardware (ROACH) boards10 connected via an Infiniband
network switch to 8 server computers, each of which contains two 6-core Nehalem-class
Intel Xeon 5650 CPUs and two Fermi-class NVIDIA Tesla C2070 GPUs. The 26 inter-
mediate frequency (IF) signals from the multibeam receiver (2 polarisations × 13 beams)
are fed via analogue-to-digital converters into the ROACH boards, where the signals are
passed through a polyphase filter bank and broken into 1024 frequency channels cover-
ing the 400 MHz bandwidth centered at 1382 MHz. The ROACH then places the data
(integrated to 64 µs samples) into packets and forwards them to the server machines,
each machine receiving the dual-polarisation signal from two beams. Here, the data are
captured by software daemons, which proceed to sum the polarisations and write the to-
tal intensity information to a ring buffer in memory. It is this ring buffer to which the
transient detection pipeline attaches, and at this point that the detection process begins.
The transient pipeline is run as a number of separate instances, each instance processing
the data from one beam using a single GPU. Upon completion of the pipeline for a gulp
of data (typically ∼10 s worth), the pipeline instances output their lists of candidates. A
‘multibeam monitor’ code, running in another process, then collates these lists, performs
the RFI coincidence check between beams and produces the results overview plots as
an image file. Finally, this image file is transferred to a web server and presented to
the observer on a web-based graphical user interface (see Section 4.2.10 for details on
visualisation of results).
While the primary use of the real-time pipeline is during pulsar and fast-transient
10https://casper.berkeley.edu/wiki/ROACH
96 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
survey observations, it is also possible to run it simultaneously with other observations that
use the Parkes Multibeam receiver, providing varying levels of usefulness. During timing
and other follow-up studies of pulsars and RRATs the pipeline remains a powerful indicator
of the current RFI environment, as well as providing detailed feedback on the quality
of data (for sufficiently bright sources). Other observing modes (e.g., quasar pointings
or studies requiring the use of a calibrator signal) can pose problems for the pipeline,
producing unintuitive output. However, meaningful results can often still be obtained in
such cases via the twelve off-centre receiver beams. When possible, the use of these beams
can provide useful RFI information as well as the opportunity to capture serendipitous
transient events.
In its current form, the output of the transient pipeline is presented only to the ob-
server(s). While this already represents a significant change in the observing paradigm,
there exists even more avenue for discovery through the use of a fully-automated machine
interface capable of further reducing the detection→reaction delay to the order of seconds.
The implementation of such a system is beyond the scope of this work, but the idea is
discussed in more detail in Section 4.4.
4.2. The pipeline 97
Fig
ure
4.2
Res
ult
sov
ervie
wp
lots
from
the
pip
elin
efo
ran
arch
ived
poi
nti
ng
inth
eH
TR
Usu
rvey
.T
he
poin
tin
gco
nta
ins
an
ewro
tati
ng
rad
iotr
ansi
ent
can
did
ate,
wh
ich
app
ears
asth
eth
ree
pin
ksp
ots
lab
elle
dw
ith
(bea
m)
‘1’
at
aD
Mof
aro
un
d45
pc
cm−
3
(occ
uri
ng
atti
mes∼
70s,
270
san
d320
s).
See
mai
nte
xt
for
det
ails
ofth
evis
ual
isat
ion
.A
cut
inS
NR
of
6.5
was
ap
pli
edfo
rcl
ari
ty.
98 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
4.2.10 Visualisation
The primary way of visualising the results of the pipeline is through a set of plots created to
provide an overview of the complete collection of candidates (see Fig. 4.2). Here, the main
plot displays detections in the time–DM plane and allows for immediate characterisation
of the RFI environment and the presence of bright dispersed signals. Candidates below
the cut in dispersion measure indicating low-DM RFI are shown as hollow circles, while
detections with high multibeam coincidence are shown as stars. Grey crosses indicate
candidates flagged as noise (due to a small number of component members in the detection
space), and candidates that are strongest at the largest filter width are not shown. All
other signals are then displayed as filled circles. Further information is conveyed in the
size (representing peak signal-to-noise ratio), colour (representing pulse width) and label
(representing strongest beam) of each of the points. A second plot shows a histogram of
the total number of candidates in each beam as a function of dispersion measure. This is
useful for identifying periodic sources such as pulsars and some rotating radio transients,
which show up as a narrow spike at the corresponding dispersion measure (assuming a
sufficient number of pulses are detected). Finally, a third plot displays the signal-to-noise
ratio as a function of dispersion measure to provide more detailed information.
Currently only the final candidates are visualised in the overview plots—the individual
member events comprising each candidate are not shown. This is in contrast to previous
work where events appear as extended ‘trails’ of detections across DM space, peaking in
signal-to-noise ratio at the true DM and providing some additional insight into the shape
of the signal (Cordes & McLaughlin, 2003; McLaughlin et al., 2006; Burke-Spolaor et al.,
2011). The decision to plot only the final events in our work was due to the desire to
simultaneously display results from all 13 beams of the Parkes Multibeam receiver and
the risk of overcrowding, potentially hiding interesting signals behind the trails of others.
Improvement of our visualisation methodology is ongoing, and we may return to plotting
full candidate trails in the future.
During live observing, the overview plots are updated and displayed to the observer
around every 10 s. The web-based interface also provides a set of controls allowing the
observer to interactively modify visualisation parameters such as cuts in SNR, filter and
DM, and to toggle the inclusion of individual beams. The thirty strongest candidates from
the pipeline are also displayed, along with lists of the known pulsars located within the
field of view of each beam.
4.2. The pipeline 99
4.2.11 Performance
This section presents performance results for the execution of the pipeline, demonstrating
scaling of the different processing stages and comparing total computing time to real-time.
All benchmarks were run on a server node containing two six-core Intel Xeon X5650 CPUs
and two NVIDIA Tesla C2070 GPUs.
Fig. 4.3 shows the execution time of each part of the pipeline when processing the
central beam from the 9.4 minute observation shown in Fig. 4.2. As expected, dedisper-
sion consumes the majority of the execution time, and remains constant throughout the
observation. Filterbank cleaning, memory copying, baseline removal, normalisation and
matched filtering all consume only a small fraction of the total computation time. The
only data-dependent stage of the pipeline is event detection, which can be seen to fluc-
tuate significantly throughout the observation as different numbers of events are detected
(corresponding in this case to RFI). The potential for large increases in the execution time
of this stage during periods of strong RFI motivated the addition of a hard limit to the
event rate in order to guarantee sustained real-time performance (see Section 4.2.6).
Fig. 4.4 shows the total execution times of the different pipeline stages as a function
of the gulp size when processing the observation shown in Fig. 4.2. The computation is
most efficient when processing large lengths of data at a time (e.g., processed time per
gulp ∼ 30 s) and the RFI cleaning and dedispersion processes remain very efficient down
to gulp lengths of ∼ 2 s. However, the later stages of the pipeline become extremely inef-
ficient at small gulp sizes. This issue is due to under-utilisation of the GPU’s computing
resources: at small gulp sizes there are insufficient time samples to exploit all of the avail-
able processing threads, and many simply remain idle. The RFI cleaning and dedispersion
algorithms are more resilient to this problem because they operate simultaneously on all
1024 channels of the filterbank data; in contrast, our current implementation applies the
later parts of the pipeline to each dedispersed time series sequentially. The solution is
clearly to exploit the additional parallelism between independent DM trials. Limitations
of the Thrust library currently prevent this from being a straightforward task; however,
we expect upcoming versions of Thrust to allow the use of separate streams of GPU com-
putation. This feature should allow the simultaneous processing of multiple DM trials on
a single GPU, providing much greater efficiency at short gulp sizes.
While our pipeline was designed to execute all processes on the GPU, a hybrid GPU-
CPU approach is also possible. Using Thrust’s ability to target multiple back-ends, we
trivially recompiled the pipeline to use OpenMP-based implementations of all algorithms
except dedispersion (which remained on the GPU using our external library). CPU-based
100 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
12
34
56
78
910
1112
1314
1516
1718
1920
2122
2324
2526
2728
2930
3132
3334
0
2
4
6
8
10
12
Detect eventsMatched filterNormaliseBaselineMem copyDedisperseClean RFIMem alloc
Gulp number
Exe
cutio
n ti
me
[s]
Figure 4.3 Plot showing the break-down of execution times during each gulp for differentparts of our transient pipeline when processing the central beam of the 565 s pointingshown in Fig 4.2. Here each gulp (except the last) processes 16.8 s of data and all stagesof the pipeline are executed on the GPU (an NVIDIA Tesla C2070).
4.2. The pipeline 101
1.049 2.097 4.194 8.389 16.777 33.5540
100
200
300
400
500
600
700
All GPU Hybrid totalDetect eventsMatched filterNormaliseBaselineMem copyDedisperseClean RFIMem alloc
Processed time per gulp [s]
Exe
cutio
n ti
me
[s]
Figure 4.4 Plot showing the variation of execution times for different parts of our transientpipeline as a function of the gulp size when processing the central beam of the 565 spointing shown in Fig 4.2. Here all stages of the pipeline are executed on the GPU (anNVIDIA Tesla C2070). The dashed region shows the total time when using the hybridGPU-CPU(3 cores) approach for comparison (see Fig. 4.5).
102 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
dedispersion was not considered due to its approximately six times slower performance
(see Chapter 3). Fig. 4.5 shows the results of the same benchmarks as in Fig. 4.4 but for
the hybrid GPU-CPU code using three11 cores of the CPU. At large gulp sizes, the CPU is
around three times slower than the GPU. However, with the current implementation, the
CPU is able to scale more effectively to smaller chunks of data, and becomes faster than
the GPU below gulps of around four seconds. We note that we observed scaling efficiencies
of approximately 80 per cent when using different numbers of CPU cores between one and
twelve, indicating that the algorithms are well-suited to both GPUs and multi-core CPUs.
The speed of the GPU dedispersion code and the data-parallel implementations of
the other parts of the pipeline have proved sufficient to comfortably maintain real-time
execution under the current back-end configuration with 8 s gulps. The code has been in
operation at the telescope since mid July 2012 and has not exhibited any performance-
related issues during this time besides the automated bail-outs during periods of excessive
RFI (corresponding to event-rates exceeding 1.5× 105 detection peaks per minute across
the search space).
4.3 Results
This section presents preliminary testing and science results from our pipeline using
archived data as well as real-time observations. Further work applying the system to
specific science applications is ongoing.
4.3.1 Discovery of PSR J1926–13
Here we report the serendipitous discovery of a new rotating radio transient (RRAT) source
found in existing data (observed in April 2009) from the High Time Resolution Universe
(HTRU) survey (Keith et al., 2010) during testing of our pipeline. Manual inspection of
the overview plots from this pointing (see Fig. 4.2) prompted further study of a small
number of strong pulses appearing in the central beam at consistent DM (∼45 pc cm−3)
and filter (∼8 ms) trials. Our confidence in the origin of the signal was sufficient to
schedule a follow-up observation during HTRU observing time.
A fifteen minute confirmation observation was made in July 2012, in which several
strong pulses were again detected at similar DM and filter trials (see Fig. 4.6). Manual
inspection of the dedispersed time series from both the detection and confirmation obser-
vations found the eleven observed pulses to match a 4.864±0.002 s period, confirming the
11Only three cores out of six on the CPU were used because the remaining cores are needed for otherprocessing tasks during observations.
4.3. Results 103
1.049 2.097 4.194 8.389 16.777 33.5540
100
200
300
400
500
600
700
GPU/CPU(3 cores) hybrid All GPU totalDetect eventsMatched filterNormaliseBaselineMem copyDedisperseClean RFIMem alloc
Processed time per gulp [s]
Exe
cutio
n ti
me
[s]
Figure 4.5 Plot showing the variation of execution times for different parts of our transientpipeline as a function of the gulp size when processing the central beam of the pointingshown in Fig 4.2. Here all stages of the pipeline but dedispersion are executed on the CPU(an Intel Xeon X5650) using 3 cores. The dashed region shows the total time when usingthe all-GPU approach for comparison (see Fig. 4.4).
104 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
Table 4.1 Properties of the newly discovered RRAT. Columns are: (1) name derived fromJ2000 coordinates; (2,3) right ascension and declination of the beam centre; (4) best-fitting period; (5) observed pulsation rate in pulses per hour; (6) best-fitting DM; and (7)observed pulse width at half maximum of brightest single pulse. Uncertainties in the lastdigit are give in brackets.
PSRJ RAJ DecJ P (s) χ (h−1) DM (pc cm−3) weff (ms)
J1926–13 19:26:38 -13:13:37 4.864(2) 25 45(5) 8(2)
rotating transient nature of the source. A Fourier search was also performed, but did not
result in a significant detection, making the object a potential member of the RRAT class
of pulsars. Measured properties of the source are listed in Table 4.1. We note that this
source was found after inspecting approximately ten per cent of the mid-lattitude portion
of the HTRU survey, and we therefore expect processing and inspection of the remaining
data to yield additional discoveries. Further studies such as measurements of the detection
rate for known pulsars and RRATs will form the basis of future work.
After the completion of this work we became aware that this source had also been
discovered independently by Rosen et al. (2012), whose work includes a timing solution.
4.3. Results 105
Fig
ure
4.6
Res
ult
sov
ervie
wp
lots
from
the
pip
elin
efo
ra
con
firm
atio
np
ointi
ng
ofth
ero
tati
ng
rad
iotr
an
sien
tca
nd
idate
show
nin
Fig
.4.2
.O
nly
resu
lts
from
the
centr
al
bea
mar
esh
own
.T
he
can
did
ate
(re-
)ap
pea
rsas
the
pin
kan
dp
urp
lesp
ots
at
aD
Mof
aro
un
d45
pc
cm−
3.
106 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
4.3.2 Giant pulses
In addition to survey observations, the transient pipeline is also able to operate during
pulsar timing sessions. One use of this ability is to detect the emission of particularly bright
individual pulses from known pulsars in real-time and trigger dumps of the corresponding
baseband data to disk for later study. Some pulsars are known to have extended tails in the
luminosity distribution of their individual pulses and emit what are known as ‘giant pulses’
that can exceed the mean pulse strength by two to three orders of magnitude (Cognard
et al., 1996; Cordes & McLaughlin, 2003). While baseband recording facilities are typically
limited in their capacity due to the extreme data rate (see Section 4.1), recordings can
be kept manageable if they are restricted only to signals of interest. By connecting the
output of our pipeline (i.e., the list of candidates from the current observation) to the
baseband recording hardware via some form of decision-maker, significant events could be
captured within strict recording constraints. The decision-maker could simply be a human
observer; however, a more robust solution would be to use a machine program to analyse
the list of candidates and decide on which to record based on, e.g., their significance, their
likelyhood of being RFI and the acceptable event rate. This idea will form the basis of
future work.
As a proof of concept, our transient pipeline was run during several pulsar timing
observations with the aim of detecting bright pulses from the tail of the distribution. The
results from an observation of PSR J1022+1001, a millisecond pulsar with a period of
16.45 ms (Camilo et al., 1996; Verbiest et al., 2009), are shown in Fig. 4.7. While this
pulsar is not known to emit giant pulses (Kramer et al., 1999), it was sufficiently bright to
detect a number of individual pulses using the transient pipeline. Of the 3852 s / 0.01645 s
≈ 2.3×105 stellar rotations during the observation, the strongest detected emission was
only around eight times the mean pulse SNR (derived from the integrated timing SNR),
consistent with the lack of giant pulse emission from this pulsar.
4.3. Results 107
Fig
ure
4.7
Res
ult
sov
ervie
wp
lots
from
the
pip
elin
ed
uri
ng
ati
min
gob
serv
atio
nof
the
mil
lise
con
dp
uls
ar
PS
RJ1022+
1001
show
ing
the
det
ecti
onof
anu
mb
erof
stro
ng
nar
row
pu
lses
atth
ep
uls
ar’s
DM
of10
.25
pc
cm−
3.
108 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
4.3.3 RFI monitoring
The ability of the pipeline to search for signals across a wide range of parameter space in
real-time allows it to provide unprecedented feedback on the radio-frequency interference
(RFI) environment during observations. Fig. 4.8 shows the overview plots from a pointing
in which several strong bursts of RFI occurred. Visible are narrow events at zero-DM
appearing in single beams (hollow green circles), isolated broad events at a variety of DMs
appearing in multiple beams (orange stars) and intermediate-width events spanning many
DMs also appearing in multiple beams (purple stars).
Such information can be used by observers to guide their observing schedule. For
example, narrow zero-DM RFI may be acceptable for certain observations, but strong
RFI spanning many DMs may render the data useless. In the latter case, the observer
could respond by moving to another target, or by attempting to identify the source of the
RFI. Results from the real-time pipeline may also be useful for long-term monitoring of
the RFI environment at the observatory.
4.3. Results 109
Fig
ure
4.8
Res
ult
sov
ervie
wp
lots
from
the
pip
elin
efo
ra
poi
nti
ng
conta
inin
gst
ron
gb
urs
tsof
RF
I,in
clu
din
gn
arr
owze
ro-D
Msi
gn
als
(hol
low
gree
nci
rcle
s),
isola
ted
bro
ad
even
tsap
pea
rin
gin
mu
ltip
leb
eam
s(o
ran
gest
ars)
an
dm
ediu
m-w
idth
even
tssp
an
nin
gm
any
DM
san
db
eam
s(p
urp
le/b
lue
star
s).
Pu
lses
from
the
kn
own
pu
lsar
PS
RJ10
46–5
813
can
als
ob
ese
enin
bea
mn
ine
aro
un
dit
sD
Mof
240
pc
cm−
3(N
ewto
n,
Man
ches
ter
&C
ooke
,19
81).
110 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
4.3.4 Quality assurance
The real-time pipeline also serves as a means of monitoring the quality of observation data.
While existing monitoring tools such as plots of the integrated band-pass and zero-DM
time series allow the observer to identify many problems in the observing system, in some
cases issues can go unnoticed due to their subtle impact on these diagnostics. Due to
its large search space, we expect the real-time transient pipeline to provide much greater
diagnostic power.
A case in point was the identification, during initial deployment of the pipeline at
Parkes Observatory, of problems with several beams of the Parkes Multibeam Receiver.
During observing, this problem was immediately visible in the transient overview plots—
an overwhelming presence of bright events from beam six forced the beam to be manually
hidden in order to see the results from other beams. This behaviour was observed in-
termittently over a period of weeks and was also seen to shift into neighbouring beams
during this time. Further investigation suggested an origin inside the focus cabin, and
maintenance work is planned to track down the problem. The strong visibility of this
issue in the overview plots as well the ability to assess its impact on data quality made the
real-time transient pipeline a valuable addition to the set of quality assurance diagnostics
presented to observers.
4.4 Discussion
Our transient detection pipeline can comfortably operate in real-time under the current
survey observing configuration at Parkes Observatory thanks primarily to the use of GPUs.
For comparison, based on additional timing benchmarks of dedispersion and the other
stages of the pipeline, we estimate that an equivalent CPU-only system capable of execut-
ing the pipeline in real-time would require around five times as many nodes, multiplying
the total monetary, power and rack-space costs considerably. These metrics would have
put such a system well beyond the available budgets. Furthermore, the need to partition
the problem between additional nodes would have added considerable complexity to the
software implementation and networking requirements. For these reasons we consider the
use of GPUs to have been the enabling factor in the development of this system.
The ability to observe transient event detections in real-time has dramatically changed
the observing paradigm at Parkes Observatory. Immediate feedback on astronomical
sources, terrestrial interference and instrumentation issues now allows observers to assess
the contents of their data and proactively adapt their observing schedules based on what
4.4. Discussion 111
they see. As HTRU survey observations continue, we expect the pipeline to produce the
first reactive confirmations of new RRAT and pulsar sources in the near future, bypassing
the issues and delays associated with offline processing. In addition, work is underway to
investigate (automatically-)triggered baseband recording and to add additional features
to the real-time observing system, including the ability to produce plots of frequency and
SNR versus time and DM for individual candidate events. These diagnostics will allow
observers to better discriminate between astronomical and terrestrial events, and could
be linked to a manual trigger for recording baseband data. Ongoing real-time monitoring
of the effects of RFI and instrumental problems are also expected to result in improved
observing quality as such issues are progressively resolved over the long term.
The detection of significant unique events is also a high-reward possibility. One op-
tion for reacting to such detections would be to release an alert to the community such
that immediate follow-up observations could be taken at suitable observatories. The VO-
Event standard from the International Virtual Observatory Alliance is designed for such
purposes, and provides the means for an ‘author’ (e.g., a human or machine analysing
the output of our pipeline) to send structured information about a new event to a ‘pub-
lisher’, which then forwards the information to attached ‘subscribers’ according to event
filtering criteria (Williams & Seaman, 2006). Significant unique events with DM >> 0
detected using the 20 cm Parkes Multibeam Receiver may, with sufficient coordination,
subsequently be detectable at low frequency facilities such as the Murchison Widefield
Array (Tingay et al., 2012), aided substantially by the ability to ‘steer’ such telescopes in
software. For example, a burst with a DM of 100 pc cm−3 detected at Parkes at 1381 MHz
would be appear at 200 MHz around 10 s later, albeit with significantly increased disper-
sion smearing. If successful, the results from such an effort would provide unprecedented
multi-wavelength data on one-off short-duration events.
While the speed of the pipeline is currently sufficient to satisfy the requirements of real-
time processing under the existing back-end configuration, possibilities exist for upgrading
the system given further improvements in performance. Increasing the time resolution
captured by the back-end is one such change that could be made if the pipeline exhibited
the necessary processing power; such an upgrade would improve the pipeline’s sensitivity
to short-duration signals. Another opportunity is the capture of polarisation information,
which is currently ignored by operating only on the total intensity data. Reducing the
gulp size is also of interest—minimising the delay between event and alert is critical to the
notion of immediate follow-up observations at external observatories.
Two obvious avenues exist to obtain the performance needed to support these upgrades.
112 Chapter 4. Fast-Radio-Transient Detection in Real-Time with GPUs
The first is to further optimise the dedispersion process, which remains the computational
bottle-neck. Use of the tree or sub-band algorithms, or a hybrid approach, is one option
(see Chapter 3 of this work; Magro et al. 2011); alternatively, there is the possibility of
further improving the efficiency of the direct dedispersion algorithm on GPUs (Armour
et al., 2011).
The second avenue for increasing performance is simply to use faster GPUs. Since the
purchase of our computing nodes, the next generation of hardware has been announced
and is expected to provide at least two times the performance of the current devices12.
This is by far the simplest means of speeding-up the application. In addition, it would
be a relatively straightforward task to divide the workload of each receiver beam between
multiple GPUs, e.g., by partitioning across dispersion trials. With the existing hardware,
a seven-beam observing mode could provide two GPUs per beam, doubling the processing
power. A multi-GPU approach would also allow the use of GPUs containing two discrete
chips on a single board.
Increasing the overall sensitivity of the pipeline is highly desirable; the choice of detec-
tion threshold is, however, currently constrained by the number of false positives produced
and the rate at which candidates can be assessed by a human observer. A solution to this
problem is instead to use a machine to analyse the events; e.g., by training an artifi-
cial neural network (Eatough et al., 2010). Machine-based candidate analysis also has
the advantage of providing ceaseless attention—something that cannot be guaranteed by
a human observer. Such an approach would allow for much more robust and confident
real-time alerts and baseband recording following positive astronomical detections.
Further improvements in sensitivity could also be achieved through the use of more ad-
vanced RFI mitigation techniques. While the current approach of performing multibeam
coincidence analysis only at the end of the pipeline has the benefit of reducing imple-
mentation complexity, a more powerful approach would undoubtedly be to leverage the
discriminatory power of multibeam coincidence information during the initial filterbank
cleaning process. We plan to investigate this option in future work by allowing pipeline
instances to communicate data between each other during processing.
While the code developed in this work was targeted specifically at GPU hardware, in
the spirit of the ideas put forth in Chapter 2 the algorithms chosen for each stage of the
pipeline are entirely general, remaining suitable for virtually any parallel shared-memory
computing architecture. Thus, while the immediate significance of the work has been
demonstrated, we also believe that the algorithmic ideas it presents will be of long-term
12http://www.nvidia.com/object/nvidia-kepler.html
4.4. Discussion 113
value, extending effortlessly to embrace future architectures. Given the current volatility
in the landscape of computing hardware, this is a welcome thought.
Our pipeline software heimdall is freely available, currently as part of the open source
psrdada package13.
Acknowledgments
We would like to thank everyone at the Max Planck Institut fur Radioastronomie for
hosting us during the early stages of this work, Aris Karastergiou for a useful discussion
about transient pipelines and RFI mitigation, and Andrew Jameson for his tireless efforts
integrating our pipeline into the back-end systems at Parkes Observatory and for his
subsequent help during testing.
13http://psrdada.sourceforge.net/
5Future Directions and Conclusions
You’ve got to think about big things while you’re doing small
things, so that all the small things go in the right direction.
—Alvin Toffler
5.1 Future directions
Chapter 2 of this thesis advocated a generalised approach to many-core hardware based
on the analysis of algorithms. While this was shown to provide significant insight into the
optimal implementation approach for a given problem, it was later found in Chapter 3
that platform-specific issues can become important during the final stages of optimisation.
Tuning of code and parameters to best take advantage of a particular architecture can
require significant effort, and can often be guided only by benchmarking and trial and error
(see Volkov & Demmel 2008 for an example of the complexities involved in tuning matrix
multiplication on GPUs). One possibility for future work is an investigation into methods
of auto-tuning. Automatic optimisation techniques allow algorithms to be developed once,
in a very generic way, and subsequently deployed efficiently to different hardware by letting
a machine perform the final, platform-specific tuning of the code. This technique, used
in the Fourier transform library FFTW1 and recently applied to the problem of matrix
multiplication on GPUs (Li, Dongarra & Tomov, 2009; Cui et al., 2010), could prove very
valuable for performance-critical astronomy applications needing to extract peak efficiency
from current and future computing architectures.
While the applications studied in this thesis cover a variety of different algorithms, they
may generally be classified as problems involving ‘dense’ data structures (e.g., densely-
sampled particle lists, pixel arrays, time series etc.). These contrast with problems in-
1http://www.fftw.org
115
116 Chapter 5. Future Directions and Conclusions
volving ‘sparse’ data structures, where data and computations can be irregular; common
examples are tree-based methods [e.g., the Barnes-Hut force-calculation algorithm (Barnes
& Hut, 1986)] and sparse matrix calculations [e.g., those used in solving the Poisson equa-
tion in multiple dimensions (Stone & Norman, 1992)]. The irregularity of these algorithms
can pose problems on highly-vectorised architectures like GPUs, where they often require
significantly more complex implementations than on traditional sequential processors (see,
e.g., Bedorf, Gaburov & Portegies Zwart 2012). A more detailed investigation into gener-
alised approaches to the analysis and implementation of algorithms involving sparse data
structures would be an excellent avenue for future work, with the potential outcome of
opening up new application areas to acceleration by advanced architectures.
One final direction for future work is the development of new software tools and li-
braries optimised for many-core architectures. While this thesis has presented and demon-
strated a powerful and general approach to such hardware, the adoption of GPUs (or other
accelerators) by the wider astronomy community will only come once sufficient utilities
and applications are in place. The Thrust library is a good example of how well-made
tools with a focus on algorithms can dramatically lower the entry barrier and increase
productivity when targeting complex computing architectures, even allowing code to be
effortlessly switched between different hardware. Porting or redeveloping widely-used as-
tronomy software for GPUs remains an important ongoing area of work.
5.1.1 The future evolution of GPUs
Since the arrival of true general-purpose GPU computing platforms in 2007/08, GPUs have
continued to increase dramatically in both computational power and flexibility. November
2008 marked the first appearance of a GPU-accelerated cluster in the Top500 supercom-
puter ranking2, and as of July 2012 the list contains 57 machines exhibiting accelerator
or co-processor cards3. To examine the future directions of this hardware, we will focus
on the recent evolution of products from one vendor chosen as being representative of
the market. Fig. 5.1 plots the peak memory bandwidth and compute4 performance over
the last five years for NVIDIA GeForce GPUs costing around USD$400 on release. An
exponential fit shows compute performance doubling every 1.6 years—architectural im-
provements actually allow faster growth than that defined by Moore’s Law. Given the
tight fit displayed by these five years of data and the unwavering success of Moore’s Law
2http://www.nvidia.com/object/io_1226945999108.html3http://www.top500.org/lists/2012/06/highlights4Here we use the term compute performance to mean arithmetic performance calculated as: core count
× shader clock rate × 2 operations per clock cycle.
5.1. Future directions 117
10
100
1000
6 7 8 9 10 11 12 13 100
1000
10000M
emor
y ba
ndw
idth
(G
B/s
)
Sin
gle-
prec
. FP
per
form
ance
(G
FLO
P/s
)
Time (years since 2000)
GF 8800 GTS
GF GTX 260
GF GTX 470
GF GTX 570
GF GTX 670
Memory bandwidthDoubling time = 3.7+1.3
- 0.3 yearsCompute performance
Doubling time = 1.6+0.1- 0.1 years
Figure 5.1 Trends in theoretical peak memory bandwidth (+) and compute performance(×) over the last five years for NVIDIA GeForce GPUs costing around USD$400 on release.Dashed and dotted lines show exponential fits.
over the last fifty years, it is quite reasonable to expect that this trend will continue for
(at least) another five years5, at which point GPUs could be expected to provide an order
of magnitude more performance than today. However, this represents arithmetic perfor-
mance only; memory bandwidth is seen to increase on a much longer timescale, doubling
approximately every four years. If memory technology also continues at its current pace,
five years of evolution will provide only ∼2.5 times the data access speed of today.
The ratio of compute performance to memory bandwidth, which we define as the criti-
cal arithmetic intensity, is plotted in Fig. 5.2, and is seen to double around every 2.8 years.
This metric gives an indication of the number of floating-point operations required per
byte of memory access to balance the compute and bandwidth capabilities of the hard-
5Moore’s Law must ultimately come to an end, but technology roadmaps defining the near term (upto 2018) and long term (up to 2026) prospects for its continuation continue to drive the industry (http://www.itrs.net/Links/2011ITRS/Home2011.htm)
118 Chapter 5. Future Directions and Conclusions
100
1000
6 7 8 9 10 11 12 13
10
100
Cor
e co
unt
Arit
hmet
ic in
tens
ity (
FLO
P/B
)
Time (years since 2000)
GF 8800 GTS
GF GTX 260
GF GTX 470
GF GTX 570
GF GTX 670
Core countDoubling time = 1.5+0.2
- 0.1 yearsCritical arithmetic intensity
Doubling time = 2.8+0.2- 0.4 years
Figure 5.2 Trends in core count (+) and critical arithmetic intensity (×) over the last fiveyears for NVIDIA GeForce GPUs costing around USD$400 on release. Dashed and dottedlines show exponential fits.
ware, and provides insight into the scalability of different applications. Problems with
arithmetic intensities below the critical value of the target hardware will be bound by
memory performance, while those exceeding it will be bound by arithmetic capabilities.
The continued increase in the critical arithmetic intensity of GPU hardware threatens to
leave more and more algorithms in the bandwidth-limited regime, where they are con-
strained by the slower growth-rate of memory speed. The effect of this phenomenon on
astronomy applications will be discussed in Section 5.1.2.
Fig. 5.2 also plots the evolution in the number of cores exhibited by recent GPUs,
which is seen to double every 1.5 years (n.b., this is faster than the increase in compute
performance due to a negative trend in shader clock rate). This metric, the fastest-growing
of those discussed in this section, places a lower-bound on the amount of parallelism re-
quired to fully-utilise the GPU hardware. Applications are therefore required to exhibit
substantial and scalable division of work in order to remain efficient on future GPUs.
5.1. Future directions 119
However, with the advent of dynamic parallelism functionality and multiple kernel execu-
tion in the Kepler generation of GPUs, this is expected to become an easier goal for many
algorithms. Furthermore, the large number of pixels/voxels/samples/particles/rays typi-
cally appearing in astronomy applications means that in many cases the relevant quantity
far exceeds the number of cores; in such cases the issue is easily resolved through the use
of a data-parallel approach to algorithm design as described in Chapter 2.
In addition to increases in theoretical performance, GPU technology has also exhibited
significant increases in flexibility over the last five years, which has allowed a wider range
of applications to achieve higher practical performance with less development effort. The
addition of two levels of automatically-managed cache space in NVIDIA’s ‘Fermi’ gener-
ation of GPUs alieviated the need to consider the precise alignment of memory accesses,
allowing a more CPU-like approach to code design (see Chapter 3). Similarly, the ability
of the next-generation ‘Kepler’6 architecture to support GPU-managed kernel launches
(also known as dynamic parallelism) will provide a much more natural and efficient means
of implementing algorithms that must adapt dynamically to their data during execution
[e.g., adaptive mesh refinement methods (Schive, Tsai & Chiueh, 2010)]. Looking forward,
we expect this trend of increasing flexibility to continue strongly into the future, provid-
ing even more freedom from the constraints of traditional GPU programming models and
opening up the raw power of the hardware to an ever-broader range of algorithms. The
release of Intel’s Xeon Phi accelerator card (expected in late 2012) may be a significant
step in this direction, with its compatibility with the same instruction set as used by most
modern CPUs (the x86 instruction set) allowing it to exploit legacy software tools7.
One final trend of interest in the GPU market is the divergence of the hardware tar-
geting the graphics/games industry and that targeting the scientific/high-performance
computing sector. NVIDIA’s Tesla series of devices, which targets the scientific com-
puting market, began as simple derivatives of the company’s game-oriented GeForce line,
differentiated only by their increased memory volume and quality guarantee. However, suc-
cessive generations have introduced additional Tesla-exclusive features, such as increased
double-precision floating-point arithmetic performance and error-correcting memory. The
next-generation Tesla K20 will further increase this divide by providing dynamic paral-
lelism and virtualisation features that will not be available in the corresponding GeForce
models8. While this divergence of features is a natural result of the different applica-
6http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.
pdf7http://www.intel.com/content/www/us/en/high-performance-computing/
high-performance-xeon-phi-coprocessor-brief.html8http://www.nvidia.com/content/tesla/pdf/NV_DS_TeslaK_Family_May_2012_LR.pdf
120 Chapter 5. Future Directions and Conclusions
tions being targeted by the science and graphics markets, it is a trend that we expect
to require careful management: hardware like the Tesla series owes its existence to the
well-established video gaming market, and it is not yet clear whether the scientific and
high-performance computing markets are large enough to support the development of
custom hardware on their own. For this reason, we expect manufacturers to continue
to share their architectures between the two markets until such time as the demand for
high-performance computing justifies a complete separation. The future impact of this
situation is difficult to predict, but it is possible that a point will be reached where scien-
tific computing hardware is held back by the needs of graphics applications, and progress
beyond this point will be restricted as a result. On the other hand, it is also possible that
the high-performance computing market will grow (or has in fact already grown) large
enough to support its own hardware. In this case, graphics and compute architectures can
be expected to continue to diverge as demanded by their respective application areas.
5.1.2 Prospects for astronomy applications
While the application of advanced architectures to certain computational problems in as-
tronomy has already proven very successful, most of the applications targeted to date
are well-known for exhibiting high degrees of parallelism and for being key performance
bottle-necks [the O(N2) direct gravitational N-body force calculation being the archetypal
example]. This is a result of the non-trivial nature of development for advanced architec-
tures and the restrictions imposed by hardware limitations, as well as the computational
needs of astronomy research. However, the work in this thesis has paved the way for
simplifying future adoption of such technologies, and we expect the number and variety
of accelerated applications to increase significantly in the coming years.
As seen in Section 5.1.1, a key trend in the evolution of GPU hardware is the slow
increase in memory bandwidth relative to compute performance, which poses a concern
for the ability of some algorithms to scale effectively to future hardware. The algorithm
analysis approach presented in this thesis provides useful insight into this issue. Appli-
cations whose computational complexity grows significantly faster than their input and
output data [e.g., the direct gravitational N-body problem, radio-telescope signal correla-
tion and matrix multiplication all have arithmetic intensities of O(N)] are unlikely to ever
be constrained by memory performance, and will continue to push the limits of advanced
architectures with little additional implementation effort. Problems with slowly-growing
arithmetic intensities [e.g., the fast Fourier transform and tree-based algorithms, which
have arithmetic intensities ofO(logN)] will tread the line between bandwidth and compute
5.1. Future directions 121
limitations, requiring careful optimisation to avoid being held back by memory hardware.
In the worst position are algorithms with constant arithmetic intensities (e.g., transforms
and reductions), which, in practical applications, are often used with only small constant
compute factors. In such cases, performance will remain limited by the available memory
bandwidth. Furthermore, the number of algorithms in this category can be expected to
increase as more and more sink below the (projected) rising critical arithmetic intensity
of future hardware architectures.
Memory bandwidth is not the only bottle-neck faced by algorithms with low arith-
metic intensity. PCI-Express bandwidth is typically an order of magnitude lower than
that of GPU memory, and network bandwidth can be an order of magnitude lower still.
Avoiding these bottle-necks often requires the consideration of a more coarse-grained form
of arithmetic intensity: the number of tasks that can be performed for each data transfer.
The PCI-Express or network communication cost may be significant with respect to any
single task in a pipeline, but its impact can often be reduced by performing multiple tasks
in one place (e.g., moving multiple processing steps onto the GPU rather than just one).
The ability to overlap communication and computation also helps in these situations.
While this analysis paints a somewhat dark picture of the future for a large number
of algorithms in use today, to assume that this is representative of the overall prospects
for the future of computational astronomy would be very pessimistic. It is important
to note that, due to their low computational complexity, bandwidth-bound applications
only rarely form the overall bottle-neck in scientific applications: moving and transforming
data may be limited by memory performance9, but its computational cost quickly becomes
insignificant when faced with anO(N2) or evenO(N logN) algorithm in the same pipeline.
One possible result of the increasing critical arithmetic intensity of GPU hardware is
that we will witness a shift in the balance of algorithm design: computationally intensive
processes will become (relatively) cheaper to employ, and will consequently be used more
liberally than they are today. Given the ability of these processes to fully exploit the
rapid growth in computing power and their position as the key performance bottle-necks
in many scientific applications, we see the overall future for computational astronomy as
being very bright.
Future research may even necessitate the use of more computationally-intensive al-
gorithms as a result of growing data rates and the increasing infeasibility of human in-
tervention during processing. Machine-learning and data-mining algorithms have already
become important tools in some areas of astronomy research, and their importance is ex-
9Among other hardware limitations such as PCI-Express, network and disk IO bandwidth.
122 Chapter 5. Future Directions and Conclusions
pected to increase significantly over the next decade (e.g., Ball et al. 2006, 2007, 2008;
Mahabal et al. 2008; Borne 2008; Richards et al. 2011). It is widely believed that the next
generation of telescopes will bring about a new era of “Big Data” astronomy research,
where traditional analysis, distribution and archiving methods will fail to cope with the
rate of data generation (Hey, Tansley & Tolle, 2009; Jones et al., 2012)10. While the
computational requirements of new surveys in the optical and infrared may be relatively
undemanding (Schlegel, 2012), future radio surveys such as those at the Square Kilo-
metre Array are expected to require the use of world-class high-performance computing
facilities in conjunction with new data-processing techniques (Cornwell, 2004; Lonsdale,
Doeleman & Oberoi, 2004; Smits et al., 2009). Many of the processes involved in these
surveys (e.g., synthesis imaging, pulsar and transient detection algorithms) exhibit high
arithmetic intensities, making them ideal for deployment on advanced architectures.
The increasing reliance on computationally-intensive algorithms in astronomy makes
the use of rapidly-progressing advanced architectures such as those discussed in this thesis
a very attractive prospect for enabling the next generation of research. However, the shift
away from traditional sequential computing models continues to pose significant challenges.
We believe that the work presented in this thesis offers a prudent path through these
obstacles and into a new decade of discovery.
5.2 Summary
Motivated by recent advances in computing hardware, this thesis began with an investi-
gation into new approaches to the problem of applying advanced massively-parallel com-
puting architectures to applications in astronomy. Chapter 2 eschewed ad-hoc approaches
in favour of a generalised methodology based on algorithm analysis. Simple analysis tech-
niques were shown to provide deep insight into the suitability of particular problems for
advanced architectures now and into the future, answering the questions of both whether
to invest in a many-core solution for a given problem and where to begin such an imple-
mentation. The application of this methodology to four well-known astronomy problems
resulted in the rapid identification of potential speed-ups from cheaply-available hardware
such as GPUs and ultimately led to the work presented in Chapter 3. Due to the gen-
eral nature of algorithm analysis, these results are expected to stand the test of time and
remain relevant for virtually all future parallel architectures.
Incoherent dedispersion is a computationally intensive problem at the heart of surveys
10We note that the Big Data paradigm is expected to affect many areas of science and is not restrictedto astronomy.
5.2. Summary 123
for fast radio transients. Commonly positioned as the primary performance bottle-neck
in these applications, the speed of this algorithm can place direct constraints on the
rate of scientific discovery. This fact, combined with the results of Chapter 2 showing
a strong potential for GPU-acceleration, made it a logical choice for further study and
implementation. Chapter 3 presented a detailed analysis of three different incoherent
dedispersion algorithms and described their implementations using the CUDA platform.
Building on the results of the analysis in Chapter 2, this chapter presented a more detailed
investigation of the memory access patterns exhibited by the algorithms. Also discussed
were implementation-specific details, which were found to play an important role in the
optimisation for older generations of GPU hardware, but a less-significant one for more
recent devices exhibiting automatic cache spaces. The GPU implementation of the direct
dedispersion algorithm was found to out-perform an optimised multi-core CPU code by
up to a factor of 9× using high-end hardware available at the time. The sub-band and tree
algorithms on the GPU provided further speed-ups of 3–20×, but were found to introduce
significant smearing into the output time series due to their use of approximations. The
ability of even the direct dedispersion code to execute in one third of real-time on the
GPU suggested the possibility of using this implementation as the basis for a real-time
transient detection pipeline, which led to the work presented in Chapter 4.
Looking toward GPU-driven scientific outcomes, Chapter 4 described the development
of a complete real-time fast-radio-transient detection pipeline capable of exploiting the
power of advanced many-core computing architectures. Performance and radio-frequency
interference (RFI) mitigation were noted as the key issues to be solved, and the GPU-
implementation of the direct dedispersion algorithm from Chapter 3 was used as the basis
for the solution of the former. The algorithms comprising the remaining stages of the
pipeline were chosen by taking into consideration the need for both robust statistical meth-
ods and efficient data-parallel implementations, with the use of foundation algorithms and
algorithm-composition techniques introduced in Chapter 2 proving crucial to the work.
The additional processing performance also allowed the use of high-resolution matched fil-
tering, providing increased sensitivity. Implementations for GPU hardware were expedited
through use of the Thrust library of algorithms, which also allowed trivial retargeting of
the codebase for multi-core CPUs. RFI mitigation algorithms employed in the pipeline
included pre-processing of filterbank data to remove both narrow- and broad-band sig-
nals likely to be of terrestrial origin, and spatial discrimination of candidates based on
coincidence information from independent receiver beams.
The pipeline was demonstrated using both archival data from the High Time Resolu-
124 Chapter 5. Future Directions and Conclusions
tion Universe survey and real-time observations at Parkes Observatory. Using NVIDIA
Tesla C2050 GPUs, execution time was found to remain comfortably below real-time (e.g.,
8 s of data processed in ∼4 s) for the vast majority of pointings. The exceptions, corre-
sponding to periods of extreme RFI, resulted in the triggering of a ‘bail’ condition and
returned incomplete results for the given time-segment. The system was deployed as part
of the Berkeley Parkes Swinburne Recorder back-end connected to the 20 cm Multibeam
Receiver on the 64 m Parkes radio telescope, with a web-based interface providing control
of the pipeline and visualisation of results to observers. Early results demonstrated sev-
eral powerful abilities, including live detection of individual pulses from known pulsars and
RRATs, detailed real-time monitoring of the RFI environment, and continuous quality-
assurance of recorded data. The increased sensitivity and ability to rapidly re-process
archival data also resulted in the serendipitous discovery of a new RRAT candidate in a
2009 pointing from the High Time Resolution Universe survey, which was subsequently
confirmed using the same pipeline in real-time at Parkes. A number of future projects
are now being planned for the system, including long-term RFI monitoring, base-band
capture of giant pulses and triggered inter-observatory observations of significant unique
events. The ability to detect dispersed transient events in real-time is expected to be
critical to next-generation facilities such as the Square Kilometre Array, and it is likely
that the feasibility of such endeavours will depend on the ability to exploit advanced,
massively-parallel computing architectures, making this work particularly timely.
In conclusion, this thesis has shown how a generalised approach to exploiting the power
and scalability offered by advanced computing architectures can provide paradigm-shifting
accelerations to computationally-limited astronomy problems today while also promising
to carry these same problems effortlessly through the foreseable future of developments in
hardware technology.
Bibliography
Aarseth S. J., 1963, M.N.R.A.S., 126, 223
Abdo A. A. et al., 2010, ApJ. Supp., 187, 460
Agarwal P. K., Krishnan S., Mustafa N. H., Suresh, 2003, in In Proc. 11th European
Sympos. Algorithms, Lect. Notes Comput. Sci, Springer-Verlag, pp. 544–555
Ait-Allal D., Weber R., Dumez-Viou C., Cognard I., Theureau G., 2012, Comptes Rendus
Physique, 13, 80
Alpar M. A., Cheng A. F., Ruderman M. A., Shaham J., 1982, Nature, 300, 728
Amdahl G. M., 1967, in AFIPS ’67: Proceedings of the American Federation of Informa-
tion Processing Societies Conference, pp. 483–485
Anderson J. A., Lorenz C. D., Travesset A., 2008, Journal of Computational Physics, 227,
5342
Angulo R. E., Springel V., White S. D. M., Jenkins A., Baugh C. M., Frenk C. S., 2012,
Scaling relations for galaxy clusters in the Millennium-XXL simulation, arXiv:1203.3216
[astro-ph.CO]
Armour W. et al., 2011, A GPU-based survey for millisecond radio transients using
ARTEMIS, arXiv:1111.6399 [astro-ph.IM]
Asanovic K. et al., 2006, The landscape of parallel computing research: A view from
berkeley. Tech. Rep. UCB/EECS-2006-183, EECS Department, University of Califor-
nia, Berkeley, available at: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/
EECS-2006-183.html
Asanovic K. et al., 2009, Communications of the ACM, 52, 56
Bagchi M., Cortes Nieves A., McLaughlin M., 2012, A search for dispersed radio bursts in
archival Parkes Multibeam Pulsar Survey data, arXiv:1207.2992 [astro-ph.HE]
Ball N. M., Brunner R. J., Myers A. D., Strand N. E., Alberts S. L., Tcheng D., 2008,
ApJ, 683, 12
Ball N. M., Brunner R. J., Myers A. D., Strand N. E., Alberts S. L., Tcheng D., Llora X.,
2007, ApJ, 663, 774
Ball N. M., Brunner R. J., Myers A. D., Tcheng D., 2006, ApJ, 650, 497
125
126 Bibliography
Barnes J., Hut P., 1986, Nature, 324, 446
Barr E., 2011, in American Institute of Physics Conference Series, Vol. 1357, American
Institute of Physics Conference Series, Burgay M., D’Amico N., Esposito P., Pellizzoni
A., Possenti A., eds., pp. 52–53
Barsdell B. R., Bailes M., Barnes D. G., Fluke C. J., 2012, M.N.R.A.S., 2599
Barsdell B. R., Barnes D. G., Fluke C. J., 2010, M.N.R.A.S., 408, 1936
Bate N. F., Fluke C. J., Barsdell B. R., Garsden H., Lewis G. F., 2010, New Astronomy,
15, 726
Baumgardt H., Hut P., Makino J., McMillan S., Portegies Zwart S., 2003, ApJL, 582, L21
Bedorf J., Gaburov E., Portegies Zwart S., 2012, Journal of Computational Physics, 231,
2825
Bedorf J., Portegies Zwart S., 2012, A pilgrimage to gravity on GPUs, arXiv:1204.3106
[astro-ph.IM]
Belleman R. G., Bedorf J., Portegies Zwart S. F., 2008, New Astronomy, 13, 103
Belletti F. et al., 2007, QCD on the Cell Broadband Engine, arXiv:0710.2442 [hep-lat]
Bhat N. D. R., Cordes J. M., Chatterjee S., Lazio T. J. W., 2005, Radio Science, 40, 5
Bhattacharya D., van den Heuvel E. P. J., 1991, Physics Reports, 203, 1
Blelloch G. E., 1996, Commun. ACM, 39, 85
Bohn C.-A., 1998, in In Proceedings of Int. Conf. on Compu. Intelligence and Neuro-
sciences, pp. 64–67
Bolz J., Farmer I., Grinspun E., Schrooder P., 2003, ACM Trans. Graph., 22, 917
Borne K. D., 2008, Astronomische Nachrichten, 329, 255
Boyles J. et al., 2012, The Green Bank Telescope 350 MHz Drift-scan Survey I: Survey
Observations and the Discovery of 13 Pulsars, arXiv:1209.4293
Briggs D. S., 1995, PhD thesis, New Mexico Institute of Mining and Technology
Briggs F. H., Kocz J., 2005, Radio Science, 40, 5
Bibliography 127
Brunner R. J., Kindratenko V. V., Myers A. D., 2007, in NSTC ’07: Proceedings of the
NASA Science Technology Conference
Buck I., Foley T., Horn D., Sugerman J., Fatahalian K., Houston M., Hanrahan P., 2004,
ACM TRANSACTIONS ON GRAPHICS, 23, 777
Burke-Spolaor S., Bailes M., 2010, M.N.R.A.S., 402, 855
Burke-Spolaor S., Bailes M., Ekers R., Macquart J.-P., Crawford, III F., 2011, ApJ, 727,
18
Burke-Spolaor S. et al., 2011, M.N.R.A.S., 416, 2465
Burns W. R., Clark B. G., 1969, A&A, 2, 280
Camilo F., Nice D. J., Shrauner J. A., Taylor J. H., 1996, ApJ, 469, 819
Campana-Olivo R., Manian V., 2011, in Society of Photo-Optical Instrumentation En-
gineers (SPIE) Conference Series, Vol. 8048, Society of Photo-Optical Instrumentation
Engineers (SPIE) Conference Series
Cecilia J. M., Garcia J. M., Ujaldon M., Nisbet A., Amos M., 2011, in Proceedings of the
2011 IEEE International Symposium on Parallel and Distributed Processing Workshops
and PhD Forum, IPDPSW ’11, IEEE Computer Society, Washington, DC, USA, pp.
339–346
Che S., Boyer M., Meng J., Tarjan D., Sheaffer J., Skadron K., 2008, Journal of Parallel
and Distributed Computing, 68, 1370
Clark B. G., 1980, A&A, 89, 377
Clark M. A., La Plante P. C., Greenhill L. J., 2011, Accelerating Radio Astronomy Cross-
Correlation with Graphics Processing Units, arXiv:1107.4264 [astro-ph.IM]
Cognard I., Shrauner J. A., Taylor J. H., Thorsett S. E., 1996, ApJL, 457, L81
Cohen J. M., Molemake J., 2009, in 21st International Conference on Parallel Computa-
tional Fluid Dynamics (ParCFD2009)
Colegate T. M., Clarke N., 2011, Pub. Astron. Soc. Australia, 28, 299
Cordes J. M., Kramer M., Lazio T. J. W., Stappers B. W., Backer D. C., Johnston S.,
2004, New Astronomy Reviews, 48, 1413
128 Bibliography
Cordes J. M., McLaughlin M. A., 2003, ApJ, 596, 1142
Cornwell T. J., 2004, Experimental Astronomy, 17, 329
Cui X., Chen Y., Zhang C., Mei H., 2010, in Proceedings of the 2010 IEEE 16th Interna-
tional Conference on Parallel and Distributed Systems, ICPADS ’10, IEEE Computer
Society, Washington, DC, USA, pp. 237–242
Cytowski M., Remiszewski M., Soszyski I., 2010, in Lecture Notes in Computer Science,
Vol. 6067, Parallel Processing and Applied Mathematics, Wyrzykowski R., Dongarra J.,
Karczewski K., Wasniewski J., eds., Springer Berlin Heidelberg, pp. 507–516
de Greef M., Crezee J., van Eijk J. C., Pool R., Bel A., 2009, Medical Physics, 36, 4095
Deneva J. S. et al., 2009, ApJ, 703, 2259
Dewdney P. E., Hall P. J., Schilizzi R. T., Lazio T. J. L. W., 2009, IEEE Proceedings, 97,
1482
Diewald U., Preußer T., Rumpf M., Strzodka R., 2001, Acta Mathematica Universitatis
Comenianae (AMUC), LXX, 15
Dodson R., Harris C., Pal S., Wayth R., 2010, in ISKAF2010 Science Meeting
Eatough R. P., Molkenthin N., Kramer M., Noutsos A., Keith M. J., Stappers B. W.,
Lyne A. G., 2010, M.N.R.A.S., 407, 2443
Ebisuzaki T., Makino J., Fukushige T., Taiji M., Sugimoto D., Ito T., Okumura S. K.,
1993, Proc. Astron. Soc. Japan, 45, 269
Eichenberger A. E. et al., 2005, in Proceedings of the 14th International Conference on
Parallel Architectures and Compilation Techniques, PACT ’05, IEEE Computer Society,
Washington, DC, USA, pp. 161–172
Elsen E., Vishal V., Houston M., Pande V., Hanrahan P., Darve E., 2007, N-Body Simu-
lations on GPUs, arXiv:0706.3060
Floer L., Winkel B., Kerp J., 2010, in RFI Mitigation Workshop
Fluke C. J., Barnes D. G., Barsdell B. R., Hassan A. H., 2011, Pub. Astron. Soc. Australia,
28, 15
Ford E. B., 2009, New Astronomy, 14, 406
Bibliography 129
Foster R. S., Backer D. C., 1990, ApJ, 361, 300
Fournier A., Fussell D., 1988, ACM Trans. Graph., 7, 103
Fridman P. A., Baan W. A., 2001, A&A, 378, 327
Fukushige T., Makino J., Kawai A., 2005, Proc. Astron. Soc. Japan, 57, 1009
Gaburov E., Harfst S., Portegies Zwart S., 2009, New Astronomy, 14, 630
Gaensler B. M., Madsen G. J., Chatterjee S., Mao S. A., 2008, Pub. Astron. Soc. Australia,
25, 184
Garcia V., Debreuve E., Barlaud M., 2008, in Computer Vision and Pattern Recognition
Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on, pp. 1 –6
Gold T., 1968, 218, 731
Gonnet P., 2010, in American Institute of Physics Conference Series, Vol. 1281, American
Institute of Physics Conference Series, Simos T. E., Psihoyios G., Tsitouras C., eds.,
pp. 1305–1308
Goodnight N., Woolley C., Lewin G., Luebke D., Humphreys G., 2003, in Proceedings of
the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS
’03, Eurographics Association, Aire-la-Ville, Switzerland, Switzerland, pp. 102–111
Gorbunov S., Kebschull U., Kisel I., Lindenstruth V., Muller W. F. J., 2008, Computer
Physics Communications, 178, 374
Govett M. W., Middlecoff J., Henderson T. B., Rosinski J., Madden P., 2011, AGU Fall
Meeting Abstracts, A2
Hamada T., Iitaka T., 2007, The Chamomile Scheme: An Optimized Algorithm for N-body
simulations on Programmable Graphics Processing Units, arXiv:astro-ph/0703100
Hamada T. et al., 2009, Computer Science - Research and Development, 24, 21
Hankins T. H., Rickett B. J., 1975, in Methods in Computational Physics. Volume 14 -
Radio astronomy, Alder B., Fernbach S., Rotenberg M., eds., Vol. 14, pp. 55–129
Harris C., Haines K., 2011, Pub. Astron. Soc. Australia, 28, 317
Harris C., Haines K., Staveley-Smith L., 2008, Experimental Astronomy, 22, 129
130 Bibliography
Harris M., 2005, GPU Gems 2 - Mapping Computational Concepts to GPUs, Pharr M.,
ed., Addison-Wesley Professional, pp. 493–508
Harris M., 2007, Optimizing parallel reduction in cuda. Tech. rep., avail-
able at: http://developer.download.nvidia.com/compute/cuda/1_1/Website/
projects/reduction/doc/reduction.pdf
Harris M. J., Coombe G., Scheuermann T., Lastra A., 2002, in Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS ’02, Euro-
graphics Association, Aire-la-Ville, Switzerland, Switzerland, pp. 109–118
Hassan A. H., Fluke C. J., Barnes D. G., 2012, A Distributed GPU-based Framework for
real-time 3D Volume Rendering of Large Astronomical Data Cubes, arXiv:1205.0282
[astro-ph.IM]
Heuveline V., Weiß J.-P., 2009, European Physical Journal Special Topics, 171, 31
Hewish A., Bell S. J., Pilkington J. D. H., Scott P. F., Collins R. A., 1968, 217, 709
Hey T., Tansley S., Tolle K., eds., 2009, The Fourth Paradigm, Microsoft Research
Heymann F., Siebenmorgen R., 2012, ApJ, 751, 27
Hobbs G., Lyne A. G., Kramer M., Martin C. E., Jordan C., 2004, M.N.R.A.S., 353, 1311
Hoberock J., Bell N., 2010, Thrust: A parallel template library. http://www.
meganewtons.com/, version 1.6.0
Hoff, III K. E., Keyser J., Lin M., Manocha D., Culver T., 1999, in Proceedings of the
26th annual conference on Computer graphics and interactive techniques, SIGGRAPH
’99, ACM Press/Addison-Wesley Publishing Co., New York, NY, USA, pp. 277–286
Hogbom J. A., 1974, A&AS, 15, 417
Hogden J., Vander Wiel S., Bower G. C., Michalak S., Siemion A., Werthimer D., 2012,
ApJ, 747, 141
Hopf M., Ertl T., 1999, in Proceedings of the conference on Visualization ’99: celebrating
ten years, VIS ’99, IEEE Computer Society Press, Los Alamitos, CA, USA, pp. 471–474
Horvath Z., Liebmann M., 2010, in American Institute of Physics Conference Series, Vol.
1281, American Institute of Physics Conference Series, Simos T. E., Psihoyios G., Tsi-
touras C., eds., pp. 1789–1792
Bibliography 131
Hupca I. O., Falcou J., Grigori L., Stompor R., 2012, in Proceedings of the 2011 interna-
tional conference on Parallel Processing, Euro-Par’11, Springer-Verlag, Berlin, Heidel-
berg, pp. 355–366
Jalali B., Baumgardt H., Kissler-Patig M., Gebhardt K., Noyola E., Lutzgendorf N., de
Zeeuw P. T., 2012, A&A, 538, A19
Jia X., Gu X., Sempau J., Choi D., Majumdar A., Jiang S. B., 2010, Physics in Medicine
and Biology, 55, 3077
Johnston H. M., Kulkarni S. R., 1991, ApJ, 368, 504
Jones D. L. et al., 2012, in IAU Symposium, Vol. 285, IAU Symposium, Griffin R. E. M.,
Hanisch R. J., Seaman R., eds., pp. 340–341
Jonsson P., Primack J. R., 2010, New Astronomy, 15, 509
Kawai A., Fukushige T., Makino J., Taiji M., 2000, Proc. Astron. Soc. Japan, 52, 659
Kayser R., Refsdal S., Stabell R., 1986, A&A, 166, 36
Keane E. F., Kramer M., Lyne A. G., Stappers B. W., McLaughlin M. A., 2011,
M.N.R.A.S., 838
Keane E. F., Ludovici D. A., Eatough R. P., Kramer M., Lyne A. G., McLaughlin M. A.,
Stappers B. W., 2010, M.N.R.A.S., 401, 1057
Keith M. J. et al., 2010, M.N.R.A.S., 409, 619
Kesteven M., Hobbs G., Clement R., Dawson B., Manchester R., Uppal T., 2005, Radio
Science, 40, 5
Khanna G., 2010, International Journal of Modeling, Simulation and Scientific Computing,
01, 147
Kim J., Park C., Rossi G., Lee S. M., Gott, III J. R., 2011, Journal of Korean Astronomical
Society, 44, 217
Klessen R. S., Kroupa P., 1998, ApJ, 498, 143
Knuth D. E., 1998, The art of computer programming, 2nd edn., Vol. 3. Addison-Wesley
Longman Publishing Co., Boston, MA, USA
Kocz J., Bailes M., Barnes D., Burke-Spolaor S., Levin L., 2012, M.N.R.A.S., 420, 271
132 Bibliography
Kramer M. et al., 1999, ApJ, 520, 324
Krishnan S., Mustafa N. H., Venkatasubramanian S., 2002, in Proceedings of the thirteenth
annual ACM-SIAM symposium on Discrete algorithms, SODA ’02, Society for Industrial
and Applied Mathematics, Philadelphia, PA, USA, pp. 558–567
Langston G., Rumberg B., Brandt P., 2007, in Bulletin of the American Astronomical
Society, Vol. 39, American Astronomical Society Meeting Abstracts, p. 745
Larsen E. S., McAllister D., 2001, in Proceedings of the 2001 ACM/IEEE conference
on Supercomputing (CDROM), Supercomputing ’01, ACM, New York, NY, USA, pp.
55–55
Lattimer J. M., Prakash M., 2004, Science, 304, 536
Lengyel J., Reichert M., Donald B. R., Greenberg D. P., 1990, SIGGRAPH Comput.
Graph., 24, 327
Levoy M., 1990, ACM Trans. Graph., 9, 245
Li Y., Dongarra J., Tomov S., 2009, in Proceedings of the 9th International Conference
on Computational Science: Part I, ICCS ’09, Springer-Verlag, Berlin, Heidelberg, pp.
884–892
Lindholm E., Kilgard M. J., Moreton H., 2001, in Proceedings of the 28th annual con-
ference on Computer graphics and interactive techniques, SIGGRAPH ’01, ACM, New
York, NY, USA, pp. 149–158
Lonsdale C. J., Doeleman S. S., Oberoi D., 2004, Experimental Astronomy, 17, 345
Lorimer D. R., Bailes M., McLaughlin M. A., Narkevic D. J., Crawford F., 2007, Science,
318, 777
Lorimer D. R. et al., 2006, M.N.R.A.S., 372, 777
Lu L., Paulovicks B., Sheinin V., Perrone M., 2010, in Society of Photo-Optical Instru-
mentation Engineers (SPIE) Conference Series, Vol. 7744, Society of Photo-Optical
Instrumentation Engineers (SPIE) Conference Series
Lyne A. G. et al., 2004, Science, 303, 1153
Lyne A. G. et al., 1998, M.N.R.A.S., 295, 743
Bibliography 133
Macquart J.-P., 2011, ApJ, 734, 20
Macquart J.-P. et al., 2010, Pub. Astron. Soc. Australia, 27, 272
Magro A., Karastergiou A., Salvini S., Mort B., Dulwich F., Zarb Adami K., 2011,
M.N.R.A.S., 417, 2642
Mahabal A. et al., 2008, in American Institute of Physics Conference Series, Vol. 1082,
American Institute of Physics Conference Series, Bailer-Jones C. A. L., ed., pp. 287–293
Makino J., 1991, Proc. Astron. Soc. Japan, 43, 859
Makino J., 1996, ApJ, 471, 796
Makino J., Fukushige T., Koga M., Namura K., 2003, Proc. Astron. Soc. Japan, 55, 1163
Makino J., Funato Y., 1993, Proc. Astron. Soc. Japan, 45, 279
Manchester R. et al., 2001, M.N.R.A.S., 328, 17
Manchester R. et al., 1996, M.N.R.A.S., 279, 1235
Manchester R. N., Hobbs G. B., Teoh A., Hobbs M., 2005, AJ, 129, 1993
Mark W. R., Glanville R. S., Akeley K., Kilgard M. J., 2003, ACM Trans. Graph., 22, 896
Masuda N., Ito T., Tanaka T., Shiraki A., Sugie T., 2006, Optics Express, 14, 603
Matsakis D. N., Taylor J. H., Eubanks T. M., 1997, A&A, 326, 924
McConnell S. M., 2010, Journal of Physics Conference Series, 256, 012013
McLaughlin M. A. et al., 2006, Nature, 439, 817
Men C., Gu X., Choi D., Majumdar A., Zheng Z., Mueller K., Jiang S. B., 2009, Physics
in Medicine and Biology, 54, 6565
Mereghetti S., 2008, A&A Rev., 15, 225
Merz H., Pen U.-L., Trac H., 2005, New Astronomy, 10, 393
Michalakes J., Vachharajani M., 2008, in Parallel and Distributed Processing, 2008. IPDPS
2008. IEEE International Symposium on, pp. 1 –7
Mielikainen J., Huang B., Huang A., 2011, AGU Fall Meeting Abstracts, B6
134 Bibliography
Mignani R. P., 2011, Advances in Space Research, 47, 1281
Molnar F., Szakaly T., Meszaros R., Lagzi I., 2010, Computer Physics Communications,
181, 105
Monmasson E., Cirstea M., 2007, Industrial Electronics, IEEE Transactions on, 54, 1824
Moore G. E., 1965, Electronics, 38, 4
Moreland K., Angel E., 2003, in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS
conference on Graphics hardware, HWWS ’03, Eurographics Association, Aire-la-Ville,
Switzerland, Switzerland, pp. 112–119
Mudryk L. R., Murray N. W., 2009, New Astronomy, 14, 71
Mustafa N., Koutsofios E., Krishnan S., Venkatasubramanian S., 2001, in Proceedings of
the seventeenth annual symposium on Computational geometry, SCG ’01, ACM, New
York, NY, USA, pp. 50–59
Nakasato N., Ogiya G., Miki Y., Mori M., Nomoto K., 2012, Astrophysical Particle Sim-
ulations on Heterogeneous CPU-GPU Systems, arXiv:1206.1199 [astro-ph.IM]
Neal J., Fewtrell T., Trigg M., Bates P., 2009, in EGU General Assembly Conference Ab-
stracts, Vol. 11, EGU General Assembly Conference Abstracts, Arabelos D. N., Tsch-
erning C. C., eds., p. 1464
Newton L. M., Manchester R. N., Cooke D. J., 1981, M.N.R.A.S., 194, 841
Nitadori K., Makino J., 2008, New Astronomy, 13, 498
NVIDIA Corporation, 2012, Nvidias next generation cuda compute architecture: Ke-
pler gk110. Tech. rep., available at: http://www.nvidia.com/content/PDF/kepler/
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
Nyland L., Harris M., Prins J. F., 2007, GPU Gems 3 - Fast N-Body Simulation with
CUDA, Nguyen H., ed., Addison-Wesley, pp. 677–695
Nyland L., Prins J. F., Harris M., 2004, The Rapid Evaluation of Potential Fields in
N-Body Problems Using Programmable Graphics Hardware (Poster)
Ord S., Greenhill L., Wayth R., Mitchell D., Dale K., Pfister H., Edgar R. G., 2009, GPUs
for data processing in the MWA, arXiv:0902.0915
Bibliography 135
Owens J. D., Luebke D., Govindaraju N., Harris M., Kruger J., Lefohn A. E., Purcell T.,
2005, in Eurographics 2005, State of the Art Reports, pp. 21–51
Pacini F., 1968, 219, 145
Portegies Zwart S. F., Belleman R. G., Geldof P. M., 2007, New Astronomy, 12, 641
Preis T., Virnau P., Paul W., Schneider J. J., 2009, New Journal of Physics, 11, 093024
Proudfoot K., Mark W. R., Tzvetkov S., Hanrahan P., 2001, in Proceedings of the 28th
annual conference on Computer graphics and interactive techniques, SIGGRAPH ’01,
ACM, New York, NY, USA, pp. 159–170
Ransom S. M., 2001, PhD thesis, Harvard University
Richards J. W. et al., 2011, ApJ, 733, 10
Rosa F. L., Marichal-Hernandez J. G., Rodriguez-Ramos J. M., 2004, in Society of Photo-
Optical Instrumentation Engineers (SPIE) Conference Series, Vol. 5572, Society of
Photo-Optical Instrumentation Engineers (SPIE) Conference Series, Gonglewski J. D.,
Stein K., eds., pp. 262–272
Rosen R. et al., 2012, The Pulsar Search Collaboratory: Discovery and Timing of Five
New Pulsars, arXiv:1209.4108
Rumpf M., Strzodka R., 2001, in Proceedings of EG/IEEE TCVG Symposium on Visual-
ization (VisSym ’01), pp. 75–84
Sainio J., 2012, Journal of Cosmology and Astroparticle Physics, 4, 38
Sane N., Ford J., Harris A. I., Bhattacharyya S. S., 2012, Radio Science, 47, 3005
Schaaf K., Overeem R., 2004, Experimental Astronomy, 17, 287
Scherl H., Koerner M., Hofmann H., Eckert W., Kowarschik M., Hornegger J., 2007,
in Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, Vol.
6510, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series
Schive H., Tsai Y., Chiueh T., 2010, ApJ. Supp., 186, 457
Schiwietz T., Chang T.-c., Speier P., Westermann R., 2006, in Society of Photo-Optical In-
strumentation Engineers (SPIE) Conference Series, Vol. 6142, Society of Photo-Optical
Instrumentation Engineers (SPIE) Conference Series, Flynn M. J., Hsieh J., eds., pp.
1279–1290
136 Bibliography
Schlegel D., 2012, LSST is Not ”Big Data”, arXiv:1203.0591
Schneider P., Weiss A., 1986, A&A, 164, 237
Schneider P., Weiss A., 1987, A&A, 171, 49
Serylak M., Karastergiou A., Williams C., Armour W., LOFAR Pulsar Working Group,
2012, Observations of transients and pulsars with LOFAR international stations,
arXiv:1207.0354 [astro-ph.IM]
Shara M. M., Hurley J. R., 2002, ApJ, 571, 830
Siemion A. P. V. et al., 2012, ApJ, 744, 109
Smits R., Kramer M., Stappers B., Lorimer D. R., Cordes J., Faulkner A., 2009, A&A,
493, 1161
Spitler L. G., Cordes J. M., Chatterjee S., Stone J., 2012, ApJ, 748, 73
Springel V., Yoshida N., White S. D. M., 2001, New Astronomy, 6, 79
Stappers B. W. et al., 2011, A&A, 530, A80
Staveley-Smith L. et al., 1996, Pub. Astron. Soc. Australia, 13, 243
Steinmetz M., 1996, M.N.R.A.S., 278, 1005
Stone J. M., Norman M. L., 1992, ApJ. Supp., 80, 753
Sun C., Agrawal D., El Abbadi A., 2003, in Proceedings of the 2003 ACM SIGMOD
international conference on Management of data, SIGMOD ’03, ACM, New York, NY,
USA, pp. 455–466
Sunarso A., Tsuji T., Chono S., 2010, Journal of Computational Physics, 229, 5486
Tanabe N., Ichihashi Y., Nakayama H., Masuda N., Ito T., 2009, Computer Physics
Communications, 180, 1870
Taylor J. H., 1974, A&AS, 15, 367
Thacker R. J., Couchman H. M. P., 2006, Computer Physics Communications, 174, 540
Thakar A. R., 2008, Computing in Science and Engineering, 10, 9
Thompson A. C., Fluke C. J., Barnes D. G., Barsdell B. R., 2010, New Astronomy, 15, 16
Bibliography 137
Tingay S. J. et al., 2012, The Murchison Widefield Array: the Square Kilometre Array
Precursor at low radio frequencies, arXiv:1206.6945 [astro-ph.IM]
Tomczak T., Zadarnowska K., Koza Z., Matyka M., Miros law A., 2012, Complete PISO
and SIMPLE solvers on Graphics Processing Units, arXiv:1207.1571 [cs.DC]
Tomov S., McGuigan M., Bennett R., Smith G., Spiletic J., 2005, Computers & Graphics,
29, 71
Trendall C., Stewart A. J., 2000, in In Eurographics Workshop on Rendering, Springer,
pp. 287–298
van Meel J., Arnold A., Frenkel D., Portegies Zwart S., Belleman R., 2008, Molecular
Simulation, 34, 259266
van Nieuwpoort R. V., Romein J. W., 2009, in ICS ’09: Proceedings of the 23rd interna-
tional conference on Supercomputing, ACM, New York, NY, USA, pp. 440–449
van Straten W., Bailes M., 2011, Pub. Astron. Soc. Australia, 28, 1
Varbanescu A., Amesfoort A., Cornwell T., Mattingly A., Elmegreen B., Nieuwpoort R.,
Diepen G., Sips H., 2008, in Lecture Notes in Computer Science, Vol. 5168, Euro-
Par 2008 Parallel Processing, Luque E., Margalef T., Bentez D., eds., Springer Berlin
Heidelberg, pp. 749–762
Verbiest J. P. W. et al., 2009, M.N.R.A.S., 400, 951
Vladimirov A., 2012, Arithmetics on intels sandy bridge and westmere cpus: not all flops
are created equal. Tech. rep., available at: http://research.colfaxinternational.
com/post/2012/04/30/FLOPS.aspx
Volkov V., Demmel J. W., 2008, in Proceedings of the 2008 ACM/IEEE conference on
Supercomputing, SC ’08, IEEE Press, Piscataway, NJ, USA, pp. 31:1–31:11
von Hoerner S., 1960, Zeitschrift fur Astrophysik, 50, 184
Wambsganss J., 1990, PhD thesis, Thesis Ludwig-Maximilians-Univ., Munich (Germany,
F. R.). Fakultat fur Physik., (1990)
Wambsganss J., 1999, Journal of Computational and Applied Mathematics, 109, 353
Wang P., Abel T., Kaehler R., 2010, New Astronomy, 15, 581
138 Bibliography
Wayth R. B., Greenhill L. J., Briggs F. H., 2009, Pub. Astron. Soc. Pacific, 121, 857
Williams R. D., Seaman R., 2006, in Astronomical Society of the Pacific Conference Series,
Vol. 351, Astronomical Data Analysis Software and Systems XV, Gabriel C., Arviset
C., Ponz D., Enrique S., eds., p. 637
York D. G. et al., 2000, AJ, 120, 1579
Zhang S., Royer D., Yau S.-T., 2006, Optics Express, 14, 9120
AChapter 3 Appendix
A.1 Error analysis for the tree dedispersion algorithm
Here we derive an expression for the maximum error introduced by the use of the piecewise
linear tree dedispersion algorithm.
The deviation of a function f(x) from a linear approximation between x = x0 and
x = x1 is bounded by
εf ≤1
8(x1 − x0)2 max
x0≤x≤x1
∣∣∣∣ d2
dx2f(x)
∣∣∣∣ , (A.1)
which shows that the error is proportional to the square of the step size and the second
derivative of the function. For the dedispersion problem, the second derivative of the delay
function with respect to frequency is given by
∂2
∂ν2∆t(d, ν) = DM(d)
d2
dν2∆T (ν) (A.2)
= 6 DM(d)kDM∆ν2
ν40
(1 +
∆ν
ν0ν
)−4
, (A.3)
which has greater magnitude at lower frequencies. Evaluating at the lowest frequency in
the band, ν = Nν , and substituting into equation (A.1) along with the sub-band size N ′ν ,
one finds the error to be bounded by:
ttree ≡ ε∆t ≤3
4DM
kDM
ν20
(N ′νNν
)2 λ2
(1 + λ)4, (A.4)
where λ ≡ ∆νν0Nν is a proxy for the fractional bandwidth, a measure of the width of the
antenna band.
If the smearing as a result of using the direct algorithm is quantified as the effective
139
140 Appendix A. Chapter 3 Appendix
width, W , of an observed pulse, then the piecewise linear tree algorithm is expected to
produce a signal with an effective width of
Wtree =√W 2 + t2tree, (A.5)
giving a relative smearing of
µtree ≡Wtree
W=
√W 2 + t2tree
W. (A.6)
In contrast to the use of a piecewise linear approximation, the use of a change of frequency
coordinates (‘frequency padding’) to linearise the dispersion trails results in no additional
sources of smear.
A.2 Error analysis for the sub-band dedispersion algorithm
Here we derive an expression for the maximum error introduced by the use of the sub-band
dedispersion algorithm.
The smearing introduced into a dedispersed time series due to an approximation to
the dispersion curve is bounded by the maximum temporal deviation of the approximation
from the exact curve. The maximum change in delay across a sub-band is ∆t(DM, Nν)−∆t(DM, Nν − N ′ν); the difference in this value between two nominal DMs then gives the
smearing time:
tSB ≤ ∆DMnom
[∆T (Nν)−∆T (Nν −N ′ν)
](A.7)
= N ′DM∆DMkDM
ν20
[−2
N ′νNν
λ
(1 + λ)3+O
(N ′νNν
)2], (A.8)
where the second form is obtained through Taylor expansion in powers of N ′νNν
around zero.
Note that this derivation assumes the dispersion curve is approximated by aligning the
’early’ edge of each sub-band. An alternative approach is to centre the sub-bands on the
curve, which reduces the smearing by ∼ 2× but adds complexity to the implementation.
As with the tree algorithm, we can define the relative smearing of the sub-band algo-
rithm with respect to the direct algorithm as
µSB ≡WSB
W=
√W 2 + t2SB
W, (A.9)
A.2. Error analysis for the sub-band dedispersion algorithm 141
where, as before, W is the effective width of an observed pulse after direct dedispersion.