Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 1 of 14
Name Designation Affiliation Date Signature
Additional Authors
Submitted by:
A. Karastergiou UOXF 2011‐03‐30
Approved by:
W. Turner Signal Processing Domain Specialist
SPDO 2011‐04‐01
SKA NON IMAGING PROCESSING CONCEPT
DESCRIPTION: GPU PROCESSING FOR REAL‐TIME
ISOLATED RADIO PULSE DETECTION
Document number .................................................................. WP2‐040.130.010‐TD‐001
Revision ........................................................................................................................... 1
Author .................................................................................................... Aris Karastergiou
Date ................................................................................................................ 2011‐04‐01
Status ............................................................................................... Approved for release
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 2 of 14
DOCUMENT HISTORY
Revision Date Of Issue Engineering Change
Number
Comments
A ‐ ‐ First draft release for internal review
DOCUMENT SOFTWARE
Package Version Filename
Wordprocessor MsWord Word 2003 03j1‐wp2‐040 130 010‐td‐001‐1‐nonimaging‐concept‐description‐2003
Block diagrams
Other
ORGANISATION DETAILS
Name SKA Program Development Office
Physical/Postal
Address
Jodrell Bank Centre for Astrophysics
Alan Turing Building
The University of Manchester
Oxford Road
Manchester, UK
M13 9PL
Fax. +44 (0)161 275 4049
Website www.skatelescope.org
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 3 of 14
TABLE OF CONTENTS
1 INTRODUCTION ............................................................................................. 6
1.1 Purpose of the document ....................................................................................................... 6
2 REFERENCES ................................................................................................ 6
3 BACKGROUND .............................................................................................. 7
4 THE PROTOTYPE ............................................................................................ 7
4.1 Hardware description ............................................................................................................. 7
4.1.1 NVIDIA Tesla S1070 ......................................................................................................... 8
4.1.2 NVIDIA Fermi M2050 and GeForce GTX.......................................................................... 9
4.2 Software description ............................................................................................................... 9
4.2.1 GPU modules ................................................................................................................. 10
4.2.2 Beyond the standard dedispersion algorithms ............................................................. 12
5 TESTING IN A REAL ENVIRONMENT. .................................................................. 12
6 SUMMARY OF COSTS AND THE FUTURE .............................................................. 13
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 4 of 14
LIST OF FIGURES
Figure 1. Schematic diagram of the LOFAR ILS GPU backend. Components in orange make up the
backend. Details of the ILS on the right. ....................................................................................... 8
Figure 2. A very bright, dispersed IRP from pulsar B0329+54 at 150 MHz from the UK ILS. ................ 10
Figure 3. The result of the dedispersion module. Intensity, proportional to the radii of the circles, is
plotted versus time (x) and frequency (y). Pulses from a real pulsar B1508+55 are detected at a
DM of ~20. RFI is also seen at DM 0. Events of S/N >5 are shown. This illustrates a real time
detection of IRPs. ........................................................................................................................ 11
LIST OF TABLES
No table of figures entries found.
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 5 of 14
LIST OF ABBREVIATIONS
AA .................................. Aperture Array
Ant. ................................ Antenna
CoDR ............................. Conceptual Design Review
DRM .............................. Design Reference Mission
EoR ............................... Epoch of Reionisation
EX .................................. Example
DM ................................. Dispersion measure
FLOPS ........................... Floating Point Operations per second
FoV ................................ Field of View
GPU ............................... general purpose Graphics Processing Unit
ILS ................................. International LOFAR station
IRP ................................ Isolated Radio Pulses
Ny .................................. Nyquist
Ov .................................. Over sampling
PAF ............................... Phased Array Feed
PrepSKA........................ Preparatory Phase for the SKA
RFI ................................. Radio Frequency Interference
rms ................................ root mean square
SKA ............................... Square Kilometre Array
SKADS .......................... SKA Design Studies
SPDO ............................ SKA Program Development Office
SSFoM .......................... Survey Speed Figure of Merit
TBD ............................... To be decided
Wrt ................................. with respect to
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 6 of 14
1 Introduction
1.1 Purpose of the document
The purposes of this document are as follows:
1. Provide a description of a working prototype of a GPU based backend for real‐time, rapid‐
response time domain radio astronomy and searches of Individual Radio Pulses (IRPs).
2. Describe the design of a real scientific experiment with the prototype to demonstrate its
advantages and shortcomings.
3. Consider a pathway of application from current pathfinders to SKA‐I and SKA‐II
2 References
[1] SKA Science Case
[2] The Square Kilometre Array Design Reference Mission: SKA‐mid and SKA‐Lo v 0.4
[3] Science Operations Plan
[4] System Interfaces
[5] Environmental requirements (natural and induced)
[6] SKA strategies and philosophies
[7] Risk Register
[8] Requirements Traceability
[9] Logistic Engineering Management Plan (LEMP)
[10] Risk Management Plan (RMP)
[11] Document Handling Procedure
[12] Project Dictionary
[13] Strategy to proceed to the next phase
[14] WP3 SKA array configuration report
[15] WP3 SKA site RFI environment report
[16] WP3 Troposphere measurement campaign report
[17] SKA Science‐Technology Trade‐off Process (WP2‐005.010.030‐MP‐004)
[18] A. Faulkner, et al., Aperture Arrays for the SKA: the SKADS White Paper, January 2010.
[19] E. de Lera‐Acedo et al., System Noise Analysis of an Ultra Wide Band Aperture Array: SKADS
Memo T28.
[20] SKA Monitoring and Control Strategy WP2‐005.065.000‐R‐001 Issue Draft E
[21] “The Square Kilometre Array”, Peter E. Dewdney, Peter J. Hall, Richard T. Schilizzi, and T.
Joseph L. W. Lazio, Proceedings of the IEEE Vol. 97,No. 8, August 2009
[22] Thompson, A. R., Moran, J. M., and Swenson, G. W. “Interferometry and Aperture Synthesis
in Radio Astronomy” (second edition), Wiley, 1986.
[23] System Engineering Management Plan (SEMP) WP2‐005.010.030‐MP‐001Reference 3
[24] SKA System Requirement Specification (SRS)
[25] SKA IP Policy Document
[26] International Technology Roadmap for Semiconductors (ITRS), available at www.itrs.net.
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 7 of 14
3 Background
Pulsar science is one of the two areas that SKA‐I will concentrate on. The instrument is being optimised for successful extensive searching and timing campaigns. In recent years, the discovery of Rotating Radio Transients (McLaughlin et al. 2006) and intermittent pulsars (Kramer et al. 2006) has highlighted the fact that some of the most interesting radio pulsars are not regular emitters of radio, which has an impact on the design of search strategies. All short duration pulses of emission will suffer dispersion and scattering due to propagation in the ISM. However, for sporadic emitters, a periodicity search is not appropriate and other techniques need to be applied. In addition, there are specific advantages to being able to detect such "isolated radio pulses" (IRPs) in as close to real‐time as possible, such that triggered immediate follow‐ups can extract maximal information about the nature of the emitters. Apart from opening up the opportunity to detect extremely intermittent pulsars, these techniques will also open a window of discovery on all short duration radio bursts of astrophysical origin. Real‐time processing imposes particular requirements, which need to be satisfied in a pragmatic design. Between 2009 and 2011, we have put together a working prototype of a real‐time processing backend for blind searches of IRPs. We have used international LOFAR stations (ILS) as test‐beds for this backend. This choice was made on scientific grounds, as ILSs score highly on the combination of raw sensitivity, high time resolution and large field of view, allowing some optimism for successful early surveys. ILSs also provide a real‐world working environment for testing of hardware and software. The backend described here uses the high performance of general purpose graphics processing units (GPUs). We have put together this multi‐core architecture with multi‐threaded code in order to achieve the necessary operation counts for the real‐time processing of our particular application in the ILS environment. Multiple tests of the backend have been conducted by early 2011 and a test survey is planned for later this year, to demonstrate its advantages and shortcomings during a real scientific experiment. ILSs represent a current implementation of aperture array technology, therefore conclusions extracted from these experiments and tests can be reasonably well projected on to a path towards the SKA.
4 The prototype
4.1 Hardware description
The GPU backend we have developed for ILSs is matched to the datastreams coming out of the
LOFAR hardware. These consist of beamformed, raw, complex, 16‐bit data. The details of the LOFAR
datastreams are as follows: the analogue streams are sampled at 200 or 160 Msamples/s, and
channelized to 512 channels using a polyphase filter. The resulting raw complex subbands are 195.3
or 156.2 kHz wide, with a time resolution of 5.12 or 6.4 μs, depending on the value of the clock. A
total of 244 subbands are beamformed into anything up to 244 beams. The 244 beams, which
correspond to 3.2 Gbps, are separated into 4 streams of UDP packets, each of ~800 Mbps. This
bandwidth can be carried by 1gbe technology which features throughout our backend.
Figure 1 shows a schematic diagram of the hardware. In light blue are all the LOFAR components and
in orange are the components of our backend. In particular, the processing units are made of dual
socket, 6‐core Intel Xeon CPUs (5650, 2.66 GHz clock, 32nm lithography and 95W power rating).
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 8 of 14
Figure 1. Schematic diagram of the LOFAR ILS GPU backend. Components in orange make up the backend.
Details of the ILS on the right.
The GPU Units are used for real time dispersion searches for IRPs. We have built and tested a
machine based on NVIDIA Tesla S1070 blades, and we are in the process of building and testing a
machine based on NVIDIA Fermi M2050 cards. We are also planning to test NVIDIA GeForce GTX
cards, which are not server grade but offer very high processing power for money. Currently, a single
unit of the CPU‐GPU backend can effectively search for ms dispersed transients at Δν/ν~0.1, given a
1gbe stream of data, in real time at a cost of about 9kEuros for machines with server‐grade GPUs
and 4kEuros for non‐server grade. The LOFAR example is 16‐bit complex samples, 800 mbps data
streams, which translate to 12 MHz sky bandwidth @ 150 MHz; This backend can process several
thousand dispersion measures (DM) in real time, depending on the dedispersion technique.
4.1.1 NVIDIA Tesla S1070
The first thing to note is that production of these cards has now ceased. However they offer a very
competitive platform for GPU computing on a 24/7 operations basis. According to NVIDIA, these
blades nominally offer up to four teraflops of computing performance in a 1U configuration. Each
blade is made up of 4 NVIDIA Tesla C1060 cards, with 240 cores and 4GB of on‐board memory per
card. The PCIe connection between the S1070 blades and the CPU servers has a nominal bandwidth
of up to 6.4 GB/s, which by far exceeds the LOFAR data rate being processed. The on‐board memory
is necessary for storing an array of filterbank data (intensity as function of frequency and time) in
order to process dispersed events which spread across time delays of many tens of seconds at
LOFAR frequencies (see section on dedispersion algorithms). Since searching for IRPs involves
dedispersion at many unknown DMs, and dedispersion at one DM is independent of the results of
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 9 of 14
dispersion at other DMs, this process is entirely parallelisable and benefits from multi‐threaded
architectures. The S1070s run ~1.3 GHz clocks and use approximately 700W of power when running.
The idle power for an S1070 does not drop below 200W. Cooling requirements are therefore non‐
negligible. For a description of the C1060 cards upon which the S1070 blades are based, visit:
http://www.nvidia.com/docs/IO/43395/BD‐04111‐001_v05.pdf
4.1.2 NVIDIA Fermi M2050 and GeForce GTX
The current generation of supercomputing cards from NVIDIA are called Fermi, and they feature
more cores and much faster double precision computing than the C1060 cards. In particular, the
M2050 have 448 cores, and 3GB of on‐board memory that, although less than the C1060 cards, is
still sufficient to process several tens of seconds of ILS beamformed data. The speed‐up in double
precision will make no significant difference. However, the substantial increases in L1 and L2 cache
memory are likely to provide substantial improvements over the C1060s. Early tests indicate at least
a factor of 2 in performance improvement, which roughly cancels out the current difference in price.
On the other side of the pricing spectrum, NVIDIAs gaming cards have been steadily improving in
clock speed, number of cores, on‐board memory and reliability. There are implementations of the
GeForce GTX 580 with 3GB of on‐board memory and 512 cores, at a fraction (~25%) of the cost of
the M2050. We will be testing such cards as much as possible over the coming months. Currently,
the GTX 580 nominally requires about 250W of power during operation (closer to 350W at full load)
and about 150W when idle. These values are similar to the M2050s, which share the same chips with
the GTX 470 cards. It should be noted that despite this, NVIDIA only provide single precision support
on the GeForce range.
4.2 Software description
The software that runs on the backend are modular pipelines, based on the PELICAN framework
developed under PrepSKA. Care has been taken to ensure that each module of the pipeline can
operate at better than real‐time rates. The modules have been developed with the specific aim of
delivering the appropriate data to the GPU module for the dispersion search for IRPs. The framework
is responsible for inter‐modular communications, via TCP, and mechanisms for accessing and
processing the content of the data‐blocks that are passed through. It also contains the appropriate
mechanisms for reading in the UDP datastreams via a software server, which passes down TCP
streams of any size to a flexible number of clients. The clients are responsible for buffering,
processing and writing out to file at the end. The framework and modules are written in C++.
Currently, our pipeline contains the following modules:
UDP data reader and TCP server
Buffer and datablock generator
Polyphase channeliser – 2N channels per subband
Stokes generator – conversion from complex data to power
RFI clipper ‐ removal of narrowband interference spikes from spectra
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 10 of 14
Integrator – addition of 2N time bins
Dedisperser and dispersion search – GPU module including second buffer
File writer – binary data output of chosen stream
The CPU modules are necessary to bring the data into the rate form for the GPU processing.
4.2.1 GPU modules
The necessity to develop GPU modules for IRP detection comes from the processing requirements
for detecting an IRP of unknown dispersion measure, such as an irregular pulse from a new Rotating
Radio Transient or a giant pulse of a yet undiscovered pulsar.
Figure 2. A very bright, dispersed IRP from pulsar B0329+54 at 150 MHz from the UK ILS.
An example dispersed IRP observed with the CPU‐GPU backend at an ILS is shown in Figure 2. The
data have been integrated in time by a factor of 64 from the original 81.92μs to 5.24ms. The
principle of detecting a dispersed IRP of unknown DM relies on finding an appropriate curve in time,
frequency space along which to integrate over the given bandwidth in order to maximise the signal
over the instrumental noise. The cold plasma dispersion law, which describes well the dispersion
seen in radio pulses, states that the delay in time of arrival is proportional to the frequency to the
power ‐2. The proportionality constant, or dispersion measure, is directly related to the number of
free electrons in the line of site of the observation.
Figure 2 shows a pulse that is bright and visible within individual frequency channels. Finding such is
pulse is not difficult, and several techniques can be applied: Once a single high S/N point has been
found, the next point can be found by sampling the neighbouring points in time and frequency for
more significant points, until the description of the curve can be built. However, the weakest IRPs
that an instrument can detect will be well below the noise level in individual bins, and only
integration in frequency will reveal them as significant. The problem then becomes to find the path
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 11 of 14
of integration that will maximise the S/N. In the specific case of the cold plasma dispersion law,
there is a single degree of freedom, the DM.
The first approach at solving this problem is by applying a brute force technique. This means
transforming the incoming data from 3D data of intensity versus frequency and time I(f,t) to 3D data
of intensity versus DM and time I(DM,t), by applying the appropriate time delays per frequency
channel for each DM to be searched within a given range, and integrating over the frequency
dimension. This is approximately an N2 algorithm for the number of frequency channels. For typical
values of 512 frequency channels and 81.92μs sampling time, the transformation alone requires
sustained processing of ~30 GFLOPS to process 5000 DMs. This is the limit of the capabilities of the
hardware tested here. The GPU code (CUDA kernel) written is a direct translation of the standard
CPU dedispersion kernels. Although this provides a substantial improvement over the available CPU
codes on similarly priced hardware, this kernel has large margins for optimisation. It takes advantage
of the multiple cores on the GPU, but does not yet take advantage of operations the GPU hardware
has been optimised for, such as 3D matrix rotation.
In order to reduce the computational demand, there exist at least two known algorithms which are
commonly applied in CPU dedispersion, namely the Taylor tree algorithm and subband dedispersion
(from the presto package by S. Ransom). The tree algorithm avoids redundant sums and effectively
reduces the computational load to N‐logN, however it can only be applied in the case where the
relative bandwidth is sufficiently narrow to approximate the dispersion delay by a linear function. In
the case of subband dedispersion, the principle is to split the total bandwidth into subbands, and
perform a coarse dispersion search within each subband. Then, to achieve fine DM gridding, for each
of the coarse DMs which reduces the frequency channels in each subband to one, a second stage of
dedispersion occurs at a larger number of DMs. The number of channels involved at the second
stage is equal to the number of subbands chosen. It can be shown that the approximation in the
algorithm does not significantly affect the result, and the gain in computational effort can be
significant based on the number of frequency channels.
Figure 3. The result of the dedispersion module. Intensity, proportional to the radii of the circles, is plotted
versus time (x) and frequency (y). Pulses from a real pulsar B1508+55 are detected at a DM of ~20. RFI is also
seen at DM 0. Events of S/N >5 are shown. This illustrates a real time detection of IRPs.
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 12 of 14
It is not the scope of this document to provide details on the known dedispersion algorithms, which
can be found elsewhere (e.g. Lorimer and Kramer pulsar handbook). It must be said however that all
the above algorithms are examples of incoherent dedispersion, i.e. dedispersion that applies to the
total power versus time and frequency. The technique that recovers the closest to the original signal,
coherent dedispersion, involves convolution of the incoming complex data (voltage with phase
information) with a chirp function that represents the inverse of the effect of interstellar dispersion.
Coherent dedispersion is known to work well on GPUs from the works of I. Cognard (Nancay) and P.
Demorest (NRAO). However, the computational requirements of the convolution make it more
suitable for single, known DM dedispersion rather than dispersion searches over large ranges of DM
(several thousands). In this respect and for the above algoriths, our tests indicate a relative speedup
between GPUs and CPUs of typically two orders of magnitude for equally priced hardware.
4.2.2 Beyond the standard dedispersion algorithms
One way of reducing the cost of GPU dispersion searches is optimising the algorithms to run on
effectively less hardware. We are working on this approach in two directions. The first is to do with
linearization of the problem, which makes it suitable for algorithms such as tree dedispersion. We
are designing a new dedispersion kernel that takes advantage of GPU capability for fast matrix
rotation. Effectiveness of GPUs is increased dramatically when accessing neighbouring memory
addresses with neighbouring threads, where memory calls are minimised. The index shifting
algorithms used for dedispersion today, do not take advantage of this aspect and we are focusing
efforts on trying to improve that.
The second path is through developments in adaptive sampling techniques in information theory.
We are working on a feedback mechanism, which will decide on the next sample in the frequency
versus time domain based on the dedispersed intensity of the previous measurement(s). In an
environment where the noise is well characterised, such algorithms can focus “attention” on
interesting areas of the data, directing most computational power to the relevant areas. We are
developing such an algorithm within the restrictions and special attributes of GPU environments, to
be tested on our ILS backend.
5 Testing in a real environment.
With the sampling rate offered by the ILS, we have identified the useful parameter space to sample
for DM searches of bright IRPs. The DM range to be search relates to the observing frequency; at low
frequencies from tens to a few hundred MHz, the maximum DM at which an IRP can be expected is
related to the total scattering effect that the same electrons will have, reducing the peak intensity of
the pulse to below detectable levels. At higher frequencies, the DM at which a search is relevant is
more related to the DM distribution from models of the Galactic electron density distribution.
Typical values (in DM unites) for a search are a maximum DM of 100 for LOFAR frequencies and
potentially several thousands for high radio frequencies. The channelization required for incoherent
dedispersion and the DM step are also frequency dependent, with lower frequencies requiring
narrower channels and finer DM steps than high frequency searches. At 150 MHz, a typical search
for 1ms wide IRPs requires ~6kHz frequency channels and a DM step of 0.05 in DM units. The
allowed frequency resolution directly translates into a time resolution of 160 μs, which is sufficient
to measure the rise and fall of a millisecond IRP. This is another route to set the maximum DM, as for
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 13 of 14
higher DM values, finer channelization is required, which increases the time bin duration to beyond
what is useful to measure millisecond IRPs.
The backend at the LOFAR‐UK station is operational and can process several thousand DMs in real‐
time, depending on the choice of GPU dedispersion kernel. Figure 3 shows a few seconds of output
as an example of the resulting data from the dedispersion module. The x and y‐axes are time and
DM, and the diameter of the circles is proportional to the intensity. Circles are plotted whenever the
intensity is 5 times the RMS above the noise level. These data were taken during an observation of a
bright pulsar (B1508+55) and IRPs from it can be see as persistent events at a DM just under 20. The
conclusion is that at an initial cost of ~5kEuros per beam of 6MHz bandwidth (Δν/ν=3%), plus 1kW of
power, a dispersion search can be conducted today using an ILS.
Initial positive testing has motivated a longer, real survey for IRPs using ILSs, which we expect to
complete within the PrepSKA programme. This will provide a clearer understanding of the
interesting parameter space to be searched, and provide long‐term measurements for the mean
processing power we can expect from the GPU hardware, as well as other possible bottlenecks
including smooth and continuous network functions for the distribution of the data, PCIe bandwidth
issues for the IO between CPU and GPU and memory bandwidth issues in the GPUs.
6 Summary of costs and the future
We have put together a complete hardware and software backend to perform real time
dedispersion for search of IRPs, which we are currently testing out on LOFAR international stations.
The cost of the machine can be broken down as follows:
‐ 12‐Core INTEL Xeon server to perform CPU preprocessing, including buffering of 800 Mbps
beamformed data: 3.5 kEuros
‐ NVIDIA Tesla GPU cards:
o S1070 (4xC1060), capable of sustained processing up to ~5000 DM values over a 12
MHz band at 150 MHz with current, non‐optimised dedispersion kernels: 5.5kEuros
o M2050, capable of processing up to ~2000 DM values over 12 MHz of bandwidth at
150 MHz: 1.5kEuros
o GeForce GTX card, non server grade but potentially equivalent to the M2050 (tests
pending): 0.4kEuros
The power consumptions are a total of ~1kW for each of these solutions per 800 Mbps of raw data
bandwidth or 12 MHz of LOFAR sky bandwidth.
It should be noted here that these costs are estimated at a sky frequency of 150 MHz, where 12 MHz
of band is 10% the sky frequency. The cost estimates for other frequencies should be based on the
bandwidth to frequency ratio and not the absolute bandwidth. Also, the CPU host machines play a
crucial role by buffering the data (in amounts that optimise IO between CPU and GPUs), which
means that many tens of seconds of data (>100s of ILS data) can be processed at once on current
GPU boards with 3‐6 GB on‐board memory.
WP2‐040.130.010‐TD‐001
Revision : 1
2011‐04‐01 Page 14 of 14
The hardware of the backend is all rack‐mountable, and cooling is done with standard room air‐
conditioning. Also, the total cost for software effort is currently hard to estimate. The reasons are
that a) we are still in the design phase for a new algorithm and not sure how much coding will be
required and b) we are currently conducting long (multi‐day) tests to establish to shortcomings of
both hardware and code in a real life test.
The conclusion to be drawn at this stage is that cheap multi‐core technology such as what is
implemented on GPU chips can be used today, at reasonable up‐front and running costs in an
implementation for LOFAR searches of dispersed IRPs. The hardware and running costs today match
well the total bandwidth of data that is processed. All indications are that Moore’s law requires a
move to massively multi‐core CPUs in order to maintain relevance, and GPU processing is the first
step. With the next generation of CPU processors, such as the Intel Sandy Bridge, GPU type cores will
coexist on CPU chips, indicating a new area in CPU architecture. Dispersion searches for IRPs are vital
to expand the parameter space of known pulsars, and can lead to discovery of other astrophysical
events. Based on the above, persisting in the effort to characterise and optimise GPU usage in this
field appears to be both useful and necessary.