Accelerating geoscience and engineering system - Earth Sciences

ARTICLE IN PRESS

Computers & Geosciences 35 (2009) 2353–2364

Contents lists available at ScienceDirect

Computers & Geosciences

0098-30

doi:10.1

� Corr

E-m

journal homepage: www.elsevier.com/locate/cageo

Accelerating geoscience and engineering system simulationson graphics hardware

Stuart D.C. Walsh a,�, Martin O. Saar a, Peter Bailey b, David J. Lilja b

a University of Minnesota, Department of Geology and Geophysics, 310 Pillsbury Drive S.E., Minneapolis, MN 55455-0219, USAb University of Minnesota, Department of Electrical and Computer Engineering, USA

a r t i c l e i n f o

Article history:

Received 26 September 2008

Received in revised form

17 April 2009

Accepted 14 May 2009

Keywords:

General purpose graphics processing units

Accelerators

Parallel computing

Lattice Boltzmann

Spectral finite element method

Least-squares minimization

Geofluids

Seismology

Magnetic force microscopy

04/$ - see front matter & 2009 Elsevier Ltd. A

016/j.cageo.2009.05.001

esponding author.

ail address: [email protected] (S.D. Walsh).

a b s t r a c t

Many complex natural systems studied in the geosciences are characterized by simple local-scale

interactions that result in complex emergent behavior. Simulations of these systems, often implemented

in parallel using standard central processing unit (CPU) clusters, may be better suited to parallel

processing environments with large numbers of simple processors. Such an environment is found in

graphics processing units (GPUs) on graphics cards.

This paper discusses GPU implementations of three example applications from computational fluid

dynamics, seismic wave propagation, and rock magnetism. These candidate applications involve

important numerical modeling techniques, widely employed in physical system simulations, that are

themselves examples of distinct computing classes identified as fundamental to scientific and

engineering computing. The presented numerical methods (and respective computing classes they

belong to) are: (1) a lattice-Boltzmann code for geofluid dynamics (structured grid class); (2) a spectral-

finite-element code for seismic wave propagation simulations (sparse linear algebra class); and (3) a

least-squares minimization code for interpreting magnetic force microscopy data (dense linear algebra

class). Significant performance increases (between 10� and 30� in most cases) are seen in all three

applications, demonstrating the power of GPU implementations for these types of simulations and,

more generally, their associated computing classes.

& 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Many processes of interest in both science and engineeringinvolve complex systems characterized by the interactions ofsmaller component elements. This is particularly true of manyphenomena observed in the geosciences, which often span anenormous range of spatial (mm to Earth) and temporal (ms tomillion year) scales. Such scales are commonly encountered instudies of climate systems (Fournier et al., 2004), tsunamis(Schmidt et al., 2009), seismic wave transmissions (Tsuboi et al.,2005), magma flow in the Earth’s mantle or crust leading tovolcanic eruptions (Gonnermann and Manga, 2007), and glacialmelting (Jellinek et al., 2004), to name a few.

The need to understand the emergent behavior of thesecomplex systems creates a constant and growing demand formore powerful computational algorithms and hardware capableof solving such physical modeling problems (e.g., the EarthSimulator, Tsuboi et al., 2005). Often, this demand is not due tothe algorithmic complexity of the problem, as a hallmark of many

ll rights reserved.

complex systems is the simplicity of their constituent elements(Wolfram, 1983). Instead, the principal limitation is the need torepeat many simple tasks a large number of times. Traditionally,this is achieved by implementing the problem on a centralprocessing unit (CPU) cluster. However, standard cluster CPUs areunnecessarily sophisticated for many physical simulations. Thus,the speed, size, and detail of these simulations could bedramatically increased by using clusters of significantly largernumbers of smaller, simpler, and thus cheaper processors.

Such processor clusters already exist. A graphics processingunit (GPU) is hardware that currently costs around $500.Individual GPUs may contain 128 or more simple processors,thus constituting parallel computing systems designed to performmany simple simultaneous calculations.

GPUs were first co-opted for general purpose computing bymanipulating the graphics pipeline. In particular, the advent ofshaders, small customized routines inserted at specific stages ofthe pipeline (for example the stages that dealt with vertex andfragment graphics primitives), permitted savvy users to acceleratecertain applications by recasting them in terms of graphicsmanipulations (Lindholm et al., 2001; Purcell et al., 2002).Excellent overviews of the types of problems investigated usingshaders can be found in Lastra et al. (2004) and Owens et al.

www.elsevier.com/locate/cageo

dx.doi.org/10.1016/j.cageo.2009.05.001

mailto:[email protected]

ARTICLE IN PRESS

Table 1Example applications and related numerical methods considered cover three of

seven fundamental scientific/engineering computing classes discussed by Colella

(2004) and the Berkeley Report (Asanovic et al., 2006).

Application Numerical method Computing class

Computational fluid dynamics Lattice Boltzmann Structured grid

Seismic wave propagation Spectral finite element Sparse linear algebra

Rock magnetism Least squares minimization Dense linear algebra

S.D.C. Walsh et al. / Computers & Geosciences 35 (2009) 2353–23642354

(2007). In particular, examples of linear algebra operations andlattice-Boltzmann techniques using shaders were both publishedin 2003 (Hillesland et al., 2003; Li et al., 2003).

Nevertheless, there are several drawbacks to shader program-ming. Shader implementations were often limited to reducedprecision accuracy (e.g., Hillesland et al., 2003 employed 16 bitfixed point precision, while Li et al., 2003 use 8 bit fixed pointprecision). In order to use the shader approach, the programmermust be familiar with graphics primitives: pixels, textures,fragments and vertices, and how to manipulate them withgraphics operations to reproduce the desired computation (Owenset al., 2007). The need to cast algorithms in terms of graphicsoperations not only adds to the complexity of shader program-ming it also restricts the manner in which data are processed. Thismeans, for example, that although the shader model is efficient atsimple element-wise operations, more complex data interdepen-dencies have to be evaluated through more costly ‘‘multipassloops’’ (Hillesland et al., 2003; Pieters et al., 2007). Programmersare also restricted by certain hardware limitations, e.g., shaderinstruction count and number of shader outputs (Buck et al.,2004). As a result, although GPU shaders were able to achieveimpressive performance gains in some cases, the limitations andcomplexities of the programming model prevented widespreadadoption of shader GPU computing for scientific and engineeringapplications.

More recently, however, compilers have been released thatallow custom-made programs to run directly on GPUs withoutadopting the graphics pipeline approach employed by shaders(e.g., Nickolls et al., 2008). In particular, in this paper we discussthe Compute Unified Device Architecture (CUDA) programmingmodel released by nVIDIA (2008a). Other examples of this newgeneration of programming tools for the GPU include BrookGPU(Buck et al., 2004) and the soon-to-be-released OpenCL.

CUDA is focused on providing general purpose GPU (GPGPU)functionality rather than graphics programming. The CUDA modelprovides a set of minimal extensions to the C programminglanguage that allow the programmer to write kernels—functionsexecuted in parallel on the GPU. This frees the programmer fromhaving to cast the problem in terms of a graphics context andremoves the cost of calling the graphics application programminginterface. In addition to making GPGPU programming moreaccessible, CUDA also provides new functionality that distin-guishes it from the shader approach. For example, CUDA includesrandom access byte-addressable memory, and threads that canread and write to as many locations as needed. It also supportscoordination and communication among processes throughthread synchronization and shared memory—thereby allowingcomplex data dependencies to be processed without requiringmultipass loops. Finally, CUDA supports both single and doubleprecision, and IEEE-compliant arithmetic (nVIDIA, 2008a).

Studies have demonstrated that GPGPU implementations candeliver performance gains of 10� to 100� compared to single CPUimplementations (e.g., Jeong et al., 2006; Anderson et al., 2007;Tolke and Krafczyk, 2008). Further, because GPUs are low-cost,high-volume commodity products that are constantly improvedfor gaming/video purposes, it may reasonably be expected thatthe GPU-computing speed advantage will hold pace with, if notoutstrip, CPU development (Owens et al., 2007). Hence, orders ofmagnitude in cost savings, coupled with orders of magnitudecomputation speedups could result in truly transformationalcomputing opportunities in science/engineering that wouldcontinuously outpace what could be achieved with traditionalCPU clusters.

The question becomes then, is GPU-based scientific computinga more promising approach than, for example, multi-coreparallelization, thus could GPU computing potentially revolutio-

nize the way computational science/engineering is conducted? Or,is the GPU approach merely a short-term phenomenon that islimited in scope and applicability? While several scientists havesuccessfully employed GPU computation in fields ranging fromastrophysics (Kaehler et al., 2006) to quantum mechanics(Anderson et al., 2007), currently lacking is a systematic analysisof massively parallel GPU-computing and its potential forscientific/engineering modeling as well as implementations ofsuch a system on a larger scale for more complex systems. Todiscover what potential impact GPU computation may have onscientific and engineering computing in general, it is necessary toevaluate systematically the classes of numerical methods that canbe effectively implemented on massively parallel GPU systems.Here, we take the first steps of such an evaluation by investigatingthe potential impact of GPU implementations on a selection ofexample problems from the geosciences.

The technical report ‘‘The Landscape of Parallel ComputingResearch: A View from Berkeley’’ by Asanovic et al. (2006)describes 13 fundamental computing classes, of which seven(dense linear algebra, sparse linear algebra, spectral methods, N-body methods, structured grids, unstructured grids, and Monte-Carlo methods), originally identified by Colella (2004), arebelieved to be important for science and engineering applications.Each class comprises related computations and data movement,hence the performance of one method is indicative of theperformance of others in that class. Thus, the significance ofGPU-based scientific and engineering computing may be evalu-ated by studying the applicability of GPU computations to theoriginal seven fundamental scientific/engineering numericalmethod classes. If a significant subset of these seven classes lendthemselves to GPU-computing, new transformational opportu-nities would emerge in scientific computing, as orders ofmagnitude more (simple but sufficient) processors could resultin substantially faster, larger, or higher-resolution simulations.

This paper focuses on the applicability of GPU computing toproblems in the geosciences. GPU computations appear wellsuited to geoscientific problems due to the spatial and temporalscales involved, as discussed at the beginning of this section. Wehave selected three applications for investigation, involving threeof the seven computing classes deemed fundamental for scientificand engineering applications (Table 1):

(1)
a lattice-Boltzmann code for geofluid dynamics; (2) a spectral-finite-element code for seismic wave simulations;
and
(3) a least-squares minimization code for magnetic force micro-
scopy data analysis.

The three programs involve important numerical modelingtechniques routinely employed in a wide range of scientific/engineering disciplines beyond the geosciences.

A brief description of the GPU hardware used to run thesimulations is given in Section 2. The lattice-Boltzmann simula-tion is described in Section 3, followed by the spectral-finite-element seismic-wave simulation in Section 4, and the least

ARTICLE IN PRESS

S.D.C. Walsh et al. / Computers & Geosciences 35 (2009) 2353–2364 2355

squares method in Section 5. At the beginning of each section, weprovide an overview of the scientific computing applications thatwill be investigated and identify the fundamental numerical classto which each method belongs before discussing the examplegeoscience applications.

2. GPU hardware

The simulations conducted in this paper are performed usingan nVIDIA (2008b) GeForce 8800 GTX graphics card, which has apeak performance of 518 GFLOPs and a memory bandwidth of86 GBs. These graphics cards offer a particularly convenientplatform for developing new GPGPU computing codes as theysupport code written using nVIDIA’s CUDA programming envir-onment.

The internal structure of a Geforce 8 series GPU consists of upto 16 multiprocessors (nVIDIA, 2008a), specifically designed forcompute-intensive, highly parallel computations. The GPU archi-tecture provides a parallelism advantage over single- or dual-coregeneral-purpose CPUs, which are capable of relatively few floatingpoint operations per clock cycle. Computation on the GPU isorganized into kernels, or GPU programs to be executed on someinput data by multiple threads in parallel. All threads within athread block execute the same kernel, communicate with eachother through shared on-chip memory, and synchronize theircomputation with built-in synchronization instructions. Typically,multiple thread blocks are employed, because the hardwareplaces constraints on the maximum number of threads in a singleblock. Thread blocks cannot synchronize executions as easily asthreads within a single block can, nor do thread blocks link to on-chip memory. However, the blocks have access to global memoryon the GPU card. This constraint limits thread-to-thread commu-nication and puts a restriction on the amount of work that can bedone in one kernel invocation.

Although the GPU architecture places constraints on inter-thread communication, it is specifically designed to optimize thethroughput of a single set of instructions operating simulta-neously on a large number of data sources. This single instruction,multiple data (SIMD) framework originally arose from the needfor rapid processing of graphics data, in which the same simplegraphics operations are performed on multiple pixels at once.Nevertheless, this framework is equally well suited to simulatingthe type of geophysical systems described in the Introduction, i.e.,complex composite systems consisting of many simple interactingcomponent parts.

3. Lattice-Boltzmann simulations—example application:geofluidic flows

Lattice-Boltzmann simulations are a method for modeling fluidmechanics in which the fluid is represented by a set of discretefluid packets moving through a regular node lattice. Thesesimulations are capable of modeling a wide range of complexfluid flow systems that are often intractable with other modelingmethods—for example, complex and changing boundary geome-tries (Bosl et al., 1998; Hersum et al., 2005), turbulent and laminarflows (Chen et al., 1992), multiphase-multicomponent flow ofmiscible and immiscible fluids (Pan et al., 2004), and buoyancy-induced convection due to solute and thermal gradients (Alex-ander et al., 1993). These properties make lattice-Boltzmannmethods particularly attractive for a large number of geofluid andother fluid-mechanical applications. Lattice-Boltzmann simula-tions model porous media flows (Bosl et al., 1998), oceancirculation (Salmon, 1999), quantum fluids (Boghosian and Taylor,

1998), granular flows (Herrmann et al., 1994) and colloidalsuspensions (Ladd, 1993). The price of the relative ease ofimplementing lattice-Boltzmann methods is the large amount ofCPU time these simulations demand. These costs can be offset inpart through the use of non-uniform grids and mesh refinementmethods (e.g., Filippova and Hanel, 1998; Lee and Lin, 2003; Liet al., 2005). Here, an alternative, but complementary approach, isexplored, accelerating lattice-Boltzmann methods with GPUimplementation.

Here, the ability of GPU implementations to accelerate lattice-Boltzmann simulations is highlighted by considering one parti-cular application: determination of pumice permeability fromfluid flow simulations through X-ray tomography images. Mag-matic volatile degassing is typically impossible to measuredirectly within the volcano conduit, particularly during eruptions.Understanding the details of volcanic degassing, however, iswidely considered critical in determining volcanic eruptiondynamics, specifically the transition between effusive andexplosive eruptions (Hammer et al., 1999). Although considerableeffort has been made to investigate the complex flow anddegassing behavior of magma (e.g., Saar et al., 2001; Saar andManga, 2002; Jellinek et al., 2004; Walsh and Saar, 2008), thelimitations of current computational tools place considerableconstraints on the scale and type of problems that can besimulated, even when traditional supercomputing clusters areemployed. For example, to determine volatile degassing rates atthe appropriate volcanic conduit scale, eruption products such aspumice samples have to be investigated at a much smaller scaleand then up-scaled. To do this, we use the Lawrence BerkeleyNational Laboratory Synchrotron facility to obtain three-dimen-sional tomography scans of cored pumice samples with resolu-tions of about 4mm to capture the thin inter-bubble walls (Fig. 1a).Even before upscaling, the size of the three-dimensionaltomography data precludes lattice-Boltzmann simulations of gasflow through the whole sample (one gigavoxel is equivalent to0.064 ml at 4mm resolution). A GPU approach has the potential toprovide the computational means to solve these problems.

At each timestep in the lattice-Boltzmann model, fluid packetsundergo a two-step process: (1) a streaming step in which thediscrete fluid packets are propagated between neighboring nodes;and (2) a collision step in which the fluid packets converging onindividual nodes are redistributed according to a set of simplerules (Wolf-Gladrow, 2000). For example, if the node is part of thefluid domain, then the fluid packets are updated according to

fiðt þDtÞ ¼ ð1� lÞfiðtÞ þ lf eqi ðtÞ; ð1Þ

where fiðtÞ is the fluid-packet density for the lattice velocity, i, attime, t, l is a relaxation constant that determines the viscosity ofthe fluid and f eq

i ðtÞ is the local-equilibrium fluid packet density.The exact form of the equilibrium fluid packet densities isdetermined by the type of lattice being simulated. The GPUimplementation discussed here is based on the popular D3Q19lattice (i.e., a three-dimensional nineteen-speed lattice, Fig. 2a)where

f eqi ¼ rwi 1�

3

2c2u � uþ

3

c2ci � uþ

9

2c4ðci � uÞ

2

� �; ð2Þ

in which wi are lattice velocity-specific constants, ci are the latticevelocities, c is the lattice speed, r ¼

Pi fiðtÞ is the macroscopic

fluid density, and u ¼P

i fiðtÞci=r is the macroscopic fluid velocity(Qian et al., 1992). More detailed discussion on the lattice-Boltzmann method and implementation of other material modelsincluding boundary conditions is given in Succi (2001) andGinzburg et al. (2008). Due to the ordered nature of the latticeand the predictable manner in which the fluid packets propagate,

ARTICLE IN PRESS

Fig. 1. (a) A back-scattered scanning electron microscope image (similar to an X-ray tomography slice) from one of our pumice samples. Results from a typical simulation

through a virtual pumice ‘‘core’’: (b) 128� 128� 512 voxel domain; (c) simulated flow (darker colors indicate faster moving fluid).

Fig. 2. Lattice-Boltzmann lattices: (a) D3Q19 lattice; (b) D3Q13 lattice showing independent sublattices.

One thread per x axis node

Lattice Nodes

x

z

Thread Block

One block per x row

y

z

Lattice Nodes

Thre

ad B

lock

s

Fig. 3. Each x-axis coordinate is assigned a thread within each thread block. Thread blocks are assigned to each y, z lattice coordinate.


lattice-Boltzmann models are examples of the ‘‘Structured Grid’’class (Table 1) described in the Berkeley report (Asanovic et al.,2006).

Lattice-Boltzmann simulations are particularly suited to GPUimplementation. Before each simulation, the total-system CPUdesignates each node as either a fluid or a boundary node toreflect the simulation geometry. As a consequence of the simplenature of the rules governing the lattice-Boltzmann simulation,once the geometry is loaded, the remainder of the simulationoccurs entirely on the GPUs. Further, CPU–GPU communicationneed only occur when GPUs return flow field ‘‘snapshots’’ to thesystem CPU.

Our GPU algorithm for the lattice Boltzmann program proceedsas follows:

(1)
Initial fluid packet densities and lattice geometry are loadedinto GPU memory. The fluid packets are stored in a single one-dimensional array, arranged in order according to directionand then position (z then y then x) so that fluid packetspropagating in the same direction are grouped together in thearray.
(2)
One block of threads on the GPU is assigned to each row oflattice Boltzmann nodes along the x-axis, the threads within

ARTICLE IN PRESS

Timestep 2

Timestep 1

Fluid packets copied from global to local memory.

Collision step performed, packets return to original position in memory.

Timestep 3

etc...

:tnemevomataD:sedoN:stekcapdiulF

Fig. 4. In our lattice-Boltzmann implementation, fluid packets are stored in memory in their original configuration at the start of a simulation. At each timestep, fluid-

packet propagation is performed implicitly by individual threads, which (1) load fluid packets scheduled to arrive at a node from GPU global memory into each thread’s

local memory, (2) perform a collision step, and (3) copy fluid packets back to the same location.


each block are assigned to corresponding nodes along thatrow (Fig. 3). In order to ensure fast (i.e., coalesced) memoryaccess on the nVIDIA GPU, the length of the x-axis is padded toa multiple of 16.

(3)
At each simulation timestep:(a) Individual threads load the corresponding fluid packets
arriving at their assigned node at that timestep. Fluid-packet propagation in the y- and z-axis directions isperformed implicitly, by selectively loading the incomingfluid packets from different locations in the array of fluidpacket densities (Fig. 4) into the shared memory for thethread block. Thus, the fluid packets are stored in globalmemory in a Lagrangian manner (i.e., in the sameconfiguration as at the first timestep), while the threadblock operates on the fluid packets in an Eulerian fashion(representing the change in the fluid packets at thenodes).

(b) The individual threads perform the collision step for eachnode using the fluid packets stored in the thread block’sshared memory.

(c) Fluid packets with x-velocity components are propagatedalong the rows, by copying between locations in thethread block’s shared memory. This rearrangement isneeded to ensure that the fluid packets are written back toglobal memory in a coalesced fashion in the followingstep.

(d) Outgoing fluid packets are written back to the originallocations in GPU memory (Fig. 4).

(4)
The implicit propagation will leave the fluid packets dis-ordered following the final timestep, unless both y and z
lattice dimensions are factors of the number of timesteps.Consequently, the final step of the algorithm reorders the fluidpackets so their y and z locations correspond to the correctposition in the lattice-Boltzmann array. The correction step inthe y direction is performed using thread blocks of the samesize as before arranged in a two dimensional grid withdimensions GCFY by NZ, where NZ is the number of nodesalong the z axis, and GCFY is the greatest common factorbetween NY (the number of nodes along the Y-axis) and therequired offset in the y direction, itself equal to the number of

timesteps modulo, NY . The correction is performed by eachthread repeatedly shifting fluid packets with a y velocitycomponent forward to the correct position in the array,replacing the old value at that location which is then itselfpropagated forward to the next location in the fluid packetarray. The same procedure is performed for the correction stepin the z direction.

Our GPU implementation is similar to those described in Tolke(2008) and Tolke and Krafczyk (2008), in that the key to theimproved performance of all three simulations is the row-wisepropagation of the fluid packets in Step 3(c). This row-wisepropagation ensures coalesced memory access from the globalmemory in first generation CUDA devices. It should be noted,however, that this detail may be obsolete in the future, as thecriteria for memory coalescence have been relaxed in later-generation CUDA devices (those with compute capability 1.2 andhigher).

In particular, it may be that Step 3(c) could be replaced byimplicit streaming methods in all directions as performed in Step3(a). Here, the use of implicit streaming steps in y- and z-axisdirections allow the fluid packets to be stored in a single array(Fig. 4), rather than separate arrays for incoming and outgoingfluid packets as in Tolke (2008) and Tolke and Krafczyk (2008).This reduces the total memory required to store the lattice,increasing the maximum number of nodes per GPU by almost afactor of two.

The GPU implementation of the lattice-Boltzmann modeldelivers substantial performance gains over the standard CPUimplementation. For the D3Q19 lattice, we are able to achieveover 200 million lattice node updates per second (MLU/s). Thisfigure is largely independent of lattice size (Table 2), although it iscontingent on the entire lattice being stored in the GPU memory.This assumption is not unreasonable for the purpose of comparingthe performance of a single CPU to a CPU–GPU system. Whilemost GPU memories are smaller than those of their host CPU, thisis not true for all GPUs—particularly those designed for generalpurpose computing rather than graphics operations. The nVIDIAC1060, for example, has 4 GB of onboard memory. Moreover,

ARTICLE IN PRESS

Table 2Performance in lattice node updates per second versus lattice size after 10,000

timesteps.

Lattice size Node updates per second (millions) Speedup

CPU GPU

64� 64� 64 7.39 217.06 29�

128� 128� 128 6.98 228.81 33�

128� 256� 256 6.06 228.32 38�

CPU times are based on results from a single 2.66 GHz Intel Core 2 Quad processor.

GPU times based on a nVIDIA GeForce 8800 GTX with a clock rate of 1.188 GHz.


multiple GPUs may be connected to a single CPU. If the amount ofhost CPU memory is less than, or equal to, the total GPU memory,then there is no benefit to communication between the CPU andthe GPU, unless the simulation is larger than can be contained in asingle CPU–GPU system.

Even greater performance gains in individual GPUs arereported by Tolke and Krafczyk (2008), who achieve speeds ofup to 582 MLU/s using a multiple relaxation time D3Q13 lattice(i.e., the three-dimensional, thirteen-velocity lattice described byd’Humi�eres et al., 2001). Part of the difference in performance isaccounted for by the type of GPU employed—the simulations inTolke and Krafczyk (2008) were conducted using a GeForce 8800Ultra, which has a faster clock speed and greater bandwidth thanthe GeForce 8800 GTX used in this paper. However, we attributethe majority of the speed difference to the different number oflattice velocities (13 versus 19), which impacts performance intwo ways: (1) the 19-velocity lattice requires a greater number ofglobal memory reads and writes per node—a common bottleneckin GPU codes, (2) the reduced number of fluid packets allows theD3Q13 code to achieve greater occupancy (i.e., more efficient use)of the processors, and is thereby capable of executing a greaternumber of simultaneous instructions (Bailey et al., 2009).

However, despite the potential for an additional 2.5� speedupafforded by the more sparsely connected D3Q13 lattice, there areadvantages to D3Q19 simulations—in particular for porous flowsimulations based on voxelized tomography images, like thoseconsidered in this section. When simulating flow through a three-dimensional Cartesian grid, the D3Q13 lattice may be decomposedinto two independent lattices that intertwine in a checkerboardfashion (Fig. 2b). These two independent lattices are problematicfor porous medium flows as the effective resolution of the flowdomain is reduced. Alternatively, the D3Q13 model can beemployed in a single rhombic-dodecahedron lattice; however,this requires remeshing of voxelized data, which may also resultin some loss of resolution. In contrast, all nodes in the D3Q19lattice are interconnected (Fig. 2a), offering greater resolution ofsmall scale apertures.

Figs. 1b and c show simulated fluid flow through a numericalcore taken from a tomography image of a pumice sample. The corehas dimensions of 128� 128� 512 voxels, corresponding tosample dimensions of 0:5 mm� 0:5 mm� 2 mm. Approximately100,000 lattice-Boltzmann timesteps were run for one and a halfhours. In contrast, running the same simulation on the CPU alonerequires approximately two days computation time. The speedincrease provided by the GPU allows multiple cores (e.g., withlong axis aligned in the x, y and z directions) to be investigated forseveral different pumice samples in the same time that it wouldtake to simulate a single sample on the same CPU without the GPU.In a broader context, the speedup provided by the lattice-BoltzmannGPU implementation enables thorough parameter-space investiga-tions to be conducted for any number of different geoscience orengineering problems in circumstances where previously only ahandful of simulations would have been possible.

4. Spectral-finite-element method—example application:seismic wave propagation

The spectral-finite-element method (SFEM) for modelingpartial differential equations was originally developed to simulatefluid dynamics (Patera, 1984). Since then, SFEM has been widelyadopted in such disparate fields as meteorology (Fournier et al.,2004), biomechanical engineering (Finol and Amon, 2001), andseismology (Tromp et al., 2008). The SFEM is so named as itdemonstrates an accuracy comparable to pseudospectral meth-ods, but is more similar to standard finite element methods (FEM)with respect to model space discretization. Thus, the SFEMexamples presented here are also representative of the largerfield of FEM employed frequently in a wide range of scientific/engineering computations.

Here, we discuss the implementation of a two-dimensionalSFEM seismic wave propagation code on GPUs as proof of conceptand as groundwork on which more advanced GPU-based SFEMcodes can later be built. It should be stressed that we are primarilyinterested in the broader question of the GPUs performance withsparse matrix multiplication, rather than as a tool specifically forsolving seismological problems. As such our goal in this section isto compare the performances of two similar GPU and CPU SFEMimplementations, rather than independent CPU and GPU codesoptimized for each platform. Nevertheless, such an optimizedimplementation has recently been brought to our attention, andwe refer interested readers to Komatitsch et al. (2009).

Our SFEM implementation is based on the two-dimensionalsimulation described in Komatitsch et al. (2001) (Fig. 6). SFEM hasproven successful as it accommodates more complex geometriesand boundary conditions than pseudospectral, finite-difference,and boundary element methods, and has a reduced computationalcost compared to standard FEM codes (Komatitsch and Tromp,1999). Despite these advantages, the computational demands ofSFEM simulations are often enormous. The largest seismic SFEMsimulations currently employ approximately 4000 processors tosimulate 14 billion grid points and require some of the world’slargest supercomputers such as the Earth Simulator in Japan(Tsuboi et al., 2005). Even relatively small simulations still containmillions of grid points (Tromp et al., 2008). Application of SFEM tobody-wave scattering (Cormier, 2000; Shearer and Earle, 2004),travel-time and amplitude finite-frequency kernels (Liu andTromp, 2006), and long-range propagation through realisticcrust, (upper) mantle, and inner core heterogeneous structures(Ge et al., 2005; Niu and Chen, 2008) is hindered by the need formassive cluster computing. A parallel-GPU implementation,because of its inherent speed and lower costs, could put thepower of the spectral-finite-element method into the hands of amuch larger proportion of the seismological community. Themethod has similar advantages in ultrasonic and engineeringseismic non-destructive testing (Marklein et al., 2005), wherehigh-order accuracy and its ability to handle unusual geometriesare key.

Spectral methods employ a weak form of the equations ofmotion for seismic waves,Z

rwi@2si

@t2dx ¼ �

Zwi;jsij dxþMijðtÞwi;jðx

sÞ; ð3Þ

where sij are the stress tensor components, si are the displace-ment field components, MijðtÞ are the source moment tensorcomponents of the earthquake originating at location xs at time, t,and wi are the components of the vector field used to generate theweak formulation. Here, Einstein summation is employed forRoman subscripts, in which repetition implies summation overthe indices, while subscripted commas represent derivatives withrespect to the corresponding coordinate, i.e., A;i ¼ @A=@xi.

ARTICLE IN PRESS

−1 0 1−1

0

1

Fig. 5. (a) Each spectral element is mapped to a unit cube (or square in two dimensional), with nodes located on Gauss–Lobatto–Legendre integration points (shown here

for a two dimensional fourth order element). This arrangement of nodes and elements gives rise to a sparse stiffness matrix (b) comprised of locally dense blocks (c).


The modeled domain is subdivided into deformed cubicelements (see Fig. 5a for a two-dimensional representation),such that points within the unit cube, g, are mapped to the pointswithin each element, x, via a set of shape functions, NaðgÞ.Lagrange polynomials interpolate the state variables and theirspatial derivatives within the elements,

f ðxðgÞÞ ¼Xn

a;b¼0

f abLabðgÞ; f;i ¼Xn

a;b¼0

f abLab;i ; ð4Þ

where Lab is a product of two Lagrange polynomials,l : LabðgÞ ¼ laðZ1Þl

bðZ2Þ. The integrals in the weak formulation areapproximated byZ

f dx ¼Z

f ðxðgÞÞJðZÞdg �Xn

a;b¼0

oaobJabf ab; ð5Þ

where oa denote quadrature weights associated with thenumerical integration and Jab represents the value of theJacobian at the quadrature points, calculated from the shapefunctions, NaðgÞ. The quadrature points are selected using theGauss–Lobatto–Legendre (GLL) integration rule, which yields aformulation favorable to parallel implementation (Tromp et al.,2008). Specifically, expressing the displacement field, si, and theweighting functions, wi, in Eq. (3) in terms of Lagrangepolynomials and recognizing that the resulting expressions musthold for arbitrary weighting functions give rise to equations of theform

M €U ¼ F � KU; ð6Þ

where M is the global mass matrix, U is the displacement vector, F

is the vector of forcing terms, and K is the global stiffness matrix. IfGLL integration is used, M is diagonal, and consequently, matrix

multiplication is only required to update the displacement field ateach timestep (Tromp et al., 2008). Moreover, the higher-orderaccuracy of SFEM means that explicit, rather than implicit,methods are employed in the timestepping scheme (Fig. 6).

SFEM has an additional advantage for GPU implementation,namely the contribution from each element leads to a block-compressed sparse-row stiffness matrix (Fig. 5). This matrixstructure can be easily manipulated to ensure coalesced read/write operations, necessary to obtain maximum GPU perfor-mance. The block-compressed sparse-row stiffness matrix is anexample of the ‘‘sparse matrix linear algebra’’ class (Table 1)identified in the Berkeley report (Asanovic et al., 2006) (Table 3).

Our GPU implementation proceeds in the following manner:

(1)
Each element is assigned a thread block, with each blockcontaining a number of threads equal to the number of nodesin each element. The elements’ individual contributions to thestiffness matrix, K, are calculated and stored as a block-compressed sparse-matrix format (recording the degrees offreedom associated with the rows and columns, as well as theassociated values) in global memory (i.e., the GPU’s onboardmemory). Each element is considered separately—no effort ismade at this point to sum the contributions from nodes thatappear in multiple elements. Although the entire stiffnessmatrix is calculated and stored in our implementation, themesh need not be processed all at once in this manner. Insteadseparate regions of the mesh can be calculated individual-ly—thereby reducing memory requirements for largermeshes. For a perfectly elastic material, K is constant, inanelastic materials, however, K changes as a function of thestrain history (Tromp et al., 2008). Consequently, we conducttwo sets of simulations: in one set K is recalculated at each

ARTICLE IN PRESS

0 0.1 0.2 0.3 0.4 0.5−0.08

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

Time (s)

Rad

ial d

ispl

acem

ent (

m)

Fig. 6. (a) and (b) Results from the two-dimensional spectral element code, simulating a Ricker wavelet propagating across a two-dimensional mesh, reflecting at edges and

causing interference patterns at 0.5 s. (c) Simulated seismograph recorded at detector location marked in (a), showing outgoing and reflected waves.

Table 3SFEM performance in node updates per second versus GLL interpolation order.

GLL order K recalculated (nodes/second) K stored (nodes/second)

CPU GPU Speedup CPU GPU Speedup

4� 4 5120 570,000 111� 1,300,000 9,000,000 7�

5� 5 3790 353,000 93� 689,000 9,500,000 14�

6� 6 2850 213,000 74� 521,000 9,670,000 19�

Simulation speeds are based on 10,000 timesteps performed on an 11� 11

element grid. The stiffness matrix, K, is either recalculated at each timestep

(K recalculated), or calculated once and stored in memory (K stored). CPU results

are from a single 2.66 GHz Intel Core 2 Quad processor. GPU times based on an

nVIDIA GeForce 8800 GTX with a clock rate of 1.188 GHz.


timestep, and in the other K is calculated once and then storedin memory.

(2)
Next the product of the stiffness matrix components fromeach element and the corresponding entries in the displace-ment vector are calculated. For this step, the number of blocksper thread is set equal to the number of degrees of freedom ineach block. The resulting vector KU0 is stored as a list of rowsand a list of values, which are then added to a single vector KU.In the CUDA environment, memory writes will fail if twothreads attempt to access the same location simultaneously.Thus the product KU0 is calculated and added to KU inmultiple function calls that consider separate batches ofelements, such that no two elements in each batch have nodesin common.
(3)
The new velocity vector is calculated from _Un ¼_U0 þDtðF �
KUÞ and the displacement vector is updated using a leapfrogrule: U ¼ U þ ð _Un þ

_U0ÞDt=2, with each thread performing thecalculation for a corresponding degree of freedom in bothcases. Here the number of threads is set to a constant (i.e.,256) and the total number of blocks determined such thatthere is at least one thread per entry in U.

The GPU again delivers a significant performance increase overthe equivalent CPU implementation. In particular, if the stiffnessmatrix is recalculated rather than stored, the speedup can be morethan two orders of magnitude (Table 3). If instead, the matrix isstored in memory the speedup is less (7–19 times), but still notunsubstantial. We attribute the differences in speedups betweenthe two versions to the fact that recalculating the stiffness matrixat each timestep reduces the overall proportion of computationtime spent on memory access. We also note that recalculating thestiffness matrix at each timestep on the GPU is only a factor of 2–3times slower than running the same simulation with the stiffness

matrix stored on the CPU. This is interesting as the stiffness matrixcan be reconstructed on the GPU in separate sections that areprocessed sequentially. Such an implementation would savememory, making it feasible to run larger (albeit slightly slower)simulations than would be possible with the CPU alone.

As in the lattice-Boltzmann example, the performance gainsshown assume that all nodes in the mesh are stored in GPUmemory. Part of the performance increase may also be due to thefact that our SFEM implementation is two dimensional and wehave confined our analysis to relatively small meshes—memoryaccess in larger, three-dimensional meshes may incur a highercost. Nevertheless, the three-dimensional GPU implementation ofan SFEM code in Komatitsch et al. (2009) mentioned earlier wasable to achieve a 25� speedup compared to an optimized CPUversion. The next challenge will be to demonstrate how thesesimulations scale across multiple CPU–GPU systems, in order toaccount for more complicated three-dimensional models, withthe eventual goal of scaling up to whole-earth simulations.

5. Least-squares minimization—example application:magnetic scanning force microscopy

Least-squares minimization (LSM) is used throughout thegeosciences, and in many other fields, as a tool for data fitting,signal processing, and inverse theoretical methods. Here least-squares minimization techniques are used to interpret magneticfield intensity data, obtained from scanning force microscopy andrelated methods (e.g., SQUID and Hall probe microscopy Kirtleyand Wikswo, 1999; Chang et al., 1992). A scanning forcemicroscope contains a micron-scale cantilever with a sharp tip,similar to a phonograph arm and stylus. The cantilever tip isscanned across the sample surface and its vertical displacementinferred from the phase shift in a laser reflected off the cantilever.Magnetic force microscopy (MFM) is conducted by coating the tipwith a magnetic material. The cantilever responds to the netmagnetic force experienced between its tip and the sample(Frandsen et al., 2004). The most probable distribution ofmagnetic dipoles is obtained from the measured magnetic forcesvia a least-squares fit (Weiss et al., 2007). Due to the long rangeforces that act between each dipole and the cantilever, the least-squares minimization problem gives rise to a dense matrix inwhich every element interacts with every other element, and isthus representative of the ‘‘dense linear algebra’’ class (Table 1)identified in the Berkeley report (Asanovic et al., 2006).

Traditional paleomagnetic studies use bulk rock samples thatoften contain different magnetic minerals in a multitude of grainsizes. These various mineral phases and grains often form

ARTICLE IN PRESS


throughout the rock’s history, recording conflicting magnetiza-tions, which confounds interpretation of the overall remanence.To address this problem, researchers are obliged to conduct alitany of time consuming rock magnetic experiments designed toidentify various mineral phases and their abundances within agiven sample. MFM offers a means of overcoming many of theseissues by providing grain-scale information of the rock sample.However, to date, many of the benefits of this approach remainunrealized due to the amount of time needed for data processing.

GPU implementation of this problem has the potential toreduce analysis times from several days to hours or less, withoutrequiring external supercomputing facilities. This would greatlyenhance the capabilities of magnetic scanning force microscopy,allowing rapid processing of larger, higher resolution scans andimproving determination of dipole distributions by iterativeanalysis and scan refinement.

The set of magnetic intensities, BiðxÞ, detected by the magneticforce microscope, are expressed in terms of the underlyingmagnetic field, MiðxÞ, as

BiðxÞ ¼

ZV

Gijðx;x0ÞMjðx

0Þdx0; ð7Þ

where Gijðx;x0Þ is the Greens function relating the contribution of

the magnetic field at x0 to the magnetic intensity at x. Theunderlying dipole distribution is represented as a set of discretepoint sources of variable intensities and directions, m, and/orlocations, x, i.e., Miðx

0Þ ¼P

a mai dðx

0 � xaÞ, where dðxÞ is the Diracdelta function. Due to the noisy nature of the magnetic intensitydata, the magnetic field predicted by the dipole distribution willonly approximately match the measured data, Bi, with a misfitresidual, Ri, of

RiðxbÞ ¼

Xb

½Gijðxa;xbÞmb

j � � BiðxaÞ: ð8Þ

Thus, the goal is to minimize the square of RiðxÞ with respect topredetermined degrees of freedom, n, i.e.,

@

@n

Xa

RiðxaÞRiðx

aÞ ¼ 0: ð9Þ

For linear problems (e.g., if dipole locations are fixed and onlytheir magnitudes allowed to vary or if an overall dipole directionis sought), Eq. (9) is solved by finding the single value decom-position (SVD) of a matrix equation, Aijxj ¼ Bi, where Aij expressesthe relationship between the degrees of freedom and themagnetic force and Bi are the observed magnetic intensities fromthe MFM scan. If dipole positions are also allowed to vary, theproblem becomes non-linear. In this case, the best fit is found by

123...

n/2-2n/2-1n/2

nn-1n-2...

n/2+3n/2+2n/2+1

Column AColumn B

One thread block for each pair

128 threads

Gi Gj

Thread block finds GCalculates θ and upd

Fig. 7. We assign n=2 thread blocks to pairs of matrix columns, and perform Jacobi piv

rotating column indexes as shown by arrows (round robin ordering of Zhou and Brent,

nðn� 1Þ=2 possible combinations of column pairs.

constructing a Hessian matrix for the problem, and then findingthe minima using a non-linear method (e.g., Newton’s method).The analysis given here is confined to the linear case, although weplan to apply GPGPU computation to the nonlinear inverseproblem in future work.

We implement the SVD using a basic one-sided Jacobialgorithm (Drmac and Veselic, 2008a, b). Although faster SVDmethods are available, the one-sided Jacobi algorithm is chosendue to its high accuracy, ease of parallel implementation, andrelative simplicity. The Jacobi method calculates the SVD of amatrix, G, by constructing an orthogonal matrix, V, that diag-onalizes A ¼ GT G. If VT AV ¼W2, then the singular value decom-position is G ¼ UWVT , where W is a diagonal matrix and U isorthogonal. In practice, the SVD is constructed implicitly, throughthe successive application of Jacobi rotations, O,

G0 ¼ G ; ð10Þ

Gaþ1 ¼ GaO ; ð11Þ

in which the rotations are chosen to orthogonalize column pairsfrom the matrix Ga. After several iterations through all possiblecolumn-pair combinations, the algorithm is halted once ðGaÞT Ga isdiagonal to numerical precision, at which point GaCUW . The SVD

is then obtained from W ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðGaÞT Ga

q, VT ¼ ðGaÞ�1G, and

U ¼ GaW�1.For a matrix G, with m rows and n columns, our GPU

implementation of the one-sided Jacobi method proceeds in thefollowing manner:

(1)

i•Gi,ate

Thread block

otin

1997

The matrix G is copied into the GPUs global memory.
(2) For each ‘‘sweep’’ through all possible column pair combina-
tions:

(a) The algorithm matches column pairs using the round-robin ordering described in Zhou and Brent (1997) (Fig. 7).For each distinct round-robin column-pair ordering (n� 1in total):

(i) Each of the column pairs (n=2 in total) is assigned to aGPU thread block. Each block is given a fixed numberof threads (128 was found to be optimal for thematrix sizes examined).

(ii) The entries of A ¼ ðGaÞT Ga corresponding to theproducts of the two columns (Aa

ii , Aajj, and Aa

ij) arecalculated, and the rotation angle, y, for Ga

i and Gaj is

Gi•Gs G

g on

). Thi

j, G

each

s proc

1n2...

n/2-3n/2-2n/2-1

n-1n-2n-3...

n/2+2n/2+1

n/2

Column A Column B

Round-robin stepupdates pairs

j•Gj

in parallel. Subsequent combinations of column pairs are found by

ess is repeated n� 1 times per ‘‘Jacobi sweep’’, thereby rotating all

ARTICLE IN PRESS

TablSVD

Matr

1024

2048

4096

8192

GPU

proc

CPU

proc

1.188

Fig.Fig. 1

lines


found. The column dot products are calculated inparallel, with each thread summing 1/128th of thecolumn entries. The results are then summed acrossall threads, the value of y calculated by a singlethread, and then distributed to the others in its block.

(iii) The column pairs are orthogonalized:

Gaþ1i ¼ cosðyÞGa

i þ sinðyÞGaj ;

Gaþ1j ¼ cosðyÞGa

j � sinðyÞGai ;

with each thread again operating on 1/128th of therows in the two columns.

(b) The magnitudes of the off-diagonal components of ðGaÞT Ga

are estimated from the column pair products at the startof each sweep. If the error estimate indicates the matrix isdiagonal to machine precision, the Frobenious norm of theoff-diagonal components of the matrix A is calculateddirectly at the end of the sweep. The algorithm is haltedonce the Frobenious norm is below the desired tolerance.

e 4time

ix si

� 1

� 2

� 4

� 8

per

essor

tim

essor

GHz

8. Ca

b in

repr

in se

ze

024

048

096

192

forma

routi

es are

. GPU

.

lculat

Zhu e

oduce

The performance of our GPU SVD implementation for randomsquare matrices is compared to the performance of two other SVDroutines running on the same machine in Table 4. The first is asingle processor implementation based on JAMA=Cþþ, a Cþþimplementation of the Java Matrix Library by Mathworks andNIST. The second uses the Matlab 7 command ‘‘svd’’, whichemploys LAPACK and BLAS routines. The GPU implementationdelivers performance gains of up to 24� over the Matlabcommand and more than 100� over the JAMA=Cþþimplementation. Moreover, it is worthwhile noting that both ofthese routines employ algorithms based on QR decomposition,which are generally faster, though less accurate than the basicJacobi SVD employed on the GPU. As with the other routinesdiscussed in this paper, these results are contingent on the matrix

conds versus matrix size.

SVD time (s)

CPU: JAMA/C++ CPU: Matlab GPU Speedup

86 26 3 8.7�

1708 415 22 18.9�

16325 3414 200 17.1�

120479 26603 1080 24.6�

nce is compared to two CPU SVD implementations: a single

ne based on the JAMA=Cþþ library; and Matlab’s ‘‘svd’’ command.

based on results from a single 2.66 GHz Intel Core 2 Quad

times based on an nVIDIA GeForce 8800 GTX with a clock rate of

ion of the overall magnetic moments of a synthetic sample, using a GPU i

t al., 2003). (b) Predicted distribution of magnetic dipoles (darker regio

Bz distribution (given in (a)) and show predictions of dipole locations

being stored in the GPU memory (the maximum matrix size forthe nVIDIA GeForce 8800 GTX graphics card is approximately200,000,000 elements). Nevertheless, larger matrices can beprocessed using block-Jacobi methods (e.g. Hari, 2005), and weanticipate that similar performance gains will be sustained inlarger matrices, as data transfer between the CPU and GPUoccupies a negligible percentage of the overall calculation time(�0:16% in the 1024� 1024 matrix SVD and �0:04% in the2048� 2048 matrix SVD). It should also be noted that our currentimplementation is relatively unsophisticated, and has yet toincorporate many recent developments in accelerating the JacobiSVD (e.g., Mascarenhas, 1995; Hari, 2005; Drmac and Veselic,2008b). Admittedly, it is not yet evident how well thesetechniques developed for CPU implementations will translate tothe GPU setting. Nevertheless, there is ample reason to believethat additional improvement is possible beyond what is givenhere.

Fig. 8a shows part of an MFM scan, reproduced withpermission from Fig. 1b of Zhu et al. (2003). We find the dipoledistribution (Fig. 8b, c) by calculating the singular valuedecomposition for a 120 by 120 grid of observed Bz values,seeking average dipole directions at one in every four pointsinside the grid. The entire image is constructed by repeatedlyfinding the product of the SVD with overlapping regions from theoriginal scan data, and recording the average result for eachlocation. In total the entire process takes approximately an hourand a half. The speed of the singular value decomposition makes itfeasible to conduct further operations to refine the dipolelocations—for example, by thresholding the final image to revealthe most likely dipole locations and repeating the singular valuedecomposition. Once again the speeds reported for this simulationis entirely due to the GPU, and it is highly likely that moresophisticated techniques will deliver even greater performancegains. Obvious methods for improvement involve the use ofsuitable preconditioners and/or sparse matrix techniques (forexample as employed in Weiss et al., 2007).

6. Conclusion

Many simulations in science and engineering, and in particularin the Earth sciences, are described in terms of simple localinteractions that give rise to complex emergent macroscopicbehavior. Graphics processing units (GPUs) as found on standardgraphics cards, designed to implement a single set of instructionson multiple data sources simultaneously, are well suited to thesetypes of problems. While previously GPU computation has beenlargely confined to specialized applications that could be cast interms of graphics manipulations, the recent release of compilersfor general purpose GPU computing has made the power of theseprocessors more widely available.

mplementation. (a) Initial Bz distribution (reproduced with permission from part of

ns indicate more likely/stronger dipole locations). (c) Close up of (b), gray contour

between Bz maxima and minima, as expected.

ARTICLE IN PRESS


Here, we have demonstrated that GPU implementationsdeliver substantial performance increases to three geophysicalsimulations: a lattice-Boltzmann simulation of geofluid dynamics;a spectral-finite-element method for seismic wave propagationsimulations; and a least squares solver for interpreting magnetic-force-microscopy data. The example applications each involvedifferent computing classes (structured grid, sparse linear algebra,and dense linear algebra) deemed fundamental to scientific andengineering programming, and hence may be considered indica-tive of the performance of GPU computing to a much broader setof problems in many research fields. The GPU implementationswere found to deliver performance gains of 10�–30� compared toequivalent CPU implementations.

There are some clear limitations to GPU computing, inparticular (a) the amount of memory on most graphics cards is(currently) somewhat limited, although cards with larger mem-ories specifically designed for GPGPU computing are available, (b)the speed of data transfer between the GPU and host CPU is(currently) low, and (c) the degree of complexity in the individualelements being simulated is restricted. These limitations may inpart be overcome as these relatively nascent systems mature fromtheir graphics processing origins. Other inefficiencies in generalpurpose GPU systems will also undoubtedly be overcome by thedevelopment of new algorithms specifically optimized for thesetypes of computing environments. Finally, the simplest way toresolve potential limitations of individual GPU performancecompared to existing parallel CPU systems, may be through thedevelopment of GPU clusters. Indeed, our future work will look atsustaining the performance gains achieved in this paper overseveral parallel-linked GPUs, opening the way for efficientprograms designed to take advantage of GPU clusters.

Although general purpose GPU computing should not beregarded as a general panacea, it nevertheless offers severalexciting advantages over traditional CPU computing for certainclasses of problems. Particularly for many of the simulationsencountered in geology and the wider scientific and engineeringcommunity, GPUs may present supercomputing opportunitiesthat would otherwise be out of reach to all but the most advancedcomputing facilities. Such facilities, would in turn, however, alsobenefit from cost-effective multi-GPU cluster implementationswith thousands of GPUs, each containing hundreds of processors.Such computing systems may then allow new types of simula-tions at unprecedented scales, resolutions, or computing speeds.

Acknowledgments

We would like to thank Joshua Feinberg, Justin Revenaugh,Erich Frahm and Casey Battaglino for their assistance and helpfulsuggestions while writing this paper. We also gratefully acknowl-edge the use of resources from the University of MinnesotaSupercomputing Institute (MSI), and thank nVIDIA for supplyingthe 8800 GTX graphics card. Martin Saar thanks the George andOrpha Gibson endowment for its generous support of theHydrogeology and Geofluids research group. In addition, thismaterial is based upon support by the National Science Founda-tion (NSF) under Grant nos. DMS-0724560 and EAR-0510723. Anyopinions, findings, and conclusions or recommendations ex-pressed in this material are those of the authors and do notnecessarily reflect the views of the NSF. We thank the anonymousreviewers for their helpful comments that improved this paper.

References

Alexander, F.J., Chen, S., Sterling, J.D., 1993. Lattice Boltzmann thermohydrody-namics. Physical Review E 47 (4), R2249–R2252.

Anderson, A.G., Goddard, W.A., Schroder, P., 2007. Quantum Monte Carlo ongraphical processing units. Computer Physics Communications 177 (3),298–306.

Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K.,Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A., 2006. Thelandscape of parallel computing research: a view from Berkeley. TechnicalReport No. UCB/EECS-2006-183.

Bailey, P., Myre, J., Walsh, S.D., Lilja, D.J., Saar, M.O., 2009. Accelerating latticeBoltzmann fluid flow simulations using graphics processors. InternationalConference on Parallel Processing: Vienna, Austria (ICPP 2009).

Boghosian, B.M., Taylor, W., 1998. Quantum lattice-gas model for the many-particleSchrodinger equation in d dimensions. Physical Review E 57 (1), 54–66.

Bosl, W.J., Dvorkin, J., Nur, A., 1998. A study of porosity and permeability usinga lattice Boltzmann simulation. Geophysical Research Letters 25 (9),1475–1478.

Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.,2004. Brook for GPUs: stream computing on graphics hardware. In: SIGGRAPH’04: ACM SIGGRAPH 2004 Papers. ACM, New York, NY, USA, pp. 777–786.

Chang, A.M., Hallen, H.D., Harriott, L., Hess, H.F., Kao, H.L., Kwo, J., Miller, R.E.,Wolfe, R., van der Ziel, J., Chang, T.Y., 1992. Scanning hall probe microscopy.Applied Physics Letters 61 (16), 1974–1976 URL /http://link.aip.org/link/?APL/61/1974/1S.

Chen, S., Wang, Z., Shan, X., Doolen, G., 1992. Lattice Boltzmann computationalfluid dynamics in three dimensions. Journal of Statistical Physics 68 (3/4),379–400.

Colella, P., 2004. Defining software requirements for scientific computing.Presentation as Referenced in Asanovic et al. 2006.

Cormier, V., 2000. D’’ as a transition in the heterogeneity spectrum of thelowermost mantle. Journal of Geophysical Research 105, 16193–16205.

d’Humi�eres, D., Bouzidi, M., Lallemand, P., 2001. Thirteen-velocity three-dimen-sional lattice Boltzmann model. Physical Review E 63 (6), 066702.

Drmac, Z., Veselic, K., 2008a. New fast and accurate Jacobi SVD algorithm. i. SIAMJournal on Matrix Analysis and Applications 29 (4), 1322–1342.

Drmac, Z., Veselic, K., 2008b. New fast and accurate Jacobi SVD algorithm. ii. SIAMJournal on Matrix Analysis and Applications 29 (4), 1343–1362.

Filippova, O., Hanel, D., 1998. Boundary-fitting and local grid refinement for lattice-BGK models. International Journal of Modern Physics C 9 (8), 1271–1279.

Finol, E.A., Amon, C.H., 2001. Blood flow in abdominal aortic aneurysms: pulsatileflow hemodynamics. Journal of Biomechanical Engineering 123 (5), 474–484.

Fournier, A., Taylor, M.A., Tribbia, J.J., 2004. The spectral element atmospheremodel (SEAM): high-resolution parallel computation and localized resolutionof regional dynamics. Monthly Weather Review 132 (3), 726–748.

Frandsen, C., Stipp, S.L.S., McEnroe, S., Madsen, M., Knudsen, J.M., 2004. Magneticdomain structures and stray fields of individual elongated magnetite grainsrevealed by magnetic force microscopy (MFM). Physics of the Earth andPlanetary Interiors 141 (2), 121–129.

Ge, Z., Fu, L.-Y., Wu, R.-S., 2005. P-SV wave-field connection technique for regionalwave propagation simulation. Bulletin of the Seismological Society of America95, 1375–1386.

Ginzburg, I., Verhaeghe, F., d’Humi�eres, D., 2008. Two-relaxation-time latticeBoltzmann scheme: about parametrization, velocity, pressure and mixedboundary conditions. Communications in Computational Physics 3, 427–478.

Gonnermann, H.M., Manga, M., 2007. The fluid mechanics inside a volcano. AnnualReview of Fluid Mechanics 39 (1), 321–356.

Hammer, J.E., Cashman, K.V., Hoblitt, R.P., Newman, S., 1999. Degassing andmicrolite crystallization during pre-climactic events of the 1991 eruption ofMt. Pinatubo, Philippines. Bulletin of Volcanology 60 (5), 355–380.

Hari, V., 2005. Accelerating the SVD block-Jacobi method. Computing 75 (1),27–53.

Herrmann, H.J., Flekkoy, E., Nagel, K., Peng, G., Ristow, G., 1994. Density waves ingranular flow. Non-Linearity and Breakdown in Soft Condensed Matter, LectureNotes in Physics, vol. 437. Springer, Berlin, Heidelberg, pp. 28–39.

Hersum, T., Hilpert, M., Marsh, B.D., 2005. Permeability and melt flow in simulatedand natural partially molten basaltic magmas. Earth and Planetary ScienceLetters 237, 798–814.

Hillesland, K.E., Molinov, S., Grzeszczuk, R., 2003. Nonlinear optimization frame-work for image-based modeling on programmable graphics hardware. ACMTransactions on Graphics 22 (3), 925–934.

Jellinek, A.M., Manga, M., Saar, M.O., 2004. Did melting glaciers cause volcaniceruptions in eastern California? Probing the mechanics of dike formation.Journal of Geophysical Research—Solid Earth 109 (B9), B09206.

Jeong, W.K., Whitaker, R., Dobin, M., 2006. Interactive 3D seismic fault detection onthe Graphics Hardware. In: Proceedings of the International Workshop onVolume Graphics, Boston, MA, USA, pp. 111–118.

Kaehler, R., Wise, J., Abel, T., Hege, H.-C., 2006. GPU-assisted raycasting forcosmological adaptive mesh refinement simulations. In: Proceedings of theInternational Workshop on Volume Graphics, Boston, MA, USA, pp. 103–110.

Kirtley, J.R., Wikswo, J.P., 1999. Scanning squid microscopy. Annual Review ofMaterials Science 29 (1), 117–148 URL /http://arjournals.annualreviews.org/doi/abs/10.1146/annurev.matsci.29.1.117S.

Komatitsch, D., Martin, R., Tromp, J., Taylor, M.A., Wingate, B.A., 2001. Wavepropagation in 2-D elastic media using a spectral element method withtriangles and quadrangles. Journal of Computational Acoustics 9 (2), 703–718.

Komatitsch, D., Michea, D., Erlebacher, G., 2009. Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards usingCUDA. Journal of Parallel and Distributed Computing 69 (5), 451–460.

http://link.aip.org/link/?APL/61/1974/1

http://link.aip.org/link/?APL/61/1974/1

http://arjournals.annualreviews.org/doi/abs/10.1146/annurev.matsci.29.1.117

http://arjournals.annualreviews.org/doi/abs/10.1146/annurev.matsci.29.1.117

ARTICLE IN PRESS


Komatitsch, D., Tromp, J., 1999. Introduction to the spectral element method forthree-dimensional seismic wave propagation. Geophysical Journal Interna-tional 139 (3), 806–822.

Ladd, A.J.C., Mar 1993. Short-time motion of colloidal particles: numericalsimulation via a fluctuating lattice-Boltzmann equation. Physical ReviewLetters 70 (9), 1339–1342.

Lastra, A., Lin, M., Manocha, S. (Eds.), 2004. ACM Workshop on General PurposeComputing on Graphics Processors. ACM, Chapel Hill, NC, USA, p. 78.

Lee, T., Lin, C.-L., 2003. An Eulerian description of the streaming process in thelattice Boltzmann equation. Journal of Computational Physics 185 (2),445–471.

Li, W., Wei, X., Kaufman, A., 2003. Implementing lattice Boltzmann computation ongraphics hardware. The Visual Computer 19, 444–456.

Li, Y., LeBoeuf, E.J., Basu, P.K., 2005. Least-squares finite-element scheme for thelattice Boltzmann method on an unstructured mesh. Physical Review E 72 (4),046711.

Lindholm, E., Kligard, M.J., Moreton, H., 2001. A user-programmable vertexengine. In: SIGGRAPH ’01: Proceedings of the 28th Annual Conference onComputer Graphics and Interactive Techniques. ACM, New York, NY, USA, pp.149–158.

Liu, Q., Tromp, J., 2006. Finite-frequency kernels based upon adjoint methods.Bulletin of the Seismological Society of America 96, 2383–2397.

Marklein, R., Langenberg, K.J., Mayer, K., Miao, J., Shlivinski, A., Zimmer, A., Muller,W., Schmitz, V., Kohl, C., Mletzko, U., 2005. Recent applications and advances ofnumerical modeling and wavefield inversion in nondestructive testing.Advances in Radio Science 3, 167–174.

Mascarenhas, W.F., 1995. On the convergence of the Jacobi method forarbitrary orderings. SIAM Journal on Matrix Analysis and Applications 16 (4),1197–1209.

Nickolls, J., Buck, I., Garland, M., Skadron, K., 2008. Can CUDA make parallelprogramming straightforward and scalable?. ACM Queue 6, 40–53.

Niu, F., Chen, Q.-F., 2008. Seismic evidence for distinct anisotropy in the innermostinner core. Nature Geoscience 1, 692–696.

nVIDIA, 2008a. CUDA Programming Guide. NVIDIA, first ed. URL /http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_Programming_Guide_1.1.pdfS.

nVIDIA, 2008b. NVIDIA GeForce GTX 200 GPU Architectural Overview. nVIDIA. URL/http://www.nvidia.com/docs/IO/55506/GeForce_GTX_200_GPU_Technical_Brief.pdfS.

Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A.E., Purcell,T.J., 2007. A survey of general-purpose computation on graphics hardware.Computer Graphics Forum 26 (1), 80–113.

Pan, C., Hilpert, M., Miller, C.T., 2004. Lattice-Boltzmann simulation of two-phaseflow in porous media. Water Resources Research 40 (1), W01501 pT: J; UT:ISI:000188284800002..

Patera, A.T., 1984. A spectral element method for fluid dynamics. Journal ofComputational Physics 54, 468–488.

Pieters, B., Rijsselbergen, D.V., Neve, W.D., , de Walle, R.V., 2007. Performanceevaluation of h.264/avc decoding and visualization using the GPU. In:

Proceedings of SPIE on Applications of Digital Image Processing XXX, vol.6696, pp. 669606-1–669606-13.

Purcell, T.J., Buck, I., Mark, W.R., Hanrahan, P., 2002. Ray tracing on programmablegraphics hardware. ACM Transactions on Graphics 21 (3), 703–712 iSSN 0730-0301 (Proceedings of ACM SIGGRAPH 2002)..

Qian, Y.-H., D’Humi�eres, D., Lallemand, P., 1992. Lattice BGK models forNavier–Stokes equation. Europhysics Letters 17 (6), 479–484.

Saar, M.O., Manga, M., 2002. Continuum percolation for randomly oriented soft-core prisms. Physical Review E 65 (5), 056131.

Saar, M.O., Manga, M., Cashman, K.V., Fremouw, S., 2001. Numerical models of theonset of yield strength in crystal-melt suspensions. Earth and PlanetaryScience Letters 187 (3–4), 367–379.

Salmon, R., 1999. The lattice Boltzmann method as a basis for ocean circulationmodeling. Journal of Marine Research 57, 503–535.

Schmidt, J., Piret, C., Zhang, N., Kadlec, B., Liu, Y., Yuen, D., Wright, G. B., Sevre, E.,2009. Modeling of Tsunami equations and atmospheric swirling flows withgraphics accelerated hardware (GPU) and radial basis functions (RBF).Concurrency and Computation: Practice and Experience, in press.

Shearer, P., Earle, P., 2004. The global short-period wavefield modelled with aMonte Carlo seismic phonon method. Geophysical Journal International 158,1103–1117.

Succi, S., 2001. The Lattice Boltzmann Equation for Fluid Dynamics and Beyond.Oxford University Press, Oxford, p. 288.

Tolke, J., 2008. Implementation of a lattice Boltzmann kernel using the computeunified device architecture developed by nVIDIA. Computing and Visualisationin Science, 11pp., DOI: 10.1007/s00791–008–0120–2.

Tolke, J., Krafczyk, M., 2008. Teraflop computing on a desktop PC with GPUs for 3DCFD. International Journal of Computational Fluid Dynamics 22, 443–456.

Tromp, J., Komatitsch, D., Liu, Q., 2008. Spectral-element and adjoint methods inseismology. Communications in Computational Physics 3, 1–32.

Tsuboi, S., Komatititsch, D., Tromp, J., 2005. Broadband modelling of global seismicwave propagation on the Earth simulator using the spectral-element method.Journal of the Seismological Society of Japan 57, 321–329.

Walsh, S.D.C., Saar, M.O., 2008. Magma yield stress and permeability: insights frommultiphase percolation theory. Journal of Volcanology and GeothermalResearch 177 (4), 1011–1019.

Weiss, B.P., Lima, E.A., Fong, L.E., Baudenbacher, F.J., 2007. Paleomagnetic analysisusing SQUID microscopy. Journal of Geophysical Research 112, B09105.

Wolf-Gladrow, D.A., 2000. Lattice-gas Cellular Automata and Lattice BoltzmannModels: An Introduction. Springer, Berlin, p. 308.

Wolfram, S., 1983. Statistical mechanics of cellular automata. Reviews of ModernPhysics 55, 601–644.

Zhou, B.B., Brent, R.P., 1997. A parallel ring ordering algorithm for efficient one-sided Jacobi SVD computations. Journal of Parallel and Distributed Computing42 (1), 1–10.

Zhu, X., Grutter, P., Metlushko, V., Hao, Y., no, F.J.C., Ross, C.A., Ilic, B., Smith, H.I.,2003. Construction of hysteresis loops of single domain elements and coupledpermalloy ring arrays by magnetic force microscopy. Journal of Applied Physics93 (10), 8540–8542.

http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_Programming_Guide_1.1.pdf



http://www.nvidia.com/docs/IO/55506/GeForce_GTX_200_GPU_Technical_Brief.pdf

http://www.nvidia.com/docs/IO/55506/GeForce_GTX_200_GPU_Technical_Brief.pdf

Documents

Accelerating geoscience and engineering system - Earth Sciences