59
aeroCuda: The GPU-Optimized Immersed Solid Code Samir Patel Advisor: Dr. Cris Cecka June 23, 2012

aeroCuda: The 2-d CFD Code

Embed Size (px)

DESCRIPTION

This code walks through Peskin's immersed solid implementation using Tryggvason's formulation of Chorin's projection method. It also describes the process of speeding up the code using PyCuda and PyFFT as bindings to Nvidia's CUDA framework.

Citation preview

Page 1: aeroCuda: The 2-d CFD Code

aeroCuda: The GPU-Optimized Immersed Solid Code

Samir Patel

Advisor: Dr. Cris Cecka

June 23, 2012

Page 2: aeroCuda: The 2-d CFD Code

Abstract

Commercial fluid dynamics software is expensive and can be difficult to handle for transient prob-

lems involving moving objects. While open-source codes exist to handle such problems, the docu-

mentation and structure of such codes might be difficult to navigate for researchers not well-versed

in computer science or students lacking a formal background in fluid dynamics. aeroCuda was

developed to provide an efficient, accurate, and open-source method for testing fluid dynamics

problems involving moving objects. The solution method for the Navier-Stokes equations was the

Projection Method, and the effects of objects moving in fluid were implemented via Peskin’s Im-

mersed Boundary Method. The code was first developed in serial and then parallelized via CUDA

and MPI to optimize its speed. It generates and rotates a full 2-d point cloud to simulate the

object’s shape, and also allows the user to implement full 2-d translational and rotational motion

of the object. The results obtained for Reynolds numbers at 25 and 100 matched those obtained by

Saiki and Biringen as well as Peskin and Lai; the expected physical phenomena are also confirmed.

Page 3: aeroCuda: The 2-d CFD Code

Preface

This paper was submitted for the satisfaction of the thesis requirement for the Bachelor of Science

in Engineering Sciences at Harvard College on April 2, 2012.

My interest in the field of CFD was piqued in high school, when I first studied the Speedo LZR

Racer. Since then, I have come a long way in my understanding of CFD, both in its applications

and theoretical underpinnings. However, none of this would have been possible without the support

of many individuals who have supported me throughout my career as a student.

I would like to thank my parents and my sister for their continued support and trust in me. They

have been monumental in getting me to where I am today. I love you, Satish, Sneh, and Swati Patel!

I would like to thank my advisor, Cris Cecka, for his support in helping me bring this project to life.

There are some individuals who have supported my work as a student at Harvard without whom

I could not envision being where I am today. Special thanks to Professor Robert Wood and Dr.

Hiroto Tanaka for allowing me the opportunity to work on their robotics projects and learn from

their dedication to the subject, which helped develop my interests and skill as a researcher. Special

thanks to Professor Anette Hosoi and Ms. Lisa Burton for allowing me to begin exploring CFD

under their tutelage.

I would also like to thank those that influenced me in high school: Dr. Thom Morris, Mrs.

Martha DeWeese, Mrs. Kemp Hoversten, Mr. Stephen Mikell, and Mr. Patrick Fisher. Their

guidance allowed me to become the individual that I am today, and without their support I would

not have be where I am. In addition, I would like to thank the man who helped kindle my interest

in mathematics, Mr. Farhad Azar.

I would also like to thank Assistant Professor Charbel Bou-Mosleh of the Notre Dame Univer-

sity of Lebanon, who over the course of one summer taught me to appreciate CFD and helped me

craft my beginnings as a researcher in this area.

Page 4: aeroCuda: The 2-d CFD Code

I would like to thank Professor Charles Peskin of NYU for his support of my project (and of

course, for developing its theoretical basis).

I would like to thank Karl Helfrich of Woods Hole Institute and Mattheus Ueckermann of the

Massachusetts Institute of Technology for helping me navigate the world of CFD.

This project is dedicated to the memory of my grandfathers, a mechanical engineer and a physicist.

2

Page 5: aeroCuda: The 2-d CFD Code

Contents

1 Motivation 3

1.1 Computational Fluid Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Moving Mesh and a Translating Cylinder . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Governing Equations and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Why Immersed Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Immersed Boundary Method and Solution to the Navier-Stokes equations 8

2.1 Modification of the Navier-Stokes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Developing the Forcing Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Relationship between the Solid and Prescribed Points . . . . . . . . . . . . . . . . . 9

3 Goal and Design Phase 10

3.1 Goal of aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Reasons for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Platforms Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Comsol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.2 Ansys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.3 openFoam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Language for the Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Working with openFoam 13

4.1 Mesh Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Solver Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.3 Building the Code for openFoam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4 Issues with openFoam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Development of aeroCuda 16

5.1 Influences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 Structural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5.2.2 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2.3 Solver-Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1

Page 6: aeroCuda: The 2-d CFD Code

5.2.4 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.3 Pre-Computation: Interior Point Generation and Rotation Capabilities . . . . . . . . 20

5.3.1 Motivation behind Interior Point Generation . . . . . . . . . . . . . . . . . . 20

5.3.2 Interpolating the Surface of the Geometry . . . . . . . . . . . . . . . . . . . . 20

5.3.3 Developing the Cloaking Mechanism . . . . . . . . . . . . . . . . . . . . . . . 21

5.3.4 Developing the Delaunay Mechanism . . . . . . . . . . . . . . . . . . . . . . . 22

5.3.5 Comparing the Delaunay and Cloaking Mechanisms . . . . . . . . . . . . . . 22

5.3.6 Implementing the Rotation Algorithm . . . . . . . . . . . . . . . . . . . . . . 23

5.4 Developing the Solver In Serial Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.4.1 Implementing the Projection Method: Steps 2 and 4 . . . . . . . . . . . . . . 24

5.4.2 Implementing the Projection Method: Step 3 . . . . . . . . . . . . . . . . . . 25

5.4.3 Implementing the Interpolation Step . . . . . . . . . . . . . . . . . . . . . . . 26

5.4.4 Implementing the Forcing Field . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Code Refinements and Optimization 28

6.1 The Variable-Spring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1.2 Underlying Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.2.1 Evaluation of MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2.2 Evaluation of CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.2.3 Going with CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.3 Implementing the CUDA-optimized Structure . . . . . . . . . . . . . . . . . . . . . . 32

6.3.1 Implementing the Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3.2 Implementing the Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.3.3 Implementing the Intermediate Velocity and Final Velocity Calculations . . . 34

7 Results Obtained with aeroCuda 34

7.1 The Effect of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

7.2 Numerical Confirmations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7.3 Expected Physical Phenomena and Further Validation . . . . . . . . . . . . . . . . . 36

7.4 A Closer Look at the Physical Response of the Immersed Solid . . . . . . . . . . . . 36

2

Page 7: aeroCuda: The 2-d CFD Code

7.5 Physical Location of the Immersed Solid Points . . . . . . . . . . . . . . . . . . . . . 37

8 Test Case: Swimmer in Glide Position 37

8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8.2 Simulation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.4 Reynolds Number Transition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

9 Conclusion 39

9.1 Numerical Improvements to aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . 39

9.2 Technical Improvements to aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

9.3 Capability Enhancements to aeroCuda . . . . . . . . . . . . . . . . . . . . . . . . . . 40

9.4 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

10 Finances 41

10.1 Resources Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

10.2 Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

11 Appendix 41

11.1 Solving the Immersed Solid-influenced Navier-Stokes Equations . . . . . . . . . . . . 41

11.1.1 Step 1: Force Projection [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

11.1.2 Step 2: Calculating the intermediate velocity field [12] . . . . . . . . . . . . . 43

11.1.3 Step 3: Calculating the Pressure Field [12] . . . . . . . . . . . . . . . . . . . 45

11.1.4 Step 4: Calculating the Final Velocity fields[12] . . . . . . . . . . . . . . . . . 46

11.1.5 Step 5: Interpolation and Velocity [7] . . . . . . . . . . . . . . . . . . . . . . 47

1 Motivation

1.1 Computational Fluid Dynamics

The field of computational fluid dynamics (CFD) gradually arose as there was a demonstrated

need to evaluate aerodynamic, mechanical, biological, and/or environmental systems, either for

design or the study of naturally-occurring phenomena like vortex shedding. However, owing to

the complexity of solving the Navier-Stokes equations, the field of CFD grew to integrate three

3

Page 8: aeroCuda: The 2-d CFD Code

disciplines (computer science, applied mathematics, and fluid dynamics) in order to develop efficient

and accurate solutions to the Navier-Stokes equations.

The most common CFD simulations involve 3 main steps: pre-processing, simulation, and post-

processing. In the step of pre-processing, the problem at hand (e.g. 2-d cylinder in a wind-tunnel)

is decomposed into either 2-d or 3-d geometry depending on the dimensionality of problem. This

decomposition involves breaking the domain into a contiguous sequence of triangles or other simple

geometrical shapes. For example, a 2-d cylinder would need to be partitioned into triangles to

have its solution developed. Moreover, this decomposition can take lots of time if the geometries

are complicated—the elemental partitions must not have any overlaps, jagged edges, or displaced

elements. This process is only for a steady-state problem; for transient solutions, moving meshes

might be implemented. In such simulations, a mesh with a time-dependent orientation would be

developed, allowing for the simulation to take place at a very computationally expensive cost, given

that the mesh would have to be updated to reflect the new orientation at each timestep. To reit-

erate, a moving mesh would be desired in the case of an object that is either changing shape or

orientation as time progresses.

In the simulation step, depending on the Reynolds number (ρudµ ) magnitude of the problem,

different parameters and solution methods might need to be implemented to ensure stability of the

solution. For example, in high Reynolds number problems where lots of turbulence is expected,

more sophisticated models might have to be applied to properly resolve the solutions. In other

cases, the time-step and grid-size might have to be reduced to ensure accurate solutions. In the

event that such reductions are implemented, the code must be as efficient as possible to ensure that

lots of time isn’t needed to achieve good solutions.

In the post-processing step, the flow-fields at different times are observed and the convergence of

the force or another field variable to its steady-state levels are observed. In this case, for an object

with prescribed motion that is either periodic or constant, steady-state refers to the situation in

which the forces experienced are either periodic or constant. Being able to track the convergence

of the forces allows us to know when the simulation can be terminated with sufficient results.

1.2 Moving Mesh and a Translating Cylinder

To illustrate the complexity of implementing a moving mesh simulation, the case of a translating

cylinder is considered. For the algorithm, the r-method outlined by Tao Tang of the University of

4

Page 9: aeroCuda: The 2-d CFD Code

Maryland is observed. In the method, gridpoints are moved in such a way that at each timestep, a

high concentration of points is located where strong changes in the variable fields (such as pressure

or velocity) is expected. To support the r-method, there are functions, such as interpolation of field

variables to reflect field values at translated notes, that need to be implemented as well.[10]

In the case of the translating cylinder, suppose that the cylinder moves with timestep of δt =

0.001s at a velocity of u = 1m/s, at a Reynolds number of 100. This means that 1000 iterations

are needed to see the cylinder translate 1 meter. For the flow to develop properly, usually 6-10

meters are needed before the Von Karman shedding phenomenon can be observed. Therefore, the

nodes and variables are translated and interpolated, respectively, 6000 times to see the quantities

develop. In addition, the necessity of mappings between the the actual domain and a test domain

needed for a finite element formulation needs to be taken into account as well. Depending on the

clustering of nodes around the cylinder, the number of points that need to be interpolated and

updated may range from tens to thousands, depending on the accuracy desired.[10]

The complexities of the equations at hand as well as the coding would set a barrier to someone

who is not well-versed in computer science and fluid mechanics. For a student just beginning

to learn fluid mechanics, implementing a moving mesh simulation to observe the flow around a

translating cylinder is an unrealistic task. Moreover, the initial mesh itself has to be generated,

which may or may not be difficult depending on the complexity of the object.

In conclusion, many steps have to be executed at each timestep. For problems like flapping

wings which depend on rapid optimization of a variety of parameters, the overall cost of running

the simulations would be very high. The immersed boundary method offers a much less expensive

method, though at the cost of reduced accuracy (to be explained).

1.3 Governing Equations and Solutions

In CFD simulations, two primary equations are usually solved in the simulation, the momentum

and mass convervation equations; collectively they are known as the Navier-Stokes equations. In

two dimensions, the primary quantities dealt with are pressure p and the velocity fields, u and v.

The quantity ν is the dynamic viscosity and ρ is density. Let the 2-d velocity field be denoted as

5

Page 10: aeroCuda: The 2-d CFD Code

u = (u, v). The equations together are:

Momentum:∂u

∂t+ (u · ∇)u = −∇p+ ν∇2u (1)

Mass: ∇ · u = 0 (2)

Together, these equations establish the condition of incompressible flow, where the fluid does not

change density during the solution phase. This approximation is critical to the formulation of the

Projection Method, the algorithm used to solve the equations in this project. The idea behind the

projection method is that the velocity is propogated forward in time and corrected to account for

the incompressible condition. The steps for solving the Navier-Stokes equations in this algorithm

are [12]:

1. Solve for the intermediate velocity, u∗ = RHS.

2. Solve for the pressure using the divergence of the flow field, ∇2pn+1 = ρδt∇ · u

∗. Done using

a Fast Fourier Transform (FFT).

3. Project the intermediate velocity to get the divergence-free final velocity, un+1 = u∗ −δtρ∇p

n+1.

To note, aeroCuda solves the problems with periodic boundary conditions in both the X- and Y-

directions. There is no inlet or outlet flow, but moving objects in a stationary fluid.

1.4 Why Immersed Boundary

Being able to modify the geometry and run simulations, without having to recreate a mesh, would be

a great step forward in efficiency. Similarly, being able to change parameters and rerun simulations

at fast runtimes would be very advantageous, especially for optimization. As an added benefit,

in the event that a decomposition of the domain is not needed to develop solutions, then simpler

solutions can be implemented with very high efficiency.

The immersed boundary method developed by Peskin allows us to do exactly this. In Peskin’s

formulation, an extra forcing term is added in an attempt to enforce the desired boundary conditions

in the fluid simulation (i.e. flow around the cylinder surface should match its prescribed velocity)[7].

Since the forcing terms coincide with gridpoints, a cartesian mesh can be used with simplified solver

6

Page 11: aeroCuda: The 2-d CFD Code

routines. This is one of the foundations of this design project. Such a routine can be implemented

and optimized while retaining accuracy, making immersed boundary an attractive choice.

Figure 1: Point Decomposition

Figure 2: Mesh of a Similar Disk

For example, in Figure 1 the points within the boundary are marked as those that provide

forcing throughout the simulation (they follow prescribed motion as discussed later). However, in

Figure 2 the nodes that compose the mesh of the disk would provide the Dirichlet or Neumann

boundary conditions, depending on the type of simulation being run. Immersed boundary does not

require the regeneration of a mesh at every new timestep in the simulation, which would require

a mesh similar to that in Figure 2 to be translated (including all of the nodes) at every timestep.

Given the avoidance of this task by immersed boundary, a significant speedup in runtime is observed

and forms part of the motivation behind building a code to run immersed boundary simulations.

7

Page 12: aeroCuda: The 2-d CFD Code

2 Immersed Boundary Method and Solution to the Navier-Stokes

equations

Professor Charles Peskin of NYU was the founder of the immersed boundary method, a method

of solving the Navier-Stokes equations in a complicated or smooth domain with a structured grid.

Peskin originally developed the immersed boundary formulation to model the fluid flow in the heart;

however, it has been widely adapted to many flow problems. His formulation is outlined in the

following subsections. For this specific project, it has been coupled with the the projection method

outlined by Tryggvason. The full scope of the problem is now addressed.

2.1 Modification of the Navier-Stokes

In his formulation, Peskin modifies the momentum equation so that an extra forcing term, f , is

included.[6] For the equations used, Tryggvason’s ρ-normalized equation is adopted, where p is the

ρ-normalized pressure term.[12]

∂u

∂t+ (u · ∇)u = −∇p+ ν∇2u+

f

ρ(3)

∇ · u = 0 (4)

The addition of the forcing term in the Navier-Stokes equation allows the fluid around a certain

point (with prescribed velocity) to be forced such that the prescribed velocity is observed by the

fluid field.6 Ordinarily in the Navier-Stokes equations, either a no-slip or a slip boundary condition

would be prescribed via a Dirichlet or Neumann boundary condition on the object in the flowfield.

However, in a problem involving a moving boundary, the location of these conditions would be

dependent on the orientation and location of the mesh. By using the forcing term, the boundary

conditions are implicit in the formulation but do not need to have their locations respecified, as

the locations of the aforementioned points provide that capability.

2.2 Developing the Forcing Term

In Peskin’s original formulation of the immersed boundary method, the forcing term had a magni-

tude of κ| d2bdx2|τ , where κ is the membraneous force constant, the derivative represents the curvature

of the membrane, and τ represents the tangential vector. In doing so, Peskin allows for fluid-

structure interaction to take place (fluid forcing the boundary as well as vice-versa).6 However, in

8

Page 13: aeroCuda: The 2-d CFD Code

the case of immersed solids, both boundary and interior points matter. Therefore, a modification of

Peskin’s implementation as a network of springs is applied. In this alternative formulation, Peskin

simulates the object via springs; this method was used by Peskin and Lai to simulate the flow

around a stationary cylinder with great accuracy. The force is then given by κ(xp − xb), where

xp, yp are the points prescribed by the user and xb, yb are those that move with and force the flow

field.[7] While this setup is useful for simple geometries and motions, for higher Reynolds flow

problems a way of ensuring that immersed boundary points do not oscillate spuriously is required.

The harmonic oscillator-forcing mechanism implemented by Saiki and Biringen in their study of

the flow around a cylinder,

f = κ(xp − xs) + β(vs − vp)

is used.[9] It is very similar to the actions of a damped harmonic oscillator and helps obtain

convergence of the velocity while dissipating the energy exhibited by strongly-oscillating particles,

as can be seen in the force plots discussed later. Peskin’s implementation of the forward Euler

method is used to compute the integral in the code, where the position xn+1b = xnb + unb δt. To

introduce the forcing terms into the Navier-Stokes, Peskin uses the Dirac delta function to transfer

the boundary point’s force to an area of gridpoints via a stencil of coefficients. In addition, to get

the velocity of the boundary points Peskin interpolates from the surrounding fluid velocity points

via the same delta function stencil. [7] Henceforth, the immersed boundary shall be referred to as

an immersed solid.

2.3 Relationship between the Solid and Prescribed Points

To reiterate, there are two sets of points: the solid points, (xb, yb and the prescribed points, (xp, yp.

There is a one-to-one correspondence between the solid and prescribed points; each solid point

tracks the prescribed point as the latter moves based on the motion specified by the user. The solid

point derives its velocity from that of the fluid points surrounding it. In Peskin’s formulation, each

solid point receives velocity and projects force to all gridpoints within a radius of 2 gridspaces.[7]

The act of calculating the velocity for a specific gridpoint is done by means of the Dirac delta func-

tionss. This process is done twice: initially to calculate the damper force in the forcing equation

and finally to advance the solid points. The act of projecting the force is done by obtaining the

velocity of each solid point and the distance between the solid point and its prescribed counterpart.

9

Page 14: aeroCuda: The 2-d CFD Code

These are provided as inputs into the forcing equation and a single value is obtained for each pair of

solid and prescribed points. These forces are then transferred to the grid via the same Dirac delta

function, except in this instance the value is spread to the surrounding fluid points to influence

their motion. In sum:

Force Projection

• Obtain solid point velocities from surrounding fluid via Dirac delta function.

• Calculate forces via forcing equation with solid point velocities, prescribed point velocities,

and distances between each solid point and prescribed counterpart.

• Spread force to fluid points surrounding the solid point via Dirac delta function.

Point Update

• Obtain solid point velocities from surrounding fluid via Dirac delta function.

• Use forward Euler to progress solid points by respective interpolated velocities and prescribed

points by specified functional velocity.

3 Goal and Design Phase

3.1 Goal of aeroCuda

The goal of developing aeroCuda is to design either an add-on component to an existing CFD

software or a standalone CFD code that is capable of handling immersed solid implementations

for transient Navier-Stokes problems. Given that the scale and types of problems could range very

extensively, certain targets for both user inputs and specifications were set. While the final design

did not match all of these, it did satisfy the design expectations that were initially set. These are

outlined in the following tables:

The specifications were set to allow for users to efficiently calculate solutions to problems involv-

ing rigid bodies. The efficiency comes from introducing parallelization into the code, whereby tasks

are broken down amongst multiple processing units versus one processor. CUDA was chosen over

MPI for the bulk of parallelization as it allowed for massive parallelization of very basic arithmetic

operations. Concerning the rigid bodies, such implementations were the initial goal; however, the

concentration of points in important regions could be decreased if the object expanded, leading

10

Page 15: aeroCuda: The 2-d CFD Code

Table 1: Specifications

Specification Initial Final

Dimensionality 3-d 2-dParallelization MPI/CUDA 9:1 CUDA:MPINumerical Accuracy > 4th-order 1st- and 2nd-OrderObject Discretization User-Specified Internally-GeneratedMovement of Solid Points Specify Positions for all Time Prescribed MotionObject Type Deformable Rigid

to forcing problems (discussed in later sections). In addition, for this project it was simpler to

prescribe consistent motion for the entire body; prescribing motion for all internal points would

result in a drastic loss of efficiency and introduce a very complicated structure in point-dependent

functions. Lastly, 2 dimensions were chosen instead of 3-dimensions as grid-sizing and execution

would lead to memory and slow-downs in runtime. The latter case can be developed if necessary.

Table 2: Solver Input and Output

Input Output

Nodes/Connectivity Full Variable FieldsSituational Parameters Solver TimingsProblem Parameters Point LocationsFunctional Motion Total Force

Part of the motivation behind this project was to place as much control as possible in the hands

of the user. To this end, the user can input any 2-d surface and leave it to the software to generate

the internal points. In addition, the motion can be prescribed through lambda functions, which

are functions of variables that do not require formal declarations. Of importance to the user is the

CFL condition, specifically making sure that enough time- and space-refinement is used to ensure

convergence and accuracy of the solution. In terms of output, almost all calculated variables and

analytics are outputted either at a certain frequency or every cycle. More in-depth analysis of the

software will be provided in upcoming sections.

3.2 Reasons for Evaluation

The final structure of aeroCuda, as well as the decision to construct a CFD code from scratch, were

both decided upon after evaluating and working with a number of existing CFD platforms. The

initial stage of the project focused exclusively on identifying a platform to implement the immersed

solid method and a coding language to develop the module. The target criterion for a platform

11

Page 16: aeroCuda: The 2-d CFD Code

was a software whose solver routines could be directly interfaced with via external code.

3.3 Platforms Evaluated

3.3.1 Comsol

Comsol is a widely-used industrial solver utility. It has modules available for all disciplines of

engineering, including a fluid-dynamics module. It interfaces directly with Matlab, whiich has an

extensive library of tools that would provide good support to the user. However, COMSOL would

require a new solve at each timestep, in order to update the new locations of points and forces. In

addition, COMSOL did not allow for specific quantities to be placed on the field, which complicated

the ability of force and point placement.

3.3.2 Ansys

Ansys is the industry standard for fluid-dynamics problems. It comes with everything from a strong

CAD capability to mesh generation and a great CFD solver routine in Fluent. However, the user

interface can be very complicated for individuals to operate, even for very basic test cases. The CAD

interface allows for the construction of great geometries, but to operate the ICEM-CFD (mesher)

and Fluent requires a high-level understanding. The Ansys suite allows for the implementation of

an immersed solid functionality, which allows for motion to be prescribed to an object that won’t

deform, but has no capability for a deformable object. While it was ultimately chosen to pursue

a rigid object, Ansys did not appeal due to the difficulty of engaging Fluent and working with it

directly (as is needed for the immersed solid implementation).

3.3.3 openFoam

openFoam is an open-source CFD library available as a set of C++ modules that can run any type

of problem. The motivation behind using openFoam is that all of the solvers are coded and so

one can go straight to implementing the immersed solid method. Additionally, the solver allows

for the output of fields every certain number of cycles so a visual analysis of the field can take

place. It gave control at a really low-level, which meant implementing the immersed solid method

would be considerably easier with this software than with Ansys and COMSOL. Moreover, its open-

source attributes assured that no copyright or license violations would be incurred in modifiying

the software.

12

Page 17: aeroCuda: The 2-d CFD Code

3.4 Language for the Module

For developing the code infrastructure to support the project, Python was the choice language.

More than just being object-oriented, Python is quite easy to code in and can easily interface with

a host of packages, from visualization to parallelization. Among those useful for this project were:

• Numpy: A Python math library that allows for the development and usage of arrays, with

a host of functions that can work with these arrays. It also has a FFT package embedded

within.

• Matplotlib: A Python plotting library that can very quickly generate contours, vector fields,

and other plots.

• Pickle/ H5: These two libraries allow for quick and efficient outputting of data. In the case

of pickle, Python variables output directly to a file. H5 allows for great data compression and

its files (known as cubes) are very quick to write to and read from.

Moreover, many of the underlying functions used in Python libraries have already been optimized

using C and C++. These two languages were also considered for the project, but interfacing them

with other packages and developing visualization would have been difficult.

4 Working with openFoam

4.1 Mesh Generation

To generate the mesh, the blockMesh utilities called from withinopenFoam from the central direc-

tory. The blockMeshDict contains all of our information and the blockMesh utility will find the

file and output it. Of the files that are outputted by blockMesh, all but the boundary file are

important to have a discretization of the structured grid that will be worked with. The boundary

file itself will be where the boundary conditions that need to be applied will be called.

4.2 Solver Utility

Once these files are produced, the solver utility testFoam is called to run the code. testFoam simply

needs to be called without any command-line parameters, as it will read and output files so long

as they follow the openFoam file structure. At each iteration, testFoam will output a time file with

the specific results of that iterations results until the total runtime of the problem is completed.

13

Page 18: aeroCuda: The 2-d CFD Code

4.3 Building the Code for openFoam

Figure 3: Structure of openFoam Immersed Solid Code

The structure of the openFoam immersed solid code that was developed took the above structure.

All of the modules were produced via Python. The program works as follows and is detailed in

Figure 3:

1. Reading in the User’s Object: The user provides a node file and a connectivity file. This

allows for point placement on the grid.

2. Parsing the Mesh File: After the blockMesh utility is called to generate the desired

cartesian mesh, a parser is run to read in the mesh data. This consists of four files: the

nodes, faces, cells, and neighbors. A parser runs on each file and stores the data, which is

outputted via the pickle module to a data folder. Each specific value is stored as a key in

a dictionary, and the relevant info (node coordinates, connectivity) as the items relative to

that dictionary key.

3. Placing the Points on the Grid: Once the object data is obtained, a series of modules

is run to triangulate which faces are closest to the object’s points. This is done by iterating

through all of the faces and seeing which one’s centroid has the lowest distance relative to the

object point. Once the centroids and consequently, cells, are specified, a boundary condition

file is generated.

14

Page 19: aeroCuda: The 2-d CFD Code

4. Developing the Boundary Condition File: Depending on what the user specified for

motion, a file with all of the boundary conditions (patch, cell number, value) is outputted.

The fiile has all of the boundary conditions listed in sequential order with the relevant data

and can be parsed by the csv.reader function in Python.

5. Parsing Input File: The user should have a file with a list of all of the boundary condi-

tions that need to be specified in the program; this would consist of a list of patches (or face

boundary conditions) which would have the information. In addition, the file should also con-

tain specifications of the mesh (grid size, spacing) and solver (time step, fluid specifications).

These files will be parsed by an input module and the data will then be generated into a

boundary condition file.

6. Generating the Initial Field: Once the boundary conditions are read, the mesh data is

used to generate the initial field which represents the problem at the initial time of 0, with

the boundary conditions reflected in the same manner.

7. Run the Solver Loop: The solver loop is executed for 1 iteration. At the next iteration,

the same process takes place.

4.4 Issues with openFoam

The full immersed solid formulation was not implemented in openFoam, as it became apparent

upon running a first iteration of the code that the software was not a suitable choice. Multiple

issues arose:

1. openFoam had its own file structure formats, and consistently developing the input files was

not only costly (at every iteration) but also prone to errors (if even one letter was off or there

was an errant space, the program would crash).

2. openFoam required a structured mesh, with the above implementation. Because the forcing

term would be implemented via patches (facial boundary conditions), it soon became apparent

that every face would have to be specified with a certain value; linking all of the faces together

was not possible, unless the mesh was structured in this way.

3. The patch method was very quick to generate continuity errors; while these might have been

resolved, due to the amount of time available for this project, it was not feasible to pursue

this issue further.

15

Page 20: aeroCuda: The 2-d CFD Code

4. The openFoam data fields were structured in the software’s specific file format, and while

openFoam had a graphical user interface known as ParaView to do post-processing, it simply

wasn’t feasible to use this for all analyses, as a faster development loop was desired.

5 Development of aeroCuda

5.1 Influences

Having evaluated the issue with openFoam, it seemed the best decision was to develop a code

from scratch that would be malleable, effective, and efficient for users of all backgrounds to use.

The final structure of the code drew its influence from codes developed by Peskin and openFoam.

From Peskin’s, the structure of the solver routine as well as the force projection, interpolation,

and advancement of the solid points were incorporated into the final version. With respect to

openFoam, the data input/output structure were adapted into the final version of the code. Given

that these codes were robust, and in the case of openFoam, well-established, they would serve

as good templates. A notable part of the final structure is the solving of the problem via an

Nvidia GPU, which allows for very significant parallelization and provides immense speedups in

the solution phase.

On his website, Peskin provided Matlab code that simulated the problem of an immersed elastic

membrane forcing fluid via a tensile force (forcing is proportional to κ| d2bd2x|).[8] The code served as

a template for how an the immersed solid software might be structured. Since it was written in

Matlab, the code was translated to Python to get a feel for what Python functions and/or modules

would play a critical role in the CFD package. Among those that were useful were Pylab, Numpy, and

Scipy, which provided vectors for handling the data but also arraywise operations. Of particular

note was the pointer-referencing issues that arose in Python and not in Matlab. In Matlab, when

a variable is set to take on the value of another variable, it receives a value by copying, not by

direct memory reference. In Python, however, the data is transferred by direct memory referencing,

unless a copy function is called, creating a duplicate of the value itself. Therefore, in certain cases

where function calls and variable storage were dealt with, the code needed to be modified to ensure

the original variable wasn’t altered during the update process.

From the Python version of Peskin’s code, there were a few important points that would figure

in the development of an immersed solid code:

• Peskin’s implementation was for an immersed membrane but this project’s goal was to sim-

16

Page 21: aeroCuda: The 2-d CFD Code

ulate immersed solid bodies. A key difference is that the fluid inside the body surface does

not move if the delta stencils of the boundary points do not cover the full interior.

• Peskin’s code used for-loops and other runtime expensive mechanisms, which led to high

runtimes for large grids and/or a large number of immersed solid points.

• Peskin’s code provided a template where reconfiguration and adaptation (i.e. for array op-

erations ) could provide serious optimization. Those areas which presented serious potential

for optimization were the choice of solution algorithm and the parallelization of the code.

• Spurious oscillations occurred within the code when different situations were implemented,

i.e. a wider membrane radius.

With respect to the last item, in a paper by Saiki and Biringen it was noted that spectrally-

discretized flow-solvers tend to result in spurious oscillations. In Peskin’s code, Fourier transforms

were widely used to solve the equations. This claim made by Saiki and Biringen motivated the

usage of Tryggvason’s formulation of the projection method.

5.2 Structural Overview

5.2.1 Input

Figure 4: Structure of Input

For the input, there are 3 main components (outlined in Figure 4). First, nodes that define

the surface of the object are needed. These will serve as a portion of the prescribed points (more

might be needed, as explained in the following section). Second, the connectivities of these points

are required as well, to help in guiding appropriate distance-checking and interpolation between

17

Page 22: aeroCuda: The 2-d CFD Code

consecutive nodes. The nodes are inputted as an n x 2 array (x-coordinates in one column, y-

coordinates in the other) and the connectivities are also n x 2, where the ith row has the id’s of the

2 points that the ith nodet attaches to. Lastly, the parameters of the solve need to be provided.

These range from the constants to grid-spacing, as well as the specifications for the GPU (thread

configuration per block, block configuration per grid).

5.2.2 Pre-Processing

Figure 5: Structure of Pre-Processing

The pre-processing phase is broken into multiple steps, as shown in Figure 5. First, the nodes

and connectivities are checked for the spacing (tolerance prescribed by the user). This alerts

him/her to problematic spacing. Second, the nodes and connectivity are then taken into the ”Com-

plete” module and wherever the spacing between two connected nodes is greater than the actual

grid spacing, enough points are generated between the two nodes via interpolation until the gap

is sufficiently small. Once the surface is closed, points inside the bounding surface are generated.

Regardless of whether the user wants to rotate the orientation of the object, the rotation module is

run to retrieve the angles and radii (relative to the specified origin of motion) of the points. These

are important if any angular velocity is prescribed. Lastly, each specific point is given a specific

spring constant to keep it as close as possible to the prescribed point’s position—the reason is that

external points have fewer points to rely on for additional forcing and therefore need a higher spring

constant.

18

Page 23: aeroCuda: The 2-d CFD Code

5.2.3 Solver-Loop

Figure 6: Structure of Solver Loop

At the conclusion of pre-processing, the solver loop is engaged. It is a repeat of 6 steps that

feed data in and out, all shown in Figure 6. In the first step, the velocities of the solid points are

obtained via delta stencil interpolation from the variable fields. Once obtained, the forces on each

solid point are calculated via the forcing equation and projected to the grid. Next, the equations are

solved via the projection method: the intermediate velocity, pressure field, and velocity correction.

Lastly, the final velocities are obtained via interpolation, and both prescribed/boundary solid are

translated by their respective velocities. Just to note, all calculations take place via the GPU, to

optimize their runtimes.

5.2.4 Post-Processing

The post-processing takes place as the solver loop executes, detailed in Figure 7. There are two

types of outputs that take place. Those of type ’Transient’ are ones that take place with each

execution cycle. Those of type ’Frequency’ take place after a certain number (user-specified) of

cycles executes. Those values outputted of type Frequency tend to have lots of data and therefore

should only be outputted after a large number of cycles, otherwise a slowdown in runtime and

massive memory consumption will take place. The idea of Frequency outputs was taken from

19

Page 24: aeroCuda: The 2-d CFD Code

Figure 7: Structure of Post-Processing

openFoam, as it seemed the most logical way to view variables without incurring the aforementioned

costs.

5.3 Pre-Computation: Interior Point Generation and Rotation Capabilities

5.3.1 Motivation behind Interior Point Generation

In the immersed solid formulation, interior points need to be specified inside the 2-d or 3-d geometry

to force the fluid internally, as suggested by Peskin in correspondence. To put in perspective, if a

circle is moving at a velocity u, it should have to force the fluid on its outside only; the interior

points should be moving at the same velocity u. If interior points are not specified, then the velocity

in the interior of the circle will not be at u, as no force will be present to move the fluid at the

velocity u; this is the case with a moving membrane, which is not the focus of this project. To

make the task easier for the user, the code requires a 2-d surface to be passed in and develops the

interior points afterwards.

5.3.2 Interpolating the Surface of the Geometry

Since the immersed solid relies on points forcing the fluid, it is important that points completely

enclose the object at hand to prevent fluid from penetrating the intended boundary. To handle

this issue, an interpolation module is implemented to close gaps in the surface. It takes in a list of

nodes and connectivities to generate the surface of the object. Once completed, each point has its

connectivity checked to ensure that the distance between two nodes is less than a certain amount

(for best results, this should be smaller than the grid-spacing). If the between two nodes is too

large, a linear interpolation scheme is implemented by traversing a vector between the nodes and

20

Page 25: aeroCuda: The 2-d CFD Code

Figure 8: Cloaking from Different Directions

placing a point every h units, where h is a tolerance defined by the user. In doing so, it is ensured

that the object has no compromising gaps.

5.3.3 Developing the Cloaking Mechanism

Cloaking is a mechanism developed to help construct a point cloud that most closely resembles

the object’s geometry. The mechanism is illustrated in the following figure: The principle behind

cloaking is to isolate all points that lie within a boundary by using normal vectors from all 4 sides.

The nodes are mapped to locations on the grid via the prescribed spacings, with a magnitude of 1

(all of the gridpoints are initialized to 1). Cumulative sums are then executed from all 4 sides of

the grid, using the Numpy.cumsum function. Therefore, any point which lies in the normal vector

direction from a boundary point will have a value greater than 0, due to the cumulative sum. At

the end, the grids are taken and examined for those points with a nonzero value in all four runs;

these points form the point cloud that composes the object. The drawback to cloaking is that if

the tolerance for cloaking is less than the interpolation spacing, then there will be gaps in the solid,

which may reduce the effectiveness of the mechanism. This process is detailed in Figure 8. The

Dark Blue portions of the figures represent those points where the sum is 0; where there is color

(ranging from blue to red), the value of the sum is greater than 0 (the closer to red, the higher the

sum). All those points with nonzero values in all 4 cases are taken to form the body of the object.

21

Page 26: aeroCuda: The 2-d CFD Code

5.3.4 Developing the Delaunay Mechanism

An alternative method to the cloaking mechanism is the Delaunay Triangulation Method. While

this was originally developed to help form meshes, it has been adapted here to develop interior

points for an arbitrary 2-d geometry. As adapted and modified as necessary from the notes of

Tautges, the algorithm is as follows [11]:

1. Identify an interior point (find average (x, y) coordinate).

2. Initialize arrays to keep track of point ids, (x, y) locations, and those checked for neighbors.

3. Starting with the central, check to see if any interior/boundary points exist within the up,

down, left, and right directions based on a radius r. If so, create a new point entry and log

its x,y coordinates along with a checked status of 0 (empty).

4. Repeat the previous step until no new points have been added after a certain number of times

executed.

The benefit of using the Delaunay method is that the generate points can very quickly conform to

the boundary of the object without distorting its actual surface just to fill the interior. In addition,

the tolerance can be adjusted to help ensure that the boundary is matched quite nicely. While the

implemented algorithm does not involve any adaptive point generation, such a capability can be

implemented in the code and would allow for more robust results.

5.3.5 Comparing the Delaunay and Cloaking Mechanisms

The following figure best depicts the effects of both mechanisms using different spacings on a NACA

6716 airfoil: From a first glance, the Cloaking mechanism appears to provide more than enough

points for the interior but which do not stay inside the shape, that is, cross the boundary (though

the violations is not too apparent). In the case of the Delaunay method, fewer points are provided

but they remain inside the boundary. While grid- and point-spacing certainly affect the outcome

of the immersed boundary, conforming to the body of the object is important in CFD, regardless

of the problem being solved. However, in the immersed boundary method, it is important the

points defining the boundary be supported by interior points. In essence, since points moving in

the same general direction at the same general speed, their force contributions will be split across

the surrounding boundary points. Therefore, the point generation mechanism must be able to place

22

Page 27: aeroCuda: The 2-d CFD Code

Figure 9: Interior Point Generation using Both Mechanisms

points very closely to the boundary. Since the Cloaking mechanism does this more efficiently, it is

used to generate the point clouds for the following simulations.

5.3.6 Implementing the Rotation Algorithm

To allow the user to test different angles of attack or orientations, a rotation module was imple-

mented to provide the geometry with a certain angular orientation. The general structure behind

the rotation algorithm is as follows:

1. Calculate where the central point of the geometry lies.

2. Shift the entire object to be centered over origin.

3. Get distance from origin to all points.

4. Get angles of all points relative to origin by converting them to complex vectors and using

the angle function in Python.

5. Add the theta desired to all of the angles.

6. Use the r(cosθ, sinθ) formulation to regenerate the points.

7. Shift them back to the original central point.

23

Page 28: aeroCuda: The 2-d CFD Code

5.4 Developing the Solver In Serial Code

The projection method has 5 steps that need to be solved. The algorithm presented here summarizes

the full solution of the algorithm detailed in the Solver-Loop subsection of the Structural Overview

section:

1. An interpolation of velocities from the field and a projection of the calculated forces to the

field [8]

2. An explicit solve for the intermediate velocity [12]

3. An implicit solve for the pressure field to correct the intermediate velocity [12]

4. An explicit solve for the final velocity via pressure correction [12]

5. An interpolation of velocities from the field and an update of the prescribed and solid locations

[12]

5.4.1 Implementing the Projection Method: Steps 2 and 4

Steps 2 and 4 are the easiest to implement since they are explicit and involve shifting operations.

For the simulations of this project, it is important to note that periodic boundary conditions are

enforced, so over a domain of size [0, L]× [0, L], the conditions x(0) = x(L) and y(0) = y(L) hold

for all variables and their derivatives. It would be important to make sure that the cells on the

boundaries read their data from those on the opposite if the applied operator requires a cell past

the boundary. In Python, such an operation can be implemented via the Numpy.roll function.

This function allows for the shifting of an array of n dimensions via a specific axis and by a certain

magnitude. Therefore, the second step of the algorithm was laid out as follows.

• un= field velocity (ux, uy) at step n, f = force field, us = intermediate velocity (us, vs),

δx=x-spacing, δy = y-spacing, ρ = density, δt = timestep, ν = viscosity

• Define the function partial-first(variable, spacing, magnitude, axis): (roll(variable,-1, axis)-

roll(variable, 1, axis))/(2*spacing)

• Define the function partial-second(variable, spacing, magnitude, axis): (roll(variable,-1,axis)-

2*variable + roll(variable,1,axis))/pow(spacing,2)

24

Page 29: aeroCuda: The 2-d CFD Code

• us= un + δt*(-1*(partial-first(un,δx,1,2)*ux + partial-first(un,δy,0,2)*uy) + ν*(partial-second(un,δx,1,2)

+ partial-second(un,δy,0,2)) + f/ρ)

Likewise, for the fourth step of the algorithm:

• un+1 = field velocity at step n+ 1, p = pressure

• un+1 = us- δt*partial-first(p,***,1,2)

*** denotes relevant axis (x-axis: δx, y-axis: δy)

In utilizing the Numpy.roll function, two benefits are gained. First, because the Numpy func-

tions are coded in C++ and operate array-wise, the cost of iterating through the array via looping

is avoided. Secondly, the roll function implicitly accommodates periodic boundary conditions, help-

ing to avoid conditional statements to ensure the nodes on the boundary and interior are treated

properly.

5.4.2 Implementing the Projection Method: Step 3

In the description of the algorithm used, it was outlined that the FFT was used to solve the Poisson

equation. However, this was only arrived at after considering the implementation of the matrix

solution method. The Poisson equation, given as,

∂2p

∂2x+∂2p

∂2y= f

takes the following form when decomposed via finite-differences:

pi,j+1 − 2pi,j + pi,j−1(∆x)2

+pi+1,j − 2pi,j + pi−1,j

(∆y)2= fi,j .

A matrix method like BICGSTAB can be utilized to solve this equation. The coefficient matrix

would have five bands, since there are five variables involved in each equation, shown in Figure 10.

From a computational perspective, this means that for every point on the computational grid,

there are 5 values to be stored in the matrix. Since the smallest grid used is of size (512,512),

about 10mb is allocated for the coefficient matrix. While a method like BICGSTAB can indeed work

with a coefficient matrix of this size, it would require many iterations in addition to ensuring that

memory allocation is not a problem (creating such the matrix outlined resulted in a MemoryError

being called by Numpy). Since a speedy and accurate CFD solution is desired, and one that does

25

Page 30: aeroCuda: The 2-d CFD Code

Figure 10: Coefficient Matrix Structure for Poisson Equation on 8 Node x 8 Node grid

not require massive amounts of cores to run, too, implementing a spectral solution to the Poisson

equation is an efficient way of obtaining a good solution to the equation.

5.4.3 Implementing the Interpolation Step

The delta function stencil is a 4x4 stencil but has uniform x and y values which are multiplied

together by the field values. In his code, Peskin conducted this interpolation in the following

method [8]:

1. Calculate location of point, radius, and other necessary parameters

2. Iterate through all of the points

3. Multiply stencil by field values and get the total sum

For a quantity of 1000 points, the time to execute such a loop would be very large. Therefore,

it is important to vectorize these calculations and avoid looping to produce quick iterations. To

do this, the stencil should be examined: it is a combination of 16 coefficients multiplied by 16

corresponding values from the the field. Therefore, for each immersed solid point there are 4

unique delta values in the x-dimension and 4 unique delta values in the y-dimension. Therefore, for

each immersed solid point two 4×4 arrays are generated, each with the x-values uniform across the

26

Page 31: aeroCuda: The 2-d CFD Code

rows and the y-values uniform across the columns. Once obtained, the x-value arrays are stacked

on top of each other and y-value arrays aligned next to each other using the Numpy.column-stack

and Numpy.row-stack functions, respectively. These gives two 4n×4 size matrices. To get the full

delta values, another set of matrices, one that has the x-values and y-values of the corresponding

points, is generated. Now the Numpy.flatten function is used on the arrays to convert them all

to 1-dimensional vectors (i.e. flatten(2 × 16 vector) = 1 × 32 vector). Numpy lends motivation

to this idea, as an array of values can be yielded by some variable if a 1-dimensional array or list

(multiple dimensions are not supported) is passed as the index. The delta values become relatively

easy to work with, as the list of relevant field values is multiplied by the x- and y- delta vectors.

The resulting values are then taken, and using the Numpy.reshape function to convert them back to

nx16 matrices, the Numpy.sum function is executed across across the 1 axis (horizontal, or row-wise)

to add up all of the values and return the relevant u and v velocities for each solid point.

5.4.4 Implementing the Forcing Field

The projection of the forcing field onto the grid is similar to the interpolation step, except in this

case values are passed instead of taken from the grid. Assuming that the forces has been calculated

for all solid points, the force value at each solid point needs to be projected to the surrounding

points using the delta-function. In addition, this property is additive, meaning that other points in

the vicinity might be affecting the same gridpoint and so the forces will need to be added together.

First, a force variable of the grid’s size is initialized to 0 and converted to a 1-d vector via the

Numpy.flatten command. Tbe same delta-stencil and global location arrays are implemented as

in the interpolation step. However, instead of retrieving values from the grid, similarly-sized field

value arrays are created by repeating our force values in 16 sets; therefore, if the array is [1,2,3...]

the new array will have the 0-15th indices corresponding to 1, the 16-31st indices corresponding to

2, and so forth. These are multiplied by the delta-matrices and the force vector, yielding the stencil

values. The global location values are then used to initialize a defaultdict dictionary pointing

to a list. A defaultdict is an object in Python that allows for one to place values with certain

keys based on an object type, like a list or a float. This satisfies the need well, group by location

is desired. Therefore, a Python generator (which takes much less time than a for-loop since it does

not create the object in memory) to iterate through the global location vector and place the stencil

values with their appropriate locations. The the stencil values are summed at each point using

another defaultdict, except one that’s initialized to a float (thing of this as the reduce-portion of

27

Page 32: aeroCuda: The 2-d CFD Code

map-reduce). Since the global locations are stored as the dictionary keys and the force magnitudes

as their values, passing these to the force grid is an easy process. Both keys and values can be

isolated as lists, and the relevant gridpoints can be augmented by passing the keys directly to the

force grid (Force-grid[keys]) and add the values (Force-grid[keys] += values). The reshape function

is then used to reshape force-grid to the size of the domain.

6 Code Refinements and Optimization

6.1 The Variable-Spring Model

6.1.1 Motivation

In the immersed solid method, the outermost layer of solid points are responsible for breaking the

flow as the object moves. Consequently, these points are the ones that also happen to shift positions

the most (due to fluid forces) and thereby are most likely to begin a chain of displacement within

the layers of surrounding points. The easy solution would be to raise all the spring constants

to massive levels; however, this is not feasible since the object (at the beginning of its motion)

would be destroyed by a massive spring force from the initial motion. However, by raising the

stiffness of those points in areas with fewer solid points, more force can be effected by those points

to compensate for the compounding effect of having multiple points forcing the same gridpoint.

Raising stiffness also ensures that the solid points will closely follow the prescribed points, with

higher forces being the penalty for widening distances. Therefore, the variable spring model is

proposed.

6.1.2 Underlying Principle

In the variable-spring model, spring constants are inversely proportionaly to the number of sur-

rounding the points. The reason traces back to Peskin’s delta function. Since it is a 4x4 stencil,

neighboring points are more than likely to overlap on the same gridpoints; as a result, their forces

will compound, applying a much stronger spring force than an individual point alone. However,

if a point is rather secluded in the geometry (on the surface or the point end of an airfoil), that

point must have its spring constant raised to compensate for the fewer surrounding points but also

having to deal with the boundary layer.

28

Page 33: aeroCuda: The 2-d CFD Code

6.1.3 Algorithm

In the variable-spring model, the algorithm implemented is as follows:

• Produce the distance vector of one point to all of the points in the object

• Run a logic statement to find those within a specified radius

• Sum up the logic “1” values to find the total

• Repeat the above for all points in the solid.

• User prescribes a slope to apply based on the number of surrounding points and initial κo.

• Let Max denote the largest number of solid points within the specified radius of a solid point

and Surr denote the number of solid points within the specified radius of the specific solid

point being dealt with. Let m denote a slope constant prescribed by the user to specify how

muc hthe spring constant should be raised for every point lying in the vicinity.

• Once the maximum number of surrounding points is identified, assign the spring constant:

κi = κo(1 +m(Max− Surr + 1))

6.2 Parallelization

While the algorithm itself is not optimized for speed, it is easily parallelizable mainly because the

operations employed in the solution involve basic arithmetic steps that involve data from multiple

points. Since the algorithm has to be repeated for all points, the process can be executed by n

processors if the points are split up into n groups. Each processor will then work on its group and

return the value. Things are made easier by the MPI scatter and gather functions, which allow

for the groups to be sent to respective processors and the same groups to be returned in the right

order, respectively. Therefore, there is no issue with synchronization and the order of retrieval.

The values are simply passed out, the function executed, and the outputs gather and concatenated

into a 1-d vector with length equal to the number of immersed solid points. The algorithm above

was initially implemented in a serial code. However, it took considerably long to run, even for the

most basic cases. Therefore, the focus now shifted to optimizing the code via parallell processing.

To this end, 2 options existed: using MPI or Nvidia’s CUDA GPU computing platform.

29

Page 34: aeroCuda: The 2-d CFD Code

6.2.1 Evaluation of MPI

If MPI was used, the structure of the program would be as follows. For the interpolation scheme,

the immersed solid points are scattered (they are broken up into n arrays for n processors to work

on) amongst the different processors and the velocities are gathered (collection of the processors

computed values) back. For the force projection, the force grid (as an array of zeros) would be

broadcasted (same copy sent) to all processors and each processor would add its projections to the

grid. The outputs would be gathered and added up to obtain the grid values. The calculation of

the intermediate velocity could be implemented via a domain decomposition method with ghost

cell transfers. The most difficult step would be the Poisson equation, as this would require an

MPI-version of BICGSTAB to be implemented. The spectral Poisson solution would be a waste

to implement via MPI, as the FFT is essentially a global operation. An implementation of an

FFT algorithm might involve some sort of master-slave algorithm where one processor serves as

the distributor of data that needs to be processed. As the other processors execute jobs, the central

processor retrieves the data from the completed processors and provides new data to be processed.

This process continues until the full operation is complete. Once done, the central processor would

have to transpose the matrix and then pass out new arrays to have the FFT run. This would result

in a lot of code to implement which might not even offer a speedup. Given the goals of this project,

this would detract from the malleability of the code but also prevent it from running even faster.

6.2.2 Evaluation of CUDA

If CUDA was used, the structure would be as follows. CUDA grants control of an individual

thread, of which there are millions on a gpu, enabling the grassroots control of each grid point

value. Therefore, the code can be parallelized at a level which would not be possible on MPI (or

would be possible, but would require a vast amount of resources and code). For the interpolation

code, one point could be assigned to each node, whose job would entail computing the full stencil for

that specific point and return the velocity, eliminating the need for vectorization. For the forcing

implementation, the same processed would be used but the values would be stored in an array with

a correspoinding global id array, so that a group of threads could run the reduction very efficiently.

The intermediate velocity calculation could also be run very quickly, as each thread needs to read

the values from the cells surrounding it and execute two lines of operations. The Poisson equation

can be solved using FFT libraries that exist with Python bindings to CUDA. The velocity correction

30

Page 35: aeroCuda: The 2-d CFD Code

could be executed in a manner similar to the intermediate velocity calculation step. The source

code required for CUDA (though more complicated) would be concise, but it would also help in

another way. With CUDA, gpuarrays (pointer references to arrays) are allocated and left in device

memory, avoiding the necessity of having to pass memory back and forth between the host and

device (this can be avoided in MPI but would take much longer to implement and be much more

complicated than the CUDA code).

6.2.3 Going with CUDA

Having thought about both approaches, the CUDA implementation appeared to be more feasiblel.

It would be cleaner and more effective by allowing a much more low-level approach than MPI. While

it wouldn’t allow for functions like Numpy.roll to be used, it would provide a greater speedup by

allowing for thread-based approaches.

Figure 11: Technical Structure of CUDA [3]

The technical structure of an Nvidia GPU worked quite well with the solution method employed

by aeroCuda to solve the Navier-Stokes equations. The structure is detailed in Figure 11. Each

Nvidia gpu contains 3 levels of operation: the grid, the block, and the thread. The hierarchy, as

shown in the relevant figure, functions as follows:

1. Thread: This is the lowest level of the hierarchy. It functions as a worker for executing the

31

Page 36: aeroCuda: The 2-d CFD Code

functions and can access local, shared, or global memory. Local memory can only be accessed

by each thread.

2. Block: This is a group of threads that function together. The blocks are important as shared

memory can be accessed by all threads in a block – it is also quicker to read and write from

than global memory. It can accommodate 32× 32 threads.

3. Grid: This is a group of blocks that forms the basis of the computational grid. Only global

memory exists on this level.

It is also important to recognize that the GPU is separated from the CPU or computing platform,

so memory will need to be allocated on the GPU to hold the computed data. The PyCUDA package

developed by Klockner does exactly just that and more.[5] Klockner’s gpuarray module allows for

the creation of arrays on the gpu that have properties similar to Numpy arrays but also allow

operations between arrays to be conducted on the gpu, providing a further speedup. The PyCUDA

package will allow for the engagement of CUDA from a very high-level but use functions optimized

for necessary operations.

The projection method with the immersed solid formulation has 4 explicit steps and 1 implicit

step. For the explicit steps, CUDA kernels (functions) can be written to execute them. For the

implicit step, FFTs are needed to for these issues. Nvidia developed the cuFFT package to run FFTs

using the CUDA programming structure—to adapt this in Python, the pyFFT package developed

by Bogdan Opanchuk creates a binding with PyCUDA to pass gpuarray objects to cuFFT. The

following sections will describe the programming scheme.

Of worth noting is that CUDA will take an n-dimensional variable and decompose it to a 1-

d vector, whereby the indexing is carried through by the block and thread level. Therefore, in

the following outlines of the CUDA algorithm, all global variables/quantites (while they might be

2-dimensional) are actually 1-dimensional when transferred to the GPU.

6.3 Implementing the CUDA-optimized Structure

6.3.1 Implementing the Interpolation

In the case of interpolation, the vectorizing process is completely averted. Since n immersed solid

points exist, n threads can carry out the interpolation scheme for each point. The parameters for

each point (xr, yr, rx, ry,etc) are calculated in the similar fashion. However, for the stencils, each

32

Page 37: aeroCuda: The 2-d CFD Code

point has a double for-loop that iterates through all of the possible indices. In each iteration, a

new φ(x)φ(y) is calculated and multiplied by the relevant point, which the thread reads from the

field variable (this is stored in global memory, since it is available to all threads). Once the threads

have completed, they write the interpolated values to an n-length vector.

6.3.2 Implementing the Forcing

Implementing the forcing is slightly more complicated than before. Multiple arrays are needed for

this implementation. In one array, global IDs of the force projections will need to be stored (the

mapping of global IDs is shown in Figure 12). For n points, this array will have to containt 16n

elements, to ensure that each projection is written to a different space. In addition, two arrays will

have to be created, of the same 16n length, to store the magnitudes of the corresponding forces. In

the fourth and fifth arrays, the full force grid will need to be assembled, to store the total forcing

at each point (if it is actually forced, else the value is just 0).

Figure 12: Thread-to-Point Mapping Diagram

.

In the first step, for each immersed solid point a double for-loop is engaged. If it is the ith

solid point, it will write to the [16i, 16i + 15] indices of the global ID and the corresponding force

vectors. Therefore, all threads require 16 total iterations to get all the projected forces. The issue

now becomes writing to the grid. In CUDA, a common issue is that of thread racing, where by

multiple threads try to write to the same global memory location or shared memory location. If

not executed properly or done in sequence, multiple threads can write at the same time or read

at the same time and result in wrong values being written or read. Therefore, all of the threads

simply cannot write to the same location. However, if recalled the stencil had 16 unique points;

therefore, in the global ID vector, every 16 values should be completely different, starting from

33

Page 38: aeroCuda: The 2-d CFD Code

the beginning. Therefore, if 16 threads execute in a for-loop of size n, the threads [0, 15] will

read from the [16n, 16(n + 1)) indices of the corresponding force vector. They willl then take the

[16n, 16(n + 1)] values of the global ID vector and augment the respective locations on the full

forcing grid. The threads are synchronized using the synchthreads() command to ensure that no

threads begin executing the next for-loop, since they might interfere with the reads and/or writes

of the threads still completing the previous for-loop.

6.3.3 Implementing the Intermediate Velocity and Final Velocity Calculations

Since both steps involve explicit finite differencing, the task is fairly straightforward. Referring

to the following figure depicting the layout of the threads and the computational grid, so long as

the number of gridpoints does not exceed the number of threads, every point will have a unique

thread assigned to compute its value. Since the values are being stored in new (intermediate or

final velocity) arrays, there is no issue with race-conditions between threads. Therefore, the crux

of the task at hand is to compute the proper ids of those points needed for computing the relevant

center point’s value. Since the dimensions of the grids and blocks are set by the user (in addition

to the solver parameters), the index can be calculated either blockwise (as done in this code) or

row/column-wise; it depends on the user’s preference.

7 Results Obtained with aeroCuda

7.1 The Effect of Optimization

Loading the code onto the GPU removes a considerable portion of the runtime. The speedups are

especially noticeable in the 1st, 2nd, and 4th steps, as shown in Table 3. In the 1st step, substituting

the thread-based force projection for the vectorized projection appears to have provided the bulk

of the speedup, since in the 5th step where only the interpolation takes place, there is a much

smaller speedup. The finite-differencing steps (2,4) show a very high speedup as well, especially in

the case of the 4th step. The discrepancy between these two steps might be the total number of

global memory reads that must be made—since the 2nd step requires many more variables than

the 4th step, it is possible that the variable reads are forming somewhat of a bottleneck on that

time.

The actual time and speedup quantities for the simulations are listed in Table 3. For the serial

code, the 1st and 2nd steps took the longest, while for the GPU the 1st and 3rd steps took the

34

Page 39: aeroCuda: The 2-d CFD Code

longest. The issue behind this could be that for the serial code, the necessity of having multiple

roll functions execute the partial derivatives resulting in a slowdown for the 2nd step. For the 1st

step of the serial code, the forcing function was difficult to optimize outside of the vectorization

that was done. For aeroCuda, the runtime for the 1st step was large as it involved the for-loop

iteration necessary to place all the forces at their respective points on the grid. The Poisson

equaton, the 3rd step, took the second-longest to execute, yet, still provided a good speedup over

the 3rd step runtime of the serial code. For an improvement to aeroCuda, a more robust algorithm

for transferring forces to the grid would help shave some time off the 1st step.

Table 3: Simulation Speedup for Re 100 Case

Simulation 1st 2nd 3rd 4th 5thSerial100 0.87s 0.67s 0.28s 0.33s 0.035sGPU100 0.018s 0.008s 0.014s 0.004s 0.011sSpeedup 48.2 81.1 19.9 89.1 3.13

7.2 Numerical Confirmations

The results obtained are expected in terms of magnitude, though they vary slightly from those

obtained in other papers. The Drag Coeffocients are shown in Table 4. Multiple sources are used

to confirm the tests conducted in this paper. In particular, Henderson’s work on studying the drag

around a cylinder shows a graph of the drag coefficient as a function of Re [4]. All of the values

fall within the expected regions according to that graph. For numerical confirmation, in the case of

the Re 100 cylinder the experimental coefficient of drag obtained by Peskin and Lai is very closely

matched by the Re 100 case [6]. In the Re 25 case, the coefficient of drag is on the higher end of

the numerical studies presented by Saiki and Biringen, but is backed by other studies [9].

Table 4: Coefficient of Drag Mean and Standard Deviation

Simulation Mean Std Previous WorkGPU 1000 1.53 0.35 1.5GPU 100 1.4 0.19 1.44-1.54GPU 25 2.24 1.07 1.54-2.26

The cylinders were run at the same conditions except velocity and timestep (details are given in

Table 5). For the computational parameters, the spacings were δx = 1128 , δy = 1

128 while the density

ρ = 1. The following table outlines the time-stepping, dynamic viscosity, and velocity parameters

for the different simulations:

35

Page 40: aeroCuda: The 2-d CFD Code

Table 5: Simulation ParametersSimulation ux δt νRe 1000 1 0.0001 0.0003Re 100 1 0.001 0.003Re 25 0.25 0.001 0.003

As outlined in the notes of Tryggvason, for the projection method implemented the CFL con-

dition was dt < 2ν|u|2

10.[12] Given these constraints, the Re 25 and 100 cases can be executed at the

same parameters. For the Re 1000 case, since the dynamic viscosity is 10x lower than in the Re 25

and 100 cases, the time step must be lowered significantly in order to get the best result. A timestep

of 10−4 was satisfactory to get a good coefficient of drag. In a conversation with Karl Helfrich, a

CFD scientist at Woods Hole Oceanographic Institute, two issues with the present code were no-

ticed. First, the method used to solve these problems is known as a direct numerical simulation—in

effect, no approximations are used and the refinement of the spacing (spatial and temporal) is used

to obtain solutions. Not only is this costly but it cannot be used for all problems. Second, the

projection method employed is purely first-order in time and a more accurate advancement of the

solution would most likely help in both stability and accuracy.

7.3 Expected Physical Phenomena and Further Validation

In figure 13 , there is no vortex shedding occuring in the case of the cylinder at Re 25. This is

because the phenomenon, known as Von Karman Shedding, does not occur until about Re 50. The

contours obtained here are also present in the paper by Saiki and Biringen, where the authors also

simulate a Re 25 flow around a cylinder. In the cases of the cylinders at Re 100 and Re 1000

(figures 9 and 10, respectively), the Von Karman Shedding takes place. In particular, the shape of

the vortices obtained in the Re 100 case take on the similar ’pointy’ shape that those in Peskin and

Lai’s paper for Re 150 took on as well. The immersed solid method has the points traveling at the

prescribed 1 m/s in both figures, as the velocity magnitude matches that specified by the color bar

on the right-hand sides of both plots. The velocity magnitude was taken to be: |umag| = (u2+v2)12 .

7.4 A Closer Look at the Physical Response of the Immersed Solid

Looking at the drag forces in Figure 15, the oscillatory motion of the immersed solid points is

evidenced by the nature of the drag forces. It should be noted that the forces listed here are in

Nm , as the solution is 2-dimensional and not 3-dimensional. While this is a graph focused on the

36

Page 41: aeroCuda: The 2-d CFD Code

converged portion of the drag force, initially one can expect to see a damped harmonic oscillator

response from the system, where the both the drag and lift forces gradually converge to their steady

state value(s). To measure the oscillations’ magnitudes, the means and the standard deviations

are computed to provide a better understanding of the steady-state behavior. In the case of the

cylinder at Re 25, it can be assumed that the force is in the mid 0.02-0.025 range, while for the Re

100 and 1000 cases, the force will be in the mid 0.2-0.24. The forces should not be very different

for these last two cases, as their coefficients of drag are relatively close to each other.

7.5 Physical Location of the Immersed Solid Points

Looking at Figure 16, the colorbars indicate the magnitude of the displacement of solid points from

their prescribed counterparts. In the diagrams, the point dispersion goes from best to worst as

Re 25, 1000, and 100. This should be expected as the Re 25 case faces the lowest velocity, while

in the Re 1000 case a very high spring constant is used. In the case of the Re 100 cylinder, on

the upper and lower surfaces of the cylinder that break the flow (right side of the cylinder), the

points have shifted more than in the left half of the cylinder. While they might be moving along

at the proper velocity, in moving out of position they might have minorly affected the expected

drag value by applying forces to fluid from their shift positions. Since the coefficient of drag for the

Re 100 case matched that achieved in other papers, the effect was negligible. Looking at all cases,

no point was more than 1/2 of a gridspace width away from its intended position. This provides

more confirmation that the flow was properly matched and that the object’s structure held strong

throughout the simulations, preventing a distortion of the flow around its surface.

8 Test Case: Swimmer in Glide Position

8.1 Overview

One of the underlying motivations behind developing this code was to apply it towards problems

involving biological motion. In the study conducted by Von Loebbecke et. al., the authors attempt

to analyze the flow around a swimmer performing the dolphin kick in 3-dimensions.[1] The study

is adapted to fit the present capabilities of this solver: the case of a swimmer in the glide position

at constant velocity in 2-dimensions. The geometrical figure of the swimmer was obtained from

the paper, via an image-tracing mechanism in MATLAB. The outline was then provided to the

aeroCuda code and the siimulation parameters were provided.

37

Page 42: aeroCuda: The 2-d CFD Code

8.2 Simulation Details

The swimmer obtained from the paper was scaled to about 1.7m in length. The max width of the

swimmer was 0.23m. The outline consisted of 7998 points when it was taken from the paper. After

being submitted to the cloaking module with a spacing of 1/128, the final point cloud is shown in

Figure 18. The grid size was set to 512 x 4096, with the spacing in both dimensions set to 1/128.0.

The kinematic viscosity was set to 3x10−4 with the density set to ρ = 1000. The timestep was

10−4s, well with the CFL condition range. The general spring constant was set to 5x107, with a

slope of 0.5 for the variable spring model. Figure 18 details the spring constants at all of the points

around the body. Concerning the motion of the swimmer, the velocity was set to 1 m/s in the

positive x-direction.

8.3 Simulation Results

The flow around the body of the swimmer is very similar to that around an airfoil, perhaps due to

the streamlined nature of the body; the Reynolds number of the simulation was placed at 5666.7.

Two major confirmations of the solution are presented, the plots for which can be found in Figures

19-21. First, the magnitude of shift visible in the immersed solid points is very minimal; the largest

separation between a solid point and its prescribed point is less than one gridspace away. Second,

the forces felt by the swimmer oscillate at steady levels; the drag force is concentrated around

60-70 while the lift force shows sinusoidal oscillations at steady periods. If the points have not

shifted much, then the integrity of the body is whole and the flow around can be deemed rather

accurate. The force diagrams confirm, as if points have shifted dramatically then the force would

not stabilize around a mean value. Looking at the flow around the swimmer, the shedding of

vortices is continuous at this stage, showing that steady state has been achieved.

8.4 Reynolds Number Transition

To further expand the analysis of the swimmer, the flow patterns at different Reynolds Number

regimes are examined (found in Figure 22). These simulations are done by varying the kinematic

viscosity of the problem at hand, the values of which are listed below. However, with higher

Reynolds number problems, the κ of the springs as well as the timesteps must be changed to

accommodate the changing nature of the problem. In increasing κ, it is ensured that points do not

shift at the higher Reynolds flows while decreasing the timestep allows for much more accuracy to

38

Page 43: aeroCuda: The 2-d CFD Code

be obtained than at higher timesteps, in addition to satisfying the CFL condition.

Table 6: Swimmer at Different Reynolds Numbers

Flow Reynolds Number ν δt κLow 566.7 3x10−3 10−3 107

Medium 5666.7 3x10−4 10−4 5x107

High 56666.7 3x10−5 10−5 108

At a low Reynolds number, the boundary layer should remain intact, which it does in the

simulation run. However, at the medium Reynolds number, the shedding of vortices and a thinner

boundary layer are to be expected, as the fluid is not as viscous. At the high Reynolds number

case, the boundary layer separates and vortices are shed not just at the feet of the swimmer (as in

the medium Reynolds case) but also along the body. These diagrams were obtained at 5s for the

low and medium cases, and 4s for the high case. Due to cluster compute time issues, it was difficult

to run long simulations as holding a position longer than 4 hours resulting in dismissal from the

cluster.

9 Conclusion

9.1 Numerical Improvements to aeroCuda

While the aeroCuda code provides good accuracy and efficiency, it can be optimized in a few crit-

ical areas that can unlock its potential as a strong CFD code. The first of these is the numerical

methods, including the algorithms behind the solutions and the governing principles of fluid dy-

namics. The first numerical improvement is the updating of the projection method implemented

here to a more numerically-accurate projection method. In particular, the algorithm here is first-

order in time and second-order in space; however, in his notes Tryggvason presents an Runge-Kutta

fourth order method (in time) that was developed by Weinan E. [12] The difference between the

methods should be noticeable at high Reynolds number, where the temporal discretization makes a

difference. To improve the actual spatial accuracy, implementing higher-order finite-difference ap-

proximations would be a good start. A potential issue with higher-order numerical methods might

be their stability; when the projection method was first implemented, higher-numerical derivatives

were implemented for both first and second derivatives. However, implementing anything greater

than second-order accurate methods resulted in a deterioration of the solution. Of help might

be expanding the delta function stencil, which would also provide more accurate forcing of the

39

Page 44: aeroCuda: The 2-d CFD Code

fluid around the immersed solid points. The delta function itself can also be improved in order of

accuracy, which should help develop a more accurate forcing function.

9.2 Technical Improvements to aeroCuda

The numerical improvements above may prove to be difficult to code if the efficiency and optimiza-

tion of aeroCuda’s runtime is taken into consideration. For example, in order to increase numerical

accuracy more gridpoints would have to be worked with; this means that global memory reads

would increase significantly to process the accuracy. In the force projection portion of step one, a

large for-loop is executed by 16 points to add forces to their respective points. These are just some

areas which result in slowdowns to the code—if their execution times can be reduced by 30-40

9.3 Capability Enhancements to aeroCuda

There are two areas which should be the next focus for aeroCuda development: expansion to 3-d

and video processing. In the case of the expansion to 3-d, two issues manifest: memory transfer

and thread execution. In the case of 3-d, there will be a massive increase in the amount of memory

consumed, simply because the domain is being extended to another dimension. Therefore, the

transfer times of large chunks of memory (in the case of a cubic grid with 1000 gridpoints per

dimensions and floats, 8 gb would be needed) would be very high. Moreover, for 1 billion threads

to execute, there would be a noticeable increase in time; high quantitires of threads executing do

take more time to execute. Perhaps a combination of MPI and CUDA could be used to execute

the problem—however, data would have to be transferred to and from the GPU at every timestep.

For video processing, there are probably multiple ways to implement this, the most direct being

output of variables at every timestep. However, this would result in a massive and unrealistic

memory requirement, especially with a large grid. If the GPU could be worked with at a deeper

level, perhaps video processing could be made as part of the solver process—if it results in a drop

in runtime, however, then perhaps it would not be wise to pursue this attribute.

9.4 Final Remarks

The immersed solid implementation developed in this paper proves to be reliable for the cases

demonstrated here. It confimed the observations made in other papers and for situations at high

Reynolds numbers, large force fluctuations were observed but the model itself appears to have

worked well. This theory developed by Peskin can truly unlock the potential for efficient CFD

40

Page 45: aeroCuda: The 2-d CFD Code

simulations of transient flow problems; it has been demonstrated in others’ works just as it has in

this one. With an improvement to the numerical method and technical specifications of aeroCuda,

it is hoped that this code could be of great use to researchers and students alike. Computational

fluid dynamics is a difficult field, but one which still holds many secrets to be unlocked. Hopefully,

aeroCuda will help shed light on some of them in the future.

10 Finances

10.1 Resources Used

To develop aeroCuda, the Enthought Python distribution was used. This license is free for students

and academic organizations but requires a fee for those in industry to obtain. The Nvidia GPU

used for these computations was a Tesla C2070, which retails for 2111.85 dollars at SabrePC. This

GPU was accessed through the Resonance cluster at the Harvard School of Engineering and Applied

Sciences. Therefore, the total budget of the project was 0 dollars.

10.2 Upgrades

With an expansion to a 3-d case, the memory transfer times will drastically increase. As a result,

GPUs that are capable of transferring larger amounts of data at very low runtime cost should be

sought out. The author’s knowledge of GPU market is limited, but more technical users might be

able to target an optimal GPU for aeroCuda to operate.

11 Appendix

11.1 Solving the Immersed Solid-influenced Navier-Stokes Equations

11.1.1 Step 1: Force Projection [7]

Let us begin by defining our prescribed points as (xp, yp) . These points are given some analytical

function (u(p(t), vp(t)) by the user that defines their motion through space.We then define the

Eulerian co-points as (xb, yb). These points retrieve their motion (ub, vb) from the velocities that

are calculated by the grid. To obtain these velocities, Peskin’s delta function method is used.6

In the delta function stencil, a reference point must chosen for each boundary point. On the

41

Page 46: aeroCuda: The 2-d CFD Code

stencil, the reference point is located at the [0, 0] location, with stencil indices in both the x- and

y-directions defined over the range [-1,2]. Assuming the spacing is the same along the x-axis and

y-axis, the reference gridpoint’s actual grid position is attained by:

xr =⌊xbδx

⌋yr =

⌊ybδy

We then obtain the displacements between (xr, yr) and (xb, yb):

rx = xb − xrδx ry = yb − yrδy

The delta function stencil is 4× 4 in size. Each position on it is a function of (rx, yx):

δ(x, y) = φ(rx)φ(ry)

The purpose of the delta function is to blend the value of the force to the surrounding gridpoints or

derive the velocity at a certain point from the surrounding grid point velocities. This is integral to

the formulation as it allows for information (such as force and velocity) to be transferred between

the immersed solid points and the gridpoints.

The phi function is taken to be:

• φ(r + 1) = 6−4r8 − 3−2r+

√1+4r−4r28

• φ(r) = 3−2r+√1+4r−4r28

• φ(r − 1) = 4r−28 + 3−2r+

√1+4r−4r28

• φ(r − 2) = 12 −

3−2r+√1+4r−4r28

The stencil is given in Table 1. Define xR = φ(rx) and yR = φ(ry). The top row of indices is j

and the leftmost column of indices are i.

Table 7: Delta StencilIndex -1 0 1 2

-1 (6−4ry

8 − yR)(6−4rx8 − xR) (6−4ry

8 − yR)xR (6−4ry

8 − yR)(4rx−28 + xR) (6−4ry

8 − yR)(0.5− xR)0 (6−4rx8 − xR)yR xRyR yR(4rx−28 + xR) yR(0.5− xR)

1 (6−4rx8 − xR)(4ry−2

8 + yR) xR(4ry−2

8 + yR) (4rx−28 + xR)(4ry−2

8 + yR) (4ry−2

8 + yR)(0.5− xR)2 (6−4rx8 − xR)(0.5− yR) xR(0.5− yR) (4rx−28 + xR)(0.5− yR) (0.5− yR)(0.5− xR)

42

Page 47: aeroCuda: The 2-d CFD Code

We then obtain the interpolated value for ub, vb using the stencil coefficients and the u, v of the

surrounding gridpoints. Denote the above stencil as the function s(i, j, rx, ry):

ub =2∑

i,j=−1u(yr+i,xr+j)s(i, j, rx, ry) vb =

2∑i,j=−1

v(yr+i,xr+j)s(i, j, rx, ry)

Once we have the velocities, we can now calculate the total force. In the harmonic oscillator

function, that developed by Saiki and Biringen, the κ was the spring constant and β was the

damping coefficient. The motion up, vp is known to us as we prescribed it. Therefore, we have for

the kth boundary point:

Fx,k = κ(xp,k − xb,k)− β(ub,k − up,k) Fy,k = κ(yp,k − yb,k)− β(vb,k − vp,k)

With the force per point calculated, we now need to project it to the surrounding gridpoints. This

operation is done in a manner similar to the interpolation step. Instead of aggregating the values

through the summation step, though, the values are added to their respective locations on a grid.

Therefore, assume that fx, fy represent the force field terms in both dimensions in the modified

Navier-Stokes. We initialize them to 0 at each iteration, and then do the following for the kth

boundary point:

fxyr,k+i,xr,k+j = Fx,ks(i, j, rx, ry)2∑

i,j=−1fyyr,k+i,xr,k+j = Fy,ks(i, j, rx, ry)

Now we move to solving the equations via the projection method.

11.1.2 Step 2: Calculating the intermediate velocity field [12]

Previously, we established the force fields interacting with the main equations as fx, fy. Therefore,

we now need to solve those equations. Let us define the primary field quantities:

• un =< un, vn >= Primary Velocity Fields at time =n

• u∗ =< u∗, v∗ >= Intermediate Velocity Fields

• un+1 =< un+1, vn+1 >= Final Velocity Fields

• p = Pressure Field

• f =< fx, fy >= Forcing Fields

43

Page 48: aeroCuda: The 2-d CFD Code

The fully-modified Navier-Stokes equations are given by:

• ∂un

∂t + (∇ · un)un = −∇p+ ν∇2un + fρ

• ∇ · un = 0

We begin by decomposing the equation via finite differences. The time derivative is represented

through forward Euler, and all other derivatives are obtained through a second-order central-

difference scheme.[12] The first and second derivatives, when evaluated via centered differencing,

are given as:

∂qi,j∂x

=qi,j+1 − qi,j−1

2δx

∂2qi,j∂2x

=qi,j+1 − 2qi,j + qi,j−1

(δx)2

Note that the same procedures follows for the y-axis derivatives, with a change in the axis of differ-

encing and the magnitude of spacing

Applying these operators on the modified Momentum equation, we obtain the following break-

down of the terms:

• Time Derivative:un+1i,j −u

ni,j

δt

• Viscous Derivative: ν(∂uni,j+1−2u+∂uni,j+1

δx2+

∂uni+1,j−2u+∂uni−1,j

δy2)

One term remains: the convective derivative. With a basic centered-differencing scheme it would

be given as:

uni,j∂uni,j+1 − ∂uni,j−1

δx+ vni,j

∂uni+1,j − ∂uni−1,jδy

The above equation also applies for the v-field. Mattheus Ueckermann of MIT was consulted on the

occurrence of oscillations observed with the code when using the above centered difference scheme

for the code. His explanation of the issue was that in the advection equation, using a centered

difference scheme doesn’t allow for information in the direction of the flow to be transmitted

properly. For example, if the flow is negative, we need to see what the value at the flux between

cells j and j+1 is, as opposed to cells j-1 and j+1 which do not necessarily average to the proper

value. Therefore, the centered differencing operation was reflected to adjust the following scheme:

uni,juni,j+1 − uni,j

δx, uni,j < 0 uni,j

uni,j − uni,j−1δx

, uni,j > 0

44

Page 49: aeroCuda: The 2-d CFD Code

The above equations also apply for the v-field. The idea is to look upwind if positive advection

and downwind if negative advection. However, because this is a first-order approximation, the

accuracy is not very good. To improve upon this, the CFD-Wiiki online website was consulted

for the QUICK (Quadratic Interpolation for Convective Kinematics) formulation. The idea behind

this implementation is that in the centered differencing operations, instead of relying on two points

to find the derivative 4 are used. If positive advection, 2 upwind points and 1 downwind point

are used; if negative advection, 2 downwind points and 1 upwind point are used. By applying the

QUICK algorithm, the following formula results for the convective derivative [2]:

Convective Derivative:

uni,j(0.375uni,j+1 + 0.375uni,j − 0.875uni,j−1 + 0.125uni,j−2)

δx, uni,j > 0

uni,j(−0.125uni,j+2 + 0.875uni,j+1 − 0.375uni,j − 0.375uni,j−1)

δx, uni,j < 0

The above equations also apply for the v-field. By incorporating more points into the analysis, a

more accurate and stable solution is obtained. Therefore, the QUICK formulation was used for

evaluating the convective term.

In the projection method, an intermediate velocity is inserted into the time derivative to isolate

the pressure term on the left-hand side of equation. Therefore, two equations are developed:

u∗i,j = uni,j + δt(−Convective Derivative + Viscous Derivative + fx)un+1i,j − u∗i,j

δt= −∇p

The same equations exist for the v-velocity field. The first equation is purely explicit and can be

solved by decomposition through finite-differences. The second equation is implicit and will yield

two more equations to be solved in the subsequent steps.

11.1.3 Step 3: Calculating the Pressure Field [12]

To solve the linking equation,

un+1 − u∗

δt= −∇p

45

Page 50: aeroCuda: The 2-d CFD Code

the continuity equation is introduced and used to generate a pressure field that imposes the

divergence-free condition. The gradient operator is applied to the linking equation:

∇ · (un+1 − u∗

2δt) = −∇2p

By the divergence condition, ∇ · un+1 = 0. Therefore, the following Poisson equation is obtained:

∇2p =∇u∗

δt

We now need to solve for the pressure, p. The right hand side may be computed explicitly, call it

U . Express p and U by in terms of their Fourier transforms:

p(θ, φ) =∑n,m

pn,meı(nθ+mφ) U(θ, φ) =

∑n,m

Un,meı(nθ+mφ)

where θ = 2πxL and φ = 2πy

L . Then, taking the second derivative

∇2p =∑n,m

(−n2 −m2)pn,meı(nθ+mφ)

and equating Fourier modes with Un,m yields

4π2(−n2 −m2)pn,m = Un,m =⇒ pn,m =−Un,m

4π2(n2 +m2)

and the pressure p is given by the inverse Fourier transform of the right hand side. Thus, we simply

need to compute the 2-d FFT of U , divide by the matrix of corresponding coefficients, and compute

the 2-d iFFT (inverse FFT) to get the matrix for p.

11.1.4 Step 4: Calculating the Final Velocity fields[12]

Now that the pressure field p and velocity vield u∗ have been calculated, the final velocity, as given

by Tryggvason, can be obtained [12]:

un+1 = u∗ − δt(∇p)

46

Page 51: aeroCuda: The 2-d CFD Code

To do this, the prior equations are decomposed via finite-differencing as in Step 2:

un+1i,j = u∗i,j −

pi,j+1 − pi,j−12δx

δt vn+1i,j = v∗i,j −

pi+1,j − pi−1,j2δy

δt

11.1.5 Step 5: Interpolation and Velocity [7]

We follow the same interpolation procedure used in Step One to obtain the velocities.

ub =

2∑i,j=−1

u(yr+i,xr+j)s(i, j, rx, ry) vb =

2∑i,j=−1

v(yr+i,xr+j)s(i, j, rx, ry)

Using the Peskin method of forward Euler, we update the positions of the solid and the prescribed

points [8]:

• xn+1b = xnb + unb δt

• yn+1b = ynb + vnb δt

• xn+1p = xnp + unpδt

• yn+1p = ynp + vnp δt

We now have the new locations of the points and can proceed to the next iteration of our solution.

References

[1] Russell Mark James Hahn Afred Von Loebbecke, Rajat Mittal. A computational method for

analysis of underwater dolphin kick hydrodynamics in human swimming. Sports Biomechanics

Journal, 8(1):60–77, March 2009.

[2] CFD-Wiki. Linear schemes - structured grids.

[3] Nvidia Corporation. Nvidia cuda c programming guide, 11 2011.

[4] Ronald Henderson. Details of the drag curve near the onset of vortex shedding.

[5] Andreas Klockner, Nicolas Pinto, Yunsup Lee, Bryan C. Catanzaro, Paul Ivanov, and Ahmed

Fasih. Pycuda: Gpu run-time code generation for high-performance computing. CoRR,

abs/0911.3456, 2009.

47

Page 52: aeroCuda: The 2-d CFD Code

[6] Ming-Chih Lai and Charles S. Peskin. An immersed boundary method with formal second-

order accuracy and reduced numerical viscosity. Journal of Computational Physics, 160(2):705–

719, 2000.

[7] Prof. Charles Peskin. The immersed boundary method in a simple special case.

[8] Prof. Charles Peskin. tar file of matlab programs.

[9] E.M. Saiki and S. Biringen. An immersed boundary method with formal second-order accuracy

and reduced numerical viscosity. Journal of Computational Physics, 123(2):450–465, 1996.

[10] Tao Tang. Moving mesh methods for computational fluid dynamics.

[11] Prof. Timothy Tautges. Mesh generation.

[12] Prof. Gretar Tryggvason. Solving the navier-stokes in primitive variables i, Spring 2010.

48

Page 53: aeroCuda: The 2-d CFD Code

Figure 13: Vorticity Contours at Different Reynolds Numbers

49

Page 54: aeroCuda: The 2-d CFD Code

Figure 14: Velocity Magnitude at Different Reynolds Numbers

50

Page 55: aeroCuda: The 2-d CFD Code

Figure 15: Forces at Different Reynolds Numbers

51

Page 56: aeroCuda: The 2-d CFD Code

Figure 16: Immersed Solid Point Dispersion at Different Reynolds Numbers

52

Page 57: aeroCuda: The 2-d CFD Code

Figure 17: Discretization of the Swimmer

Figure 18: Variable Spring Model of the Swimmer

53

Page 58: aeroCuda: The 2-d CFD Code

Figure 19: Point Shift of the Swimmer

Figure 20: Forces on the Swimmer

Figure 21: Flow Around the Swimmer at T= 25s

54

Page 59: aeroCuda: The 2-d CFD Code

Figure 22: Flow Transition Dependent on Reynolds Number

55