27
N-Body Simulations Kenneth Owens

N-Body Simulations

  • Upload
    lieu

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

N-Body Simulations. Kenneth Owens. The N-Body Problem. We wish to compute the interaction between particles (bodies) given their mass and positions Simulation is performed in time steps Forces between all bodies is computed O(n 2 ) - PowerPoint PPT Presentation

Citation preview

Page 1: N-Body Simulations

N-Body SimulationsKenneth Owens

Page 2: N-Body Simulations

We wish to compute the interaction between particles (bodies) given their mass and positions

Simulation is performed in time steps◦ Forces between all bodies is computed O(n2)◦ Positions for all bodies are updated based on their

current kinematics and the interaction with other bodies O(n)

◦ Time moves forward by one step

The N-Body Problem

Page 3: N-Body Simulations

The force between a body i and N other bodies is approximated as above by computing the interaction given their mass (m), the distance vector between them (r_ij), and a softening factor (ε).

This is computed for all bodies with all other bodies

Force Computation

Page 4: N-Body Simulations

Euler Method: For each particle, a discrete timestep (dt) is used to approximate the continuous kinematic equation and update the position and velocity of each particle

Position Updatesdtdva dtadv dvvv ii 1 dtavv iii 1

dtdxv dtvdx dxxx ii 1

dtvxx iii 1

Page 5: N-Body Simulations

Execute an n-body simulation on a distributed memory architecture with multiple GPUs on each node

Project Objective

Page 6: N-Body Simulations

Sequential implementation of n-body simulation code◦ Written in C◦ Compiled using gcc-4.4 with –O3

MPI implementation◦ Written in C◦ Compiled using mpicc.mpich with gcc-4.4 using 0-3◦ Executed using mpirun.mpich on 2,5, an 10 nodes

GPU implementation◦ Written in C with CUDA extensions◦ Compiled using nvcc with gcc-4.4 using –O3◦ Executed on Nvidia 580s

MPI-GPU implementation◦ The MPI driver above was combined with the GPU kernel

implementation◦ Compiled but not tested for correctness

Project Accomplishments

Page 7: N-Body Simulations

The main method of the driver calls nbody nbody calls two externally linked function

◦ compute_forces computes the interactions ◦ update_positions updates the kinematics

Driver Source Codevoid nbody(vector4d_t* positions, vector4d_t* velocities, vector4d_t* current_positions, vector4d_t* current_velocities, vector3d_t* accel, size_t size, value_t dt, value_t damping, value_t softening_squared){ compute_forces(positions,accel, size, positions, size, softening_squared); update_positions(positions, velocities, current_positions, current_velocities, accel, size, dt, damping); }

Page 8: N-Body Simulations

Computes the pair-wise interaction◦ Hidden second loop in acceleration function

Sequential compute_forces

void compute_forces(vector4d_t* positions, vector3d_t* forces, size_t positions_size, vector4d_t* sources, size_t sources_size, value_t softening_squared){ for (size_t i = 0; i < positions_size; i++) { forces[i] = acceleration(positions[i],sources,sources_size, forces[i], softening_squared); } }

Page 9: N-Body Simulations

Computation for individual interaction written using c

Sequential Interaction Computation

vector3d_t interaction(vector3d_t acceleration, vector4d_t body1, vector4d_t body2, value_t softening_squared) { vector3d_t force; force.x = body1.x - body2.x; force.y = body1.y - body2.y; force.z = body1.z - body2.z; float distSqr = force.x * force.x + force.y * force.y + force.z * force.z; distSqr += softening_squared; float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = body2.w * invDistCube; acceleration.x += force.x * s; acceleration.y += force.y * s; acceleration.z += force.z * s; return acceleration;}

Page 10: N-Body Simulations

Updates each position based on the computed forces

Sequential Kinematics Update

void update_positions(vector4d_t* positions, vector4d_t* velocities, vector4d_t* current_positions, vector4d_t* current_velocities, vector3d_t* acceleration, size_t size, value_t dt, value_t damping){ for(size_t i = 0; i < size; i++) { vector4d_t current_position = current_positions[i]; vector3d_t accel = acceleration[i]; vector4d_t current_velocity = current_velocities[i]; update_position(&positions[i], &velocities[i], current_position, current_velocity, accel, dt, damping); }}

Page 11: N-Body Simulations

Implements the previously shown equations

Sequential Update Function

void update_position(vector4d_t* position, vector4d_t* velocity, vector4d_t current_position, vector4d_t current_velocity, vector3d_t acceleration,value_t dt, value_t damping){ current_velocity.x += acceleration.x * dt; current_velocity.y += acceleration.y * dt; current_velocity.z += acceleration.z * dt; current_velocity.x *= damping; current_velocity.y *= damping; current_velocity.z *= damping; current_position.x += current_velocity.x * dt; current_position.y += current_velocity.y * dt; current_position.z += current_velocity.z * dt; *position = current_position; *velocity = current_velocity;}

Page 12: N-Body Simulations

Started with the implementation from GPU Gems http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html

Modified the code to work with data sizes that are larger than 256 but that are not evenly divisible by 256

Added kinematics update Code no longer works for sizes less than 256

◦ Needed command line specification to control grid and block size anyway

CUDA Implementation

Page 13: N-Body Simulations

Copies to device memory and execute the compute_force_gpu kernel◦ Note - cudaMemAlloc truncated to fit code

CUDA compute_forces

void compute_forces(vector4d_t* positions, vector3d_t* forces, size_t positions_size, vector4d_t* sources, size_t sources_size, value_t softening_squared){ ….. compute_forces_gpu<<< grid, threads, sharedMemSize >>>(device_positions, device_forces, positions_size, device_sources, sources_size, softening_squared ); cudaThreadSynchronize(); cudaMemcpy(forces, device_forces, positions_size * sizeof(float3), cudaMemcpyDeviceToHost); cudaFree((void**)device_positions); cudaFree((void**)device_sources); cudaFree((void**)device_forces); err = cudaGetLastError(); if( cudaSuccess != err) { fprintf(stderr, "Cuda error: %s: \n", cudaGetErrorString( err) );

Page 14: N-Body Simulations

Every thread computes the acceleration for its position and moves to the next block◦ For our test sizes this only implemented cleanup

for strides not divisible by 256

CUDA compute_forces_gpu Kernel

__global__ voidcompute_forces_gpu(float4* positions, float3* forces,int size, float4* sources, int sources_size, float softening_squared ){ for(int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x; index < size; index += blockDim.x * gridDim.x) { float4 pos = positions[index]; forces[index] = acceleration(pos, sources, sources_size, forces[index], softening_squared);

Page 15: N-Body Simulations

Uses float3 and float4 instead of home brewed vector types Shared memory is used 256 positions per block Each thread strides across the grid to update a single particle

CUDA force computation

__device__ float3acceleration(float4 position, float4* positions, int size, float3 acc, float softening_squared){ extern __shared__ float4 sharedPos[]; int p = blockDim.x; int q = blockDim.y; int n = size; int numTiles = n / (p * q); for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++) { sharedPos[threadIdx.x+blockDim.x*threadIdx.y] = positions[WRAP(blockIdx.x + tile,gridDim.x) * p + threadIdx.x]; __syncthreads(); // This is the "tile_calculation" function from the GPUG3 article. acc = gravitation(position, acc,softening_squared); __syncthreads(); } return acc;}

Page 16: N-Body Simulations

Kernel strides in the same way as the force computation All threads update a single position simulaneously

CUDA update_positions

__global__ voidupdate_positions_gpu(float4* positions, float4* velocities, float4* current_positions, float4* current_velocities, float3* forces, int size, float dt, float damping){ for(int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x; index < size; index += blockDim.x * gridDim.x) { float4 pos = current_positions[index]; float3 accel = forces[index]; float4 vel = current_velocities[index]; vel.x += accel.x * dt; vel.y += accel.y * dt; vel.z += accel.z * dt; vel.x *= damping; vel.y *= damping; vel.z *= damping; // new position = old position + velocity * deltaTime pos.x += vel.x * dt; pos.y += vel.y * dt; pos.z += vel.z * dt; // store new position and velocity positions[index] = pos; velocities[index] = vel; }}

Page 17: N-Body Simulations

O(n2)/p pipeline implementation◦ Particles are divided among processes◦ Particle positions are shared in a ring

communication topology◦ Force computation occurs for all particles by

sending the data around the ring◦ After all forces are computed each process

updates the kinematics of its own particles

MPI implementation

Page 18: N-Body Simulations

Compiles with CPU and GPU implementations

Timings have only been collected for CPU

MPI Driver

for(size_t i = 0; i < time_steps; i++) { memcpy( sendbuf, current_positions, num_particles * sizeof(vector4d_t) ); for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) { MPI_Isend( sendbuf, num_particles, mpi_vector4d_t, right, pipe, commring, &request[0] ); MPI_Irecv( recvbuf, num_particles, mpi_vector4d_t, left, pipe, commring, &request[1] ); } compute_forces(positions,accel, num_particles, positions, num_particles, softening_squared); if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, num_particles * sizeof(vector4d_t) ); } update_positions(positions, velocities, current_positions, current_velocities, accel, num_particles, dt, damping); }

Page 19: N-Body Simulations

Taken of float for sequential and gpu Taken on tux for mpi All used 10 iterations for time steps Wallclock time was collected for comparison Memory allocation time was omitted

◦ Except for device memory allocation and device data transfer

Timings where not collected for the code using MPI to distribute data over multiple nodes with multiple GPUs

Timings

Page 20: N-Body Simulations

100 200 300 400 500 600 700 800 900 1000 1100 12000

50

100

150

200

250

300

350

sequential

sequential

Sequential Timings

Page 21: N-Body Simulations

Sequential GFlops

1 2 3 4 5 6 7 8 9 10 11 120

0.002

0.004

0.006

0.008

0.01

0.012

Sequential

gflops

Page 22: N-Body Simulations

GPU Timings

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

0

5

10

15

20

25

30

35

40

GPU

time

Page 23: N-Body Simulations

GPU GFlops

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

0

50

100

150

200

250

GPU

gflops

Page 24: N-Body Simulations

MPI Timings

100 200 300 400 500 600 700 800 900 1000 1100 12000

20

40

60

80

100

120

140

160

180

MPI

n=2n=5n=10

Page 25: N-Body Simulations

MPI GFlops

100 200 300 400 500 600 700 800 900 1000 1100 12000

0.02

0.04

0.06

0.08

0.1

0.12

MPI

n=2n=5n=10

Page 26: N-Body Simulations

We achieved several orders of magnitude speed-up going to a GPU

We achieved similar results to what was obtained in GPU gems

The sequential implementation was not optimal as it did not use SSE or multiple cores – much lower than the theoretical possible FLOPs for the Xeon CPU

The MPI driver showed that task level parallelism can be exploited using distributed memory computing

Conclusions

Page 27: N-Body Simulations

Run the MPI GPU version on Draco FMM (Fast Multiple Method) MPI

implementation Multi-device GPU implementation

To Do List