Upload
lieu
View
48
Download
0
Embed Size (px)
DESCRIPTION
N-Body Simulations. Kenneth Owens. The N-Body Problem. We wish to compute the interaction between particles (bodies) given their mass and positions Simulation is performed in time steps Forces between all bodies is computed O(n 2 ) - PowerPoint PPT Presentation
Citation preview
N-Body SimulationsKenneth Owens
We wish to compute the interaction between particles (bodies) given their mass and positions
Simulation is performed in time steps◦ Forces between all bodies is computed O(n2)◦ Positions for all bodies are updated based on their
current kinematics and the interaction with other bodies O(n)
◦ Time moves forward by one step
The N-Body Problem
The force between a body i and N other bodies is approximated as above by computing the interaction given their mass (m), the distance vector between them (r_ij), and a softening factor (ε).
This is computed for all bodies with all other bodies
Force Computation
Euler Method: For each particle, a discrete timestep (dt) is used to approximate the continuous kinematic equation and update the position and velocity of each particle
Position Updatesdtdva dtadv dvvv ii 1 dtavv iii 1
dtdxv dtvdx dxxx ii 1
dtvxx iii 1
Execute an n-body simulation on a distributed memory architecture with multiple GPUs on each node
Project Objective
Sequential implementation of n-body simulation code◦ Written in C◦ Compiled using gcc-4.4 with –O3
MPI implementation◦ Written in C◦ Compiled using mpicc.mpich with gcc-4.4 using 0-3◦ Executed using mpirun.mpich on 2,5, an 10 nodes
GPU implementation◦ Written in C with CUDA extensions◦ Compiled using nvcc with gcc-4.4 using –O3◦ Executed on Nvidia 580s
MPI-GPU implementation◦ The MPI driver above was combined with the GPU kernel
implementation◦ Compiled but not tested for correctness
Project Accomplishments
The main method of the driver calls nbody nbody calls two externally linked function
◦ compute_forces computes the interactions ◦ update_positions updates the kinematics
Driver Source Codevoid nbody(vector4d_t* positions, vector4d_t* velocities, vector4d_t* current_positions, vector4d_t* current_velocities, vector3d_t* accel, size_t size, value_t dt, value_t damping, value_t softening_squared){ compute_forces(positions,accel, size, positions, size, softening_squared); update_positions(positions, velocities, current_positions, current_velocities, accel, size, dt, damping); }
Computes the pair-wise interaction◦ Hidden second loop in acceleration function
Sequential compute_forces
void compute_forces(vector4d_t* positions, vector3d_t* forces, size_t positions_size, vector4d_t* sources, size_t sources_size, value_t softening_squared){ for (size_t i = 0; i < positions_size; i++) { forces[i] = acceleration(positions[i],sources,sources_size, forces[i], softening_squared); } }
Computation for individual interaction written using c
Sequential Interaction Computation
vector3d_t interaction(vector3d_t acceleration, vector4d_t body1, vector4d_t body2, value_t softening_squared) { vector3d_t force; force.x = body1.x - body2.x; force.y = body1.y - body2.y; force.z = body1.z - body2.z; float distSqr = force.x * force.x + force.y * force.y + force.z * force.z; distSqr += softening_squared; float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = body2.w * invDistCube; acceleration.x += force.x * s; acceleration.y += force.y * s; acceleration.z += force.z * s; return acceleration;}
Updates each position based on the computed forces
Sequential Kinematics Update
void update_positions(vector4d_t* positions, vector4d_t* velocities, vector4d_t* current_positions, vector4d_t* current_velocities, vector3d_t* acceleration, size_t size, value_t dt, value_t damping){ for(size_t i = 0; i < size; i++) { vector4d_t current_position = current_positions[i]; vector3d_t accel = acceleration[i]; vector4d_t current_velocity = current_velocities[i]; update_position(&positions[i], &velocities[i], current_position, current_velocity, accel, dt, damping); }}
Implements the previously shown equations
Sequential Update Function
void update_position(vector4d_t* position, vector4d_t* velocity, vector4d_t current_position, vector4d_t current_velocity, vector3d_t acceleration,value_t dt, value_t damping){ current_velocity.x += acceleration.x * dt; current_velocity.y += acceleration.y * dt; current_velocity.z += acceleration.z * dt; current_velocity.x *= damping; current_velocity.y *= damping; current_velocity.z *= damping; current_position.x += current_velocity.x * dt; current_position.y += current_velocity.y * dt; current_position.z += current_velocity.z * dt; *position = current_position; *velocity = current_velocity;}
Started with the implementation from GPU Gems http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html
Modified the code to work with data sizes that are larger than 256 but that are not evenly divisible by 256
Added kinematics update Code no longer works for sizes less than 256
◦ Needed command line specification to control grid and block size anyway
CUDA Implementation
Copies to device memory and execute the compute_force_gpu kernel◦ Note - cudaMemAlloc truncated to fit code
CUDA compute_forces
void compute_forces(vector4d_t* positions, vector3d_t* forces, size_t positions_size, vector4d_t* sources, size_t sources_size, value_t softening_squared){ ….. compute_forces_gpu<<< grid, threads, sharedMemSize >>>(device_positions, device_forces, positions_size, device_sources, sources_size, softening_squared ); cudaThreadSynchronize(); cudaMemcpy(forces, device_forces, positions_size * sizeof(float3), cudaMemcpyDeviceToHost); cudaFree((void**)device_positions); cudaFree((void**)device_sources); cudaFree((void**)device_forces); err = cudaGetLastError(); if( cudaSuccess != err) { fprintf(stderr, "Cuda error: %s: \n", cudaGetErrorString( err) );
Every thread computes the acceleration for its position and moves to the next block◦ For our test sizes this only implemented cleanup
for strides not divisible by 256
CUDA compute_forces_gpu Kernel
__global__ voidcompute_forces_gpu(float4* positions, float3* forces,int size, float4* sources, int sources_size, float softening_squared ){ for(int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x; index < size; index += blockDim.x * gridDim.x) { float4 pos = positions[index]; forces[index] = acceleration(pos, sources, sources_size, forces[index], softening_squared);
Uses float3 and float4 instead of home brewed vector types Shared memory is used 256 positions per block Each thread strides across the grid to update a single particle
CUDA force computation
__device__ float3acceleration(float4 position, float4* positions, int size, float3 acc, float softening_squared){ extern __shared__ float4 sharedPos[]; int p = blockDim.x; int q = blockDim.y; int n = size; int numTiles = n / (p * q); for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++) { sharedPos[threadIdx.x+blockDim.x*threadIdx.y] = positions[WRAP(blockIdx.x + tile,gridDim.x) * p + threadIdx.x]; __syncthreads(); // This is the "tile_calculation" function from the GPUG3 article. acc = gravitation(position, acc,softening_squared); __syncthreads(); } return acc;}
Kernel strides in the same way as the force computation All threads update a single position simulaneously
CUDA update_positions
__global__ voidupdate_positions_gpu(float4* positions, float4* velocities, float4* current_positions, float4* current_velocities, float3* forces, int size, float dt, float damping){ for(int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x; index < size; index += blockDim.x * gridDim.x) { float4 pos = current_positions[index]; float3 accel = forces[index]; float4 vel = current_velocities[index]; vel.x += accel.x * dt; vel.y += accel.y * dt; vel.z += accel.z * dt; vel.x *= damping; vel.y *= damping; vel.z *= damping; // new position = old position + velocity * deltaTime pos.x += vel.x * dt; pos.y += vel.y * dt; pos.z += vel.z * dt; // store new position and velocity positions[index] = pos; velocities[index] = vel; }}
O(n2)/p pipeline implementation◦ Particles are divided among processes◦ Particle positions are shared in a ring
communication topology◦ Force computation occurs for all particles by
sending the data around the ring◦ After all forces are computed each process
updates the kinematics of its own particles
MPI implementation
Compiles with CPU and GPU implementations
Timings have only been collected for CPU
MPI Driver
for(size_t i = 0; i < time_steps; i++) { memcpy( sendbuf, current_positions, num_particles * sizeof(vector4d_t) ); for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) { MPI_Isend( sendbuf, num_particles, mpi_vector4d_t, right, pipe, commring, &request[0] ); MPI_Irecv( recvbuf, num_particles, mpi_vector4d_t, left, pipe, commring, &request[1] ); } compute_forces(positions,accel, num_particles, positions, num_particles, softening_squared); if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, num_particles * sizeof(vector4d_t) ); } update_positions(positions, velocities, current_positions, current_velocities, accel, num_particles, dt, damping); }
Taken of float for sequential and gpu Taken on tux for mpi All used 10 iterations for time steps Wallclock time was collected for comparison Memory allocation time was omitted
◦ Except for device memory allocation and device data transfer
Timings where not collected for the code using MPI to distribute data over multiple nodes with multiple GPUs
Timings
100 200 300 400 500 600 700 800 900 1000 1100 12000
50
100
150
200
250
300
350
sequential
sequential
Sequential Timings
Sequential GFlops
1 2 3 4 5 6 7 8 9 10 11 120
0.002
0.004
0.006
0.008
0.01
0.012
Sequential
gflops
GPU Timings
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
0
5
10
15
20
25
30
35
40
GPU
time
GPU GFlops
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
0
50
100
150
200
250
GPU
gflops
MPI Timings
100 200 300 400 500 600 700 800 900 1000 1100 12000
20
40
60
80
100
120
140
160
180
MPI
n=2n=5n=10
MPI GFlops
100 200 300 400 500 600 700 800 900 1000 1100 12000
0.02
0.04
0.06
0.08
0.1
0.12
MPI
n=2n=5n=10
We achieved several orders of magnitude speed-up going to a GPU
We achieved similar results to what was obtained in GPU gems
The sequential implementation was not optimal as it did not use SSE or multiple cores – much lower than the theoretical possible FLOPs for the Xeon CPU
The MPI driver showed that task level parallelism can be exploited using distributed memory computing
Conclusions
Run the MPI GPU version on Draco FMM (Fast Multiple Method) MPI
implementation Multi-device GPU implementation
To Do List