Brief Overview of a Parallel Nbody Code

Preview:

DESCRIPTION

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

Citation preview

Brief overview of a parallel nbody code

Filipo Novo MórGraduate Program in Computer Science UFRGSProf. Nicollas Maillard 2013, December

Implementation and analysis

Overview• About the nbody problem• The Serial Implementation• The OpenMP Implementation• The CUDA Implementation• Experimental Results• Conclusion

About the nbody problemFeatures:

Force calculation between all particles. Complexity O(N2) Energy should be constant. The brute force algorithm demands huge

computational power.

The Serial ImplementationNAIVE!

• Clearly N2

• Each pair is evaluated twice• Acceleration has to be adjusted at the end.

The Serial Implementationsmart

• It stills under N2 domain, but:• Each pair is evaluated once only.• Acceleration it’s OK at the end!

The OpenMP Implementation

• MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”!

• OBS: the static schedule seems to be slightly faster than dynamic schedule.

Analysis

“naive” Serial “smart” Serial

for (i=0; i<N; i++){ for(j=i+1; j<N; j++) { printf(“*”); } printf(“\n”);}

*************************

≈𝒏(𝒏−𝟏)𝟐

OpenMP Parallel

The CUDA Implementation

Basic CUDA GPU architecture

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

N = 15K = 3

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

3 4 5

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0123456789

1011121314

6 7 8

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0123456789

1011121314

9 10

11

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0123456789

1011121314

12

13

14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Analysis CUDA implementation:

First of all, all the elements are transfered from Host to Device memory.

Each thread is responsible for only one particle. There are barriers with synchronizations

between shared and global memory. On each barrier, elements are transferred to the

shared memory at time. At the end, all elements are copied from local to

global memory at once, and finally copied back to the CPU memory.

C : cost of the CalculateForce function.M : transfer cost between global and shared memories.T : transfer cost between CPU and device memories.

Access to shared memory is around 100X faster than to the global memory.

Experimental Results

Testing Environment: Dell PowerEdge R610

2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading 8 physical cores, 16 threads. RAM 16GB NVIDIA Tesla S2050

Ubuntu Server 10.0.4 LTS GCC 4.4.3 CUDA 5.0

How much would it cost???Version Cost

Naive 0.49$

Smart 0.33$

OMP 0.08$

CUDA 0.05$

Amazon EC2: General Purpose - m1.large plan GPU Instances - g2.2xlarge plan

• PRAM is OK for sequential and OpenMP.• But for CUDA, we need a better model!

– Considering block threads, warps and latency.

Thanks!

Conclusions

Additional Slides

• Calculations

𝑓𝑖 ≈𝐺𝑚𝑖 𝑚𝑗𝑟𝑖𝑗ቀฮ𝑟𝑖𝑗ฮ2 +𝜀2ቁ321<𝑗<𝑁𝑗≠𝑖

Force (acceleration)

𝐸= 𝐸𝑘 +𝐸𝑝

𝐸𝑝 = − 𝐺𝑚𝑖𝑚𝑗ฮ𝑟𝑖𝑗ฮ

𝑁1<𝑗<𝑁𝑖≠𝑗

𝐸𝑘 = 𝑚𝑖𝑣𝑖22𝑁1<𝑖<𝑁

Energy (kinetic and potential)

Softening Factorcollisionless system

virtual particles

About the nbody problem

About the nbody problem

Recommended