27
Brief overview of a parallel nbody code Filipo Novo Mór Graduate Program in Computer Science UFRGS Prof. Nicollas Maillard 2013, December Implementation and analysis

Brief Overview of a Parallel Nbody Code

Embed Size (px)

DESCRIPTION

This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.

Citation preview

Page 1: Brief Overview of a Parallel Nbody Code

Brief overview of a parallel nbody code

Filipo Novo MórGraduate Program in Computer Science UFRGSProf. Nicollas Maillard 2013, December

Implementation and analysis

Page 2: Brief Overview of a Parallel Nbody Code

Overview• About the nbody problem• The Serial Implementation• The OpenMP Implementation• The CUDA Implementation• Experimental Results• Conclusion

Page 3: Brief Overview of a Parallel Nbody Code

About the nbody problemFeatures:

Force calculation between all particles. Complexity O(N2) Energy should be constant. The brute force algorithm demands huge

computational power.

Page 4: Brief Overview of a Parallel Nbody Code

The Serial ImplementationNAIVE!

• Clearly N2

• Each pair is evaluated twice• Acceleration has to be adjusted at the end.

Page 5: Brief Overview of a Parallel Nbody Code

The Serial Implementationsmart

• It stills under N2 domain, but:• Each pair is evaluated once only.• Acceleration it’s OK at the end!

Page 6: Brief Overview of a Parallel Nbody Code

The OpenMP Implementation

• MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”!

• OBS: the static schedule seems to be slightly faster than dynamic schedule.

Page 7: Brief Overview of a Parallel Nbody Code

Analysis

“naive” Serial “smart” Serial

for (i=0; i<N; i++){ for(j=i+1; j<N; j++) { printf(“*”); } printf(“\n”);}

*************************

≈𝒏(𝒏−𝟏)𝟐

OpenMP Parallel

Page 8: Brief Overview of a Parallel Nbody Code

The CUDA Implementation

Basic CUDA GPU architecture

Page 9: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

N = 15K = 3

Page 10: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

Page 11: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

Page 12: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

Page 13: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

3 4 5

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 14: Brief Overview of a Parallel Nbody Code

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

Page 15: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

6 7 8

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 16: Brief Overview of a Parallel Nbody Code

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

Page 17: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

9 10

11

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 18: Brief Overview of a Parallel Nbody Code

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

BARR

IER

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

Page 19: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

12

13

14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Page 20: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

Page 21: Brief Overview of a Parallel Nbody Code

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Page 22: Brief Overview of a Parallel Nbody Code

Analysis CUDA implementation:

First of all, all the elements are transfered from Host to Device memory.

Each thread is responsible for only one particle. There are barriers with synchronizations

between shared and global memory. On each barrier, elements are transferred to the

shared memory at time. At the end, all elements are copied from local to

global memory at once, and finally copied back to the CPU memory.

C : cost of the CalculateForce function.M : transfer cost between global and shared memories.T : transfer cost between CPU and device memories.

Access to shared memory is around 100X faster than to the global memory.

Page 23: Brief Overview of a Parallel Nbody Code

Experimental Results

Testing Environment: Dell PowerEdge R610

2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading 8 physical cores, 16 threads. RAM 16GB NVIDIA Tesla S2050

Ubuntu Server 10.0.4 LTS GCC 4.4.3 CUDA 5.0

How much would it cost???Version Cost

Naive 0.49$

Smart 0.33$

OMP 0.08$

CUDA 0.05$

Amazon EC2: General Purpose - m1.large plan GPU Instances - g2.2xlarge plan

Page 24: Brief Overview of a Parallel Nbody Code

• PRAM is OK for sequential and OpenMP.• But for CUDA, we need a better model!

– Considering block threads, warps and latency.

Thanks!

Conclusions

Page 25: Brief Overview of a Parallel Nbody Code

Additional Slides

Page 26: Brief Overview of a Parallel Nbody Code

• Calculations

𝑓𝑖 ≈𝐺𝑚𝑖 𝑚𝑗𝑟𝑖𝑗ቀฮ𝑟𝑖𝑗ฮ2 +𝜀2ቁ321<𝑗<𝑁𝑗≠𝑖

Force (acceleration)

𝐸= 𝐸𝑘 +𝐸𝑝

𝐸𝑝 = − 𝐺𝑚𝑖𝑚𝑗ฮ𝑟𝑖𝑗ฮ

𝑁1<𝑗<𝑁𝑖≠𝑗

𝐸𝑘 = 𝑚𝑖𝑣𝑖22𝑁1<𝑖<𝑁

Energy (kinetic and potential)

Softening Factorcollisionless system

virtual particles

About the nbody problem

Page 27: Brief Overview of a Parallel Nbody Code

About the nbody problem