Upload
filipo-mor
View
592
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This is a brief overview of a nbody code, including the sequential and parallel versions (OpenMP and CUDA) and its computational complexity.
Citation preview
Brief overview of a parallel nbody code
Filipo Novo MórGraduate Program in Computer Science UFRGSProf. Nicollas Maillard 2013, December
Implementation and analysis
Overview• About the nbody problem• The Serial Implementation• The OpenMP Implementation• The CUDA Implementation• Experimental Results• Conclusion
About the nbody problemFeatures:
Force calculation between all particles. Complexity O(N2) Energy should be constant. The brute force algorithm demands huge
computational power.
The Serial ImplementationNAIVE!
• Clearly N2
• Each pair is evaluated twice• Acceleration has to be adjusted at the end.
The Serial Implementationsmart
• It stills under N2 domain, but:• Each pair is evaluated once only.• Acceleration it’s OK at the end!
The OpenMP Implementation
• MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”!
• OBS: the static schedule seems to be slightly faster than dynamic schedule.
Analysis
“naive” Serial “smart” Serial
for (i=0; i<N; i++){ for(j=i+1; j<N; j++) { printf(“*”); } printf(“\n”);}
*************************
≈𝒏(𝒏−𝟏)𝟐
OpenMP Parallel
The CUDA Implementation
Basic CUDA GPU architecture
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Global Memory
Shared Memory Bank
N = 15K = 3
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
3 4 5
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
0123456789
1011121314
6 7 8
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
0123456789
1011121314
9 10
11
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
BARR
IER
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
0123456789
1011121314
12
13
14
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Global Memory
Shared Memory Bank
Active Tasks
Active Transfers
0123456789
1011121314
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Global Memory
Shared Memory Bank
Analysis CUDA implementation:
First of all, all the elements are transfered from Host to Device memory.
Each thread is responsible for only one particle. There are barriers with synchronizations
between shared and global memory. On each barrier, elements are transferred to the
shared memory at time. At the end, all elements are copied from local to
global memory at once, and finally copied back to the CPU memory.
C : cost of the CalculateForce function.M : transfer cost between global and shared memories.T : transfer cost between CPU and device memories.
Access to shared memory is around 100X faster than to the global memory.
Experimental Results
Testing Environment: Dell PowerEdge R610
2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading 8 physical cores, 16 threads. RAM 16GB NVIDIA Tesla S2050
Ubuntu Server 10.0.4 LTS GCC 4.4.3 CUDA 5.0
How much would it cost???Version Cost
Naive 0.49$
Smart 0.33$
OMP 0.08$
CUDA 0.05$
Amazon EC2: General Purpose - m1.large plan GPU Instances - g2.2xlarge plan
• PRAM is OK for sequential and OpenMP.• But for CUDA, we need a better model!
– Considering block threads, warps and latency.
Thanks!
Conclusions
Additional Slides
• Calculations
𝑓𝑖 ≈𝐺𝑚𝑖 𝑚𝑗𝑟𝑖𝑗ቀฮ𝑟𝑖𝑗ฮ2 +𝜀2ቁ321<𝑗<𝑁𝑗≠𝑖
Force (acceleration)
𝐸= 𝐸𝑘 +𝐸𝑝
𝐸𝑝 = − 𝐺𝑚𝑖𝑚𝑗ฮ𝑟𝑖𝑗ฮ
𝑁1<𝑗<𝑁𝑖≠𝑗
𝐸𝑘 = 𝑚𝑖𝑣𝑖22𝑁1<𝑖<𝑁
Energy (kinetic and potential)
Softening Factorcollisionless system
virtual particles
About the nbody problem
About the nbody problem