Brief Overview of a Parallel Nbody Code

Brief overview of a parallel nbody code

Filipo Novo MórGraduate Program in Computer Science UFRGSProf. Nicollas Maillard 2013, December

Implementation and analysis

Overview• About the nbody problem• The Serial Implementation• The OpenMP Implementation• The CUDA Implementation• Experimental Results• Conclusion

About the nbody problemFeatures:

Force calculation between all particles. Complexity O(N2) Energy should be constant. The brute force algorithm demands huge

computational power.

The Serial ImplementationNAIVE!

• Clearly N2

• Each pair is evaluated twice• Acceleration has to be adjusted at the end.

The Serial Implementationsmart

• It stills under N2 domain, but:• Each pair is evaluated once only.• Acceleration it’s OK at the end!

The OpenMP Implementation

• MUST be based on the “naive” version. • We lost the “/2”, but we gain the “/p”!

• OBS: the static schedule seems to be slightly faster than dynamic schedule.

Analysis

“naive” Serial “smart” Serial

for (i=0; i<N; i++){ for(j=i+1; j<N; j++) { printf(“*”); } printf(“\n”);}

*************************

≈𝒏(𝒏−𝟏)𝟐

OpenMP Parallel

The CUDA Implementation

Basic CUDA GPU architecture

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

N = 15K = 3

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0123456789

1011121314

Global Memory

Shared Memory Bank

Active Tasks

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0123456789

1011121314

Global Memory

Shared Memory Bank

Active Tasks

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0123456789

1011121314

Global Memory

Shared Memory Bank

Active Tasks

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Active Tasks

Active Transfers

0123456789

1011121314

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Global Memory

Shared Memory Bank

Analysis CUDA implementation:

First of all, all the elements are transfered from Host to Device memory.

Each thread is responsible for only one particle. There are barriers with synchronizations

between shared and global memory. On each barrier, elements are transferred to the

shared memory at time. At the end, all elements are copied from local to

global memory at once, and finally copied back to the CPU memory.

C : cost of the CalculateForce function.M : transfer cost between global and shared memories.T : transfer cost between CPU and device memories.

Access to shared memory is around 100X faster than to the global memory.

Experimental Results

Testing Environment: Dell PowerEdge R610

2 Intel Xeon Quad-Core E5520 2.27 GHz Hyper-Threading 8 physical cores, 16 threads. RAM 16GB NVIDIA Tesla S2050

Ubuntu Server 10.0.4 LTS GCC 4.4.3 CUDA 5.0

How much would it cost???Version Cost

Naive 0.49$

Smart 0.33$

OMP 0.08$

CUDA 0.05$

Amazon EC2: General Purpose - m1.large plan GPU Instances - g2.2xlarge plan

• PRAM is OK for sequential and OpenMP.• But for CUDA, we need a better model!

– Considering block threads, warps and latency.

Thanks!

Conclusions

Additional Slides

• Calculations

𝑓𝑖 ≈𝐺𝑚𝑖 𝑚𝑗𝑟𝑖𝑗ቀฮ𝑟𝑖𝑗ฮ2 +𝜀2ቁ321<𝑗<𝑁𝑗≠𝑖

Force (acceleration)

𝐸= 𝐸𝑘 +𝐸𝑝

𝐸𝑝 = − 𝐺𝑚𝑖𝑚𝑗ฮ𝑟𝑖𝑗ฮ

𝑁1<𝑗<𝑁𝑖≠𝑗

𝐸𝑘 = 𝑚𝑖𝑣𝑖22𝑁1<𝑖<𝑁

Energy (kinetic and potential)

Softening Factorcollisionless system

virtual particles

About the nbody problem

Brief Overview of a Parallel Nbody Code

Technology

Mechanics Exercise Class Ⅲ. Brief Review 2 Rotational Inertia 3 The Parallel-Axis Theorem Angular displacement Angular velocity Angular acceleration 1

High-Performance Computing Intel® Parallel Studio XE 2020 …€¦ · High-Performance Computing Intel® Parallel Studio XE 2020 Software PRODUCT BRIEF. What’s New (cont.) •

Parallel Computing: a brief discussion · Parallel computing models Concepts and Terminology • Shared Memory = a computer architecture where all processors have direct access to

CS 2110: Parallel Programming A brief intro to: Parallelism, Threads, and Concurrency

Parallel Processing - Edward Bosworth · Web viewChapter 18 – Parallel Processing True parallel processing has been the goal of quite a few computer designers. Here, we give a brief

Compsci 201 Work, Nbody, ArrayLists - Duke University · 2020-01-29 · Work, Nbody, ArrayLists 1/29/2020 Compsci 201, Spring 2020 1 Susan Rodger January 29, 2020. ... Arrays.asList

Parallel Database Systems: The Future of High …brewer/cs262/5-dewittgray92.pdf · tural concepts used in these parallel database systems. This is followed by a brief presentation

STRENGTHENING TRANSITIONAL JUSTICE IN BOSNIA: … · 2019. 12. 14. · STRENGTHENING TRANSITIONAL JUSTICE IN BOSNIA: 3 REGIONAL POSSIBILITIES AND PARALLEL NARRATIVES POLICY BRIEF

Introduction to Parallel Algorithms 15-853 : Algorithms in ... · These notes give a brief introduction to Parallel Algorithms. We start by discussing cost models, and then go into

Remember to use parallel forms: Parallel Word Forms Parallel Phrases Parallel Coordinate Conjunctions & Paired Words

Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:

EBI Brief Template Keyword mnemonic EMH · EBI$Network$Mathematics$(ebi.missouri.edu)$ 2$ Brief$developedby$ElizabethM.$Hughes,$Duquesne$University$ Step$1:$$“This$is$agood$way$to$remember$what$parallel

Higher Geometry - UC Denverwcherowi/courses/m3210/lecture1.pdf · Higher Geometry. Introduction Brief Historical ... Parallel Postulate Non-Euclidean Geometries Projective Geometry

Instructor Notes Lecture discusses parallel implementation of a simple embarrassingly parallel nbody algorithm We aim to provide some correspondence between

Parallel Computing Why & How? - SINTEFWe´re already at the age of parallel computing Parallel computing relies on parallel hardware Parallel computing needs parallel software So parallel

Parallel Import in the Middle East - TAG-LEGAL Import - brief... · Kindly note that parallel importing cannot be stopped in Qatar even if the trademark owner has ... them with the

nbody bidCyaly smiti - cbseitms.in

Parallelization Strategies for Implementing Nbody Codes on Multicore Architectures

Aspects of practical parallel programming Parallel programming models Data parallel

THE ULTIMATE ALL-IN-ONE PERFORMANCE TOOLKIT Intel® Parallel Studio 2011 · 2010-12-17 · Product Brief Intel® Parallel Studio 2011 “Intel® Parallel Studio makes the new Envivio