Upload
dodiep
View
224
Download
2
Embed Size (px)
Citation preview
Basic Concepts in Parallelization
1
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
IWOMP 2010CCS, University of Tsukuba
Tsukuba, JapanJune 14-16, 2010
Ruud van der Pas
Senior Staff EngineerOracle Solaris Studio
OracleMenlo Park, CA, USA
Basic Concepts in Parallelization
Basic Concepts in Parallelization
2
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Outline❑ Introduction
❑ Parallel Architectures
❑ Parallel Programming Models
❑ Data Races
❑ Summary
Basic Concepts in Parallelization
3
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Introduction
Basic Concepts in Parallelization
4
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Why Parallelization ?
Parallelization is another optimization techniqueThe goal is to reduce the execution time
To this end, multiple processors, or cores, are usedT
ime
1 core 4 coresparallelization
Using 4 cores, the execution time is 1/4 of the single core
time
Basic Concepts in Parallelization
5
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
What Is Parallelization ?
"Something" is parallel if there is a certain level of independence in the order of operations
A sequence of machine instructions
A collection of program statements
An algorithm
The problem you're trying to solve
gran
ula rity
In other words, it doesn't matter in what order those operations are performed
Basic Concepts in Parallelization
6
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
What is a Thread ? Loosely said, a thread consists of a series of instructions
with it's own program counter (“PC”) and state
A parallel program executes threads in parallel
These threads are then scheduled onto processors
.....
.....
.....
.....
.....
.....
.....
.....
Thread 0
.....
.....
.....
.....
.....
.....
.....
.....
Thread 1
.....
.....
.....
.....
.....
.....
.....
.....
Thread 2
.....
.....
.....
.....
.....
.....
.....
.....
Thread 3
P P
PCPC
PC
PC
Basic Concepts in Parallelization
7
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Parallel overhead❑ The total CPU time often exceeds the serial CPU time:
● The newly introduced parallel portions in your program need to be executed
● Threads need time sending data to each other and synchronizing (“communication”)✔ Often the key contributor, spoiling all the fun
❑ Typically, things also get worse when increasing the number of threads
❑ Effi cient parallelization is about minimizing the communication overhead
Basic Concepts in Parallelization
8
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Communication
SerialExecution
Wal
lclo
ck t
ime
Parallel - With communication
Additional communication
Less than 4x faster
Consumes additional resources
Wallclock time is more than ¼ of serial wallclock time
Total CPU time increases
Parallel - Without communication
Embarrassingly parallel: 4x faster
Wallclock time is ¼ of serial wallclock time
Communication
Basic Concepts in Parallelization
9
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Load balancingT
ime
Perfect Load Balancing Load Imbalance
All threads fi nish in the same amount of time
No threads is idle
Thread is idle
Different threads need a different amount of time to fi nish their task
Total wall clock time increases
Program does not scale well
1 idle2 idle3 idle
Basic Concepts in Parallelization
10
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About scalability Defi ne the speed-up S(P) as
S(P) := T(1)/T(P)
The effi ciency E(P) is defi ned as E(P) := S(P)/P
In the ideal case, S(P)=P and E(P)=P/P=1=100%
Unless the application is embarrassingly parallel, S(P) eventually starts to deviate from the ideal curve
Past this point Popt
, the application sees less and less benefi t from adding processors
Note that both metrics give no information on the actual run-time
As such, they can be dangerous to use
Ideal
Sp
eed
-up
S(P
)
PPopt
In some cases, S(P) exceeds P
This is called "superlinear" behaviour
Don't count on this to happen though
Basic Concepts in Parallelization
11
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Amdahl's Law
▶ This "law' describes the effect the non-parallelizable part of a program has on scalability
▶ Note that the additional overhead caused by parallelization and speed-up because of cache effects are not taken into account
Assume our program has a parallel fraction “f”
This implies the execution time T(1) := f*T(1) + (1-f)*T(1)
On P processors: T(P) = (f/P)*T(1) + (1-f)*T(1)
Amdahl's law:
S(P) = T(1)/T(P) = 1 / (f/P + 1-f)
Comments:
Basic Concepts in Parallelization
12
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Amdahl's Law It is easy to scale on a
small number of processors
Scalable performance however requires a high degree of parallelization i.e. f is very close to 1
This implies that you need to parallelize that part of the code where the majority of the time is spent
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
2
4
6
8
10
12
14
16
1.000.990.750.500.250.00
Value for f:
Processors
Sp
eed
-up
Basic Concepts in Parallelization
13
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Amdahl's Law in practice
f = (1 - T(P)/T(1))/(1 - (1/P))
We can estimate the parallel fraction “f”
Recall: T(P) = (f/P)*T(1) + (1-f)*T(1)
It is trivial to solve this equation for “f”:
Example:
T(1) = 100 and T(4) = 37 => S(4) = T(1)/T(4) = 2.70f = (1-37/100)/(1-(1/4)) = 0.63/0.75 = 0.84 = 84%
Estimated performance on 8 processors is then:
T(8) = (0.84/8)*100 + (1-0.84)*100 = 26.5S(8) = T/T(8) = 3.78
Basic Concepts in Parallelization
14
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Threads Are Getting Cheap .....
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
10
20
30
40
50
60
70
80
90
100
Elapsed time and speed up for f = 84%
Number of cores used
Using 8 cores:100/26.5 = 3.8x faster
= Elapsed time= Speed up
Basic Concepts in Parallelization
15
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Numerical results
Consider:A = B + C + D + E
Serial Processing Parallel Processing
Thread 0 Thread 1
☞ The roundoff behaviour is different and so the numerical results may be different too
☞ This is natural for parallel programs, but it may be hard to differentiate it from an ordinary bug ....
A = B + C
A = A + D
A = A + E
T1 = B + C
T1 = T1 + T2
T2 = D + E
Basic Concepts in Parallelization
16
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Cache Coherence
Basic Concepts in Parallelization
17
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Typical cache based system
L2cacheCPU MemoryL1
cache?
Basic Concepts in Parallelization
18
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Cache line modifi cations
Main Memory
caches
page cache line
cache lineinconsistency !
Basic Concepts in Parallelization
19
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Caches in a parallel computer❑ A cache line starts in
memory
❑ Over time multiple copies of this line may exist
Cache Coherence ("cc"):
✔ Tracks changes in copies
✔ Makes sure correct cache line is used
✔ Different implementations possible
✔ Need hardware support to make it effi cient
Processors Caches Memory
CPU 0
CPU 1
CPU 2
CPU 3
invalidated
invalidated
invalidated
Basic Concepts in Parallelization
20
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Parallel Architectures
Basic Concepts in Parallelization
21
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Uniform Memory Access (UMA)❑ Also called "SMP"
(Symmetric Multi Processor)
❑ Memory Access time is Uniform for all CPUs
❑ CPU can be multicore
❑ Interconnect is "cc":
● Bus● Crossbar
❑ No fragmentation - Memory and I/O are shared resources
Memory I/O
cache
CPU
✔ Easy to use and to administer
✔ Effi cient use of resources
✔ Said to be expensive
✔ Said to be non-scalable
Pro
Con
cache
CPU
cache
CPU
Interconnect I/O
I/O
Basic Concepts in Parallelization
22
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
NUMA❑ Also called "Distributed
Memory" or NORMA (No Remote Memory Access)
❑ Memory Access time is Non-Uniform
❑ Hence the name "NUMA"
❑ Interconnect is not "cc":
● Ethernet, Infi niband, etc, ......
❑ Runs 'N' copies of the OS
❑ Memory and I/O are distributed resources
✔ Said to be cheap
✔ Said to be scalable
✔ Diffi cult to use and administer
✔ In-effi cient use of resources
Pro
Con
I/O
cache
CPU
M
I/O
cache
CPU
M
I/O
cache
CPU
M
Interconnect
Basic Concepts in Parallelization
23
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The Hybrid Architecture
Second-level Interconnect
❑ Second-level interconnect is not cache coherent
● Ethernet, Infi niband, etc, ....❑ Hybrid Architecture with all Pros and Cons:
● UMA within one SMP/Multicore node● NUMA across nodes
SMP/MC node SMP/MC node
Basic Concepts in Parallelization
24
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
cc-NUMA
❑ Two-level interconnect:
● UMA/SMP within one system● NUMA between the systems
❑ Both interconnects support cache coherence i.e. the system is fully cache coherent
❑ Has all the advantages ('look and feel') of an SMP
❑ Downside is the Non-Uniform Memory Access time
Interconnect
2-nd level
Basic Concepts in Parallelization
25
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Parallel Programming Models
Basic Concepts in Parallelization
26
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
How To Program A Parallel Computer?❑ There are numerous parallel programming models
❑ The ones most well-known are:
● Distributed Memory✔ Sockets (standardized, low level)
✔ PVM - Parallel Virtual Machine (obsolete)
✔ MPI - Message Passing Interface (de-facto std)
● Shared Memory
✔ Posix Threads (standardized, low level)
✔ OpenMP (de-facto standard)
✔ Automatic Parallelization (compiler does it for you)
Any Cluster
and/or
single SMP
SMP only
Basic Concepts in Parallelization
27
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Parallel Programming ModelsDistributed Memory - MPI
Basic Concepts in Parallelization
28
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
What is MPI?❑ MPI stands for the “Message Passing Interface”
❑ MPI is a very extensive de-facto parallel programming API for distributed memory systems (i.e. a cluster)
● An MPI program can however also be executed on a shared memory system
❑ First specifi cation: 1994
❑ The current MPI-2 specifi cation was released in 1997
● Major enhancements✔ Remote memory management✔ Parallel I/O✔ Dynamic process management
Basic Concepts in Parallelization
29
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
More about MPI❑ MPI has its own data types (e.g. MPI_INT)
● User defi ned data types are supported as well❑ MPI supports C, C++ and Fortran
● Include fi le <mpi.h> in C/C++ and “mpif.h” in Fortran❑ An MPI environment typically consists of at least:
● A library implementing the API● A compiler and linker that support the library● A run time environment to launch an MPI program
❑ Various implementations available
● HPC Clustertools, MPICH, MVAPICH, LAM, Voltaire MPI, Scali MPI, HP-MPI, .....
Basic Concepts in Parallelization
30
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The MPI Programming Model
1 M
4M 5 M
2M 3 M
0M
A Cluster Of Systems
Basic Concepts in Parallelization
31
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The MPI Execution ModelAll processes
start
MPI_Finalize
All processes fi nish
= communication
MPI_Init
Basic Concepts in Parallelization
32
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The MPI Memory Model
Pprivate
Pprivate
Pprivate
Pprivate
P
private
TransferMechanism
✔ All threads/processes have access to their own, private, memory only
✔ Data transfer and most synchronization has to be programmed explicitly
✔ All data is private
✔ Data is shared explicitly by exchanging buffers
Basic Concepts in Parallelization
33
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Example - Send “N” Integers
#include <mpi.h>
you = 0; him = 1;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
if ( me == 0 ) {
error_code = MPI_Send(&data_buffer, N, MPI_INT,
him, 1957, MPI_COMM_WORLD);
} else if ( me == 1 ) {
error_code = MPI_Recv(&data_buffer, N, MPI_INT,
you, 1957, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}
MPI_Finalize();
include fi le
initialize MPI environment
get process ID
process 0 sends
process 1 receives
leave the MPI environment
Basic Concepts in Parallelization
34
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
you = 1him = 0
Process 0
you = 1him = 0
Process 1
Run time Behavior
N integersdestination = you = 1
label = 1957
MPI_Send
me = 0 me = 1
Yes ! Connection established
MPI_RecvN integers
sender = him =0label = 1957
Basic Concepts in Parallelization
35
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The Pros and Cons of MPI❑ Advantages of MPI:
● Flexibility - Can use any cluster of any size● Straightforward - Just plug in the MPI calls● Widely available - Several implementations out there● Widely used - Very popular programming model
❑ Disadvantages of MPI:
● Redesign of application - Could be a lot of work● Easy to make mistakes - Many details to handle● Hard to debug - Need to dig into underlying system● More resources - Typically, more memory is needed● Special care - Input/Output
Basic Concepts in Parallelization
36
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Parallel Programming Models Shared Memory - Automatic
Parallelization
Basic Concepts in Parallelization
37
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Automatic Parallelization (-xautopar)❑ Compiler performs the parallelization (loop based)
❑ Different iterations of the loop executed in parallel
❑ Same binary used for any number of threads
for (i=0; i<1000; i++) a[i] = b[i] + c[i];
Thread 3
750-999
Thread 2
500-749
Thread 1
250-499
Thread 0
0-249
OMP_NUM_THREADS=4
Basic Concepts in Parallelization
38
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Parallel Programming Models Shared Memory - OpenMP
Basic Concepts in Parallelization
39
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
http://www.openmp.org
0
M
1
M
P
M
Shared Memory
Basic Concepts in Parallelization
40
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Shared Memory Model
Tprivate
Tprivate
Tprivate
Tprivate
T
private
Programming Model
SharedMemory
✔ All threads have access to the same, globally shared, memory
✔ Data can be shared or private
✔ Shared data is accessible by all threads
✔ Private data can only be accessed by the thread that owns it
✔ Data transfer is transparent to the programmer
✔ Synchronization takes place, but it is mostly implicit
Basic Concepts in Parallelization
41
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About data In a shared memory parallel program variables have a
"label" attached to them:
☞ Labelled "Private" ⇨ Visible to one thread only
✔ Change made in local data, is not seen by others✔ Example - Local variables in a function that is
executed in parallel☞ Labelled "Shared" ⇨ Visible to all threads
✔ Change made in global data, is seen by all others✔ Example - Global data
Basic Concepts in Parallelization
42
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
TID = 0for (i=0,1,2,3,4)
TID = 1for (i=5,6,7,8,9)
Example - Matrix times vector
i = 0 i = 5
a[0] = sum a[5] = sum
sum = � b[i=0][j]*c[j] sum = � b[i=5][j]*c[j]
i = 1 i = 6
a[1] = sum a[6] = sum
sum = � b[i=1][j]*c[j] sum = � b[i=6][j]*c[j]
... etc ...
for (i=0; i<m; i++){ sum = 0.0; for (j=0; j<n; j++) sum += b[i][j]*c[j]; a[i] = sum;
}
#pragma omp parallel for default(none) \ private(i,j,sum) shared(m,n,a,b,c)
= *
j
i
Basic Concepts in Parallelization
43
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A Black and White comparisonMPI OpenMP
De-facto standard De-facto standard
Endorsed by all key players Endorsed by all key players
Runs on any number of (cheap) systems Limited to one (SMP) system
“Grid Ready” Not (yet?) “Grid Ready”
High and steep learning curve Easier to get started (but, ...)
You're on your own Assistance from compiler
All or nothing model Mix and match model
No data scoping (shared, private, ..) Requires data scoping
More widely used (but ....) Increasingly popular (CMT !)
Sequential version is not preserved Preserves sequential code
Requires a library only Need a compiler
Requires a run-time environment No special environment
Easier to understand performance Performance issues implicit
Basic Concepts in Parallelization
44
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The Hybrid Parallel Programming Model
Basic Concepts in Parallelization
45
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
The Hybrid Programming Model
0
M
1
M
P
M
Shared Memory
Distributed Memory
0
M
1
M
P
M
Shared Memory
Basic Concepts in Parallelization
46
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Data Races
Basic Concepts in Parallelization
47
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About Parallelism
Parallelism
Independence
No Fixed Ordering
"Something" that does not obey this rule, is not
parallel (at that level ...)
Basic Concepts in Parallelization
48
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Shared Memory Programming
Threads communicate via shared memory
Tprivate
Tprivate
Tprivate
Tprivate
T
private
SharedMemory
XX
X
Y
Basic Concepts in Parallelization
49
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
What is a Data Race?❑ Two different threads in a multi-threaded shared
memory program
❑ Access the same (=shared) memory location
● Asynchronously and● Without holding any common exclusive locks and● At least one of the accesses is a write/store
Basic Concepts in Parallelization
50
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Example of a data race
Tprivate
Tprivate
SharedMemory
n
{n = omp_get_thread_num();}
#pragma omp parallel shared(n)
WW
Basic Concepts in Parallelization
51
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Another example
{x = x + 1;}
#pragma omp parallel shared(x)
Tprivate
Tprivate
SharedMemory
x
R/WR/W
Basic Concepts in Parallelization
52
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About Data Races❑ Loosely described, a data race means that the update
of a shared variable is not well protected
❑ A data race tends to show up in a nasty way:
● Numerical results are (somewhat) different from run to run
● Especially with Floating-Point data diffi cult to distinguish from a numerical side-effect
● Changing the number of threads can cause the problem to seemingly (dis)appear✔ May also depend on the load on the system
● May only show up using many threads
Basic Concepts in Parallelization
53
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
A parallel loop
for (i=0; i<8; i++) a[i] = a[i] + b[i];
Thread 1 Thread 2
a[0]=a[0]+b[0] a[4]=a[4]+b[4]
a[1]=a[1]+b[1] a[5]=a[5]+b[5]
a[2]=a[2]+b[2] a[6]=a[6]+b[6]
a[3]=a[3]+b[3] a[7]=a[7]+b[7] Time
Every iteration in this loop is independent of
the other iterations
Basic Concepts in Parallelization
54
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Not a parallel loop
for (i=0; i<8; i++) a[i] = a[i+1] + b[i];
Thread 1 Thread 2
a[0]=a[1]+b[0] a[4]=a[5]+b[4]
a[1]=a[2]+b[1] a[5]=a[6]+b[5]
a[2]=a[3]+b[2] a[6]=a[7]+b[6]
a[3]=a[4]+b[3] a[7]=a[8]+b[7] Time
The result is not deterministic when
run in parallel !
Basic Concepts in Parallelization
55
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About the experiment❑ We manually parallelized the previous loop
● The compiler detects the data dependence and does not parallelize the loop
❑ Vectors a and b are of type integer
❑ We use the checksum of a as a measure for correctness:
● checksum += a[i] for i = 0, 1, 2, ...., n-2❑ The correct, sequential, checksum result is computed
as a reference
❑ We ran the program using 1, 2, 4, 32 and 48 threads
● Each of these experiments was repeated 4 times
Basic Concepts in Parallelization
56
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Numerical results threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953
threads: 2 checksum 1953 correct value 1953threads: 2 checksum 1953 correct value 1953threads: 2 checksum 1953 correct value 1953threads: 2 checksum 1953 correct value 1953
threads: 4 checksum 1905 correct value 1953threads: 4 checksum 1905 correct value 1953threads: 4 checksum 1953 correct value 1953threads: 4 checksum 1937 correct value 1953
threads: 32 checksum 1525 correct value 1953threads: 32 checksum 1473 correct value 1953threads: 32 checksum 1489 correct value 1953threads: 32 checksum 1513 correct value 1953
threads: 48 checksum 936 correct value 1953threads: 48 checksum 1007 correct value 1953threads: 48 checksum 887 correct value 1953threads: 48 checksum 822 correct value 1953
Data Race In Action !
Basic Concepts in Parallelization
57
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Summary
Basic Concepts in Parallelization
58
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
Parallelism Is Everywhere
Multiple levels of parallelism:
Granularity Technology Programming Model
Instruction Level Superscalar CompilerChip Level Multicore Compiler, OpenMP, MPISystem Level SMP/cc-NUMA Compiler, OpenMP, MPIGrid Level Cluster MPI
Threads Are Getting Cheap