6
“Matrix Multiply ― in parallel” Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago [email protected]

“Matrix Multiply ― in parallel”

Embed Size (px)

DESCRIPTION

“Matrix Multiply ― in parallel”. Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago [email protected]. Background…. Class :“ Introduction to CS for Engineers ” Lang :C/C++ Focus :programming basics, vectors, matrices Timing :present this after introducing 2D arrays…. - PowerPoint PPT Presentation

Citation preview

Page 1: “Matrix Multiply ― in parallel”

“Matrix Multiply ― in parallel”

Joe Hummel, PhDU. Of Illinois, Chicago

Loyola University Chicago

[email protected]

Page 2: “Matrix Multiply ― in parallel”

Class: “Introduction to CS for Engineers”

Lang: C/C++

Focus: programming basics, vectors, matrices

Timing: present this after introducing 2D arrays…

Background…

Page 3: “Matrix Multiply ― in parallel”

Yes, it’s boring, but…◦ everyone understands the problem

◦ good example of triply-nested loops

◦ non-trivial computation

Matrix multiply

for (int i = 0; i < N; i++)for (int j = 0; j < N; j++)for (int k = 0; k < N; k++)

C[i][j] += (A[i][k] * B[k][j]);

1500x1500 matrix:

2.25M elements » 32 seconds…

Page 4: “Matrix Multiply ― in parallel”

Matrix multiply is greatcandidate for multicore

◦ embarrassingly-parallel

◦ easy to parallelize viaoutermost loop

Multicore

#pragma omp parallel forfor (int i = 0; i < N; i++)for (int j = 0; j < N; j++)for (int k = 0; k < N; k++)

C[i][j] += (A[i][k] * B[k][j]);

Cores

1500x1500 matrix:

Quad-core CPU » 8 seconds…

Page 5: “Matrix Multiply ― in parallel”

Parallelism alone is not enough…

Designing for HPC

HPC == Parallelism + Memory Hierarchy ─ Contention

Expose parallelism

Maximize data locality:• network• disk• RAM• cache• core

Minimize interaction:• false sharing• locking• synchronization

Page 6: “Matrix Multiply ― in parallel”

What’s the other halfof the chip?

Implications?◦ No one implements MM this way

◦ Rewrite to use loop interchange,and access B row-wise…

Data locality

Cache!

X

#pragma omp parallel for

for (int i = 0; i < N; i++)for (int k = 0; k < N; k++)

for (int j = 0; j < N; j++)

C[i][j] += (A[i][k] * B[k][j]);

1500x1500 matrix:

Quad-core + cache » 2 seconds…