“Matrix Multiply ― in parallel”

“Matrix Multiply ― in parallel”

Joe Hummel, PhDU. Of Illinois, Chicago

Loyola University Chicago

[email protected]

Class: “Introduction to CS for Engineers”

Lang: C/C++

Focus: programming basics, vectors, matrices

Timing: present this after introducing 2D arrays…

Background…

Yes, it’s boring, but…◦ everyone understands the problem

◦ good example of triply-nested loops

◦ non-trivial computation

Matrix multiply

for (int i = 0; i < N; i++)for (int j = 0; j < N; j++)for (int k = 0; k < N; k++)

C[i][j] += (A[i][k] * B[k][j]);

1500x1500 matrix:

2.25M elements » 32 seconds…

Matrix multiply is greatcandidate for multicore

◦ embarrassingly-parallel

◦ easy to parallelize viaoutermost loop

Multicore

#pragma omp parallel forfor (int i = 0; i < N; i++)for (int j = 0; j < N; j++)for (int k = 0; k < N; k++)

C[i][j] += (A[i][k] * B[k][j]);

Cores

1500x1500 matrix:

Quad-core CPU » 8 seconds…

Parallelism alone is not enough…

Designing for HPC

HPC == Parallelism + Memory Hierarchy ─ Contention

Expose parallelism

Maximize data locality:• network• disk• RAM• cache• core

Minimize interaction:• false sharing• locking• synchronization

What’s the other halfof the chip?

Implications?◦ No one implements MM this way

◦ Rewrite to use loop interchange,and access B row-wise…

Data locality

Cache!

X

#pragma omp parallel for

for (int i = 0; i < N; i++)for (int k = 0; k < N; k++)

for (int j = 0; j < N; j++)

C[i][j] += (A[i][k] * B[k][j]);

1500x1500 matrix:

Quad-core + cache » 2 seconds…

Documents

“Matrix Multiply ― in parallel”