21
1 Friday, September 22, 2006 If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. - Grace Murray Hopper (1906-1992)

1 Friday, September 22, 2006 If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. -Grace Murray Hopper (1906-1992)

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

1

Friday, September 22, 2006

If one ox could not do the job they did not try to grow a

bigger ox, but used two oxen.

- Grace Murray Hopper

(1906-1992)

2

Today

Block matrix operationsNetwork topologies

3

Strided access

Stride

Sequence of memory reads and writes to addresses, each of which is separated from the last by a constant interval called "the stride length“

Unit stride

4

do i = 1, N

do j = 1, N

A[i] =A[i] + B[j]

enddo

enddo

N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop.

Little reuse between touches

How many cache misses for A and B?

5

Blocking

do i = 1, N

do j = 1, N, S

do jj = j, MIN(j+S, N)

A[i] =A[i] + B[jj]

enddo

enddo

enddo

do i = 1, N

do j = 1, N

A[i] =A[i] + B[j]

enddo

enddo

6

Blocking

do j = 1, N, S

do i = 1, N

do jj = j, MIN(j+S, N)

A[i] =A[i] + B[jj]

enddo

enddo

enddo

do i = 1, N

do j = 1, N

A[i] =A[i] + B[j]

enddo

enddo

S is the maximum number of elements of B that can remain in cache between two iterations of the i loop

Block or strip mine

How many cache misses for A and B?

7

Operation Count vs. Memory Operations

Example: Matrix multiplicationPrevious example?

8

Block matrix operations

9

Matrix multiplication

int i,j,k;

for (i=0;i<n;i++) {

for(j=0;j<n;j++) {

for (k=0;k<n;k++) {

c[i][j]=c[i][j]+ a[i][k]*b[k][j];

}

}

}

Remember to initialize c[i][j] to zero

10

Matrix multiplication with blockingint i,j,k,ii,jj,kk;for (ii=0;ii<n;ii+=S) { for (jj=0;jj<n;jj+=S) { for (kk=0;kk<n;kk+=S) { for(i=ii;i<min((ii+S),n);i++) { for(j=jj;j<min((jj+S),n);j++) {

for(k=kk;k<min((kk+S),n);k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j];

} } } } }}

Remember to initialize c[i][j] to zero

11

Exercise

Matrix Vector Multiplication

12

Cache coherence in multiprocessor systems

Suppose two processors on a shared bus have loaded the same variable.

If one processor changes value of that variable then:

13

Cache coherence in multiprocessor systems

Suppose two processors on a shared bus have loaded the same variable.

If one processor changes value of that variable then: Invalidate other copies Update other copies

14

15

Cache coherence in multiprocessor systems

What if a processor reads a data item only once initially?

Invalidate protocol is more commonly used.

16

False Sharing (multiprocessor)

Two processors are accessing different data items in the same cache block.

What happens if they both attempt to write to it?

17

False Sharing (multiprocessor)

Two processors are accessing different data items in the same cache block.

What happens if they both attempt to write to it?

Padding in data structures (tradeoff space vs. time)

18

Network Topologies

Bus based, crossbar and multistage networks

Earth simulator: crossbar IBM SP-2 Multistage network

19

Network Topologies

Large number of links in completely connected.

Bottleneck in star topology.

20

Network Topologies

1-D torus

Intel Paragon – 2-D Mesh

BlueGene/L 3-D torus

Cray TE3 3-D Cube

21

2-D and 3-D meshes are common in parallel computers

Regularly structured computation maps naturally to 2-D mesh.

3-D network topologies: weather modeling, structure modeling