View
213
Download
0
Tags:
Embed Size (px)
Citation preview
1
Friday, September 22, 2006
If one ox could not do the job they did not try to grow a
bigger ox, but used two oxen.
- Grace Murray Hopper
(1906-1992)
3
Strided access
Stride
Sequence of memory reads and writes to addresses, each of which is separated from the last by a constant interval called "the stride length“
Unit stride
4
do i = 1, N
do j = 1, N
A[i] =A[i] + B[j]
enddo
enddo
N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop.
Little reuse between touches
How many cache misses for A and B?
5
Blocking
do i = 1, N
do j = 1, N, S
do jj = j, MIN(j+S, N)
A[i] =A[i] + B[jj]
enddo
enddo
enddo
do i = 1, N
do j = 1, N
A[i] =A[i] + B[j]
enddo
enddo
6
Blocking
do j = 1, N, S
do i = 1, N
do jj = j, MIN(j+S, N)
A[i] =A[i] + B[jj]
enddo
enddo
enddo
do i = 1, N
do j = 1, N
A[i] =A[i] + B[j]
enddo
enddo
S is the maximum number of elements of B that can remain in cache between two iterations of the i loop
Block or strip mine
How many cache misses for A and B?
9
Matrix multiplication
int i,j,k;
for (i=0;i<n;i++) {
for(j=0;j<n;j++) {
for (k=0;k<n;k++) {
c[i][j]=c[i][j]+ a[i][k]*b[k][j];
}
}
}
Remember to initialize c[i][j] to zero
10
Matrix multiplication with blockingint i,j,k,ii,jj,kk;for (ii=0;ii<n;ii+=S) { for (jj=0;jj<n;jj+=S) { for (kk=0;kk<n;kk+=S) { for(i=ii;i<min((ii+S),n);i++) { for(j=jj;j<min((jj+S),n);j++) {
for(k=kk;k<min((kk+S),n);k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j];
} } } } }}
Remember to initialize c[i][j] to zero
12
Cache coherence in multiprocessor systems
Suppose two processors on a shared bus have loaded the same variable.
If one processor changes value of that variable then:
13
Cache coherence in multiprocessor systems
Suppose two processors on a shared bus have loaded the same variable.
If one processor changes value of that variable then: Invalidate other copies Update other copies
15
Cache coherence in multiprocessor systems
What if a processor reads a data item only once initially?
Invalidate protocol is more commonly used.
16
False Sharing (multiprocessor)
Two processors are accessing different data items in the same cache block.
What happens if they both attempt to write to it?
17
False Sharing (multiprocessor)
Two processors are accessing different data items in the same cache block.
What happens if they both attempt to write to it?
Padding in data structures (tradeoff space vs. time)
18
Network Topologies
Bus based, crossbar and multistage networks
Earth simulator: crossbar IBM SP-2 Multistage network