19
Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University of Massachusetts Amherst 10/27/01

Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Embed Size (px)

Citation preview

Page 1: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Tile Size Selection Using Cache Organization and Data Layout

Stephanie Coleman

Intermetrics, Inc.

Kathryn S. M c Kinley

Computer Science, LGRC,

University of Massachusetts Amherst

10/27/01

Page 2: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Where to Use Tiling/Blocking?

• Register

• TLB

• L1 cache

• L2 cache

• any other memory hierarchy

Page 3: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Cache Misses

• Compulsory misses

• Capacity misses

• Interference misses• Self-interference• Cross-interference

Page 4: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Data Reuse and locality

• Data reuse– Temporal reuse– Spatial reuse

• Locality: reused data remain in cache

• Reuse does not necessarily result in locality

Page 5: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Without Tiling

• Matrix Multiply

for I=1 to N do

for K=1 to N do

R=X(K,I)

for J=1 to N do

Z(J,I)=Z(J,I)+R*Y(J,K)

Page 6: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Reuse Pattern without tiling

Page 7: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Reuse Pattern after tiling

Page 8: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

After tiling

(tile size=TK* TJ)

for KK=1 to N by TK do

for JJ=1 to N by TJ do

for I=1 to N do

for K=KK to MIN(KK+TK-1,N) do

R=X(K,I)

for J=JJ to MIN(JJ+TJ-1,N) do

Z(J,I)=Z(J,I)+R*Y(J,K)

Page 9: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

General Formula for tiling

• Before tiling:for I= lo to hi do

• Tiled into:for It=floor((lo-off)/ts)*ts+off to floor((hi-off)/ts)*ts+off by ts do

for I=max(lo, It) to min(hi, It+ts-1)

(off: offset ts: tile size)

Page 10: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Loop Interchange

• Interchange an innter tile loop with an outer element loop:for I=max(l1,l2,..) to min(u1,u2,…) do

for Jt=floor((k1*I+m1)/ts)*ts+off

to floor((ku*I+mu)/ts)*ts+off by ts do

• The limit for the I loop: do not change;

• The new lower/upper limit for Jt loop will be the max of a set of expressions,where each expression is its old limit with I replaced by one of l1,l2,…(if k1>0) , or u1,u2,…(if k1<0).

Page 11: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Tile Size Selection

Page 12: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Tile Size selection

Cache layout with a tile size of 24

Page 13: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Potential column dimensions

• Euclidean algorithm– G.C.D(a,b)=G.C.D(a-b,b)

CS= q1*N+r1

N = q2*r1+r2

r1 = q3*r2+r3

1024 = 5* 200 + 24

200 = 8*24 + 8

Potential column dimensions: 24, 8.

Page 14: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Computing row size for a column size

Page 15: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Improve Spatial Locality with Cache Line Size

colSize=

colSize if colSize mod CLS =0, or if colSize=column length

floor(colSize/CLS)*CLS otherwise

Page 16: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Minimize Cross Interference

• Working set size constraint:

TJ*TK+TJ+1*CLS<CS

Page 17: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Tile Size Selection Algorithm(TSS)

Page 18: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Other Algorithm for Computing Tile Size

• LRW– improves the average cache performance– sensitive to the array size– ineffective cache utilization

• ESS– effective only for one-dimensional tiling– no consideration on cross-interference

Page 19: Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University

Conclusion

• TSS incorporate the effect of cache line size and cross-interference between arrays

• Performs better on direct-mapped caches and higher associative caches than ESS and LRW

• sensitive to array dimension

• not fully exploit temporal reuse for some matrix sizes