View
214
Download
0
Category
Preview:
Citation preview
Tile Size Selection Using Cache Organization and Data Layout
Stephanie Coleman
Intermetrics, Inc.
Kathryn S. M c Kinley
Computer Science, LGRC,
University of Massachusetts Amherst
10/27/01
Where to Use Tiling/Blocking?
• Register
• TLB
• L1 cache
• L2 cache
• any other memory hierarchy
Cache Misses
• Compulsory misses
• Capacity misses
• Interference misses• Self-interference• Cross-interference
Data Reuse and locality
• Data reuse– Temporal reuse– Spatial reuse
• Locality: reused data remain in cache
• Reuse does not necessarily result in locality
Without Tiling
• Matrix Multiply
for I=1 to N do
for K=1 to N do
R=X(K,I)
for J=1 to N do
Z(J,I)=Z(J,I)+R*Y(J,K)
Reuse Pattern without tiling
Reuse Pattern after tiling
After tiling
(tile size=TK* TJ)
for KK=1 to N by TK do
for JJ=1 to N by TJ do
for I=1 to N do
for K=KK to MIN(KK+TK-1,N) do
R=X(K,I)
for J=JJ to MIN(JJ+TJ-1,N) do
Z(J,I)=Z(J,I)+R*Y(J,K)
General Formula for tiling
• Before tiling:for I= lo to hi do
• Tiled into:for It=floor((lo-off)/ts)*ts+off to floor((hi-off)/ts)*ts+off by ts do
for I=max(lo, It) to min(hi, It+ts-1)
(off: offset ts: tile size)
Loop Interchange
• Interchange an innter tile loop with an outer element loop:for I=max(l1,l2,..) to min(u1,u2,…) do
for Jt=floor((k1*I+m1)/ts)*ts+off
to floor((ku*I+mu)/ts)*ts+off by ts do
• The limit for the I loop: do not change;
• The new lower/upper limit for Jt loop will be the max of a set of expressions,where each expression is its old limit with I replaced by one of l1,l2,…(if k1>0) , or u1,u2,…(if k1<0).
Tile Size Selection
Tile Size selection
Cache layout with a tile size of 24
Potential column dimensions
• Euclidean algorithm– G.C.D(a,b)=G.C.D(a-b,b)
CS= q1*N+r1
N = q2*r1+r2
r1 = q3*r2+r3
…
1024 = 5* 200 + 24
200 = 8*24 + 8
Potential column dimensions: 24, 8.
Computing row size for a column size
Improve Spatial Locality with Cache Line Size
colSize=
colSize if colSize mod CLS =0, or if colSize=column length
floor(colSize/CLS)*CLS otherwise
Minimize Cross Interference
• Working set size constraint:
TJ*TK+TJ+1*CLS<CS
Tile Size Selection Algorithm(TSS)
Other Algorithm for Computing Tile Size
• LRW– improves the average cache performance– sensitive to the array size– ineffective cache utilization
• ESS– effective only for one-dimensional tiling– no consideration on cross-interference
Conclusion
• TSS incorporate the effect of cache line size and cross-interference between arrays
• Performs better on direct-mapped caches and higher associative caches than ESS and LRW
• sensitive to array dimension
• not fully exploit temporal reuse for some matrix sizes
Recommended