Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc....


Citation preview

Tile Size Selection Using Cache Organization and Data Layout

Stephanie Coleman

Intermetrics, Inc.

Kathryn S. M c Kinley

Computer Science, LGRC,

University of Massachusetts Amherst


Where to Use Tiling/Blocking?

• Register


• L1 cache

• L2 cache

• any other memory hierarchy

Cache Misses

• Compulsory misses

• Capacity misses

• Interference misses• Self-interference• Cross-interference

Data Reuse and locality

• Data reuse– Temporal reuse– Spatial reuse

• Locality: reused data remain in cache

• Reuse does not necessarily result in locality

Without Tiling

• Matrix Multiply

for I=1 to N do

for K=1 to N do


for J=1 to N do


Reuse Pattern without tiling

Reuse Pattern after tiling

After tiling

(tile size=TK* TJ)

for KK=1 to N by TK do

for JJ=1 to N by TJ do

for I=1 to N do

for K=KK to MIN(KK+TK-1,N) do


for J=JJ to MIN(JJ+TJ-1,N) do


General Formula for tiling

• Before tiling:for I= lo to hi do

• Tiled into:for It=floor((lo-off)/ts)*ts+off to floor((hi-off)/ts)*ts+off by ts do

for I=max(lo, It) to min(hi, It+ts-1)

(off: offset ts: tile size)

Loop Interchange

• Interchange an innter tile loop with an outer element loop:for I=max(l1,l2,..) to min(u1,u2,…) do

for Jt=floor((k1*I+m1)/ts)*ts+off

to floor((ku*I+mu)/ts)*ts+off by ts do

• The limit for the I loop: do not change;

• The new lower/upper limit for Jt loop will be the max of a set of expressions,where each expression is its old limit with I replaced by one of l1,l2,…(if k1>0) , or u1,u2,…(if k1<0).

Tile Size Selection

Tile Size selection

Cache layout with a tile size of 24

Potential column dimensions

• Euclidean algorithm– G.C.D(a,b)=G.C.D(a-b,b)

CS= q1*N+r1

N = q2*r1+r2

r1 = q3*r2+r3

1024 = 5* 200 + 24

200 = 8*24 + 8

Potential column dimensions: 24, 8.

Computing row size for a column size

Improve Spatial Locality with Cache Line Size


colSize if colSize mod CLS =0, or if colSize=column length

floor(colSize/CLS)*CLS otherwise

Minimize Cross Interference

• Working set size constraint:


Tile Size Selection Algorithm(TSS)

Other Algorithm for Computing Tile Size

• LRW– improves the average cache performance– sensitive to the array size– ineffective cache utilization

• ESS– effective only for one-dimensional tiling– no consideration on cross-interference


• TSS incorporate the effect of cache line size and cross-interference between arrays

• Performs better on direct-mapped caches and higher associative caches than ESS and LRW

• sensitive to array dimension

• not fully exploit temporal reuse for some matrix sizes
