LAB 2: CACHE LABacase/fa14/CS201/labs/cachelab-recitation.pdfQUESTIONS TO PONDER • What would happen if instead of accessing each index in row order you alternated with jumping from

Simulations of Memory Hierarchy

LAB 2: CACHE LAB

OVERVIEW •  Objectives •  Cache Set-Up •  Command line parsing •  Least Recently Used (LRU) •  Matrix Transposition •  Cache-Friendly Code

OBJECTIVE •  There are two parts to this lab: •  Part A: Cache Simulator • Simulate a cache table using the LRU algorithm

•  Part B: Optimizing Matrix Transpose • Write “cache-friendly” code in order to optimize

cache hits/misses in the implementation of a matrix transpose function

•  When submitting your lab, please submit the handin.tar file as described in the instructions.

MEMORY HIERARCHY •  Pick your poison: smaller, faster, and costlier, or larger,

slower, and cheaper

CACHE ADDRESSING •  X-bit memory addresses (in Part A, X <= 64 bits) •  Block offset: b bits •  Set index: s bits •  Tag bits: X – b – s •  Cache is a collection of S=2^s cache sets •  Cache set is a collection of E cache lines •  E is the associativity of the cache •  If E=1, the cache is called “direct-mapped”

•  Each cache line stores a block of B=2^b bytes of data

ADDRESS ANATOMY

CACHE TABLE BASICS •  Conditions: •  Set size (S) •  Block size (B) •  Line size (E)

•  Note that the total capacity of this cache would be S*B*E •  Blocks are the fundamental units of the cache

CACHE TABLE CORRESPONDENCE WITH ADDRESS

Example for 32 bit address

CACHE SET LOOK-UP •  Determine the set index and the tag bits based on the

memory address •  Locate the corresponding cache set and determine

whether or not there exists a valid cache line with a matching tag

•  If a cache miss occurs: •  If there is an empty cache line, utilize it •  If the set is full then a cache line must be evicted

TYPES OF CACHE MISSES •  Compulsory Miss: •  First access to a block has to be a miss

•  Conflict Miss: •  Level k cache is large enough, but multiple data

objects all map to the same level k block •  Capacity Miss: •  Occurs when the working set of blocks (blocks of

memory being used) is larger than the cache

PART A: CACHE SIMULATION

YOUR OWN CACHE SIMULATOR •  NOT a real cache •  Block offsets are NOT used but are important in

understanding the concept of a cache •  s, b, and E given at runtime

FUNCTIONS TO USE FOR COMMAND LINE PARSING •  int getopt(int argc, char*const* argv, const char*

options) •  See: http://www.gnu.org/software/libc/manual/

html_node/Example-of-Getopt.html#Example-of-Getopt

•  long long int strtoll(const char* str, char** endptr, int base) •  See: http://www.cplusplus.com/reference/cstdlib/

strtoll/

LEAST RECENTLY USED (LRU) ALGORITHM

•  A least recently used algorithm should be used to determine which cache lines to evict in what order

•  Each cache line will need some sort of “time” field which should be update each time that cache line is referenced

•  If a cache miss occurs in a full cache set, the cache line with the least relevant time field should be evicted

PART B: OPTIMIZING MATRIX TRANSPOSE

WHAT IS A MATRIX TRANSPOSITION? •  The transpose of a matrix A is denoted as AT •  The rows of AT are the columns of A, and the

columns of AT are the rows of A •  Example:

GENERAL MATRIX TRANSPOSITION

CACHE-FRIENDLY CODE •  In order to have fewer cache misses, you must make

good use of: •  Temporal locality: reuse the current cache block if

possible (avoid conflict misses [thrashing]) •  Spatial locality: reference the data of close storage

locations •  Tips: •  Cache blocking •  Optimized access patterns •  Your code should look ugly if done correctly

CACHE BLOCKING •  Partition the matrix in question into sub-matrices •  Divide the larger problem into smaller sub-problems

•  Main idea: •  Iterate over blocks as you perform the transpose as

opposed to the simplistic algorithm which goes index by index, row by row

•  Determining the size of these blocks will take some amount of thought and experimentation

QUESTIONS TO PONDER •  What would happen if instead of accessing each index in row

order you alternated with jumping from row to row within the same column?

•  What would happen if you declared only 4 local variables as opposed to 12 local variables?

•  Is it possible to get rid of the local variables all together? •  What happens when accessing elements along the diagonal? •  What happens when the program is run in a different directory?

(XKCD)

Documents

LAB 2: CACHE LABacase/fa14/CS201/labs/cachelab-recitation.pdfQUESTIONS TO PONDER • What would happen if instead of accessing each index in row order you alternated with jumping from