High Performance Embedded Computing
© 2007 Elsevier
Lecture 7: Memory Systems & Code Compression
Embedded Computing SystemsMikko Lipasti, adapted from M. Schulte
Based on slides and textbook from Wayne Wolf
High Performance Embedded Computing
© 2007 Elsevier
Topics
Branching in VLIW (from last time) Memory systems.
Memory component models. Caches and alternatives.
Code compression.
High Performance Embedded Computing
© 2007 Elsevier
Branching in VLIW Processors Operations in the same VLIW as a branch must not depend on the
branch outcome VLIWs may utilize static branch prediction
Uses profiling to predict branch direction Have different instructions depending on if the branch is
predicted to be taken or not taken VLIWs sometimes have specialized loop instructions – execute the
instruction word or set of instruction words a specified number of time.
VLIWs sometimes have predicated instructions Execute instruction conditionally based on value in “predicate register” cmpgt p1 = r1, r2;; sub (p1) r3 = r1, r2, sub (!p1) r3 = r2, r1;;
High Performance Embedded Computing
© 2007 Elsevier
Branching in VLIW Processors VLIWs often executes branches in two phases
Prepare the branch condition (store in a branch register) Execute the branch based on the condition
## Implements: if (a > 0 && b < 0) { [ Block 1] }## where a is in $r1, b is in $r2, and 0 is in r0
cmpgt $r3 = $r1, $r0 ## (a > 0)cmplt $r4 = $r2, $r0 ## (b < 0)
;; and $b0 = $r3, $r4 ;;br $b0, L1;; ## L1 starts Block 1
High Performance Embedded Computing
© 2007 Elsevier
Generic memory block
High Performance Embedded Computing
© 2007 Elsevier
Simple memory model
Core array is n rows x m columns. Total area A = Ar + Ax + Ap + Ac. Row decoder area Ar = arm. Core area Ax = axmn. Precharge circuit area Ap = apm. Column decoder area Ac = acm.
High Performance Embedded Computing
© 2007 Elsevier
Simple energy and delay models = setup + r + x + bit + c. Total energy E = ED + ES.
Static energy component ES is a technology parameter.
Dynamic energy ED = Er + Ex + Ep + Ec.
High Performance Embedded Computing
© 2007 Elsevier
Multiport memories - delay
Delay vs. memory sizeand number of ports
Multiport Memory
R_addr1
R_addr2
W_addr
W_data
R_data1
R_data2
High Performance Embedded Computing
© 2007 Elsevier
Multiport memories - area Area of memories often grows with the
square of the number of ports
High Performance Embedded Computing
© 2007 Elsevier
Kamble and Ghose cache power model
High Performance Embedded Computing
© 2007 Elsevier
Kamble/Ghose, cont’d.
Cache is m-way set-associative, capacity of D bytes, T bits of tag, and L bytes of line, St status bits per line.
The bit line energy is :
The equation in the book is incorrect.
)]()8(
[5.0
,,,
,,,,,,2
QpgQpbgQpag
rwbitrbitrwbitwbitprbitprbitDDbit
CCCCAStTLm
CNCNCNVE
High Performance Embedded Computing
© 2007 Elsevier
Kamble/Ghose, cont’d.
Word line energy:
Output line energy:
Address input lines:
High Performance Embedded Computing
© 2007 Elsevier
Shiue and Chakrabarti cache energy model add_bs: number of transitions on address bus per
instruction. data_bs: number of transitions on data bus per
instruction. word_line_size: number of memory cells on a word
line. bit_line_size: number of memory cells on a bit line. Em: Energy consumption of a main memory access. : technology parameters.
High Performance Embedded Computing
© 2007 Elsevier
Shiue/Chakrabarti, cont’d.
High Performance Embedded Computing
© 2007 Elsevier
Register files
First stage in the memory hierarchy. When too many values are live, some values
must be spilled onto cache/main memory and read back later. Spills cost time, energy.
Register file parameters: Number of words. Number of ports.
High Performance Embedded Computing
© 2007 Elsevier
Performance and energy vs. register file size.
[Weh01] © 2001 IEEE
High Performance Embedded Computing
© 2007 Elsevier
Cache size vs. energy
[Li98]© 1998 IEEE
High Performance Embedded Computing
© 2007 Elsevier
Cache parameters Cache size:
Larger caches hold more data, require more static energy, take area away from other functinos.
Number of sets: More independent references, more locations mapped onto
each line. Cache line length:
Longer lines give more prefetching bandwidth, often result in higher energy consumption.
What impact does each of these have on energy?
High Performance Embedded Computing
© 2007 Elsevier
Multilevel cache optimization
Gordon-Ross et al adjust cache parameters in order: Cache size. Line size. Associativity.
Design cache size for first level, then second level; line size for first, then second level; associativity for first, then second level.
Why vary the parameters in this order.
High Performance Embedded Computing
© 2007 Elsevier
Scratch pad memory
Scratch pad is managed by software, not hardware. Provides predictable
access time. Requires values to be
allocated. Use standard
read/write instructions to access scratch pad.
High Performance Embedded Computing
© 2007 Elsevier
Code compression Extreme version of instruction encoding:
Use variable-bit instructions. Generate encodings using compression
algorithms. Generally takes longer to decode. Can result in performance, energy, code size
improvements. How? IBM CodePack (PowerPC) used Huffman
encoding.
High Performance Embedded Computing
© 2007 Elsevier
Terms
Compression ratio: Compressed code size/uncompressed code size *
100%. Must take into account all overheads.
High Performance Embedded Computing
© 2007 Elsevier
Wolfe/Chanin approach
Object code is fed to lossless compression algorithm. Wolfe/Chanin used
Huffman’s algorithm. Compressed object
code becomes program image.
Code is decompressed on-the-fly during execution.
Source code
compiler
Object code
compressor
Compressedobject code
High Performance Embedded Computing
© 2007 Elsevier
Wolfe/Chanin execution
Instructions are decompressed when read from main memory. Data is not compressed or
decompressed. Cache holds uncompressed
instructions. Longer latency to get
instructions from memory. CPU does not require
significant modifications.
CPU
decompressor
cache
memory
High Performance Embedded Computing
© 2007 Elsevier
Huffman coding
Input stream is a sequence of symbols.
Each symbol’s probability of occurrence is known.
Construct a binary tree of probabilities from the bottom up. Path from root to symbol
gives code for that symbol.
High Performance Embedded Computing
© 2007 Elsevier
Wolfe/Chanin results
[Wol92] © 1992 IEEE
High Performance Embedded Computing
© 2007 Elsevier
Compressed vs. uncompressed code Code must be
uncompressed from many different starting points during branches.
Code compression algorithms are designed to decode from the start of a stream.
Compressed code is organized into blocks. Uncompress at start of
block. Unused bits between blocks
constitute overhead.
add r1, r2, r3
mov r1, a
bne r1, foo
uncompressed compressed
High Performance Embedded Computing
© 2007 Elsevier
Block structure and compression Trade-off:
Compression algorithms work best on long blocks. Program branching works best with short blocks.
Labels in program move during compression. Two approaches:
Wolfe and Chanin used branch table to translate branches during execution (adds code size).
Lefurgy et al. patched compressed code to refer branches to compressed locations.
High Performance Embedded Computing
© 2007 Elsevier
Compression ratio vs. block size
[Lek99b] © 1999 IEEE
High Performance Embedded Computing
© 2007 Elsevier
Pre-cache compression
Decompress as instructions come out of the cache.
One instruction may be decompressed many times.
Program has smaller cache footprint.
Why might a different type of decompression engine be needed?
High Performance Embedded Computing
© 2007 Elsevier
Compression algorithms There are a large number of algorithms for
compressing data. These algorithms were designed for different
constraints: Large text files. No real-time or power constraints.
Evaluate existing algorithms under the requirements of code compressions and develop new algorithms Several of these new algorithms are discussed in
the textbook
High Performance Embedded Computing
© 2007 Elsevier
Code and data compression
Unlike (non-modifiable) code, data must be compressed and decompressed dynamically. Compress data before it gets written back to main
memory Can substantially reduce the main memory or
cache footprints. Requires different trade-offs.
High Performance Embedded Computing
© 2007 Elsevier
Lempel-Ziv algorithm
Dictionary-based method.
Decoder builds dictionary during decompression process.
LZW variant uses a fixed-size buffer.
Sourcetext
Uncompressedsource
Coder Dictionary
Decoder Dictionary
Compressedtext
High Performance Embedded Computing
© 2007 Elsevier
Lempel-Ziv example
High Performance Embedded Computing
© 2007 Elsevier
MXT
Tremaine et al. has 3-level cache system. Level 3 is shared among several processor, connected to
main memory. Data and code are compressed/uncompressed as they
move between main memory and level 3 cache. Uses a variant of Lempel-Ziv 1977 algorithm.
All compression engines share the same dictionary. Typically, 1 KB blocks are divided into four 256-byte blocks
for compression. These can be decompressed in parallel.