35
High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf

High Performance Embedded Computing 2007 Elsevier Lecture 7: Memory Systems Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

Embed Size (px)

DESCRIPTION

High Performance Embedded Computing © 2007 Elsevier Branching in VLIW Processors Operations in the same VLIW as a branch must not depend on the branch outcome VLIWs may utilize static branch prediction  Uses profiling to predict branch direction  Have different instructions depending on if the branch is predicted to be taken or not taken VLIWs sometimes have specialized loop instructions – execute the instruction word or set of instruction words a specified number of time. VLIWs sometimes have predicated instructions  Execute instruction conditionally based on value in “predicate register”  cmpgt p1 = r1, r2;; sub (p1) r3 = r1, r2, sub (!p1) r3 = r2, r1;;

Citation preview

Page 1: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Lecture 7: Memory Systems & Code Compression

Embedded Computing SystemsMikko Lipasti, adapted from M. Schulte

Based on slides and textbook from Wayne Wolf

Page 2: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Topics

Branching in VLIW (from last time) Memory systems.

Memory component models. Caches and alternatives.

Code compression.

Page 3: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Branching in VLIW Processors Operations in the same VLIW as a branch must not depend on the

branch outcome VLIWs may utilize static branch prediction

Uses profiling to predict branch direction Have different instructions depending on if the branch is

predicted to be taken or not taken VLIWs sometimes have specialized loop instructions – execute the

instruction word or set of instruction words a specified number of time.

VLIWs sometimes have predicated instructions Execute instruction conditionally based on value in “predicate register” cmpgt p1 = r1, r2;; sub (p1) r3 = r1, r2, sub (!p1) r3 = r2, r1;;

Page 4: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Branching in VLIW Processors VLIWs often executes branches in two phases

Prepare the branch condition (store in a branch register) Execute the branch based on the condition

## Implements: if (a > 0 && b < 0) { [ Block 1] }## where a is in $r1, b is in $r2, and 0 is in r0

cmpgt $r3 = $r1, $r0 ## (a > 0)cmplt $r4 = $r2, $r0 ## (b < 0)

;; and $b0 = $r3, $r4 ;;br $b0, L1;; ## L1 starts Block 1

Page 5: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Generic memory block

Page 6: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Simple memory model

Core array is n rows x m columns. Total area A = Ar + Ax + Ap + Ac. Row decoder area Ar = arm. Core area Ax = axmn. Precharge circuit area Ap = apm. Column decoder area Ac = acm.

Page 7: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Simple energy and delay models = setup + r + x + bit + c. Total energy E = ED + ES.

Static energy component ES is a technology parameter.

Dynamic energy ED = Er + Ex + Ep + Ec.

Page 8: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Multiport memories - delay

Delay vs. memory sizeand number of ports

Multiport Memory

R_addr1

R_addr2

W_addr

W_data

R_data1

R_data2

Page 9: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Multiport memories - area Area of memories often grows with the

square of the number of ports

Page 10: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Kamble and Ghose cache power model

Page 11: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Kamble/Ghose, cont’d.

Cache is m-way set-associative, capacity of D bytes, T bits of tag, and L bytes of line, St status bits per line.

The bit line energy is :

The equation in the book is incorrect.

)]()8(

[5.0

,,,

,,,,,,2

QpgQpbgQpag

rwbitrbitrwbitwbitprbitprbitDDbit

CCCCAStTLm

CNCNCNVE

Page 12: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Kamble/Ghose, cont’d.

Word line energy:

Output line energy:

Address input lines:

Page 13: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Shiue and Chakrabarti cache energy model add_bs: number of transitions on address bus per

instruction. data_bs: number of transitions on data bus per

instruction. word_line_size: number of memory cells on a word

line. bit_line_size: number of memory cells on a bit line. Em: Energy consumption of a main memory access. : technology parameters.

Page 14: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Shiue/Chakrabarti, cont’d.

Page 15: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Register files

First stage in the memory hierarchy. When too many values are live, some values

must be spilled onto cache/main memory and read back later. Spills cost time, energy.

Register file parameters: Number of words. Number of ports.

Page 16: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Performance and energy vs. register file size.

[Weh01] © 2001 IEEE

Page 17: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Cache size vs. energy

[Li98]© 1998 IEEE

Page 18: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Cache parameters Cache size:

Larger caches hold more data, require more static energy, take area away from other functinos.

Number of sets: More independent references, more locations mapped onto

each line. Cache line length:

Longer lines give more prefetching bandwidth, often result in higher energy consumption.

What impact does each of these have on energy?

Page 19: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Multilevel cache optimization

Gordon-Ross et al adjust cache parameters in order: Cache size. Line size. Associativity.

Design cache size for first level, then second level; line size for first, then second level; associativity for first, then second level.

Why vary the parameters in this order.

Page 20: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Scratch pad memory

Scratch pad is managed by software, not hardware. Provides predictable

access time. Requires values to be

allocated. Use standard

read/write instructions to access scratch pad.

Page 21: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Code compression Extreme version of instruction encoding:

Use variable-bit instructions. Generate encodings using compression

algorithms. Generally takes longer to decode. Can result in performance, energy, code size

improvements. How? IBM CodePack (PowerPC) used Huffman

encoding.

Page 22: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Terms

Compression ratio: Compressed code size/uncompressed code size *

100%. Must take into account all overheads.

Page 23: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Wolfe/Chanin approach

Object code is fed to lossless compression algorithm. Wolfe/Chanin used

Huffman’s algorithm. Compressed object

code becomes program image.

Code is decompressed on-the-fly during execution.

Source code

compiler

Object code

compressor

Compressedobject code

Page 24: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Wolfe/Chanin execution

Instructions are decompressed when read from main memory. Data is not compressed or

decompressed. Cache holds uncompressed

instructions. Longer latency to get

instructions from memory. CPU does not require

significant modifications.

CPU

decompressor

cache

memory

Page 25: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Huffman coding

Input stream is a sequence of symbols.

Each symbol’s probability of occurrence is known.

Construct a binary tree of probabilities from the bottom up. Path from root to symbol

gives code for that symbol.

Page 26: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Wolfe/Chanin results

[Wol92] © 1992 IEEE

Page 27: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Compressed vs. uncompressed code Code must be

uncompressed from many different starting points during branches.

Code compression algorithms are designed to decode from the start of a stream.

Compressed code is organized into blocks. Uncompress at start of

block. Unused bits between blocks

constitute overhead.

add r1, r2, r3

mov r1, a

bne r1, foo

uncompressed compressed

Page 28: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Block structure and compression Trade-off:

Compression algorithms work best on long blocks. Program branching works best with short blocks.

Labels in program move during compression. Two approaches:

Wolfe and Chanin used branch table to translate branches during execution (adds code size).

Lefurgy et al. patched compressed code to refer branches to compressed locations.

Page 29: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Compression ratio vs. block size

[Lek99b] © 1999 IEEE

Page 30: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Pre-cache compression

Decompress as instructions come out of the cache.

One instruction may be decompressed many times.

Program has smaller cache footprint.

Why might a different type of decompression engine be needed?

Page 31: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Compression algorithms There are a large number of algorithms for

compressing data. These algorithms were designed for different

constraints: Large text files. No real-time or power constraints.

Evaluate existing algorithms under the requirements of code compressions and develop new algorithms Several of these new algorithms are discussed in

the textbook

Page 32: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Code and data compression

Unlike (non-modifiable) code, data must be compressed and decompressed dynamically. Compress data before it gets written back to main

memory Can substantially reduce the main memory or

cache footprints. Requires different trade-offs.

Page 33: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Lempel-Ziv algorithm

Dictionary-based method.

Decoder builds dictionary during decompression process.

LZW variant uses a fixed-size buffer.

Sourcetext

Uncompressedsource

Coder Dictionary

Decoder Dictionary

Compressedtext

Page 34: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

Lempel-Ziv example

Page 35: High Performance Embedded Computing  2007 Elsevier Lecture 7: Memory Systems  Code Compression Embedded Computing Systems Mikko Lipasti, adapted from

High Performance Embedded Computing

© 2007 Elsevier

MXT

Tremaine et al. has 3-level cache system. Level 3 is shared among several processor, connected to

main memory. Data and code are compressed/uncompressed as they

move between main memory and level 3 cache. Uses a variant of Lempel-Ziv 1977 algorithm.

All compression engines share the same dictionary. Typically, 1 KB blocks are divided into four 256-byte blocks

for compression. These can be decompressed in parallel.