Memory eXpansion Technology Krishan Swarup Gupta Rabie A. Ramadan Supervised By: Prof. El-Rewini

Memory eXpansion Memory eXpansion TechnologyTechnology

Krishan Swarup GuptaRabie A. Ramadan

Supervised By:Prof. El-Rewini

Memory eXpansion TechnologyMemory eXpansion Technology Agenda Agenda

Introduction (Krishan ) Motivation (Krishan ) A Breakthrough (Krishan ) Requirements (Krishan ) Terminology (Rabie) Architecture (Rabie) Shared cache subsystem Requirements (Krishan ) C-RAM Architecture (Krishan ) Compression technique (Rabie) Main Memory subsystem (Rabie) Operating System Software (Rabie) Performance (Rabie)

Introduction Introduction

“Adding memory is often the most effective way to improve system performance, but it's a

costly proposition,"

Mark Dean, IBM Fellow and Vice President of Systems Research.

MXT is a hardware technology for compressing main memory contents.

MXT doubles the effective size of the main memory.

512 MB installed memory appears as 1 GB.

This is done entirely in hardware, transparent to the CPUs, I/O devices, peripherals and all software including apps, device drivers and the kernel with the exception of less than hundred lines of code additions to the base kernel.

Introduction Introduction

Memory seems to be cheap, But is not , especially when the system uses 512 MB or more.

Why bother with MXT to double the size of memory?

Simple ! , MXT saves money and lots of money.

Try the on-line price configurations of Compaq, IBM, Dell, etc to double the size of the system memory.

Motiviation Motiviation

The Large Technology Installations can save millions of dollars.

The savings can be significant for both small and large customers, as memory comprises 40-70 percent of the total cost of most NT-based server configurations.

MXT is a hardware implementation that automatically stores frequently accessed data and instructions close to a computer's microprocessors so they can be accessed immediately.

MXT incorporates a new level of cache that is designed to efficiently handle data and instructions on a memory controller chip.

It is real and implemented by IBM eServer x330 with MXT released on 11th Feb 2002.

A BreakthroughA Breakthrough

Very fast compression/decompression h/w is required permitting operations at main-memory bandwidth.

Since with compression, the logical total main-memory size may vary dynamically, changes in memory management must be made to the operating system.

A way must be found to efficiently store and access the variable-length objects obtained from compression.

Requirements Requirements

ASCI is the US Department of Energy's Accelerated Strategic Computing Initiative, a collaboration between three US national defense laboratories

Aim to give researchers the five-order-of-magnitude increase in computing performance over current technology

MIMD distributed memory message-passing supercomputer The architecture is scalable

– communication bandwidth– main memory– internal disk storage capacity– I/O

TerminologyTerminology

ASCI MachineASCI Machine

Terminologies Terminologies RAMRAM

Conventional DRAMSynchronous DRAM (SDRAM)DDR SDRAMSIMMDIMMSInterleaving

A Collection of processors are connected to a common SDRAM-based main memory through a memory controller chip.

MXT incorporates the two level of architecture consisting of a large shared cache coupled with a typical main memory.

Three ways to manage memory :

– Organizing M to be a linear space, where variable-length intervals are allocated and deallocated.

– Organizing M as a collection of blocks of possibly multiple sizes, where space for a variable-length object is allocated as an integral number of such blocks.

– Organizing M as a collection of blocks, but permitting a variable amount of space to be allocated within a block.

MXT ArchitectureMXT Architecture

Cyclic Redundancy Code (Cyclic Redundancy Code (CRC)CRC)

A number derived from a data block A CRC is more complicated than a checksum Calculated using division by using shifts and exclusive ORs Generator Polynomial CRCs treat blocks of input bits as coefficient-sets for polynomials

– EX. 10100000 1*x7 + 0*x6 + 1*x5 + 0*x4 + 0*x3 + 0*x2 + 0*x1 + 0*x0

The reminder of the Division is the checksum For More Info. Please visit this web site http://www.4d.com/docs/CMU/CMU79909.HTM

Shared L3 Cache

comp/decomp

Compressed Main Memory

ProcessorCache (L1)Cache (L2)



More Details More Details

The shared cache L3 provides low-latency processor and I/O subsystem access to frequently accessed uncompressed data.

The cache is partitioned into a quantity of lines called cache lines, with each line an associative storage unit equivalent in size to the 1KB uncompressed data block size.

A cache directory is used to keep track of real-memory tag address which correspond to the cached address that can be stored within the line.

Shared Cache Subsystem Shared Cache Subsystem

Three primary architecture :- The independent cache array scheme

– Large independent data-cache memory is implemented using low cost double-data-rate (SDRAM) technology.

– Outside the memory controller chip, while the associated cache directory is implemented on the chip.

– The cache size is limited primarily by the size of the cache directory.

– Cache interface can be optimized for the lowest-latency access by the processor.

Shared Cache Subsystem Shared Cache Subsystem

Memory eXpansion TechnologyMemory eXpansion Technology

Shared cache subsystemShared cache subsystem The compressed main memory partition scheme

– The cache controller and the memory controller share the same storage array via the same physical interface.

– Data is shuttled back and forth between compressed main memory region and uncompressed cache through the compression hardware during cache line replacement.

– Compressed cache size can be readily optimized to specific system application.

– Contention for main memory physical interface by latency-sensitive cache controller.


Shared cache subsystemShared cache subsystem The distributed cache scheme

– The cache is distributed throughout the compressed memory as a number of uncompressed lines. Only the most recently used n lines are selected to make up the cache.

– Data is shuttled in and out of the compressed memory, changing the compressed state as it is passed through the compression logic during cache-line replacement.

– Effective cache size may be dynamically optimized during system operation by simply changing the maximum number of uncompressed lines.

– Contention for main memory physical interface.– Greater average latency associated with the cache directory

references.


Logically, the memory M consists of a collection of randomly accessible fixed-size lines, where L is the line size.

Internally, the ith line is stored in a compressed format as L(i) bytes, where L(i) <= L, and where L(i) may change on each cache cast-out of this line.

C-RAM ArchitectureC-RAM Architecture


M comprises a standard random-access memory with a minimum access size (granule) of g bytes. We will generally assume that g is 32.

Memory accesses invoke a translation between a logical line address and an internal address. This correspondence is stored in a directory D contained in M.

Translation, fetching, and memory management within the C-RAM are carried out by a memory controller rather than by operating system (OS) software.


Decompressor Compressor

L3 DirectoryL3 Cache Lines

ReadRead WriteWrite

MissMiss StoreStore

Line4Line3

Line2

Line1A1A1

A2A2

A4A4

A3A3

Line 2Line 2

Line 3Line 3

Line 4Line 4

AddressAddress

BlocksBlocks

MM

L3L3L3 and C-RAM L3 and C-RAM organizationorganization


Each directory Entry contains :– Flags.– Fragment combining information.– Pointers for up to four block.

On an L3 cache miss, the memory controller and decompression h/w find the blocks allocated to store the compressed line and dynamically decompress the line to handle the miss.



When a new or modified line is stored, the blocks currently allocated to the line are made free, and the line is then compressed and stored in the C-RAM by allocating the required number of blocks.



Example Pages size is 4KB. L3 cache immediately above C-RAM

has line size of 1KB. Each line compresses to 1, 2, 3, 4…, or 1024 bytes with equal likelyhood.

Expected compressed line size would be 512.5 bytes. This yields to 50.5% compression.

But the problem is ?????????????


Block size is 256-bytes FRAGMENTATION “left over space in the block”


Approaches dealing with fragmentation problem :

– Make block size smaller. Size of directory entry will increase dramatically.

– Combine two or more fragments, that is, the “left-over” pieces in the last blocks used to store compressed lines, into single blocks.

The set of lines for which fragment combining is allowed is called “cohort”.



Cohort size : to have a small upper bound on the time required for directory scans, ideally the size of cohorts should be small.

Two ways in which the cohort are determined.

Partitioned cohort :

Lines are divided into a number of disjoint sets, where each such set is a cohort. For example : with a cohort of size 2, the first two 1KB lines in each 4KB page could form one cohort and the last two lines another cohort.



Sliding cohort :

Cohorts are not disjoint, but overlap. For example, with a cohort of size 4, the cohort corresponding to any given line could consist of the set containing that line and the previous three lines, and similarly for other cohort sizes. Less fragmentation then partitioned cohort.



– The mathod by which fragments are combined

The number of fragments that can be combined into a block.

2 way combining (2 fragments per block) 3 way combining (3 fragments per block)

– Which fragment (or fragments) to choose. First fit, Best fit Fragment Contention, Optimal Fit



Design Of Directory Structure :

– Static Directory : It is configured so as to have the required number of

entries to support a maximum compression factor of F. That is, if the C-RAM has a capacity of N uncompressed lines, the directory contains entries for FN lines.

A possible problem with this type of design is that the maximum compression is limited to a predetermined value.



– Dynamic Directory : Using a dynamic directory structure, directory

entries are created (deleted) whenever real addresses are allocated (deallocated). In this case, free main-memory blocks could be allocated (deallocated) and used for the directory entries for one or more pages whenever the pages were created (deleted).


XMT

Main Memory Subsystem

LZ77 Compression TechniqueLZ77 Compression Technique

The LZ77 output is a series of byte values intersperse with (index, length) pairs. Each byte value is written as is to the output. The (index, length) pairs are written to the output as a pair of integers (index first, then length) each of which has 256 added to the value. This allows for the index and length values to be distinguished from the byte values.

LZ77 in operation

IBM Implementation of the IBM Implementation of the compression technique compression technique

Divide the data into n partitions A compression engine for each part Shared dictionaryTypically

– 4 compression engines – 256 B ( a quarter of 1KB uncompressed data) – (1B/ cycle 4B/ cycle ) or – ( 2B/cycle 8B/cycle) when double clocked.

Uncompressed MemoryUncompressed Memory

Unescorted region is used by SST for additional and future needs.

Main Memory SubsystemMain Memory Subsystem

– Comprises SDRAM and Dual in-line Memoery Modules DIMMs

– The controller supports two separate DIMMs– Can be configured to operate with compression

disabled, enabled for specific address ranges, or completely enables.

– Sector Translation Table – Sectored Memory

Compressed Memory Compressed Memory

Cont.Cont.

Data – 1KB <= 120 bits Compression – Stored in SST– 1KB > 120 bits Compression – Pointer to the

sector – Uncompressed – Directly accessed without SST reference

Reliability-Availability-Reliability-Availability-Serviceability (RAS)Serviceability (RAS)

Sector translation table entry parity checking. Sector free-list parity checking. Sector out-of-range checking. Sectored memory-overrun detection. Sectors-used threshold detection (2). Compressor/decompressor validity checking. Compressed-memory CRC protection.

Commodity Duplex MemoryCommodity Duplex Memory

Fault tolerance technique – never found before

Operating System SoftwareOperating System Software

Can not distinguish between XMT and Conventional Memory HW Environment

When the memory over utilized the system fails – Unsectored memory– Needs paging management

In UNIX needs to change OS kernel In Windows , the code is not public, needs

external driver software

Performance Performance

Cont. Cont.

ReferencesReferences

MXT1- High-throughput coherence control and hardware messaging in EverestA. K. Nanda, A.-T. Nguyen, M.

M. Michael, and D. J. Josephp. 229 2- Algorithms and data structures for compressed-memory machinesP. A. Franaszek, P. Heidelberger, D.

E. Poff, and J. T. Robinsonp. 245 3- On internal organization in compressed random-access memoriesP. A. Franaszek and J. T. Robinsonp.

259 IBM Memory Expansion Technology (MXT)R. B. Tremaine, P. A. Franaszek, J. T. Robinson, C. O. Schulz, T. B. Smith, M. E. Wazlowski, and P. M. Blandp. 271

4- Memory Expansion Technology (MXT): Software support and performanceB. Abali, H. Franke, D. E. Poff, R. A. Saccone, Jr., C. O. Schulz, L. M. Herger, and T. B. Smithp. 287

5- Memory Expansion Technology (MXT): Competitive impactT. B. Smith, B. Abali, D. E. Poff, and R. B. Tremainep. 303

Memory Compression http://domino.research.ibm.com/comm/wwwr_thinkresearch.nsf/pages/memory200.html

Memory Guide http://www.pcguide.com/ref/ram/tech.htm

Documents

Memory eXpansion Technology Krishan Swarup Gupta Rabie A. Ramadan Supervised By: Prof. El-Rewini