of 42 /42
Memory eXpansion Memory eXpansion Technology Technology Krishan Swarup Gupta Rabie A. Ramadan Supervised By: Prof. El-Rewini

Memory eXpansion Technology Krishan Swarup Gupta Rabie A. Ramadan Supervised By: Prof. El-Rewini

Embed Size (px)

Text of Memory eXpansion Technology Krishan Swarup Gupta Rabie A. Ramadan Supervised By: Prof. El-Rewini

  • Memory eXpansion Technology

    Krishan Swarup GuptaRabie A. Ramadan

    Supervised By:Prof. El-Rewini

  • Memory eXpansion Technology AgendaIntroduction (Krishan )Motivation (Krishan )A Breakthrough (Krishan )Requirements (Krishan )Terminology (Rabie)Architecture (Rabie)Shared cache subsystem Requirements (Krishan )C-RAM Architecture (Krishan )Compression technique (Rabie)Main Memory subsystem (Rabie)Operating System Software (Rabie)Performance (Rabie)

  • Introduction Adding memory is often the most effective way to improve system performance, but it's a costly proposition,"

    Mark Dean, IBM Fellow and Vice President of Systems Research.

  • MXT is a hardware technology for compressing main memory contents.

    MXT doubles the effective size of the main memory.

    512 MB installed memory appears as 1 GB.

    This is done entirely in hardware, transparent to the CPUs, I/O devices, peripherals and all software including apps, device drivers and the kernel with the exception of less than hundred lines of code additions to the base kernel.Introduction

  • Memory seems to be cheap, But is not , especially when the system uses 512 MB or more.

    Why bother with MXT to double the size of memory?

    Simple ! , MXT saves money and lots of money.

    Try the on-line price configurations of Compaq, IBM, Dell, etc to double the size of the system memory.Motiviation

  • The Large Technology Installations can save millions of dollars.

    The savings can be significant for both small and large customers, as memory comprises 40-70 percent of the total cost of most NT-based server configurations.

    MXT is a hardware implementation that automatically stores frequently accessed data and instructions close to a computer's microprocessors so they can be accessed immediately.

    MXT incorporates a new level of cache that is designed to efficiently handle data and instructions on a memory controller chip.

    It is real and implemented by IBM eServer x330 with MXT released on 11th Feb 2002.A Breakthrough

  • Very fast compression/decompression h/w is required permitting operations at main-memory bandwidth.

    Since with compression, the logical total main-memory size may vary dynamically, changes in memory management must be made to the operating system.

    A way must be found to efficiently store and access the variable-length objects obtained from compression.Requirements

  • ASCI is the US Department of Energy's Accelerated Strategic Computing Initiative, a collaboration between three US national defense laboratories

    Aim to give researchers the five-order-of-magnitude increase in computing performance over current technology

    MIMD distributed memorymessage-passing supercomputer The architecture is scalable communication bandwidth main memory internal disk storage capacity I/O Terminology ASCI Machine

  • Terminologies RAMConventional DRAMSynchronous DRAM (SDRAM)DDR SDRAMSIMMDIMMSInterleaving

  • A Collection of processors are connected to a common SDRAM-based main memory through a memory controller chip.

    MXT incorporates the two level of architecture consisting of a large shared cache coupled with a typical main memory.

    Three ways to manage memory :

    Organizing M to be a linear space, where variable-length intervals are allocated and deallocated.

    Organizing M as a collection of blocks of possibly multiple sizes, where space for a variable-length object is allocated as an integral number of such blocks.

    Organizing M as a collection of blocks, but permitting a variable amount of space to be allocated within a block.

    MXT Architecture

  • Cyclic Redundancy Code (CRC)A number derived from a data block A CRC is more complicated than a checksum Calculated using division by using shifts and exclusive ORs Generator PolynomialCRCs treat blocks of input bits as coefficient-sets for polynomials EX. 10100000 1*x7 + 0*x6 + 1*x5 + 0*x4 + 0*x3 + 0*x2 + 0*x1 + 0*x0The reminder of the Division is the checksum For More Info. Please visit this web site http://www.4d.com/docs/CMU/CMU79909.HTM

  • Shared L3 Cachecomp/decompCompressed Main MemoryProcessorCache (L1)Cache (L2)ProcessorCache (L1)Cache (L2)ProcessorCache (L1)Cache (L2)

  • More Details

  • The shared cache L3 provides low-latency processor and I/O subsystem access to frequently accessed uncompressed data.

    The cache is partitioned into a quantity of lines called cache lines, with each line an associative storage unit equivalent in size to the 1KB uncompressed data block size.

    A cache directory is used to keep track of real-memory tag address which correspond to the cached address that can be stored within the line.Shared Cache Subsystem

  • Three primary architecture :-The independent cache array schemeLarge independent data-cache memory is implemented using low cost double-data-rate (SDRAM) technology.Outside the memory controller chip, while the associated cache directory is implemented on the chip.The cache size is limited primarily by the size of the cache directory.Cache interface can be optimized for the lowest-latency access by the processor.Shared Cache Subsystem

  • Memory eXpansion TechnologyShared cache subsystemThe compressed main memory partition scheme

    The cache controller and the memory controller share the same storage array via the same physical interface.Data is shuttled back and forth between compressed main memory region and uncompressed cache through the compression hardware during cache line replacement.Compressed cache size can be readily optimized to specific system application.Contention for main memory physical interface by latency-sensitive cache controller.

  • Memory eXpansion TechnologyShared cache subsystemThe distributed cache schemeThe cache is distributed throughout the compressed memory as a number of uncompressed lines. Only the most recently used n lines are selected to make up the cache.Data is shuttled in and out of the compressed memory, changing the compressed state as it is passed through the compression logic during cache-line replacement.Effective cache size may be dynamically optimized during system operation by simply changing the maximum number of uncompressed lines.Contention for main memory physical interface.Greater average latency associated with the cache directory references.

  • Memory eXpansion TechnologyLogically, the memory M consists of a collection of randomly accessible fixed-size lines, where L is the line size.

    Internally, the ith line is stored in a compressed format as L(i) bytes, where L(i)

  • Memory eXpansion TechnologyM comprises a standard random-access memory with a minimum access size (granule) of g bytes. We will generally assume that g is 32.

    Memory accesses invoke a translation between a logical line address and an internal address. This correspondence is stored in a directory D contained in M.

    Translation, fetching, and memory management within the C-RAM are carried out by a memory controller rather than by operating system (OS) software.C-RAM Architecture

  • DecompressorCompressorL3 DirectoryL3 Cache LinesReadWriteMissStoreLine4Line3Line2Line1A1A2A4A3Line 2Line 3Line 4AddressBlocksML3L3 and C-RAM organization

  • Memory eXpansion TechnologyEach directory Entry contains :Flags.Fragment combining information.Pointers for up to four block.

    On an L3 cache miss, the memory controller and decompression h/w find the blocks allocated to store the compressed line and dynamically decompress the line to handle the miss.

    C-RAM Architecture

  • Memory eXpansion TechnologyWhen a new or modified line is stored, the blocks currently allocated to the line are made free, and the line is then compressed and stored in the C-RAM by allocating the required number of blocks.C-RAM Architecture

  • Memory eXpansion TechnologyExamplePages size is 4KB. L3 cache immediately above C-RAM has line size of 1KB. Each line compresses to 1, 2, 3, 4, or 1024 bytes with equal likelyhood. Expected compressed line size would be 512.5 bytes. This yields to 50.5% compression. But the problem is ?????????????C-RAM ArchitectureBlock size is 256-bytesFRAGMENTATION left over space in the block

  • Memory eXpansion TechnologyApproaches dealing with fragmentation problem :

    Make block size smaller. Size of directory entry will increase dramatically.

    Combine two or more fragments, that is, the left-over pieces in the last blocks used to store compressed lines, into single blocks.

    The set of lines for which fragment combining is allowed is called cohort.C-RAM Architecture

  • Memory eXpansion TechnologyCohort size : to have a small upper bound on the time required for directory scans, ideally the size of cohorts should be small.

    Two ways in which the cohort are determined.

    Partitioned cohort : Lines are divided into a number of disjoint sets, where each such set is a cohort. For example : with a cohort of size 2, the first two 1KB lines in each 4KB page could form one cohort and the last two lines another cohort.C-RAM Architecture

  • Memory eXpansion TechnologySliding cohort : Cohorts are not disjoint, but overlap. For example, with a cohort of size 4, the cohort corresponding to any given line could consist of the set containing that line and the previous three lines, and similarly for other cohort sizes. Less fragmentation then partitioned cohort.C-RAM Architecture

  • Memory eXpansion TechnologyThe mathod by which fragments are combined

    The number of fragments that can be combined into a block.2 way combining (2 fragments per block) 3 way combining (3 fragments per block)

    Which fragment (or fragments) to choose.First fit, Best fitFragment Contention, Optimal FitC-RAM Architecture

  • Memory eXpansion TechnologyDesign Of Directory Structure :

    Static Directory :It is configured so as to have the required number of entries to support a maximum compression factor of F. That is, if the C-RAM has a capacity of N uncompressed lines, the directory contains entries for FN lines.

    A possible problem with this type of design is that the maximum compression is limited to a predetermined value. C-RAM Architecture

  • Memory eXpansion Technology

    Dynamic Directory :Using a dynamic directory structure, directory entries are created (deleted) whenever real addresses are allocated (deallocated). In this case, free main-memory blocks could be allocated (deallocated) and used for the directory entries for one or more pages whenever the pages were created (deleted).

    C-RAM Architecture

  • XMTMain Memory Subsystem

  • LZ77 Compression TechniqueThe LZ77 output is a series of byte values intersperse with (index, length) pairs. Each byte value is written as is to the output. The (index, length) pairs are written to the output as a pair of integers (index first, then length) each of which has 256 added to the value. This allows for the index and length values to be distinguished from the byte values. LZ77 in operation

  • IBM Implementation of the compression technique Divide the data into n partitions A compression engine for each part Shared dictionaryTypically 4 compression engines 256 B ( a quarter of 1KB uncompressed data) (1B/ cycle 4B/ cycle ) or ( 2B/cycle 8B/cycle) when double clocked.

  • Uncompressed MemoryUnescorted region is used by SST for additional and future needs.

  • Main Memory SubsystemComprises SDRAM and Dual in-line Memoery Modules DIMMsThe controller supports two separate DIMMsCan be configured to operate with compression disabled, enabled for specific address ranges, or completely enables.Sector Translation Table Sectored Memory

  • Compressed Memory

  • Cont.Data 1KB 120 bits Compression Pointer to the sector Uncompressed Directly accessed without SST reference

  • Reliability-Availability-Serviceability (RAS) Sector translation table entry parity checking. Sector free-list parity checking. Sector out-of-range checking. Sectored memory-overrun detection. Sectors-used threshold detection (2). Compressor/decompressor validity checking. Compressed-memory CRC protection.

  • Commodity Duplex MemoryFault tolerance technique never found before

  • Operating System SoftwareCan not distinguish between XMT and Conventional Memory HW Environment When the memory over utilized the system fails Unsectored memoryNeeds paging managementIn UNIX needs to change OS kernel In Windows , the code is not public, needs external driver software

  • Performance

  • Cont.

  • ReferencesMXT1- High-throughput coherence control and hardware messaging in EverestA. K. Nanda, A.-T. Nguyen, M. M. Michael, and D. J. Josephp. 229 2- Algorithms and data structures for compressed-memory machinesP. A. Franaszek, P. Heidelberger, D. E. Poff, and J. T. Robinsonp. 245 3- On internal organization in compressed random-access memoriesP. A. Franaszek and J. T. Robinsonp. 259 IBM Memory Expansion Technology (MXT)R. B. Tremaine, P. A. Franaszek, J. T. Robinson, C. O. Schulz, T. B. Smith, M.E. Wazlowski, and P. M. Blandp. 2714- Memory Expansion Technology (MXT): Software support and performanceB. Abali, H. Franke, D. E. Poff, R. A. Saccone, Jr., C. O. Schulz, L. M. Herger, and T. B. Smithp. 2875- Memory Expansion Technology (MXT): Competitive impactT. B. Smith, B. Abali, D. E. Poff, and R.B. Tremainep. 303

    Memory Compression http://domino.research.ibm.com/comm/wwwr_thinkresearch.nsf/pages/memory200.html

    Memory Guide http://www.pcguide.com/ref/ram/tech.htm