6
A Dual-Phase Compression Mechanism for Hybrid DRAM/PCM Main Memory Architectures Seungcheol Baek, Hyung Gyu Lee, Jongman Kim School of Elec. and Comp. Eng. Georgia Institute of Technology {bsc11235, hyunggyu, jkim}@gatech.edu Chrysostomos Nicopoulos Department of Elec. and Comp. Eng. University of Cyprus [email protected] ABSTRACT Phase-Change Memory (PCM) is emerging as a promising new memory technology, due to its inherent ability to scale deeply into the nanoscale regime. However, PCM is still marred by a duet of potentially show-stopping deficiencies: poor write performance and limited durability. These weaknesses have urged designers to develop various supporting architectural techniques to aid and com- plement the operation of the PCM, while mitigating its innate flaws. One promising such solution is the deployment of hybridized mem- ory architectures that fuse DRAM and PCM, in order to combine the best attributes of each technology. In this paper, we introduce a Dual-Phase Compression (DPC) scheme specifically optimized for DRAM/PCM hybrid environments. Extensive simulations with traces from real applications running on a full-system simulator of a multicore system demonstrate 35.1% performance improvement and 29.3% energy reduction, on average, as compared to a baseline DRAM/PCM hybrid implementation. Categories and Subject Descriptors C.0.4 [Computer Systems Organization]: General—System ar- chitecture 1. INTRODUCTION Over the last few years, several new memory technologies have been developed in an effort to address some of the shortcomings of DRAM technology. One of the most promising new actors is Phase-Change Memory (PCM), which is gaining a foothold in the research community, predominantly due to its attractive ability to scale very deeply into the low nanometer regime, low power con- sumption, non-volatility, and fast read performance [9]. However, there are some crippling limitations that prevent PCM from com- pletely ousting DRAM from future systems: low write performance and limited long-term endurance. These drawbacks have led de- signers toward the adoption of various architectural techniques, wh- ich attempt to mitigate the deficiencies of PCM technology. Notable examples of such schemes include the re-organization of memory buffers [14], Flip-N-Write [5], and redundant-bit write removal and row shifting [21]. A higher-level memory system ar- chitectural approach employs a partial-write technique that flushes only a specific part of a dirty cache line from the traditional cache Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GLSVLSI’12, May 3–4, 2012, Salt Lake City, Utah, USA. Copyright 2012 ACM 978-1-4503-1244-8/12/05 ...$10.00. [14]. Similarly, a segment-swapping mechanism can be used to pe- riodically swap memory segments of high and low write accesses in order to achieve a wearleveling effect (i.e., prolong the lifetime of PCM). Although these schemes reduce the actual number of cell operations within the PCM, they not only require extensive modifi- cations to the PCM device itself, but they also involve multi-layered modifications in the memory sub-system and require a subsequent orchestration of activities at multiple levels of the hierarchy. Another approach to circumventing the undesired characteris- tics of PCM is the adoption of hybridized memory structures, wh- ich utilize both DRAM and PCM memories, in order to combine the best of both worlds [4, 16, 7]. The conventional way to hy- bridize DRAM and PCM is by using the DRAM as an off-chip cache (i.e., an additional level in the memory hierarchy). Although this DRAM cache significantly reduces the number of PCM ac- cesses, DRAM/PCM memory hybrids are still not enough – by themselves – to adequately address all issues pertaining to the use of PCM. Thus, substantial support from other layers in the memory sub-system is also pertinent. The aforementioned observations constitute our primary motiva- tion. The elemental driver of this work is two-fold: (a) consider the aspects of performance, energy consumption, and lifetime of the PCM simultaneously, and (b) deploy an architecture that does not require any support from, or any modification to, the various layers of the cache/memory hierarchy. This latter goal implies that the new design must be transparent to the operation of the remaining components of the memory sub-system; i.e., the proposed solution must be self-contained and self-supported. Toward this end, we hereby propose a practical Dual-Phase Com- pression technique (DPC) optimized for hybrid DRAM/PCM main memory architectures. The main contributions of this work are: A compression mechanism tailored to hybrid DRAM/PCM main memory systems, whereby the compression process is divided into two distinct phases for better performance, en- ergy consumption, and wearleveling. The first phase opti- mizes DRAM cache accesses by utilizing a simplified, low- latency word-level compression algorithm. This phase dra- matically reduces the number of PCM accesses – thereby in- creasing the effective memory capacity – without adversely affecting access latencies. The second phase adopts a bit- level compression algorithm to further reduce the number of PCM accesses and to increase the lifetime of the PCM. A low-overhead wearleveling technique, called Compression- based Segment Rotation (CSR), which enhances the lifetime of the PCM by exploiting the remaining memory space after compression. This mechanism acts on the data after com- pression and intelligently balances wear-out within the PCM. The devised architecture can be cost-effectively implemented by slightly modifying the memory controller. It does not re- quire any architectural support from other entity in the CMP. 345

[ACM Press the great lakes symposium - Salt Lake City, Utah, USA (2012.05.03-2012.05.04)] Proceedings of the great lakes symposium on VLSI - GLSVLSI '12 - A dual-phase compression

  • Upload
    jongman

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

A Dual-Phase Compression Mechanism for HybridDRAM/PCM Main Memory Architectures

Seungcheol Baek, Hyung Gyu Lee, Jongman KimSchool of Elec. and Comp. Eng.Georgia Institute of Technology

{bsc11235, hyunggyu, jkim}@gatech.edu

Chrysostomos NicopoulosDepartment of Elec. and Comp. Eng.

University of [email protected]

ABSTRACT

Phase-Change Memory (PCM) is emerging as a promising newmemory technology, due to its inherent ability to scale deeply intothe nanoscale regime. However, PCM is still marred by a duetof potentially show-stopping deficiencies: poor write performanceand limited durability. These weaknesses have urged designers todevelop various supporting architectural techniques to aid and com-plement the operation of the PCM, while mitigating its innate flaws.One promising such solution is the deployment of hybridized mem-ory architectures that fuse DRAM and PCM, in order to combinethe best attributes of each technology. In this paper, we introducea Dual-Phase Compression (DPC) scheme specifically optimizedfor DRAM/PCM hybrid environments. Extensive simulations withtraces from real applications running on a full-system simulator ofa multicore system demonstrate 35.1% performance improvementand 29.3% energy reduction, on average, as compared to a baselineDRAM/PCM hybrid implementation.

Categories and Subject Descriptors

C.0.4 [Computer Systems Organization]: General—System ar-chitecture

1. INTRODUCTIONOver the last few years, several new memory technologies have

been developed in an effort to address some of the shortcomingsof DRAM technology. One of the most promising new actors isPhase-Change Memory (PCM), which is gaining a foothold in theresearch community, predominantly due to its attractive ability toscale very deeply into the low nanometer regime, low power con-sumption, non-volatility, and fast read performance [9]. However,there are some crippling limitations that prevent PCM from com-pletely ousting DRAM from future systems: low write performanceand limited long-term endurance. These drawbacks have led de-signers toward the adoption of various architectural techniques, wh-ich attempt to mitigate the deficiencies of PCM technology.Notable examples of such schemes include the re-organization

of memory buffers [14], Flip-N-Write [5], and redundant-bit writeremoval and row shifting [21]. A higher-level memory system ar-chitectural approach employs a partial-write technique that flushesonly a specific part of a dirty cache line from the traditional cache

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.GLSVLSI’12,May 3–4, 2012, Salt Lake City, Utah, USA.Copyright 2012 ACM 978-1-4503-1244-8/12/05 ...$10.00.

[14]. Similarly, a segment-swapping mechanism can be used to pe-riodically swap memory segments of high and low write accessesin order to achieve a wearleveling effect (i.e., prolong the lifetimeof PCM). Although these schemes reduce the actual number of celloperations within the PCM, they not only require extensive modifi-cations to the PCM device itself, but they also involvemulti-layeredmodifications in the memory sub-system and require a subsequentorchestration of activities at multiple levels of the hierarchy.

Another approach to circumventing the undesired characteris-tics of PCM is the adoption of hybridized memory structures, wh-ich utilize both DRAM and PCM memories, in order to combinethe best of both worlds [4, 16, 7]. The conventional way to hy-bridize DRAM and PCM is by using the DRAM as an off-chipcache (i.e., an additional level in the memory hierarchy). Althoughthis DRAM cache significantly reduces the number of PCM ac-cesses, DRAM/PCM memory hybrids are still not enough – bythemselves – to adequately address all issues pertaining to the useof PCM. Thus, substantial support from other layers in the memorysub-system is also pertinent.

The aforementioned observations constitute our primary motiva-tion. The elemental driver of this work is two-fold: (a) consider theaspects of performance, energy consumption, and lifetime of thePCM simultaneously, and (b) deploy an architecture that does notrequire any support from, or any modification to, the various layersof the cache/memory hierarchy. This latter goal implies that thenew design must be transparent to the operation of the remainingcomponents of the memory sub-system; i.e., the proposed solutionmust be self-contained and self-supported.

Toward this end, we hereby propose a practicalDual-Phase Com-pression technique (DPC) optimized for hybrid DRAM/PCM mainmemory architectures. The main contributions of this work are:

• A compression mechanism tailored to hybrid DRAM/PCMmain memory systems, whereby the compression process isdivided into two distinct phases for better performance, en-ergy consumption, and wearleveling. The first phase opti-mizes DRAM cache accesses by utilizing a simplified, low-latency word-level compression algorithm. This phase dra-matically reduces the number of PCM accesses – thereby in-creasing the effective memory capacity – without adverselyaffecting access latencies. The second phase adopts a bit-level compression algorithm to further reduce the number ofPCM accesses and to increase the lifetime of the PCM.

• A low-overhead wearleveling technique, called Compression-based Segment Rotation (CSR), which enhances the lifetimeof the PCM by exploiting the remaining memory space aftercompression. This mechanism acts on the data after com-pression and intelligently balances wear-out within the PCM.

• The devised architecture can be cost-effectively implementedby slightly modifying the memory controller. It does not re-quire any architectural support from other entity in the CMP.

345

Extensive simulations driven by traces extracted from real appli-cation workloads running on a full-system simulator demonstratethat the dual-phase mechanism yields 35.1% performance improve-ment and 29.3% energy reduction, on average, as compared to abaseline DRAM/PCM hybrid implementation. Additionally, wedemonstrate that the co-operative use of our CSR scheme and asimple existing wearleveling technique (also implemented withinthe memory controller) ensure a lifetime for the PCM within arange of 4.42 to 40 years.

2. BACKGROUND & RELATEDWORK

2.1 Fundamentals of PCMPCM exploits the unique trait of calcogenide glass to switch be-

tween two states - crystalline (SET, low resistance) and amorphous(RESET, high resistance) through the passage of an electric cur-rent. The PCM’s write operation commences with the melting ofthe cell under hot temperature, followed by fast quenching (RESEToperation), or slow quenching (SET operation), depending on thewrite value. This heating-and-cooling process generally takes morethan 100 ns (i.e., it is relatively slow compared to DRAM), and italso severely affects the endurance of PCM, limiting its lifetime towithin 106 to 108 write cycles in current PCM technology. In addi-tion, this power-hungry write operation prevents the exploitation ofburst memory-write operations, which, in turn, decreases the band-width of write operations [12]. On the other hand, PCM’s read per-formance is similar to that of DRAM. Actually, PCM’s read latencyis currently 2 times slower than DRAM, but PCM has been shownto support the DDR2 interface for read operations [11], which willsoon bring its performance up to par with DRAM. Thus, reducingthe number of PCM writes and the data size of write accesses arecrucial goals when dealing with PCM.

2.2 Compression techniques in PCMCompression techniques have generally been used to reduce the

memory traffic and increase the effective capacity of memory sys-tems. However, they require additional, non-negligible latency over-head during the compression/decompression processes. In addi-tion, address remapping schemes, as well as somewhat complexcompaction and/or reallocation processes are mandatory, becauseof the variable size of the compressed data. Hence, compressiontechniques have not been widely adopted in real products. How-ever, despite said non-negligible overhead, compression techniquesconstitute an attractive solution within the context of PCM, be-cause the reduced data size after compression can increase the per-formance (especially the write performance), energy consumption,and endurance simultaneously. Furthermore, if the remaining (saved)space is not used to store additional data, the overhead of addressremapping and data compaction and/or reallocation can be elim-inated. Fortunately, one can easily give up the capacity benefitsof compression, because the main attraction of PCM is to providemore space than DRAM at the same cost. Hence, the remainingspace (after compression) can instead be used to further enhancethe lifetime of the PCM. We will demonstrate this effect later on inthe paper.Compression techniques have been investigated for PCM-based

main memory systems in [20]. However, compression is only oneof several alternatives explored in that paper, and it is only trig-gered when the other alternative schemes are not expected to work.Furthermore, the compression algorithm employed was not specif-ically optimized for the nuances of a PCM-based system. Conse-quently, the contribution of data compression was not found to beso important.The authors of [13] indicate that conventional data compression

techniques may not be so effective for PCM, because mismatchesin the sizes of new and old compressed data segments may gener-ate additional write traffic for compaction, and may require address

(a) A high-level conceptual overview

(b) A high-level architectural overview

Figure 1: A high-level conceptual / architectural overview ofthe proposed Dual-Phase Compression Mechanism.

space remapping. Therefore, the incorporation of efficient and ef-fective data compression for PCM-based main memory systems ne-cessitates optimizations at both the algorithmic and architecturallevels; simply applying conventional compression techniques toPCM-based systems will not yield the desired effects.

Recently, encoding-based memory compression [18] has beenintroduced for reducing the number of PCM writes using FrequentPattern Compression [19]. The encoding engine was embeddedinside the PCM chip and required some modification of the PCMrow-array structure. This implementation has some disadvantages,such as resource duplication (all individual memory chips musthave the same encoding tables and logic, thus increasing the energyconsumption of the PCM device), updating the Frequent Value ta-ble is not easy, and all the tag bits must be stored within the PCMdevice at a fixed position (meaning that the wearleveling mecha-nism must also consider the position of the tag bits). Moreover,the proposed scheme of [18] focuses only on enhancing the energyconsumption and endurance of PCM.

On the other hand, our proposed scheme can be effectively im-plemented within the memory controller, and does not require anyarchitectural (or other) support from the system, rendering its po-tential incorporation into future CMPs very straightforward. Mostimportantly, the technique proposed in this paper enhances the per-formance, energy consumption, and endurance simultaneously, byapplying a fine-tuned compression technique specifically tailoredto the characteristics of PCM technology.

3. A DUAL-PHASE COMPRESSION (DPC)Our goal in this work is to exploit the benefits of compression

techniques for PCMwithout the aforementioned limitations of com-pression. To this end, we distribute the compression process intotwo phases, as shown in Figure 1(a). The first phase of the com-pression process compresses the data at theword-level using a Suc-cessive Matching Algorithm (SMA) and aims to increase the effec-tive DRAM cache capacity with very little delay overhead. Thesecond phase operates on the compressed data of the first phaseusing a Frequent Pattern Algorithm (FPA) [2] at the bit-level, inorder to achieve even more data compaction. The proposed com-pression mechanism also employs an efficient wearleveling tech-nique to prolong the lifetime of the PCM, without resorting to anyremapping activities.

Figure 1(b) presents the accompanying memory controller archi-tecture, which consists of a DRAM cache controller, two compres-

346

Figure 2: Phase 1 of the Dual-Phase Compression scheme em-ploys a hardware-based Successive Matching Algorithm (SMA).

sion engines, DRAM and PCM device controllers, a tag memoryspace, and (embedded) DRAMs. In DRAM/PCM hybrid memoryimplementations, the DRAM serves as a cache for the PCM (i.e.,it forms an extra level in the memory hierarchy). Note that all ourproposed components, including the tag memory, reside solely inthe memory controller.

3.1 Phase 1 – Word-Level Compression witha Successive Matching Algorithm (SMA)

Figure 2 illustrates the SMA utilized in the first phase of the DPCarchitecture, which is very lightweight, simple, and extremely fast.In this phase, a 64B data block (assuming a last-level cache linecontaining 16 32-bit words) is compressed at the granularity of asingle word (i.e., word-level compression). As the term “succes-sive matching” suggests, compression is only performed betweenadjacent (consecutive) words in the 64B block. While compressingthe original 64B data block, the i and i-1 words are compared for apossible match (i = 0 is the least significant word of the original64B data block). The resulting compressed data block comprisestwo parts: the Compression Tag Area and the Compressed DataArea. The Compression Tag Area consists of 16 bits, with each bitcorresponding to one word in the original 64B data block (remem-ber, a 64B block holds 16 words). If two consecutive words arefound to be identical, a ’0’ is written in the corresponding bit ofthe Compression Tag Area. Otherwise, a ’1’ is written. The firstbit of the Compression Tag Area is always ’1’, since there is no i-1word to compare with. If the i-th tag bit is ’1’, then the i-th word iswritten from the original data block in the Compressed Data Area(Figure 2). Otherwise, nothing is written.Looking at the scenario presented in Figure 2, the first three tag

bits are set to ’1’, because the first three words in the original datablock are different. However, the 4th-13th tag bits are all set to’0’, because the 3rd word in the original data block is repeatedmultiple times. Similarly, the last three words in the original datablock are the same. Hence, the 14th tag bit is set to ’1’, whilethe 15th and 16th bits are set to ’0’, to indicate compression. Thestorage overhead of this scheme, in terms of extra bits, is only onebit for each word (i.e., 3% for a 32-bit data path system), whichis the cost of the Compression Tag Area. If there are at least twoidentical consecutive words in the original 16-word data block, thenthe overhead is fully amortized and yields immediate compactionbenefits.In order to fully take advantage of the compression in Phase 1,

the DRAM cache must be able to accept more compressed linesthan conventional uncompressed lines. In other words, it must beable to accommodate cache lines of variable size without wastingavailable space due to placement restrictions. In order to achievethis, we exploit decoupled, variable-segment caching [1], but withour SMA-based compression. The proposed DRAM cache archi-tecture is depicted in Figure 3. While each set is physically 4-way

Figure 3: Illustration of a single set of the proposed DRAMcache structure, which is a similar implementation to a decou-pled variable-segment DRAM cache [1].

set associative, we overlay a logical 16-way set associativity by us-ing more tags and a segmented data area. More specifically, thedata array of the cache is broken into 4B (single-word) segments,with 64 segments statically allocated to each cache set.

As previously described, the SMA compresses each 64B cacheline into anything between 1 (fully compressed) to 16 (uncom-pressed) words. Hence, each set in the proposed DRAM cachestructure of Figure 3 can, technically, hold up to 64 compressedcache lines (in the extreme case, where each cache line is com-pressed to a single 32-bit word). In order to support such highnumber of cache lines per set, the Tag Area size (see Figure 3)would have to increase dramatically. So, the critical question raisedis whether real applications would benefit from such extreme capa-bility (i.e., 64 compressed cache lines per set). Based on extensivesimulations using traces from a full-system simulator, real appli-cations hardly yield more than 10 compressed cache lines per set.Thus, we set the maximum number of cache lines that one set canhold to 16, which means that each set can potentially increase itseffective capacity by up to four times (when storing 16 compressedcache lines).

Each tag in the Tag Area (Figure 3) consists of a valid bit, the16-bit Compression Tag (from Figure 2), and the address tag. Theoverhead, as opposed to a conventional cache, is the extra 16-bitCompression Tag, which is required for all 16 tags of each set.

Data segments are stored contiguously in Address Tag order. Theoffset for the first data segment of cache line k (in a particular set)is

segment_offset(k) =

k−1∑

i=0

actual_size(i)

where k ≤ 15 and actual_size(i) is the size of ith cache line (whethercompressed or not), in bytes, as given by

actual_size(i) =∑

num_tag(i)× 4 bytes

where num_tag(i) is the number of ’1’s in the valid compressiontag(i), 0 ≤ i ≤ 15. The ’segment_offset’ and ’actual_size’ param-eters are used to target specific segments in the Data Area of thecache for address-tag matching.

To enable rapid access to a specific word in a cache line, it is im-perative for the technique to be able to swiftly locate the requestedword in the compressed cache line. This, in fact, is one of the po-tential drawbacks of compression schemes: how does one locate aspecific item within the compressed data (a) without having to de-compress, and (b) without having to search the entire compressedblock? The proposed SMA mechanism avoids both of these com-plications by providing a critical-word-first capability: to accessthe i-th word from the original uncompressed data block, the con-troller simply counts the number of tag bits set to ’1’ in the 16-bitCompression Tag Area of the set (see Figure 2) from tag bit 0 totag bit i-1. This number immediately tells the location of the re-quested i-th word in the compressed data block. This procedurepinpoints the required word directly with no need for decompres-sion or searching. This is a valuable attribute that contributes tothe enormous performance benefits that will be demonstrated inSection 5. Although our DRAM cache incurs 4% storage over-head, due to the presence of the Compression Tags, this overhead is

347

Figure 4: Phase 2 of the Dual-Phase Compression scheme em-ploys a hardware-based Frequent Pattern Algorithm (FPA).

fully amortized by the vast increases in the average effectiveDRAMcache capacities achieved.We have also analyzed the hardware overhead of the control

logic of the proposed mechanism. Its area cost in terms of gatecount is only 0.81% (specifically, 394 gates) of a commercial DDR2SDRAMmemory controller, while it incurs only 1 additional mem-ory clock per compression/decompression operation. This over-head is included in our evaluation/simulations later on.

3.2 Phase 2 – Bit-Level Compression with aFrequent Pattern Algorithm (FPA)

Phase 2 of our DPC architecture attempts to compact the com-pressed data of Phase 1 even more for storage into the PCM. Inorder to identify this additional redundancy that can be further com-pressed, the mechanism has to operate at a finer granularity than theword-level approach of Phase 1.One of the most efficient and fastest bit-level compression algo-

rithms is FPA [2]. FPA uses a Frequent Pattern Encoding table,as illustrated in Figure 4. The table includes eight bit patterns thattend to appear frequently in real application workloads [6]. Theseeight patterns are known a priori and are converted to 3-bit prefixvalues (23=8) and a corresponding data value in their compressedform (see Figure 4). For example, a so called ’0’-run (a pattern ofall consecutive zeros in the word) corresponds to the first patternin the table of Figure 4, which compresses to a prefix of ’000’. Inthe example of the same figure, the first word of the compresseddata from Phase 1 is compressed to the ’000’ prefix and a 1-bitdata value of ’0’, because the original word is a ’0’-run word. Asshown, after the FPC procedure of Phase 2, the compressed datablock from Phase 1 is further compressed to 76 bits.Despite the pre-compression of Phase 1, it turns out that the same

eight frequent patterns identified in [2, 6] also appear in the com-pressed (at the word-level) cache lines of Phase 1, as shown inthe third column of the table of Figure 4. This is because onlyword-level compression is performed in Phase 1 to minimize la-tency. Even though the frequencies of the eight patterns are smallerthan those in uncompressed data, there are still significant numbersof occurrences of the frequent patterns of the FPA algorithm of [6].So we can safely use the same frequent pattern encoding table asthe one shown in Figure 4.The hardware implementation cost of the second compression

phase is extracted from [1, 6], and we also include this additionaloverhead (timing and energy) in our evaluations.

4. A COMPRESSION-BASED SEGMENT

ROTATION (CSR) SCHEMECompression techniques are quite useful not only for enhancing

performance, but also for prolonging the lifetime of the PCM. Inthis section, we propose a Compression-based Segmented Rotation(CSR) mechanism, which is a type of intra-line wearleveling tech-

Figure 5: The fundamental operating principle of the proposedCompression-based Segment Rotation (CSR) process for localwearleveling within a line.

nique added to our proposed DPC scheme with minimal overhead.The minimum unit of flushed-out data from the DRAM cache is acache line, i.e., 64B in this paper. However, after applying our DPCmechanism, the physical size of a cache line is reduced to less thanhalf of the original, on average. In most general-purpose compres-sion algorithms, the remaining space is used for storing other data.Instead, our CSR scheme uses this remaining space to enhance thelifetime of the PCM (in this case, address remapping is no longernecessary) under the assumption that the PCM device already pro-vides enough memory space. This is not unrealistic, since PCM isan extremely scalable memory architecture offering massive stor-age density.

Figure 5 depicts the basic principle of the proposed CSR tech-nique, which is only performed at the line-level. As shown in thefigure, the first write operation only uses the first 7 segments of theavailable line space, because the compressed size of the flushed-out cache line happens to be 7 words long (i.e., 28 bytes). At thesecond write, the data is stored in the following segments, fromthe 8th to the 13th (occupying 5 segments in the line). If the dataspills out of the last segment of the line, the remaining informationwraps around to the first segments of the line while increasing theline counter value by one, as shown in the case of the third writein Figure 5. By doing this, six write requests to the same address(the example of Figure 5) only increase the write count of the lineby two. For efficient implementation, we use 9 bits of additionalspace per line (i.e., a 1.8% space overhead in the case of 64B linesize, but this will decrease if the line is larger than 64B) to indi-cate the compression status (C status - 1 bit) and the offset and sizeof valid compressed data. Since this 9-bit tag data is stored intoon-chip memory, and is read and updated at the same time as theaccessing of the tag area of the DRAM cache, there is no additionaltiming overhead in accessing this data.

Although this CSR scheme only focuses on wearleveling withina line, we have found that our mechanism can efficiently co-operatewith a local-counter-based page swapping technique [13], which isalso implemented in the memory controller together with the CSRscheme. In this case, the lifetime of the PCM device is extended tomore than 5 years, even under memory-intensive applications, asillustrated in Figure 7 in Section 5.3. This is crucial, since no ad-ditional wearleveling technique is required (such as partial-writes[14], Flip-N-Write [5], and line-level wearleveling [16]).

5. EVALUATION

5.1 Simulation FrameworkWe have developed a trace-driven simulator for evaluating our

proposed Dual-Phase Compression architecture. The traces havebeen extracted from the Simics/GEMS full system simulator [15]with timing information in CPU cycles. All the details of the simu-lation configuration are described in Table 1. We investigate a totalof four memory system configurations: (1) DRAM-only, (2) PCM-

348

Table 1: Simulated system configuration detailsNumber of CMP Cores 16Processor Core Type UltraSPARC-III+ (in-order), 2 GHzL1 caches (Private) I- and D-cache: 64 KB, 4-way 64 B blockL2 caches (Shared) 1 MB, 4-way 64 B blockDRAM memory (large) DDR2 8 GBDRAM memory (small) DDR2 64 MB

Table 2: Memory access characteristics of the benchmark ap-plications used (PARSEC benchmark suite [3])Group

#Application Description

R/Wrate

MMArate1

1canneal Simulated cache-aware annealing 1.83 12.6dedup Compression with data deduplications 1.39 4.7

2fluidanimate Fluid dynamics for animation purposes 1.38 4.6x264 H.264 video encoding 1.78 2.3

3vips Image processing 1.32 1.6freqmine Frequent itemset mining 1.55 1.5

4ferret Content similarity search server 1.52 0.9bodytrack Body tracking of a person 1.71 0.8

1 Main Memory Accesses per 1K CPU cycles

only, (3) regular DRAM/PCM hybrid with 64-entry write queue,and (4) DRAM/PCM hybrid with our Dual-Phase Compression(DPC) technique. 8 GB off-chip DRAM modules and 8 GB off-chip PCM modules are used for the DRAM-only and PCM-onlymemory configurations, respectively, while a 64 MB DRAM and8 GB PCM modules are used for the DRAM/PCM hybrid mem-ory configuration. We set the size of DRAM cache to 64 MBeven though 256 MB DRAN cache is a sweet spot for most appli-cations in the baseline DRAM cache, because the proposed DPCtechnique increases the effective capacity of the DRAM cache by2 to 3 times, on the average. So we expect similar results to usingDRAM cache modules of well over 128 MB. This 64 MB DRAMcache can be implemented as a small discrete component (memorymodule), or it can be embedded into the controller as embeddedDRAM (eDRAM). The energy and performance models used forthe PCM and DRAM devices in our evaluations are derived fromstate-of-the-art academic and industrial references [8, 12, 17, 10].We selected 8 benchmarks from the PARSEC benchmark suite

[3]; their memory-access characteristics are shown in Table 2. Wecategorized 8 benchmarks into four groups (Group 1 to 4), basedon their memory access rates. Group 1 consists of the two mostmemory-intensive applications of Table 2, whereas Group 4 con-sists of the two least memory-intensive applications. Groups 2 and3 lie in-between Groups 1 and 4, in terms of memory access rates.

5.2 Performance and Energy ImprovementsSince the main objective of the paper is to apply an optimized

dual-phase compression mechanism for DRAM/PCM hybrid mainmemory systems, we first evaluate the compression ability of thealgorithms that we used in each phase. As shown in Figure 6(a), theproposed DPC always exhibits better compression ratio than eachof the individual compression algorithms. This is attributed to thefact that DPC performs combined global and local compression.The DPC technique can also increase the effective capacity of a

64 MB uncompressed DRAM cache to anywhere from 144 to 192MB, as shown in Figure 6(b). This means that the performance ofthe proposed DRAM cache is always higher than a normal DRAMcache with the same physical size. Alternatively, one can have asmaller-sized DRAM cache without experiencing any performancedegradation, due to the increased effective cache capacity resultingfrom the DPC mechanism.Figure 7(a) shows performance comparisons between the four

memory configurations. Each configuration is normalized to thatof the DRAM-only configuration. Note that a higher bar indi-cates better performance. As expected, our proposed architecture

Figure 6: Compression ratios (=compressed-size/original-size)and the effect of SMA in the DRAM cache of the DRAM/PCMhybrid memory architecture.

Figure 7: Performance and Energy simulation results. The Per-formance and Energy results are normalized to the DRAM-onlyconfiguration.

– DRAM/PCM hybrid with DPC – shows the second-best perfor-mance for all groups of applications. This is because the DPCmechanism (1) reduces the number of PCM writes (dirty cachemisses) by increasing the effective DRAM cache size and hit ra-tio in Phase 1, and (2) it also significantly reduces the PCM write-block size of dirty misses. Remember that PCM cannot support aburst-write operation, due to its heavy power requirement. There-fore, reducing the data block size even by a few words can en-hance the performance significantly. On average, our proposedDPCmechanism enhances the performance by 35.1%, as comparedto the baseline hybrid configuration.

We also compare the energy consumption of each configura-tion, as shown in Figure 7(b). Similar to Figure 7(a), the energyconsumption for each configuration is normalized to that of theDRAM-only. In this figure, a smaller bar indicates better energyefficiency. Since the DRAM-only configuration uses a large-sizeDRAM (8 GB), its energy consumption is substantially higher thanthe hybrid configurations. The PCM-only configuration also con-sumes a large amount of energy especially in memory intensive ap-plications (Groups 1 and 2), because the number of energy-hungrywrite operations is never reduced compared with DRAM-only con-figuration. If we focus on the two DRAM/PCM hybrid configura-tions, the DPC-augmented one always consumes less energy thanthe other one. The reasons are similar to the ones described in ourperformance analysis: storing compressed data markedly reducesthe number of energy-hungry PCM write operations. On average,the proposed DRAM/PCM hybrid with DPC enhances energy ef-ficiency by 29.3%, as compared to the same hybrid configurationwithout the DPC technique.

5.3 Enhancing the lifetime of the PCM deviceFigure 8 compares the PCM lifetime of the DRAM/PCM hybrid

main memory architecture with and without our DPC scheme. Weuse 10

8 as the maximum number of write cycles that can be per-formed to any cell before the cell fails (worst-case analysis). Toaid understanding, we divide the results into two graphs. Figure8(a), whose y-axis is displayed in units of days, directly compares

349

Figure 8: Assessment of the lifetime of the PCM device.

the effectiveness of the DPC technique. Without any wearlevelingtechniques, the lifetime of the baseline hybrid configuration rangesfrom 98 to 488 days. After applying the proposed DPC scheme (butwithout CSR), its lifetime can be prolonged by up to 1.66 times. Fi-nally, applying DPC+CSR increases the lifetime by up to 2 times,as compared with the baseline hybrid configuration.Even though the proposed DPC architecture can significantly in-

crease the lifetime of the PCM, it is still not enough to ensure areasonable lifetime for the PCM under memory-intensive applica-tions (Groups 1 and 2). This is because the proposed CSR designonly focuses on wearleveling within a line. Hence, we analyze theeffect of CSR when this scheme is combined with an existing Page-level Swapping (PS) technique [13], which is also implemented inthe memory controller. As shown in Figure 8(b), when our pro-posed wearleveling scheme is combined with the page-swappingtechnique (DRAM/PCM hybrid w DPC+PS+CSR), the lifetime ofthe PCM is extended to multiple years (the y-axis is displayed inunits of years). The lifetime ranges from 4.42 to 40 years, whilethe combination of the DRAM/PCM hybrid with PS but withoutour proposed DPC mechanism ranges only from 0.7 to 5.1 years.

6. CONCLUSIONIn this paper, we present a novel enhancement mechanism for

DRAM/PCM hybrid memory architectures. The proposed Dual-Phase Compression (DPC) scheme dramatically reduces the high-cost PCM write accesses by increasing the effective capacity of theDRAM cache, and by accessing the data in compressed mode. Inaddition, the DPCmechanism is architected in such a way as to be aself-contained and self-supported solution, which does not requireany support from – or any modification to – the various layers of thecache/memory hierarchy. Our extensive simulations using tracesextracted from a full-system simulator running real applicationsdemonstrate that the DPC mechanism achieves 35.1% performanceimprovement and 29.3% energy reduction, on average, as com-pared to a baseline DRAM/PCM hybrid implementation. Addition-ally, a Compression-based Segmented Rotation (CSR) wearlevel-ing mechanism is introduced and combined with the DPC scheme.The combined design is shown to significantly prolong the lifetimeof the DRAM/PCM hybrid by up to 2 times. If page-swapping andinter-line wearleveling are also employed, the resulting architec-ture exhibits endurance well beyond the typical useful lifetime of acomputer system.

7. ACKNOWLEDGMENTSThis work is partially supported by KORUSTECH(KT)-2008-

DC-AP-FS0-0003. It also falls under the Cyprus Research Promo-tion Foundation′s Framework Programme for Research, Techno-logical Development and Innovation 2009-10 (DESMI 2009-10),co-funded by the Republic of Cyprus and the European RegionalDevelopment Fund, and specifically under Grant TΠE/ΠΛHPO/06-09(BIE)/09.

8. REFERENCES[1] A. R. Alameldeen and D. A. Wood. Adaptive cache

compression for high-performance processors. InProceedings of the ISCA 2004.

[2] A. R. Alameldeen and D. A. Wood. Frequent patterncompression: A significance-based compression scheme forl2 caches. In Technical Report 1500, Computer SciencesDepartment, University of Wisconsin-Madison, April 2004.

[3] R. Bagrodia and et al. Parsec: A parallel simulationenvironment for complex systems. Computer, 31:77–85,October 1998.

[4] A. Bivens and et al. Architectural design for next generationheterogeneous memory systems. IEEE International MemoryWorkshop, pages 1 – 4, May 2010.

[5] S. Cho and H. Lee. Flip-N-Write: A simple deterministictechnique to improve PRAM write performance, energy andendurance. In Proceedings of the MICRO 2009.

[6] R. Das and et al. Performance and power optimizationthrough data compression in network-on-chip architectures.Proceedings of the HPCA 2008, pages 215 – 225.

[7] G. Dhiman, R. Ayoub, and T. Rosing. PDRAM: a hybridPRAM and DRAM main memory system. In Proceedings ofthe DAC 2009, pages 664–469.

[8] A. P. Ferreira and et al. Using PCM in next-generationembedded space applications. Proceedings of the RTAS 2010.

[9] R. F. Freitas and W. W. Wilcke. Storage-class memory: Thenext storage system technology. IBM Journal of Researchand Development, 52(4/5):439 – 447, 2008.

[10] HP Labs. CACTI: an integrated cache and memory accesstime, cycle time, area, leakage and dynamic power model.http://www.hpl.hp.com/research/cacti/.

[11] H. Jung and et al. A 58nm 1.8v 1Gb PRAM with 6.4MB/sprogram BW. In Proceedings of IEEE ISSCC 2011.

[12] S. Kang and et al. A 0.1/spl mu/m 1.8V 256Mb 66MHzsynchronous burst PRAM. In Proceedings of the ISSCC2006, pages 487 – 496.

[13] J. Kong and H. Zhou. Improving privacy and lifetime ofPCM-based main memory. In Proceedings of the DSN 2010.

[14] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger. Architectingphase change memory as a scalable DRAM alternative. InProceedings of the ISCA 2009, pages 2–13.

[15] P. S. Magnusson and et al. Simics: A full system simulationplatform. Computer, 35:50–58, February 2002.

[16] M. K. Qureshi, V. Srinivasan, and J. A. Rivers. Scalable highperformance main memory system using phase-changememory technology. In Proceedings of the ISCA 2009.

[17] Samsung Electronics. DDR2 registered SDRAM module,M393T5160QZA. Datasheet.

[18] G. Sun, D. Niu, J. Ouyang, and Y. Xie. A frequent-valuebased pram memory architecture. In Proceedings of theASP-DAC 2011, pages 211–216.

[19] J. Yang, Y. Zhang, and R. Gupta. Frequent valuecompression in data caches. In Proceedings of the MICRO2000, pages 258–265.

[20] W. Zhang and T. Li. Characterizing and mitigating theimpact of process variations on phase changed basedmemory systems. In Proceedings of the MICRO 2009.

[21] P. Zhou, B. Zhao, J. Yang, and Y. Zhang. A durable andenergy efficient main memory using phase change memorytechnology. In Proceedings of the ISCA 2009, pages 14–23.

350