HAMLeT: Hardware Accelerated Memory Layout Transform ... franzf/papers/hamlet-   HAMLeT: Hardware

  • View
    221

  • Download
    0

Embed Size (px)

Text of HAMLeT: Hardware Accelerated Memory Layout Transform ... franzf/papers/hamlet-   HAMLeT:...

HAMLeT: Hardware Accelerated Memory LayoutTransform within 3D-stacked DRAM

Berkin Akn, James C. Hoe, and Franz FranchettiElectrical and Computer Engineering DepartmentCarnegie Mellon University, Pittsburgh, PA, USA

{bakin, jhoe, franzf}@ece.cmu.edu

AbstractMemory layout transformations via data reorgani-zation are very common operations, which occur as a part of thecomputation or as a performance optimization in data-intensiveapplications. These operations require inefficient memory accesspatterns and roundtrip data movement through the memoryhierarchy, failing to utilize the performance and energy-efficiencypotentials of the memory subsystem. This paper proposes a high-bandwidth and energy-efficient hardware accelerated memorylayout transform (HAMLeT) system integrated within a 3D-stacked DRAM. HAMLeT uses a low-overhead hardware thatexploits the existing infrastructure in the logic layer of 3D-stacked DRAMs, and does not require any changes to theDRAM layers, yet it can fully exploit the locality and parallelismwithin the stack by implementing efficient layout transformalgorithms. We analyze matrix layout transform operations (suchas matrix transpose, matrix blocking and 3D matrix rotation)and demonstrate that HAMLeT can achieve close to peak systemutilization, offering up to an order of magnitude performanceimprovement compared to the CPU and GPU memory subsystemswhich does not employ HAMLeT.

I. INTRODUCTION

Main memory has been a major bottleneck in achievinghigh performance and energy efficiency for various computingsystems. This problem, also known as the memory wall, isexacerbated by multiple cores and on-chip accelerators sharingthe main memory and demanding more memory bandwidth.3D die stacking is an emerging technology that addresses thememory wall problem by coupling multiple layers of DRAMwith the processing elements via high-bandwidth, low-latencyand very dense vertical interconnects, i.e. TSVs (throughsilicon via). However, in practice, the offered high performanceand energy efficiency potentials is only achievable via theefficient use of the main memory.

Exploiting the data locality and the abundant parallelismprovided by multiple banks, ranks (and layers) is the keyfor efficiently utilizing the DRAM based main memories.However, several data-intensive applications fail to utilize theavailable locality and parallelism due to the inefficient memoryaccess patterns and the disorganized data placement in theDRAM. This leads to excessive DRAM row buffer missesand uneven distribution of the requests to the banks, ranksor layers which yield very low bandwidth utilization and incursignificant energy overhead. Existing solutions such as mem-ory access scheduling [1] or compiler optimizations [2], [3]provide limited improvements. Memory layout transformationvia data reorganization in the memory aims the inefficient

This work was sponsored by DARPA under agreement HR0011-13-2-0007. The content, views and conclusions presented in this document do notnecessarily reflect the position or the policy of DARPA.

memory access pattern and the disorganized data placementissues at their origin. Yet, transforming the memory layoutsuffers from the high latency of the roundtrip data movementfrom the memory hierarchy, to the processor and back tothe main memory. Also the memory layout transformation isassociated with a bookkeeping cost of updating the addressmapping tables [4].

In this paper, we present HAMLeT, a hardware accel-erated memory layout transform framework that efficientlyreorganizes the data within the memory by exploiting the 3D-stacked DRAM technology. HAMLeT uses high-bandwidth,low-latency and dense TSVs, and the customized logic layerunderneath the DRAM to reorganize the data in the memoryby avoiding the latency and the energy overhead of theroundtrip data movement through the memory hierarchy andthe processor. HAMLeT proposes a lightweight and low-powerhardware in the logic layer to address the thermal issues.It introduces very simple modifications to the existing 3D-stacked DRAM systems such as the hybrid memory cube(HMC) [5]it mostly uses the already existing infrastructurein the logic layer and does not require any changes to theDRAM layers. By using the SRAM based scratchpad memoryblocks and the existing interconnection in the logic layer, itimplements efficient algorithms to perform otherwise costlydata reorganization schemes. For the reorganization schemesthat we consider, HAMLeT handles the address remapping inthe hardware, transparent to the processor. Our work makesthe following contributions: To our knowledge, HAMLeT is the first work that pro-

poses a high-performance and energy-efficient memorylayout transformation accelerator integrated within a 3D-stacked DRAM.

HAMLeT uses a lightweight hardware implementationthat exploits existing infrastructure in the logic layer, anddoes not require any changes to the DRAM layers, yetit can fully exploit the locality and parallelism withinthe 3D-stacked DRAM via efficient layout transformalgorithms.

We evaluate the performance and energy/power consump-tion of HAMLeT for the data reorganizations such as:Matrix transpose, matrix blocking, and data cube rotation.

For the analyzed reorganization schemes, HAMLeT han-dles the address remapping in the hardware, transparentto the software stack, and does not incur any bookkeepingoverhead of the page table updates.

We compare HAMLeT with CPU and GPU memory sub-systems, which do not employ hardware accelerated datareorganization, and demonstrate up to 14x performanceand 4x bandwidth utilization difference.

Rank 1

Ban

k b

Ban

k 1

Ban

k 0

Bank b Bank 1

Bank 0

DRAM Chip c Rank 0

Ban

k 1

Ban

k 0

Bank b Bank 1

Bank 0

DRAM Chip 1

Ban

k 1

Ban

k 0

Bank b Bank 1

Bank 0

DRAM Chip 0

I/O bus

Row Buffer

I/O

Bu

s

row i

column j

Data Array

Column Decoder address

row bits

column bits

DRAM Bank

Ro

w

Dec

od

er

Fig. 1. Organization of an off-chip DRAM module.

II. DRAM OPERATION AND 3D-STACKING

A. DRAM Operation

As shown in Figure 1, modern DRAM modules are dividedhierarchically into ranks, chips, banks, rows and columns. Setof DRAM chips which are accessed in parallel to form thewhole DRAM word constitute a rank. Each DRAM chip hasmultiple internal banks that share the I/O pins. A bank within aDRAM chip has a row buffer which is a fast buffer holding thelastly accessed row (page) in the bank. If the accessed bankand page pair are already active, i.e. the referenced page isalready in the row buffer, then a row buffer hit occurs reducingthe access latency considerably. On the other hand, when adifferent page in the active bank is accessed, a row buffer missoccurs. In this case, the DRAM array is precharged and thenewly referenced page is activated in the row buffer, increasingthe access latency and energy consumption. Exploiting thespatial locality in the row buffer is the key to achieve highbandwidth utilization and energy efficiency.

In addition to the row buffer locality, bank/rank levelparallelism has a significant impact on the DRAM band-width and energy utilization. Given that different banks canoperate independently, one can overlap the latencies of therow precharge and activate operations with data transfer ondifferent banks/ranks. However, frequently precharging andactivating pages in different banks increases the power andtotal energy consumption.

B. 3D-stacked DRAM

3D-stacked DRAM is an emerging technology where mul-tiple DRAM dies and a logic layer are stacked on top andconnected by TSVs (through silicon via) [5], [6], [7], [8].TSVs allow low latency and high bandwidth communicationwithin the stack, eliminating I/O pin count concerns. Fine-grainrank-level stacking, which allows individual memory banksto be stacked in 3D, enables fully utilizing the internal TSVbandwidth [9], [5]. As shown in Figure 2(a), fine-grain rank-level stacked 3D-DRAM consists of multiple DRAM layerswhere each layer has multiple DRAM banks, and each bankhas its own TSV bus. Vertically stacked banks share a TSVbus and form a vertical rank (or sometimes referred as vault[5]). Each vault can operate independently.

The internal operation and the structure of the 3D-stackedDRAM banks are very similar to the regular DRAMs (seeFigure 1) except some of the peripheral circuitry is moveddown to the logic layer which enables achieving much bettertimings [9]. As shown in Figure 2(b), the logic layer alsoincludes a memory controller, a crossbar switch, vault andlink controllers. The memory controller schedules the DRAMcommands and aims to maximize the DRAM utilization while

obeying the timing constraints. The crossbar switch routes thedata between different vaults and I/O links. Finally, the vaultand the link controllers simply transfer data to the DRAMlayers and to the off-chip I/O pins respectively. Typically, thesenative control units do not fully occupy the logic layer andleave a real estate for a custom logic implementation [10].However, the thermal issues limit the complexity of the customlogic.

III. HAMLET ARCHITECTURE

We propose the HAMLeT architecture shown in Figure 2(c)for efficient reorganization of the data layout in the mem-ory. The processor offloads the memory layout transform tothe HAMLeT, and the data reorganization is handled in thememory by avoiding the latency and energy overhead ofthe roundtrip data movement through the memory hierarchy.Overall, memory layout transform operations can