Exploiting 3D-Stacked Memory Devices

Rajeev Balasubramonian

School of ComputingUniversity of Utah

Oct 2012

Power Contributions

PERCENTAGEOF TOTALSERVERPOWER

PROCESSOR

MEMORY

Power Contributions

PERCENTAGEOF TOTALSERVERPOWER

PROCESSOR

MEMORY

Example IBM Server

Source: P. Bose, WETI Workshop, 2012

Reasons for Memory Power Increase

• Innovations for the processor, but not for memory

• Harder to get to memory (buffer chips)

• New workloads that demand more memory SAP HANA in-memory databases SAS in-memory analytics

The Cost of Data Movement

• 64-bit double-precision FP MAC: 50 pJ (NSF CPOM Workshop report)

• 1 instruction on an ARM Cortex A5: 80 pJ (ARM datasheets)

• Fetching 256-bit block from a distant cache bank: 1.2 nJ (NSF CPOM Workshop report)

• Fetching 256-bit block from an HMC device: 2.68 nJ Fetching 256-bit block from a DDR3 device: 16.6 nJ (Jeddeloh and Keeth, 2012 Symp. on VLSI Technology)

Memory Basics

Host Multi-CoreProcessor

FB-DIMM

MCMC …

SMB/SMI

Micron Hybrid Memory Cube Device

HMC Architecture

Key Points

• HMC allows logic layer to easily reach DRAM chips

• Open question: new functionalities on the logic chip – cores, routing, refresh, scheduling

• Data transfer out of the HMC is just as expensive as before

Near Data Computing … to cut off-HMC movement

Intelligent Network-of-Memories … to reduce hops

Near Data Computing (NDC)

Timely Innovation

• A low-cost way to achieve NDC

• Workloads that are embarrassingly parallel

• Workloads that are increasingly memory bound

• Mature frameworks (MapReduce) in place

Open Questions

• What workloads will benefit from this?

• What causes the benefit?

Workloads

• Initial focus on MapReduce, but any workload with localized data access patterns will be a good fit

• Map phase in MapReduce: the dataset is partitioned and each Map phase works on its “split”; embarrassingly parallel, localized data access, often the bottleneck; e.g., count word occurrences in each individual document

• Reduce phase in MapReduce: aggregates the results of many mappers; requires random access of data; but deals with less data than Mappers; e.g., summing up the occurrences for each word

Baseline Architecture

• Mappers and Reducers both execute on the host processor• Many simple cores is better than few complex cores• 2 sockets, 256 GB memory, processing power budget 260 W, 512 Arm cores (EE-Cores) per socket, each core at 876 MHz

NDC Architecture

• Mappers execute on ND Cores; Reducers execute on the host processor• 32 cores per HMC; 2048 total ND Cores and 1024 total EE-Cores; 260 W total processing power budget

NDC Memory Hierarchy

• Memory latency excludes delay for link queuing and traversal• Many row buffer hits• L1 I and D caches per ND Core• The vault has space reserved for intermediate outputs, and Mapper/Runtime code/data

Methodology

• Three workloads: Range-Aggregate: count occurrences of something Group-By: count occurrences of everything Equi-Join: for two databases, it counts the pairs that

have similar attributes

• Dataset: 1998 World Cup web server logs

• Simulations of individual mappers and reducers on EE-cores on TRAX simulator

Single Thread Performance

Effect of Bandwidth

Exec Time vs. Frequency

Maximizing the Power Budget

Scaling the Core Count

Energy Reduction

Results Summary

• Execution time reductions of 7%-89%

• NDC performance scales better with core count

• Energy reduction of 26%-91%

No bandwidth limitation Lower memory access latency Lower bit transport energy

Intelligent Network of Memories

• How should several HMCs be connected to the processor?• How should data be placed in these HMCs?

Contributions

• Evaluation of different network topologies Route adaptivity does help

• Page placement to bring popular data to nearby HMCs Percolate-down based on page access counts

• Use of router bypassing under low load

• Use of deep sleep modes for distant HMCs

Topologies

(d) F-Tree (e) T-Tree

Network Properties

• Supports 44-64 HMC devices with 2-4 rings

• Adaptive routing (deadlock avoidance based on timers)

• An entire page resides in one ring, but cache lines are striped across the channels

Percolate-Down Page Placement

• New pages are placed in nearest ring

• Periodically, inactive pages are demoted to the next ring; thresholds matter because of queuing delays

• Activity is tracked with the multi-queue algorithm: hierarchical queues, each entry has a timer and an access count, demotion to lower queue if timer expires, promotion to higher queue if access count is high

• Page migration off the critical path, striped across many channels, distant links are under-utilized

Router Bypassing

• Topologies with more links and adaptive routing (T-Tree) are better… but distant links experience relatively low load

• While a complex router is required for the T-Tree, the router can often be bypassed

Power-Down Modes

• Activity shift to nearby rings under-utilization at distant HMCs

• Can power off the DRAM layers (PD-0) and the SerDes circuits (PD-1)

• 26% energy saving for a 5% performance penalty

Methodology

• 128-thread traces of NAS parallel benchmarks (capacity requirements of nearly 211 GB)

• Detailed simulations with 1 billion memory access traces, confirmatory page-access simulations for the entire application

• Power breakdown: 3.7 pJ/bit for DRAM access, 6.8 pJ/bit for HMC logic layer, 3.9 pJ/bit for a 5x5 router

Results – Normalized Exec Time

• T-Tree P-Down reduces exec time by 50%• 86% of flits bypass the router• 88% of requests serviced by Ring-0

Results – Energy

Summary

• Must reduce data movement on off-chip memory links

• NDC reduces energy, improves performance by overcoming the bandwidth wall

• More work required to analyze workloads, build software frameworks, analyze thermals, etc.

• iNoM uses OS page placement to minimize hops for popular data and increase power-down opportunities

• Path diversity is useful, router overhead is small

Acknowledgements

• Co-authors: Kshitij Sudan, Seth Pugsley, Manju Shevgoor, Jeff Jestes, Al Davis, Feifei Li

• Group funded by: NSF, HP, Samsung, IBM

Backup Slide

Exploiting 3D-Stacked Memory Devices

Documents

TSV Stacked Memory Patent Landscape Analysis€¦ · • This report provides a detailed picture of the patent landscape for TSV Stacked Memory, with 3-dimensional structure. •

Exploiting the jemalloc Memory Allocator: Owning Firefox… · Exploiting the jemalloc Memory Allocator: Owning Firefox's Heap ... Exploiting the jemalloc Memory Allocator: Owning

3D-Stacked Logic-in-Memory Hardware For Sparse Matrix ......Carnegie Mellon 3D-Stacked Logic-in-Memory Hardware For Sparse Matrix Operations Franz Franchetti Carnegie Mellon University

Exploiting Selective Placement for Low-cost Memory Protection

Accelerating Pointer Chasing in 3D-Stacked Memory ...omutlu/pub/in-memory... · Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh ySamira

A stacked memory device on logic 3D technology for ultra ...meiyang/ecg700/readings/A stacked memory devic… · A stacked memory device on logic 3D technology for ultra-high-density

"Exploiting Hardware Transactional Memory in Main-Memory ...leis/papers/HTM.pdf · Exploiting Hardware Transactional Memory in Main-Memory Databases ... prevent a one-to-one mapping

Stacked Memory … building with the debris of origin…collaboration req’d

Exploiting Memory-level Parallelism in Reconfigurable Accelerators

Challenges in Exploiting Conversational Memory in Human ... · Exploiting Conversational Memory in Human-Agent Interaction. In Proc. of the 17th International Conference on Autonomous

Attacking WebKit Applications by exploiting memory corruption bugs

Bochspwn: Exploiting Kernel Race Conditions Found via Memory

Exploiting Available Memory and Disk for Scalable Instant

Exploiting Memory Corruption Vulnerabilities in the Java ...media.blackhat.com/bh-ad-11/Drake/bh-ad-11-Drake-Exploiting_Java... · Exploiting Memory Corruption Vulnerabilities in

Long Short-Term Memory Implementation Exploiting Passive

Exploiting Instruction-Level Parallelism for Memory …rsim.cs.uiuc.edu/Pubs/phdthesis-pai.pdf · Exploiting Instruction-Level Parallelism for Memory System Performance by ... Exploiting

Exploiting Java Memory Corruption

3D Stacked Memory: Patent Landscape Analysis - WIPO · Page | 2 3D Stacked Memory: Patent Landscape Analysis INTRODUCTION With the increased demand for enhanced functionality and

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges

3D-Stacked Memory Architectures for Multi-core Processorsmeiyang/ecg700/readings/3D-Stacked... · 3D-Stacked Memory Architectures for Multi-Core Processors Gabriel H. Loh Georgia