Improving Node-level MapReduce Performance using ... · " And other classes of applications (Scale-Out applications) " Contemporary research shows that emerging scale-out applications

Improving Node-level MapReduce Performance using Processing-in-Memory Technologies

Mahzabeen Islam, Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory

Department of Computer Science & Engineering University of North Texas, USA

Mike Ignatowski and Nuwan Jayasena

AMD Research - Advanced Micro Devices, Inc., USA

Overview

•  Introduction • Motivation •  Proposed Model

•  Server Architecture •  Programming Framework

•  Experiments • Results • Conclusion and Future Work • Related Work • References

2 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on

MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

Introduction •  3D stacked DRAM consists of DRAM dies stacked on top of a logic die,

•  provides higher memory bandwidth, •  lower access latencies and •  lower energy consumption than existing DRAM technologies

Ø Hybrid Memory Cube (HMC): capacity 2-4 GB, bandwidth 160 GB/sec (15x DDR3), 70%less energy per bit 1

•  The bottom logic die contains peripheral circuitry (row decoder, sense amp etc.), but still there is enough silicon for other logic

3

•  3D-DRAM can be used as large Last Level Cache or Main Memory or buffer to PCM •  SRAM can be integrated in the logic layer to aid

address translation – hardware page tables •  Recent trend is to put processing capabilities in the

logic layer

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

Processing in Memory

•  Processing-In-Memory (PIM) is the concept of moving computation closer to memory

• Advantages:

Ø Low access latency, high memory bandwidth and high degree of parallelization can be achieved by adding simple processing cores in memory

Ø Minimize cache pollution by not transferring some data to main cores Ø Data intensive/memory bounded applications , which do not benefit from the

conventional cache hierarchies, could benefit from PIM

• Concerns: Ø Designing appropriate system architecture.

§  Too many design choices – main processor, PIM processors, memory hierarchy, communication channels, interfaces

Ø Requires changes to Operating System (memory management), programming framework (e.g. MapReduce library), programming models (synchronization, coherence)



Our Work

•  3D stacked DRAM has generated renewed interest in PIMs •  We can use several low power cores in the logic layer of a 3D-DRAM to execute

memory bounded functions closer to memory •  Our current research is focusing on Big Data analyses based on MapReduce

programming model

Ø Map functions are good candidates for executing on PIM processors Ø We propose and evaluate a server architecture here Ø MapReduce is modified for shared memory processors

§ We plan to investigate using PIM for other parts of MapReduce applications § And other classes of applications (Scale-Out applications)

§  Contemporary research shows that emerging scale-out applications do not benefit from conventional processor architecture and cache hierarchies 2

5 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

Proposed Server Architecture

6

Host

PIM & DRAM controllers

Memory dies

Timing-‐specific DRAM

interface

Abstract load/store interface

•  Host processor connected to multiple 3D Memory Units (3DMUs) •  PIM cores in the logic layer of each 3DMMU •  Simple, in-order, single-issue, energy efficient PIM cores with only L-1 caches •  Processes running on host control the execution of PIM threads •  Unified Memory View as proposed by Heterogeneous System Architecture (HSA)

foundation •  A number of such nodes will make up a cluster

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM:

throughput-oriented programmable processing in memory. In: HPDC, (2014)

Proposed MapReduce Framework

•  Adapt MapReduce frameworks for shared memory systems that exhibit NUMA Ø We chose Phoenix++ which works with CMP and SMP systems Ø Needed to modify Phoenix for our purpose

7

•  Map phase - overlap with reading input using MP cores (host reads from files) •  Reduce phase - By using special data

structures (2D hash tables) allow local reduction in the 3DMUs to minimize amount of data transferred during final reduction •  Merge phase – Initial stages can be

performed by PIM cores, and the rest by the host processor

•  Here we emphasize on single (intra) node level MapReduce operation, and assume, a global (inter) node level of MapReduce operation will take place if we need a cluster of such nodes.

Master Process Input

Manager Process 0

Manager Process 1

Manager Process 2

Manager Process 3

PIM Threads

PIM Threads

PIM Threads

PIM Threads

Running on host processor Running on 3DMUs



Experiment Setup

X eon E5 X eon E5QPI

MEM0

MEM1

Table 1: Baseline System Configuration

CPU 2 x Xeon E5-2640 6 cores per processor, 2 threads/core Out-of-Order, 4-wide issue

Clock Speed 2.5 GHz clock speed

L3 Cache 15 MB/processor

Power TDP = 95 W/processor Low-power = 15 W/processor

Memory BW 42.6 GB/s per processor

Memory 32 GB (8 x 4 GiB DIMM DDR3), NUMA enabled

Table 2: New System Configuration

Host Processor PIM cores

Processing Unit

1 Xeon E5-2640 6 cores, 2 threads/core Out-of-Order, 4-wide issue

64 = 4 * 16 ARM Cortex-A5 In-order, single-issue

Clock Speed 2.5 GHz 1 GHz LL Cache 15 MB 32 KB I and 32 KB D /core Power TDP = 95 W

80 mW/core (5.12 W for 64 cores)

Memory BW 42.6 GB/s 1.33 GB/s per core Memory 32 GB (4 8 GiB 3DMU)

8

•  Baseline vs. New System Configuration

P0 P1



Experiments and Analysis

•  Our assumption is that we can overlap reading of data with the execution of map tasks •  The input reading is performed by the host CPU and the map tasks by PIM cores

Ø  We do not want the cores to sit idle Ø  Estimate the number of cores needed

Fig : (a) PIM cores mostly idle (b) PIM core utilization is high

9

0 32 64 96 128 160 192 224

IP MU0 IP MU1 IP MU2 IP MU3 IP MU0 IP MU1 IP MU2 IP MU3

0 32 64 96 128 160 192 224

IP MU0 IP MU1 IP MU2 IP MU3 IP MU0 IP MU1 IP MU2 IP MU3

idle busy idle busy idle busy busy

busy busy busy

Time (ms)

Host reads IP splits Into 3DMUs

PIM cores in 3DMU0 PIM cores in 3DMU1 PIM cores in 3DMU2 PIM cores in 3DMU3

(a)

idle busy busy busy busy

busy busy busy

(b)




10

The time taken by PIM cores to process a input split should be smaller than the time taken by the host to read one input split

Here s is the factor that indicates the relative slowdown caused by simple PIM cores when compared to the host.

is the time taken by host to complete map function on one input split

There are 4 DMUs and each contains n PIM cores

is the time taken by host to read one input split

How many PIM cores per 3DMU do we need?




•  We ran different workloads on the baseline system with Phoenix++ and we measured

•  We used two different storage technologies-

HDD and SSD on the baseline system to measure the

11

Table 3:

Workload

word count 25 ms

histogram 7 ms

string match 12 ms

linear regression 7 ms

Table 4:

Storage

HDD 10.42 ms

SSD 2.17 ms



•  To achieve full utilization of the PIM cores following must hold,

•  We estimated how many cores (n) we need to achieve the overlapping of map tasks with input reading for different storage technologies and slowdown factors

•  For an estimated slowdown factor of 4 for the PIM cores, we need fewer than16 cores per 3DMU – but use 16 cores to handle stragglers


Fig: Required number of PIM cores for different slowdown factors



Feasibility Analysis

•  Embedding 16 PIM cores on the logic layer of a 3DMU is possible Ø We use ARM Cortex-A5 as PIM core to estimate silicon area needed

§  40 nm processing technology §  1 GHz clock §  32 KB D and 32 KB I L-1 caches

•  Area? Ø Each PIM core needs an area of 0.80mm2 Ø The area overhead for 16 PIM core is only ~12% in the logic die 3

•  Power budget? Ø Each PIM core has an average power consumption of 80 mW Ø 16 PIM core will consume 1.28W, which is only ~13% of the 10W TDP budget of

the logic layer 4

13

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

Performance Analysis

14

The performance gain using 16 PIM cores per 3DMU, assuming a slow down factor of 4 is shown here We reduced the execution time by tmap The average reduction is 8% We feel overlapping some merge and reduce functions lead to higher gains Gains are higher if input does not fit in memory



Energy Analysis

•  Data is for 16 PIM cores in each

3DMU (total 64) •  Relative energy savings range from

10% to 23% with the absolute energy savings range from 80J to 2045J

•  Energy reduction is due to low power cores

15

Fig: Energy consumption by processing units 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on


Bandwidth Utilization

•  On the baseline system the peak BW consumption is less than 15GB/s, •  Assuming 64 PIM cores, peak BW consumption 60GB/s (15GB/s at each 3DMU) •  Each SerDes link provides 40GB/s with 5W power consumption 5

•  64 ARM Cortex-A5 (if placed on host chip) can consume at most 88GB/s 6, requiring at least 3 SerDes links to transfer data between host and 3D DRAM •  But when we use PIM cores these high bandwidth links are not required

Fig: Bandwidth consumption when running wordcount on two different systems.

16 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

Summary and conclusions

•  We propose to use simple energy efficient cores embedded in the logic layer of 3D-DRAMs

•  We show how our architecture can be used for MapReduce workloads

•  We estimated the number of PIM cores needed to achieve a good balance of work between host and PIMs

•  We have shown that we achieve both energy savings and performance gains

•  We have found that most of the applications do not need the high bandwidth offered (or proposed) by current prototypes of 3D-stacked DRAMs



Future Work

•  Conduct experiments using variety of scale-out applications

•  Investigate the impact of reduced memory bus traffic on the memory hierarchy Ø Bandwidth utilization

§ Use low-energy buses with smaller bandwidth instead of high-speed SerDes links

Ø Alternative cache organization § What are the savings?

•  ARM Cortex -A5 Ø Do we need such processor complexity? Ø Simple RISC cores? Even less energy consumption Ø GPGPUs, FPGAs

•  Estimate the performance of a cluster comprising of proposed nodes



Related Study

•  Several studies are conducted on using 3D-DRAM as LLC or main memory [1, 2]

•  Researchers have worked in the PIM idea a decade ago [3, 4, 5, 6], but integrating DRAM and logic in the same die was not quite successful at that time.

•  The Phoenix++ MapReduce framework [7] works for conventional large-scale shared memory CMP and SMP systems. We use Phoenix++ as our basis and propose changes to adapt it to our PIM architecture.

•  Near Data Computing (NDC) architecture [8] provides a similar idea to our study and assumes 3D-DRAMs embedded with processing cores. Ø The NDC study works only with in-memory MapReduce workloads

•  Recent study on scale-out cloud applications shows that they do not benefit from common server class processors with complex micro-architecture and deep cache hierarchy. Also they do not need high bandwidth on and off-chip links [9].



References

1.  Black, B., Annavaram, M., Brekelbaum, N., DeVale, et al.: Die stacking (3D) microarchitecture. In: Micro, pp. 469-479. IEEE, (2006)

2.  Zhang, D. P., Jayasena, N., Lyashevsky, A., et al.: A new perspective on processingin-memory architecture design. In: Proceedings of the ACM SIGPLAN Workshop

3.  Patterson, D., Anderson, T., Cardwell, N., et al.: A case for intelligent RAM. In: Micro, 17(2), 34-44. IEEE, (1997)

4.  Torrellas, J.: FlexRAM: Toward an advanced Intelligent Memory system: A retrospective paper. In: Intl. Conference on Computer Design, pp. 3-4. IEEE, (2012)

5.  Draper, J., Chame, J., Hall, M., et al.: The architecture of the DIVA processing-inmemory chip. In: Proceedings of the Supercomputing, pp. 14-25. ACM, (2002)

6.  Rezaei, M., Kavi, K. M.: Intelligent memory manager: Reducing cache pollution due to memory management functions. In: Journal of Systems Architecture, 52(1), 41-55. (2006)

7.  Talbot, J., Yoo, R. M., Kozyrakis, C.: Phoenix++: modular MapReduce for sharedmemory systems. In: Proceedings of the international workshop on MapReduce and its applications, pp. 9-16. ACM, (2011)

8.  Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: International Symposium on Performance Analysis of Systems and Software. (2014)

9.  Ferdman, M., Adileh, A., Kocberber, O., et al.: A Case for Specialized Processors for Scale-Out Workloads. In: Micro, pp. 31-42. IEEE, (2014)



Thank You Questions ?



Documents

Improving Node-level MapReduce Performance using ... · " And other classes of applications (Scale-Out applications) " Contemporary research shows that emerging scale-out applications