Pinatubo: A Processing-in-Memory Architecture for Bulk ...jzhao/files/Pinatubo-dac2016.pdfPinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile

Pinatubo: A Processing-in-Memory Architecture for BulkBitwise Operations in Emerging Non-volatile Memories

Shuangchen Li1∗, Cong Xu2, Qiaosha Zou1,5, Jishen Zhao3, Yu Lu4, and Yuan Xie1

University of California, Santa Barbara1, Hewlett Packard Labs2

University of California, Santa Cruz3, Qualcomm Inc.4, Huawei Technologies Inc.5{shuangchenli, yuanxie}ece.ucsb.edu1

ABSTRACTProcessing-in-memory (PIM) provides high bandwidth, mas-sive parallelism, and high energy efficiency by implement-ing computations in main memory, therefore eliminatingthe overhead of data movement between CPU and mem-ory. While most of the recent work focused on PIM inDRAM memory with 3D die-stacking technology, we pro-pose to leverage the unique features of emerging non-volatilememory (NVM), such as resistance-based storage and cur-rent sensing, to enable efficient PIM design in NVM. Wepropose Pinatubo1, a Processing In Non-volatile memo-ry ArchiTecture for bUlk Bitwise Operations. Instead ofintegrating complex logic inside the cost-sensitive memory,Pinatubo redesigns the read circuitry so that it can com-pute the bitwise logic of two or more memory rows veryefficiently, and support one-step multi-row operations. Theexperimental results on data intensive graph processing anddatabase applications show that Pinatubo achieves a ∼500×speedup, ∼28000× energy saving on bitwise operations, and1.12× overall speedup, 1.11× overall energy saving over theconventional processor.

1. INTRODUCTIONIn the big data era, the “memory wall” is becoming the

toughest challenge as we are moving towards exascale com-puting. Moving data is much more expensive than comput-ing itself: a DRAM access consumes 200 times more en-ergy than a floating-point operation [14]. Memory-centricprocessing-in-memory (PIM) architecture and Near-data-com-puting (NDC) appear as promising approaches to address

∗This work was supported by the U.S. Department of Energy,

Office of Science, Office of Advanced Scientific Computing Re-search under Award number DE-SC0013553 with disclaimer athttp://seal.ece.ucsb.edu/doe/. It was also supported in part byNSF 1533933, 1461698 and 1500848, and a grant from Qualcom-m. Zhao is supported by UCSC start-up funding.1

Mount Pinatubo is an active volcano that erupted in 1991. Weenvision our design to invigorate the future PIM research, similarto the rejuvenation of life after an volcanic eruption.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

DAC ’16, June 05-09, 2016, Austin, TX, USAc© 2016 ACM. ISBN 978-1-4503-4236-0/16/06. . . $15.00

DOI: http://dx.doi.org/10.1145/2897937.2898064

such challenges. By designing processing units inside/nearmain memory, PIM/NDC dramatically reduces the overheadof data movements. It also takes advantage of the largememory internal bandwidth, and hence the massive paral-lelism. For example, the internal bandwidth in the hybridmemory cube (HMC) [20] is 128 times larger than its SerDesinterface [4].

Early PIM efforts [19] are unsuccessful due to practicalconcerns. They integrate processing units and memory onthe same die. Unfortunately, designing and manufacturingthe performance-optimized logic and the density-optimizedmemory together is not cost-effective. For instance, complexlogic designs require extra metal layers, which is not desir-able for the cost-sensitive memory vendors. Recent achieve-ments in 3D-stacking memory revive the PIM research [18],by decoupling logic and memory circuits in different dies.For example, in HMC stacked memory structure, an extralogic die is stacked with multi-layer of DRAM dies usingmassive number of through-silicon-vias [18].

Meanwhile, the emerging non-volatile memories (NVM-s), i.e., phase changing memory (PCM) [10], spin-transfertorque magnetic random access memory (STT-MRAM) [24],and resistive random access memory (ReRAM) [8] providepromising features such as high density, ultra-low stand-by power, promising scalability, and non-volatility. Theyhave showed great potential as candidates of next-generationmain memory [15, 28, 27].

The goal of this paper is to show NVM’s potential on en-abling PIM architecture, while almost all existing efforts fo-cus on DRAM systems and heavily depend on 3D integra-tion. NVM’s unique features, such as resistance-based s-torage (in contrast to charge-based in DRAM) and current-sensing scheme (in contrast to the voltage-sense scheme usedin DRAM), are able to provide inherent computing capabil-ities [13, 16]. Therefore, NVM can enable PIM without therequirement of 3D integration. In addition, it only requiresinsignificant modifications to the peripheral circuitry, result-ing in a cost-efficient solution. Furthermore, NVM-enabledPIM computation is based on in-memory analog signals,which is much more energy efficient than other work thatuses digital circuits.

In this paper, we propose Pinatubo, a Processing In Non-volatile memory ArchiTecture for bUlk Bitwise Operations,including OR, AND, XOR, and INV operations. WhenPinatubo works, two or more rows are activated simulta-neously, the memory will output the bitwise operations re-sult of the open rows. Pinatubo works by activating two (ormore) rows simultaneously, and then output of the memo-

http://dx.doi.org/10.1145/2897937.2898064

ry is the bitwise operation result upon the open rows. Theresults can be sent to the I/O bus or written back to an-other memory row directly. The major modifications on theNVM-based memory are in the sense amplifier (SA) design.Different from a normal memory read operation, where theSA just differentiates the resistance on the bitline betweenRhigh and Rlow, Pinatubo adds more reference circuit to theSA, so that it is capable of distinguishing the resistance of{Rhigh/2 (logic “0,0”), Rhigh||Rlow (logic “0,1”), Rlow/2 (log-ic “1,1”)} for 2-row AND/OR operations. It also potentiallysupports multi-row OR operations when high ON/OFF ra-tio memory cells are provided. Although we use 1T1R PCMas an example in this paper, Pinatubo does not rely on acertain NVM technology or cell structure, as long as thetechnology is based on resistive-cell.

Our contributions in this paper are listed as follows,

• We propose a low-cost processing-in-NVM architecturewith insignificant circuit modification and no require-ment on 3D integration.• We design a software/hardware interface which is both

visible to the programmer and the hardware.• We evaluate our proposed architecture on data inten-

sive graph processing and data-base applications, andcompare our work with SIMD processor, accelerator-in-memory PIM, and the state-of-the-art in-DRAMcomputing approach.

2. NVM BACKGROUNDAlthough the working mechanism and the features vary,

PCM, STT-MRAM, and ReRAM share common basics: allof them are based on resistive-cell. To represent logic “0”and “1”, they rely on the difference of cell resistance (Rhigh

or Rlow). To switch between logic “0” and “1”, certain polar-ity, magnitude, and duration voltage/current are required.The memory cells typically adopt 1T1R structure [10], wherethere are a wordline (WL) controlling the access transistor,a bitline (BL) for data sensing, and a source line (SL) toprovide different polarized write currents.

Architecting NVM as main memory has been well stud-ied [28, 15]. The SA design is the major difference be-tween NVM and DRAM design. Different from the conven-tional charge-based DRAM, the resistance-based NVM re-quires a larger SA to convert resistance difference into volt-age/current signal. Therefore, multiple adjacent columnsshare one SA by a multiplexer (MUX), and it results in asmaller row buffer size.

S1

Cs

BL REF

Ic Ir

Latch (en=0)

S1

Cs

BL REF

Ic Ir

Latch (en=0)

(a) Phase 1:

current-sampling

Vc Vr

Vgc Vgr

S1

Cs

BL REF

Latch (en=1)

Vc Vr

(b) Phase 2: current-

ratio amplification

(c) Phase 3: 2nd-stage

amplification

Figure 1: A current-based SA (CSA) [8].

Fig. 1 shows the mechanism of a state-of-the-art CSA [8].There are three phases during sensing, i.e., current-sampling,current-ratio amplification, and 2nd-stage amplification.

3. MOTIVATION AND OVERVIEWBitwise operations are very important and widely used

by database [26], graph processing [5], bio-informatics [21],

and image processing [6]. They are applied to replace ex-pensive arithmetic operations. Actually, modern processorshave already been aware of this strong demand, and devel-oped accelerating solutions, such as Intel’s SIMD solutionSSE/AVX.

ALU

NVM-based Main Memory

Operand Row 1

Caches

CPU

(a) Conventional Approach (b) Pinatubo

Operand Row 2

Operand Row n

Result Row

All data via the Narrow DDR bus

Operand Row 1

Operand Row 2

Operand Row n

Result Row

Modified SA

ALU (idle)

Caches (idle)

Only CMD &Row-ADR

DDR3Bus

Figure 2: Overview: (a) Computing-centric ap-proach, moving tons of data to CPU and write back.(b) The proposed Pinatubo architecture, performsn-row bitwise operations inside NVM in one step.

We propose Pinatubo to accelerate the bitwise operationsinside the NVM-based main memory. Fig. 2 shows theoverview of our design. Conventional computing-centric ar-chitecture in Fig. 2 (a) fetches every bit-vector from thememory sequentially. The data walks through the narrowDDR bus and all the memory hierarchies, and finally is exe-cuted by the limited ALUs in the cores. Even worse, it thenneeds to write the result back to the memory, suffering fromthe data movements overhead again. Pinatubo in Fig. 2 (b)performs the bit-vector operations inside the memory. On-ly commands and addresses are required on the DDR bus,while all the data remains inside the memory. To performbitwise operations, Pinatubo activates two (or more) mem-ory rows that store bit-vector simultaneously. The modifiedSA outputs the desired result. Thanks to in-memory calcu-lation, the result does not need the memory bus anymore. Itis then written to the destination address thought the WDdirectly, bypassing all the I/O and bus.

Pinatubo embraces two major benefits from PIM architec-ture. First, the reduction of data movements. Second, thelarge internal bandwidth and massive parallelism. Pinatuboperforms a memory-row-length (typical 4Kb for NVM) bit-vector operations. Furthermore, it supports multi-row oper-ations, which calculate multi-operand operations in one step,bringing the equivalent bandwidth ∼1000× larger than theDDR3 bus.

4. ARCHITECTURE AND CIRCUIT DESIGNIn this section, we first show the architecture design that

enables the NVM main memory for PIM. Then we showthe circuit modifications for the SA, LWL driver, WD, andglobal buffers.

4.1 From Main Memory to PinatuboMain memory has several physical/logic hierarchies. Chan-

nels runs in parallel, and each channel contains several ranksthat share the address/data bus. Each rank has typical 8physical chips, and each chip has typical 8 banks as shown inFig. 3 (a). Banks in the same chip share the I/O, and banksin different chips work in a lock-step manner. Each bank

Bank

Bank

Bank

Bank

Bank

Bank

Bank

(a) Chip (inter-bank op.)

Ou

tpu

t B

uff

er

Inte

r-b

an

k

ope

ration

s

IO

(b) Bank (inter-subarray op.) (c) Mat (intra-subarray op.)

Bank

Ctrl.

GW

L d

ec. Mat Mat

Mat Mat

Inter-sub operations

Row Buffer

Ctrl.

SubarraysGDLG

WL

MatMat

SA (w/ Intra-sub operations)

LW

L d

ec.

mu

lti-ro

w

Activa

tion

MUX

MUX

WD (w/ in-place update)

Row

BLSL

LWL

CSL

Ctrl.

Figure 3: The Pinatubo Architecture. Glossary: Global WordLine (GWL), Global DataLine (GDL), Local

WordLine (LWL), SelectLine (SL), BitLine (BL), Column SelectLine (CSL), Sense Amplifier (SA), Write Driver (WD).

has several subarrays. As Fig. 3 (b) shows, Subarrays sharethe GDLs and the global row buffer. One subarray containsseveral MATs as shown in Fig. 3 (c), which also work in thelock-step manner. Each Mat has its private SAs and WDs.Since NVM’s SA is much larger than DRAM, several (32 inour experiment) adjacent columns share one SA by a MUX.

According to the physical address of the operand rows,Pinatubo performs three types of bitwise operations: intra-subarray, inter-subarray, and inter-bank operations.Intra-subarray operations. If the operand rows are allwithin one subarray, Pinatubo performs intra-subarray oper-ations in each MAT of this subarray. As shown in Fig. 3 (c),the computation is done by the modified SA. Multiple rowsare activated simultaneously, and the output of the modifiedSA is the operation result. The operation commands (e.g.,AND or OR) are sent by the controller, which change thereference circuit of the SA. We also modify the LWL driveris also implemented to support multi-row activation. If theoperation result is required to write back to the same sub-array, it is directly fed into the WDs locally as an in-placeupdate.Inter-subarray operations. If the operand rows are indifferent subarrays but in the same bank, Pinatubo performsinter-subarray operations as shown in Fig. 3 (b). It is basedon the digital circuits added on the global row buffer. Thefirst operand row is read to the global row buffer, while thesecond operand row is read onto the GDL. Then the twooperands are calculated by the add-on logic. The final resultis latched in the global row buffer.Inter-bank operations. If the operand rows are even indifferent banks but still in the same chip, Pinatubo performsinter-bank operations as shown in Fig. 3 (a). They are doneby the add-on logic in the I/O buffer, and have a similarmechanism as inter-subarray operations.

Note that Pinatubo does not deal with operations betweenbit-vectors that are either in the same row or in differen-t chips. Those operations could be avoided by optimizedmemory mapping, as shown in Section 5.

4.2 Peripheral Circuitry ModificationSA Modification: The key idea of Pinatubo is to use SAfor intra-subarray bitwise operations. The working mecha-nism of SA is shown in Fig. 5. Different from the charge-based DRAM/SRAM, the SA for NVM senses the resistanceon the BL. Fig. 5 shows the BL resistance distribution dur-ing read and OR operations, as well as the reference val-ue assignment. Fig. 5 (a) shows the sensing mechanism fornormal reading (Though the SA actually senses currents, thefigure presents distribution of resistance for simplicity). Theresistance of a single cell (either Rlow or Rhigh) is comparedwith the reference value (Rref-read), determining the resultbetween “0” and “1”. For bitwise operations, an example for

a 2-row OR operation is shown in Fig. 5 (b). Since two rowsare activated simultaneously, the resistance on the BL is theparallel connection of two cells. There could be three situ-ations: Rlow||Rlow (logic “1”,“1”), Rlow||Rhigh (“1”,“0”), andRhigh||Rhigh (“0”,“0”)2. In order to perform OR operations,the SA should output “1” for the first two situations andoutput “0” for the last situation. To achieve this, we simplyshift the reference value to the middle of Rlow||Rhigh andRhigh||Rhigh, denoted as Rref-or. Note that we assume thevariation is well controlled so that no overlap exists between“1” and “0” region. In summary, to compute AND and OR,we only need to change the reference value of the SA.

Rlow Rhigh Rlow||Rlow Rhigh||RhighRlow||Rhigh

1 region 0 region

Rref-read

1 region 0 region

RBL Rref-or

CELL value 1 0 ( 1 , 1 ) ( 1 , 0 ) ( 0 , 0 )

SA output

(a) SA reads with Rref-read. (b) SA processes OR with Rref-or.

pdf

pdf

Figure 5: Modifying Reference Values in SA to En-able Pinatubo.

S1

CsBL

Read

OR

AN

D

Sxor

XOR/INVEN

Ch

OUT

OR/AND overhead

XO

R

ove

rhead

CsREF

S1

printed Thu Oct 29 2015 16:43:18 by shuangchenli on linux34.engr.ucsb.edu Synopsys, Inc. (c) 2000-2009TIME(sec) (lin)

0

0.8

0

0.8

0

0.8

0

0.8

Row Data

Contrl. (V)

V(Cs) (V)

OUT (V)

OR AND XOR

1 0 1 0 0 1 0 1 0 1

0

1

0

0

1

1

0

1

0

0

1

1

0

0

1

0

1

1

0

1

Figure 6: Current Sense Amplifier (CSA) Modifica-tion (left) and HSPICE Validation (right).

Fig. 6 (a) shows the corresponding circuit modificationbased on the CSA [8] introduced in Section 2. As explainedabove, we add two more reference circuits to support AND/ORoperations. For XOR, we need two micro-steps. First, oneoperand is read to the capacitor Ch. Second, the otheroperand is read to the latch. The output of the two add-ontransistors is the XOR result. For INV, we simply outputthe differential value from the latch. The output is select-ed among READ, AND, OR, XOR, and INV results by aMUX. Fig. 6 (b) shows the HSPICE validation of the pro-posed circuit. The circuit is tested with a large range ofcell resistances from the recent PCM, STT-MRAM, and R-eRAM prototypes [23].Multi-row Operations: Pinatubo supports multi-rowoperations that further accelerate the bitwise operations. Amulti-row operation is defined as calculating the result ofmultiple operands at one operation. For PCM and ReRAM

2“||” denotes production over sum operation.

Main Memory

Mode Register 4 (MR4)

Ctrl.

Memory with PIM

CMDADR

DAT

Programming Model Driver Lib

pim_malloc( );

pim_op(dst,src1,src2,

data_t,op_t, len);

C Run-time Library

pim-aware malloc

OSpim-aware memory

management

expose PA by sys-call

schedule opt

extend ISA

...

Software Stack Hardware Control

Figure 4: Pinatubo System Support.

which encode Rhigh as logic “0”, Pinatubo can calculate n-row OR operations3. After activating n rows simultaneously,Pinatubo needs to differentiate the bit combination of onlyone “1” that results in “1”, and the bit combination with all“0” that results in “0”. This leads to a reference value be-tween Rlow||Rhigh/(n − 1) and Rhigh/n. This sensing mar-gin is similar with the TCAM design [17]. State-of-the-artPCM-based TCAM supports 64-bit WL with two cells perbit. Therefore we assume maximal 128-row operations forPCM. For STT-MRAM, since the ON/OFF ratio is alreadylow, we conservatively assume maximal 2-row operation.LWL Driver Modification: Conventional memory acti-vates one row each time. However, Pinatubo requires multi-row activation, and each activation is a random-access. Themodifications of the LWL driver circuit and the HPSICEvalidation are shown in Fig. 7. Normally, the LWL driveramplifies the decoded address signal with a group of invert-ers. We modify each LWL drive by adding two more transis-tors. The first transistor is used to feed the signal betweeninverters back and serves as a latch. The second transis-tor is used to force the driver’s input as ground. Duringthe multi-row activation, it requires to send out the RESETsignal first, making sure that no WL has latched anything.Then every time an address is decoded, the selected WLsignal is latched and stuck at VDD until the next RESETsignal arrives. Therefore, after issuing all the addresses, allthe corresponding selected WL are driven to the high voltagevalue.

LW

L-n

GW

LA

dr ... ...

WL Driver...

RESET

LW

L d

ec.

printed Tue Oct 27 2015 14:15:19 by shuangchenli on linux34.engr.ucsb.edu Synopsys, Inc. (c) 2000-2009

0 1n 2n 3n 4n 5nTIME(sec) (lin)

0

1.5

0

1.5

0

1.5

RESET (V)

DEC_n (V)

WL_n (V)

Figure 7: Local Wordline (LWL) Driver Modifica-tion (left) and HSPICE Validation (right).

WD Modification: Fig. 8 (a) shows the modification to aWD of STT-MRAM/ReRAM. We do not show PCM’s WDsince it is simpler with unidirectional write current. Thewrite current/voltage is set on BL or SL according to thewrite input data. Normally, the WD’s input comes from thedata bus. We modify the WD circuit so that the SA resultis able to be fed directly to the WD. This circuit bypassesthe bus overhead when writing results back to the memory.

SL BL

SA

wdataContr

l.

Row Buffer

or IO Buffer

Contrl.

Overhead

da

ta

Figure 8: (a) Modifications to Write Driver (WD).(b) Modifications for Inter-Sub/Bank Operations

3Multi-row AND in PCM/ReRAM is not supported, since it is un-

likely to differentiate Rlow/(n− 1)||Rhigh and Rlow/n, when n > 2.

Global Buffers Modification: To support inter-subarrayand inter-bank operations, we have to add the digital circuitsto the row buffers or IO buffers. The logic circuit’s input isthe data from the data bus and the buffer. The output is s-elected by the control signals and then latched in the buffer,as shown in Fig. 8 (b).

5. SYSTEM SUPPORTFig. 4 shows an overview of Pinatubo’s system design.

The software support contains the programming model andrun-time supports. The programming model provides twofunctions for programmers, including the bit-vector alloca-tion and the bitwise operations. The run-time supports in-clude modifications of the C/C++ run-time library and theOS, as well as the development of the dynamic linked driverlibrary. The C/C++ run-time library is modified to providea PIM-aware data allocation function. It ensures that differ-ent bit-vectors are allocated to different memory rows, sincePinatubo is only able to process inter-row operations. TheOS provides the PIM-aware memory management that max-imizes the opportunity for calling intra-subarray operations.The OS also provides the bit-vector mapping informationand physical addresses (PAs) to the PIM’s run-time driverlibrary. Based on the PAs, the dynamic linked driver libraryfirst optimizes and reschedules the operation requests, andthen issues extended instruction for PIM [3]. The hardwarecontrol part utilizes the DDR mode register (MR) and com-mand. The extended instructions are translated to DDRcommands and issued through the DDR bus to the mainmemory. The MR in the main memory is set to configurethe PIM operations.

6. EXPERIMENTIn this section, we compare Pinatubo with state-of-the-art

solutions and present the performance and energy results.

6.1 Experiment SetupThe three counterparts we compare are described below:

SIMD is a 4-core 4-issue out-of-order x86 Haswell proces-sor running at 3.3GHz. It also contains a 128-bit SIMDunit with SSE/AVX for bitwise operation acceleration. Thecache hierarchy consists of 32KB L1, 256KB L2, and 6MBL3 caches.S-DRAM is the in-DRAM computation solution to accel-erate bitwise operations [22]. The operations are executedby charges sharing in DRAM. Due to the read-destructivefeature of DRAM, this solution requires copying data beforecalculation. Only 2-row AND and OR are supported.AC-PIM is an accelerator-in-memory solution, where eventhe intra-subarray operations are implemented with digitallogic gates as shown in Fig. 8 (b).

The S-DRAM works with a 65nm 4-channel DDR3-1600DRAM. AC-PIM and Pinatubo work on 1T1R-PCM basedmain memory whose tRCD-tCL-tWR is 18.3-8.9-151.1ns [9].SIMD works with DRAM when compared with S-DRAM,

Vector: pure vector OR operations.dataset: e.g. 19-16-1(s/r) means 219-lentgh vector, 216 vec-

tors, 21-row OR ops (sequntial/random access)

Graph: bitmap-based BFS for graph processing [5].dataset: dblp-2010, eswiki-2013,amazon-2008 [1]

Database: bitmap-based database (Fastbit [26]) application.dataset: 240/480/720 number of quraying on STAR [2]

Table 1: Benchmarks and Data Set

and with PCM when compared with AC-PIM and Pinatubo.Note that the experiment takes 1T1R PCM for a case study,but Pinatubo is also capable to work with other technologiesand cell structures.

The parameters for S-DRAM are scaled from existingwork [22]. The parameters for AC-PIM are collected fromsynthesis tool with 65nm technology. As to parameters forPinatubo, the analog/mixsignal part, including SA, WD,and LWL, is extracted from HSPICE simulation; the digitalpart, including controllers and logics for inter-subarray/bankoperations, is extracted from the synthesis tool. Based onthose low-level parameters, we heavily modify NVsim [11] forthe NVM circuit modeling, and CACTI-3DD [9] for the mainmemory modeling, in order to achieve high-level parameters.We also modify the PIN-based simulator Sniper [7] for SIMDprocessor and the NVM-based memory system. We developan in-house simulator to evaluate the AC-PIM, S-DRAM,and Pinatubo. We show the evaluation benchmarks and da-ta sets in Table 1, in which Vector only has OR operationwhile Graph and Database contain all AND, OR, XOR, andINV operations.

6.2 Performance and Energy EvaluationFig. 9 shows Pinatubo’s OR operation throughput. We

have four observations. First, the throughput increases withlonger bit-vectors, because they make better use of the mem-ory internal bandwidth and parallelism. Second, we observetwo turning points, A and B, after which the speedup im-provement is showed down. Turning point A is caused bythe sharing SA in NVM: bit-vectors longer than 214 have tobe mapped to columns the SA sharing, and each part has tobe processed in serial. Turning point B is caused by the lim-itation of the row length: bit-vectors longer than 219 haveto be mapped to multiple ranks that work in serial. Third,Pinatubo has the capability of multi-row operations (as thelegends show). For n-row OR operations, larger n provideslarger bandwidth. Fourth, the y-axis is divided into threeregions: the below DDR bus bandwidth region which onlyincludes short bit-vectors’ result; the memory internal band-width region which includes the majority of the results; andthe beyond internal bandwidth region, thanks to the multi-row operations. DRAM systems can never achieve beyondmemory internal bandwidth region.

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

10 11 12 13 14 15 16 17 18 19 20

GB

ps

Bit-vector Length (2^n)

2-row OR

4-row OR

8-row

16-row

32-row

64-row

128-rowBelow DDR Bus BandwidthRegion

Beyond Internal BandwidthRegion

Point APoint B

Figure 9: Pinatubo’s Throughput (GBps) for ORoperations.

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

19-16-1s

19-16-7s

14-12-7s

14-16-7s

14-16-7r

dblp

eswiki

amazon 240

480

720

Gmean

Vector Graph Fastbit .

S-DRAM AC-PIM Pinatubo-2 Pinatubo-128

Figure 10: Speedup Normalized to SIMD Baseline.

We compare both Pinatubo of 2-row and 128-row opera-tion with two aggressive baselines in Fig. 10, which shows thespeedup on bitwise operations. We have three observation-s: First, S-DRAM has better performance than Pinatubo-2 in some cases with very long bit-vectors. This is be-cause DRAM-based solutions benefit from larger row buffer-s, compared with the NVM-based solution. However, theadvantage of NVM’s multi-row operations still dominates.Pinatubo-128 is 22× faster than S-DRAM. Second, the AC-PIM solution is much slower than Pinatubo in every singlecase. Third, multi-row operations show their superiority,especially when intra-subarray operations are dominating.An opposite example is 14-16-7r, where all operations arerandom accesses and it is dominated by inter-subarray/bankoperations, so that Pinatubo-128 is as slow as Pinatubo-2.

5.E+00

5.E+01

5.E+02

5.E+03

5.E+04

5.E+05

19-16-1s

19-16-7s

14-12-7s

14-16-7s

14-16-7r

dblp

eswiki

amazon 240

480

720

Gmean

Vector Graph Fastbit .

S-DRAM AC-PIM Pinatubo-2 Pinatubo-128

Figure 11: Energy Saving Normalized to SIMD.

1.00

1.10

1.20

1.30

1.40

dblp

eswiki

amazon

Gmean

240

480

720

Gmean

Gmean

dblp

eswiki

amazon

Gmean

240

480

720

Gmean

Gmean

Graph Fastbit . Graph Fastbit .

Speedup . Energy

S-DRAM AC-PIM Pinatubo-2 Pinatubo-128 Ideal

Figure 12: Overall Speedup and Energy Saving Nor-malized to SIMD Baseline.

Fig. 11 shows the energy saving result. The observationsare similar with those from speedup: S-DRAM is betterthan Pinatubo-2 in some cases but worse than Pinatubo-128 on average. AC-PIM never has a change to save moreenergy then any of the other three solutions, since both S-DRAM and Pinatubo rely on high energy efficient analogy

computing. On average, Pinatubo saves 2800× energy forbitwise operations, compared with SIMD processor.

Fig. 12 shows the overall speedup and energy saving ofPinatubo in the two real-world applications. The ideal leg-end represents the result with zero latency and energy spenton bitwise operations. We have three observations. First,Pinatubo almost achieves the ideal acceleration. Second,limited by the bitwise operations’ proportion, Pinatubo canimprove graph processing applications by 1.15× with 1.14×energy saving. However, it is data dependent. For the eswi-ki and amazon data set, since the connection is “loose”, ithas to spend most of the time searching for an unvisitedbit-vector. For dblp, it has 1.37× speedup. Third, for thedatabase applications, it achieves 1.29× overall speedup andenergy saving.

6.3 Overhead EvaluationFig. 13 shows the area overhead results. As shown in

Fig. 13 (a), Pinatubo incurs insignificant area overhead on-ly 0.9%. However, AC-PIM has 6.4% area overhead, whichis critical to the cost-sensitive memory industry. S-DRAMreports ∼0.5% capacity loss, but it is for DRAM-only re-sult and orthogonal with Pinatubo’s overhead evaluation.Fig. 13 (b) shows the area overhead breakdown. We con-clude that the majority area overhead are taken by inter-subarray/bank operations. For intra-subarray operations,XOR operations takes most of the area.

0.9%

6.4%

0%

2%

4%

6%

PCM

Area

Ove

rhea

d

PinatuboAC-PIM

inter-sub0.72%

inter-bank0.09%

xor 0.06%

wl act 0.05%

and/or0.02%

Intra-sub 0.13%

Figure 13: Area Overhead Comparison (left) andBreakdown (right).

7. RELATED WORKPinatubo distinguishes itself from PIM work, in-DRAM

computing work, and logic-in-memory work. First, differ-ent from other PIM work, Pinatubo does not need 3D in-tegration [18], and does not suffer from logic/memory cou-pling problem [19] either. Pinatubo benefits from NVM’sresistive-cell feature, and provides cost and energy efficien-t PIM. Although ProPRAM [25] leverages NVM for PIM,it uses NVM’s lifetime enhancement peripheral circuits forcomputing, instead of the NVM’s character itself. More-over, bitwise AND/OR is not supported in ProRPAM, andit is computing with digital circuit while Pinatubo takes ad-vantage of high energy-efficient analog computing. Second,based on charge sharing, in-DRAM bulk bitwise operationsis proposed [22]. However, it suffers from read destructiveproblem so that operand copy is required before comput-ing, incurring unnecessary overhead. Also, only maximal2-row operations are supported. Third, there is other workusing NVM for logic-in-memory functionality such as asso-ciate memory [17, 12]. Recent studies also take use of R-eRAM crossbar array to implement IMPLY-based logic op-erations [16, 13]. However, none of them are necessarily fit-ted to the PIM concept: they use the memory technique toimplement processing unit, but the processing unit appearsas either a co-processor or a stand-alone accelerator. Theystill pay for the expensive data fetching from the memoryand are limited by the memory bus bandwidth.

8. CONCLUSIONIn this paper, a processing-in-NVM architecture for bulk

bitwise operations is proposed. The computation makesuse of NVM’s resistive-cell feature and achieves high per-formance and energy efficiency with insignificant area over-heads. Experimental results show that the proposed archi-tecture achieves ∼500× speedup and ∼28000× energy sav-ing on bitwise operations, and 1.12× overall speedup, 1.11×overall energy saving on data intensive graph processing anddatabase applications with real-world data.

9. REFERENCES[1] Laboratory for web algorithmics. http://law.di.unimi.it/.

[2] The star experiment. http://www.star.bnl.gov/.

[3] J. Ahn et al. Pim-enabled instructions: A low-overhead,locality-aware processing-in-memory architecture. In ISCA,pages 336–348. ACM, 2015.

[4] J. Ahn et al. A scalable processing-in-memory accelerator forparallel graph processing. In ISCA, pages 105–117. ACM, 2015.

[5] S. Beamer et al. Direction-optimizing breadth-first search. InSC, pages 1–10, Nov 2012.

[6] J. Bruce et al. Fast and inexpensive color image segmentationfor interactive robots. In IROS, volume 3, 2000.

[7] T. E. Carlson et al. Sniper: Exploring the level of abstractionfor scalable and accurate parallel multi-core simulation. In SC,pages 52:1–52:12. ACM, 2011.

[8] M.-F. Chang et al. An offset-tolerant fast-random-readcurrent-sampling-based sense amplifier for small-cell-currentnonvolatile memory. JSSC, 48(3):864–877, March 2013.

[9] K. Chen et al. Cacti-3dd: Architecture-level modeling for 3ddie-stacked dram main memory. In DATE, pages 33–38, 2012.

[10] G. De Sandre et al. A 90nm 4mb embedded phase-changememory with 1.2v 12ns read access time and 1mb/s writethroughput. In ISSCC, pages 268–269, Feb 2010.

[11] X. Dong et al. Nvsim: A circuit-level performance, energy, andarea model for emerging nonvolatile memory. TCAD,31(7):994–1007, July 2012.

[12] Q. Guo et al. Ac-dimm: associative computing with stt-mram.In ISCA, pages 189–200. ACM, 2013.

[13] S. Hamdioui et al. Memristor based computation-in-memoryarchitecture for data-intensive applications. In DATE, pages1718–1725, 2015.

[14] S. W. Keckler et al. Gpus and the future of parallel computing.IEEE Micro, (5):7–17, 2011.

[15] B. C. Lee et al. Architecting phase change memory as ascalable dram alternative. In ISCA, pages 2–13. ACM, 2009.

[16] H. Li et al. A learnable parallel processing architecture towardsunity of memory and computing. Scientific reports, 5, 2015.

[17] J. Li et al. 1 mb 0.41 um 2t-2r cell nonvolatile tcam withtwo-bit encoding and clocked self-referenced sensing. JSSC,49(4):896–907, April 2014.

[18] R. Nair et al. Active memory cube: A processing-in-memoryarchitecture for exascale systems. IBM Journal of Researchand Development, 59(2/3):17:1–17:14, March 2015.

[19] D. Patterson et al. A case for intelligent ram. IEEE Micro,17(2):34–44, 1997.

[20] J. T. Pawlowski. Hybrid memory cube (hmc). In Hot Chips,volume 23, 2011.

[21] M. Pedemonte and other. Bitwise operations for gpuimplementation of genetic algorithms. In GECCO, pages439–446. ACM, 2011.

[22] V. Seshadri et al. Fast bulk bitwise and and or in dram. CAL,PP(99):1–1, 2015.

[23] K. Suzuki et al. The non-volatile memory technology database(nvmdb). Technical Report CS2015-1011, UCSD, May 2015.

[24] K. Tsuchida et al. A 64mb mram with clamped-reference andadequate-reference schemes. In ISSCC, pages 258–259, 2010.

[25] Y. Wang et al. Propram: Exploiting the transparent logicresources in non-volatile memory for near data computing. InDAC, pages 47:1–47:6. ACM, 2015.

[26] K. Wu. Fastbit: an efficient indexing technology foraccelerating data-intensive science. In Journal of Physics,volume 16, page 556, 2005.

[27] J. Zhao et al. Memory and storage system design withnonvolatile memory technologies. IPSJ, 8(0):2–11, 2015.

[28] P. Zhou et al. A durable and energy efficient main memoryusing phase change memory technology. In ISCA, 2009.

http://law.di.unimi.it/

http://www.star.bnl.gov/

Documents

Pinatubo: A Processing-in-Memory Architecture for Bulk ...jzhao/files/Pinatubo-dac2016.pdfPinatubo: A Processing-in-Memory Architecture for Bulk Bitwise Operations in Emerging Non-volatile