The role of NVM in the future of HPC workloads · HPC workloads 29 05 2017 . The New Normal: Compute is not keeping up with data explosion The end of scaling at just the wrong time

Note: The speaker notes for this slide include detailed instructions on how to insert a Title Slide with a different picture.

Tip! Remember to remove this text [email protected]

Distinguished Technologist EG EMEA

The role of NVM in the future of HPC workloads

29 05 2017

mailto:[email protected]

The New Normal: Compute is not keeping up with data explosion

The end of scaling at just the wrong time …

8Bpeople

20Bmobile devices

100Bsocial infrastructure apps

+ + + 1T( )

Ap

plic

ati

on

La

ten

cy

Storage Volume/Capacity

High

Low

Low High

Increase

Amount of

data

processed

at low

latency

Reduce latency of storage

while increasing efficiency

Object

Storage

HPC for the

Enterprise

NoSQL

Database

Hadoop

IoT

HPC Big Data

convergence

Massively

Parallel,

Low

Latency

Computing

Large volumes of data, lower Cost, high latency

Traditional Cold Storage

TraditionalHPC

Deep Learning

Parallel File

Systems

Analytics

Technologies

Workloads

HPC and Big Data technology context and The Machine

Critical for our future infrastructures

4

DRAM

Reaching the physical limits of charge storageNon-Volatile memories – forms of memristor (Type 4)

3D Flash PCRAM STTRAMRRAM HMC

Disruption #1: Non-volatile memories

3D Flash

NVM and high speed memories are critical for extreme computing

Predicted 1971

Leon Chua

U.C. Berkeley

Reduced to practice 2008

R. Stanley Williams

HP Laboratories

MEMRISTOR

dφ = M dq1971Chua

Ohm1827

1831Faraday

Von Kleist1745

RESISTOR

dv = R diCAPACITOR

dq = C dv

INDUCTOR

dφ = L di

i

v

q

φ

The memristor : 4th fundamental cuircuit element

Convergence of memory and storage

CPU

Cache

DRAM

Flash SDD

Magnetic HDD

Load-Store Data

Low Latency

Volatile

Block-Store Data

Long Latency

Non-Volatile

Memory

Storage

Speed

Persistence

Persistent Memory

Persistent Memory = The speed of memory with the persistence of storage

Disruption #2 : PhotonicsPhotonics

Need 1000Ximprovment

FLOPS will cost lesspower than on-chip

data movement 10^18 ops*

1Byte/ops=

10^19bits*

1pj / bit=

10MWatt

Katharine Schmidtke, Finisar

ultra low energy low uniform latencyhigh bandwidth low cost

Disruption #3 : optimized architectures

Special Purpose

Cores

Reduced costLess energyLess spaceLess complex

Integrated into a single package

Extreme scale compute requires ultra cheap and ultra efficient technologies

Challenges today : complexity to codesign hard+soft+integration in ecosystem

Future processors

HMCHBM

Wide IO….

x GB?y GB/s

Latency?

OutstandingLOADS?

L3Cache 16MB

processor

processor

processor

processor

L2/L1 hiearchy, characteristics?Freq?How many flops/cycle?ISAOutstanding load/store?Latencies?

L2

In future processors will have many cores and many heterogenous accelerators either GPU/FPGA/ASICWe will have a high speed L4 cache of few GB L3 shared and L2 private will have much smaller capacity in MB rangeThis High Speed Memory will be X times more rapid than higher levels , but will be limited to tens of GBA very large pool of NVM will be attached and shared thru gen-Z We can also have a very large local NVM pool to mitigate IO,latency,costs,… can be in TB

…

…

InterposerCORE

Memory/Storage Convergence: The Media Revolution

Memory

Internal Block Storage

Today

DRAM DRAM

Disk/SSD

SCM SCM

DRAM

SSD

OPM

SCM

SCM = Storage Class Memory

Disk/SSDDisk/SSD

OPM = On-Package Memory

Memory Semantics is becoming pervasive in Volatile AND Non-Volatile Storage as these technologies continue to converge.

OPM

Typical 2 socket architecture – 8 possible interconnect types

FACTEvery domain transition requires translation overhead which adds latency

Storage NIC

CPU

DDR4

SAS Ethernet, IB, FC, OPA, etc.

CPU

DDR4

NVMeNVMe

PCIe

DDR

IPL

FACTLoad/Store memory semantics are the simplest, most efficient way for a CPU to talk to memory.

PCIe

Processor Pin Increase

0

500

1000

1500

2000

2500

3000

3500

4000

4500

2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

# Processor Pins & DDR Channels

63%

43%

47%

55%

3CH

4CH

6CH

8CH

I

Nehalem

1366 pins

I

Sandy Bridge

2011 pins

Courtesy: Ideas International

I

Sky Lake/Ice Lake

3647 pins

I

TBD

~ 4223 pins

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

2012 (8) 2013(12)

2014(18)

2016(22)

2017(28)

2019(64)

GB

/s

Memory Bandwidth per Core

(Projected)

Architectural Limitations

Storage NIC

x1

6

GPUCPU

DDR4

x16

HBM32GB/s

Memory Bandwidth: 100 GB/s

HBM Memory Bandwidth:732 GB/s*

* NVIDIA Tesla P100

SAS Ethernet

0.0

1.0

2.0

3.0

4.0

5.0

6.0

2012 (8) 2013(12)

2014(18)

2016(22)

2017(28)

2019(64)

GB

/s

PCIe Bandwidth per core

(Projected)

x1

6

• Core are starved for bandwidth as the number of Cores increases

• Bottleneck for heterogeneous compute environment

Drive a New and Open Protocol: Gen-Z

–Scalable, general-purpose interconnect and protocol

– Replaces processor-local interconnects—(LP)DDR, PCIe, SAS/SATA, etc.

– Replaces global fabrics for ultra-low-latency communications at scale

–Provide a flexible load-store domain for memory ops and message passing

Today

2022

CPU / SoC

SoC

Gen-Z Gen-Z Gen-Z Gen-Z Gen-Z

(LP)

DDRPCIe

SAS /

SATA

Ethernet /

IBProprietary

Making the memory hierarchy obsolete

Massive Memory

Pool

FasterMore cost per bit

On-chip cache SRAM

Main memory DRAM

Mass storage Flash

Hard disk

Capacity

TodayConstant balance between

cost and performance

The MachineEnabling massive data sets

Universal memory

Capacity

Faster

http://genzconsortium.org

App

OS

ARM DSPx86GPU

Bridge Bridge Bridge

SOC

Bridge Bridge

Name space

1 2 3 4

Fabric

Firewall Firewall Firewall

1 2 3 4

1 2 3 4Management server

Librarian

1 2 3 4

Payroll

TZ

Firewall

Now I can access everything in nanoseconds!

Semantic of access : load/store gen-Z protocol

Update

product object

in cacheline

Simplicity: Fewer data layersConventional Data Formats

Cache

Memory

Network H

DB Cache

Filesystem

Cache

product.setCount()Write

Update object

in page in DB

cache

Update object

in page in

filesystem

cache

Write page to sectors on disk

“Track the products

my company sells.”

Product

Manager

Disk

Application

Developer

Product

Product

Product

MDS Data Formats

Update product object in cachelineCache

Memory

product.setCount()Write

Update product object in NVM

Update now persists in

shared non-volatile

memory

Serialise object;

add network

packet header

Update product

object in main

memory

Application Programming Models to Persistent Memory

Existing applications unchanged

– writes to special volume

specified for certain operations

Conventional I/O Access

Filesystem

APIsBlock I/O

OS Driver

(Block Device Emulation)

Applications partially changed -

source code re-written to use

new APIs for specific data

Abstract PM Access

EXT4/XFS

Cached/UnCached

DAX

(Linux)

NTFS/ReFSCached/UnCached

SCM

Block/DAX

(Windows)

Middleware APIs / NVML

Application source code

manipulates data structures

directly in Persistent Memory

Object

Stores

New

AppsData

Analytics

Standard Open Interfaces

Direct PM AccessIndirect PM AccessIndirect I/O Access

Native PM Access

EXT4/XFS

AppDirect

DAX

(Linux)

NTFS/ReFSAppDirect

DAX

(Windows)

Graph analytics time machineMassive memory and fast fabrics to ingest all data

21

Fast graph databases and the ability to look at things

that change over time

“Are there any emerging new behaviors in my

network?”Critical node

Remote users

External users

What if we could pre-compute an almost infinite set of “what ifs”?Optimization over a large search space in real time becomes realistic

22

Effectively infinite memory can simulate large scale

probabilities, just like in game algorithms

Solve complex problems before they happen

Performance demonstration – similarity searchFrom offline to decision time

Millions of images

Searches per minute

Use cases:

Content-based image/video

retrieval

Near-duplicate web page

detection

Similar document retrieval

Outlier detection for e-commerce

fraud mitigation

Fingerprint matching

Scalable object recognition

Nearest-neighbor classificationDisk-based In-memory Simulated

Machine

1200

8030

80

0.34

ney nn

10

Capacity to store intermediate results of simulation

models allows The Machine to simply look up results,

instead of recalculating

Complex models converge in minutes not days

nnn 1

Similarity Search

300Xfaster

Large-scale graph

inference

100Xfaster

Transform performance with Memory-Driven programming

In-memory analytics

15Xfaster

Financial models

8000Xfaster

Modify existing frameworks

Completely rethink

New algorithms

“Things” generate data, and need control

Edge IT, data center and cloudOperations technology

Early analytics

and computeDeep analytics

and compute

Data is sensed,

Things controlled

The Edge

Data flowControl flow

Stage 1 Stage 2 Stage 3 Stage 4

NVM will play a major role in the IoT world

“Shift Left” From the data center to the edge

Data acquired

and aggregated

Too much data to move the data ; need to move the codes

Need also to store the data where created ; need to move only the metadata

Deep Learning and Edge Computing

Center : training

Collects some data

Continously trains models

Sends models to edge nodes

Large scale simulations

Edge Node : inference

Gets trained model

Uses the model in real-time

Collects data

Sends some data to center

“Building”

Krizhevsky et al. (2012)

Thank you

More resources on The Machine Industry articles, blogs, and social media outlets:

The Machine on Hewlett Packard Labs Webpage (http://labs.hpe.com/research/themachine/ )

Videos: Story on The Machine (https://www.youtube.com/watch?v=NwWF1LSmBJY) andThe Machine: Future of Computing (https://www.youtube.com/watch?list=PL0_ubpZ6vGcAm1sLOSyQWYx_WTJ_u9zNr&v=NZ_rbeBy-ms)

@HPE_Labs Hewlett Packard Labs Behind the scenes @ Labs HPE

Technical articles from TheNextPlatform− Drilling Down Into The Machine From HPE

http://www.nextplatform.com/2016/01/04/drilling-down-into-the-machine-from-hpe/

− The Intertwining Of Memory And Performance Of HPE’s Machinehttp://www.nextplatform.com/2016/01/11/the-intertwining-of-memory-and-performance-of-hpes-machine/

− Weaving Together The Machine’s Fabric Memoryhttp://www.nextplatform.com/2016/01/18/weaving-together-the-machines-fabric-memory/

− The Bits And Bytes Of The Machine’s Storagehttp://www.nextplatform.com/2016/01/25/the-bits-and-bytes-of-the-machines-storage/

− Non Volatile Heaps And Object Stores In The Machinehttp://www.nextplatform.com/2016/02/08/non-volatile-heaps-object-stores-machine/

− Operating Systems, Virtualization, And The Machinehttp://www.nextplatform.com/2016/02/01/operating-systems-virtualization-machine/

− Future Systems: How HP Will Adapt The Machine To HPChttp://www.nextplatform.com/2015/08/17/future-systems-how-hp-will-adapt-the-machine-to-hpc/

− Spark on Superdome X Previews in-memory on The Machinehttp://www.nextplatform.com/2016/04/11/spark-superdome-x-previews-memory-machine/

− Programming for Persistent Memory takes Persistence http://www.nextplatform.com/2016/04/25/first-steps-program-model-persistent-memory/

− First Steps in the Program Model for Persistent Memory http://www.nextplatform.com/2016/04/25/first-steps-program-model-persistent-memory/

IEEE

− Adapting to Thrive in a New Economy of Memory Abundance – Computer Magazine special articlehttp://www.labs.hpe.com/pdf/IEEE_Adapting_to_Thrive_in_a_New_Economy_of_Memory_Abundance.pdf

− At IEEE’s Rebooting the Computer Conference, A New Economy of Memory Abundancehttp://community.hpe.com/t5/Behind-the-scenes-Labs/At-IEEE-s-Rebooting-the-Computer-Conference-A-New-Economy-of/ba-p/6818400

− Blah, blah, technology, blah: Sharing the MDC Vision with the IEEE Conferencehttp://community.hpe.com/t5/Behind-the-scenes-Labs/Blah-blah-technology-blah-Sharing-the-MDC-Vision-with-the-IEEE/ba-p/6875502

− Memory-Driven Computing – how will it impact the world?http://community.hpe.com/t5/Behind-the-scenes-Labs/Memory-Driven-Computing-How-will-it-impact-the-world/ba-p/6796925

http://labs.hpe.com/research/themachine/

https://www.youtube.com/watch?v=NwWF1LSmBJY

https://www.youtube.com/watch?list=PL0_ubpZ6vGcAm1sLOSyQWYx_WTJ_u9zNr&v=NZ_rbeBy-ms

http://www.nextplatform.com/2016/01/04/drilling-down-into-the-machine-from-hpe/

http://www.nextplatform.com/2016/01/11/the-intertwining-of-memory-and-performance-of-hpes-machine/

http://www.nextplatform.com/2016/01/18/weaving-together-the-machines-fabric-memory/

http://www.nextplatform.com/2016/01/25/the-bits-and-bytes-of-the-machines-storage/

http://www.nextplatform.com/2016/02/08/non-volatile-heaps-object-stores-machine/

http://www.nextplatform.com/2016/02/01/operating-systems-virtualization-machine/

http://www.nextplatform.com/2015/08/17/future-systems-how-hp-will-adapt-the-machine-to-hpc/

http://www.nextplatform.com/2016/04/11/spark-superdome-x-previews-memory-machine/

http://www.nextplatform.com/2016/04/25/first-steps-program-model-persistent-memory/

http://www.nextplatform.com/2016/04/25/first-steps-program-model-persistent-memory/

https://twitter.com/HPE_labs

https://twitter.com/HPE_labs

https://www.linkedin.com/company/hewlett-packard-labs?trk=company_name

https://www.linkedin.com/company/hewlett-packard-labs?trk=company_name

http://community.hpe.com/t5/Behind-the-scenes-Labs/bg-p/BehindthescenesatLabs#.V2F6X7srIUG

http://community.hpe.com/t5/Behind-the-scenes-Labs/bg-p/BehindthescenesatLabs#.V2F6X7srIUG

http://www.labs.hpe.com/pdf/IEEE_Adapting_to_Thrive_in_a_New_Economy_of_Memory_Abundance.pdf

http://community.hpe.com/t5/Behind-the-scenes-Labs/At-IEEE-s-Rebooting-the-Computer-Conference-A-New-Economy-of/ba-p/6818400

http://community.hpe.com/t5/Behind-the-scenes-Labs/Blah-blah-technology-blah-Sharing-the-MDC-Vision-with-the-IEEE/ba-p/6875502

http://community.hpe.com/t5/Behind-the-scenes-Labs/Memory-Driven-Computing-How-will-it-impact-the-world/ba-p/6796925

Documents

The role of NVM in the future of HPC workloads · HPC workloads 29 05 2017 . The New Normal: Compute is not keeping up with data explosion The end of scaling at just the wrong time