Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Note: The speaker notes for this slide include detailed instructions on how to insert a Title Slide with a different picture.
Tip! Remember to remove this text [email protected]
Distinguished Technologist EG EMEA
The role of NVM in the future of HPC workloads
29 05 2017
The New Normal: Compute is not keeping up with data explosion
The end of scaling at just the wrong time …
8Bpeople
20Bmobile devices
100Bsocial infrastructure apps
+ + + 1T( )
Ap
plic
ati
on
La
ten
cy
Storage Volume/Capacity
High
Low
Low High
Increase
Amount of
data
processed
at low
latency
Reduce latency of storage
while increasing efficiency
Object
Storage
HPC for the
Enterprise
NoSQL
Database
Hadoop
IoT
HPC Big Data
convergence
Massively
Parallel,
Low
Latency
Computing
Large volumes of data, lower Cost, high latency
Traditional Cold Storage
TraditionalHPC
Deep Learning
Parallel File
Systems
Analytics
Technologies
Workloads
HPC and Big Data technology context and The Machine
Critical for our future infrastructures
4
DRAM
Reaching the physical limits of charge storageNon-Volatile memories – forms of memristor (Type 4)
3D Flash PCRAM STTRAMRRAM HMC
Disruption #1: Non-volatile memories
3D Flash
NVM and high speed memories are critical for extreme computing
Predicted 1971
Leon Chua
U.C. Berkeley
Reduced to practice 2008
R. Stanley Williams
HP Laboratories
MEMRISTOR
dφ = M dq1971Chua
Ohm1827
1831Faraday
Von Kleist1745
RESISTOR
dv = R diCAPACITOR
dq = C dv
INDUCTOR
dφ = L di
i
v
q
φ
The memristor : 4th fundamental cuircuit element
Convergence of memory and storage
CPU
Cache
DRAM
Flash SDD
Magnetic HDD
Load-Store Data
Low Latency
Volatile
Block-Store Data
Long Latency
Non-Volatile
Memory
Storage
Speed
Persistence
Persistent Memory
Persistent Memory = The speed of memory with the persistence of storage
Disruption #2 : PhotonicsPhotonics
Need 1000Ximprovment
FLOPS will cost lesspower than on-chip
data movement 10^18 ops*
1Byte/ops=
10^19bits*
1pj / bit=
10MWatt
Katharine Schmidtke, Finisar
ultra low energy low uniform latencyhigh bandwidth low cost
Disruption #3 : optimized architectures
Special Purpose
Cores
Reduced costLess energyLess spaceLess complex
Integrated into a single package
Extreme scale compute requires ultra cheap and ultra efficient technologies
Challenges today : complexity to codesign hard+soft+integration in ecosystem
Future processors
HMCHBM
Wide IO….
x GB?y GB/s
Latency?
OutstandingLOADS?
L3Cache 16MB
processor
processor
processor
processor
L2/L1 hiearchy, characteristics?Freq?How many flops/cycle?ISAOutstanding load/store?Latencies?
L2
In future processors will have many cores and many heterogenous accelerators either GPU/FPGA/ASICWe will have a high speed L4 cache of few GB L3 shared and L2 private will have much smaller capacity in MB rangeThis High Speed Memory will be X times more rapid than higher levels , but will be limited to tens of GBA very large pool of NVM will be attached and shared thru gen-Z We can also have a very large local NVM pool to mitigate IO,latency,costs,… can be in TB
…
…
InterposerCORE
Memory/Storage Convergence: The Media Revolution
Memory
Internal Block Storage
Today
DRAM DRAM
Disk/SSD
SCM SCM
DRAM
SSD
OPM
SCM
SCM = Storage Class Memory
Disk/SSDDisk/SSD
OPM = On-Package Memory
Memory Semantics is becoming pervasive in Volatile AND Non-Volatile Storage as these technologies continue to converge.
OPM
Typical 2 socket architecture – 8 possible interconnect types
FACTEvery domain transition requires translation overhead which adds latency
Storage NIC
CPU
DDR4
SAS Ethernet, IB, FC, OPA, etc.
CPU
DDR4
NVMeNVMe
PCIe
DDR
IPL
FACTLoad/Store memory semantics are the simplest, most efficient way for a CPU to talk to memory.
PCIe
Processor Pin Increase
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
# Processor Pins & DDR Channels
63%
43%
47%
55%
3CH
4CH
6CH
8CH
I
Nehalem
1366 pins
I
Sandy Bridge
2011 pins
Courtesy: Ideas International
I
Sky Lake/Ice Lake
3647 pins
I
TBD
~ 4223 pins
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
2012 (8) 2013(12)
2014(18)
2016(22)
2017(28)
2019(64)
GB
/s
Memory Bandwidth per Core
(Projected)
Architectural Limitations
Storage NIC
x1
6
GPUCPU
DDR4
x16
HBM32GB/s
Memory Bandwidth: 100 GB/s
HBM Memory Bandwidth:732 GB/s*
* NVIDIA Tesla P100
SAS Ethernet
0.0
1.0
2.0
3.0
4.0
5.0
6.0
2012 (8) 2013(12)
2014(18)
2016(22)
2017(28)
2019(64)
GB
/s
PCIe Bandwidth per core
(Projected)
x1
6
• Core are starved for bandwidth as the number of Cores increases
• Bottleneck for heterogeneous compute environment
Drive a New and Open Protocol: Gen-Z
–Scalable, general-purpose interconnect and protocol
– Replaces processor-local interconnects—(LP)DDR, PCIe, SAS/SATA, etc.
– Replaces global fabrics for ultra-low-latency communications at scale
–Provide a flexible load-store domain for memory ops and message passing
Today
2022
CPU / SoC
SoC
Gen-Z Gen-Z Gen-Z Gen-Z Gen-Z
(LP)
DDRPCIe
SAS /
SATA
Ethernet /
IBProprietary
Making the memory hierarchy obsolete
Massive Memory
Pool
FasterMore cost per bit
On-chip cache SRAM
Main memory DRAM
Mass storage Flash
Hard disk
Capacity
TodayConstant balance between
cost and performance
The MachineEnabling massive data sets
Universal memory
Capacity
Faster
http://genzconsortium.org
App
OS
ARM DSPx86GPU
Bridge Bridge Bridge
SOC
Bridge Bridge
Name space
1 2 3 4
Fabric
Firewall Firewall Firewall
1 2 3 4
1 2 3 4Management server
Librarian
1 2 3 4
Payroll
TZ
Firewall
Now I can access everything in nanoseconds!
Semantic of access : load/store gen-Z protocol
Update
product object
in cacheline
Simplicity: Fewer data layersConventional Data Formats
Cache
Memory
Network H
DB Cache
Filesystem
Cache
product.setCount()Write
Update object
in page in DB
cache
Update object
in page in
filesystem
cache
Write page to sectors on disk
“Track the products
my company sells.”
Product
Manager
Disk
Application
Developer
Product
Product
Product
MDS Data Formats
Update product object in cachelineCache
Memory
product.setCount()Write
Update product object in NVM
Update now persists in
shared non-volatile
memory
Serialise object;
add network
packet header
Update product
object in main
memory
Application Programming Models to Persistent Memory
Existing applications unchanged
– writes to special volume
specified for certain operations
Conventional I/O Access
Filesystem
APIsBlock I/O
OS Driver
(Block Device Emulation)
Applications partially changed -
source code re-written to use
new APIs for specific data
Abstract PM Access
EXT4/XFS
Cached/UnCached
DAX
(Linux)
NTFS/ReFSCached/UnCached
SCM
Block/DAX
(Windows)
Middleware APIs / NVML
Application source code
manipulates data structures
directly in Persistent Memory
Object
Stores
New
AppsData
Analytics
Standard Open Interfaces
Direct PM AccessIndirect PM AccessIndirect I/O Access
Native PM Access
EXT4/XFS
AppDirect
DAX
(Linux)
NTFS/ReFSAppDirect
DAX
(Windows)
Graph analytics time machineMassive memory and fast fabrics to ingest all data
21
Fast graph databases and the ability to look at things
that change over time
“Are there any emerging new behaviors in my
network?”Critical node
Remote users
External users
What if we could pre-compute an almost infinite set of “what ifs”?Optimization over a large search space in real time becomes realistic
22
Effectively infinite memory can simulate large scale
probabilities, just like in game algorithms
Solve complex problems before they happen
Performance demonstration – similarity searchFrom offline to decision time
Millions of images
Searches per minute
Use cases:
Content-based image/video
retrieval
Near-duplicate web page
detection
Similar document retrieval
Outlier detection for e-commerce
fraud mitigation
Fingerprint matching
Scalable object recognition
Nearest-neighbor classificationDisk-based In-memory Simulated
Machine
1200
8030
80
0.34
ney nn
10
Capacity to store intermediate results of simulation
models allows The Machine to simply look up results,
instead of recalculating
Complex models converge in minutes not days
nnn 1
Similarity Search
300Xfaster
Large-scale graph
inference
100Xfaster
Transform performance with Memory-Driven programming
In-memory analytics
15Xfaster
Financial models
8000Xfaster
Modify existing frameworks
Completely rethink
New algorithms
“Things” generate data, and need control
Edge IT, data center and cloudOperations technology
Early analytics
and computeDeep analytics
and compute
Data is sensed,
Things controlled
The Edge
Data flowControl flow
Stage 1 Stage 2 Stage 3 Stage 4
NVM will play a major role in the IoT world
“Shift Left” From the data center to the edge
Data acquired
and aggregated
Too much data to move the data ; need to move the codes
Need also to store the data where created ; need to move only the metadata
Deep Learning and Edge Computing
Center : training
Collects some data
Continously trains models
Sends models to edge nodes
Large scale simulations
Edge Node : inference
Gets trained model
Uses the model in real-time
Collects data
Sends some data to center
“Building”
Krizhevsky et al. (2012)
Thank you
More resources on The Machine Industry articles, blogs, and social media outlets:
The Machine on Hewlett Packard Labs Webpage (http://labs.hpe.com/research/themachine/ )
Videos: Story on The Machine (https://www.youtube.com/watch?v=NwWF1LSmBJY) andThe Machine: Future of Computing (https://www.youtube.com/watch?list=PL0_ubpZ6vGcAm1sLOSyQWYx_WTJ_u9zNr&v=NZ_rbeBy-ms)
@HPE_Labs Hewlett Packard Labs Behind the scenes @ Labs HPE
Technical articles from TheNextPlatform− Drilling Down Into The Machine From HPE
http://www.nextplatform.com/2016/01/04/drilling-down-into-the-machine-from-hpe/
− The Intertwining Of Memory And Performance Of HPE’s Machinehttp://www.nextplatform.com/2016/01/11/the-intertwining-of-memory-and-performance-of-hpes-machine/
− Weaving Together The Machine’s Fabric Memoryhttp://www.nextplatform.com/2016/01/18/weaving-together-the-machines-fabric-memory/
− The Bits And Bytes Of The Machine’s Storagehttp://www.nextplatform.com/2016/01/25/the-bits-and-bytes-of-the-machines-storage/
− Non Volatile Heaps And Object Stores In The Machinehttp://www.nextplatform.com/2016/02/08/non-volatile-heaps-object-stores-machine/
− Operating Systems, Virtualization, And The Machinehttp://www.nextplatform.com/2016/02/01/operating-systems-virtualization-machine/
− Future Systems: How HP Will Adapt The Machine To HPChttp://www.nextplatform.com/2015/08/17/future-systems-how-hp-will-adapt-the-machine-to-hpc/
− Spark on Superdome X Previews in-memory on The Machinehttp://www.nextplatform.com/2016/04/11/spark-superdome-x-previews-memory-machine/
− Programming for Persistent Memory takes Persistence http://www.nextplatform.com/2016/04/25/first-steps-program-model-persistent-memory/
− First Steps in the Program Model for Persistent Memory http://www.nextplatform.com/2016/04/25/first-steps-program-model-persistent-memory/
IEEE
− Adapting to Thrive in a New Economy of Memory Abundance – Computer Magazine special articlehttp://www.labs.hpe.com/pdf/IEEE_Adapting_to_Thrive_in_a_New_Economy_of_Memory_Abundance.pdf
− At IEEE’s Rebooting the Computer Conference, A New Economy of Memory Abundancehttp://community.hpe.com/t5/Behind-the-scenes-Labs/At-IEEE-s-Rebooting-the-Computer-Conference-A-New-Economy-of/ba-p/6818400
− Blah, blah, technology, blah: Sharing the MDC Vision with the IEEE Conferencehttp://community.hpe.com/t5/Behind-the-scenes-Labs/Blah-blah-technology-blah-Sharing-the-MDC-Vision-with-the-IEEE/ba-p/6875502
− Memory-Driven Computing – how will it impact the world?http://community.hpe.com/t5/Behind-the-scenes-Labs/Memory-Driven-Computing-How-will-it-impact-the-world/ba-p/6796925