Building Systems for Big Data Big Compute · 2016-09-01 · Strong motivation for HPC + Big Data in a single system Growing desire for HPC + analytic workflows More efficient when

Building Systems for Big Data and Big Compute

Steve Scott, Cray CTO

Smoky Mountains ConferenceSeptember 1, 2016

C O M P U T E | S T O R E | A N A L Y Z E

We’ve Been Doing “Big Data” For a Long Time

Massive Datasets

High Performance Memory, Interconnects, and Storage

Copyright 2016 Cray Inc. 2


Disruptive Memory Technology

Cray Inc.

● Standard DDR memory BW has notkept pace with CPUs

● HBM:● ~10x higher BW, ~10x less energy/bit● Costs ~2x DDR4 per bit

0

200

400

600

800

1000

1200

1400

1600

1800

bandwidth (GB/s) pJ/bit

Today’s DDR4 vs. Future HBM3

4 channels 2.4 GHz DDR4 4 stacks of gen-3 HBM on package

180160140120100

80604020

0

May want more, smaller nodes, with better BW and capacity per op


Most “Big Data” Jobs Aren’t That Big


● Aggregate data becoming very large, but most analytic jobs are modest● Typical data analytics workloads: 10GB mean, 100 GB 95%ile● Prabhat: big HPC analytics jobs ~10x larger than that● Many data analytics jobs run on a handfull of cores

● Meanwhile, the APEX procurement wants multiple PB of memory!

3PB main memory

1 TB “Big Data” job


Most “Big Data” Jobs Aren’t That Big


● Aggregate data becoming very large, but most analytic jobs are modest● Typical data analytics workloads: 10GB mean, 100 GB 95%ile● Prabhat: big HPC analytics jobs ~10x larger than that● Many data analytics jobs run on a handfull of cores

● Meanwhile, the APEX procurement wants multiple PB of memory!

● I’ll interpret “Big Data” as meaning data analytics● Extracting knowledge/insight from data● As opposed to simulation and modeling, which generally producesdata


Convergence of HPC and Big Data


What dowe need tobedoing inHPCthat is differentfrom what we have donein

the past?


What is an optimal design for HPC?


Node Architecture A

• Dual Haswell nodes @ 2.4 GHz• 128 GB DDR4 @ 2.66 GHz• 12.5 GB/s/node network bandwidth

Node Architecture B

• Dual Haswell nodes @ 2.6 GHz• 256 GB DDR4 @ 2.66 GHz• 25 GB/s/node network bandwidth

Not atall clear.

Whichofthese is better?


§ Map Reduce§ N-body methods § Graph traversal § Graphical models § Dense and sparse linear algebra § Spectral methods § Structured and unstructured grids § Combinational logic§ Dynamic programming § Backtrack and branch-and-bound § Finite-state machines

§ Basic statistics – simple Map Reduce implementation

§ Generalized n-body problems§ Graph-theoretic computations§ Linear algebraic computations§ Optimizations – e.g., linear programming§ Integration/machine learning§ Alignment problems – e.g., BLAST

Copyright 2016 Cray Inc.

Landscape of Parallel Computing Research (Berkeley – 2006/2008)

State of Big Data: Use Cases and Ogre Patterns (NIST 2014)

Data Analytics can be considered just another set of workloads in a sea of workloads.

8


GeneralizationsAbout Analytics Workloads


● Data centric workloads⇒ Larger memories and local SSDs are helpful

● Vertical data motion is important● Hadoop and Spark effectively move computation to the data, do initial filtering of data locally⇒ Don’t (usually) need much network bandwidth

● Notable exceptions: Graph analytics and machine learning● Graph analytics

● Can’t partition the data! So really hard to scale! (many get discouraged)● Wants a network that can do fine-grained RDMA well (similar to some HPC)

● Machine Learning● Training problem can be parallelized, can use lots of data, and requires global communication● Wants a very high performance network and memory system


Merging of HPC and Data Analytics


Urika-GDCustom Graph

Analyticsengine Urika-XA

Hadoop, Spark,NoSQL

Urika-GX “Athena”

Cray Graph Engine

HPC + Analytics workflows

Why combine HPC and Analytics solutions in a single box? HPC underneath the covers

Open analytics framework

Aries network

Integrated system: Hadoop/Spark + Graph analytics +

HPC

XC40World’s leading Supercomputer

“Minerva”


Building an Analytics Machine


● Urika-GX Approach:● 48 Haswell nodes per cabinet● Aries network● Up to 512GB DRAM per node● Dual SATA HDDs per node● Up to 4TB/node SSD per node

● XC40 Approach:● 192 Haswell nodes per cabinet● Aries network● Up to 256GB DRAM per node● DataWarp 12TB SSD blades, which can

be dynamically shared across system

But…. need to address Lustre metadata bottleneck for codes that do lots of “local” file IO.


Using Shifter to Accelerate Per-Node I/O


• Demonstrated > 100x speedup vs. straight Lustre on IOPS benchmark at 256 nodes

• Demonstrated Spark scaling to 50,000 cores in CUG 2016 paper

“NAS storage surprisingly close to local SSDs”`

https://cug.org/proceedings/cug2016_proceedings/includes/files/pap125.pdf


Resource and Management and Scheduling


Picture from Malte Schwarzkopf bloghttp://www.firmament.io/blog/scheduler-architectures.html

● Analytics workloads can have very different scheduling needs than HPC workloads● May want very fine-grained scheduling (cores, not nodes)● May have long-running services processing streaming data● May need to dynamically expand/contract● May be tied to real-time events such as experimental control or output processing● May be interactive/bursty (database utilization depends on queries)


Other Analytics Implications (mostly SW)


● Greater diversity of programming languages & environments● Python, R, Julia, Spark, Scala, ML frameworks, etc.● MPI + OpenMP is a foreign concept to the analytics community● Openness and container support are important

● Cloud interoperability● E.g.: source data from cloud ➝ compute/analyze ➝ store data back in cloud

● Data movement between apps● HPC tends to focus on accelerating single applications● Analytics workloads usually involve pipelines● Shared data formats can allow data exchange in memory

● E.g.: Arrow in-memory data structure specification for columnar data


Take aways


● Strong motivation for HPC + Big Data in a single system● Growing desire for HPC + analytic workflows● More efficient when data can be transferred in memory/SSD● Utilization is better with systems that can be dynamically provisioned

● Big Data is just another set of workloads● Not that different (we already build machines to handle big data)● On average, probably want more memory per node for analytics● Some workloads don’t need much network, but others need a strong network● May argue for heterogeneous systems (already do that for HPC)

● Biggest issue may be resource management/scheduling● A few other software issues, but no show stoppers for converged systems

Thank You!Questions?

Documents

Building Systems for Big Data Big Compute · 2016-09-01 · Strong motivation for HPC + Big Data in a single system Growing desire for HPC + analytic workflows More efficient when