New Era of High Performance Computing (convergence of AI ... Petascale Cray XC Systems UNDISCLOSED SYSTEMS

  • View
    0

  • Download
    0

Embed Size (px)

Text of New Era of High Performance Computing (convergence of AI ... Petascale Cray XC Systems UNDISCLOSED...

  • New Era of High Performance Computing (convergence of AI, Big Data, HPC)

    Rajesh Chhabra raj@cray.com

    mailto:raj@cray.com

  • Petascale

    Cray XC

    Systems

    UNDISCLOSED SYSTEMS

    Top 10 Top 50 Top 100

    Cray Systems 4 18 29

    Vendor Rank #1 #1 #1

    Top 500 Supercomputers in the World

    June 2018

    /wikipedia/de/7/76/HLRN-Logo.svg /wikipedia/de/7/76/HLRN-Logo.svg http://www.nersc.gov/ http://www.nersc.gov/ http://www.afrl.hpc.mil/ http://www.afrl.hpc.mil/

  • Cray’s Supercomputing Leadership

    Copyright 2018 Cray Inc. - Confidential and Proprietary 3

    Top 500 Supercomputers in the World

    Nov 2017

    Top 10 Top 50 Top 100

    Cray Systems 4 18 29

    Vendor Rank #1 #1 #1

  • Copyright 2018 Cray Inc. 4

    The Convergence of Big Data, AI and HPC

    Modeling The World

    Data-Intensive Processing

    Hybrid workflows with a mix of simulation and

    analytics

    Data Models

    Analysis of large datasets for knowledge discovery, insight, and

    prediction.

    Math Models

    Simulation and modelling of the natural world via

    mathematical equations

  • Workloads are

    becoming

    more

    heterogeneous

  • The Convergence of Big Data, AI and HPC

    6

    Systems/Container Management

    Analytics/Machine Learning Ecosystem

    Deep Learning Toolkits

    Big Data

    Today: Running software built for the cloud on HPC hardware

    Benefit: Convergence of productivity and performance

    https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCObR85jFi8kCFQlwPgod6fYHZA&url=https://uk.wikipedia.org/wiki/Apache_Spark&psig=AFQjCNFEtbW0aqXpHbnpSOEoDSrdF5POPw&ust=1447440232698881 https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&cad=rja&uact=8&ved=0CAcQjRxqFQoTCObR85jFi8kCFQlwPgod6fYHZA&url=https://uk.wikipedia.org/wiki/Apache_Spark&psig=AFQjCNFEtbW0aqXpHbnpSOEoDSrdF5POPw&ust=1447440232698881

  • Machine Learning Coming to Your Science Domain

    Clustering Daya Bay

    Events

    Classifying LHC

    Events

    Oxford Nanopore

    Sequencing Detecting Extreme Weather

    FWI Subsurface

    Modelling

    Modeling Galaxy Shapes

    Turbine CFD Modeling

    New and Larger Deep Learning Models Required

    Protein-Ligand Binding

  • Deep Learning Use Cases

    Consumer Retail Energy Financial Health Industrial Autonomous Driving

    •Search •Face/Object Detection

    Detection • Image Segmentation

    Segmentation •Speech

    Understanding •NLP •Text to Speech

    •Person and Object Detection Detection

    • Image Segmentation

    •Scene Analytics •Support •Marketing •Supply Chain •Security

    •Oil and Gas Exploration

    •Smart Grid •Operational

    Improvement •Conservation

    •Algorithmic Trading Trading

    •Fraud Detection •Personal Finance •Risk Mitigation •Security

    •Enhanced Diagnostics

    •3D Medical Imaging

    •Drug Discovery •Sensory Aids

    •Factory Automation

    •Predictive Maintenance

    •Precision Agriculture

    •Field Automation

    •Pedestrian, Vehicle, and Object Detection Detection and Classification

    •Ego Motion •Sensor Fusion •Environment

    Modeling

    Topologies:

    •ResNet •SSD •LSTM •Attention •SparseNN •FCN

    Topologies:

    •ResNet •SSD •FCN •RNN

    Topologies:

    •Deep Reinforcement

    Topologies:

    •Deep Reinforcement

    Topologies:

    •ResNet •SSD •FCN

    Topologies:

    •ResNet •SSD •Deep

    Reinforcement Learning

    Topologies:

    •Deep Reinforcement Learning

    •LSTM •SSD

    8/31/2018

  • NERSC – Deep Learning in Science

    Opportunities to apply DL widely in support of classic HPC simulation and modelling

  • Molecular Engineering of Solar-powered Windows Jacqueline M. Cole, University of Cambridge, Argonne National Lab

    1 2 3 4 Extract Compound Data from Scientific Publications

    Enrich Data with ML and Quantum Chemical Calculations

    Filter Data Set to small number of candidates (ML)

    Validate final candidates (sim)

  • Relationship Between AI, ML, & DL

    ● “AI” is a very broad term, with no clear boundaries

    ● ”AI” and “deep learning” are not synonymous

    ● Machine learning is just a part of AI

    ● Deep learning is a specialization of machine learning

    ● Cray focuses on Deep Learning

  • Neural Network Workflow

     NN workflows are similar in many ways to typical data science workflows

     Ingest/clean & transform can be major undertakings, as usual

     Training results in a model that can then be used for inference, which produces answers in

    production

    Cray’s biggest contribution is to be made in the computationally intensive training phase!

  • Deep Learning in Production

    Data Acquisition

    Data Preparation

    Model Training

    Model Testing

    • Cleansing • Shaping • Enrichment

    Data Annotation (Ground Truth)

    Test Set

    Validation Set

    Train Model

    Evaluate Performance and optimize model

    Cross-Validation

    Training Set

    Model Deployment

    A/B testing in production

    Iterative

    Training Inferencing

    Model management

    Example of an end-to-end workflow

  • Deep Learning: Behind the Scenes

    Data Acquisition

    Data Preparation

    Model Training

    Model Testing

    Model Deployment

    A/B testing in production

    Training Inferencing

    Ideal training algorithm:  For every training sample:

     run sample forward through the model

     compute the error vs. the training data

     back-propagate error through the NN to update the weights (gradient descent)

     Typically broken up into “mini-batches”

     Exposes more intra-node parallelism; arguably reduces “noise”

     After all data is processed, adjust “learning rate” and repeat until desired accuracy achieved

    DNN model with weights on all connections. Largest models now hundreds of layers, and millions (to billions) of nodes

  • HPC Thinking: Message-size, MPI-collective, Global all-reduce modifications

    Source: Peter Mendygral and Jef Dawson, Cray PE and Performance

    90%+ scalability efficiency that can reduce training time from days to hours

    Differentiating Results: TensorFlow

  • Cray Machine Learning / AI Environment

    Cray Distributed Training Framework Delivers up to 5x Performance* over other Distributed Training Approaches

    * Actual performance depends on system, batch size and model

    Deep Learning Toolkits

    Analytics/Machine Learning Ecosystem

    NCCL

  • Not Just More Data But Also DIFFERENT I/O Patterns

    18

    Large, streaming

    I/O (HDDs shine)

    Small, random

    I/O (SSDs shine)

    Modelling &

    Simulation

    Advanced

    Analytics Artificial

    Intelligence

  • L300FL300 L300N

    ClusterStor Converged Building Blocks Embedded HA Lustre Object Storage Servers

    Form Factor

    HDD/SDD

    IOPS 4K rand. Wr

    Throughput

    GB/s*

    Cost/usable GB 1 1.15 ~ 30

    5U84 12Gb/s SAS

    82/2 82/2 or 80/4 (with NXD SW) 0/24

    4,000 40,000 500,000

    10 rd/10 wr 10 rd/10 wr 10 rd/20 wr

    *Conservatively derated 19

    Best used for.. Large,

    Streaming I/O Small,

    random I/O Mixed

    I/O

    5U84 12Gb/s SAS 2U24 12Gb/s SAS

  • Base rack

    Copyright 2017 Cray Inc.

    ● No single point of failure

    ● 2 X GiGE Switches (1U each)

    ● 2 X IB Switches (1U each)

    ● SMU / System Management unit

    (2U)

    ● MMU / Meta data management unit

    (2U)

    ● 6 X SSU (5U each)

    20

  • Expansion rack

    ● No single point of failure

    ● 2 X GiGE Switches (1U each)

    ● 2 X IB Switches (1U each)

    ● 7 X SSU (5U each)

    Copyright 2017 Cray Inc.

  • Rack # drives:

    (HDDs/SSDs)

    8TB HDD TBs: (U/R)

    IOR perf GB/s*

    Power kW

    SSU #6 574/ 14 3304 / 4592 63 14.9

    SSU #5 492 / 12 2832 / 3936 54 12.6

    SSU #4 410 / 10 2360 / 3280 45 10.9

    SSU #3 328 / 8 1888 / 2624 36 9.2

    SSU #2 246 / 6 1416 / 1968 27 7.4

    SSU #1 164 / 4 944 / 1312 18 5.7

    SSU #0 82 / 2 472 / 656 9 4.0

    SSU Specs – 7.2 K RPM NL-SAS drive (Expansion rack)

    22 Copyright 2017 Cray Inc.

  • CRAY INC - COPYRIGHT 2018 23

    Runtime Variability Real time and historical views of metrics

    to understand what is impacting

    applications

    Problem Isolation Unified view of system activity enables

    problem isolation in complex

    environments

    Trend Analysis Enable data-driven decisions and