New Era of High Performance Computing (convergence of AI ... Petascale Cray XC Systems UNDISCLOSED SYSTEMS

  • View

  • Download

Embed Size (px)

Text of New Era of High Performance Computing (convergence of AI ... Petascale Cray XC Systems UNDISCLOSED...

  • New Era of High Performance Computing (convergence of AI, Big Data, HPC)

    Rajesh Chhabra

  • Petascale

    Cray XC



    Top 10 Top 50 Top 100

    Cray Systems 4 18 29

    Vendor Rank #1 #1 #1

    Top 500 Supercomputers in the World

    June 2018

    /wikipedia/de/7/76/HLRN-Logo.svg /wikipedia/de/7/76/HLRN-Logo.svg

  • Cray’s Supercomputing Leadership

    Copyright 2018 Cray Inc. - Confidential and Proprietary 3

    Top 500 Supercomputers in the World

    Nov 2017

    Top 10 Top 50 Top 100

    Cray Systems 4 18 29

    Vendor Rank #1 #1 #1

  • Copyright 2018 Cray Inc. 4

    The Convergence of Big Data, AI and HPC

    Modeling The World

    Data-Intensive Processing

    Hybrid workflows with a mix of simulation and


    Data Models

    Analysis of large datasets for knowledge discovery, insight, and


    Math Models

    Simulation and modelling of the natural world via

    mathematical equations

  • Workloads are




  • The Convergence of Big Data, AI and HPC


    Systems/Container Management

    Analytics/Machine Learning Ecosystem

    Deep Learning Toolkits

    Big Data

    Today: Running software built for the cloud on HPC hardware

    Benefit: Convergence of productivity and performance

  • Machine Learning Coming to Your Science Domain

    Clustering Daya Bay


    Classifying LHC


    Oxford Nanopore

    Sequencing Detecting Extreme Weather

    FWI Subsurface


    Modeling Galaxy Shapes

    Turbine CFD Modeling

    New and Larger Deep Learning Models Required

    Protein-Ligand Binding

  • Deep Learning Use Cases

    Consumer Retail Energy Financial Health Industrial Autonomous Driving

    •Search •Face/Object Detection

    Detection • Image Segmentation

    Segmentation •Speech

    Understanding •NLP •Text to Speech

    •Person and Object Detection Detection

    • Image Segmentation

    •Scene Analytics •Support •Marketing •Supply Chain •Security

    •Oil and Gas Exploration

    •Smart Grid •Operational

    Improvement •Conservation

    •Algorithmic Trading Trading

    •Fraud Detection •Personal Finance •Risk Mitigation •Security

    •Enhanced Diagnostics

    •3D Medical Imaging

    •Drug Discovery •Sensory Aids

    •Factory Automation

    •Predictive Maintenance

    •Precision Agriculture

    •Field Automation

    •Pedestrian, Vehicle, and Object Detection Detection and Classification

    •Ego Motion •Sensor Fusion •Environment



    •ResNet •SSD •LSTM •Attention •SparseNN •FCN


    •ResNet •SSD •FCN •RNN


    •Deep Reinforcement


    •Deep Reinforcement


    •ResNet •SSD •FCN


    •ResNet •SSD •Deep

    Reinforcement Learning


    •Deep Reinforcement Learning

    •LSTM •SSD


  • NERSC – Deep Learning in Science

    Opportunities to apply DL widely in support of classic HPC simulation and modelling

  • Molecular Engineering of Solar-powered Windows Jacqueline M. Cole, University of Cambridge, Argonne National Lab

    1 2 3 4 Extract Compound Data from Scientific Publications

    Enrich Data with ML and Quantum Chemical Calculations

    Filter Data Set to small number of candidates (ML)

    Validate final candidates (sim)

  • Relationship Between AI, ML, & DL

    ● “AI” is a very broad term, with no clear boundaries

    ● ”AI” and “deep learning” are not synonymous

    ● Machine learning is just a part of AI

    ● Deep learning is a specialization of machine learning

    ● Cray focuses on Deep Learning

  • Neural Network Workflow

     NN workflows are similar in many ways to typical data science workflows

     Ingest/clean & transform can be major undertakings, as usual

     Training results in a model that can then be used for inference, which produces answers in


    Cray’s biggest contribution is to be made in the computationally intensive training phase!

  • Deep Learning in Production

    Data Acquisition

    Data Preparation

    Model Training

    Model Testing

    • Cleansing • Shaping • Enrichment

    Data Annotation (Ground Truth)

    Test Set

    Validation Set

    Train Model

    Evaluate Performance and optimize model


    Training Set

    Model Deployment

    A/B testing in production


    Training Inferencing

    Model management

    Example of an end-to-end workflow

  • Deep Learning: Behind the Scenes

    Data Acquisition

    Data Preparation

    Model Training

    Model Testing

    Model Deployment

    A/B testing in production

    Training Inferencing

    Ideal training algorithm:  For every training sample:

     run sample forward through the model

     compute the error vs. the training data

     back-propagate error through the NN to update the weights (gradient descent)

     Typically broken up into “mini-batches”

     Exposes more intra-node parallelism; arguably reduces “noise”

     After all data is processed, adjust “learning rate” and repeat until desired accuracy achieved

    DNN model with weights on all connections. Largest models now hundreds of layers, and millions (to billions) of nodes

  • HPC Thinking: Message-size, MPI-collective, Global all-reduce modifications

    Source: Peter Mendygral and Jef Dawson, Cray PE and Performance

    90%+ scalability efficiency that can reduce training time from days to hours

    Differentiating Results: TensorFlow

  • Cray Machine Learning / AI Environment

    Cray Distributed Training Framework Delivers up to 5x Performance* over other Distributed Training Approaches

    * Actual performance depends on system, batch size and model

    Deep Learning Toolkits

    Analytics/Machine Learning Ecosystem


  • Not Just More Data But Also DIFFERENT I/O Patterns


    Large, streaming

    I/O (HDDs shine)

    Small, random

    I/O (SSDs shine)

    Modelling &



    Analytics Artificial


  • L300FL300 L300N

    ClusterStor Converged Building Blocks Embedded HA Lustre Object Storage Servers

    Form Factor


    IOPS 4K rand. Wr



    Cost/usable GB 1 1.15 ~ 30

    5U84 12Gb/s SAS

    82/2 82/2 or 80/4 (with NXD SW) 0/24

    4,000 40,000 500,000

    10 rd/10 wr 10 rd/10 wr 10 rd/20 wr

    *Conservatively derated 19

    Best used for.. Large,

    Streaming I/O Small,

    random I/O Mixed


    5U84 12Gb/s SAS 2U24 12Gb/s SAS

  • Base rack

    Copyright 2017 Cray Inc.

    ● No single point of failure

    ● 2 X GiGE Switches (1U each)

    ● 2 X IB Switches (1U each)

    ● SMU / System Management unit


    ● MMU / Meta data management unit


    ● 6 X SSU (5U each)


  • Expansion rack

    ● No single point of failure

    ● 2 X GiGE Switches (1U each)

    ● 2 X IB Switches (1U each)

    ● 7 X SSU (5U each)

    Copyright 2017 Cray Inc.

  • Rack # drives:


    8TB HDD TBs: (U/R)

    IOR perf GB/s*

    Power kW

    SSU #6 574/ 14 3304 / 4592 63 14.9

    SSU #5 492 / 12 2832 / 3936 54 12.6

    SSU #4 410 / 10 2360 / 3280 45 10.9

    SSU #3 328 / 8 1888 / 2624 36 9.2

    SSU #2 246 / 6 1416 / 1968 27 7.4

    SSU #1 164 / 4 944 / 1312 18 5.7

    SSU #0 82 / 2 472 / 656 9 4.0

    SSU Specs – 7.2 K RPM NL-SAS drive (Expansion rack)

    22 Copyright 2017 Cray Inc.

  • CRAY INC - COPYRIGHT 2018 23

    Runtime Variability Real time and historical views of metrics

    to understand what is impacting


    Problem Isolation Unified view of system activity enables

    problem isolation in complex


    Trend Analysis Enable data-driven decisions and