HPC & AI Applications on Intel® Xeon Phi™ Processorb2b.lenovo.com.cn/Public/Ad/hpc/files/1-2.pdfIntel® AVX-512 Instructions Scatter/Gather Engine Integrated Fabric - OPA Self-Boot

HPC & AI Applications on Intel® Xeon Phi™ Processor

Tools help making things better and easier when effectively used by Human Beings.

Zongyan Cao 曹宗雁

Intel软件与服务事业部

June 30, 2017

2

Agenda

Overview

Tools introduction with cases sharing

Compilers

Libraries

Profiling & Analyzing Tools

Cluster Tools

……

More Information

For Discovery and Business Innovation

in Science, Visualization & Analytics

3

See featured applications: Intel® Xeon Phi™ Processor Applications Showcase

https://software.intel.com/sites/default/files/managed/b4/b0/Nov2016_Intel_Xeon_Phi_Processor_Showcase.pdf

DDR4

x4 DMI2 to PCH36 Lanes PCIe* Gen3 (x16, x16, x4)

MCDRAM MCDRAM

MCDRAM MCDRAM

DDR4

Knights LandingPackage

2D Mesh Architecture Out-of-Order Cores 3X Single-Thread vs. KNC Intel® AVX-512 Instructions Scatter/Gather Engine Integrated Fabric - OPA

Self-Boot ProcessorBinary-compatibility with Xeon, 3+ TFLOPS1 (DP)

On-package memory16GB, up to 490 GB/s STREAM TRIAD

Other Key Features

Platform MemoryUp to 384GB (6ch DDR4-2400 MHz)

Enhanced Intel® Atom™ cores based on Silvermont Microarchitecture

TILE:(up to 36)

2VPU

Core

2VPU

Core1MBL2

HUB

What is Knights Landing?

4

Tile IMC (Integrated Memory Controller)EDC (Embedded DRAM Controller) IIO (Integrated I/O Controller)1Theoretical peak performance

5

Material Science: VASP, LAMMPS, NWCHEM, GTC-P

QCD: MILC, CHROMA, CCS QCD

CFD/Mfg: OPENFOAM, CLOVERLEAF, LSTC LS-DYNA, CONVERGENT SCIENCE CONVERGE CFD

Weather/Climate/Cosmology: WRF, NEMO, WALLS

Energy: ISO3DFD

FSI: STAC A2, MONTE CARLO, BLACK SCHOLES

MD: NAMD, GROMACS, AMBER

What segments & applications are primary targets for KNL? Why?

Features driving KNL’s perf & perf/$/W

• High number of physical cores (< 72)+

• High number of threads (< 288) +

• Intel® AVX-512 (ER)

• 16GB MCDRAM

• High memory BW (< 490 GB/s)

• High system memory (< 400 GB)

• Lower system price ($7300) +

• Lower system power (~400W) +

See slides in back-up for description and availability of these codes

*Other names and brands may be claimed as the property of others.

+See speaker notes for comparison data

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured or estimated as of April 2017.

6

Material Science: VASP, LAMMPS, NWCHEM**

QCD: MILC (WIP)

CFD/Mfg: LSTC LS-DYNA**, OPENFOAM**, CONVERGENT

SCIENCE CONVERGE CFD**, CLOVERLEAF (WIP)

Weather: WRF**

FSI: STAC A2 (WIP), BLACK SCHOLES

MD: GROMACS

What segments & applications are primary targets for KNL Vs P100*?

*Other names and brands may be claimed as the property of others.

Features driving KNL’s perf & perf/$/W

• Intel® AVX-512 (ER)

• Higher attached memory (<400 GB)

• Lower system price ($7300) +

• Lower system power (~400W) +

**Intel assessment - Not ported/optimized for P100





Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured or estimated as of April 2017.

7#IntelAI

ACCELERATE Hardware Capabilities

OptimizeDeep Learning Software

AlignDeveloper Ecosystem

Training

Available Late 2017

Up to 4x performance over 7200 series for Deep Learning workloads*

Knights Mill

Tools

Benefit from expert-led trainings, hands-on workshops, exclusive

remote access, and more!

Community

Gain access to the latest libraries, frameworks, tools, and technologies

from Intel to accelerate you AI project

Collaborate with industry luminaries, developers, students,

and Intel engineers

Optimized via framework & library enhancements

Ensure intel solutions are easy to use and readily available

Delivering hardware optimized for deep learning

AvailableSoon

World class neural network performance via Nervana engine

Nervana Crest

Crest

Available Now

Start training models today using Intel® Xeon Phi™

Intel® Xeon Phi™ 7200

Series Optimizing these frameworks on Intel® Xeon® & Xeon Phi™ processors

Intel® MKL

MKL-DNN

Intel® MLSL

Libraries/Languages

Tuning delivers up to 400x performance**

Tuned for Intel® processors - current &

next generation

Directly Optimized Frameworks

Optimized via nGRAPH

Above frameworks will be optimized via Nervana nGRAPH

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of Nov 2016

*Intel® Xeon Phi ™ processor Knights Mill up to 4x estimated performance improvement over Intel® Xeon Phi™ processor 7290 BASELINE: Intel® Xeon Phi™ Processor 7290 (16GB, 1.50 GHz, 72 core) with 192 GB Total Memory on Red Hat Enterprise Linux* 6.7 kernel 2.6.32-573 using MKL 11.3 Update 4, Relative performance 1.0 Knights Mill: Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

** Configurations contained on next slide

http://www.intel.com/performance

0

0.4

0.8

1.2

1.6

2

Caffe/AlexNet TensorFlow/AlexNet-ConvNet TensorFlow/ ConvNet VGG

Normalized Throughput (Images/Second)

2S Intel® Xeon® processor E5-2697 v4 Intel® Xeon Phi™ processor 7250

8

Intel® Xeon Phi ™ processor 7250 single-node image classification training throughput up to 1.8x better than 2S Intel® Xeon® processor E5-2697 v4

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016 . Click here for configuration

Inte

l® X

eo

n P

hi™

7250

Rela

tive

Perf

orm

an

ce(N

orm

alize

d t

o 1

.0 b

ase

lin

e o

f a

Inte

l® X

eo

n®

pro

cess

or

2S E

5-2

697 V

4)

Up to1.8x Up to1.7xUp to1.4x

Public


What is KNM?

First Knights product designed for Intel® Scalable System Framework for Deep Learning

KNM is a KNL derivative that strives for:

Doubling the peak performance of SP dense compute & perf/W– At the cost of lower DP performance

– No or minimal platform changes

How?

Remove DP units to make room for 2x SP units + int16 VNNI

Drive them via Quadmadd instructions to boost efficiency

Knights Landing SOC

1 MB L2 per tile2 cores per tile (SLM-based)2 VPUs/core32 SP/16 DP flops per VPU

Package

MCDRAM

MCDRAM

MCDRAM

MCDRAM

MCDRAM

MCDRAM

MCDRAM

MCDRAM

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

DDR4

DDR4

PCIEGen3

2 x161 x4

x4

DMI

36 TilesTiles connected with 2D

Mesh

Up to 36 tiles (72 cores)1.7 GHz mesh

Misc

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile

Tile Tile Tile Tile Tile Tile




EDC EDC EDC EDC

iMC Tile Tile Tile Tile iMC

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

PCIe

DDR DDR

6 channels of DDR4 2400

(up to 384GB, 90 GB/s Streams triad)

16GB of IPM (MCDRAM)

(up to 490 GB/s Streams triad)

36 lanes PCIE Gen3

2VPU

Core

2VPU

Core

1MBL2

HUB

32 KBD$/I$

32 KBD$/I$

Knights Mill SOC

1 MB L2 per tile2 cores per tile (SLM-based)1 VPUs/core DP / 2 VPUs/core SP64 SP/16 DP flops per VPU

Misc

IIOEDC EDC

Tile Tile

Tile Tile Tile

EDC EDC

Tile Tile

Tile Tile Tile





EDC EDC EDC EDC

iMC Tile Tile Tile Tile iMC

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

PCIe

DDR DDR

2VPU

Core

2VPU

Core

1MBL2

HUB

32 KBD$/I$

32 KBD$/I$

2x SP0.5x DP

14nm+Up to 320w SKU (72 cores at 1.5 GHz)

New ISA for 2x SP: quadmadd

6 channels of DDR4 2400 (up to 384GB, 90 GB/s Streams triad)

16GB of IPM (MCDRAM)(up to 460 GB/s Streams triad)

36 lanes PCIE Gen3

Up to 36 tiles (72 cores)1.6 GHz mesh

Package

MCDRAM

MCDRAM

MCDRAM

MCDRAM

MCDRAM

MCDRAM

MCDRAM

MCDRAM

OPIO OPIO OPIO OPIO

OPIO OPIO OPIO OPIO

DDR4

DDR4

PCIEGen3

2 x161 x4

x4

DMI

36 TilesTiles connected with 2D

Mesh

Student Visit Program IntroductionDeveloper IA Tech Pipeline Buildup & Refresh

ISV

Developer Resume: Name # (in case

name is not proper)

R&R Experience

DRD

Developer Categorization: Resume

screening (Phone /

F2F) Interview

Senior Developers

Junior Developers

New Beginner

Developer IA Tech Pipeline

Refresh every half year

Promotion Required Actions

Finish all trainings in Training Framework for this ISV

Pass DRD AE brief interview

Successfully finish Intel Internshipfor this ISV

Removing IO and Memory Barriers

Integrated Intel® Omni-Path fabric increases price-performance and reduces communication latency

Direct access of up to 400 GB of

memory with no PCIe performance lag (vs. GPU:16GB)

Breakthrough HighlyParallel Performance

Up to 400X deep learning

performance on existing hardware via Intel software optimizations

Up to 4X deep learning performance

increase estimated (Knights Mill, 2017)

Easier Programmability

Binary-compatible with Intel® Xeon® processors

Open standards, libraries and frameworks

Intel® Xeon Phi™ Processor FamilyEnables Shorter Time to Train Using General Purpose Infrastructure

Processor for HPC & enterprise customers running scale-out, highly-parallel, memory intensive apps

Configuration details on slide: 30Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804


Breakthrough Performance

Increases price performance, reduces communication latency compared to InfiniBand EDR1:

Up to 21% Higher Performance, lower latency at scale

Up to 17% higher messaging rate Up to 9% higher application

performance

Building on some of Industry’s best technologies

Highly leverage existing Aries & Intel True Scale fabrics

Excellent price/performance price/port, 48 radix

Re-use of existing OpenFabricsAlliance Software

Over 80+ Fabric Builder Members

Intel® OMNI-path ArchitectureWorld-Class Interconnect Solution for Shorter Time to Train

Fabric interconnect for breakthrough performance on scale-out apps like deep learning training

HFI AdaptersSingle portx8 and x16

Edge Switches1U Form Factor24 and 48 port

Director SwitchesQSFP-based

192 and 768 port

SoftwareOpen Source

Host Software and Fabric Manager

CablesThird Party Vendors

Passive Copper Active Optical

Innovative Features Improve performance, reliability and

QoS through:

Traffic Flow Optimization to maximize QoS in mixed traffic

Packet Integrity Protection for rapid and transparent recovery of transmission errors

Dynamic lane scaling to maintain link continuity

1Intel® Xeon® Processor E5-2697A v4 dual-socket servers with 2133 MHz DDR4 memory. Intel® Turbo Boost Technology and Intel® Hyper Threading Technology enabled. BIOS: Early snoop disabled, Cluster on Die disabled, IOU non-posted prefetch disabled, Snoop hold-off timer=9. Red Hat Enterprise Linux Server release 7.2 (Maipo). Intel® OPA testing performed with Intel Corporation Device 24f0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48 port (B0 silicon). Intel® OPA host software 10.1 or newer using Open MPI 1.10.x contained within host software package. EDR IB* testing performed with Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR Infiniband switch. EDR tested with MLNX_OFED_Linux-3.2.x. OpenMPI 1.10.x contained within MLNX HPC-X. Message rate claim: Ohio State Micro Benchmarks v. 5.0. osu_mbw_mr, 8 B message (uni-directional), 32 MPI rank pairs. Maximum rank pair communication time used instead of average time, average timing introduced into Ohio State Micro Benchmarks as of v3.9 (2/28/13). Best of default, MXM_TLS=self,rc, and -mca pml yalla tunings. All measurements include one switch hop. Latency claim: HPCC 1.4.3 Random order ring latency using 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Application claim: GROMACS version 5.0.4 ion_channel benchmark. 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Intel® MPI Library 2017.0.064. Additional configuration details available upon request.

Libraries, frameworks & toolsIntel® Math Kernel

Library

Intel® MLSL

Intel® Data Analytics

Acceleration Library (DAAL)

Intel® Distribution

Open Source Frameworks

Intel Deep Learning SDK

Intel® Computer Vision SDKIntel® MKL MKL-DNN

High Level

Overview

High performance math primitives

granting low level of control

Free open source DNN functions for

high-velocity integration with deep learning frameworks

Primitive communication

building blocks to scale deep learning

framework performance over a

cluster

Broad data analytics acceleration object

oriented library supporting

distributed ML at the algorithm level

Most popularand fastest

growing language for

machine learning

Toolkits driven by academia and industry for

training machine learning

algorithms

Accelerate deep learning model

design, training and deployment

Toolkit to develop & deploying vision-oriented solutions

that harness the full performance of

Intel CPUs and SOC accelerators

Primary Audience

Consumed by developers of higher level libraries and Applications

Consumed by developers of the next generation of

deep learning frameworks

Deep learning framework

developers and optimizers

Wider Data Analytics and ML

audience, Algorithm level development

for all stages of data analytics

Application Developers and Data Scientists

Machine Learning App Developers, Researchers and Data Scientists.

Application Developers and Data Scientists

Developers who create vision-

oriented solutions

Example Usage

Framework developers call

matrixmultiplication, convolution

functions

New framework with functions

developers call for max CPU

performance

Framework developer calls

functions to distribute Caffe

training compute across an Intel®

Xeon Phi™ cluster

Call distributed alternating least

squares algorithm for a

recommendation system

Call scikit-learnk-means

function for credit card fraud

detection

Script and train a convolution neural network for image

recognition

Deep Learningtraining and model

creation, with optimization for deployment on constrained end

device

Use deep learning to do pedestrian

detection

…

Find out more at software.intel.com/ai

16

Xeon Phi Processor & Platform Introduction – Knights Landing (KNL)

Processor: SKUs & Key Features核数/线程 Ghz

片上存储

Fabric Ddr4 power**

7290*

最好的节点性能(3.46TF DP / 6.92TF SP)

72/288

1.5 16GB 7.2 GT/s

Yes384GB

2400 MHz

245W

7250最好的每瓦性能

(3.05TF DP / 6.1TF SP)

68/272

1.4 16GB 7.2 GT/s

Yes384GB

2400 MHz

215W

7230最好的每核内存带宽

(2.66TF DP / 5.32TF SP)

64/256

1.3 16GB 7.2 GT/s

Yes384GB

2400 MHz

215W

7210入门级产品

(2.66TF DP / 5.32TF SP)

64/256

1.3 16GB 6.4 GT/s

Yes384GB

2133 MHz

215W

*Available beginning in September **Add 15 watts for integrated fabric

消除PCIe数据流动瓶颈自启动的主处理器

片上MCDRAM 改善访存带宽集成16GB高速片上内存，带宽高达490GB/s，约为DDR4带宽的4-5倍

和CPU一样运行x86应用和Intel® Xeon®二进制应用兼容

优秀的系统可扩展性良好的扩展效率，像Intel® Xeon® CPU

集成Intel Omni-Path Fabric更低网络延迟，更低成本，更低功耗

生命科学 /基因测序 /金融风险模型 /能源 /气象环保 / 流体力学 /可视化 /渲染 /大数据 /机器学习/深度学习

标准的x86编程模式和Xeon一样的编程环境，工具，语言

高度并行化和向量化高达288并发线程，AVX-512指令集

17


Platform

Intel Software Development Platform

(SDP)

Intel Server Board S7200APin Intel Server Chassis H2000XXLR2

Other OEM Systems

18


System: Interconnect

(1 or 2) x16 or x8 PCIe

10/40 GbE

F

CO

NN

EC

TO

R

QSPF connector

QSPF connector

QSFP module

(2) x16 PCIe

2x 100Gbps (uni-directional); 50GB/sec (bi-directional)

Omni-Path Fabric Ethernet

1st

Gen(2013)

2nd

Gen(2016)

Knights Mill(2017)

• Up to ~13.8 TF* Single Precision Peak performance

• Up to ~27.6 TOPs* Variable Precision QVNNI performance

• Surgical changes in the VPU to increase SP performance over DP performance

• Bootable Host-CPU avoids PCIe latency & bottlenecks

• Efficient Scaling with Multi-node optimizations for top ML frameworks

• High memory bandwidth for seamlessly training Complex Neural Network datasets

Sin

gle

-Pre

cisi

on

Tera

flo

ps

(Peak)

Common Groveport PlatformBootable Host CPU

Knights Mill: Optimal Deep Learning Throughput

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Performance estimate wrt KNL 7250 SKU SGEMM. Performance Calculation= AVX freq X Cores X Flops per Core X Efficiency

Faster Time to Train Machines

19

*Based on estimates and Subject to Change

http://www.intel.com/benchmarks

20

Intel® Xeon Phi ™ processor Knights Mill up to 4x estimated performance improvement over Intel® Xeon Phi™ processor 7290

0

2

4

6

Deep Learning Performance

Normalized Performance

Intel® Xeon Phi™ processor 7290

Intel® Xeon Phi™ processor family - Knights Mill

Est

imate

d n

orm

aliz

ed

perf

orm

an

ce o

n

Inte

l® X

eo

n P

hi™

pro

cess

or

7290

com

pare

d t

o In

tel®

Xeo

n P

hi™

Kn

igh

ts

Mill Up to 4x

Configuration details on next slide

Knights Mill performance: Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured baseline Intel® Xeon Phi™ processor 7290 as of November 2016

Public


Composer Edition

Intel® Parallel Studio XECreate Faster Code…Faster

22More Power for Your Code - software.intel.com/intel-parallel-studio-xe

Intel® VTune™ Amplifier

Performance Profiler

ANALYZEAnalysis Tools

Intel® AdvisorVectorization Optimization

& Thread Prototyping

Intel® InspectorMemory & Thread

Debugger

SCALECluster Tools

Intel® Trace Analyzer & Collector

MPI Tuning & Analysis

Intel® MPI LibraryMessage Passing Interface Library

Intel® Cluster CheckerCluster Diagnostic Expert System

Operating System: Windows*, Linux*, MacOS1*

Intel® Architecture Platforms

BUILDCompilers & Libraries

C / C++ CompilerOptimizing Compiler

Intel® Distribution for Python*High Performance Scripting

Intel® MKLFast Math Kernel Library

Intel® IPPImage, Signal & Data

ProcessingIntel® TBB

C++ Threading Library

Intel® DAALData Analytics Library

Fortran CompilerOptimizing Compiler

Professional Edition Cluster Edition

https://software.intel.com/intel-parallel-studio-xe

23

Components of Intel® MKL 2017

Linear Algebra

• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative • PARDISO*• Cluster Sparse

Solver

Fast Fourier Transforms

• Multidimensional

• FFTW interfaces• Cluster FFT

Vector Math

• Trigonometric• Hyperbolic • Exponential• Log• Power• Root• Vector RNGs

Summary Statistics

• Kurtosis• Variation

coefficient• Order

statistics• Min/max• Variance-

covariance

And More…

• Splines• Interpolation• Trust Region• Fast Poisson

Solver

Deep Neural Networks

• Convolution• Pooling• Normalization• ReLU• Softmax

New

24

Improve GEMM performance by pack GEMMA machine learning case

• Improve GEMM performance by Pack GEMM from MKL 2017

0

1

2

3

4

Peak v

alu

e(T

flo

ps)

matrix size(m,n,k)

SGEMM Performance Comparison (Higher is better)

sgemm on KNL 7210 Optimization by Packed gemm on KNL

NOTE: Pack gemm is helpful for some matrixes, but the packing costs is high, it can be amortized over multiple GEMM calls if the input matrices (A or B) are reused between these calls.

Performance enhanced by

Intel® Math Kernel Library

Q&A

25

Legal DisclaimersINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTYRIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTELDISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULARPURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'SPRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS,OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANYCLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WASNEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked"reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The informationhere is subject to change without notice. Do not finalize a design with this information.The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata areavailable on request.Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or goto: http://www.intel.com/design/literature.htmSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measuredusing specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information andperformance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sitesor others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specificbenchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjbb, SPECjvm, SPECWeb, SPECompM, SPECompL, SPEC MPI, SPECjEnterprise* are trademarks of the Standard PerformanceEvaluation Corporation. See http://www.spec.org for more information. TPC-C, TPC-H, TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology-enabled chipset, BIOS and operating system. Performance will varydepending on the specific hardware and software you use. For more information including details on which processors support HT Technology, see hereIntel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, softwareand overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, seehttp://www.intel.com/technology/turboboostNo computer system can provide absolute security. Requires an enabled Intel® processor and software optimized for use of the technology. Consult your system manufacturer and/or softwarevendor for more information.Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to:Learn About Intel® Processor NumbersIntel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps.© 2017 Intel Corporation. 26

http://www.intel.com/design/literature.htm

http://www.spec.org/

http://www.tpc.org/

http://www.intel.com/info/hyperthreading/

http://www.intel.com/technology/turboboost

http://www.intel.com/products/processor_number

vasp*

The Vienna Ab initio Simulation Package (VASP) is a computer program for atomic scale materials modeling and performs electronic structure calculations and quantum-mechanical molecular dynamics from first principles.

Application: Vienna Ab initio Simulation Package (VASP)

Code: Available here Recipe: See configuration details. Also, check for future availability here

Value Proposition: VASP provides scientists with fast and precise calculation of materials properties thus it is widely used and consumes up to 25% of supercomputers time worldwide1. Intel® Xeon Phi™ processor 7250 enables VASP to outperform alternatives for some workloads and to improve energy efficiency.

PUBLIC PRESENTA

TIONMaterial Sciences

27

https://www.vasp.at/

https://software.intel.com/en-us/xeon-phi/x200-processor

Lammps Coarse-Grain Water Simulation

LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator. It is used to simulate the movement of atoms to develop better therapeutics, improve alternative energy devices, develop new materials, and more.

Application: Coarse-grain water simulation with LAMMPS using Stillinger-Weber potential. More at http://lammps.sandia.gov/

Code: In main LAMMPS repository. Recipe: Available here

Value Proposition: Intel continues to advance the capabilities of HW and SW necessary for scientists to solve new and more complex problems that could not previously be achieved. The Intel® Xeon Phi™ processor improves power-efficient performance for scalable workloads.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others

Life Sciences

Ice formation in water droplet wetting a flat surface with coarse-grain model in LAMMPSImage Source: Comput. Phys. Commun., 2013, 184, 2785-2793

PUBLIC PRESENTA

TION

28

http://lammps.sandia.gov/

http://lammps.sandia.gov/

https://software.intel.com/en-us/articles/recipe-lammps-for-intel-xeon-phi-processors

NWChem NWPW AIMD SIMULATION

NWChem is a computational chemistry software package which also includes quantum chemical and molecular dynamics functionality. It was designed to run on high-performance parallel supercomputers as well as conventional workstation clusters. It aims to be scalable both in its ability to treat large problems efficiently, and in its usage of available parallel computing resources.

Application: Ab-initio molecular dynamics simulation (NWChem NWPW), water128 benchmark

Code: Main NWChem repository at https://svn.pnl.gov/svn/nwchem/trunk, SVN rev 28860. Recipe: Check for availability here

alue Proposition: Intel continues to advance the capabilities of HW and SW necessary for scientists to solve new and more complex problems that could not previously be achieved. The Intel® Xeon Phi™ processor improves power-efficient performance for scalable workloads..

Life Sciences

PUBLIC PRESENTA

TION


29

Images Source: US Govt.; NWChem


Trinity Benchmarks - Optimized cluster (GFLOPS)

Trinity is a set of benchmark programs used as part of the joint NERSC/ACES NERSC-8/Trinity system procurement.

Code: In main NERSC website. Recipe: Check for availability here

AMG: Parallel algebraic multigrid solver for linear systems MiniFE: Finite Element mini-app UMT: 3D, deterministic, multigroup, photon transport code for unstructured meshes SNAP: proxy application to model the performance of a modern discrete ordinates neutral particle transport application GTC: Gyrokinetic Particle Simulation of Turbulent Transport in Burning Plasmas MILC: MIMD Lattice Computation (MILC) collaboration kernel used to study quantum chromodynamics (QCD) MiniGhost: Finite Difference mini-app

Value Proposition: Trinity benchmarks showcase the out-of-the box performance of the Intel® Xeon Phi™ processor at single node and at the cluster.

Material Sciences

PUBLIC PRESENTA

TION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computersystems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating yourcontemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as theproperty of others

30

https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/


The MILC Code is used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics and is written by the MIMD Lattice Computation (MILC) collaboration.

Application: Trinity MILC provided by NERSC as part of Trinity8 suite

Code: Original NERSC Benchmark code is here; contact Intel for Optimized Code Recipe: Check for availability here

Value Proposition:

MILC is widely deployed on numerous supercomputers and 2nd most used application at US DOE’s National Energy Research Scientific Computing Center (NERSC)

Intel’s optimizations are being incorporated into mainline by MILC collaboration

MILC*Physics - QCD

Image Credit: Brookhaven Lab (BNL)

PUBLIC PRESENTA

TION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computersystems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluatingyour contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed asthe property of others

31

http://physics.indiana.edu/~sg/milc.html

http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/milc/


Chroma*The Chroma package supports data-parallel programming constructs for lattice field theory and in particular latticeQCD. It uses the SciDAC QDP++ data-parallel programming (in C++) that presents a single high-level code image tothe user, but can generate highly optimized code for many architectural systems including single node workstations,multi and many-core nodes, clusters of nodes via QMP, and classic vector computers.

Application: Chroma “hmc”

Code: Available here Recipe: Check for availability here

Value Proposition:

Chroma is deployed on numerous supercomputers and one of the most used QCD applications/ researchkernels.

Intel’s optimizations are incorporated into mainline Chroma. The optimizations are made available in the QPhiX library.

Physics - QCD

Image Credit: Brookhaven Lab (BNL)

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplatedpurchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others

PUBLIC PRESENTA

TION

32

https://github.com/JeffersonLab/chroma


Openfoam*


OpenFOAM (for "Open source Field Operation And Manipulation") is a C++ toolbox for the development of customized numerical solvers, and pre-/post-processing utilities for the solution of continuum mechanics problems, including computational fluid dynamics (CFD).

Application: OpenFOAM

Code: Available here Recipe: Available here

Optimizations: https://github.com/OpenFOAM/OpenFOAM-Intel

Value Proposition: Provides an extensive range of features to solve complex fluid flows involving chemical reactions, turbulence and heat transfer, acoustics, solid mechanics and electromagnetics. OpenFOAM on the Intel® Xeon Phi™ processor is great for computational fluid dynamics, structured grid or unstructured mesh.

Image Source: Intel

Manufacturing

PUBLIC PRESENTA

TION

33

http://www.openfoam.com/

https://github.com/OpenFOAM/OpenFOAM-dev

https://github.com/OpenFOAM/OpenFOAM-Intel

Cloverleaf*The CloverLeaf* code investigates the behavior of fluids under high temperatures and pressures, which potentially cause shock fronts to form. It is common for hydrocodes to be constructed using one of two formulations –Lagrangian, in which a mesh is constructed and evolved through time, or Eulerian, where material flow is calculated relative to a fixed spatial grid.

Application: CloverLeaf

Code: https://github.com/UK-MAC/CloverLeaf

Recipe: make COMPILER=INTEL MPI_COMPILER=mpiifort C_MPI_COMPILER=mpiicc OPTIONS=“-xMIC-AVX512” C_OPTIONS=“-xMIC-AVX512”

Value Proposition:

This application provides users with a research tool for investigating code modernization approaches for larger shock hydrodynamics applications.

This application now significantly outperforms (time-to-solution) alternative processing solutions with the Intel® Xeon Phi™ processor 7250.

Physics - Hydrodynamics

Image Source: Intel

PUBLIC PRESENTA

TION


34

https://github.com/UK-MAC/CloverLeaf

Weather & Research Forecast Model* Numerical Weather Simulation

The WRF Model is a numerical weather prediction system designed to serve atmospheric research and operational forecasting needs. Currently in operational use at NCEP, AFWA, NASA, NOAA, etc.

Application: The Weather & Research Forecast Model* (WRF) WRFV3.6.1 Conus12km. Community code is managed by NCAR. CONUS12KM benchmark is an adhoc industry standard workload and is widely cited.

Code: Available here. (WRF 3.6 & 3.6.1) Recipe: Select Intel MIC configuration on build. Check for availability here

Value Proposition: The most widely-used weather forecasting code runs in its entirety on the Intel platform only. Speed up the WRF weather simulation code and results with Intel® architecture.

Climate & Weather

Image Source: NOAA

PUBLIC PRESENTA

TION


35

http://www2.mmm.ucar.edu/wrf/users/download/get_sources.html#V351


YASK HPC Stencils, iso3dfd kernelYASK, Yet Another Stencil Kernel, is a framework to facilitate design exploration and tuning of HPC kernels. One of the stencils included in YASK is iso3dfd, a finite-difference code found in seismic imaging software used by energy-exploration companies to predict the location of oil and gas deposits.

Application: YASK, iso3dfd stencil

Code and Recipe: Available here

Value Proposition: Intel® Xeon Phi™ processor 7250 enables this application to leverage the high-bandwidth

memory and 512-bit SIMD for higher performance.

Geophysics

Image Source: US Dept. Energy

PUBLIC PRESENTA

TION


36

https://software.intel.com/en-us/articles/recipe-building-and-running-yask-yet-another-stencil-kernel-on-intel-processors

STAC-A2* BENCHMARK

The STAC-A2 Benchmark suite is the industry standard created by the financial community to test technology stacks used for compute-intensive analytic workloads involved in pricing and risk management

Application: Intel Composer XE STAC Pack Rev. H

Code: Available here Recipe: Available here

Value Proposition:

The Intel Xeon Phi processor based-system takes up 1/8th the space (0.5U vs 4U) than the IBM Power8* based-system

Performance enhanced by Intel® AVX512 and MCDRAM

Image Source: Intel

“STAC” and all STAC names are trademarks or registered trademarks of the Securities Technology Analysis Center LLC.

Financial Services

PUBLIC PRESENTA

TION


37

https://stacresearch.com/news/2016/06/20/INTC160428

http://www.stacresearch.com/INTC160428

Industrial standard benchmark that uses Monte Carlo method for pricing European call options. It pre-generates random numbers then uses them in all options pricing processes. Used by all financial firms to price derivatives with multiple dimensions. Uses the stock price, strike price and time as input streams then creates a call output stream.

Application: Monte Carlo European Options


Value Proposition:

Foundation of Financial derivatives pricing Widely used all over financial libraries EMU benefits transcendental functions

Monte Carlo European optionsbenchmark*


Financial Services

Image Source: US gov.

PUBLIC PRESENTA

TION

38

https://software.intel.com/en-us/articles/monte-carlo-european-option-pricing-for-intel-xeon-phi-processor

Black-Scholes benchmark*

Industrial standard benchmark that calculates call and put option price using the Black-Scholes-Merton Formula. Used by all financial firms to price derivatives with multiple dimensions. Stock price, strike price and time are input streams that create call and put as output streams.

Application: Black-Scholes formula


Value Proposition:

Foundation of financial derivatives pricing Widely used all over financial libraries Performance enhanced by Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and MCDRAM

Financial Services

Image Source: US gov

PUBLIC PRESENTA

TION


39

https://software.intel.com/en-us/articles/black-scholes-merton-formula-optimization-for-intel-xeon-phi-processor

Nanoscale Molecular Dynamics program*

Nanoscale Molecular Dynamics program (NAMD) is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 200,000 cores for the largest simulations.

Application: NAMD 2.11

Code: http://www.ks.uiuc.edu/Research/namd/

Recipe: Check for availability here

Value Proposition: NAMD is the 2nd most popular MD code. Intel® AVX 512 instructions are used heavily by the Assembler code. Source code performance tuning with intrinsics demonstrates MCDRAM and simultaneous multithreading advantages.

Life Sciences

Image Source: Use approved by NAMD

PUBLIC PRESENTA

TION


40

http://www.ks.uiuc.edu/Research/namd/


GROMACS*


GROMACS (GROningen MAchine for Chemical Simulations ) is a versatile package to perform classical Molecular Dynamics simulations. Heavily optimized for most modern platforms and provides extremely high performance compared to all other MD codes.

Application: GROMACS (Intel® AVX-512 speedup)

Code: Available here

Recipe: All optimizations merged in GROMACS 2016 branch, MKL FFT

Value Proposition: This application provides users with wide range of functionality for chemical simulations and highest out-

of-the-box performance across all MD codes. GROMACS on the Intel® Xeon Phi™ processor outperforms Intel® Xeon® processors for simulating large biochemical systems due to enabling new Intel® Advanced Vector Extensions 512 (Intel® AVX-512) features and enabling enhanced parallelism and provides more performance simulations within the same energy envelope.

Life Sciences

Images Source: Used with permission

PUBLIC PRESENTA

TION

41

https://github.com/gromacs/gromacs.git

Amber16* Generalized Born (Implicit Solvent)


Amber* is a bio related simulation code for DNA, RNA, protein, and other bio-molecules. Amber has two solvers: Particle Mesh Ewald (PME), known as Explicit, and Generalized Born (GB), known as Implicit. Amber is written in Fortran 90 and is mainly MPI*, OpenMP* and Vectorization parallelized.

Application: Amber 16 PMEMD Implicit

Code: In main Amber GIT repository. Recipe: http://ambermd.org/intel

Value Proposition: This application provides users with a research tool for investigating code modernization approach for Bio-molecular dynamics applications.

Life Sciences

PUBLIC PRESENTA

TION

Images Source: Intel

42

http://ambermd.org/intel

Amber* is a bio related simulation code for DNA, RNA, protein, and other bio-molecules. Amber has two solvers: Particle Mesh Ewald (PME), known as Explicit, and Generalized Born (GB), known as Implicit. Amber is written in Fortran 90 and is mainly MPI*, OpenMP* and Vectorization parallelized.

Application: Amber 16 PMEMD Explicit

Code: In main Amber GIT repository. Recipe: http://ambermd.org/intel

Value Proposition: This application provides users with a research tool for investigating code modernization approach for Bio-molecular dynamics applications. The Intel® Xeon Phi™ processor is best suited for larger problem sizes.

Amber16* PME (explicit solvent)

Images Source: Intel

Life Sciences

PUBLIC PRESENTA

TION


43

http://ambermd.org/intel

Configuration details

44

45

LAMMPSBASELINE: 2S Intel® Xeon® processor E7-2697 v3, 2.6GHz, 28 cores, Intel®

Turbo Boost Technology and Intel® Hyperthreading Technology on, BIOS 86B.01.01.1008.R00, 8x8GB 2133 MHz DDR4, CentOS Linux* 7.1.1503 kernel

3.10.0-229.

NEXT GEN: 2S Intel® Xeon® processor E5-2697 v4, 2.3GHz, 36 cores,

Intel® Turbo Boost Technology and Intel® Hyperthreading Technology on,

BIOS 86B0271.R00, 8x16GB 2400MHz DDR4, Red Hat Enterprise Linux* 7.2

kernel 3.10.0-327.

NEW: 2S Intel® Xeon® Gold 6148 processor, 2.4GHz, 40 cores, Intel® Turbo Boost Technology and Intel® Hyperthreading Technology on, BIOS 86B.01.00.0412.R00,

12x16GB 2666MHz DDR4, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

BLACK SCHOLES

FSI Black-Scholes workload. OS: Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327. Testing

by Intel March 2017.

BASELINE: 2S Intel® Xeon® processor E7-2697 v3, 2.6GHz, 28 cores, turbo and HT on,

BIOS 86B.0036.R05, 64GB total memory, 8x8GB 2133 MHz DDR4, Fedora release 20

kernel 3.15.10-200 .

NEXT GEN: 2S Intel® Xeon® processor CPU E5-2697 v4 , 2.3GHz, 36 cores, turbo and

HT

on, BIOS 86B0271.R00, 128GB total memory, 8 x16GB 2400 MHz DDR4 RDIMM,

1 x 1TB SATA, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

NEW: Intel® Xeon® Gold processor 6148@ 2.4GHz, H0QS, 40 cores 150W. QMS1, turbo

and HT on, BIOS SE5C620.86B.01.00.0412.020920172159, 192GB total memory, 12 x 16 GB 2666 MHz DDR4 RDIMM, 1 x 800GB INTEL SSD SC2BA80, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327

STAC-A2

BASELINE: SuperMicro* Superserver SYS-1028GR-TR, Dual Socket Intel® Xeon® processor E5-2697 v3 2.6 GHz , 14 Cores/Socket, 28 Cores, 56 Threads (HT and turbo on), DDR4 256GB, 1866 MHz, Red Hat 7.1, Wildcat Pass Motherboard, BIOS: American Megatrends, 2.0, 12/21/2015, 745GB SATA SSD

STAC-A2 version: STAC-A2 Pack for Intel Composer XE, Rev F

NEXT GENERATION: SuperMicro* Superserver SYS-1028GR-TR, Dual Socket Intel® Xeon® processor E5-2699 v4 2.2 GHz , 22 Cores/Socket, 44 Cores, 88 Threads (HT and turbo on), DDR4 256GB, 1866 MHz, Red Hat 7.1, Wildcat Pass Motherboard, BIOS: American Megatrends, 2.0, 12/21/2015, 745GB SATA SSD

STAC-A2 version: STAC-A2 Pack for Intel Composer XE, Rev G, STAC-A2 version: Internal (not audited)

NEW: 2S Intel® Xeon® Gold 6148 processor, 2.4GHz, 40 cores, turbo and HT on, BIOS 86B.01.00.0412.R00, 12x16GB 2666MHz DDR, Red Hat Enterprise Linux* 7.3 kernel 3.10.0-514

Iso3dfd

version dev13 from Jan 2017. OS: ICentos 7.3. Compiler: Intel® Parallel Studio XE Cluster Edition 2017 update 2Runs: OpenMP only using always the max number of cores. Common workload parameters for ALL runs Better performance can be obtained with MPI+OMP or less threads and using best parameters using GA for every IA.KMP_AFFINITY=compact OMP_SCHEDULE=static KMP_HW_SUBSET=Xc,1t /raid/opt/intel/share/phil/iso3dfd_for_oem/bin/iso3dfd_dev13_cpu_simd_ft_nohbm.exe 224 2125 2100 2X 50 224 48 96

Configuration Details

46

AMBERAmber: Version 16 with all patches applied at December, 2016. Workloads: PME Cellulose NVE(408K atoms), PME stmv(1M atoms), GB Nucleosome (25K), GB Rubisco (75K). No cut-off was used for GB workloads. Compiled with -mic2_spdp –intelmpi - openmp, –DMIC2 * defined. Tests performed on March 2017.

BASELINE: Executed with 36 MPI, 2 OpenMP. 2S Intel® Xeon® processor E5-2697 v4, 2.3GHz, 36

cores, turbo and

HT on, BIOS 86B0271.R00, 8x16GB 2400MHz DDR4, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

NEW: Executed with 40 MPI and 2 OpenMP. 2S Intel® Xeon® Gold 6148 processor, 2.4GHz, 40

cores, turbo on, H

T on, BIOS 86B.01.00.0412.R00, 12x16GB 2666MHz DDR, Red Hat Enterprise Linux* 7.2 kernel 3.10.0327.

*-DMIC2 enable optimization for AVX512 vectorization, SPDP mixed precision, OpenMP optimization, but not any specific optimization for KNL arch.

GROMACSGROMACS AVX2 CONFIGURATION: Version 2016.3: ftp://ftp.gromacs.org/pub/gromacs/gromacs-2016.3.tar.gz

, Intel® Compiler 17.0.1.132, Intel® MPI 2017u1. Optimization Flags: “-O3 -xCORE-AVX2“. Cmakeoptions: “-DGMX_FFT_LIBRARY=mkl -DGMX_SIMD=AVX2_256”.

GROMACS AVX512 CONFIGURATION: Version 2016.3: ftp://ftp.gromacs.org/pub/gromacs/gromacs-2016.3.tar.gz ,

Intel® Compiler 17.0.1.132, Intel® MPI 2017u1. Optimization Flags: “-O3 -xCORE-AVX512“. Cmakeoptions: “-DGMX_FFT_LIBRARY=mkl -DGMX_SIMD=AVX_512”.

BASELINE INTEL XEON CONFIGURATION: GROMACS AVX2 binary, Dual Socket Intel® Xeon®

processor E5-2697 v3 BASE2.6 GHz, 14 Cores/Socket, 28 Cores, 56 Threads (HT on, Turbo on), DDR4 128GB, 2133 MHz, Red Hat 7.3.

NEXT GEN INTEL XEON CONFIGURATION: GROMACS AVX2 binary, Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz, 18 Cores/Socket, 36 Cores, 72 Threads (HT on, Turbo on), DDR4 128GB, 2400 MHz, Red Hat 7.2.

NEW INTEL XEON CONFIGURATION: GROMACS AVX512 binary, Dual Socket Intel® Xeon®

processor Gold 6148 2.4 GHz , 20 Cores/Socket, 40 Cores, 80 Threads (HT on, Turbo on), DDR4

192GB, 2666 MT/s DDR4 RDIMMs, Red Hat 7.2.

VASPVASP CONFIGURATION: Developer branch provided as “Package” included with download: https://github.com/vasp-dev/vasp-knl, Intel® Compiler 17.0.1.132, Intel® MPI 2017u1, ELPA 2016.05.004. Optimization Flags: “-O3 -xCORE-AVX512“.

BASELINE (INTEL XEON) CONFIGURATION: 2S Intel® Xeon processor E5-2699 v3 2.3 GHz, 18 Cores/Socket, 36 Cores, 72 Threads, HT on, turbo off, 128GB total memory, 2133 MT/s / DDR4 RDIMM, Red Hat Enterprise Linux* 7.0 kernel.

NEXT GEN (INTEL XEON) CONFIGURATION: 2S Intel® Xeon® processor E5-2697 v4 2.3 GHz , 18 Cores/Socket, 36 Cores, 72 Threads, HT on, turbo off, BIOS 86B0271.R00, 128GB total memory, 2400 MT/s DDR4 RDIMM, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

NEW (INTEL XEON) CONFIGURATION: Dual Socket Intel® Xeon® processor Gold 6148 2.4 GHz , 20 Cores/Socket, 40 Cores, 80 Threads, HT on, turbo off, BIOS 86B.01.00.0412, 192GB total memory, 2666 MT/s / DDR4 RDIMM, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

NAMDNAMD: Version 2.12 Dec2016. Workloads: apoa1(92K atoms), stmv(1M atoms). Compiled with –DNAMD_KNL* define. Tests performed on March 2017.

BASELINE: Performance apoa1- 5.62, stmv - 0.44 ns/day. Executed with 72 charm threads. 2S Intel® Xeon® processor E5-2697 v4, 2.3GHz, 36 cores, turbo and HT on, BIOS 86B0271.R00, 8x16GB 2400MHz DDR4, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

NEW: Performance apoa1- 8.67, stmv - 0.73 ns/day. Executed with 40 charm threads. 2S Intel® Xeon® Gold 6148 processor, 2.4GHz, 40 cores, turbo on, HT on, BIOS 86B.01.00.0412.R00, 12x16GB 2666MHz DDR, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

• - NAMD_KNL define enable optimization for AVX512 vectorization, not any specific for KNL arch.

WRFLINE: 2S Intel® Xeon® processor CPU E5-2697 v4 , 2.3GHz, 36 cores, turbo and HT on, BIOS 86B0271.R00, 128GB total memory, 8 slots / 16GB / 2400 MT/s / DDR4 RDIMM, 1 x 1TB SATA, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

Software: WRF version 3.6.1 Compiled using Intel config option with “-O3 -fp-model fast=1 -xCORE-AVX2”. Executed with 36 MPI ranks and OMP_NUM_THREADS=1.

NEW: Intel® Xeon® Gold processor 6148, 2.4GHz, 40 cores, turbo and HT on, BIOS 86B.01.00.0412, 192GB total memory, 12 slots / 16 GB / 2666 MT/s / DDR4 RDIMM, 1 x 800GB INTEL SSD SC2BA80, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.

Software: WRF version 3.6.1 Compiled using Intel config option with “-O3 -fp-model

fast=1 -xCORE-AVX512”. Executed with 40 MPI ranks and OMP_NUM_THREADS=1.

Configuration Details

ftp://ftp.gromacs.org/pub/gromacs/gromacs-2016.3.tar.gz

47

Configuration DetailsFSI

P100: 2S Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz, 56 cores, turbo

and HT on, BIOS 86B0271.R00, 16x16GB 2133Mhz RDIMM, Red Hat Enterprise Linux Server release 7.3 (Maipo) kernel 3.10.0-514.6.1.el7.

x86_64 CUDA SDK 8.0

VASP

NVIDIA* CONFIGURATION: Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz , 18 Cores/Socket, 36 Cores, 72 Threads (HT and turbo on), DDR4 128GB, 2400 MHz, Red Hat 7.2, Super Micro* SuperServer 1028GR-TR, Bios Version 2.0a, Super Micro* X10DRG-H Motherboard, CSE-118GHTS-R1K66BP FRU, 500GB SATA Seagate* ST9500423AS System

Disk, NVIDIA Tesla* P100 GPU, NVIDIA CUDA* 8.0 (375.20).

LAMMPS

LAMMPS Version: 13 Oct 2016, Parallel Studio 2016 update 3,CUDA

Driver: 367.48,CUDA Version: 8.0, OS: RHEL 7.2 Kernel 3.10.0-327 Host Processor: E5-2697v4 Turbo Enabled, HT Enabled Host Memory: 8x16GB 2400 MHz DDR4 GPU: Tesla P100-PCIE-16GB Boost Enabled CUDA

MPS: Best run was taking by varying the number of MPI tasks on the

host from 1-36 and using CUDA MPS with MPI less than 17 MPI tasks.

GROMACS

BDW+1xNVIDIA P100: 2S Intel® Xeon® processor E5-2697 v4 (Turbo On) + 1xNVIDIA Tesla P100, Default Configuration, CUDA 8.0

Cost and power used in charts are based on estimated node cost and estimated node wall power during operation.All application performance is measured internally by Intel. GPU performance was measured with a single P100 per node. When comparing against 2 P100s per node, this assessment assumes linear scaling, with the 2 GPU node having twice the application performance of a measured 1 GPU node result. Cost and power used in charts are based on estimated node cost and estimated system wall power during operation

Estimated System Power Estimated System Cost

1s Intel® Xeon Phi™ Processor 7250 392 W $7250

2s Intel® Xeon® Processor Gold 6148 498 W $10000

2s Intel® Xeon® Processor Gold 6148 + 2 P100s 1102 W $21000

Documents

HPC & AI Applications on Intel® Xeon Phi™ Processorb2b.lenovo.com.cn/Public/Ad/hpc/files/1-2.pdfIntel® AVX-512 Instructions Scatter/Gather Engine Integrated Fabric - OPA Self-Boot