Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
HPC & AI Applications on Intel® Xeon Phi™ Processor
Tools help making things better and easier when effectively used by Human Beings.
Zongyan Cao 曹宗雁
Intel软件与服务事业部
June 30, 2017
2
Agenda
Overview
Tools introduction with cases sharing
Compilers
Libraries
Profiling & Analyzing Tools
Cluster Tools
……
More Information
For Discovery and Business Innovation
in Science, Visualization & Analytics
3
See featured applications: Intel® Xeon Phi™ Processor Applications Showcase
DDR4
x4 DMI2 to PCH36 Lanes PCIe* Gen3 (x16, x16, x4)
MCDRAM MCDRAM
MCDRAM MCDRAM
DDR4
Knights LandingPackage
2D Mesh Architecture Out-of-Order Cores 3X Single-Thread vs. KNC Intel® AVX-512 Instructions Scatter/Gather Engine Integrated Fabric - OPA
Self-Boot ProcessorBinary-compatibility with Xeon, 3+ TFLOPS1 (DP)
On-package memory16GB, up to 490 GB/s STREAM TRIAD
Other Key Features
Platform MemoryUp to 384GB (6ch DDR4-2400 MHz)
Enhanced Intel® Atom™ cores based on Silvermont Microarchitecture
TILE:(up to 36)
2VPU
Core
2VPU
Core1MBL2
HUB
What is Knights Landing?
4
Tile IMC (Integrated Memory Controller)EDC (Embedded DRAM Controller) IIO (Integrated I/O Controller)1Theoretical peak performance
5
Material Science: VASP, LAMMPS, NWCHEM, GTC-P
QCD: MILC, CHROMA, CCS QCD
CFD/Mfg: OPENFOAM, CLOVERLEAF, LSTC LS-DYNA, CONVERGENT SCIENCE CONVERGE CFD
Weather/Climate/Cosmology: WRF, NEMO, WALLS
Energy: ISO3DFD
FSI: STAC A2, MONTE CARLO, BLACK SCHOLES
MD: NAMD, GROMACS, AMBER
What segments & applications are primary targets for KNL? Why?
Features driving KNL’s perf & perf/$/W
• High number of physical cores (< 72)+
• High number of threads (< 288) +
• Intel® AVX-512 (ER)
• 16GB MCDRAM
• High memory BW (< 490 GB/s)
• High system memory (< 400 GB)
• Lower system price ($7300) +
• Lower system power (~400W) +
See slides in back-up for description and availability of these codes
*Other names and brands may be claimed as the property of others.
+See speaker notes for comparison data
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured or estimated as of April 2017.
6
Material Science: VASP, LAMMPS, NWCHEM**
QCD: MILC (WIP)
CFD/Mfg: LSTC LS-DYNA**, OPENFOAM**, CONVERGENT
SCIENCE CONVERGE CFD**, CLOVERLEAF (WIP)
Weather: WRF**
FSI: STAC A2 (WIP), BLACK SCHOLES
MD: GROMACS
What segments & applications are primary targets for KNL Vs P100*?
*Other names and brands may be claimed as the property of others.
Features driving KNL’s perf & perf/$/W
• Intel® AVX-512 (ER)
• Higher attached memory (<400 GB)
• Lower system price ($7300) +
• Lower system power (~400W) +
**Intel assessment - Not ported/optimized for P100
See slides in back-up for description and availability of these codes
+See speaker notes for comparison data
+See speaker notes for comparison data
See slides in back-up for description and availability of these codes
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured or estimated as of April 2017.
7#IntelAI
ACCELERATE Hardware Capabilities
OptimizeDeep Learning Software
AlignDeveloper Ecosystem
Training
Available Late 2017
Up to 4x performance over 7200 series for Deep Learning workloads*
Knights Mill
Tools
Benefit from expert-led trainings, hands-on workshops, exclusive
remote access, and more!
Community
Gain access to the latest libraries, frameworks, tools, and technologies
from Intel to accelerate you AI project
Collaborate with industry luminaries, developers, students,
and Intel engineers
Optimized via framework & library enhancements
Ensure intel solutions are easy to use and readily available
Delivering hardware optimized for deep learning
AvailableSoon
World class neural network performance via Nervana engine
Nervana Crest
Crest
Available Now
Start training models today using Intel® Xeon Phi™
Intel® Xeon Phi™ 7200
Series Optimizing these frameworks on Intel® Xeon® & Xeon Phi™ processors
Intel® MKL
MKL-DNN
Intel® MLSL
Libraries/Languages
Tuning delivers up to 400x performance**
Tuned for Intel® processors - current &
next generation
Directly Optimized Frameworks
Optimized via nGRAPH
Above frameworks will be optimized via Nervana nGRAPH
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of Nov 2016
*Intel® Xeon Phi ™ processor Knights Mill up to 4x estimated performance improvement over Intel® Xeon Phi™ processor 7290 BASELINE: Intel® Xeon Phi™ Processor 7290 (16GB, 1.50 GHz, 72 core) with 192 GB Total Memory on Red Hat Enterprise Linux* 6.7 kernel 2.6.32-573 using MKL 11.3 Update 4, Relative performance 1.0 Knights Mill: Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.
** Configurations contained on next slide
0
0.4
0.8
1.2
1.6
2
Caffe/AlexNet TensorFlow/AlexNet-ConvNet TensorFlow/ ConvNet VGG
Normalized Throughput (Images/Second)
2S Intel® Xeon® processor E5-2697 v4 Intel® Xeon Phi™ processor 7250
8
Intel® Xeon Phi ™ processor 7250 single-node image classification training throughput up to 1.8x better than 2S Intel® Xeon® processor E5-2697 v4
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016 . Click here for configuration
Inte
l® X
eo
n P
hi™
7250
Rela
tive
Perf
orm
an
ce(N
orm
alize
d t
o 1
.0 b
ase
lin
e o
f a
Inte
l® X
eo
n®
pro
cess
or
2S E
5-2
697 V
4)
Up to1.8x Up to1.7xUp to1.4x
Public
What is KNM?
First Knights product designed for Intel® Scalable System Framework for Deep Learning
KNM is a KNL derivative that strives for:
Doubling the peak performance of SP dense compute & perf/W– At the cost of lower DP performance
– No or minimal platform changes
How?
Remove DP units to make room for 2x SP units + int16 VNNI
Drive them via Quadmadd instructions to boost efficiency
Knights Landing SOC
1 MB L2 per tile2 cores per tile (SLM-based)2 VPUs/core32 SP/16 DP flops per VPU
Package
MCDRAM
MCDRAM
MCDRAM
MCDRAM
MCDRAM
MCDRAM
MCDRAM
MCDRAM
OPIO OPIO OPIO OPIO
OPIO OPIO OPIO OPIO
DDR4
DDR4
PCIEGen3
2 x161 x4
x4
DMI
36 TilesTiles connected with 2D
Mesh
Up to 36 tiles (72 cores)1.7 GHz mesh
Misc
IIOEDC EDC
Tile Tile
Tile Tile Tile
EDC EDC
Tile Tile
Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
EDC EDC EDC EDC
iMC Tile Tile Tile Tile iMC
OPIO OPIO OPIO OPIO
OPIO OPIO OPIO OPIO
PCIe
DDR DDR
6 channels of DDR4 2400
(up to 384GB, 90 GB/s Streams triad)
16GB of IPM (MCDRAM)
(up to 490 GB/s Streams triad)
36 lanes PCIE Gen3
2VPU
Core
2VPU
Core
1MBL2
HUB
32 KBD$/I$
32 KBD$/I$
Knights Mill SOC
1 MB L2 per tile2 cores per tile (SLM-based)1 VPUs/core DP / 2 VPUs/core SP64 SP/16 DP flops per VPU
Misc
IIOEDC EDC
Tile Tile
Tile Tile Tile
EDC EDC
Tile Tile
Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile
EDC EDC EDC EDC
iMC Tile Tile Tile Tile iMC
OPIO OPIO OPIO OPIO
OPIO OPIO OPIO OPIO
PCIe
DDR DDR
2VPU
Core
2VPU
Core
1MBL2
HUB
32 KBD$/I$
32 KBD$/I$
2x SP0.5x DP
14nm+Up to 320w SKU (72 cores at 1.5 GHz)
New ISA for 2x SP: quadmadd
6 channels of DDR4 2400 (up to 384GB, 90 GB/s Streams triad)
16GB of IPM (MCDRAM)(up to 460 GB/s Streams triad)
36 lanes PCIE Gen3
Up to 36 tiles (72 cores)1.6 GHz mesh
Package
MCDRAM
MCDRAM
MCDRAM
MCDRAM
MCDRAM
MCDRAM
MCDRAM
MCDRAM
OPIO OPIO OPIO OPIO
OPIO OPIO OPIO OPIO
DDR4
DDR4
PCIEGen3
2 x161 x4
x4
DMI
36 TilesTiles connected with 2D
Mesh
Student Visit Program IntroductionDeveloper IA Tech Pipeline Buildup & Refresh
ISV
Developer Resume: Name # (in case
name is not proper)
R&R Experience
DRD
Developer Categorization: Resume
screening (Phone /
F2F) Interview
Senior Developers
Junior Developers
New Beginner
Developer IA Tech Pipeline
Refresh every half year
Promotion Required Actions
Finish all trainings in Training Framework for this ISV
Pass DRD AE brief interview
Successfully finish Intel Internshipfor this ISV
Removing IO and Memory Barriers
Integrated Intel® Omni-Path fabric increases price-performance and reduces communication latency
Direct access of up to 400 GB of
memory with no PCIe performance lag (vs. GPU:16GB)
Breakthrough HighlyParallel Performance
Up to 400X deep learning
performance on existing hardware via Intel software optimizations
Up to 4X deep learning performance
increase estimated (Knights Mill, 2017)
Easier Programmability
Binary-compatible with Intel® Xeon® processors
Open standards, libraries and frameworks
Intel® Xeon Phi™ Processor FamilyEnables Shorter Time to Train Using General Purpose Infrastructure
Processor for HPC & enterprise customers running scale-out, highly-parallel, memory intensive apps
Configuration details on slide: 30Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Breakthrough Performance
Increases price performance, reduces communication latency compared to InfiniBand EDR1:
Up to 21% Higher Performance, lower latency at scale
Up to 17% higher messaging rate Up to 9% higher application
performance
Building on some of Industry’s best technologies
Highly leverage existing Aries & Intel True Scale fabrics
Excellent price/performance price/port, 48 radix
Re-use of existing OpenFabricsAlliance Software
Over 80+ Fabric Builder Members
Intel® OMNI-path ArchitectureWorld-Class Interconnect Solution for Shorter Time to Train
Fabric interconnect for breakthrough performance on scale-out apps like deep learning training
HFI AdaptersSingle portx8 and x16
Edge Switches1U Form Factor24 and 48 port
Director SwitchesQSFP-based
192 and 768 port
SoftwareOpen Source
Host Software and Fabric Manager
CablesThird Party Vendors
Passive Copper Active Optical
Innovative Features Improve performance, reliability and
QoS through:
Traffic Flow Optimization to maximize QoS in mixed traffic
Packet Integrity Protection for rapid and transparent recovery of transmission errors
Dynamic lane scaling to maintain link continuity
1Intel® Xeon® Processor E5-2697A v4 dual-socket servers with 2133 MHz DDR4 memory. Intel® Turbo Boost Technology and Intel® Hyper Threading Technology enabled. BIOS: Early snoop disabled, Cluster on Die disabled, IOU non-posted prefetch disabled, Snoop hold-off timer=9. Red Hat Enterprise Linux Server release 7.2 (Maipo). Intel® OPA testing performed with Intel Corporation Device 24f0 – Series 100 HFI ASIC (B0 silicon). OPA Switch: Series 100 Edge Switch – 48 port (B0 silicon). Intel® OPA host software 10.1 or newer using Open MPI 1.10.x contained within host software package. EDR IB* testing performed with Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 - 36 Port EDR Infiniband switch. EDR tested with MLNX_OFED_Linux-3.2.x. OpenMPI 1.10.x contained within MLNX HPC-X. Message rate claim: Ohio State Micro Benchmarks v. 5.0. osu_mbw_mr, 8 B message (uni-directional), 32 MPI rank pairs. Maximum rank pair communication time used instead of average time, average timing introduced into Ohio State Micro Benchmarks as of v3.9 (2/28/13). Best of default, MXM_TLS=self,rc, and -mca pml yalla tunings. All measurements include one switch hop. Latency claim: HPCC 1.4.3 Random order ring latency using 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Application claim: GROMACS version 5.0.4 ion_channel benchmark. 16 nodes, 32 MPI ranks per node, 512 total MPI ranks. Intel® MPI Library 2017.0.064. Additional configuration details available upon request.
Libraries, frameworks & toolsIntel® Math Kernel
Library
Intel® MLSL
Intel® Data Analytics
Acceleration Library (DAAL)
Intel® Distribution
Open Source Frameworks
Intel Deep Learning SDK
Intel® Computer Vision SDKIntel® MKL MKL-DNN
High Level
Overview
High performance math primitives
granting low level of control
Free open source DNN functions for
high-velocity integration with deep learning frameworks
Primitive communication
building blocks to scale deep learning
framework performance over a
cluster
Broad data analytics acceleration object
oriented library supporting
distributed ML at the algorithm level
Most popularand fastest
growing language for
machine learning
Toolkits driven by academia and industry for
training machine learning
algorithms
Accelerate deep learning model
design, training and deployment
Toolkit to develop & deploying vision-oriented solutions
that harness the full performance of
Intel CPUs and SOC accelerators
Primary Audience
Consumed by developers of higher level libraries and Applications
Consumed by developers of the next generation of
deep learning frameworks
Deep learning framework
developers and optimizers
Wider Data Analytics and ML
audience, Algorithm level development
for all stages of data analytics
Application Developers and Data Scientists
Machine Learning App Developers, Researchers and Data Scientists.
Application Developers and Data Scientists
Developers who create vision-
oriented solutions
Example Usage
Framework developers call
matrixmultiplication, convolution
functions
New framework with functions
developers call for max CPU
performance
Framework developer calls
functions to distribute Caffe
training compute across an Intel®
Xeon Phi™ cluster
Call distributed alternating least
squares algorithm for a
recommendation system
Call scikit-learnk-means
function for credit card fraud
detection
Script and train a convolution neural network for image
recognition
Deep Learningtraining and model
creation, with optimization for deployment on constrained end
device
Use deep learning to do pedestrian
detection
…
Find out more at software.intel.com/ai
16
Xeon Phi Processor & Platform Introduction – Knights Landing (KNL)
Processor: SKUs & Key Features核数/线程 Ghz
片上存储
Fabric Ddr4 power**
7290*
最好的节点性能(3.46TF DP / 6.92TF SP)
72/288
1.5 16GB 7.2 GT/s
Yes384GB
2400 MHz
245W
7250最好的每瓦性能
(3.05TF DP / 6.1TF SP)
68/272
1.4 16GB 7.2 GT/s
Yes384GB
2400 MHz
215W
7230最好的每核内存带宽
(2.66TF DP / 5.32TF SP)
64/256
1.3 16GB 7.2 GT/s
Yes384GB
2400 MHz
215W
7210入门级产品
(2.66TF DP / 5.32TF SP)
64/256
1.3 16GB 6.4 GT/s
Yes384GB
2133 MHz
215W
*Available beginning in September **Add 15 watts for integrated fabric
消除PCIe数据流动瓶颈自启动的主处理器
片上MCDRAM 改善访存带宽集成16GB高速片上内存,带宽高达490GB/s,约为DDR4带宽的4-5倍
和CPU一样运行x86应用和Intel® Xeon®二进制应用兼容
优秀的系统可扩展性良好的扩展效率,像Intel® Xeon® CPU
集成Intel Omni-Path Fabric更低网络延迟,更低成本,更低功耗
生命科学 /基因测序 /金融风险模型 /能源 /气象环保 / 流体力学 /可视化 /渲染 /大数据 /机器学习/深度学习
标准的x86编程模式和Xeon一样的编程环境,工具,语言
高度并行化和向量化高达288并发线程,AVX-512指令集
17
Xeon Phi Processor & Platform Introduction – Knights Landing (KNL)
Platform
Intel Software Development Platform
(SDP)
Intel Server Board S7200APin Intel Server Chassis H2000XXLR2
Other OEM Systems
18
Xeon Phi Processor & Platform Introduction – Knights Landing (KNL)
System: Interconnect
(1 or 2) x16 or x8 PCIe
10/40 GbE
F
CO
NN
EC
TO
R
QSPF connector
QSPF connector
QSFP module
(2) x16 PCIe
2x 100Gbps (uni-directional); 50GB/sec (bi-directional)
Omni-Path Fabric Ethernet
1st
Gen(2013)
2nd
Gen(2016)
Knights Mill(2017)
• Up to ~13.8 TF* Single Precision Peak performance
• Up to ~27.6 TOPs* Variable Precision QVNNI performance
• Surgical changes in the VPU to increase SP performance over DP performance
• Bootable Host-CPU avoids PCIe latency & bottlenecks
• Efficient Scaling with Multi-node optimizations for top ML frameworks
• High memory bandwidth for seamlessly training Complex Neural Network datasets
Sin
gle
-Pre
cisi
on
Tera
flo
ps
(Peak)
Common Groveport PlatformBootable Host CPU
Knights Mill: Optimal Deep Learning Throughput
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Performance estimate wrt KNL 7250 SKU SGEMM. Performance Calculation= AVX freq X Cores X Flops per Core X Efficiency
Faster Time to Train Machines
19
*Based on estimates and Subject to Change
20
Intel® Xeon Phi ™ processor Knights Mill up to 4x estimated performance improvement over Intel® Xeon Phi™ processor 7290
0
2
4
6
Deep Learning Performance
Normalized Performance
Intel® Xeon Phi™ processor 7290
Intel® Xeon Phi™ processor family - Knights Mill
Est
imate
d n
orm
aliz
ed
perf
orm
an
ce o
n
Inte
l® X
eo
n P
hi™
pro
cess
or
7290
com
pare
d t
o In
tel®
Xeo
n P
hi™
Kn
igh
ts
Mill Up to 4x
Configuration details on next slide
Knights Mill performance: Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance Source: Intel measured baseline Intel® Xeon Phi™ processor 7290 as of November 2016
Public
Composer Edition
Intel® Parallel Studio XECreate Faster Code…Faster
22More Power for Your Code - software.intel.com/intel-parallel-studio-xe
Intel® VTune™ Amplifier
Performance Profiler
ANALYZEAnalysis Tools
Intel® AdvisorVectorization Optimization
& Thread Prototyping
Intel® InspectorMemory & Thread
Debugger
SCALECluster Tools
Intel® Trace Analyzer & Collector
MPI Tuning & Analysis
Intel® MPI LibraryMessage Passing Interface Library
Intel® Cluster CheckerCluster Diagnostic Expert System
Operating System: Windows*, Linux*, MacOS1*
Intel® Architecture Platforms
BUILDCompilers & Libraries
C / C++ CompilerOptimizing Compiler
Intel® Distribution for Python*High Performance Scripting
Intel® MKLFast Math Kernel Library
Intel® IPPImage, Signal & Data
ProcessingIntel® TBB
C++ Threading Library
Intel® DAALData Analytics Library
Fortran CompilerOptimizing Compiler
Professional Edition Cluster Edition
23
Components of Intel® MKL 2017
Linear Algebra
• BLAS• LAPACK• ScaLAPACK• Sparse BLAS• Sparse Solvers• Iterative • PARDISO*• Cluster Sparse
Solver
Fast Fourier Transforms
• Multidimensional
• FFTW interfaces• Cluster FFT
Vector Math
• Trigonometric• Hyperbolic • Exponential• Log• Power• Root• Vector RNGs
Summary Statistics
• Kurtosis• Variation
coefficient• Order
statistics• Min/max• Variance-
covariance
And More…
• Splines• Interpolation• Trust Region• Fast Poisson
Solver
Deep Neural Networks
• Convolution• Pooling• Normalization• ReLU• Softmax
New
24
Improve GEMM performance by pack GEMMA machine learning case
• Improve GEMM performance by Pack GEMM from MKL 2017
0
1
2
3
4
Peak v
alu
e(T
flo
ps)
matrix size(m,n,k)
SGEMM Performance Comparison (Higher is better)
sgemm on KNL 7210 Optimization by Packed gemm on KNL
NOTE: Pack gemm is helpful for some matrixes, but the packing costs is high, it can be amortized over multiple GEMM calls if the input matrices (A or B) are reused between these calls.
Performance enhanced by
Intel® Math Kernel Library
Legal DisclaimersINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTYRIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTELDISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULARPURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'SPRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS,OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANYCLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WASNEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked"reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The informationhere is subject to change without notice. Do not finalize a design with this information.The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata areavailable on request.Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or goto: http://www.intel.com/design/literature.htmSoftware and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measuredusing specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information andperformance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sitesor others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specificbenchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjbb, SPECjvm, SPECWeb, SPECompM, SPECompL, SPEC MPI, SPECjEnterprise* are trademarks of the Standard PerformanceEvaluation Corporation. See http://www.spec.org for more information. TPC-C, TPC-H, TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology-enabled chipset, BIOS and operating system. Performance will varydepending on the specific hardware and software you use. For more information including details on which processors support HT Technology, see hereIntel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, softwareand overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, seehttp://www.intel.com/technology/turboboostNo computer system can provide absolute security. Requires an enabled Intel® processor and software optimized for use of the technology. Consult your system manufacturer and/or softwarevendor for more information.Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to:Learn About Intel® Processor NumbersIntel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps.© 2017 Intel Corporation. 26
vasp*
The Vienna Ab initio Simulation Package (VASP) is a computer program for atomic scale materials modeling and performs electronic structure calculations and quantum-mechanical molecular dynamics from first principles.
Application: Vienna Ab initio Simulation Package (VASP)
Code: Available here Recipe: See configuration details. Also, check for future availability here
Value Proposition: VASP provides scientists with fast and precise calculation of materials properties thus it is widely used and consumes up to 25% of supercomputers time worldwide1. Intel® Xeon Phi™ processor 7250 enables VASP to outperform alternatives for some workloads and to improve energy efficiency.
PUBLIC PRESENTA
TIONMaterial Sciences
27
Lammps Coarse-Grain Water Simulation
LAMMPS is a classical molecular dynamics code, and an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator. It is used to simulate the movement of atoms to develop better therapeutics, improve alternative energy devices, develop new materials, and more.
Application: Coarse-grain water simulation with LAMMPS using Stillinger-Weber potential. More at http://lammps.sandia.gov/
Code: In main LAMMPS repository. Recipe: Available here
Value Proposition: Intel continues to advance the capabilities of HW and SW necessary for scientists to solve new and more complex problems that could not previously be achieved. The Intel® Xeon Phi™ processor improves power-efficient performance for scalable workloads.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
Life Sciences
Ice formation in water droplet wetting a flat surface with coarse-grain model in LAMMPSImage Source: Comput. Phys. Commun., 2013, 184, 2785-2793
PUBLIC PRESENTA
TION
28
NWChem NWPW AIMD SIMULATION
NWChem is a computational chemistry software package which also includes quantum chemical and molecular dynamics functionality. It was designed to run on high-performance parallel supercomputers as well as conventional workstation clusters. It aims to be scalable both in its ability to treat large problems efficiently, and in its usage of available parallel computing resources.
Application: Ab-initio molecular dynamics simulation (NWChem NWPW), water128 benchmark
Code: Main NWChem repository at https://svn.pnl.gov/svn/nwchem/trunk, SVN rev 28860. Recipe: Check for availability here
alue Proposition: Intel continues to advance the capabilities of HW and SW necessary for scientists to solve new and more complex problems that could not previously be achieved. The Intel® Xeon Phi™ processor improves power-efficient performance for scalable workloads..
Life Sciences
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
29
Images Source: US Govt.; NWChem
Trinity Benchmarks - Optimized cluster (GFLOPS)
Trinity is a set of benchmark programs used as part of the joint NERSC/ACES NERSC-8/Trinity system procurement.
Code: In main NERSC website. Recipe: Check for availability here
AMG: Parallel algebraic multigrid solver for linear systems MiniFE: Finite Element mini-app UMT: 3D, deterministic, multigroup, photon transport code for unstructured meshes SNAP: proxy application to model the performance of a modern discrete ordinates neutral particle transport application GTC: Gyrokinetic Particle Simulation of Turbulent Transport in Burning Plasmas MILC: MIMD Lattice Computation (MILC) collaboration kernel used to study quantum chromodynamics (QCD) MiniGhost: Finite Difference mini-app
Value Proposition: Trinity benchmarks showcase the out-of-the box performance of the Intel® Xeon Phi™ processor at single node and at the cluster.
Material Sciences
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computersystems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating yourcontemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as theproperty of others
30
The MILC Code is used to study quantum chromodynamics (QCD), the theory of the strong interactions of subatomic physics and is written by the MIMD Lattice Computation (MILC) collaboration.
Application: Trinity MILC provided by NERSC as part of Trinity8 suite
Code: Original NERSC Benchmark code is here; contact Intel for Optimized Code Recipe: Check for availability here
Value Proposition:
MILC is widely deployed on numerous supercomputers and 2nd most used application at US DOE’s National Energy Research Scientific Computing Center (NERSC)
Intel’s optimizations are being incorporated into mainline by MILC collaboration
MILC*Physics - QCD
Image Credit: Brookhaven Lab (BNL)
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computersystems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluatingyour contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed asthe property of others
31
Chroma*The Chroma package supports data-parallel programming constructs for lattice field theory and in particular latticeQCD. It uses the SciDAC QDP++ data-parallel programming (in C++) that presents a single high-level code image tothe user, but can generate highly optimized code for many architectural systems including single node workstations,multi and many-core nodes, clusters of nodes via QMP, and classic vector computers.
Application: Chroma “hmc”
Code: Available here Recipe: Check for availability here
Value Proposition:
Chroma is deployed on numerous supercomputers and one of the most used QCD applications/ researchkernels.
Intel’s optimizations are incorporated into mainline Chroma. The optimizations are made available in the QPhiX library.
Physics - QCD
Image Credit: Brookhaven Lab (BNL)
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplatedpurchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
PUBLIC PRESENTA
TION
32
Openfoam*
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
OpenFOAM (for "Open source Field Operation And Manipulation") is a C++ toolbox for the development of customized numerical solvers, and pre-/post-processing utilities for the solution of continuum mechanics problems, including computational fluid dynamics (CFD).
Application: OpenFOAM
Code: Available here Recipe: Available here
Optimizations: https://github.com/OpenFOAM/OpenFOAM-Intel
Value Proposition: Provides an extensive range of features to solve complex fluid flows involving chemical reactions, turbulence and heat transfer, acoustics, solid mechanics and electromagnetics. OpenFOAM on the Intel® Xeon Phi™ processor is great for computational fluid dynamics, structured grid or unstructured mesh.
Image Source: Intel
Manufacturing
PUBLIC PRESENTA
TION
33
Cloverleaf*The CloverLeaf* code investigates the behavior of fluids under high temperatures and pressures, which potentially cause shock fronts to form. It is common for hydrocodes to be constructed using one of two formulations –Lagrangian, in which a mesh is constructed and evolved through time, or Eulerian, where material flow is calculated relative to a fixed spatial grid.
Application: CloverLeaf
Code: https://github.com/UK-MAC/CloverLeaf
Recipe: make COMPILER=INTEL MPI_COMPILER=mpiifort C_MPI_COMPILER=mpiicc OPTIONS=“-xMIC-AVX512” C_OPTIONS=“-xMIC-AVX512”
Value Proposition:
This application provides users with a research tool for investigating code modernization approaches for larger shock hydrodynamics applications.
This application now significantly outperforms (time-to-solution) alternative processing solutions with the Intel® Xeon Phi™ processor 7250.
Physics - Hydrodynamics
Image Source: Intel
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
34
Weather & Research Forecast Model* Numerical Weather Simulation
The WRF Model is a numerical weather prediction system designed to serve atmospheric research and operational forecasting needs. Currently in operational use at NCEP, AFWA, NASA, NOAA, etc.
Application: The Weather & Research Forecast Model* (WRF) WRFV3.6.1 Conus12km. Community code is managed by NCAR. CONUS12KM benchmark is an adhoc industry standard workload and is widely cited.
Code: Available here. (WRF 3.6 & 3.6.1) Recipe: Select Intel MIC configuration on build. Check for availability here
Value Proposition: The most widely-used weather forecasting code runs in its entirety on the Intel platform only. Speed up the WRF weather simulation code and results with Intel® architecture.
Climate & Weather
Image Source: NOAA
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
35
YASK HPC Stencils, iso3dfd kernelYASK, Yet Another Stencil Kernel, is a framework to facilitate design exploration and tuning of HPC kernels. One of the stencils included in YASK is iso3dfd, a finite-difference code found in seismic imaging software used by energy-exploration companies to predict the location of oil and gas deposits.
Application: YASK, iso3dfd stencil
Code and Recipe: Available here
Value Proposition: Intel® Xeon Phi™ processor 7250 enables this application to leverage the high-bandwidth
memory and 512-bit SIMD for higher performance.
Geophysics
Image Source: US Dept. Energy
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
36
STAC-A2* BENCHMARK
The STAC-A2 Benchmark suite is the industry standard created by the financial community to test technology stacks used for compute-intensive analytic workloads involved in pricing and risk management
Application: Intel Composer XE STAC Pack Rev. H
Code: Available here Recipe: Available here
Value Proposition:
The Intel Xeon Phi processor based-system takes up 1/8th the space (0.5U vs 4U) than the IBM Power8* based-system
Performance enhanced by Intel® AVX512 and MCDRAM
Image Source: Intel
“STAC” and all STAC names are trademarks or registered trademarks of the Securities Technology Analysis Center LLC.
Financial Services
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
37
Industrial standard benchmark that uses Monte Carlo method for pricing European call options. It pre-generates random numbers then uses them in all options pricing processes. Used by all financial firms to price derivatives with multiple dimensions. Uses the stock price, strike price and time as input streams then creates a call output stream.
Application: Monte Carlo European Options
Code and Recipe: Available here
Value Proposition:
Foundation of Financial derivatives pricing Widely used all over financial libraries EMU benefits transcendental functions
Monte Carlo European optionsbenchmark*
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
Financial Services
Image Source: US gov.
PUBLIC PRESENTA
TION
38
Black-Scholes benchmark*
Industrial standard benchmark that calculates call and put option price using the Black-Scholes-Merton Formula. Used by all financial firms to price derivatives with multiple dimensions. Stock price, strike price and time are input streams that create call and put as output streams.
Application: Black-Scholes formula
Code and Recipe: Available here
Value Proposition:
Foundation of financial derivatives pricing Widely used all over financial libraries Performance enhanced by Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and MCDRAM
Financial Services
Image Source: US gov
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
39
Nanoscale Molecular Dynamics program*
Nanoscale Molecular Dynamics program (NAMD) is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems. Based on Charm++ parallel objects, NAMD scales to hundreds of cores for typical simulations and beyond 200,000 cores for the largest simulations.
Application: NAMD 2.11
Code: http://www.ks.uiuc.edu/Research/namd/
Recipe: Check for availability here
Value Proposition: NAMD is the 2nd most popular MD code. Intel® AVX 512 instructions are used heavily by the Assembler code. Source code performance tuning with intrinsics demonstrates MCDRAM and simultaneous multithreading advantages.
Life Sciences
Image Source: Use approved by NAMD
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
40
GROMACS*
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
GROMACS (GROningen MAchine for Chemical Simulations ) is a versatile package to perform classical Molecular Dynamics simulations. Heavily optimized for most modern platforms and provides extremely high performance compared to all other MD codes.
Application: GROMACS (Intel® AVX-512 speedup)
Code: Available here
Recipe: All optimizations merged in GROMACS 2016 branch, MKL FFT
Value Proposition: This application provides users with wide range of functionality for chemical simulations and highest out-
of-the-box performance across all MD codes. GROMACS on the Intel® Xeon Phi™ processor outperforms Intel® Xeon® processors for simulating large biochemical systems due to enabling new Intel® Advanced Vector Extensions 512 (Intel® AVX-512) features and enabling enhanced parallelism and provides more performance simulations within the same energy envelope.
Life Sciences
Images Source: Used with permission
PUBLIC PRESENTA
TION
41
Amber16* Generalized Born (Implicit Solvent)
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
Amber* is a bio related simulation code for DNA, RNA, protein, and other bio-molecules. Amber has two solvers: Particle Mesh Ewald (PME), known as Explicit, and Generalized Born (GB), known as Implicit. Amber is written in Fortran 90 and is mainly MPI*, OpenMP* and Vectorization parallelized.
Application: Amber 16 PMEMD Implicit
Code: In main Amber GIT repository. Recipe: http://ambermd.org/intel
Value Proposition: This application provides users with a research tool for investigating code modernization approach for Bio-molecular dynamics applications.
Life Sciences
PUBLIC PRESENTA
TION
Images Source: Intel
42
Amber* is a bio related simulation code for DNA, RNA, protein, and other bio-molecules. Amber has two solvers: Particle Mesh Ewald (PME), known as Explicit, and Generalized Born (GB), known as Implicit. Amber is written in Fortran 90 and is mainly MPI*, OpenMP* and Vectorization parallelized.
Application: Amber 16 PMEMD Explicit
Code: In main Amber GIT repository. Recipe: http://ambermd.org/intel
Value Proposition: This application provides users with a research tool for investigating code modernization approach for Bio-molecular dynamics applications. The Intel® Xeon Phi™ processor is best suited for larger problem sizes.
Amber16* PME (explicit solvent)
Images Source: Intel
Life Sciences
PUBLIC PRESENTA
TION
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance *Other names and brands may be claimed as the property of others
43
45
LAMMPSBASELINE: 2S Intel® Xeon® processor E7-2697 v3, 2.6GHz, 28 cores, Intel®
Turbo Boost Technology and Intel® Hyperthreading Technology on, BIOS 86B.01.01.1008.R00, 8x8GB 2133 MHz DDR4, CentOS Linux* 7.1.1503 kernel
3.10.0-229.
NEXT GEN: 2S Intel® Xeon® processor E5-2697 v4, 2.3GHz, 36 cores,
Intel® Turbo Boost Technology and Intel® Hyperthreading Technology on,
BIOS 86B0271.R00, 8x16GB 2400MHz DDR4, Red Hat Enterprise Linux* 7.2
kernel 3.10.0-327.
NEW: 2S Intel® Xeon® Gold 6148 processor, 2.4GHz, 40 cores, Intel® Turbo Boost Technology and Intel® Hyperthreading Technology on, BIOS 86B.01.00.0412.R00,
12x16GB 2666MHz DDR4, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.
BLACK SCHOLES
FSI Black-Scholes workload. OS: Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327. Testing
by Intel March 2017.
BASELINE: 2S Intel® Xeon® processor E7-2697 v3, 2.6GHz, 28 cores, turbo and HT on,
BIOS 86B.0036.R05, 64GB total memory, 8x8GB 2133 MHz DDR4, Fedora release 20
kernel 3.15.10-200 .
NEXT GEN: 2S Intel® Xeon® processor CPU E5-2697 v4 , 2.3GHz, 36 cores, turbo and
HT
on, BIOS 86B0271.R00, 128GB total memory, 8 x16GB 2400 MHz DDR4 RDIMM,
1 x 1TB SATA, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.
NEW: Intel® Xeon® Gold processor 6148@ 2.4GHz, H0QS, 40 cores 150W. QMS1, turbo
and HT on, BIOS SE5C620.86B.01.00.0412.020920172159, 192GB total memory, 12 x 16 GB 2666 MHz DDR4 RDIMM, 1 x 800GB INTEL SSD SC2BA80, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327
STAC-A2
BASELINE: SuperMicro* Superserver SYS-1028GR-TR, Dual Socket Intel® Xeon® processor E5-2697 v3 2.6 GHz , 14 Cores/Socket, 28 Cores, 56 Threads (HT and turbo on), DDR4 256GB, 1866 MHz, Red Hat 7.1, Wildcat Pass Motherboard, BIOS: American Megatrends, 2.0, 12/21/2015, 745GB SATA SSD
STAC-A2 version: STAC-A2 Pack for Intel Composer XE, Rev F
NEXT GENERATION: SuperMicro* Superserver SYS-1028GR-TR, Dual Socket Intel® Xeon® processor E5-2699 v4 2.2 GHz , 22 Cores/Socket, 44 Cores, 88 Threads (HT and turbo on), DDR4 256GB, 1866 MHz, Red Hat 7.1, Wildcat Pass Motherboard, BIOS: American Megatrends, 2.0, 12/21/2015, 745GB SATA SSD
STAC-A2 version: STAC-A2 Pack for Intel Composer XE, Rev G, STAC-A2 version: Internal (not audited)
NEW: 2S Intel® Xeon® Gold 6148 processor, 2.4GHz, 40 cores, turbo and HT on, BIOS 86B.01.00.0412.R00, 12x16GB 2666MHz DDR, Red Hat Enterprise Linux* 7.3 kernel 3.10.0-514
Iso3dfd
version dev13 from Jan 2017. OS: ICentos 7.3. Compiler: Intel® Parallel Studio XE Cluster Edition 2017 update 2Runs: OpenMP only using always the max number of cores. Common workload parameters for ALL runs Better performance can be obtained with MPI+OMP or less threads and using best parameters using GA for every IA.KMP_AFFINITY=compact OMP_SCHEDULE=static KMP_HW_SUBSET=Xc,1t /raid/opt/intel/share/phil/iso3dfd_for_oem/bin/iso3dfd_dev13_cpu_simd_ft_nohbm.exe 224 2125 2100 2X 50 224 48 96
Configuration Details
46
AMBERAmber: Version 16 with all patches applied at December, 2016. Workloads: PME Cellulose NVE(408K atoms), PME stmv(1M atoms), GB Nucleosome (25K), GB Rubisco (75K). No cut-off was used for GB workloads. Compiled with -mic2_spdp –intelmpi - openmp, –DMIC2 * defined. Tests performed on March 2017.
BASELINE: Executed with 36 MPI, 2 OpenMP. 2S Intel® Xeon® processor E5-2697 v4, 2.3GHz, 36
cores, turbo and
HT on, BIOS 86B0271.R00, 8x16GB 2400MHz DDR4, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.
NEW: Executed with 40 MPI and 2 OpenMP. 2S Intel® Xeon® Gold 6148 processor, 2.4GHz, 40
cores, turbo on, H
T on, BIOS 86B.01.00.0412.R00, 12x16GB 2666MHz DDR, Red Hat Enterprise Linux* 7.2 kernel 3.10.0327.
*-DMIC2 enable optimization for AVX512 vectorization, SPDP mixed precision, OpenMP optimization, but not any specific optimization for KNL arch.
GROMACSGROMACS AVX2 CONFIGURATION: Version 2016.3: ftp://ftp.gromacs.org/pub/gromacs/gromacs-2016.3.tar.gz
, Intel® Compiler 17.0.1.132, Intel® MPI 2017u1. Optimization Flags: “-O3 -xCORE-AVX2“. Cmakeoptions: “-DGMX_FFT_LIBRARY=mkl -DGMX_SIMD=AVX2_256”.
GROMACS AVX512 CONFIGURATION: Version 2016.3: ftp://ftp.gromacs.org/pub/gromacs/gromacs-2016.3.tar.gz ,
Intel® Compiler 17.0.1.132, Intel® MPI 2017u1. Optimization Flags: “-O3 -xCORE-AVX512“. Cmakeoptions: “-DGMX_FFT_LIBRARY=mkl -DGMX_SIMD=AVX_512”.
BASELINE INTEL XEON CONFIGURATION: GROMACS AVX2 binary, Dual Socket Intel® Xeon®
processor E5-2697 v3 BASE2.6 GHz, 14 Cores/Socket, 28 Cores, 56 Threads (HT on, Turbo on), DDR4 128GB, 2133 MHz, Red Hat 7.3.
NEXT GEN INTEL XEON CONFIGURATION: GROMACS AVX2 binary, Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz, 18 Cores/Socket, 36 Cores, 72 Threads (HT on, Turbo on), DDR4 128GB, 2400 MHz, Red Hat 7.2.
NEW INTEL XEON CONFIGURATION: GROMACS AVX512 binary, Dual Socket Intel® Xeon®
processor Gold 6148 2.4 GHz , 20 Cores/Socket, 40 Cores, 80 Threads (HT on, Turbo on), DDR4
192GB, 2666 MT/s DDR4 RDIMMs, Red Hat 7.2.
VASPVASP CONFIGURATION: Developer branch provided as “Package” included with download: https://github.com/vasp-dev/vasp-knl, Intel® Compiler 17.0.1.132, Intel® MPI 2017u1, ELPA 2016.05.004. Optimization Flags: “-O3 -xCORE-AVX512“.
BASELINE (INTEL XEON) CONFIGURATION: 2S Intel® Xeon processor E5-2699 v3 2.3 GHz, 18 Cores/Socket, 36 Cores, 72 Threads, HT on, turbo off, 128GB total memory, 2133 MT/s / DDR4 RDIMM, Red Hat Enterprise Linux* 7.0 kernel.
NEXT GEN (INTEL XEON) CONFIGURATION: 2S Intel® Xeon® processor E5-2697 v4 2.3 GHz , 18 Cores/Socket, 36 Cores, 72 Threads, HT on, turbo off, BIOS 86B0271.R00, 128GB total memory, 2400 MT/s DDR4 RDIMM, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.
NEW (INTEL XEON) CONFIGURATION: Dual Socket Intel® Xeon® processor Gold 6148 2.4 GHz , 20 Cores/Socket, 40 Cores, 80 Threads, HT on, turbo off, BIOS 86B.01.00.0412, 192GB total memory, 2666 MT/s / DDR4 RDIMM, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.
NAMDNAMD: Version 2.12 Dec2016. Workloads: apoa1(92K atoms), stmv(1M atoms). Compiled with –DNAMD_KNL* define. Tests performed on March 2017.
BASELINE: Performance apoa1- 5.62, stmv - 0.44 ns/day. Executed with 72 charm threads. 2S Intel® Xeon® processor E5-2697 v4, 2.3GHz, 36 cores, turbo and HT on, BIOS 86B0271.R00, 8x16GB 2400MHz DDR4, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.
NEW: Performance apoa1- 8.67, stmv - 0.73 ns/day. Executed with 40 charm threads. 2S Intel® Xeon® Gold 6148 processor, 2.4GHz, 40 cores, turbo on, HT on, BIOS 86B.01.00.0412.R00, 12x16GB 2666MHz DDR, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.
• - NAMD_KNL define enable optimization for AVX512 vectorization, not any specific for KNL arch.
WRFLINE: 2S Intel® Xeon® processor CPU E5-2697 v4 , 2.3GHz, 36 cores, turbo and HT on, BIOS 86B0271.R00, 128GB total memory, 8 slots / 16GB / 2400 MT/s / DDR4 RDIMM, 1 x 1TB SATA, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.
Software: WRF version 3.6.1 Compiled using Intel config option with “-O3 -fp-model fast=1 -xCORE-AVX2”. Executed with 36 MPI ranks and OMP_NUM_THREADS=1.
NEW: Intel® Xeon® Gold processor 6148, 2.4GHz, 40 cores, turbo and HT on, BIOS 86B.01.00.0412, 192GB total memory, 12 slots / 16 GB / 2666 MT/s / DDR4 RDIMM, 1 x 800GB INTEL SSD SC2BA80, Red Hat Enterprise Linux* 7.2 kernel 3.10.0-327.
Software: WRF version 3.6.1 Compiled using Intel config option with “-O3 -fp-model
fast=1 -xCORE-AVX512”. Executed with 40 MPI ranks and OMP_NUM_THREADS=1.
Configuration Details
47
Configuration DetailsFSI
P100: 2S Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz, 56 cores, turbo
and HT on, BIOS 86B0271.R00, 16x16GB 2133Mhz RDIMM, Red Hat Enterprise Linux Server release 7.3 (Maipo) kernel 3.10.0-514.6.1.el7.
x86_64 CUDA SDK 8.0
VASP
NVIDIA* CONFIGURATION: Dual Socket Intel® Xeon® processor E5-2697 v4 2.3 GHz , 18 Cores/Socket, 36 Cores, 72 Threads (HT and turbo on), DDR4 128GB, 2400 MHz, Red Hat 7.2, Super Micro* SuperServer 1028GR-TR, Bios Version 2.0a, Super Micro* X10DRG-H Motherboard, CSE-118GHTS-R1K66BP FRU, 500GB SATA Seagate* ST9500423AS System
Disk, NVIDIA Tesla* P100 GPU, NVIDIA CUDA* 8.0 (375.20).
LAMMPS
LAMMPS Version: 13 Oct 2016, Parallel Studio 2016 update 3,CUDA
Driver: 367.48,CUDA Version: 8.0, OS: RHEL 7.2 Kernel 3.10.0-327 Host Processor: E5-2697v4 Turbo Enabled, HT Enabled Host Memory: 8x16GB 2400 MHz DDR4 GPU: Tesla P100-PCIE-16GB Boost Enabled CUDA
MPS: Best run was taking by varying the number of MPI tasks on the
host from 1-36 and using CUDA MPS with MPI less than 17 MPI tasks.
GROMACS
BDW+1xNVIDIA P100: 2S Intel® Xeon® processor E5-2697 v4 (Turbo On) + 1xNVIDIA Tesla P100, Default Configuration, CUDA 8.0
Cost and power used in charts are based on estimated node cost and estimated node wall power during operation.All application performance is measured internally by Intel. GPU performance was measured with a single P100 per node. When comparing against 2 P100s per node, this assessment assumes linear scaling, with the 2 GPU node having twice the application performance of a measured 1 GPU node result. Cost and power used in charts are based on estimated node cost and estimated system wall power during operation
Estimated System Power Estimated System Cost
1s Intel® Xeon Phi™ Processor 7250 392 W $7250
2s Intel® Xeon® Processor Gold 6148 498 W $10000
2s Intel® Xeon® Processor Gold 6148 + 2 P100s 1102 W $21000