Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor...

Visualizing and Finding Optimization Opportunities with Intel® Advisor Roofline featureIntel Software Developer Conference – Frankfurt, 2017

Klaus-Dieter Oertel, Intel

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Agenda

• Intel® Advisor for vectorization optimization

• What is the theoretical roofline model ?

• How is it implemented in Advisor ?

• Some examples

• Resources

Optimization Notice

Intel® Xeon® Processor

64-bit5100 series

5500 series

5600 series

E5-2600E5-2600

V2E5-2600

V3E5-2600

V4Platinum

Core(s) 1 2 4 6 8 12 18 22 28

Threads 2 2 8 12 16 24 36 44 56

SIMD Width

128 128 128 128 256 256 256 256 512

Intel® Xeon Phi™ processor

Knights Landing

*Product specification for launched and shipped products available on ark.intel.com.

High performance software must be both Parallel (multi-thread, multi-process)

Vectorized

Changing Hardware Impacts SoftwareMore Cores More Threads Wider Vectors

Optimization Notice

Vectorize and Thread for Dramatic Performance GainsTogether they are more effective than either one alone

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Configurations at the end of this presentation.

2012E5-2600

Sandy Bridge

2013E5-2600 v2

Ivy Bridge

2010X5680

Westmere

2007X5472

Harpertown

2009X5570

Nehalem

2014E5-2600 v3

Haswell

2016E5-2600 v4 Broadwell

Vectorized & Threaded

Threaded

VectorizedSerial

Intel® Xeon™

Processor:codenamed:

The Difference Is Growing With

Each New Generation of

Hardware

“Automatic” Vectorization Not EnoughExplicit pragmas and optimization often required

Optimization Notice5

Faster Vectorization Optimization: Vectorize where it will pay off most

Quickly ID what is blocking vectorization

Tips for effective vectorization

Safely force compiler vectorization

Optimize memory stride

The data and guidance you need: Compiler diagnostics +

Performance Data + SIMD efficiency

Detect problems & recommend fixes

Loop-Carried Dependency Analysis

Memory Access Patterns Analysis

Intel® Advisor – Vectorization AdvisorGet breakthrough vectorization performance

Optimize for AVX-512 with

or without access to AVX-512 hardware

http://intel.ly/advisor-xePart of Intel® Parallel Studio XE

Optimization Notice

The Right Data At Your FingertipsGet all the data you need for high impact vectorization

Filter by which loops are vectorized!

Focus on hot loops

What vectorization issues do I have?

How efficient is the code?

What prevents vectorization?

Which Vector instructions are being used?

Trip Counts

Get Faster Code Faster!

Find Effective Optimization StrategiesIntel Advisor: Cache-aware roofline analysis

Roofs Show Platform Limits

Memory, cache & compute limits

Dots Are Loops

Bigger, red dots take more time so optimization has a bigger impact

Dots farther from a roof have more room for improvement

Higher Dot = Higher GFLOPs/sec

Optimization moves dots up

Algorithmic changes move dots horizontally

Which loops should we optimize? A and G are the best candidates B has room to improve, but will have less impact E, C, D, and H are poor candidates

Roofline tutorial video

Agenda

• Some examples

• Resources

What is the roofline model ?Do you know how fast you should run ?

• Comes from Berkeley

• Performance is limited by equations/implementation & code generation/hardware

• 2 hardware limitations

• PEAK Flops

• PEAK Bandwidth

• The application performance is bounded by hardware specifications

Gflop/s= 𝒎𝒊𝒏 𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑷𝑬𝑨𝑲𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑩𝑾 ∗ 𝑨𝑰

Arithmetic Intensity (Flops/Bytes)

Platform PEAK FlopSHow many floating point operations per second

• Theoretical value can be computed by specificationExample with 2 sockets Intel® Xeon® Processor E5-2697 v2PEAK FLOP = 2 x 2.7 x 12 x 8 x 2 = 1036.8 Gflop/s

• More realistic value can be obtained by running Linpack=~ 930 Gflop/s on a 2 sockets Intel® Xeon® Processor E5-2697 v2

Number of sockets

Core Frequency

Number of cores

Number of single precisionelement in a SIMD register

1 port for addition, 1 for multiplication

Platform PEAK bandwidthHow many bytes can be transferred per second

• Theoretical value can be computed by specificationExample with 2 sockets Intel® Xeon® Processor E5-2697 v2PEAK BW = 2 x 1.866 x 8 x 4 = 119 GB/s

• More realistic value can be obtained by running Stream=~ 100 GB/s on a 2 sockets Intel® Xeon® Processor E5-2697 v2

Number of socketsMemory Frequency

Byte per channel

Number of mem channels

Drawing the RooflineDefining the speed of light

Gflops/s

AI [Flop/B]

2 sockets Intel® Xeon® Processor E5-2697 v2Peak Flop = 1036 Gflop/sPeak BW = 119 GB/s

Gflops/s

AI [Flop/B]

Gflops/s

AI [Flop/B]8.7

What is the performance boundary?Manual way to do it

• Manual counting on matrix/matrix multiplication

• # add = N * N * N #Read = 3 * N * N * 4 bytes

# mul = N * N * N #Write = N * N * 4 bytes

• 𝐴𝐼 =2𝑁3

16𝑁2 =1

for(i=0; i<N; i++)for(j=0; j<N; j++)

for(k=0; k<N; k++)c[i][j] = c[i][j] + a[i][k] * b[k][j]

Compute the maximum performanceBW * Arithmetic Intensity

Gflops/s

AI [Flop/B]8.7

For sgemmAI = 1/8 NIf N = 8, AI = 1

If N = 8, sgemm should not be able to perform better than 119 GFlop/s

on a 2 sockets Ivy Bridge

And NOW?How to get better performance?

Gflops/s

Optimize memory access

Vectorization + threading

Agenda

• Some examples

• Resources

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 19

Intel® Advisor implements a Cache Aware Roofline Model (CARM)

- “Algorithmic”, “Cumulative (L1+L2+LLC+DRAM)” traffic-based

- Invariant for the given code / platform combination

How does it work ?

- Counts every memory movement

- Bytes and Flops -> Instrumentation

- Time -> Sampling

CARM: Cache aware Roofline ModelDRAM: DRAM aware Roofline ModelTRAM: Theoretical Roofline Model Typically AI_CARM < AI_DRAM < AI_TRAM

Roofline in Intel® AdvisorThe cache aware roofline model

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 20

Purely Cache/DRAM-bound

Purely compute bound

Understanding the roofline in Intel® Advisor

Intel® Advisor for vectorization optimization

Agenda

• Some examples

• Resources

Roofline model and compiler optimizations

Roofline model and optimizations

• Matrix/matrix addition

• Let’s have a look at the roofline model

void addition(float* a, float* b, float* c, int size){

int i, j;

for(j=0; j<size; j++){

for(i=0; i<size; i++){

c[i*size + j] = a[i*size + j]+b[i*size + j];

• Compilation with –O1

Very poor performance, far from the DRAM roofline !

• Lets look at the Memory Access Pattern Analysis

Constant stride found !!! Looks like loops should be reversed

• Compilation with –O3

Vectorization of Loop carried dependency

Vectorization of loop carried dependency

• Loop carried dependency

int i, j;

for(j=pad; j<size; j++){

c[i*size + j] = a[i*size + j]+c[i*size + j-pad];

int i, j;

#pragma omp simd safelen(4)

for(j=pad; j<size; j++){

c[i*size + j] = a[i*size + j]+c[i*size + j-pad];

In this case, we assume that pad >=4

Safelen was 4

Vectorization of function call

Vectorization of a function call with OMP

• Function call inside of a loop can prevent the vectorization

for(int i=0; i<SIZE; i++){

for(int j=0; j<SIZE; j++){

single_line_addition(a, c, i*SIZE + j);

//function is defined in another compilation unit

void single_line_addition(float* a, float* c, int ind){

c[ind] = a[ind]+c[ind];

Advisor tells you that this pattern can be a problemand proposes a solution

• Omp declare simd

for(int i=0; i<SIZE; i++){

#pragma omp simd

for(int j=0; j<SIZE; j++){

single_line_addition(a, c, i*SIZE + j);

#pragma omp declare simd uniform(a, c) linear(ind)

void single_line_addition(float* a, float* c, int ind);

Before

Intel® Advisor for vectorization optimization

Agenda

• Some examples

• Resources

References

Roofline model proposed by Williams, Waterman, Patterson: http://www.eecs.berkeley.edu/~waterman/papers/roofline.pdf

“Cache-aware Roofline model: Upgrading the loft” (Ilic, Pratas, Sousa, INESC-ID/IST, Thec Uni of Lisbon) http://www.inesc-id.pt/ficheiros/publicacoes/9068.pdf

Optimization Notice

Resources

Intel® Advisor – Threading Design & Prototyping:

Product page – overview, features, FAQs, support…

Training materials – movies, tech briefs, documentation…

Evaluation guides – step by step walk through

Reviews

Additional Analysis Tools:

Intel® VTune Amplifier – performance profiler

Intel® Inspector - memory and thread checker / debugger

Additional Development Products:

Intel® Software Development Products

Intel® Distribution for Python* – accelerated Python distribution

Download a free, 30-day trial of

Intel® Parallel Studio XE 2018 today

software.intel.com/en-us/intel-parallel-studio-xe

And Don’t Forget…

Code that performs and outperforms

To fill out the evaluation survey via a URL that will be provided at the end of the day

Watch your email for a link to the survey

P.S.Everyone who fills out the survey will receive a personalized certificate indicating completion of the training!

Optimization Notice

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance ofthat product when combined with other products.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Advisor usingGCC, Microsoft or Intel Compiler:

Finds un-vectorized loops

Analyze SIMD, AVX, AVX2, AVX-512

Dependency Analysis – safely force vectorization with a pragma

Memory Access Pattern Analysis -optimize stride and caching

Trip Counts

FLOPS metrics with masking

Roofline Analysis – balance memory vs. compute optimization

Intel Compiler Adds:

Usually better optimized vectorization

Better compiler optimization messages

Intel Advisor with Intel Compiler Adds:

Finds inefficiently vectorized loops and estimates performance gain

Compiler optimization report messages displayed on the source

More tips for improving vectorization

Optimize for AVX-512 even without AVX-512 hardware

Advisor works with GCC and Microsoft CompilersAdds bonus capabilities with the Intel Compiler

Optimization Notice

Configurations for 2007-2016 Benchmarks

Platform

Unscaled Core

FrequencyCores/S

ocket Num

SocketsL1 Data Cache

L2 Cache

L3 Cache Memory

Memory Frequency

Memory Access

H/W Prefetchers

EnabledHT

EnabledTurbo

Enabled C StatesO/S

NameOperating

SystemCompiler Version

Intel® Xeon™5472 Processor

3.0 GHZ 4 2 32K 6 MB None 32 GB 800 MHz UMA Y N N DisabledFedora

203.11.10-301.fc20

icc version 14.0.1

Intel® Xeon™ X5570 Processor

2.9 GHZ 4 2 32K 256K 8 MB 48 GB 1333 MHz NUMA Y Y Y DisabledFedora

203.11.10-301.fc20

icc version 14.0.1

Intel® Xeon™ X5680 Processor

3.33 GHZ 6 2 32K 256K 12 MB 48 MB 1333 MHz NUMA Y Y Y DisabledFedora

203.11.10-301.fc20

icc version 14.0.1

Intel® Xeon™ E52690 Processor

2.9 GHZ 8 2 32K 256K 20 MB 64 GB 1600 MHz NUMA Y Y Y DisabledFedora

203.11.10-301.fc20

icc version 14.0.1

Intel® Xeon™ E5 2697v2 Processor

2.7 GHZ 12 2 32K 256K 30 MB 64 GB 1867 MHz NUMA Y Y Y DisabledRHEL

7.13.10.0-

229.el7.x86_64icc version

14.0.1

Intel® Xeon™ E52600v3 Processor

2.2 GHz 18 2 32K 256K 46 MB 128 GB 2133 MHz NUMA Y Y Y DisabledFedora

203.13.5-202.fc20

icc version 14.0.1

2.3 GHz 18 2 32K 256K 46 MB 256 GB 2400 MHz NUMA Y Y Y DisabledRHEL

7.03.10.0-123. el7.x86_64

icc version14.0.1

2.2 GHz 22 2 32K 256K 56 MB 128 GB 2133 MHz NUMA Y Y Y DisabledCentOS

7.23.10.0-327. el7.x86_64

icc version14.0.1

Platform Hardware and Software Configuration

Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor...

Documents

Intel Xeon Hyperthreading

Intel xeon e5v3 y sdi

Intel Xeon processor MP 3.0 GHz 4M iL3 cachedownload.intel.com/pressroom/kits/xeon/gallatin_launch.pdf · of addressable memory Intel ®Itanium 2 ... Value of Intel® Xeon™ processor

Intel® Xeon® Processor D-1500 Product Family NDA … · Reference Number: 332054-021US Intel® Xeon® Processor D-1500, Intel® Xeon® Processor D-1500 NS, and Intel® Xeon® Processor

Intel® Xeon® Processor E7 Family: Reliability ... · These system failures ... Intel Xeon Processor E7 Family: Reliability, Availability, ... Intel Xeon Processor E7 Family: Reliability,

Intel Xeon 5400 Series Datasheet

EISKALT GESPART!Inside, das „Intel Inside“-Logo, Intel vPro, Itanium, Itanium Inside, Pentium, Pentium Inside, vPro Inside, Xeon, Xeon Phi, Xeon Inside und Intel Optane sind Marken

Процессоры Intel Xeon и технологии Intel для облачных решений

Intel Xeon Phi

プライマジー FUJITSU Server PRIMERGY RX100 S8€インテル、Intelロゴ、Intel Inside、Intel Insideロゴ、Intel Core、Core Inside、Pentium、Pentium Inside、Xeon、Xeon

11 Intel ® Xeon ® Intel ® Xeon ® Servers For Small Business

Intel Xeon Processor E3-1200 v4 Product Family€¦ · Intel, Intel Core, Intel Xeon, Intel® High Definition Audio, Intel® Advanced Vector Extensions, Enhanced Intel Speedstep®

Intel Xeon Processor 5500 Seriesdownload.intel.com/pressroom/kits/xeon/5500series/...The Intel Xeon processor 5500 series, with Intel Microarchitecture Nehalem, brings intelligent

Intel® Xeon Phi™ Coprocessor: IntroductionŸ7.pdf · 2014-11-20 · сопроцессором Intel® Xeon Phi Host CPU Host CPU Intel® Xeon® платформа («хост»)

CASE STUDY 1: Intel Cherry Creek Cluster case studies_SVLG 2015.pdf · Intel Xeon Processors and Coprocessors ! 3x Intel Xeon Phi and 2x Intel Xeon CPU’s ! Featured 9,936 cores

CPU Benchmarks - Dolnośląski Urząd Wojewódzki · 2015. 7. 30. · Intel Xeon E5-2430 v2 @ 2.50GHz $569.99 Intel Xeon X5660 @ 2.80GHz $94.99* Intel Xeon E5-2620 @ 2.00GHz $389.98

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors

Estudio de arquitecturas Intel Xeon vs Intel Xeon Phi y

Seamless Parallelization and Vectorization Integration ... · Intel® Xeon® and Intel® Xeon Phi™ Product Families are both going parallel Intel® Xeon Phi™ coprocessor Knights

Cisco Unified Computing System: поддержка виртуализации и ... · процессорами Intel Xeon: семейство процессоров Intel Xeon E5-2600