48
Visualizing and Finding Optimization Opportunities with Intel® Advisor Roofline feature Intel Software Developer Conference – Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Visualizing and Finding Optimization Opportunities with Intel® Advisor Roofline featureIntel Software Developer Conference – Frankfurt, 2017

Klaus-Dieter Oertel, Intel

Page 2: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Agenda

• Intel® Advisor for vectorization optimization

• What is the theoretical roofline model ?

• How is it implemented in Advisor ?

• Some examples

• Resources

Page 3: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Intel® Xeon® Processor

64-bit5100 series

5500 series

5600 series

E5-2600E5-2600

V2E5-2600

V3E5-2600

V4Platinum

8180

Core(s) 1 2 4 6 8 12 18 22 28

Threads 2 2 8 12 16 24 36 44 56

SIMD Width

128 128 128 128 256 256 256 256 512

Intel® Xeon Phi™ processor

Knights Landing

72

288

512

*Product specification for launched and shipped products available on ark.intel.com.

High performance software must be both Parallel (multi-thread, multi-process)

Vectorized

Changing Hardware Impacts SoftwareMore Cores More Threads Wider Vectors

3

Page 4: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Vectorize and Thread for Dramatic Performance GainsTogether they are more effective than either one alone

4

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Configurations at the end of this presentation.

2012E5-2600

Sandy Bridge

2013E5-2600 v2

Ivy Bridge

2010X5680

Westmere

2007X5472

Harpertown

2009X5570

Nehalem

2014E5-2600 v3

Haswell

2016E5-2600 v4 Broadwell

Vectorized & Threaded

Threaded

VectorizedSerial

187x

Intel® Xeon™

Processor:codenamed:

The Difference Is Growing With

Each New Generation of

Hardware

“Automatic” Vectorization Not EnoughExplicit pragmas and optimization often required

Page 5: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice5

Faster Vectorization Optimization: Vectorize where it will pay off most

Quickly ID what is blocking vectorization

Tips for effective vectorization

Safely force compiler vectorization

Optimize memory stride

The data and guidance you need: Compiler diagnostics +

Performance Data + SIMD efficiency

Detect problems & recommend fixes

Loop-Carried Dependency Analysis

Memory Access Patterns Analysis

Intel® Advisor – Vectorization AdvisorGet breakthrough vectorization performance

Optimize for AVX-512 with

or without access to AVX-512 hardware

http://intel.ly/advisor-xePart of Intel® Parallel Studio XE

Page 6: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

The Right Data At Your FingertipsGet all the data you need for high impact vectorization

Filter by which loops are vectorized!

Focus on hot loops

What vectorization issues do I have?

How efficient is the code?

What prevents vectorization?

Which Vector instructions are being used?

Trip Counts

Get Faster Code Faster!

6

Page 7: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice7

Find Effective Optimization StrategiesIntel Advisor: Cache-aware roofline analysis

Roofs Show Platform Limits

Memory, cache & compute limits

Dots Are Loops

Bigger, red dots take more time so optimization has a bigger impact

Dots farther from a roof have more room for improvement

Higher Dot = Higher GFLOPs/sec

Optimization moves dots up

Algorithmic changes move dots horizontally

Which loops should we optimize? A and G are the best candidates B has room to improve, but will have less impact E, C, D, and H are poor candidates

Roofs

Roofline tutorial video

New!

Page 8: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Agenda

• Intel® Advisor for vectorization optimization

• What is the theoretical roofline model ?

• How is it implemented in Advisor ?

• Some examples

• Resources

Page 9: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

What is the roofline model ?Do you know how fast you should run ?

• Comes from Berkeley

• Performance is limited by equations/implementation & code generation/hardware

• 2 hardware limitations

• PEAK Flops

• PEAK Bandwidth

• The application performance is bounded by hardware specifications

9

Gflop/s= 𝒎𝒊𝒏 𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑷𝑬𝑨𝑲𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑩𝑾 ∗ 𝑨𝑰

Arithmetic Intensity (Flops/Bytes)

Page 10: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Platform PEAK FlopSHow many floating point operations per second

• Theoretical value can be computed by specificationExample with 2 sockets Intel® Xeon® Processor E5-2697 v2PEAK FLOP = 2 x 2.7 x 12 x 8 x 2 = 1036.8 Gflop/s

• More realistic value can be obtained by running Linpack=~ 930 Gflop/s on a 2 sockets Intel® Xeon® Processor E5-2697 v2

10

Number of sockets

Core Frequency

Number of cores

Number of single precisionelement in a SIMD register

1 port for addition, 1 for multiplication

Gflop/s= 𝒎𝒊𝒏 𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑷𝑬𝑨𝑲𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑩𝑾 ∗ 𝑨𝑰

Page 11: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Platform PEAK bandwidthHow many bytes can be transferred per second

• Theoretical value can be computed by specificationExample with 2 sockets Intel® Xeon® Processor E5-2697 v2PEAK BW = 2 x 1.866 x 8 x 4 = 119 GB/s

• More realistic value can be obtained by running Stream=~ 100 GB/s on a 2 sockets Intel® Xeon® Processor E5-2697 v2

11

Number of socketsMemory Frequency

Byte per channel

Number of mem channels

Gflop/s= 𝒎𝒊𝒏 𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑷𝑬𝑨𝑲𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑩𝑾 ∗ 𝑨𝑰

Page 12: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Drawing the RooflineDefining the speed of light

12

Gflops/s

AI [Flop/B]

1036

2 sockets Intel® Xeon® Processor E5-2697 v2Peak Flop = 1036 Gflop/sPeak BW = 119 GB/s

Gflop/s= 𝒎𝒊𝒏 𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑷𝑬𝑨𝑲𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑩𝑾 ∗ 𝑨𝑰

Page 13: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Drawing the RooflineDefining the speed of light

13

Gflops/s

AI [Flop/B]

1036

2 sockets Intel® Xeon® Processor E5-2697 v2Peak Flop = 1036 Gflop/sPeak BW = 119 GB/s

Gflop/s= 𝒎𝒊𝒏 𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑷𝑬𝑨𝑲𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑩𝑾 ∗ 𝑨𝑰

Page 14: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Drawing the RooflineDefining the speed of light

14

Gflops/s

AI [Flop/B]8.7

1036

2 sockets Intel® Xeon® Processor E5-2697 v2Peak Flop = 1036 Gflop/sPeak BW = 119 GB/s

Gflop/s= 𝒎𝒊𝒏 𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑷𝑬𝑨𝑲𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑩𝑾 ∗ 𝑨𝑰

Page 15: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

What is the performance boundary?Manual way to do it

• Manual counting on matrix/matrix multiplication

• # add = N * N * N #Read = 3 * N * N * 4 bytes

# mul = N * N * N #Write = N * N * 4 bytes

• 𝐴𝐼 =2𝑁3

16𝑁2 =1

8𝑁

15

for(i=0; i<N; i++)for(j=0; j<N; j++)

for(k=0; k<N; k++)c[i][j] = c[i][j] + a[i][k] * b[k][j]

Page 16: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Compute the maximum performanceBW * Arithmetic Intensity

16

2 sockets Intel® Xeon® Processor E5-2697 v2Peak Flop = 1036 Gflop/sPeak BW = 119 GB/s

Gflops/s

AI [Flop/B]8.7

1036

For sgemmAI = 1/8 NIf N = 8, AI = 1

1

119

Gflop/s= 𝒎𝒊𝒏 𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑷𝑬𝑨𝑲𝑷𝒍𝒂𝒕𝒇𝒐𝒓𝒎 𝑩𝑾 ∗ 𝑨𝑰

If N = 8, sgemm should not be able to perform better than 119 GFlop/s

on a 2 sockets Ivy Bridge

Page 17: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

And NOW?How to get better performance?

17

Gflops/s

8.7

1036

1

119

Optimize memory access

Vectorization + threading

Page 18: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Agenda

• Intel® Advisor for vectorization optimization

• What is the theoretical roofline model ?

• How is it implemented in Advisor ?

• Some examples

• Resources

Page 19: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 19

Intel® Advisor implements a Cache Aware Roofline Model (CARM)

- “Algorithmic”, “Cumulative (L1+L2+LLC+DRAM)” traffic-based

- Invariant for the given code / platform combination

How does it work ?

- Counts every memory movement

- Bytes and Flops -> Instrumentation

- Time -> Sampling

CARM: Cache aware Roofline ModelDRAM: DRAM aware Roofline ModelTRAM: Theoretical Roofline Model Typically AI_CARM < AI_DRAM < AI_TRAM

Roofline in Intel® AdvisorThe cache aware roofline model

Page 20: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice. 20

Purely Cache/DRAM-bound

Purely compute bound

Understanding the roofline in Intel® Advisor

Intel® Advisor for vectorization optimization

Page 21: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Agenda

• Intel® Advisor for vectorization optimization

• What is the theoretical roofline model ?

• How is it implemented in Advisor ?

• Some examples

• Resources

Page 22: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Roofline model and compiler optimizations

Page 23: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Roofline model and optimizations

• Matrix/matrix addition

• Let’s have a look at the roofline model

void addition(float* a, float* b, float* c, int size){

int i, j;

for(j=0; j<size; j++){

for(i=0; i<size; i++){

c[i*size + j] = a[i*size + j]+b[i*size + j];

}

}

}

Page 24: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Roofline model and optimizations

• Compilation with –O1

Very poor performance, far from the DRAM roofline !

Page 25: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Roofline model and optimizations

• Lets look at the Memory Access Pattern Analysis

Constant stride found !!! Looks like loops should be reversed

Page 26: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Roofline model and optimizations

• Compilation with –O3

Page 27: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Vectorization of Loop carried dependency

Page 28: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of loop carried dependency

• Loop carried dependency

void addition(float* a, float* b, float* c, int size){

int i, j;

for(i=0; i<size; i++){

for(j=pad; j<size; j++){

c[i*size + j] = a[i*size + j]+c[i*size + j-pad];

}

}

}

Page 29: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of loop carried dependency

Page 30: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of loop carried dependency

• Loop carried dependency

Page 31: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of loop carried dependency

• Loop carried dependency

void addition(float* a, float* b, float* c, int size){

int i, j;

for(i=0; i<size; i++){

#pragma omp simd safelen(4)

for(j=pad; j<size; j++){

c[i*size + j] = a[i*size + j]+c[i*size + j-pad];

}

}

}

In this case, we assume that pad >=4

Page 32: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of loop carried dependency

Page 33: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of loop carried dependency

Safelen was 4

Page 34: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Vectorization of function call

Page 35: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of a function call with OMP

• Function call inside of a loop can prevent the vectorization

for(int i=0; i<SIZE; i++){

for(int j=0; j<SIZE; j++){

single_line_addition(a, c, i*SIZE + j);

}

}

//function is defined in another compilation unit

void single_line_addition(float* a, float* c, int ind){

c[ind] = a[ind]+c[ind];

}

Page 36: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of a function call with OMP

Page 37: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of a function call with OMP

Advisor tells you that this pattern can be a problemand proposes a solution

Page 38: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of a function call with OMP

• Omp declare simd

for(int i=0; i<SIZE; i++){

#pragma omp simd

for(int j=0; j<SIZE; j++){

single_line_addition(a, c, i*SIZE + j);

}

}

#pragma omp declare simd uniform(a, c) linear(ind)

void single_line_addition(float* a, float* c, int ind);

Page 39: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of a function call with OMP

Page 40: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Vectorization of a function call with OMP

Before

After

Intel® Advisor for vectorization optimization

Page 41: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Agenda

• Intel® Advisor for vectorization optimization

• What is the theoretical roofline model ?

• How is it implemented in Advisor ?

• Some examples

• Resources

Page 42: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

References

Roofline model proposed by Williams, Waterman, Patterson: http://www.eecs.berkeley.edu/~waterman/papers/roofline.pdf

“Cache-aware Roofline model: Upgrading the loft” (Ilic, Pratas, Sousa, INESC-ID/IST, Thec Uni of Lisbon) http://www.inesc-id.pt/ficheiros/publicacoes/9068.pdf

Page 43: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Resources

Intel® Advisor – Threading Design & Prototyping:

Product page – overview, features, FAQs, support…

Training materials – movies, tech briefs, documentation…

Evaluation guides – step by step walk through

Reviews

Additional Analysis Tools:

Intel® VTune Amplifier – performance profiler

Intel® Inspector - memory and thread checker / debugger

Additional Development Products:

Intel® Software Development Products

Intel® Distribution for Python* – accelerated Python distribution

43

Page 44: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Download a free, 30-day trial of

Intel® Parallel Studio XE 2018 today

software.intel.com/en-us/intel-parallel-studio-xe

And Don’t Forget…

Code that performs and outperforms

To fill out the evaluation survey via a URL that will be provided at the end of the day

OR

Watch your email for a link to the survey

P.S.Everyone who fills out the survey will receive a personalized certificate indicating completion of the training!

Page 45: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance ofthat product when combined with other products.

Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

4545

Page 46: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

© 2017 Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. For more complete information about compiler optimizations, see our Optimization Notice.

Page 47: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice47

Advisor usingGCC, Microsoft or Intel Compiler:

Finds un-vectorized loops

Analyze SIMD, AVX, AVX2, AVX-512

Dependency Analysis – safely force vectorization with a pragma

Memory Access Pattern Analysis -optimize stride and caching

Trip Counts

FLOPS metrics with masking

Roofline Analysis – balance memory vs. compute optimization

Intel Compiler Adds:

Usually better optimized vectorization

Better compiler optimization messages

Intel Advisor with Intel Compiler Adds:

Finds inefficiently vectorized loops and estimates performance gain

Compiler optimization report messages displayed on the source

More tips for improving vectorization

Optimize for AVX-512 even without AVX-512 hardware

Advisor works with GCC and Microsoft CompilersAdds bonus capabilities with the Intel Compiler

Page 48: Visualizing and Finding Optimization Opportunities ... - Intel...Intel® Xeon Phi™ processor Knights Landing 72 288 512 *Product specification for launched and shipped products available

Copyright © 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Configurations for 2007-2016 Benchmarks

Platform

Unscaled Core

FrequencyCores/S

ocket Num

SocketsL1 Data Cache

L2 Cache

L3 Cache Memory

Memory Frequency

Memory Access

H/W Prefetchers

EnabledHT

EnabledTurbo

Enabled C StatesO/S

NameOperating

SystemCompiler Version

Intel® Xeon™5472 Processor

3.0 GHZ 4 2 32K 6 MB None 32 GB 800 MHz UMA Y N N DisabledFedora

203.11.10-301.fc20

icc version 14.0.1

Intel® Xeon™ X5570 Processor

2.9 GHZ 4 2 32K 256K 8 MB 48 GB 1333 MHz NUMA Y Y Y DisabledFedora

203.11.10-301.fc20

icc version 14.0.1

Intel® Xeon™ X5680 Processor

3.33 GHZ 6 2 32K 256K 12 MB 48 MB 1333 MHz NUMA Y Y Y DisabledFedora

203.11.10-301.fc20

icc version 14.0.1

Intel® Xeon™ E52690 Processor

2.9 GHZ 8 2 32K 256K 20 MB 64 GB 1600 MHz NUMA Y Y Y DisabledFedora

203.11.10-301.fc20

icc version 14.0.1

Intel® Xeon™ E5 2697v2 Processor

2.7 GHZ 12 2 32K 256K 30 MB 64 GB 1867 MHz NUMA Y Y Y DisabledRHEL

7.13.10.0-

229.el7.x86_64icc version

14.0.1

Intel® Xeon™ E52600v3 Processor

2.2 GHz 18 2 32K 256K 46 MB 128 GB 2133 MHz NUMA Y Y Y DisabledFedora

203.13.5-202.fc20

icc version 14.0.1

Intel® Xeon™ E52600v4 Processor

2.3 GHz 18 2 32K 256K 46 MB 256 GB 2400 MHz NUMA Y Y Y DisabledRHEL

7.03.10.0-123. el7.x86_64

icc version14.0.1

Intel® Xeon™ E52600v4 Processor

2.2 GHz 22 2 32K 256K 56 MB 128 GB 2133 MHz NUMA Y Y Y DisabledCentOS

7.23.10.0-327. el7.x86_64

icc version14.0.1

Platform Hardware and Software Configuration

48