33
Bhanu Shankar, Ph.D. Architect, 3D XPoint™ Performance Analysis Intel Corporation May 17, 2016

Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Bhanu Shankar, Ph.D.

Architect, 3D XPoint™ Performance Analysis

Intel Corporation

May 17, 2016

Page 2: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

The products described may contain design defects or errors known as errata which may cause the product to deviate from publishedspecifications. Current characterized errata are available on request.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel, Xeon, Xeon Phi, Core, VTune, Atom, Quark and the Intel logo are trademarks of Intel Corporation in the United States and other countries.

*Other names and brands may be claimed as the property of others.

© 2016 Intel Corporation.

2

Page 3: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

33

Page 4: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

In a single phrase “VTune is the best oscillator for Intel® Platforms”

If there is something to measure on the platform, VTune can do it

Learn a single tool

Use it on multiple Operating Systems

– Windows / Linux / FreeBSD / Android / VxWorks

Use it on Multiple Platforms

– Quark, Atom Family, Core Family, Xeon family, Xeon Phi family

Updated often with new Analyses modes for better insight

Intel® VTune™ AmplifierGet Faster Code Faster

4

Page 5: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice5

Familiarity with the basics of Intel® VTune™ Amplifier

Create Projects

Starting a profiling run

– Choose Target and Analysis Type

Types of Analyses available in VTune™ Amplifier

VTune Panes

– Role of the Grid

– Timeline Views

– Grouping Toolbar

Familiarity with the basics

Parallel programming using OpenMP

Intel x86 assembly language

Basics of compiler optimizations

Cache and Memory hierarchies

Audience Knowledge

Page 6: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

I will start with an application and work through the process analyzing its performance.

The focus of this process is to allow you, the user, to be able to find out if your application is memory bound.

If so, is the memory boundedness caused due to NUMA behavior

The application is a modified version of the stream benchmark

Freely available at: http://www.cs.virginia.edu/stream

A simple, synthetic benchmark designed to measure sustainable memory bandwidth

Synopsis of this webinar

6

Page 7: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

First - Run Advanced Hotspots Analysis

Identify the hotspots

Characterize the application behavior

Secondly - Run General Exploration Analysis

Identify areas to explore after the basic algorithm / hotspot

Lastly – Run Specialized Analysis

For this example - Memory Analysis

– Memory Analysis without objects

– Memory Analysis with objects (Linux only)

General Methodology for using VTune

7

Page 8: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Let’s get started

8

Page 9: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

First - Run Advanced Hotspots

Identify the hotspots

Characterize the application behavior

Secondly - Run General Exploration

Identify areas to explore after the basic algorithm / hotspot

Lastly – Run Specialized Analysis

For Instance - Memory Analysis

– Memory Analysis without objects

– Memory Analysis with objects (Linux only)

Step 1:

9

Page 10: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice10

Page 11: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Application – Hotspot – Bottom Up Tab

11

Page 12: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Application Source/Object

Source code is simple

object code is straight forward

Why the large CPI?

Not caused by algorithm

Must be machine specific

12

Page 13: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

First - Run Advanced Hotspots

Identify the hotspots

Characterize the application behavior

Secondly - Run General Exploration

Identify areas to explore after the basic algorithm / hotspot

Lastly – Run Specialized Analysis

For Instance - Memory Analysis

– Memory Analysis without objects

– Memory Analysis with objects (Linux only)

Step 2

13

Page 14: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Summary Page

14

Page 15: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

General Exploration – Bottom Up Tab

Same Loops as earlier

15

Page 16: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Let’s Explore – Source level

Yes, Indeed – We have a bottleneck in the memory hierarchy

16

Page 17: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

First - Run Advanced Hotspots

Identify the hotspots

Characterize the application behavior

Secondly - Run General Exploration

Identify areas to explore after the basic algorithm / hotspot

Lastly – Run Specialized Analysis

For Instance - Memory Analysis

– Memory Analysis without objects – Do we have a bandwidth problem?

– Memory Analysis with objects (Linux only)

Step 3: Find the memory bandwidth

17

Page 18: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

How do I run Memory Access Analysis?

Make sure this box is unchecked.

18

Page 19: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Memory Access - Summary

Looks like a problem accessing remote DRAM

19

Page 20: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Memory Access – Bottom-Up View

Imbalance in memory access across both sockets

Average latency is fairly large

20

Page 21: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

First run Advanced Hotspots

Identify the hotspots

Characterize the application behavior

Secondly run General Exploration

Identify areas to explore after the basic algorithm / hotspot

Lastly – Run Specialized Analysis

For Instance - Memory Analysis

– Memory Analysis without objects

– Memory Analysis with objects (Linux only)

Step 4: Identify the memory object(s)

21

Page 22: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Identify the Memory Objects - Configuration

Make sure this box is checked.Minimal size of object to track.

22

Page 23: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Identify the Memory Objects

Location of the heap object

Average Latency is large

23

Page 24: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Dive into an object

Access to the object in a parallel region - Good

Access to the object in a serial region –Hmmm…Investigate

24

Page 25: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Serial Access to memory object

This is where memory is first touched.BINGO!!! Linux stripes memory to local memory of socket!!!

25

Page 26: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Did it work? Analyze the fixed applicationRun Memory Access on fixed code

26

Page 27: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Fixed Code: Summary Page

Effects of NUMA completely disappearedRemote DRAM access are minimal

27

Page 28: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

The stream benchmark has 5 loops that are parallelized

Locate the loops by tagged with “#pragma omp parallel for”

Remove the “#pragma omp parallel for” for each or multiple loops

Run Intel® VTune™ Amplifier

See the effects of memory placement and parallel execution

Try the compare results feature on your runs of VTune using the icon

28

Lab exerciseTry out what you just learned

Page 29: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Memory and Inter-socket Bandwidth

Memory Latency

Memory Hierarchy

False Sharing

True Sharing

Effectiveness of Lockless Algorithms

What other problems can I diagnose this way?

29

Page 30: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Intel® VTune™ Amplifier continues to add tools to the toolbox to diagnose system performance problems

Memory Access Analysis is one such powerful tool

Stay tuned for more such tools in the future

Summary

30

Page 31: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice

Download and Evaluate Intel® VTune™ Amplifier

https://software.intel.com/en-us/intel-vtune-amplifier-xe

Intel® VTune™ Amplifier Support

https://software.intel.com/en-us/intel-vtune-amplifier-xe-support

Get Help: Ask the Community

https://software.intel.com/en-us/forums/intel-vtune-amplifier-xe

NUMA Architecture

https://software.intel.com/en-us/articles/a-brief-survey-of-numa-non-uniform-memory-architecture-literature

Stream Benchmark

http://www.cs.Virginia.edu/stream

or type “stream benchmark” into your favorite search engine

31

Call to Action

Page 32: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent

Copyright © 2016, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Optimization Notice32

Questions?

Page 33: Performance Analysis - Intel · Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent