Advanced Features of Intel® C++ Composer XE for Linux€¦ · C++, Fortran on Windows, Linux and Mac OS X Performance Compatibility Support Intel® C++ Composer XE 2011 •Intel®

Software & Services Group

Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Advanced Features ofIntel® C++ Composer XE for Linux

Jeff Arnold

Intel Corporation

18 February 2011

http://software.intel.com/en-us/articles/optimization-notice/



Agenda

• Preliminaries

• Intel® Parallel Studio XE 2011

• Intel® C++ Composer XE

• Intel® Parallel Building Blocks

– Intel® Silk™ Plus

– Intel® Array Building Blocks

• Performance Libraries

• Intel® Vtune™ Amplifier XE 2011

• Intel® Inspector XE 2011

22011-02-18




Legal Disclaimer

3

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.*Other names and brands may be claimed as the property of others.

Copyright © 2011. Intel Corporation.

http://intel.com/software/products

2011-02-18


http://www.intel.com/software/products

http://intel.com/software/products



Optimization Notice

4

Optimization Notice

Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.

Notice revision #20101101

2011-02-18




Agenda

• Preliminaries









52011-02-18




Hickory, Dickory, Dock – The ISV Development Clock

Intel® Core™

Microarchitecture

Intel® Microarchitecturecodename Nehalem

Future Intel® Microarchitecture

Yonah Merom

…

All dates, product descriptions, availability, and plans are forecasts and subject to change without notice.

Architectural and Micro-Architectural changes require software changes to realize the full benefit

Enhanced Intel®

Pentium® M Microarchitecture

62011-02-18




More and Moore Core’s

• The trend toward multi-core mobile, desktop, and server processors is expected to continue into the foreseeable future.

• Software must be ready to take full advantage of it.

Many-core array

• CMP with 10s-100s

low power cores

• Scalar cores

• Capable of TFLOPS+

• Full System-on-Chip

• Servers, workstations,

embedded…Dual core

• Symmetric multithreading

Multi-core array

• CMP with ~10 cores

Large, Scalar cores for

high single-thread

performance

Scalar plus many core for

highly threaded

workloads

CMP » Chip Multi-Processing

All dates, product descriptions, availability, and plans are forecasts and subject to change without notice.

72011-02-18




Agenda

• Preliminaries









82011-02-18




Phase Productivity Tool Feature Benefit

Advanced Build & Debug

Intel® Composer XE

C/C++ and Fortran compilers, performance libraries, and parallel models

Application performance, scalability and quality for current multicore and future many-core systems.

Advanced Verify

Intel® Inspector XE

Memory & threading errorchecking tool for higher code reliability & quality

Increases productivity and lowers cost, by catching memory and threading defects early

Advanced Tune

Intel® VTune™Amplifier XE

Performance Profiler to optimize performance and scalability

Removes guesswork, saves time, makes it easier to find performance and scalability bottlenecks Combines ease of use with deeper insights.

Intel® Parallel Studio XE 2011Powerful tools to create fast, reliable and secure code

Today’s Focus: Intel® Composer XE

92011-02-18




Get Outstanding Application Performance from Intel Compiler Suite Products

10

New Names, Same Great Tradition of Compilers & Library Performance

Old NewIntel® C++ Compiler, Professional Edition for Windows* Intel® C++ Composer XE for Windows*

Intel® Visual Fortran Compiler, Professional Edition for Windows* with IMSL* Intel® Visual Fortran Composer XE for Windows*

Intel® Visual Fortran Compiler, Professional Edition for Windows* with IMSL* Intel® Visual Fortran Composer XE for Windows* with IMSL*

Intel® Compiler Suite, Professional Edition for Windows* Intel® Composer XE for Windows*

Intel® C++ Compiler, Professional Edition for Linux* Intel® C++ Composer XE for Linux*

Intel® Fortran Compiler, Professional Edition for Linux* Intel® Fortran Composer XE for Linux*

Intel® Compiler Suite, Professional Edition for Linux* Intel® Composer XE for Linux*

Intel® C++ Compiler, Professional Edition for Mac OS X* Intel® C++ Composer XE for Mac OS X*

Intel® Fortran Compiler, Professionald Edition for Mac OS X* Intel® Fortran Composer XE for Mac OS X*

2011-02-18




11

Intel Performance-Oriented Compiler SuitesCompilers, Performance Libraries, Debugging Tools: C++, Fortran on Windows, Linux and Mac OS X

Performance Compatibility Support

Intel® C++ Composer XE 2011

• Intel® C++ Compiler XE 12.0• Intel® Parallel Debugger Extension


• Intel® Math Kernel Library • Intel® Integrated Performance Primitives

Intel® Fortran Composer XE 2011

• Intel® Fortran Compiler XE 12.0• Intel® Parallel Debugger • Intel® Math Kernel Library • Intel® Integrated Performance Primitives

Intel Composer XE 2011

• Combines Intel C++ Composer XE and Intel® Fortran Composer XE

• For Fortran developers who also want Intel C++

• Windows, Linux only

• Windows: Integrates into Microsoft* Visual Studio*, Intel C++/Visual C++ Compatibility• Linux: Integrates into Eclipse CDT, Intel C++ Compatible with GCC• Mac OS: Integrates into XCode Environment, Compatible with GCC• All: 1 Year Premier Support Renewable Annually

2011-02-18




Agenda

• Preliminaries









122011-02-18




What’s NewIntel® Composer XE

• Major release of C/C++ and Fortran compilers v12.0

• Advanced C/C++ parallelism with Intel®

Parallel Building Blocks

• Advanced vectorization with SIMD pragmas

• Co-array Fortran and more Fortran 2008 support

• Updated versions of Intel® MKL & Intel® IPP

SIMD

pragma

Parallel Program

Debugging

132011-02-18




Intel® C++ Composer XE

14

• Improved performance• Subset of C++0x in support of

Visual C++ compatibility• Support for Visual Studio* 2010

(continuing 2005 & 2008 support)• Enhanced vectorization capabilities:

GAP and SIMD pragmas• Parallelism Discovery Assistant –

enhanced loop profiler• Expanded parallelism-dev features:

Intel® Parallel Building Blocks• Fortran 2003 support and many

Fortran 2008 features, including Co-Array Fortran

• Improved Intel performance libraries integration: Intel® Math Kernel Library, Intel® Integrated Performance Primitives

• New hardware support: Intel® Sandy Bridge

• Many Intel Core Architecture –MICA – extensions (beta)

• 32-bit and 64-bit support• Windows*, Linux* and Mac* OS X

Ongoing Commitment to Innovation & Standards

What’s New!

2011-02-18




Intel® C++Compatibility and Performance Leadership

• Intel® Cilk™ Plus: Easy to use language extensions for array syntax that deliver great performance through parallelism and more readable syntax

• Staying on top of the performance heap

– Enhanced vectorization and auto-parallelization that apply to more situations in code. Developers love seeing this in their build logs.

– Low overhead loop and function profiling shows hotspots and where to introduce threads

• Guided Auto Parallelism suggests code changes to get the compiler to auto-vectorize and/or auto-parallelize, a great productivity tool that delivers great performance

• More C++ 0x and C99 standards support for enhanced compatibility with Visual C++

• Even more performance from optimized string intrinsics that use Intel® SSE 4.2 instructions

152011-02-18




Compatibility to Standards

The Intel C++ Compiler provides the following language conformances

- ANSI/ISO standard for C language compilation (ISO/IEC9899:1990)

- ANSI/ISO standard (ISO/IEC 14882:1998) for the C++ language

The Fortran Compiler provides the following language conformances

- Fortran 95 language standard



- Fortran IV

- Includes also many features from the Fortran 2003 language standard, as well as numerous popular language extensions.

162011-02-18




Compatibility with GNU/Linux

Source (mostly) and binary compatible• Mixing and matching binary files created by g++,

including third-party libraries• Generating C++ code compatible with gcc/g++ 3.2

or higher (up to 4.3)• Improved support for command-line options offered

in the GNU compilers• Support of most GNU C and C++ language

extensions• Glibc 2.3.2, 2.3.4, 2.3.5 or 2.8• Linux Kernel 2.4.x or 2.6.xLimitations• Intel Fortran Compiler for Linux is not binary

compatible with GNU g77 or GNU gfortran compiler

172011-02-18




Interprocedural Optimization (IPO)

• Cross-module optimization

• IPO is seamless process. Most optimization actually happens during Link Phase

• Benefits of IPO– Optimization of large number of frequently used small & medium

functions, especially those called in loops

– Function Inlining

– Eliminates need for arguments setup, call branch/return overhead

– Enables opportunities for other optimizations (const prop, DCE, &c.)

– Dead code elimination, Better register usage

– Improved alias analysis for better auto-vectorization & loop transformations

• May increase build-time/binary size

182011-02-18




Interprocedural Optimizations (IPO)

• ip: Enables inter-proceduraloptimizations for current source file compilation

• ipo: Enables inter-proceduraloptimizations across files

Can inline functions in separate files

Permits inlining and other inter-procedural optimizations among multiple source files. The optional value argument controls the maximum number of link-time compilations (or number of object files) spawned. Default for value is 0 (the compiler chooses).

Enhances optimization when used in combination with other compiler features

Linux* Windows*

-ip /Qip

-ipo /Qipo

192011-02-18




Other Techniques for Inlining Functions

• Compiler Switches

– Increase information provided to the compiler

-ipo, -prof_use (Linux), /Qipo, /Qprof-use (Windows)

– Change Compiler Heuristics

-inline-factor=n (default=100), /Qinline-factor=n

-inline-level=0|1|2, /ob0|1|2

• Inlining source code features

– GCC C/C++

__attribute__((always_inline))

__attribute__((noinline))

– Microsoft* C/C++

Keywords: inline, __inline, __forceinline

202011-02-18




Auto-Vectorization

• Auto-vectorizer exploits SIMD/DLP opportunities– Auto-vectorizes sequential operations using SSE and AVX instructions

– No significant changes to source-code

– Much easier to learn, debug, maintain

– Forward looking with respect to compilers and processors

• Optimized code for targeted processor(s)– Both Intel and AMD*

– Mixed processors environment supported as well

• Processor Specific Optimization– Targeting specific Intel Processor(s)

– e.g. for Intel® Core i7 use -xSSE4.2

• Auto-dispatch: Processor Optimized Optimization– Includes both optimized and generic (SSE2) code-paths

– e.g. for Intel® Core i7 use -axSSE4.2

212011-02-18




Vectorization Switches

Group 1: -m<extension> such as -msse3• Optimizes for both Intel® and compatible, non-Intel

processors

Group 2: -x<extension> such as -xAVX• Targets Intel® processors only• Application will not start on non-Intel processors or if

instruction set is not available

Group 3: -ax<extension> such as –axsse4.2• Creates default and additional processor-specific paths• Processor-specific path(s), for Intel® processors only,

defined by <extension>• default code path is -msse2 unless explicitly modified• default code path can be changed using an additional

switch from group 1 or 2• multiple processor-specific paths can be specified

222011-02-18




Key Intel® Advanced Vector Extensions(Intel® AVX) Features

• Wider Vectors

– Increased from 128 bit to 256 bit

KEY FEATURES BENEFITS

• Up to 2x peak FLOPs (floating point operations per second) output with good power efficiency

Intel® AVX is a general purpose architecture,

expected to supplant SSE in all applications used today

• Enhanced Data Rearrangement– Use the new 256 bit primitives to

broadcast, mask loads and permute data

• Organize, access and pull only necessary data more quickly and efficiently

• Three and four Operands, Non Destructive Syntax– Designed for efficiency and future

extensibility

• Fewer register copies, better register use for both vector and scalar code

• Flexible unaligned memory access support

• More opportunities to fuse load and compute operations




Intel® Advanced Vector Extensions (Intel® AVX) 2X Vector WidthA 256-bit vector extension to SSE

• Intel® AVX extends all 16 XMM registers to 256bits

• Intel AVX works on either– The whole 256-bits – for FP instructions– The lower 128-bits (like existing SSE instructions)

– A drop-in replacement for all existing scalar/128-bit SSE instructions– The upper part of the register is zeroed out

• Intel AVX targets a high-performance first implementation

– 256-bit Multiply, Add and Shuffle engines (2X today)

– 2nd load port

256 bits (2010)

YMM0

XMM0

128 bits (1999)




SIMD: Single Instruction, Multiple Data

• Scalar mode– one instruction produces

one result

• SIMD processing– with SSE or AVX instructions

– one instruction can produce

multiple results

+

X

Y

X + Y

+

X

Y

X + Y

= =

x7+y7 x6+y6 x5+y5 x4+y4 x3+y3 x2+y2 x1+y1 x0+y0

y7 y6 y5 y4 y3 y2 y1 y0

x7 x6 x5 x4 x3 x2 x1 x0

2011-02-18 25




SSE and AVX-128 Data Types

4x floatsSSE

16x bytes

8x 16-bit shorts

4x 32-bit integers

2x 64-bit integers

1x 128-bit(!) integer

2x doubles

SSE-2

2011-02-18 26




AVX-256 Data Types on “Sandy Bridge”

possible

future

impleme-

ntations?

8x floats

4x doublesnow

32x bytes

16x 16-bit shorts

8x 32-bit integers

4x 64-bit integers

2x 128-bit(!) integer

2011-02-18 27




Compiling for Intel® AVX (high level)

• Compile with –xavx

– Intel processors only

– Vectorization works just as for SSE

– Main speedups are for floating point

– No integer 256 bit instructions in first generation

– Up to ~1.8x performance for Linpack

– Best if 32 byte aligned

– More loops can be vectorized than with SSE

– Individually masked data elements

– More powerful data rearrangement instructions

• -axavx gives both SSE and AVX code paths

– use –x or –m switches to modify the default SSE code path

– Eg –axavx –xsse4.2 to target Nehalem and AVX

• Math libraries may target AVX automatically at runtime




Intel® AVX Intrinsics

• Found in immintrin.h

• Names typically begin with _mm256_

– E.g. _mm256_add_pd()

– SSE intrinsics typically begin with _mm_

• New data types:

– __m256 holds 8 32-bit floats

– __m256d holds 4 64-bit doubles

– __m256i holds integers:

32 8-bit, 16 16-bit, 8 32-bit or 4 64-bit

– Intrinsics may also use SSE data types __m128i etc

• Manual cpu dispatch (temporary names; Intel processors only)

– __declspec(cpu_specific(future_cpu_16))

– __declspec(cpu_dispatch(future_cpu_16,…))




Automatic Vectorization by CompilerTranslates Loops into SIMD Parallelism loop is stripmined (unrolled), strip length of 8 for floats with AVX

cf 4 for floats with SSE

128-bit Registers

for (i=0;i<=MAX;i++)

c[i]=a[i]+b[i];

A[7] A[6] A[5] A[4] A[3] A[2] A[1] A[0]

B[7] B[6] B[5] B[4] B[3] B[2] B[1] B[0]

C[7] C[6] C[5] C[4] C[3] C[2] C[1] C[0]

+ + + ++ + + +

2011-02-18 30




Features of AVX loads on Sandy Bridge

• Performance of vmovupd is as good as vmovapd when the data is 32 byte aligned– Therefore, compiler never generates vmovapd, only vmovupd

– No alignment faults if data is not always aligned

• Performance of 32 byte aligned loads is better than unaligned loads– Try to align your data

• Performance of two 16 byte loads may be better than one unaligned 32 byte load– Compiler may split 32 byte loads into two 16 byte loads

– if known to be unaligned, or if 32 byte alignment unknown

• Performance of 16 byte unaligned loads not much worse than aligned 16 byte loads (similar to Nehalem)

2011-02-18 31




Mixing AVX-256 and SSE instructions

• Legacy Intel® SSE instructions preserve the value of the upper 128 bits of a YMM register

– 128-bit Intel® AVX instructions will zero the upper 128 bits

• There is a performance penalty when switching between 256-bit Intel AVX and SSE

– due to save/restore of the upper 128 bits

– With -xavx, compiler will prefer AVX-128 to SSE

• User advice: avoid mixing functions with AVX-256 and functions with SSE that call each other.– Where possible, recompile with -xavx

– Automatically converts SSE intrinsics to AVX-128

– Automatically converts SSE inline assembly to AVX-128

2011-02-18 32




3rd Party Support for Intel® AVX

• GNU* tools

– gcc 4.4.1 (AVX-128 bit) and later

– binutils 2.20.51.0.1 and later

– objdump for disassembly

– gdb 6.8.50.20090915 and later

• Microsoft* Visual Studio* 2010

– Compiler and optimizer support

– /arch:AVX

– Intrinsics

– MASM

– Disassembly

– Debugger support for YMM registers

2011-02-18 33




Performance Libraries

• Intel® MKL has had Intel® AVX tunings since MKL 10.2.0– mkl_enable_instructions() activated it (64-bit only)

• Further Intel AVX optimizations in MKL 10.3– DGEMM & SGEMM optimizations

– All BLAS level 3 functions

– LU/Cholesky/QR & eigensolvers in LAPACK

– FFTs of lengths 2^n

– VML/VSL

– no special activation needed

• Intel® IPP has supported Intel AVX since IPP 6.1

– “g9” code for IA-32, “e9” code for Intel 64

– Automatic optimization using the compiler switch /Qxavx (-xavx)

– Certain functions have been hand-optimized for AVX. http://software.intel.com/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions/

• Further Intel AVX optimizations in IPP 7.0

– Hand-optimized functions for Image Compression

Many routines in the Intel® MKL and IPP libraries are more highly optimized for Intel microprocessors than for non-Intel microprocessors.

2011-02-18 34


http://software.intel.com/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions/

























Auto-Parallelization

• Serial portion of code is automatically translated into multi-threaded code when possible– Performs dataflow analysis to verify correct parallel execution

– Partitions data for threaded code

• Parallel runtime support offers same features as in OpenMP*– Handling details of loop iteration modification

– Thread scheduling

– Synchronization

• Enabled by -parallel switch

352011-02-18




Intel® Guided Auto ParallelismLet the Compiler Tell You What it Needs

• Motivation– Effective, simplified way to add parallelism to applications

– Use built-in compiler technology to speed parallelism development

• What is GAP?– Compiler-based analyzer that provides guidance to developers to change

code so it can be compiled to automatically optimize code through vectorization, parallelization, or data transformation

– Built upon existing auto-vectorization and auto-parallelization technology

• GAP does not– Analyze code and find hotspots for threading (see Advisor)

– Verify threading correctness (use Inspector)

– Do any performance/hotspot analysis (use VTune Amplifier)

36

Developer Must Verify Semantics of GAP Recommendations

2011-02-18




Using GAP

• Requires optimization level set to -O2 or higher

– Works with both command line options or in the Eclipse IDE

– Neither IPO or PGO is required but advice may change if used

– User may apply all or a subset of the advice provided by GAP

– When multiple messages apply to a given loop ALL suggestions for that loop must be applied to get desired optimization

• User can specify regions of a file or routine that are considered important to optimize– Advice will be restricted to that region

– Default is to provide advice on entire compilation-unit

• Advice may involve – Suggestions for source changes that assert new properties

– Adding pragmas for loop if semantics are satisfied

– Adding new compilation options

• GAP output is a set of GAP messages, not .exe

372011-02-18




Intel® Guided Auto ParallelismGAP Workflow

Application

Source

C/C++/Fortran

Application

Binary

Identify

hotspots,

problems

Performance

ToolsCompiler

Application

Source +

Hotspots

Compiler

in advice-

mode

Advice

messages

Modified

Application

Source

Improved

Application

Binary

Compiler

(extra

options)

Traditional

Hotspot

Analysis

Compiler suggests compiler source modifications to enable vecorization, parallelization

Feed modified

source back to

compiler for

optimization

382011-02-18




Optimization Reports

• GAP – Guided Auto Parallelism– New in Intel® Parallel Composer 2011

– -guide switch

– Provides advice on source changes that could enable parallelism

– Provides analysis and suggestions; doesn’t actually generate code

• Other reports– -vec-report

– Which loops were vectorized, which were not

– Why they weren’t vectorized

– -par-report

– -opt-report

– Reports available for a variety of optimizations

– icc –help reports for more details

392011-02-18




High-Level Optimizations (HLO)

• Enabled with –O3– With auto-vectorization, it does more aggressive data dependency

analysis than at -O2

– Exploits properties of source code (loops & arrays)

– Best chance for performing loop transformations

• Performs loop transformations– Loop distribution

– Loop interchange

– Loop fusion

– Loop unrolling

– Data pre-fetching

– PGO based loop unrolling

402011-02-18




New functionality in 12.0

• Choice of precision for math functions– Lower precision may give better performance

– In 11.1 and earlier:– high precision for libm (~0.55 ulp)

– lower precision for libsvml (< 4 ulp)

• Bitwise reproducible libraries– Identical results on different processors,

– E.g. Intel® Core® 2 Duo, Intel® Core i7, AMD* processors

– In prior versions, cpu dispatch could cause differences

– There may still be differences between– IA-32 and Intel64

– different compiler versions

– Achieved by calling high accuracy function versions that use instructions available to all processors

– There will be some cost in performance




What you need to know

•-fimf-precision=<high|medium|low>

– Default is off (compiler chooses)– Typically high for scalar code, medium for vector code

– low typically halves the number of mantissa bits

– high ~0.55 ulp

– medium < 4 ulp (typically 2)

•-fimf-arch-consistency=<true | false>

– Will produce consistent results on all microarchitectures or processors within the same architecture

– Run-time performance may decrease

– Default is false (even with –fp-model precise !)




More detail

• Can specify at the function level– -fimf-precision=<high|medium|low>[:fnlist]

– e.g. –fimf-precision=low:func1,func2,func3

– -fimf-arch-consistency=<true | false> [:fnlist]

• Can specify desired accuracy in different ways:– -fimf-max-error=ulps‡[:fnlist]

– Maximum relative error

– E.g. -fimf-max-error:0.6 for high accuracy

– -fimf-absolute-error=value[:fnlist]

– Max absolute error specified as a floating-point number

– -fimf-accuracy-bits=bits[:fnlist]

– Required accuracy specified as a number of mantissa bits

‡ulps = Units in the Last Place




Implementation

• No new run-time libraries, but new entry points– High accuracy functions typically have names ending in _ha

– Low accuracy functions typically have names ending in _ep– “extra performance”

– About half the number of bits of the high accuracy version

– Bit-wise reproducible functions typically have names starting with __bwr or terminating in _br– In some cases, the _ha functions are bit-wise reproducible and

so no _br version is needed

• New compile-time library libiml_attr– Tells the compiler which libm entry point to call

– .so or .dll located in bin directory




Loop ProfilerIdentify Time Consuming Loops/Functions

Enables targeting parallelization and optimization efforts to most significant code areas (hotspot identification )

• Simple to use

– Add compiler option to command line to instrument application

– Compiler adds instrumentation calls to loops and function entry and exit points

– Run application to get profile report file

– Both a human-readable text file (a table) and an XML-file are generated

– Analyze data by looking at raw text file or use GUI viewer shipped with compiler

• Report file contains for example

– Call count of routines

– Self-time of functions / loops

– Total-time of functions / loops

– Average, minimum, maximum iteration counter of loops

452011-02-18




Loop Profile Data Viewer GUIAlternative to Plain Text Output

Function Profile View

Loop Profile View

Column headers allow selection

to control sort criteria

independently for function and

loop table

Menu to allow user to enable

filtering or displaying the

source code

462011-02-18




Agenda

• Preliminaries









472011-02-18




Intel® Parallel Building BlocksComprehensive Tools to Deliver Outstanding App Performance

48

Mix & Match to Optimize Your Application’s Performance

Intel® Cilk™ Plus

• 3 simple keywords & array notations for parallelism

• Support for task and data parallelism

• Semantics similar to serial code

• Simple way to parallelize your code

• Sequentially consistent, low overhead, powerful solution

• Supports C, C++, Windows and Linux

Intel® Threading Building Blocks

• Parallel algorithms and data structures

• Scalable memory allocation and task scheduling

• Synchronization primitives

• Rich feature set for general purpose parallelism

• Available as open source or commercial license

• Supports C++, Windows, Linux, Mac OS X, other OSs

Intel® Array Building Blocks

• Automatically scales to future Intel platforms

• Use of cores, threads, SIMD, determined at runtime

• Use for flexible vector parallelism

• JIT & VM technology: flexible and powerful

• Supports C++, Windows & Linux

What

Features

Why

Language extensions tosimplify task/data parallelism

Widely used C++ templatelibrary for task parallelism

Sophisticated C++ template library for vector parallelism

2011-02-18




Parallel Building Blocks

• Intel® Cilk™ Plus (language extension to C/C++)– New keywords: cilk_for, cilk_spawn, cilk_sync

– Vector notation for arrays

– Elemental functions (vector functions)

• Intel® Threading Building Blocks (Intel® TBB)– C++ template library

– C++ Containers for tasks

– Flexible grain size

– Sophisticated built-in task scheduling

– Open source

• Intel® Array Building Blocks (Intel® ArBB)– C++ library for parallel array operations

– Dynamic compilation allows architectural customization

– Array notation enabling parallel array operations

– ABI published

492011-02-18




Intel® Cilk™ Plus

Intel Cilk PlusKey Benefits

• Simple syntax which is very easy to learn and use

• Array notation guarantees fast vector code

• Pragmas guarantee vectorization of loops over arbitrary user code

• Fork/join tasking system is simple to understand and offers safety from errors

• Low overhead tasks offer scalability to high core counts

• Hyper objects enable reductions which give the same answers as serial code

• Mixes with Intel TBB and Intel ArBB for a complete task and vector parallel solution

Intel Cilk Plus

What is it?

• Compiler assisted solution offering a tasking system via 3 simple keywords

• Includes array notation to specify vector code

• Has a hyper objects library which offers powerful parallel data structures

• Based on 15 years of research at MIT

• Pragmas to force vectorization of loops and specify functions that can be applied to all elements of arrays

502011-02-18




Intel® Cilk Plus: Examples

• Intel Cilk Plus: Keyword & Hyperobjects Examplecilk::reducer_list<float> pos;

void findnum(int *MAX, float *array, float val) {

cilk_for(int i=0;i<*MAX;i++)

if array[i]==val

pos.push_back(i);

}

• Intel Cilk Plus: CEAN & Elemental Functions Example

__declspec (vector) double ef_add (double x, double y){

return x + y;}

a[:] = ef_add(b[:],c[:]);

512011-02-18




Intel® Cilk™ Plus – when to use

• Seeking task or vector parallelism

• Serial semantics task based parallelism is required

• Reduction operations need consistent answers as number of cores vary

• Need a compiled language with no JIT/VM capability

• A fork/join tasking model is sufficient

• Need to guarantee array notation or loops run as high performance vector code

• Vectorize loops over arbitrary user functions applied to entire arrays

Cilk Plus

A powerful yet simple & easy to learn compiler assisted capability offering low-overhead, high-performance task & vector parallelism

522011-02-18




Intel® Array Building Blocks: In Beta

Intel ArBB

Key Benefits

• High performance and flexible vector parallelism

• Built-in data types for commonly used data

• Compile once/run everywhere

• Future proof – accommodates changing vector lengths

• No special compiler – easy to integrate incrementally into existing environments

• Mixes with Intel Cilk Plus and Intel TBB for a complete task and vector parallel solution

Intel ArBB

What is it?

• A C++ template library for flexible vector parallelism

• Utilizes a JIT and VM to offer high performance

• Runs vectors on multiple cores

532011-02-18




Intel® Array Building Blocks: example

void findnum(dense<f32> array, f32 val,

dense<usize>& results) {

dense<boolean> locations = (array == val);

dense<usize> matching_indices =

indices(0, array.length());

results = pack(matching_indices, locations);

}

542011-02-18




Intel® Array Building Blocks – when to use

• Seeking a library based vector parallelism solution for C++

• Have array or vector rich calculations

• Seeking a compile once/run everywhere deployment model, based on JIT compilation

• Need deterministic execution

ArBB

A sophisticated C++ template library based capability offering vector parallelism using JIT technology for flexible performance & deployment

552011-02-18




Guidance: Parallel Building Blocks

Select from a Variety of Powerful Tools to Aid Parallelism

56




Agenda

• Preliminaries









572011-02-18




Intel® Math Kernel Library

• Scaling performance on Intel processors

• Parallel implementations – shared and distributed memory– Extensively threaded math functions with excellent scaling

– Threading in Vector Math Functions

– OpenMP* compatibility library supports Microsoft and GNU OpenMP implementations

• Maximize application performance– Automatic runtime processor detection ensures great performance on whatever

processor your application is running on.

– Optimizations for recent Intel processors

– Cluster functionality is standard

• Function Domains– Linear Algebra: BLAS, LAPACK

– Linear Algebra: Sparse Solvers

– Fast Fourier Transforms

– Vector Math Library

– Vector Statistical Library

58

Highly Optimized Math Library for Scientific, Engineering, Financial, and Energy Applications

2011-02-18




Intel® Integrated Performance Primitives 7.0

ApplicationsDigital Media | Web/Enterprise Data | Embedded Communications | Scientific/Technical

Intel® Integrated Performance Primitives 16 Function Domains

Optimized 32-bit and 64-bit Multicore Performance

Multimedia

• Image Processing• Color Conversion• JPEG/JPEG2000• Video Coding• Computer Vision• Realistic Rendering

High level APIs and Codecs Interfaces and Code Samples

Cross-platform C/C++ API for Code Re-use

Signal Processing

• Signal Processing• Audio Coding• Speech Coding• Speech Recognition• Vector Operations

DataProcessing

• Data Compression• Data Integrity• Cryptography• String Processing• Matrix Operations

59 59




Intel® Integrated Performance Primitives What’s New In Version 7.0?

• New performance optimizations for the latest Intel processors

Advanced Encryption Standard (AES) and CRC32C new instructions for dramatic performance increases in cryptography and data compression algorithms for Intel® Core i7 and later processors

• Windows Imaging Component (WIC) API support for faster and easier adoption of IPP image codecs by Windows developers.

• Improved JPEG codec multicore performance scaling (6x on 8 core machines)

• New JPEG-XR CODEC, (aka HD Photo) a new image compression standard

– 2x the compression level for the same image quality without need for greater memory or computing resources as well

– Supports lossless and lossy compression as well as incremental decompression of specific image regions

– Supports higher dynamic range and color depth than existing image codecs

• Improved ease of use for Deferred Mode Image Processing (DMIP) via Visual Studio* Domain Specific Language graphical programming user interface

60




Agenda

• Preliminaries









612011-02-18




Where is my application…

Spending Time? Wasting Time? Waiting Too Long?

• Focus tuning on functions taking time

• See call stacks• See time on source

• See cache misses on your source

• See functions sorted by # of cache misses

• See locks by wait time

• Red/Green for CPU utilization during wait

Intel® VTune™ Amplifier XE Performance Profiler

• Windows & Linux• Low overhead• No special recompiles

Advanced Profiling For Scalable Multicore Performance

622011-02-18




Intel® VTune™ Amplifier XE Tune Applications for Scalable Multicore Performance

• Fast, Accurate Performance Profiles– Hotspot (Statistical call tree)– Hardware-Event Based Sampling

• Thread Profiling– Visualize thread interactions on timeline– Balance workloads

• Easy set-up– Pre-defined performance profiles– Use a normal production build

• Compatible– Microsoft, GCC, Intel compilers– C/C++, Fortran, Assembly, .NET– Latest Intel® processors

and compatible processors1

• Find Answers Fast– Filter extraneous data– View results on the source / assembly– Event multiplexing

• Windows or Linux– Visual Studio Integration (Windows)– Standalone user interface and command line– 32 and 64-bit

1 IA32 and Intel® 64 architectures. Many features work with compatible processors. Event based sampling requires a genuine Intel® Processor.

632011-02-18




Kinds of Collection

• User Mode Sampling and Tracing Analysis

– Dynamically instruments binary (configurable API’s)

– Uses OS interrupt for each thread to collect samples and keeps sample if thread was active since last sample

– Collects call stack

• Hardware Event-based Sampling Analysis

– Uses installed driver to configure and collect interrupts from the Performance Monitoring Unit of each Intel CPU Core.

642011-02-18




Double Click from Grid or Timeline

See Profile Data On Source / Asm

Time on Source / Asm

Quickly scroll to hot spots.Scroll Bar “Heat Map” is anoverview of hot spots

Click jump to scroll Asm

Quick Asm navigation: Select source to highlight Asm

Right click for instruction reference manual

Intel® VTune™ Amplifier XE

652011-02-18





Timeline Visualizes Thread Behavior

• Optional: Use API to mark frames and user tasks

• Optional: Add a mark during collection

CPU Time

Hovers:

TransitionsHotspots Lightweight HotspotsLocks & Waits

662011-02-18




Profile a Running ApplicationNo need to stop and re-launch the app when profiling

Two Techniques:

• Attach to Process:

– Hotspot

– Concurrency

– Locks & Waits

• Profile System:

– Lightweight Hotspots

– Advanced & Custom EBS

– Optional: Filter by process after collection

(Attach to process is currently only available for Windows)

672011-02-18




Command Line Interface

• amplxe-cl is the command line. • Linux: /opt/intel/inspector_xe/bin[32|64]/amplxe-cl

• Windows: C:\Program Files\Intel\Inspector XE \bin[32|64]\amplxe-cl.exe

• To get detailed help:• amplxexe-cl –help

• Get Command Line from GUI

– Command examples:

1.amplxe-cl -collect-list

2.amplxe-cl -knob-list=hotspots

3.amplxe-cl -collect=hotspot – myapp.exe [MyParams]

4.amplxe-cl –report hotspots

682011-02-18




Remote Data CollectionConveniently analyze data collected on remote systems

1. Set up the experiment using GUI locally

2. Copy command line instructions to paste buffer

3. Open remote shell on target machine

4. Paste command line, run collection

5. Copy result file to your local system

6. Open file using local GUI

Local SystemVTune™ Amplifier XE Full user interface

Remote SystemLightweight command line collector

Copy command line

Copy results file

•Minimal “performance footprint” during collection

• Easy setup using GUI• Easy analysis of results

692011-02-18




Regression Testing

• Create Baseline:

$> amplxe-cl -collect

hotspots -r BaseLinePerf -- myapp.exe

$> amplxe-cl –report hotspots –r BaseLinePerf

• Nightly Performance Regression Testing:$> amplxe-cl -collect

hotspots –r nightlyresults -- myapp.exe

$> amplxe-cl –report hotspots –r BaseLinePerf –r NightlyResults

[…stuff Deleted …]

Module Process Result 1:CPU Time Result 2:CPU Time Difference:CPU Time

myapp.exe myapp.exe 23.141 61.531 -38.391

…

702011-02-18





Compare Results Quickly - Sort By Difference

• Quickly identify cause of regressions.

– Run a command line analysis daily

– Identify the function responsible so you know who to alert

• Compare 2 optimizations – What improved?

• Compare 2 systems – What didn’t speed up as much?

712011-02-18




Readying Your Application: for Intel VTuneAmplifier XE

• You should run Amplifier XE on a “Released/Optimized” build.

• Symbols allow you to view the Source (not just the assembly)

– Linux: –g

• Intel Threading Runtimes need instrumented runtimes

– TBB: Define TBB_USE_THREADING_TOOLS

– OpenMP: Use Intel Dynamic Version of OpenMP

• Call Stack Mode – Requires use of the dynamic version of the C Runtime library to properly attribute System Calls

– Linux do not use: -static

722011-02-18




Agenda

• Preliminaries









732011-02-18




Intel® Inspector XE 2011 Advancing Application Reliability, Code Quality and Security

• Powerful Robust Dynamic Analysis– Memory errors

– Invalid Memory Accesses– Memory Leaks– Uninitialized Memory Accesses– Improper usage of Memory API(s)– Resource Leaks (Windows only)

– Threading Errors– Data Races– Deadlock/Lock Hierarchy Violation– Cross Stack Memory Accesses

• Productivity Features– View Context of Problem (Stack, Multiple Contributing Source Lines)– Bug does not have to occur to find it!– Suppression, Filtering, and Workflow Management– Time Line visualization– Visual Studio Integration (Windows)– Memory Leak Snapshots (Linux)– Break into Debugger on Error (Linux)

742011-02-18




+.dll(s)

Intel® Inspector XE

Intel® Parallel Inspector XEProcess Flow

Execution/JIT Instrumention

Runtime Data

Collector

r###[t|m]#.insp(results)

Application.cpp

+.dll(s)

Application.exe

Compile/Link

ConfigurationSuppression

Filter

Filter/Change State/

Suppress

Reduced Data File

752011-02-18




Readying Your Sources

• Intel Inspector XE can Analyze any native binary… but some switches (Like symbols and Debug) make the results easier to read

– Linux: -O0 –g

• Threading Error Analysis – use of the Dynamic version of the C Runtime library will avoid false positives

– Linux do not use: -static

• Use of switches that implement similar functionality in the binary is not recommended

• Intel Threading Runtimes require switches to reduce false positives in Threading Error Analysis– TBB: Define TBB_USE_THREADING_TOOLS

– Use the Dynamic Version of OpenMP Compatibility library supplied by the Intel® Compiler

762011-02-18




Command Line Interface

• inspxe-cl is the command line. – Windows: C:\Program Files\Intel\Inspector XE

\bin[32|64]\inspxe-cl.exe

– Linux: /opt/intel/inspector_xe/bin[32|64]/inspxe-cl

• To get detailed help:inspxe-cl –help

• Get Command Line from GUI

• Command examples:1.inspxe-cl -collect-list

2. inspxe-cl –collect ti2 -- MyApp.exe

3. inspxe-cl –report problems

More Help is available with the Online Documentation

772011-02-18




Remote Data CollectionConveniently analyze data collected on remote systems

1. Setup the experiment using GUI locally

2. Copy command line instructions to paste buffer

3. Open remote shell on target machine

4. Paste command line, run collection

5. Copy result file to your local system

6. Open file using local GUI

Local SystemInspector XE Full user interface

Remote SystemLightweight command line collector

Copy command line

Copy results file

782011-02-18




Regression Testing

• Create Baseline Suppression File:

$> inspxe-cl –collect ti2 –r BaseLineResults

–- App.exe

$> inspxe-cl -create-suppression-file

myThread.sup -result-dir BaseLineResults

• Nightly Performance Regression Testing:$> inspxe-cl –collect ti2 –suppression-file MyThread.sup

–r NightlyTestResults –- App.exe

[…Stuff Deleted…]

0 new problem(s) found

792011-02-18




80

SummaryIntel Performance-Oriented Compiler Suites

Performance Compatibility Support

Intel® C++ Composer XE 2011

• Intel® C++ Compiler XE 12.0• Intel® Parallel Debugger Extension


• Intel® Math Kernel Library • Intel® Integrated Performance Primitives

Intel® Fortran Composer XE 2011

• Intel® Fortran Compiler XE 12.0• Intel® Parallel Debugger • Intel® Math Kernel Library • Intel® Integrated Performance Primitives

Intel Composer XE 2011

• Combines Intel C++ Composer XE and Intel® Fortran Composer XE

• For Fortran developers who also want Intel C++

• Windows, Linux only

• Windows: Integrates into Microsoft* Visual Studio*, Intel C++/Visual C++* Compatibility• Linux: Integrates into Eclipse* CDT, Intel C++ Compatible with GCC• Mac OS: Integrates into XCode* Environment, Compatible with GCC• All: 1 Year Premier Support Renewable Annually

2011-02-18




Questions?

2011-02-18 81



Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 822011-02-18


Documents

Advanced Features of Intel® C++ Composer XE for Linux€¦ · C++, Fortran on Windows, Linux and Mac OS X Performance Compatibility Support Intel® C++ Composer XE 2011 •Intel®