Upload
others
View
30
Download
0
Embed Size (px)
Citation preview
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Advanced Features ofIntel® C++ Composer XE for Linux
Jeff Arnold
Intel Corporation
18 February 2011
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
• Preliminaries
• Intel® Parallel Studio XE 2011
• Intel® C++ Composer XE
• Intel® Parallel Building Blocks
– Intel® Silk™ Plus
– Intel® Array Building Blocks
• Performance Libraries
• Intel® Vtune™ Amplifier XE 2011
• Intel® Inspector XE 2011
22011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Legal Disclaimer
3
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.*Other names and brands may be claimed as the property of others.
Copyright © 2011. Intel Corporation.
http://intel.com/software/products
2011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Notice
4
Optimization Notice
Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.
Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.
Notice revision #20101101
2011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
• Preliminaries
• Intel® Parallel Studio XE 2011
• Intel® C++ Composer XE
• Intel® Parallel Building Blocks
– Intel® Silk™ Plus
– Intel® Array Building Blocks
• Performance Libraries
• Intel® Vtune™ Amplifier XE 2011
• Intel® Inspector XE 2011
52011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Hickory, Dickory, Dock – The ISV Development Clock
Intel® Core™
Microarchitecture
Intel® Microarchitecturecodename Nehalem
Future Intel® Microarchitecture
Yonah Merom
…
All dates, product descriptions, availability, and plans are forecasts and subject to change without notice.
Architectural and Micro-Architectural changes require software changes to realize the full benefit
Enhanced Intel®
Pentium® M Microarchitecture
62011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
More and Moore Core’s
• The trend toward multi-core mobile, desktop, and server processors is expected to continue into the foreseeable future.
• Software must be ready to take full advantage of it.
Many-core array
• CMP with 10s-100s
low power cores
• Scalar cores
• Capable of TFLOPS+
• Full System-on-Chip
• Servers, workstations,
embedded…Dual core
• Symmetric multithreading
Multi-core array
• CMP with ~10 cores
Large, Scalar cores for
high single-thread
performance
Scalar plus many core for
highly threaded
workloads
CMP » Chip Multi-Processing
All dates, product descriptions, availability, and plans are forecasts and subject to change without notice.
72011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
• Preliminaries
• Intel® Parallel Studio XE 2011
• Intel® C++ Composer XE
• Intel® Parallel Building Blocks
– Intel® Silk™ Plus
– Intel® Array Building Blocks
• Performance Libraries
• Intel® Vtune™ Amplifier XE 2011
• Intel® Inspector XE 2011
82011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Phase Productivity Tool Feature Benefit
Advanced Build & Debug
Intel® Composer XE
C/C++ and Fortran compilers, performance libraries, and parallel models
Application performance, scalability and quality for current multicore and future many-core systems.
Advanced Verify
Intel® Inspector XE
Memory & threading errorchecking tool for higher code reliability & quality
Increases productivity and lowers cost, by catching memory and threading defects early
Advanced Tune
Intel® VTune™Amplifier XE
Performance Profiler to optimize performance and scalability
Removes guesswork, saves time, makes it easier to find performance and scalability bottlenecks Combines ease of use with deeper insights.
Intel® Parallel Studio XE 2011Powerful tools to create fast, reliable and secure code
Today’s Focus: Intel® Composer XE
92011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Get Outstanding Application Performance from Intel Compiler Suite Products
10
New Names, Same Great Tradition of Compilers & Library Performance
Old NewIntel® C++ Compiler, Professional Edition for Windows* Intel® C++ Composer XE for Windows*
Intel® Visual Fortran Compiler, Professional Edition for Windows* with IMSL* Intel® Visual Fortran Composer XE for Windows*
Intel® Visual Fortran Compiler, Professional Edition for Windows* with IMSL* Intel® Visual Fortran Composer XE for Windows* with IMSL*
Intel® Compiler Suite, Professional Edition for Windows* Intel® Composer XE for Windows*
Intel® C++ Compiler, Professional Edition for Linux* Intel® C++ Composer XE for Linux*
Intel® Fortran Compiler, Professional Edition for Linux* Intel® Fortran Composer XE for Linux*
Intel® Compiler Suite, Professional Edition for Linux* Intel® Composer XE for Linux*
Intel® C++ Compiler, Professional Edition for Mac OS X* Intel® C++ Composer XE for Mac OS X*
Intel® Fortran Compiler, Professionald Edition for Mac OS X* Intel® Fortran Composer XE for Mac OS X*
2011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
11
Intel Performance-Oriented Compiler SuitesCompilers, Performance Libraries, Debugging Tools: C++, Fortran on Windows, Linux and Mac OS X
Performance Compatibility Support
Intel® C++ Composer XE 2011
• Intel® C++ Compiler XE 12.0• Intel® Parallel Debugger Extension
• Intel® Parallel Building Blocks
• Intel® Math Kernel Library • Intel® Integrated Performance Primitives
Intel® Fortran Composer XE 2011
• Intel® Fortran Compiler XE 12.0• Intel® Parallel Debugger • Intel® Math Kernel Library • Intel® Integrated Performance Primitives
Intel Composer XE 2011
• Combines Intel C++ Composer XE and Intel® Fortran Composer XE
• For Fortran developers who also want Intel C++
• Windows, Linux only
• Windows: Integrates into Microsoft* Visual Studio*, Intel C++/Visual C++ Compatibility• Linux: Integrates into Eclipse CDT, Intel C++ Compatible with GCC• Mac OS: Integrates into XCode Environment, Compatible with GCC• All: 1 Year Premier Support Renewable Annually
2011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
• Preliminaries
• Intel® Parallel Studio XE 2011
• Intel® C++ Composer XE
• Intel® Parallel Building Blocks
– Intel® Silk™ Plus
– Intel® Array Building Blocks
• Performance Libraries
• Intel® Vtune™ Amplifier XE 2011
• Intel® Inspector XE 2011
122011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What’s NewIntel® Composer XE
• Major release of C/C++ and Fortran compilers v12.0
• Advanced C/C++ parallelism with Intel®
Parallel Building Blocks
• Advanced vectorization with SIMD pragmas
• Co-array Fortran and more Fortran 2008 support
• Updated versions of Intel® MKL & Intel® IPP
SIMD
pragma
Parallel Program
Debugging
132011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® C++ Composer XE
14
• Improved performance• Subset of C++0x in support of
Visual C++ compatibility• Support for Visual Studio* 2010
(continuing 2005 & 2008 support)• Enhanced vectorization capabilities:
GAP and SIMD pragmas• Parallelism Discovery Assistant –
enhanced loop profiler• Expanded parallelism-dev features:
Intel® Parallel Building Blocks• Fortran 2003 support and many
Fortran 2008 features, including Co-Array Fortran
• Improved Intel performance libraries integration: Intel® Math Kernel Library, Intel® Integrated Performance Primitives
• New hardware support: Intel® Sandy Bridge
• Many Intel Core Architecture –MICA – extensions (beta)
• 32-bit and 64-bit support• Windows*, Linux* and Mac* OS X
Ongoing Commitment to Innovation & Standards
What’s New!
2011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® C++Compatibility and Performance Leadership
• Intel® Cilk™ Plus: Easy to use language extensions for array syntax that deliver great performance through parallelism and more readable syntax
• Staying on top of the performance heap
– Enhanced vectorization and auto-parallelization that apply to more situations in code. Developers love seeing this in their build logs.
– Low overhead loop and function profiling shows hotspots and where to introduce threads
• Guided Auto Parallelism suggests code changes to get the compiler to auto-vectorize and/or auto-parallelize, a great productivity tool that delivers great performance
• More C++ 0x and C99 standards support for enhanced compatibility with Visual C++
• Even more performance from optimized string intrinsics that use Intel® SSE 4.2 instructions
152011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compatibility to Standards
The Intel C++ Compiler provides the following language conformances
- ANSI/ISO standard for C language compilation (ISO/IEC9899:1990)
- ANSI/ISO standard (ISO/IEC 14882:1998) for the C++ language
The Fortran Compiler provides the following language conformances
- Fortran 95 language standard
- Fortran 90 language standard
- Fortran 77 language standard
- Fortran IV
- Includes also many features from the Fortran 2003 language standard, as well as numerous popular language extensions.
162011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compatibility with GNU/Linux
Source (mostly) and binary compatible• Mixing and matching binary files created by g++,
including third-party libraries• Generating C++ code compatible with gcc/g++ 3.2
or higher (up to 4.3)• Improved support for command-line options offered
in the GNU compilers• Support of most GNU C and C++ language
extensions• Glibc 2.3.2, 2.3.4, 2.3.5 or 2.8• Linux Kernel 2.4.x or 2.6.xLimitations• Intel Fortran Compiler for Linux is not binary
compatible with GNU g77 or GNU gfortran compiler
172011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Interprocedural Optimization (IPO)
• Cross-module optimization
• IPO is seamless process. Most optimization actually happens during Link Phase
• Benefits of IPO– Optimization of large number of frequently used small & medium
functions, especially those called in loops
– Function Inlining
– Eliminates need for arguments setup, call branch/return overhead
– Enables opportunities for other optimizations (const prop, DCE, &c.)
– Dead code elimination, Better register usage
– Improved alias analysis for better auto-vectorization & loop transformations
• May increase build-time/binary size
182011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Interprocedural Optimizations (IPO)
• ip: Enables inter-proceduraloptimizations for current source file compilation
• ipo: Enables inter-proceduraloptimizations across files
Can inline functions in separate files
Permits inlining and other inter-procedural optimizations among multiple source files. The optional value argument controls the maximum number of link-time compilations (or number of object files) spawned. Default for value is 0 (the compiler chooses).
Enhances optimization when used in combination with other compiler features
Linux* Windows*
-ip /Qip
-ipo /Qipo
192011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Other Techniques for Inlining Functions
• Compiler Switches
– Increase information provided to the compiler
-ipo, -prof_use (Linux), /Qipo, /Qprof-use (Windows)
– Change Compiler Heuristics
-inline-factor=n (default=100), /Qinline-factor=n
-inline-level=0|1|2, /ob0|1|2
• Inlining source code features
– GCC C/C++
__attribute__((always_inline))
__attribute__((noinline))
– Microsoft* C/C++
Keywords: inline, __inline, __forceinline
202011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Auto-Vectorization
• Auto-vectorizer exploits SIMD/DLP opportunities– Auto-vectorizes sequential operations using SSE and AVX instructions
– No significant changes to source-code
– Much easier to learn, debug, maintain
– Forward looking with respect to compilers and processors
• Optimized code for targeted processor(s)– Both Intel and AMD*
– Mixed processors environment supported as well
• Processor Specific Optimization– Targeting specific Intel Processor(s)
– e.g. for Intel® Core i7 use -xSSE4.2
• Auto-dispatch: Processor Optimized Optimization– Includes both optimized and generic (SSE2) code-paths
– e.g. for Intel® Core i7 use -axSSE4.2
212011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorization Switches
Group 1: -m<extension> such as -msse3• Optimizes for both Intel® and compatible, non-Intel
processors
Group 2: -x<extension> such as -xAVX• Targets Intel® processors only• Application will not start on non-Intel processors or if
instruction set is not available
Group 3: -ax<extension> such as –axsse4.2• Creates default and additional processor-specific paths• Processor-specific path(s), for Intel® processors only,
defined by <extension>• default code path is -msse2 unless explicitly modified• default code path can be changed using an additional
switch from group 1 or 2• multiple processor-specific paths can be specified
222011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Key Intel® Advanced Vector Extensions(Intel® AVX) Features
• Wider Vectors
– Increased from 128 bit to 256 bit
KEY FEATURES BENEFITS
• Up to 2x peak FLOPs (floating point operations per second) output with good power efficiency
Intel® AVX is a general purpose architecture,
expected to supplant SSE in all applications used today
• Enhanced Data Rearrangement– Use the new 256 bit primitives to
broadcast, mask loads and permute data
• Organize, access and pull only necessary data more quickly and efficiently
• Three and four Operands, Non Destructive Syntax– Designed for efficiency and future
extensibility
• Fewer register copies, better register use for both vector and scalar code
• Flexible unaligned memory access support
• More opportunities to fuse load and compute operations
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Advanced Vector Extensions (Intel® AVX) 2X Vector WidthA 256-bit vector extension to SSE
• Intel® AVX extends all 16 XMM registers to 256bits
• Intel AVX works on either– The whole 256-bits – for FP instructions– The lower 128-bits (like existing SSE instructions)
– A drop-in replacement for all existing scalar/128-bit SSE instructions– The upper part of the register is zeroed out
• Intel AVX targets a high-performance first implementation
– 256-bit Multiply, Add and Shuffle engines (2X today)
– 2nd load port
256 bits (2010)
YMM0
XMM0
128 bits (1999)
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
SIMD: Single Instruction, Multiple Data
• Scalar mode– one instruction produces
one result
• SIMD processing– with SSE or AVX instructions
– one instruction can produce
multiple results
+
X
Y
X + Y
+
X
Y
X + Y
= =
x7+y7 x6+y6 x5+y5 x4+y4 x3+y3 x2+y2 x1+y1 x0+y0
y7 y6 y5 y4 y3 y2 y1 y0
x7 x6 x5 x4 x3 x2 x1 x0
2011-02-18 25
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
SSE and AVX-128 Data Types
4x floatsSSE
16x bytes
8x 16-bit shorts
4x 32-bit integers
2x 64-bit integers
1x 128-bit(!) integer
2x doubles
SSE-2
2011-02-18 26
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
AVX-256 Data Types on “Sandy Bridge”
possible
future
impleme-
ntations?
8x floats
4x doublesnow
32x bytes
16x 16-bit shorts
8x 32-bit integers
4x 64-bit integers
2x 128-bit(!) integer
2011-02-18 27
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Compiling for Intel® AVX (high level)
• Compile with –xavx
– Intel processors only
– Vectorization works just as for SSE
– Main speedups are for floating point
– No integer 256 bit instructions in first generation
– Up to ~1.8x performance for Linpack
– Best if 32 byte aligned
– More loops can be vectorized than with SSE
– Individually masked data elements
– More powerful data rearrangement instructions
• -axavx gives both SSE and AVX code paths
– use –x or –m switches to modify the default SSE code path
– Eg –axavx –xsse4.2 to target Nehalem and AVX
• Math libraries may target AVX automatically at runtime
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® AVX Intrinsics
• Found in immintrin.h
• Names typically begin with _mm256_
– E.g. _mm256_add_pd()
– SSE intrinsics typically begin with _mm_
• New data types:
– __m256 holds 8 32-bit floats
– __m256d holds 4 64-bit doubles
– __m256i holds integers:
32 8-bit, 16 16-bit, 8 32-bit or 4 64-bit
– Intrinsics may also use SSE data types __m128i etc
• Manual cpu dispatch (temporary names; Intel processors only)
– __declspec(cpu_specific(future_cpu_16))
– __declspec(cpu_dispatch(future_cpu_16,…))
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Automatic Vectorization by CompilerTranslates Loops into SIMD Parallelism loop is stripmined (unrolled), strip length of 8 for floats with AVX
cf 4 for floats with SSE
128-bit Registers
for (i=0;i<=MAX;i++)
c[i]=a[i]+b[i];
A[7] A[6] A[5] A[4] A[3] A[2] A[1] A[0]
B[7] B[6] B[5] B[4] B[3] B[2] B[1] B[0]
C[7] C[6] C[5] C[4] C[3] C[2] C[1] C[0]
+ + + ++ + + +
2011-02-18 30
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Features of AVX loads on Sandy Bridge
• Performance of vmovupd is as good as vmovapd when the data is 32 byte aligned– Therefore, compiler never generates vmovapd, only vmovupd
– No alignment faults if data is not always aligned
• Performance of 32 byte aligned loads is better than unaligned loads– Try to align your data
• Performance of two 16 byte loads may be better than one unaligned 32 byte load– Compiler may split 32 byte loads into two 16 byte loads
– if known to be unaligned, or if 32 byte alignment unknown
• Performance of 16 byte unaligned loads not much worse than aligned 16 byte loads (similar to Nehalem)
2011-02-18 31
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Mixing AVX-256 and SSE instructions
• Legacy Intel® SSE instructions preserve the value of the upper 128 bits of a YMM register
– 128-bit Intel® AVX instructions will zero the upper 128 bits
• There is a performance penalty when switching between 256-bit Intel AVX and SSE
– due to save/restore of the upper 128 bits
– With -xavx, compiler will prefer AVX-128 to SSE
• User advice: avoid mixing functions with AVX-256 and functions with SSE that call each other.– Where possible, recompile with -xavx
– Automatically converts SSE intrinsics to AVX-128
– Automatically converts SSE inline assembly to AVX-128
2011-02-18 32
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
3rd Party Support for Intel® AVX
• GNU* tools
– gcc 4.4.1 (AVX-128 bit) and later
– binutils 2.20.51.0.1 and later
– objdump for disassembly
– gdb 6.8.50.20090915 and later
• Microsoft* Visual Studio* 2010
– Compiler and optimizer support
– /arch:AVX
– Intrinsics
– MASM
– Disassembly
– Debugger support for YMM registers
2011-02-18 33
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Performance Libraries
• Intel® MKL has had Intel® AVX tunings since MKL 10.2.0– mkl_enable_instructions() activated it (64-bit only)
• Further Intel AVX optimizations in MKL 10.3– DGEMM & SGEMM optimizations
– All BLAS level 3 functions
– LU/Cholesky/QR & eigensolvers in LAPACK
– FFTs of lengths 2^n
– VML/VSL
– no special activation needed
• Intel® IPP has supported Intel AVX since IPP 6.1
– “g9” code for IA-32, “e9” code for Intel 64
– Automatic optimization using the compiler switch /Qxavx (-xavx)
– Certain functions have been hand-optimized for AVX. http://software.intel.com/en-us/articles/intel-ipp-functions-optimized-for-intel-avx-intel-advanced-vector-extensions/
• Further Intel AVX optimizations in IPP 7.0
– Hand-optimized functions for Image Compression
Many routines in the Intel® MKL and IPP libraries are more highly optimized for Intel microprocessors than for non-Intel microprocessors.
2011-02-18 34
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Auto-Parallelization
• Serial portion of code is automatically translated into multi-threaded code when possible– Performs dataflow analysis to verify correct parallel execution
– Partitions data for threaded code
• Parallel runtime support offers same features as in OpenMP*– Handling details of loop iteration modification
– Thread scheduling
– Synchronization
• Enabled by -parallel switch
352011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Guided Auto ParallelismLet the Compiler Tell You What it Needs
• Motivation– Effective, simplified way to add parallelism to applications
– Use built-in compiler technology to speed parallelism development
• What is GAP?– Compiler-based analyzer that provides guidance to developers to change
code so it can be compiled to automatically optimize code through vectorization, parallelization, or data transformation
– Built upon existing auto-vectorization and auto-parallelization technology
• GAP does not– Analyze code and find hotspots for threading (see Advisor)
– Verify threading correctness (use Inspector)
– Do any performance/hotspot analysis (use VTune Amplifier)
36
Developer Must Verify Semantics of GAP Recommendations
2011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Using GAP
• Requires optimization level set to -O2 or higher
– Works with both command line options or in the Eclipse IDE
– Neither IPO or PGO is required but advice may change if used
– User may apply all or a subset of the advice provided by GAP
– When multiple messages apply to a given loop ALL suggestions for that loop must be applied to get desired optimization
• User can specify regions of a file or routine that are considered important to optimize– Advice will be restricted to that region
– Default is to provide advice on entire compilation-unit
• Advice may involve – Suggestions for source changes that assert new properties
– Adding pragmas for loop if semantics are satisfied
– Adding new compilation options
• GAP output is a set of GAP messages, not .exe
372011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Guided Auto ParallelismGAP Workflow
Application
Source
C/C++/Fortran
Application
Binary
Identify
hotspots,
problems
Performance
ToolsCompiler
Application
Source +
Hotspots
Compiler
in advice-
mode
Advice
messages
Modified
Application
Source
Improved
Application
Binary
Compiler
(extra
options)
Traditional
Hotspot
Analysis
Compiler suggests compiler source modifications to enable vecorization, parallelization
Feed modified
source back to
compiler for
optimization
382011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimization Reports
• GAP – Guided Auto Parallelism– New in Intel® Parallel Composer 2011
– -guide switch
– Provides advice on source changes that could enable parallelism
– Provides analysis and suggestions; doesn’t actually generate code
• Other reports– -vec-report
– Which loops were vectorized, which were not
– Why they weren’t vectorized
– -par-report
– -opt-report
– Reports available for a variety of optimizations
– icc –help reports for more details
392011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
High-Level Optimizations (HLO)
• Enabled with –O3– With auto-vectorization, it does more aggressive data dependency
analysis than at -O2
– Exploits properties of source code (loops & arrays)
– Best chance for performing loop transformations
• Performs loop transformations– Loop distribution
– Loop interchange
– Loop fusion
– Loop unrolling
– Data pre-fetching
– PGO based loop unrolling
402011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
New functionality in 12.0
• Choice of precision for math functions– Lower precision may give better performance
– In 11.1 and earlier:– high precision for libm (~0.55 ulp)
– lower precision for libsvml (< 4 ulp)
• Bitwise reproducible libraries– Identical results on different processors,
– E.g. Intel® Core® 2 Duo, Intel® Core i7, AMD* processors
– In prior versions, cpu dispatch could cause differences
– There may still be differences between– IA-32 and Intel64
– different compiler versions
– Achieved by calling high accuracy function versions that use instructions available to all processors
– There will be some cost in performance
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What you need to know
•-fimf-precision=<high|medium|low>
– Default is off (compiler chooses)– Typically high for scalar code, medium for vector code
– low typically halves the number of mantissa bits
– high ~0.55 ulp
– medium < 4 ulp (typically 2)
•-fimf-arch-consistency=<true | false>
– Will produce consistent results on all microarchitectures or processors within the same architecture
– Run-time performance may decrease
– Default is false (even with –fp-model precise !)
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
More detail
• Can specify at the function level– -fimf-precision=<high|medium|low>[:fnlist]
– e.g. –fimf-precision=low:func1,func2,func3
– -fimf-arch-consistency=<true | false> [:fnlist]
• Can specify desired accuracy in different ways:– -fimf-max-error=ulps‡[:fnlist]
– Maximum relative error
– E.g. -fimf-max-error:0.6 for high accuracy
– -fimf-absolute-error=value[:fnlist]
– Max absolute error specified as a floating-point number
– -fimf-accuracy-bits=bits[:fnlist]
– Required accuracy specified as a number of mantissa bits
‡ulps = Units in the Last Place
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Implementation
• No new run-time libraries, but new entry points– High accuracy functions typically have names ending in _ha
– Low accuracy functions typically have names ending in _ep– “extra performance”
– About half the number of bits of the high accuracy version
– Bit-wise reproducible functions typically have names starting with __bwr or terminating in _br– In some cases, the _ha functions are bit-wise reproducible and
so no _br version is needed
• New compile-time library libiml_attr– Tells the compiler which libm entry point to call
– .so or .dll located in bin directory
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Loop ProfilerIdentify Time Consuming Loops/Functions
Enables targeting parallelization and optimization efforts to most significant code areas (hotspot identification )
• Simple to use
– Add compiler option to command line to instrument application
– Compiler adds instrumentation calls to loops and function entry and exit points
– Run application to get profile report file
– Both a human-readable text file (a table) and an XML-file are generated
– Analyze data by looking at raw text file or use GUI viewer shipped with compiler
• Report file contains for example
– Call count of routines
– Self-time of functions / loops
– Total-time of functions / loops
– Average, minimum, maximum iteration counter of loops
452011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Loop Profile Data Viewer GUIAlternative to Plain Text Output
Function Profile View
Loop Profile View
Column headers allow selection
to control sort criteria
independently for function and
loop table
Menu to allow user to enable
filtering or displaying the
source code
462011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
• Preliminaries
• Intel® Parallel Studio XE 2011
• Intel® C++ Composer XE
• Intel® Parallel Building Blocks
– Intel® Silk™ Plus
– Intel® Array Building Blocks
• Performance Libraries
• Intel® Vtune™ Amplifier XE 2011
• Intel® Inspector XE 2011
472011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Parallel Building BlocksComprehensive Tools to Deliver Outstanding App Performance
48
Mix & Match to Optimize Your Application’s Performance
Intel® Cilk™ Plus
• 3 simple keywords & array notations for parallelism
• Support for task and data parallelism
• Semantics similar to serial code
• Simple way to parallelize your code
• Sequentially consistent, low overhead, powerful solution
• Supports C, C++, Windows and Linux
Intel® Threading Building Blocks
• Parallel algorithms and data structures
• Scalable memory allocation and task scheduling
• Synchronization primitives
• Rich feature set for general purpose parallelism
• Available as open source or commercial license
• Supports C++, Windows, Linux, Mac OS X, other OSs
Intel® Array Building Blocks
• Automatically scales to future Intel platforms
• Use of cores, threads, SIMD, determined at runtime
• Use for flexible vector parallelism
• JIT & VM technology: flexible and powerful
• Supports C++, Windows & Linux
What
Features
Why
Language extensions tosimplify task/data parallelism
Widely used C++ templatelibrary for task parallelism
Sophisticated C++ template library for vector parallelism
2011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Parallel Building Blocks
• Intel® Cilk™ Plus (language extension to C/C++)– New keywords: cilk_for, cilk_spawn, cilk_sync
– Vector notation for arrays
– Elemental functions (vector functions)
• Intel® Threading Building Blocks (Intel® TBB)– C++ template library
– C++ Containers for tasks
– Flexible grain size
– Sophisticated built-in task scheduling
– Open source
• Intel® Array Building Blocks (Intel® ArBB)– C++ library for parallel array operations
– Dynamic compilation allows architectural customization
– Array notation enabling parallel array operations
– ABI published
492011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus
Intel Cilk PlusKey Benefits
• Simple syntax which is very easy to learn and use
• Array notation guarantees fast vector code
• Pragmas guarantee vectorization of loops over arbitrary user code
• Fork/join tasking system is simple to understand and offers safety from errors
• Low overhead tasks offer scalability to high core counts
• Hyper objects enable reductions which give the same answers as serial code
• Mixes with Intel TBB and Intel ArBB for a complete task and vector parallel solution
Intel Cilk Plus
What is it?
• Compiler assisted solution offering a tasking system via 3 simple keywords
• Includes array notation to specify vector code
• Has a hyper objects library which offers powerful parallel data structures
• Based on 15 years of research at MIT
• Pragmas to force vectorization of loops and specify functions that can be applied to all elements of arrays
502011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk Plus: Examples
• Intel Cilk Plus: Keyword & Hyperobjects Examplecilk::reducer_list<float> pos;
void findnum(int *MAX, float *array, float val) {
cilk_for(int i=0;i<*MAX;i++)
if array[i]==val
pos.push_back(i);
}
• Intel Cilk Plus: CEAN & Elemental Functions Example
__declspec (vector) double ef_add (double x, double y){
return x + y;}
a[:] = ef_add(b[:],c[:]);
512011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Cilk™ Plus – when to use
• Seeking task or vector parallelism
• Serial semantics task based parallelism is required
• Reduction operations need consistent answers as number of cores vary
• Need a compiled language with no JIT/VM capability
• A fork/join tasking model is sufficient
• Need to guarantee array notation or loops run as high performance vector code
• Vectorize loops over arbitrary user functions applied to entire arrays
Cilk Plus
A powerful yet simple & easy to learn compiler assisted capability offering low-overhead, high-performance task & vector parallelism
522011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Array Building Blocks: In Beta
Intel ArBB
Key Benefits
• High performance and flexible vector parallelism
• Built-in data types for commonly used data
• Compile once/run everywhere
• Future proof – accommodates changing vector lengths
• No special compiler – easy to integrate incrementally into existing environments
• Mixes with Intel Cilk Plus and Intel TBB for a complete task and vector parallel solution
Intel ArBB
What is it?
• A C++ template library for flexible vector parallelism
• Utilizes a JIT and VM to offer high performance
• Runs vectors on multiple cores
532011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Array Building Blocks: example
void findnum(dense<f32> array, f32 val,
dense<usize>& results) {
dense<boolean> locations = (array == val);
dense<usize> matching_indices =
indices(0, array.length());
results = pack(matching_indices, locations);
}
542011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Array Building Blocks – when to use
• Seeking a library based vector parallelism solution for C++
• Have array or vector rich calculations
• Seeking a compile once/run everywhere deployment model, based on JIT compilation
• Need deterministic execution
ArBB
A sophisticated C++ template library based capability offering vector parallelism using JIT technology for flexible performance & deployment
552011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Guidance: Parallel Building Blocks
Select from a Variety of Powerful Tools to Aid Parallelism
56
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
• Preliminaries
• Intel® Parallel Studio XE 2011
• Intel® C++ Composer XE
• Intel® Parallel Building Blocks
– Intel® Silk™ Plus
– Intel® Array Building Blocks
• Performance Libraries
• Intel® Vtune™ Amplifier XE 2011
• Intel® Inspector XE 2011
572011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Math Kernel Library
• Scaling performance on Intel processors
• Parallel implementations – shared and distributed memory– Extensively threaded math functions with excellent scaling
– Threading in Vector Math Functions
– OpenMP* compatibility library supports Microsoft and GNU OpenMP implementations
• Maximize application performance– Automatic runtime processor detection ensures great performance on whatever
processor your application is running on.
– Optimizations for recent Intel processors
– Cluster functionality is standard
• Function Domains– Linear Algebra: BLAS, LAPACK
– Linear Algebra: Sparse Solvers
– Fast Fourier Transforms
– Vector Math Library
– Vector Statistical Library
58
Highly Optimized Math Library for Scientific, Engineering, Financial, and Energy Applications
2011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Integrated Performance Primitives 7.0
ApplicationsDigital Media | Web/Enterprise Data | Embedded Communications | Scientific/Technical
Intel® Integrated Performance Primitives 16 Function Domains
Optimized 32-bit and 64-bit Multicore Performance
Multimedia
• Image Processing• Color Conversion• JPEG/JPEG2000• Video Coding• Computer Vision• Realistic Rendering
High level APIs and Codecs Interfaces and Code Samples
Cross-platform C/C++ API for Code Re-use
Signal Processing
• Signal Processing• Audio Coding• Speech Coding• Speech Recognition• Vector Operations
DataProcessing
• Data Compression• Data Integrity• Cryptography• String Processing• Matrix Operations
59 59
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Integrated Performance Primitives What’s New In Version 7.0?
• New performance optimizations for the latest Intel processors
Advanced Encryption Standard (AES) and CRC32C new instructions for dramatic performance increases in cryptography and data compression algorithms for Intel® Core i7 and later processors
• Windows Imaging Component (WIC) API support for faster and easier adoption of IPP image codecs by Windows developers.
• Improved JPEG codec multicore performance scaling (6x on 8 core machines)
• New JPEG-XR CODEC, (aka HD Photo) a new image compression standard
– 2x the compression level for the same image quality without need for greater memory or computing resources as well
– Supports lossless and lossy compression as well as incremental decompression of specific image regions
– Supports higher dynamic range and color depth than existing image codecs
• Improved ease of use for Deferred Mode Image Processing (DMIP) via Visual Studio* Domain Specific Language graphical programming user interface
60
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
• Preliminaries
• Intel® Parallel Studio XE 2011
• Intel® C++ Composer XE
• Intel® Parallel Building Blocks
– Intel® Silk™ Plus
– Intel® Array Building Blocks
• Performance Libraries
• Intel® Vtune™ Amplifier XE 2011
• Intel® Inspector XE 2011
612011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Where is my application…
Spending Time? Wasting Time? Waiting Too Long?
• Focus tuning on functions taking time
• See call stacks• See time on source
• See cache misses on your source
• See functions sorted by # of cache misses
• See locks by wait time
• Red/Green for CPU utilization during wait
Intel® VTune™ Amplifier XE Performance Profiler
• Windows & Linux• Low overhead• No special recompiles
Advanced Profiling For Scalable Multicore Performance
622011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® VTune™ Amplifier XE Tune Applications for Scalable Multicore Performance
• Fast, Accurate Performance Profiles– Hotspot (Statistical call tree)– Hardware-Event Based Sampling
• Thread Profiling– Visualize thread interactions on timeline– Balance workloads
• Easy set-up– Pre-defined performance profiles– Use a normal production build
• Compatible– Microsoft, GCC, Intel compilers– C/C++, Fortran, Assembly, .NET– Latest Intel® processors
and compatible processors1
• Find Answers Fast– Filter extraneous data– View results on the source / assembly– Event multiplexing
• Windows or Linux– Visual Studio Integration (Windows)– Standalone user interface and command line– 32 and 64-bit
1 IA32 and Intel® 64 architectures. Many features work with compatible processors. Event based sampling requires a genuine Intel® Processor.
632011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Kinds of Collection
• User Mode Sampling and Tracing Analysis
– Dynamically instruments binary (configurable API’s)
– Uses OS interrupt for each thread to collect samples and keeps sample if thread was active since last sample
– Collects call stack
• Hardware Event-based Sampling Analysis
– Uses installed driver to configure and collect interrupts from the Performance Monitoring Unit of each Intel CPU Core.
642011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Double Click from Grid or Timeline
See Profile Data On Source / Asm
Time on Source / Asm
Quickly scroll to hot spots.Scroll Bar “Heat Map” is anoverview of hot spots
Click jump to scroll Asm
Quick Asm navigation: Select source to highlight Asm
Right click for instruction reference manual
Intel® VTune™ Amplifier XE
652011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® VTune™ Amplifier XE
Timeline Visualizes Thread Behavior
• Optional: Use API to mark frames and user tasks
• Optional: Add a mark during collection
CPU Time
Hovers:
TransitionsHotspots Lightweight HotspotsLocks & Waits
662011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Profile a Running ApplicationNo need to stop and re-launch the app when profiling
Two Techniques:
• Attach to Process:
– Hotspot
– Concurrency
– Locks & Waits
• Profile System:
– Lightweight Hotspots
– Advanced & Custom EBS
– Optional: Filter by process after collection
(Attach to process is currently only available for Windows)
672011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Command Line Interface
• amplxe-cl is the command line. • Linux: /opt/intel/inspector_xe/bin[32|64]/amplxe-cl
• Windows: C:\Program Files\Intel\Inspector XE \bin[32|64]\amplxe-cl.exe
• To get detailed help:• amplxexe-cl –help
• Get Command Line from GUI
– Command examples:
1.amplxe-cl -collect-list
2.amplxe-cl -knob-list=hotspots
3.amplxe-cl -collect=hotspot – myapp.exe [MyParams]
4.amplxe-cl –report hotspots
682011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Remote Data CollectionConveniently analyze data collected on remote systems
1. Set up the experiment using GUI locally
2. Copy command line instructions to paste buffer
3. Open remote shell on target machine
4. Paste command line, run collection
5. Copy result file to your local system
6. Open file using local GUI
Local SystemVTune™ Amplifier XE Full user interface
Remote SystemLightweight command line collector
Copy command line
Copy results file
•Minimal “performance footprint” during collection
• Easy setup using GUI• Easy analysis of results
692011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Regression Testing
• Create Baseline:
$> amplxe-cl -collect
hotspots -r BaseLinePerf -- myapp.exe
$> amplxe-cl –report hotspots –r BaseLinePerf
• Nightly Performance Regression Testing:$> amplxe-cl -collect
hotspots –r nightlyresults -- myapp.exe
$> amplxe-cl –report hotspots –r BaseLinePerf –r NightlyResults
[…stuff Deleted …]
Module Process Result 1:CPU Time Result 2:CPU Time Difference:CPU Time
myapp.exe myapp.exe 23.141 61.531 -38.391
…
702011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® VTune™ Amplifier XE
Compare Results Quickly - Sort By Difference
• Quickly identify cause of regressions.
– Run a command line analysis daily
– Identify the function responsible so you know who to alert
• Compare 2 optimizations – What improved?
• Compare 2 systems – What didn’t speed up as much?
712011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Readying Your Application: for Intel VTuneAmplifier XE
• You should run Amplifier XE on a “Released/Optimized” build.
• Symbols allow you to view the Source (not just the assembly)
– Linux: –g
• Intel Threading Runtimes need instrumented runtimes
– TBB: Define TBB_USE_THREADING_TOOLS
– OpenMP: Use Intel Dynamic Version of OpenMP
• Call Stack Mode – Requires use of the dynamic version of the C Runtime library to properly attribute System Calls
– Linux do not use: -static
722011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Agenda
• Preliminaries
• Intel® Parallel Studio XE 2011
• Intel® C++ Composer XE
• Intel® Parallel Building Blocks
– Intel® Silk™ Plus
– Intel® Array Building Blocks
• Performance Libraries
• Intel® Vtune™ Amplifier XE 2011
• Intel® Inspector XE 2011
732011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Inspector XE 2011 Advancing Application Reliability, Code Quality and Security
• Powerful Robust Dynamic Analysis– Memory errors
– Invalid Memory Accesses– Memory Leaks– Uninitialized Memory Accesses– Improper usage of Memory API(s)– Resource Leaks (Windows only)
– Threading Errors– Data Races– Deadlock/Lock Hierarchy Violation– Cross Stack Memory Accesses
• Productivity Features– View Context of Problem (Stack, Multiple Contributing Source Lines)– Bug does not have to occur to find it!– Suppression, Filtering, and Workflow Management– Time Line visualization– Visual Studio Integration (Windows)– Memory Leak Snapshots (Linux)– Break into Debugger on Error (Linux)
742011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
+.dll(s)
Intel® Inspector XE
Intel® Parallel Inspector XEProcess Flow
Execution/JIT Instrumention
Runtime Data
Collector
r###[t|m]#.insp(results)
Application.cpp
+.dll(s)
Application.exe
Compile/Link
ConfigurationSuppression
Filter
Filter/Change State/
Suppress
Reduced Data File
752011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Readying Your Sources
• Intel Inspector XE can Analyze any native binary… but some switches (Like symbols and Debug) make the results easier to read
– Linux: -O0 –g
• Threading Error Analysis – use of the Dynamic version of the C Runtime library will avoid false positives
– Linux do not use: -static
• Use of switches that implement similar functionality in the binary is not recommended
• Intel Threading Runtimes require switches to reduce false positives in Threading Error Analysis– TBB: Define TBB_USE_THREADING_TOOLS
– Use the Dynamic Version of OpenMP Compatibility library supplied by the Intel® Compiler
762011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Command Line Interface
• inspxe-cl is the command line. – Windows: C:\Program Files\Intel\Inspector XE
\bin[32|64]\inspxe-cl.exe
– Linux: /opt/intel/inspector_xe/bin[32|64]/inspxe-cl
• To get detailed help:inspxe-cl –help
• Get Command Line from GUI
• Command examples:1.inspxe-cl -collect-list
2. inspxe-cl –collect ti2 -- MyApp.exe
3. inspxe-cl –report problems
More Help is available with the Online Documentation
772011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Remote Data CollectionConveniently analyze data collected on remote systems
1. Setup the experiment using GUI locally
2. Copy command line instructions to paste buffer
3. Open remote shell on target machine
4. Paste command line, run collection
5. Copy result file to your local system
6. Open file using local GUI
Local SystemInspector XE Full user interface
Remote SystemLightweight command line collector
Copy command line
Copy results file
782011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Regression Testing
• Create Baseline Suppression File:
$> inspxe-cl –collect ti2 –r BaseLineResults
–- App.exe
$> inspxe-cl -create-suppression-file
myThread.sup -result-dir BaseLineResults
• Nightly Performance Regression Testing:$> inspxe-cl –collect ti2 –suppression-file MyThread.sup
–r NightlyTestResults –- App.exe
[…Stuff Deleted…]
0 new problem(s) found
792011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
80
SummaryIntel Performance-Oriented Compiler Suites
Performance Compatibility Support
Intel® C++ Composer XE 2011
• Intel® C++ Compiler XE 12.0• Intel® Parallel Debugger Extension
• Intel® Parallel Building Blocks
• Intel® Math Kernel Library • Intel® Integrated Performance Primitives
Intel® Fortran Composer XE 2011
• Intel® Fortran Compiler XE 12.0• Intel® Parallel Debugger • Intel® Math Kernel Library • Intel® Integrated Performance Primitives
Intel Composer XE 2011
• Combines Intel C++ Composer XE and Intel® Fortran Composer XE
• For Fortran developers who also want Intel C++
• Windows, Linux only
• Windows: Integrates into Microsoft* Visual Studio*, Intel C++/Visual C++* Compatibility• Linux: Integrates into Eclipse* CDT, Intel C++ Compatible with GCC• Mac OS: Integrates into XCode* Environment, Compatible with GCC• All: 1 Year Premier Support Renewable Annually
2011-02-18
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Questions?
2011-02-18 81
Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 822011-02-18