- 1 - Copyright © 2004 Intel Corporation. All Rights Reserved. Maximizing Application’s Performance by Threading, SIMD and micro arcitecture tuning Koby

- 1 -

Copyright © 2004 Intel Corporation. All Rights Reserved.

Maximizing Application’s Maximizing Application’s Performance by Threading, Performance by Threading, SIMD and micro arcitecture SIMD and micro arcitecture tuningtuning

Koby GottliebKoby Gottlieb

Intel CorporationIntel Corporation

Feb 27 2007Feb 27 2007

- 2 -


AgendaAgenda

Threading gains and challengesThreading gains and challenges

Optimization methodology, project milestonesOptimization methodology, project milestones

– Developing BenchmarkDeveloping Benchmark

– VTuneVTune™™ Performance Analyzer Performance Analyzer

– Threading: Overview of approachesThreading: Overview of approaches

– IntelIntel®® Thread Checker Thread Checker

– IntelIntel®® Thread Profiler Thread Profiler

– Streaming SIMD Extensions (SSE) and micro architectural Streaming SIMD Extensions (SSE) and micro architectural issueissue

Project exampleProject example

[Mark] is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries

- 3 -


Dual-Core SystemsDual-Core Systems

OneOne package with 2 cores package with 2 cores

Software impactSoftware impact

– 2 Cores 2 Cores 2 processors 2 processors

– 2 Cores 2 Cores 2x resources 2x resources

Use threads to exploit full resources of dual core processors

Use threads to exploit full resources of dual core processors

Efficiently Utilize Dual Cores

- 4 -


Threads DefinedThreads Defined

OS creates process for each OS creates process for each program loadedprogram loaded

– Each process executes as a Each process executes as a separate threadseparate thread

Additional threads can be Additional threads can be created within the processcreated within the process

– Each thread has its own Stack Each thread has its own Stack and Instruction Pointer and Instruction Pointer

– All threads share code and All threads share code and data data

OS creates process for each OS creates process for each program loadedprogram loaded

– Each process executes as a Each process executes as a separate threadseparate thread

Additional threads can be Additional threads can be created within the processcreated within the process

– Each thread has its own Stack Each thread has its own Stack and Instruction Pointer and Instruction Pointer

– All threads share code and All threads share code and data data

…

Data

Code

thread2()Stack

IP

threadN()Stack

IP

ProcessProcess

thread1()Stack

IP


- 5 -


Threading SoftwareThreading Software

OpenMP* threadsOpenMP* threads

– http://www.openmp.org/http://www.openmp.org/

Windows* threadsWindows* threads

– http://msdn.microsoft.com/http://msdn.microsoft.com/

POSIX* threads (pthreads)POSIX* threads (pthreads)

– http://www.ieee.org/http://www.ieee.org/


*Other names and brands may be claimed as the property of others.

If both cores fully busy, then 2x speedup possible

If both cores fully busy, then 2x speedup possible

- 6 -


Correctness BugCorrectness Bug: Data Races: Data Races

Thread1x = a + b

Thread2b = 42

What is value of What is value of xx if: if:

– ThreadThread11 runs before Thread runs before Thread22??

– ThreadThread22 runs before Thread runs before Thread11??

Data race: concurrent read, modify, write Data race: concurrent read, modify, write of same addressof same address

x = 3

Challenges Unique to Threading

x = 43

Suppose: a=1, b=2Suppose: a=1, b=2

Outcome depends on thread execution orderOutcome depends on thread execution order

- 7 -


Solving Data Races: Solving Data Races: SynchronizationSynchronization

Thread1Acquire(L)a = 1b = 2x = a + bRelease(L)

Acquisition of mutex L ensures atomic accessAcquisition of mutex L ensures atomic access– Only one thread can hold lock at a timeOnly one thread can hold lock at a time

Example APIs:Example APIs:- EnterCriticalSection(), LeaveCriticalSection()EnterCriticalSection(), LeaveCriticalSection()- pthread_mutex_lock(), pthread_mutex_unlock()pthread_mutex_lock(), pthread_mutex_unlock()

Thread2Acquire(L)b = 42Release(L)


- 8 -


Amdahl’s LawAmdahl’s Law

If only 1/2 of the code is parallel, 2X speedup is

unlikely

If only 1/2 of the code is parallel, 2X speedup is

unlikely

P

TotalParallel TONPPT })1{(

P = parallel portion of processN = number of processors (cores)O = parallel overhead


tim

e PP(1-P)

TTotal

- 9 -


Threads Intro New Class of Threads Intro New Class of ProblemsProblems Correctness bugsCorrectness bugs

• Data races• Deadlock• and more…

Performance bottlenecksPerformance bottlenecks• Overhead• Load balance• and more…

Intel® Threading Tools can help!Intel® Threading Tools can help!

Intel® Thread Checker finds correctness bugs

Intel® Thread Checker finds correctness bugs

Thread Profiler feature pinpoints bottlenecks

Thread Profiler feature pinpoints bottlenecks


- 10 -


Methodology & Milestones: Getting Methodology & Milestones: Getting StartedStarted

– Most of the world apps are not threaded:Most of the world apps are not threaded:• There are 106,177 registered Projects in (http://sourceforge.net/

) • Almost all the applications are not performance sensitive. • Some performance sensitive apps are too small, too big, or too

complex– Is the app a representative picture of the real software world?– If so, we have a problem in our multi core strategy.

– Learning the App.Learning the App.• No need to understand every algorithm but overall understanding

is a must.• Call graph of VTune™ analyzer is a great tool for this task.

– Develop a Benchmark Develop a Benchmark • Representative benchmark must define a benchmark before

optimizing.• A good benchmark must be automatic (VTune™ analyzer tuning

assistant), not too short (above 30 seconds) and not too long.• Surprisingly, selecting a good benchmark is time consuming and

difficult.

- 11 -


Using VTune™ Performance Using VTune™ Performance AnalyzerAnalyzer Sampling is surprisingly easy to Sampling is surprisingly easy to

use:use:

– Easy to get good results from samplingEasy to get good results from samplingwithout any training.without any training.

– Time breakdown is the first step for the Time breakdown is the first step for the threading decision-making process. threading decision-making process.

– Hot spots might be vectorizedHot spots might be vectorized

Call graph as a tool to understand the code and select Call graph as a tool to understand the code and select threading direction.threading direction.

– Setting the /fixed:no flag for the linkerSetting the /fixed:no flag for the linker

– Call graph provides hierarchical view and Call graph provides hierarchical view and overall timing.overall timing.

– Call graph overhead makes it tooCall graph overhead makes it too inaccurate for timing; must use Sampling inaccurate for timing; must use Sampling for correct time estimates. for correct time estimates.

- 12 -


ThreadingThreading The most challenging part of the project: how to thread.The most challenging part of the project: how to thread.

– Added difficulty—Shared resources like FSB or L2 may eliminate the speedup Added difficulty—Shared resources like FSB or L2 may eliminate the speedup potentialpotential

– Functional or data decomposition?Functional or data decomposition?

– In many cases you can find mostly functional parallelism, which only scales In many cases you can find mostly functional parallelism, which only scales to 2 -3 threads. to 2 -3 threads.

– Examples:Examples:• Identify the stages and let thread 0 work on N+1 front end of data element while

thread 1 works on the back end of Data element N.• Assign thread per channel in stereo.

– For good data decomposition, the code should be designed in advance to be For good data decomposition, the code should be designed in advance to be threaded.threaded.

• A desirable goal is maintain the exact results in order to simplify the testing. So Breaking input to chunks does not work if there is any history between data elements.

– If data decomposition worked on relatively small part of the project If data decomposition worked on relatively small part of the project Almost no speedup because of the synchronization overhead.Almost no speedup because of the synchronization overhead.

OpenMP is very convenient for data decomposition experimentation. OpenMP is very convenient for data decomposition experimentation. • Supported by the Intel® compiler.• It became more legitimate with intro in the MS .NET 2005 compiler*.

* Other names and brands may be claimed as the property of others.

- 13 -


Debugging the Threaded AppDebugging the Threaded App Convert app to serial code and debug first while running Convert app to serial code and debug first while running

thread 0 before thread 1 and then in reverse order.thread 0 before thread 1 and then in reverse order.

– This methodology is good for 75% of the bugs and does not This methodology is good for 75% of the bugs and does not require any tricky debugging technique.require any tricky debugging technique.

– Try running in parallel and start looking for shared data elements. Try running in parallel and start looking for shared data elements.

Intel® Tread Checker to the rescue.Intel® Tread Checker to the rescue.

– ““No, it is not broken, just build a very small example and be No, it is not broken, just build a very small example and be patient”. It takes a long time.patient”. It takes a long time.

– Intel® Thread Checker gives excellent Intel® Thread Checker gives excellent analysis capabilities.analysis capabilities.

• The location of the faulty data element allocation

• the read location• the write location• the call stack that brings us to this location.

- 14 -


Intel® Thread Checker 2.0Intel® Thread Checker 2.0FeaturesFeatures

Locates threading bugs: Locates threading bugs:

– Data races (storage conflicts) Data races (storage conflicts)

– Deadlocks (potential and actual) Deadlocks (potential and actual)

– Win32 threading API usage problemsWin32 threading API usage problems

– Memory leaks and overwritesMemory leaks and overwrites

Isolates bugs to source code lineIsolates bugs to source code line

Describes possible causes of errors and suggests Describes possible causes of errors and suggests resolutionsresolutions

Categorizes errors by severity levelCategorizes errors by severity level

- 15 -


Diagnostics List Diagnostics List Screen shot: Intel® Thread Checker

Diagnostics List in Terse mode

Diagnostics List in Terse mode

Summary and legend

Summary and legend

Verbose diagnosticsVerbose diagnostics

- 16 -


Source Code ViewSource Code ViewScreen shot: Intel® Thread Checker

Each Diagnostics in List links to its

source code line(s)

Each Diagnostics in List links to its

source code line(s)

- 17 -


Help with DiagnosticsHelp with Diagnostics

1) Right-click here . . .1) Right-click here . . .

2) More help!2) More help!

Screen shot: Intel® Thread Checker

- 18 -


Intel® Thread Checker Intel® Thread Checker

It shows two errors, one in cmn_prior.c and on in lm.c. After examinig the code of cmn_prior.c, it is quite obvious that the source of the problem is the use of static variables in the cmn_prior function. Code sample 13 shows the original code, please note the first 4 static variables. Needless to say, the use of such a technique is not thread safe.

Example: From Sphinx final report.

void cmn_prior(float32 **incep, int32 varnorm, int32 nfr, int32 ceplen, int32 endutt) { static float32 *cur_mean = NULL; /* the mean subtracted from input frames */ static float32 *sum = NULL; /* the sum over input frames */ static int32 nframe; /* the total number of input frames */ static int32 initialize=1; float32 sf; int32 i, j; assert(incep!=NULL);

- 19 -


Threading, PerformanceThreading, Performance

Check what percentage of the code is threaded.Check what percentage of the code is threaded.

– Setting the upper bound for potential performance.Setting the upper bound for potential performance.

– Can use VTune™ analyzer to see how much time each thread runs.Can use VTune™ analyzer to see how much time each thread runs.

– Check if the total instruction count of the threaded app is equal to the instruction Check if the total instruction count of the threaded app is equal to the instruction count of the original app. count of the original app.

• In many cases there is a huge overhead for threading, or just a bug (doing some work twice).

Evaluate the amount of parallel work.Evaluate the amount of parallel work.

– Even if both threads spend the same amount of time, they may not be doing it at Even if both threads spend the same amount of time, they may not be doing it at the same time.the same time.

– If a (already debugged) threaded app runs much slower than the scalar app, look If a (already debugged) threaded app runs much slower than the scalar app, look for false sharing issues:for false sharing issues:

• “No, converting each local variable to an array of 2 variables is not a good idea for threading efficiency.” From one of my meetings, trying to explain how come the threaded app is 14X slower than the original app.

Check the critical path.Check the critical path.

– Intel ® Thread profiler is great for the job after you figure out how to use it and its Intel ® Thread profiler is great for the job after you figure out how to use it and its cryptic terminology.cryptic terminology.

– Note that Win32 API Thread Profiler is not the same tool as the OpenMP Thread Note that Win32 API Thread Profiler is not the same tool as the OpenMP Thread Profiler. Profiler.

- 20 -


The Thread Profiler FeatureThe Thread Profiler Feature

Pinpoints threading performance Pinpoints threading performance bottlenecks in apps threaded with:bottlenecks in apps threaded with:– Microsoft* Windows* threads on Microsoft* Microsoft* Windows* threads on Microsoft*

Windows* systemsWindows* systems– POSIX* pthreads on Linux* systemsPOSIX* pthreads on Linux* systems– OpenMP* on Microsoft* Windows* and Linux* OpenMP* on Microsoft* Windows* and Linux*

systemssystems Plugs into VTune™ environmentPlugs into VTune™ environment

– Microsoft* Windows* for IA-32 systemsMicrosoft* Windows* for IA-32 systems– Linux* for IA-32 systemsLinux* for IA-32 systems

Intel® Threading Tools

*Other names and brands may be claimed as the property of others.

- 21 -


Thread Profiler Feature Thread Profiler Feature AnalysisAnalysis Monitors execution flows to find Critical Monitors execution flows to find Critical

PathPath– Longest execution flow is the Critical PathLongest execution flow is the Critical Path

Analyzes Critical PathAnalyzes Critical Path– System utilization System utilization

• Over-subscribed vs. under-subscribed

– Thread state transitionsThread state transitions• Blocked -> Running

Captures threads timelineCaptures threads timeline– Visualize threading structureVisualize threading structure


- 22 -


Thread Profiler Critical PathThread Profiler Critical Path

Thread 1

Thread 2

Thread 3

T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15

Critical Path View0

15

5

10

Tim

e

Start with the Start with the critical pathcritical path

Separate Separate according to according to system system utilizationutilization

Add overheadAdd overhead

Further analyze by Further analyze by thread thread statestate

Acquire lock L

Wait for Threads 2 & 3

Wait for L

Release L Wait for L

Release L

Idle

Serial

Parallel

Under-subscribed

Over-subscribed

Cruise time

Overhead

Blocking time

Impact time


Analysis shown for 2-way systemAnalysis shown for 2-way system

- 23 -


Intel® Thread Profiler (OpenMP)Intel® Thread Profiler (OpenMP)

Example: From FAAD final report.

- 24 -


Intel® Thread profiler (Win32 API)Intel® Thread profiler (Win32 API)From FAAD

From GainMPEG:

So what’s wrong with this picture?

- 25 -


Streaming SIMD Extensions Coding & Streaming SIMD Extensions Coding & Micro-architectureMicro-architecture

Intel® Streaming SIMD Extensions Intel® Streaming SIMD Extensions

– Optimizing the slow thread first in case Optimizing the slow thread first in case of functional decomposition.of functional decomposition.

– In C++, use the class libraries. In C++, use the class libraries.

– In C, use intrinsics. In C, use intrinsics.

– Use inline assembly if the compiler does not Use inline assembly if the compiler does not behave as expected.behave as expected.

– For integer code or code with many shuffleFor integer code or code with many shuffleinstructions, inline assembly might be the instructions, inline assembly might be the only solution.only solution.

• But will it be accepted back to the opensource tree?

Micro architectural issuesMicro architectural issues

– Use VTune™ analyzer tuning assistantUse VTune™ analyzer tuning assistant• Its simpler than trying to learn all the ugly stuff• It actually works and finds big issues in some cases.

ClockTicks (ms)

- 26 -


Micro arch tuning: VTune Tuning AssistMicro arch tuning: VTune Tuning Assist Phase 1 – identify main slow-down reasonsPhase 1 – identify main slow-down reasons

The CPI is high

High branch mispredictions

impact

Many L2 Demand Misses

Use precise events to focus on instructions of interest.

- 27 -


Example: Example: Phase 2 – focus on problem sourcesPhase 2 – focus on problem sources

Branch mispredictions

L2 load misses

- 28 -


Impact: WEB PublicationsImpact: WEB Publications

From http://techreport.com/reviews/2005q2/pentium-xe-840/index.x?pg=11LAME audio encoding LAME MT is, as you might have guessed, a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. You can even download

a paper (in Word format) describing the programming effort. Rather than run multiple parallel threads, LAME MT runs the MP3 encoder's psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. The author notes, "In general, this approach is highly recommended, for it is exponentially harder to debug a parallel application than a linear one." We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here, as we have done in our previous CPU reviews.

The successful projects The successful projects have high impact.have high impact.

The successful projects have big impact

- 29 -


The LAME example:The LAME example:What is the LAME Project?What is the LAME Project?

An educational tool used for learning about MP3 encoding. It’s goal An educational tool used for learning about MP3 encoding. It’s goal is to improve:is to improve:

– Psycho-acoustics quality.Psycho-acoustics quality.

– The speed of MP3 encoding.The speed of MP3 encoding.

LAME is the most popular state of the art MP3 encoder/decoder LAME is the most popular state of the art MP3 encoder/decoder used by today’s leading products.used by today’s leading products.

Project goals:Project goals:

– Speeding up the encryption of an audio stream.Speeding up the encryption of an audio stream.

– Turning LAME into a Multi-Threaded (MT) engine.Turning LAME into a Multi-Threaded (MT) engine.

– Be 1:1 bit compatible with the original version.Be 1:1 bit compatible with the original version.

– Optimize specifically for SMT platforms.Optimize specifically for SMT platforms.

– 64 bit port and CMP related optimizations.64 bit port and CMP related optimizations.

FOR MORE INFO... http://lame.sourceforge.net

- 30 -


MP3 Encoding OverviewMP3 Encoding Overview

Break up the audio stream into frames Break up the audio stream into frames (uniform chunks, typically ~1K) (uniform chunks, typically ~1K)

Perceptual Model

Perceptual Model

Analysis FilterbankAnalysis

Filterbank MDCTMDCT QuantizationQuantization

Audio Stream

Bitstream Encode

Bitstream Encode

Frame 1 Frame 2 Frame 4

Psycho-AcousticPsycho-AcousticRead FrameRead Frame

Frame 3

Huffman EncodingHuffman Encoding

Specifically in LAME

- 31 -


This is actually Data Decomposition

LAME MT – Intuitive approachLAME MT – Intuitive approach

Frame 1 Frame 2 Frame 4Frame 3

The intuitive approach:The intuitive approach:

Thread 1:

Thread 2:

An unbreakable dependence due to Huffman Encoding

Frame 5 Frame 6

- 32 -


Analysis FilterbankAnalysis

Filterbank MDCTMDCT QuantizationQuantization Huffman EncodingHuffman Encoding

Psycho-AcousticPsycho-AcousticRead FrameRead Frame

LAME MT – Functional LAME MT – Functional DecompositionDecomposition

T1:

T2:

Frame 1 Frame 2 Frame 4Frame 3 Frame 5 Frame 6

Floating Point Intensive

Integer Intensive

- 33 -


ResultsResults

- 34 -


Results due to Multi-ThreadingResults due to Multi-Threading

SMT SMT PlatformPlatform

CBR / VBRCBR / VBR

SMP PlatformSMP Platform

CBR / VBRCBR / VBR

Using Microsoft’s Using Microsoft’s Compiler*Compiler* 22% / 32%22% / 32% 38% / 62%38% / 62%

Using Intel® Using Intel® Compiler 8.1Compiler 8.1 20% / 29%20% / 29% 44% / 59%44% / 59%

* Other names and brands may be claimed as the property of others.

- 35 -


Overall Performance ResultsOverall Performance Results

HT PlatformHT Platform

CBR / VBRCBR / VBR

CMP PlatformCMP Platform

CBR / VBRCBR / VBR

LAME MT codeLAME MT code

++

Using Intel® Using Intel® Compiler 8.1Compiler 8.1

52% / 52% / 70%70% 78% / 78% / 109%109%

The Lame example: high quality threading job.

- 36 -


Some Observations Some Observations What can be accepted:What can be accepted:

– Threading. There is always something to thread, but not always with Threading. There is always something to thread, but not always with significant gain.significant gain.

– Differentiation via micro architecture.Differentiation via micro architecture.• Must be done on the same micro architecture. If not, we may find that we

helped some competitor instead of Intel.

– Streaming SIMD Extensions opportunities.Streaming SIMD Extensions opportunities.

– 64 bit porting.64 bit porting.• A huge opportunity. Can be used if the student can’t find other options. • Porting the assembly code will definitely show benefit. It is a big task

waiting to be done.

Things that didn't go as expected:Things that didn't go as expected:

– Finding the good and influential candidates. It becomes more difficult Finding the good and influential candidates. It becomes more difficult every semester.every semester.

– One semester is too short for many apps. One semester is too short for many apps.

– Returning code to the moderators:Returning code to the moderators:• Only some parts of some projects were accepted by the open source

moderator. • None of the projects were fully accepted.

- 37 -


Backup Backup

Documents

- 1 - Copyright © 2004 Intel Corporation. All Rights Reserved. Maximizing Application’s Performance by Threading, SIMD and micro arcitecture tuning Koby