View
218
Download
0
Tags:
Embed Size (px)
Citation preview
- 1 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Maximizing Application’s Maximizing Application’s Performance by Threading, Performance by Threading, SIMD and micro arcitecture SIMD and micro arcitecture tuningtuning
Koby GottliebKoby Gottlieb
Intel CorporationIntel Corporation
Feb 27 2007Feb 27 2007
- 2 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
AgendaAgenda
Threading gains and challengesThreading gains and challenges
Optimization methodology, project milestonesOptimization methodology, project milestones
– Developing BenchmarkDeveloping Benchmark
– VTuneVTune™™ Performance Analyzer Performance Analyzer
– Threading: Overview of approachesThreading: Overview of approaches
– IntelIntel®® Thread Checker Thread Checker
– IntelIntel®® Thread Profiler Thread Profiler
– Streaming SIMD Extensions (SSE) and micro architectural Streaming SIMD Extensions (SSE) and micro architectural issueissue
Project exampleProject example
[Mark] is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries
- 3 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Dual-Core SystemsDual-Core Systems
OneOne package with 2 cores package with 2 cores
Software impactSoftware impact
– 2 Cores 2 Cores 2 processors 2 processors
– 2 Cores 2 Cores 2x resources 2x resources
Use threads to exploit full resources of dual core processors
Use threads to exploit full resources of dual core processors
Efficiently Utilize Dual Cores
- 4 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Threads DefinedThreads Defined
OS creates process for each OS creates process for each program loadedprogram loaded
– Each process executes as a Each process executes as a separate threadseparate thread
Additional threads can be Additional threads can be created within the processcreated within the process
– Each thread has its own Stack Each thread has its own Stack and Instruction Pointer and Instruction Pointer
– All threads share code and All threads share code and data data
OS creates process for each OS creates process for each program loadedprogram loaded
– Each process executes as a Each process executes as a separate threadseparate thread
Additional threads can be Additional threads can be created within the processcreated within the process
– Each thread has its own Stack Each thread has its own Stack and Instruction Pointer and Instruction Pointer
– All threads share code and All threads share code and data data
…
Data
Code
thread2()Stack
IP
threadN()Stack
IP
ProcessProcess
thread1()Stack
IP
Efficiently Utilize Dual Cores
- 5 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Threading SoftwareThreading Software
OpenMP* threadsOpenMP* threads
– http://www.openmp.org/http://www.openmp.org/
Windows* threadsWindows* threads
– http://msdn.microsoft.com/http://msdn.microsoft.com/
POSIX* threads (pthreads)POSIX* threads (pthreads)
– http://www.ieee.org/http://www.ieee.org/
Efficiently Utilize Dual Cores
*Other names and brands may be claimed as the property of others.
If both cores fully busy, then 2x speedup possible
If both cores fully busy, then 2x speedup possible
- 6 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Correctness BugCorrectness Bug: Data Races: Data Races
Thread1x = a + b
Thread2b = 42
What is value of What is value of xx if: if:
– ThreadThread11 runs before Thread runs before Thread22??
– ThreadThread22 runs before Thread runs before Thread11??
Data race: concurrent read, modify, write Data race: concurrent read, modify, write of same addressof same address
x = 3
Challenges Unique to Threading
x = 43
Suppose: a=1, b=2Suppose: a=1, b=2
Outcome depends on thread execution orderOutcome depends on thread execution order
- 7 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Solving Data Races: Solving Data Races: SynchronizationSynchronization
Thread1Acquire(L)a = 1b = 2x = a + bRelease(L)
Acquisition of mutex L ensures atomic accessAcquisition of mutex L ensures atomic access– Only one thread can hold lock at a timeOnly one thread can hold lock at a time
Example APIs:Example APIs:- EnterCriticalSection(), LeaveCriticalSection()EnterCriticalSection(), LeaveCriticalSection()- pthread_mutex_lock(), pthread_mutex_unlock()pthread_mutex_lock(), pthread_mutex_unlock()
Thread2Acquire(L)b = 42Release(L)
Challenges Unique to Threading
- 8 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Amdahl’s LawAmdahl’s Law
If only 1/2 of the code is parallel, 2X speedup is
unlikely
If only 1/2 of the code is parallel, 2X speedup is
unlikely
P
TotalParallel TONPPT })1{(
P = parallel portion of processN = number of processors (cores)O = parallel overhead
Efficiently Utilize Dual Cores
tim
e PP(1-P)
TTotal
- 9 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Threads Intro New Class of Threads Intro New Class of ProblemsProblems Correctness bugsCorrectness bugs
• Data races• Deadlock• and more…
Performance bottlenecksPerformance bottlenecks• Overhead• Load balance• and more…
Intel® Threading Tools can help!Intel® Threading Tools can help!
Intel® Thread Checker finds correctness bugs
Intel® Thread Checker finds correctness bugs
Thread Profiler feature pinpoints bottlenecks
Thread Profiler feature pinpoints bottlenecks
Challenges Unique to Threading
- 10 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Methodology & Milestones: Getting Methodology & Milestones: Getting StartedStarted
– Most of the world apps are not threaded:Most of the world apps are not threaded:• There are 106,177 registered Projects in (http://sourceforge.net/
) • Almost all the applications are not performance sensitive. • Some performance sensitive apps are too small, too big, or too
complex– Is the app a representative picture of the real software world?– If so, we have a problem in our multi core strategy.
– Learning the App.Learning the App.• No need to understand every algorithm but overall understanding
is a must.• Call graph of VTune™ analyzer is a great tool for this task.
– Develop a Benchmark Develop a Benchmark • Representative benchmark must define a benchmark before
optimizing.• A good benchmark must be automatic (VTune™ analyzer tuning
assistant), not too short (above 30 seconds) and not too long.• Surprisingly, selecting a good benchmark is time consuming and
difficult.
- 11 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Using VTune™ Performance Using VTune™ Performance AnalyzerAnalyzer Sampling is surprisingly easy to Sampling is surprisingly easy to
use:use:
– Easy to get good results from samplingEasy to get good results from samplingwithout any training.without any training.
– Time breakdown is the first step for the Time breakdown is the first step for the threading decision-making process. threading decision-making process.
– Hot spots might be vectorizedHot spots might be vectorized
Call graph as a tool to understand the code and select Call graph as a tool to understand the code and select threading direction.threading direction.
– Setting the /fixed:no flag for the linkerSetting the /fixed:no flag for the linker
– Call graph provides hierarchical view and Call graph provides hierarchical view and overall timing.overall timing.
– Call graph overhead makes it tooCall graph overhead makes it too inaccurate for timing; must use Sampling inaccurate for timing; must use Sampling for correct time estimates. for correct time estimates.
- 12 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
ThreadingThreading The most challenging part of the project: how to thread.The most challenging part of the project: how to thread.
– Added difficulty—Shared resources like FSB or L2 may eliminate the speedup Added difficulty—Shared resources like FSB or L2 may eliminate the speedup potentialpotential
– Functional or data decomposition?Functional or data decomposition?
– In many cases you can find mostly functional parallelism, which only scales In many cases you can find mostly functional parallelism, which only scales to 2 -3 threads. to 2 -3 threads.
– Examples:Examples:• Identify the stages and let thread 0 work on N+1 front end of data element while
thread 1 works on the back end of Data element N.• Assign thread per channel in stereo.
– For good data decomposition, the code should be designed in advance to be For good data decomposition, the code should be designed in advance to be threaded.threaded.
• A desirable goal is maintain the exact results in order to simplify the testing. So Breaking input to chunks does not work if there is any history between data elements.
– If data decomposition worked on relatively small part of the project If data decomposition worked on relatively small part of the project Almost no speedup because of the synchronization overhead.Almost no speedup because of the synchronization overhead.
OpenMP is very convenient for data decomposition experimentation. OpenMP is very convenient for data decomposition experimentation. • Supported by the Intel® compiler.• It became more legitimate with intro in the MS .NET 2005 compiler*.
* Other names and brands may be claimed as the property of others.
- 13 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Debugging the Threaded AppDebugging the Threaded App Convert app to serial code and debug first while running Convert app to serial code and debug first while running
thread 0 before thread 1 and then in reverse order.thread 0 before thread 1 and then in reverse order.
– This methodology is good for 75% of the bugs and does not This methodology is good for 75% of the bugs and does not require any tricky debugging technique.require any tricky debugging technique.
– Try running in parallel and start looking for shared data elements. Try running in parallel and start looking for shared data elements.
Intel® Tread Checker to the rescue.Intel® Tread Checker to the rescue.
– ““No, it is not broken, just build a very small example and be No, it is not broken, just build a very small example and be patient”. It takes a long time.patient”. It takes a long time.
– Intel® Thread Checker gives excellent Intel® Thread Checker gives excellent analysis capabilities.analysis capabilities.
• The location of the faulty data element allocation
• the read location• the write location• the call stack that brings us to this location.
- 14 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Intel® Thread Checker 2.0Intel® Thread Checker 2.0FeaturesFeatures
Locates threading bugs: Locates threading bugs:
– Data races (storage conflicts) Data races (storage conflicts)
– Deadlocks (potential and actual) Deadlocks (potential and actual)
– Win32 threading API usage problemsWin32 threading API usage problems
– Memory leaks and overwritesMemory leaks and overwrites
Isolates bugs to source code lineIsolates bugs to source code line
Describes possible causes of errors and suggests Describes possible causes of errors and suggests resolutionsresolutions
Categorizes errors by severity levelCategorizes errors by severity level
- 15 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Diagnostics List Diagnostics List Screen shot: Intel® Thread Checker
Diagnostics List in Terse mode
Diagnostics List in Terse mode
Summary and legend
Summary and legend
Verbose diagnosticsVerbose diagnostics
- 16 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Source Code ViewSource Code ViewScreen shot: Intel® Thread Checker
Each Diagnostics in List links to its
source code line(s)
Each Diagnostics in List links to its
source code line(s)
- 17 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Help with DiagnosticsHelp with Diagnostics
1) Right-click here . . .1) Right-click here . . .
2) More help!2) More help!
Screen shot: Intel® Thread Checker
- 18 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Intel® Thread Checker Intel® Thread Checker
It shows two errors, one in cmn_prior.c and on in lm.c. After examinig the code of cmn_prior.c, it is quite obvious that the source of the problem is the use of static variables in the cmn_prior function. Code sample 13 shows the original code, please note the first 4 static variables. Needless to say, the use of such a technique is not thread safe.
Example: From Sphinx final report.
void cmn_prior(float32 **incep, int32 varnorm, int32 nfr, int32 ceplen, int32 endutt) { static float32 *cur_mean = NULL; /* the mean subtracted from input frames */ static float32 *sum = NULL; /* the sum over input frames */ static int32 nframe; /* the total number of input frames */ static int32 initialize=1; float32 sf; int32 i, j; assert(incep!=NULL);
- 19 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Threading, PerformanceThreading, Performance
Check what percentage of the code is threaded.Check what percentage of the code is threaded.
– Setting the upper bound for potential performance.Setting the upper bound for potential performance.
– Can use VTune™ analyzer to see how much time each thread runs.Can use VTune™ analyzer to see how much time each thread runs.
– Check if the total instruction count of the threaded app is equal to the instruction Check if the total instruction count of the threaded app is equal to the instruction count of the original app. count of the original app.
• In many cases there is a huge overhead for threading, or just a bug (doing some work twice).
Evaluate the amount of parallel work.Evaluate the amount of parallel work.
– Even if both threads spend the same amount of time, they may not be doing it at Even if both threads spend the same amount of time, they may not be doing it at the same time.the same time.
– If a (already debugged) threaded app runs much slower than the scalar app, look If a (already debugged) threaded app runs much slower than the scalar app, look for false sharing issues:for false sharing issues:
• “No, converting each local variable to an array of 2 variables is not a good idea for threading efficiency.” From one of my meetings, trying to explain how come the threaded app is 14X slower than the original app.
Check the critical path.Check the critical path.
– Intel ® Thread profiler is great for the job after you figure out how to use it and its Intel ® Thread profiler is great for the job after you figure out how to use it and its cryptic terminology.cryptic terminology.
– Note that Win32 API Thread Profiler is not the same tool as the OpenMP Thread Note that Win32 API Thread Profiler is not the same tool as the OpenMP Thread Profiler. Profiler.
- 20 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
The Thread Profiler FeatureThe Thread Profiler Feature
Pinpoints threading performance Pinpoints threading performance bottlenecks in apps threaded with:bottlenecks in apps threaded with:– Microsoft* Windows* threads on Microsoft* Microsoft* Windows* threads on Microsoft*
Windows* systemsWindows* systems– POSIX* pthreads on Linux* systemsPOSIX* pthreads on Linux* systems– OpenMP* on Microsoft* Windows* and Linux* OpenMP* on Microsoft* Windows* and Linux*
systemssystems Plugs into VTune™ environmentPlugs into VTune™ environment
– Microsoft* Windows* for IA-32 systemsMicrosoft* Windows* for IA-32 systems– Linux* for IA-32 systemsLinux* for IA-32 systems
Intel® Threading Tools
*Other names and brands may be claimed as the property of others.
- 21 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Thread Profiler Feature Thread Profiler Feature AnalysisAnalysis Monitors execution flows to find Critical Monitors execution flows to find Critical
PathPath– Longest execution flow is the Critical PathLongest execution flow is the Critical Path
Analyzes Critical PathAnalyzes Critical Path– System utilization System utilization
• Over-subscribed vs. under-subscribed
– Thread state transitionsThread state transitions• Blocked -> Running
Captures threads timelineCaptures threads timeline– Visualize threading structureVisualize threading structure
Intel® Threading Tools
- 22 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Thread Profiler Critical PathThread Profiler Critical Path
Thread 1
Thread 2
Thread 3
T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15
Critical Path View0
15
5
10
Tim
e
Start with the Start with the critical pathcritical path
Separate Separate according to according to system system utilizationutilization
Add overheadAdd overhead
Further analyze by Further analyze by thread thread statestate
Acquire lock L
Wait for Threads 2 & 3
Wait for L
Release L Wait for L
Release L
Idle
Serial
Parallel
Under-subscribed
Over-subscribed
Cruise time
Overhead
Blocking time
Impact time
Intel® Threading Tools
Analysis shown for 2-way systemAnalysis shown for 2-way system
- 23 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Intel® Thread Profiler (OpenMP)Intel® Thread Profiler (OpenMP)
Example: From FAAD final report.
- 24 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Intel® Thread profiler (Win32 API)Intel® Thread profiler (Win32 API)From FAAD
From GainMPEG:
So what’s wrong with this picture?
- 25 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Streaming SIMD Extensions Coding & Streaming SIMD Extensions Coding & Micro-architectureMicro-architecture
Intel® Streaming SIMD Extensions Intel® Streaming SIMD Extensions
– Optimizing the slow thread first in case Optimizing the slow thread first in case of functional decomposition.of functional decomposition.
– In C++, use the class libraries. In C++, use the class libraries.
– In C, use intrinsics. In C, use intrinsics.
– Use inline assembly if the compiler does not Use inline assembly if the compiler does not behave as expected.behave as expected.
– For integer code or code with many shuffleFor integer code or code with many shuffleinstructions, inline assembly might be the instructions, inline assembly might be the only solution.only solution.
• But will it be accepted back to the opensource tree?
Micro architectural issuesMicro architectural issues
– Use VTune™ analyzer tuning assistantUse VTune™ analyzer tuning assistant• Its simpler than trying to learn all the ugly stuff• It actually works and finds big issues in some cases.
ClockTicks (ms)
- 26 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Micro arch tuning: VTune Tuning AssistMicro arch tuning: VTune Tuning Assist Phase 1 – identify main slow-down reasonsPhase 1 – identify main slow-down reasons
The CPI is high
High branch mispredictions
impact
Many L2 Demand Misses
Use precise events to focus on instructions of interest.
- 27 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Example: Example: Phase 2 – focus on problem sourcesPhase 2 – focus on problem sources
Branch mispredictions
L2 load misses
- 28 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Impact: WEB PublicationsImpact: WEB Publications
From http://techreport.com/reviews/2005q2/pentium-xe-840/index.x?pg=11LAME audio encoding LAME MT is, as you might have guessed, a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. You can even download
a paper (in Word format) describing the programming effort. Rather than run multiple parallel threads, LAME MT runs the MP3 encoder's psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. The author notes, "In general, this approach is highly recommended, for it is exponentially harder to debug a parallel application than a linear one." We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here, as we have done in our previous CPU reviews.
The successful projects The successful projects have high impact.have high impact.
The successful projects have big impact
- 29 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
The LAME example:The LAME example:What is the LAME Project?What is the LAME Project?
An educational tool used for learning about MP3 encoding. It’s goal An educational tool used for learning about MP3 encoding. It’s goal is to improve:is to improve:
– Psycho-acoustics quality.Psycho-acoustics quality.
– The speed of MP3 encoding.The speed of MP3 encoding.
LAME is the most popular state of the art MP3 encoder/decoder LAME is the most popular state of the art MP3 encoder/decoder used by today’s leading products.used by today’s leading products.
Project goals:Project goals:
– Speeding up the encryption of an audio stream.Speeding up the encryption of an audio stream.
– Turning LAME into a Multi-Threaded (MT) engine.Turning LAME into a Multi-Threaded (MT) engine.
– Be 1:1 bit compatible with the original version.Be 1:1 bit compatible with the original version.
– Optimize specifically for SMT platforms.Optimize specifically for SMT platforms.
– 64 bit port and CMP related optimizations.64 bit port and CMP related optimizations.
FOR MORE INFO... http://lame.sourceforge.net
- 30 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
MP3 Encoding OverviewMP3 Encoding Overview
Break up the audio stream into frames Break up the audio stream into frames (uniform chunks, typically ~1K) (uniform chunks, typically ~1K)
Perceptual Model
Perceptual Model
Analysis FilterbankAnalysis
Filterbank MDCTMDCT QuantizationQuantization
Audio Stream
Bitstream Encode
Bitstream Encode
Frame 1 Frame 2 Frame 4
Psycho-AcousticPsycho-AcousticRead FrameRead Frame
Frame 3
Huffman EncodingHuffman Encoding
Specifically in LAME
- 31 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
This is actually Data Decomposition
LAME MT – Intuitive approachLAME MT – Intuitive approach
Frame 1 Frame 2 Frame 4Frame 3
The intuitive approach:The intuitive approach:
Thread 1:
Thread 2:
An unbreakable dependence due to Huffman Encoding
Frame 5 Frame 6
- 32 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Analysis FilterbankAnalysis
Filterbank MDCTMDCT QuantizationQuantization Huffman EncodingHuffman Encoding
Psycho-AcousticPsycho-AcousticRead FrameRead Frame
LAME MT – Functional LAME MT – Functional DecompositionDecomposition
T1:
T2:
Frame 1 Frame 2 Frame 4Frame 3 Frame 5 Frame 6
Floating Point Intensive
Integer Intensive
- 33 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
ResultsResults
- 34 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Results due to Multi-ThreadingResults due to Multi-Threading
SMT SMT PlatformPlatform
CBR / VBRCBR / VBR
SMP PlatformSMP Platform
CBR / VBRCBR / VBR
Using Microsoft’s Using Microsoft’s Compiler*Compiler* 22% / 32%22% / 32% 38% / 62%38% / 62%
Using Intel® Using Intel® Compiler 8.1Compiler 8.1 20% / 29%20% / 29% 44% / 59%44% / 59%
* Other names and brands may be claimed as the property of others.
- 35 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Overall Performance ResultsOverall Performance Results
HT PlatformHT Platform
CBR / VBRCBR / VBR
CMP PlatformCMP Platform
CBR / VBRCBR / VBR
LAME MT codeLAME MT code
++
Using Intel® Using Intel® Compiler 8.1Compiler 8.1
52% / 52% / 70%70% 78% / 78% / 109%109%
The Lame example: high quality threading job.
- 36 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Some Observations Some Observations What can be accepted:What can be accepted:
– Threading. There is always something to thread, but not always with Threading. There is always something to thread, but not always with significant gain.significant gain.
– Differentiation via micro architecture.Differentiation via micro architecture.• Must be done on the same micro architecture. If not, we may find that we
helped some competitor instead of Intel.
– Streaming SIMD Extensions opportunities.Streaming SIMD Extensions opportunities.
– 64 bit porting.64 bit porting.• A huge opportunity. Can be used if the student can’t find other options. • Porting the assembly code will definitely show benefit. It is a big task
waiting to be done.
Things that didn't go as expected:Things that didn't go as expected:
– Finding the good and influential candidates. It becomes more difficult Finding the good and influential candidates. It becomes more difficult every semester.every semester.
– One semester is too short for many apps. One semester is too short for many apps.
– Returning code to the moderators:Returning code to the moderators:• Only some parts of some projects were accepted by the open source
moderator. • None of the projects were fully accepted.
- 37 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Backup Backup