Upload
maximillian-jones
View
247
Download
9
Tags:
Embed Size (px)
Citation preview
Hyper-Threading Hyper-Threading Intel CompilersIntel Compilers
Andrey NaraikinAndrey Naraikin
Senior Software EngineerSenior Software Engineer
Software Products DivisionSoftware Products Division
Intel Nizhny Novgorod LabIntel Nizhny Novgorod Lab
November 29, 2002November 29, 2002
AgendaAgenda Hyper-Threading Technology OverviewHyper-Threading Technology Overview
Introduction: Intel SW Development ToolsIntroduction: Intel SW Development Tools– MotivationMotivation
– ChallengesChallenges
– Intel SW ToolsIntel SW Tools
Intel Compilers OverviewIntel Compilers Overview– Technologies supportedTechnologies supported
– SPEC and other benchmarksSPEC and other benchmarks
– Some features supported by Intel CompilersSome features supported by Intel Compilers
Hyper-Threading Overview Today’s ProcessorsToday’s Processors
Single Processor SystemsSingle Processor Systems– Instruction Level Parallelism (ILP)Instruction Level Parallelism (ILP)
– Performance improved with more CPU resourcesPerformance improved with more CPU resources
Multiprocessor SystemsMultiprocessor Systems– Thread Level Parallelism (TLP) Thread Level Parallelism (TLP)
– Performance improved by adding more CPUsPerformance improved by adding more CPUs
Hyper-Threading technology enables TLP to single processor system.
Hyper-Threading Overview Today’s SoftwareToday’s Software
Sequential tasksSequential tasks
Parallel tasksParallel tasks
Open FileOpen File Edit Spell Check Edit Spell Check
Open DB’sOpen DB’s Address Book Address Book
InBox MeetingInBox Meeting
Hyper-Threading Overview Multi-ProcessingMulti-Processing
Multi-tasking workload + processor resources=> Improves MT Performance
Multi-tasking workload + processor resources=> Improves MT Performance
Run parallel tasks using multiple processors Run parallel tasks using multiple processors
CPU 1CPU 1
CPU 2CPU 2
CPU 3CPU 3
Dual-Core ArchitectureDual-Core Architecture
Hyper-Threading
Processor Processor Execution Execution ResourcesResources
ASAS ASAS
Multiprocessor
Processor Processor Execution Execution ResourcesResources
ASAS
Processor Processor Execution Execution ResourcesResources
ASAS
AS = Architecture State (eax, ebx, control registers, etc.), xAPIC
Hyper-Threading Technology looks like Hyper-Threading Technology looks like two processors to softwaretwo processors to software
Hyper-Threading Technology looks like Hyper-Threading Technology looks like two processors to softwaretwo processors to software
Hyper-Threading TechnologyHyper-Threading Technology
Hyper-Threading Architecture OverviewHyper-Threading Architecture Overview
Pentium, VTune and Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.
Pentium, VTune and Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.
Hyper-Threading Architecture DetailsHyper-Threading Architecture Details
Hyper-Threading Overview Resource UtilizationResource Utilization
Tim
e (p
roc.
cyc
les)
Note: Each box represents a processor execution unit
Superscalar MultiprocessingHyper-
Threading
Multiprocessing With Hyper-Threading
Performance BenefitPerformance Benefit
0
0.5
1
1.5
2
A1 A2 A3 A4 A5 A6 A7 A8 A9
Application
Rel
ativ
e S
pee
du
p
SMP
HTT
Serial
Hyper-Threading TechnologyHyper-Threading Technology
CodeCode DescriptionDescription
A1A1 EngineeringEngineering
A2A2 GeneticsGenetics
A3A3 ChemistryChemistry
A4A4 EngineeringEngineering
A5A5 WeatherWeather
A6A6 GeneticsGenetics
A7A7 CFDCFD
A8A8 FEAFEA
A9A9 FEAFEA
“Hyper-Threading Technology: Impact on Compute-Intensive Workloads,” Intel Technical Journal, Vol. 6, 2002.
Key PointKey Point
Hyper-Threading Technology gives better utilization of processor resources
Hyper-Threading Technology gives more computing power for multithreaded applications
Hyper-Threading TechnologyHyper-Threading Technology
CollateralCollateralWeb SitesWeb Sites
– http://developer.intel.com/technology/hyperthread/http://developer.intel.com/technology/hyperthread/– http://developer.intel.com/design/pentium4/applnotshttp://developer.intel.com/design/pentium4/applnots– http://developer.intel.com/design/pentium4/manualshttp://developer.intel.com/design/pentium4/manuals
Documentation and application notesDocumentation and application notes– IA-32 IntelIA-32 Intel®® Architecture Software Developer’s Manual Architecture Software Developer’s Manual – Intel PentiumIntel Pentium®® 4 and Intel Xeon 4 and Intel XeonTMTM Processor Optimization Manual Processor Optimization Manual– Intel App Note AP485 - “Intel Processor Identification and CPU Intel App Note AP485 - “Intel Processor Identification and CPU
Instructions”Instructions”– Intel App Note AP 949 “Intel App Note AP 949 “ Using Spin-Loops on Intel Pentium 4 Using Spin-Loops on Intel Pentium 4
Processor and Intel Xeon Processor”Processor and Intel Xeon Processor”– Intel App Note “Detecting Support for Jackson Technology Intel App Note “Detecting Support for Jackson Technology
Enabled Processors”Enabled Processors”
Collateral (Cont’d)Collateral (Cont’d)Intel Technology Journal Intel Technology Journal
– http://developer.intel.com/technology/itj/http://developer.intel.com/technology/itj/
Intel Threading ToolsIntel Threading Tools– http://www.intel.com/software/products/http://www.intel.com/software/products/
OpenMPOpenMP– http://www.openmp.orghttp://www.openmp.org
HT Overview HT Overview – http://www.ixbt.com/cpu/pentium4-3ghz-ht.shtmlhttp://www.ixbt.com/cpu/pentium4-3ghz-ht.shtml
Performance AdvantagePerformance AdvantageOptimization PathOptimization Path
StandardStandardCompilerCompiler
Little or Little or No Code ChangeNo Code Change
Minor Code ChangeMinor Code Change(1 Line)(1 Line)
13x13x
Analysis with VTune™Analysis with VTune™
1x1x
Intel SW Development Tools
4x4x
IntelIntelCompilerCompiler
7x7x
9x9xOpenMPOpenMP
ThreadingThreading
IntelIntelCompilerCompiler
IntelIntelCompilerCompiler
15x faster15x faster
OpenMPOpenMPThreadingThreading
IntelIntelCompilerCompiler
MinorMinorCode ChangeCode Change
PerformancePerformanceLibrariesLibraries
(IPP or MKL)(IPP or MKL)
StandardStandardCompilerCompiler
PerformancePerformanceLibrariesLibraries
(IPP or MKL)(IPP or MKL)
PerformancePerformanceLibrariesLibraries
(IPP or MKL)(IPP or MKL)
Sunset Simulation Sunset Simulation Optimized PerformanceOptimized Performance
Intel SW Development Tools
15x faster15x faster
Intel® CompilersIntel® Compilers
C, C++ and Fortran95C, C++ and Fortran95– Available on Windows* and Linux*Available on Windows* and Linux*– Available for 32-bit and 64-bit platformsAvailable for 32-bit and 64-bit platforms
Utilization of latest processor/platform featuresUtilization of latest processor/platform features– Optimizations for NetBurst™ architecture (Pentium® 4 and Optimizations for NetBurst™ architecture (Pentium® 4 and
Xeon™ processor)Xeon™ processor)– Optimizations for Itanium® architecture Optimizations for Itanium® architecture
Seamless integration into Windows* (IDE)Seamless integration into Windows* (IDE)and Linux* environmentand Linux* environment
Source and binary compatible with Microsoft* Source and binary compatible with Microsoft* compiler; compiler; mostly source compatible with GNU (gcc)mostly source compatible with GNU (gcc)
Intel SW Development Tools – Compilers
Benchmarks: Intel® Compilers 6.0 Benchmarks: Intel® Compilers 6.0 for Windows*for Windows*
SPECint_base2000
Configuration info: Intel® Pentium® 4 Processor, 2.4 GHz, Intel® Medford 850 Motherboard,
(D850MD 850 motherboard) Chipset,256 MB Memory, Windows* XP Professional
Edition (build 2600), GeForce 3/nVidia* Graphics
SPECfp_base2000(Geomean of Fortran)
400
500
600
700
800
900
CVF* 6.6 Intel® Fortran Compiler 6.0
28%Faster
Floating-point Performance!!
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. Users’ results are dependent upon the application characteristics (loopy vs. flat), mix of C and C++, and other factors. For more information on performance tests and on the performance of Intel products, reference [www.intel.com] or call (U.S.) 1-800-628-8686 or 1-916-356-3104.
400
500
600
700
800
900
Leading C++ Compiler Intel® C++ Compiler 6.0
17%Faster Integer Performance!!
SPECint_base2000 = 703
SPECint_base2000 = 825Geomean of Fortran = 881
Geomean of Fortran = 686
Intel SW Development Tools – Compilers
Intel® C++ Compiler 6.0 for Linux*Intel® C++ Compiler 6.0 for Linux*
PovRay Image Rendering TimePovRay Image Rendering Time
Configuration info: Intel® Pentium® 4 processor, 2.0 GHz, 256 MB Memory, nVidia* GeForce 2 graphics card, Linux* 2.4.7, PovRay 3.1G
Intel SW Development Tools – Compilers
60%
80%
100%
120%
140%
160%
gcc 2.96, O2 andFast-math
Optimization
Intel® 6.0 ComparableOptimization
Intel® 6.0 MaximumOptimization
20.30 Seconds
14.75 Seconds
13.57 Seconds
Imp
rove
me
Imp
rove
me
nt
nt
Special Performance FeaturesSpecial Performance Features
Auto-Vectorization for NetBurst™ architectureAuto-Vectorization for NetBurst™ architecture Software-Pipelining for EPIC architectureSoftware-Pipelining for EPIC architecture Auto-Parallelization and OpenMP based parallelizationAuto-Parallelization and OpenMP based parallelization
– for Hyper-Threading and multi-processor systemsfor Hyper-Threading and multi-processor systems Data Pre-FetchingData Pre-Fetching Profile-Guided Optimization (PGO)Profile-Guided Optimization (PGO) Inter-procedural Optimization (IPO)Inter-procedural Optimization (IPO) CPU Dispatch CPU Dispatch
– Establishes code path at runtime dependent on actual processor type Establishes code path at runtime dependent on actual processor type – Allows single binary with optimal performance across Allows single binary with optimal performance across
processor familiesprocessor families
Intel SW Development Tools – Compilers
TechniquesTechniques Overview Overview
Exploit parallelism to speedup applicationExploit parallelism to speedup applicationVectorizationVectorization
– Supported by programming languages and Supported by programming languages and compilerscompilers – Motivated by modern architecturesMotivated by modern architectures
Superscalarity, deeply pipelined coreSuperscalarity, deeply pipelined core SIMDSIMD Software pipelining on ItaniumSoftware pipelining on Itanium™ architecture™ architecture
ParallelizationParallelization – OpenMPOpenMP™™ directives for shared memory directives for shared memory
multiprocessor systemsmultiprocessor systems– MPI computations for clustersMPI computations for clusters
Features by Intel Compilers
Intel processors and vectorizationIntel processors and vectorization
Pentium® with MMX™technology, Pentium® IIprocessors
Pentium® III processor
Pentium® 4 processor
Integer types, 64 bits
Streaming SIMD Extensions (SSE),Single precision floating point
Streaming SIMD Extensions 2 (SSE 2),Double precision floating point,Integer types, 128 bits
Type of processor Vectorization features supported
Features by Intel Compilers - Vectorization
Compiler automatically transforms Compiler automatically transforms sequential code for SIMD executionsequential code for SIMD execution
Automatic VectorizationAutomatic Vectorization
for (i=0; i<n; i++) { a[i] = a[i] + b[i]; a[i] = sin(a[i]);}
for(i=0; i<n; i=i+VL) { a(i : i+VL-1) = a(i : i+VL-1) + b(i : i+VL-1); a(i : i+VL-1) = _vmlSin(a(i : i+VL-1));}
icl - Qx[MKW]
Run-Time Run-Time LibraryLibrary
HW SIMD HW SIMD instructioninstruction
Features by Intel Compilers - Vectorization
Vectorization ExampleVectorization Example
0.0 1.0 2.0 3.0 4.0 5.0
0.0 1.0 2.0 3.0 4.0 5.0
6.0
6.0
7.0
7.0
8.0
8.0
9.0
9.0
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.0
a
b
Scalar
Vector 4.0 6.0 8.0 10.0 12.0 14.0 16.0 18.00.0 2.0
Features by Intel Compilers - Vectorization
double a[N], b[N]; int i;
for (i = 0; i < N; i++) a[i] = a[i] + b[i];
icl - QxW
Reduction ExampleReduction Example
a 11.00.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
0.0 0.0 0.0 0.0
0.0 1.0 2.0 3.0
4.0 6.0 8.0 10.0
12.0 15.0 18.0 21.0
30.0 36.0
66.0
Loop kernel
Postlude
float a[N], x;
int i;
x=0.0;
for (i = 0; i < N; i++)
x += a[i];
Features by Intel Compilers - Vectorization
Parallel Program DevelopmentParallel Program Development
Ease of use/
maintenaince
Explicit threading using operating system callsWith industry standard OpenMP* directivesAutomatically using the compiler
Parallelization
Features by Intel Compilers - Parallelization
AutoparallelizationAutoparallelization
float a[N], b[N], c[N];int i;for (i=0; i<N; i++) c[i] = a[i]*b[i];
icl -Qparallel foo.c { -xparallel on Linux}
….foo.c
foo.c(7) : (col. 2) remark: LOOP WAS AUTO-PARALLELIZED....
./foo.exe -- Executable detects and uses number of processors…
-Qpar_report[n] - get helpful messages from the compiler
Features by Intel Compilers - Parallelization
OpenMP™ DirectivesOpenMP™ Directives
OpenMP* standard (OpenMP* standard (www.openmp.orgwww.openmp.org))– Set of directives to enable the writing of multithreaded Set of directives to enable the writing of multithreaded
programsprogramsUse of shared memory parallelism on Use of shared memory parallelism on
programming language levelprogramming language level– PortabilityPortability– PerformancePerformance
Support by Intel® CompilersSupport by Intel® Compilers – Windows*, Linux*Windows*, Linux*– IA-32 and ItaniumIA-32 and Itanium™™ architectures architectures
Features by Intel Compilers - Parallelization
Simple DirectivesSimple Directivesfoo(float *a, float *b, float *c){ int i;#pragma parallel for (i=0; i<N; i++) { *c++ = (*a++)*bar(b++); };}
Pointers and procedure calls with escaped pointers prevent analysis for autoparallelization
Use simple directives instead
Features by Intel Compilers - Parallelization
void foo()void foo()
{ int a[1000], b[1000], c[1000], x[1000], i, NUM;{ int a[1000], b[1000], c[1000], x[1000], i, NUM;
/* parallel region *//* parallel region */
#pragma omp parallel private(NUM) shared(x, a, b, c)#pragma omp parallel private(NUM) shared(x, a, b, c)
{ NUM = omp_get_num_threads();{ NUM = omp_get_num_threads();
#pragma omp for private(i) /* work-sharing for loop */#pragma omp for private(i) /* work-sharing for loop */
for (i = 0; i< 1000; i++) {for (i = 0; i< 1000; i++) {
x[i] = bar(a[i], b[i], c[i], NUM); /* assume bar has no side-effects */ x[i] = bar(a[i], b[i], c[i], NUM); /* assume bar has no side-effects */
}}
}}
}}
OpenMP* DirectivesOpenMP* Directives
icl -Qopenmp -c foo.c { -xopenmp on Linux}foo.cfoo.c(10) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.foo.c(7) : (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
Features by Intel Compilers - Parallelization
OpenMP™ + VectorizationOpenMP™ + Vectorization
Combined speedupCombined speedupOrder of use might be importantOrder of use might be important
– Parallelization overheadParallelization overhead
– Vectorize inner loopsVectorize inner loops
– Parallelize outer loopsParallelize outer loops
Supported by Intel® CompilersSupported by Intel® Compilers
Features by Intel Compilers
Make performance a feature of your applications today –
stay competitive
Make performance a feature of your applications today –
stay competitive
Intel® CompilersIntel® Compilers
Leading-Edge compiler technologiesLeading-Edge compiler technologiesCompatible with leading industry standard Compatible with leading industry standard
compilerscompilersProcessor optimized code generationProcessor optimized code generationSupport single source code across Intel Support single source code across Intel
processor familiesprocessor families
Intel SW Development Tools
CollateralCollateralIntel Technology Journal Intel Technology Journal
– http://developer.intel.com/technology/itj/http://developer.intel.com/technology/itj/
Intel Threading ToolsIntel Threading Tools– http://www.intel.com/software/products/http://www.intel.com/software/products/
OpenMPOpenMP– http://www.openmp.orghttp://www.openmp.org
HT Overview HT Overview – http://www.ixbt.com/cpu/pentium4-3ghz-ht.shtmlhttp://www.ixbt.com/cpu/pentium4-3ghz-ht.shtml