47
Ramesh Peri Ramesh Peri Principal Engineer & Engineering Manager, Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Performance and Threading Tools Lab Intel Intel ® ® Corporation, Austin, TX 78738 Corporation, Austin, TX 78738

Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Ramesh PeriRamesh PeriPrincipal Engineer & Engineering Manager, Principal Engineer & Engineering Manager,

Performance and Threading Tools LabPerformance and Threading Tools LabIntelIntel®® Corporation, Austin, TX 78738Corporation, Austin, TX 78738

Page 2: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

MultiMulti--Core Processors Core Processors –– Are they Are they Here Yet ?Here Yet ?

My shopping basket at FryMy shopping basket at Fry’’s electronics on s electronics on BlackFridayBlackFriday

ItemItem CostCost

Motherboard+IntelMotherboard+Intel®® Quad core 2.4GhzQuad core 2.4Ghz 200200

4GB Memory4GB Memory 7070

0.5TB Disk0.5TB Disk 8080

Case Case 1010

GraphicsGraphics 1010

CD/DVDCD/DVD 1010

TotalTotal 380380

Page 3: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

OutlineOutline

Parallelism in IntelParallelism in Intel®® processors processors

IntelIntel®® tools for parallel programmingtools for parallel programming

Some thoughtsSome thoughts

Page 4: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

A New A New EraEra……

PerformanceEquals Frequency

Unconstrained Power

Voltage Scaling

PerformanceEquals IPC

THE OLDTHE OLD

THE NEWTHE NEW

Multi-Core

MicroarchitectureAdvancements

Power Efficiency

Performance Comes From Parallelism

Page 5: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

IntelIntel®® Wide Dynamic ExecutionWide Dynamic Execution

INSTRUCTION FETCHINSTRUCTION FETCHAND PREAND PRE--DECODEDECODE

INSTRUCTION QUEUEINSTRUCTION QUEUE

RETIREMENT UNITRETIREMENT UNIT(REORDER BUFFER)(REORDER BUFFER)

DECODEDECODE

RENAME / ALLOCRENAME / ALLOC

SCHEDULERSSCHEDULERS

EXECUTEEXECUTE

INSTRUCTION FETCHINSTRUCTION FETCHAND PREAND PRE--DECODEDECODE

INSTRUCTION QUEUEINSTRUCTION QUEUE

RETIREMENT UNITRETIREMENT UNIT(REORDER BUFFER)(REORDER BUFFER)

DECODEDECODE

RENAME / ALLOCRENAME / ALLOC

SCHEDULERSSCHEDULERS

EXECUTEEXECUTE

CORE 1CORE 1 CORE 2CORE 2

4 WIDE 4 WIDE --DECODE TODECODE TO

EXECUTEEXECUTE

4 WIDE 4 WIDE --MICROMICRO--OPOPEXECUTEEXECUTE

MICROMICROandand

MACROMACROFUSIONFUSION

DEEPERDEEPERBUFFERSBUFFERS

EFFICIENTEFFICIENT14 STAGE14 STAGEPIPELINEPIPELINE

ENHANCEDENHANCEDALUsALUs

EACH COREEACH CORE

Perf Perf

Energy Energy Multiple cores. Multiple instruction issue.

Page 6: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Perf Perf

Energy Energy

IntelIntel®® Advanced Digital Media BoostAdvanced Digital Media BoostSingle Cycle SSESingle Cycle SSE

CoreCore™™ μμarch arch

PreviousPrevious

DECODEDECODE

X4X4

Y4Y4

X4opY4X4opY4

SOURCESOURCE

X1opY1X1opY1

DECODEDECODE

In Each Core In Each Core

X3X3

Y3Y3

X3opY3X3opY3

X2X2

Y2Y2

X2opY2X2opY2

X1X1

Y1Y1

X1opY1X1opY1

DESTDEST

SSE/2/3 OPSSE/2/3 OP

X2opY2X2opY2

X3opY3X3opY3X4opY4X4opY4

CLOCKCLOCKCYCLE 1CYCLE 1

CLOCKCLOCKCYCLE 2CYCLE 2

00127127

CLOCKCLOCKCYCLE 1CYCLE 1

SSE OperationSSE Operation(SSE/SSE2/SSE3)(SSE/SSE2/SSE3)

EXECUTEEXECUTEEXECUTEEXECUTE

SingleSingleCycleCycleSSESSE

*Graphics not representative of actual die photo or relative siz*Graphics not representative of actual die photo or relative sizee

SIMD instructions compute multiple operations per instruction

Page 7: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Advanced DigitalAdvanced DigitalMedia BoostMedia Boost

SIMD Operations Combinedwith Multiple Instruction Issue

2M/4MShared L2

Cache

Up to10.4 Gb/s

FSB

L1 D-Cache and D-TLB

Load

Retirement Unit(Reorder Buffer

ALUBranch

MMX/SSEFPmove

Decode

Rename/Alloc

uCodeROM

Instruction FetchAnd PreDecode

ALUFAdd

MMX/SSEFPmove

ALUFMul

MMX/SSEFPmove

Instruction Queue

Store

Schedulers

4

4

5 128-bit Packed Multiply128-bit Packed Multiplyplus

128-bit Packed Addplus

128-bit Packed Addplus

128-bit packed Loadplus

128-bit packed Loadplus

128-bit packed Storeplus

128-bit packed Storeplus

Compare and jumpplus

Compare and jump

Page 8: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Many Levels of ParallelismMany Levels of Parallelism

InstructionInstruction--level parallelismlevel parallelism

SIMD (vector) parallelismSIMD (vector) parallelism

MultiMulti--core parallelismcore parallelism

Page 9: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Biggest Performance Leap Since Biggest Performance Leap Since OutOut--ofof--Order ExecutionOrder ExecutionInteger Performance at IntroductionInteger Performance at Introduction

(normalized to 25MHz 486DX)(normalized to 25MHz 486DX)

1

11 Estimated based on preproduction measurements ; Source: Spec WEstimated based on preproduction measurements ; Source: Spec Web site & Newslettereb site & Newsletter

00

200200

400400

600600

800800

10001000

12001200

Pentium Pentium II Pentium III Pentium 4 Pentium D Conroe

Single ThreadedMulti Threaded

Parallel Processor: Instruction-level Parallelism, SIMD Parallelism, Multi-core Parallelism

Page 10: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

TeraTera--Leap to Parallelism: Leap to Parallelism: EN

ERG

Y-E

FFIC

IEN

T P

ERFO

RM

AN

CE

TIME

Instruction level parallelism

Hyper-Threading

Dual Core

Quad-Core

Tera-ScaleComputing

SingleSingle--core chipscore chips

More performanceMore performanceUsing Using lessless energyenergy

ScienceSciencefictionfiction

becomes becomes A REALITYA REALITY

Energy Efficient PerformanceEnergy Efficient Performance

Many core requires a high degree of scalingin an application.

Page 11: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

AmdahlAmdahl’’s Laws Law

Runtime = serial + parallelRuntime = serial + parallel

With perfect scaling:With perfect scaling:

Runtime = serial + parallel/PRuntime = serial + parallel/P

To get 50x speedup on 65 To get 50x speedup on 65 processors, serial time must processors, serial time must be less than 0.5%be less than 0.5%

Emerging class of highly parallel workloads: games, media processing, personal data search, and more

Ideal performance scaling, varying serial percentage

0

10

20

30

40

50

60

70

1 8 16 24 32 40 48 56 64

processors

spee

dup

0.00%0.50%1.00%5.00%10.00%50.00%

Page 12: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Application ExampleApplication Example

Field Blue kicks Red kicks

Sports Video Indexing Sports Video Indexing

RecognizeRecognize players, objects and eventsplayers, objects and events

MineMine the video for target scenesthe video for target scenes

SynthesizeSynthesize a summary of what happeneda summary of what happened

Page 13: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

OutlineOutline

Parallelism in IntelParallelism in Intel®® processors processors

IntelIntel®® tools for parallel programmingtools for parallel programming

Some thoughtsSome thoughts

Page 14: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Parallel Programming DeservesParallel Programming DeservesGreat ToolsGreat Tools

Highly optimizing compilers deliveringscalable solutions

Detect latentProgramming bugs

Tune for performanceand scalability

Visualization of applications

and the system

ArchitecturalAnalysis

IntroduceParallelism

Confidence /Correctness

Optimize /Tune

Page 15: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Example: Prime Number GenerationExample: Prime Number Generation

#include <stdio.h>const long N = 18;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special casefor (long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

primes[ number_of_primes++ ] = number;}printf( "Found %d primes\n", number_of_primes );

}

Outer loop steps through numbers that may be prime

Inner loop tests factors of number

#include <stdio.h>const long N = 18;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special casefor (long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

primes[ number_of_primes++ ] = number;}printf( "Found %d primes\n", number_of_primes );

}

#include <stdio.h>const long N = 18;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special casefor (long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

primes[ number_of_primes++ ] = number;}printf( "Found %d primes\n", number_of_primes );

}

Page 16: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Example: Prime Number GenerationExample: Prime Number Generation

33 3355 3,3,5577 3,5,3,5,7799 33

1111 3,5,7,9,3,5,7,9,11111313 3,5,7,9,11,3,5,7,9,11,13131515 331717 3,5,7,9,11,13,15,3,5,7,9,11,13,15,1717

22

#include <stdio.h>const long N = 18;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special casefor (long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

primes[ number_of_primes++ ] = number;}printf( "Found %d primes\n", number_of_primes );

}

Page 17: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Introducing Introducing ThreadsThreads

DebuggingDebugging

PerformancePerformanceTuningTuning

Architectural Architectural AnalysisAnalysis

Using the Intel Programming ToolsUsing the Intel Programming Tools

Call GraphCall Graph•• Functional StructureFunctional Structure

•• Execution TimesExecution Times

•• CountsCounts

IntelIntel®® VTuneVTune™™Performance AnalyzerPerformance Analyzer

Page 18: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

VtuneVtune™™ Call Graph ProfileCall Graph Profile

Page 19: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

VtuneVtune™™ Sampling ProfileSampling Profile

Sampled Cycles: 10,774

Sampled Instructions: 1,608

CPI: 6.7

Page 20: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Hardware Support for SamplingHardware Support for Sampling

INSTRUCTION FETCHINSTRUCTION FETCHAND PREAND PRE--DECODEDECODE

INSTRUCTION QUEUEINSTRUCTION QUEUE

DECODEDECODE

RENAME / ALLOCRENAME / ALLOC

EXECUTEEXECUTE

Statistically associate events with instructions

RETIRERETIRE Event1

Event2

Event counters count events that occur in pipeline.

Key events:

• Clock_cycle

• Instruction_retired

Counters count down to zero.

• At zero, interrupt and capture instruction pointer.

• Reset counter and continue execution.

Page 21: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Example: Prime Number KernelExample: Prime Number Kernel

33 3355 3,3,5577 3,5,3,5,7799 33

1111 3,5,7,9,3,5,7,9,11111313 3,5,7,9,11,3,5,7,9,11,13131515 331717 3,5,7,9,11,13,15,3,5,7,9,11,13,15,1717

22 for ( long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}

for ( long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}

Page 22: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Introducing Introducing ThreadsThreads

DebuggingDebugging

PerformancePerformanceTuningTuning

Architectural Architectural AnalysisAnalysis

Using the Intel Programming ToolsUsing the Intel Programming Tools

OpenMP Loop ConstructOpenMP Loop Construct•• Creates one thread per coreCreates one thread per core

•• Assigns iterations to threadsAssigns iterations to threads

IntelIntel®® CompilersCompilers

Page 23: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

for ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <#include <stdio.hstdio.h>>const long N = 100000;const long N = 100000;long long primes[Nprimes[N], ], number_of_primesnumber_of_primes = 0;= 0;main()main(){{printfprintf( "Determining primes from 1( "Determining primes from 1--%d %d \\n", N );n", N );primes[ primes[ number_of_primesnumber_of_primes++ ] = 2; // special case++ ] = 2; // special case

for ( long number = 3; number <= N; number += 2 ) for ( long number = 3; number <= N; number += 2 ) {{long factor = 3;long factor = 3;while ( number % factor ) factor += 2;while ( number % factor ) factor += 2;if ( factor == number ) if ( factor == number ) primes[ primes[ number_of_primesnumber_of_primes++ ] = number;++ ] = number;

}}printfprintf( "Found %d primes( "Found %d primes\\n", n", number_of_primesnumber_of_primes ););

}}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <#include <stdio.hstdio.h>>const long N = 100000;const long N = 100000;long long primes[Nprimes[N], ], number_of_primesnumber_of_primes = 0;= 0;main()main(){{printfprintf( "Determining primes from 1( "Determining primes from 1--%d %d \\n", N );n", N );primes[ primes[ number_of_primesnumber_of_primes++ ] = 2; // special case++ ] = 2; // special case

##pragmapragma ompomp parallel forparallel forfor ( long number = 3; number <= N; number += 2 ) for ( long number = 3; number <= N; number += 2 ) {{long factor = 3;long factor = 3;while ( number % factor ) factor += 2;while ( number % factor ) factor += 2;if ( factor == number ) if ( factor == number ) primes[ primes[ number_of_primesnumber_of_primes++ ] = number;++ ] = number;

}}printfprintf( "Found %d primes( "Found %d primes\\n", n", number_of_primesnumber_of_primes ););

}}

Example: Add OpenMP DirectiveExample: Add OpenMP Directive

Page 24: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

for ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <#include <stdio.hstdio.h>>const long N = 100000;const long N = 100000;long long primes[Nprimes[N], ], number_of_primesnumber_of_primes = 0;= 0;main()main(){{printfprintf( "Determining primes from 1( "Determining primes from 1--%d %d \\n", N );n", N );primes[ primes[ number_of_primesnumber_of_primes++ ] = 2; // special case++ ] = 2; // special case

for ( long number = 3; number <= N; number += 2 ) for ( long number = 3; number <= N; number += 2 ) {{long factor = 3;long factor = 3;while ( number % factor ) factor += 2;while ( number % factor ) factor += 2;if ( factor == number ) if ( factor == number ) primes[ primes[ number_of_primesnumber_of_primes++ ] = number;++ ] = number;

}}printfprintf( "Found %d primes( "Found %d primes\\n", n", number_of_primesnumber_of_primes ););

}}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <#include <stdio.hstdio.h>>const long N = 100000;const long N = 100000;long long primes[Nprimes[N], ], number_of_primesnumber_of_primes = 0;= 0;main()main(){{printfprintf( "Determining primes from 1( "Determining primes from 1--%d %d \\n", N );n", N );primes[ primes[ number_of_primesnumber_of_primes++ ] = 2; // special case++ ] = 2; // special case

##pragmapragma ompomp parallel forparallel forfor ( long number = 3; number <= N; number += 2 ) for ( long number = 3; number <= N; number += 2 ) {{long factor = 3;long factor = 3;while ( number % factor ) factor += 2;while ( number % factor ) factor += 2;if ( factor == number ) if ( factor == number ) primes[ primes[ number_of_primesnumber_of_primes++ ] = number;++ ] = number;

}}printfprintf( "Found %d primes( "Found %d primes\\n", n", number_of_primesnumber_of_primes ););

}}

Thread 1Iterations 1..50000

Thread 2Iterations 50000..100000

Example: Add OpenMP DirectiveExample: Add OpenMP Directive

Page 25: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

for ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <#include <stdio.hstdio.h>>const long N = 100000;const long N = 100000;long long primes[Nprimes[N], ], number_of_primesnumber_of_primes = 0;= 0;main()main(){{printfprintf( "Determining primes from 1( "Determining primes from 1--%d %d \\n", N );n", N );primes[ primes[ number_of_primesnumber_of_primes++ ] = 2; // special case++ ] = 2; // special case

for ( long number = 3; number <= N; number += 2 ) for ( long number = 3; number <= N; number += 2 ) {{long factor = 3;long factor = 3;while ( number % factor ) factor += 2;while ( number % factor ) factor += 2;if ( factor == number ) if ( factor == number ) primes[ primes[ number_of_primesnumber_of_primes++ ] = number;++ ] = number;

}}printfprintf( "Found %d primes( "Found %d primes\\n", n", number_of_primesnumber_of_primes ););

}}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <#include <stdio.hstdio.h>>const long N = 100000;const long N = 100000;long long primes[Nprimes[N], ], number_of_primesnumber_of_primes = 0;= 0;main()main(){{printfprintf( "Determining primes from 1( "Determining primes from 1--%d %d \\n", N );n", N );primes[ primes[ number_of_primesnumber_of_primes++ ] = 2; // special case++ ] = 2; // special case

##pragmapragma ompomp parallel forparallel forfor ( long number = 3; number <= N; number += 2 ) for ( long number = 3; number <= N; number += 2 ) {{long factor = 3;long factor = 3;while ( number % factor ) factor += 2;while ( number % factor ) factor += 2;if ( factor == number ) if ( factor == number ) primes[ primes[ number_of_primesnumber_of_primes++ ] = number;++ ] = number;

}}printfprintf( "Found %d primes( "Found %d primes\\n", n", number_of_primesnumber_of_primes ););

}}

Example: Add OpenMP DirectiveExample: Add OpenMP Directive

Page 26: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Data RaceData Race

Time

Thread 1 Thread 2

The updates to number_of_primes are not atomic!

number_of_primes++ number_of_primes++

Page 27: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Data RaceData RaceThread 1 Thread 2

The updates to number_of_primes are not atomic!

t1 = LOAD n_of_p

t1++

STORE n_of_p = t1

t2 = LOAD n_of_p

t2++

STORE n_of_p = t2

Time

Page 28: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Data RaceData RaceThread 1 Thread 2

Lose one increment to number of primes!

1: t1= LOAD n_of_p2: t2 = LOAD n_of_p

3: t1++4: t2++

5: STORE n_of_p = t26: STORE n_of_p = t1

Time

Page 29: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Data RaceData RaceThread 1 Thread 2

Add synchronization to remove data race

CRITICAL {

t1= LOAD n_of_p

t1++

STORE n_of_p = t1

}

CRITICAL {

t2 = LOAD n_of_p

t2++

STORE n_of_p = t2

}

Time

Page 30: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

primes[ number_of_primes++ ] = number;}printf( "Found %d primes\n", number_of_primes );

}

#include <#include <stdio.hstdio.h>>const long N = 100000;const long N = 100000;long long primes[Nprimes[N], ], number_of_primesnumber_of_primes = 0;= 0;main()main(){{printfprintf( "Determining primes from 1( "Determining primes from 1--%d %d \\n", N );n", N );primes[ primes[ number_of_primesnumber_of_primes++ ] = 2; // special case++ ] = 2; // special case

##pragmapragma ompomp parallel forparallel forfor ( long number = 3; number <= N; number += 2 ) for ( long number = 3; number <= N; number += 2 ) {{long factor = 3;long factor = 3;while ( number % factor ) factor += 2;while ( number % factor ) factor += 2;if ( factor == number ) if ( factor == number )

primes[ primes[ number_of_primesnumber_of_primes++++ ] = number;] = number;}}printfprintf( "Found %d primes( "Found %d primes\\n", n", number_of_primesnumber_of_primes ););

}}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

#pragma omp criticalprimes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <#include <stdio.hstdio.h>>const long N = 100000;const long N = 100000;long long primes[Nprimes[N], ], number_of_primesnumber_of_primes = 0;= 0;main()main(){{printfprintf( "Determining primes from 1( "Determining primes from 1--%d %d \\n", N );n", N );primes[ primes[ number_of_primesnumber_of_primes++ ] = 2; // special case++ ] = 2; // special case

##pragmapragma ompomp parallel forparallel forfor ( long number = 3; number <= N; number += 2 ) for ( long number = 3; number <= N; number += 2 ) {{long factor = 3;long factor = 3;while ( number % factor ) factor += 2;while ( number % factor ) factor += 2;if ( factor == number ) if ( factor == number )

##pragmapragma ompomp criticalcriticalprimes[ primes[ number_of_primesnumber_of_primes++ ] = number;++ ] = number;

}}printfprintf( "Found %d primes( "Found %d primes\\n", n", number_of_primesnumber_of_primes ););

}}

Example: Add Synchronization DirectiveExample: Add Synchronization Directive

Page 31: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Introducing Introducing ThreadsThreads

DebuggingDebugging

PerformancePerformanceTuningTuning

Architectural Architectural AnalysisAnalysis

Using the Intel Programming ToolsUsing the Intel Programming Tools

Thread Safety IssuesThread Safety Issues•• Data RacesData Races

•• DeadlocksDeadlocks

IntelIntel®® Thread Thread CheckerChecker

Page 32: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

IntelIntel®® Thread CheckerThread Checker

Page 33: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

How Does Thread Checker Work?How Does Thread Checker Work?

Thread 1

n_of_p++

Lock(L);

Unlock(L);

Thread 2

n_of_p++;

Lock(L);

Unlock(L);

Time

Page 34: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

How Does Thread Checker Work?How Does Thread Checker Work?

Thread 1

n_of_p++

Lock(L);

Unlock(L);

Thread 2

n_of_p++;

Lock(L);

Unlock(L);

record read(n_of_p)

record lock(L)

record unlock(L)record lock(L)

record unlock(L)

record write(n_of_p)

record read(n_of_p)record write(n_of_p)

Use binary instrumentation

Time

Page 35: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

How Does Thread Checker Work?How Does Thread Checker Work?

Thread 1

n_of_p++

Lock(L);

Unlock(L);

Thread 2

n_of_p++

Lock(L);

Unlock(L);

Analysis reveals that segment 1 “happens before” segment 2. So, no data race problem.

Segment 1

Segment 2

Analysis based on [Lamport 1978]

Time

Page 36: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

How Does Thread Checker Work?How Does Thread Checker Work?

Thread 1

n_of_p++

Thread 2

n_of_p++

Segment 1 Segment 2

With no locks, neither segment “happens before”the other. So there is a data race!

Time

Page 37: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Introducing Introducing ThreadsThreads

DebuggingDebugging

PerformancePerformanceTuningTuning

Architectural Architectural AnalysisAnalysis

Using the Intel Programming Using the Intel Programming TtoolsTtools

Find Contended LocksFind Contended Locks•• Most OverheadMost Overhead

•• Largest Reduction in Largest Reduction in ParallelismParallelism

IntelIntel®® Thread Thread ProfilerProfiler

Page 38: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Profiling the Two Threads in PrimesProfiling the Two Threads in Primes

Work is not balanced between threads

Page 39: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

for ( long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}

for ( long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}

Thread 2 does 3x more work than thread 1

Thread 1Iterations 1..50000

Thread 2Iterations 50000..100000

Page 40: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel for schedule(dynamic)for ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

#pragma omp criticalprimes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <#include <stdio.hstdio.h>>const long N = 100000;const long N = 100000;long long primes[Nprimes[N], ], number_of_primesnumber_of_primes = 0;= 0;main()main(){{printfprintf( "Determining primes from 1( "Determining primes from 1--%d %d \\n", N );n", N );primes[ primes[ number_of_primesnumber_of_primes++ ] = 2; // special case++ ] = 2; // special case

##pragmapragma ompomp parallel for parallel for schedule(dynamicschedule(dynamic))for ( long number = 3; number <= N; number += 2 ) for ( long number = 3; number <= N; number += 2 ) {{long factor = 3;long factor = 3;while ( number % factor ) factor += 2;while ( number % factor ) factor += 2;if ( factor == number )if ( factor == number )

##pragmapragma ompomp criticalcriticalprimes[ primes[ number_of_primesnumber_of_primes++ ] = number;++ ] = number;

}}printfprintf( "Found %d primes( "Found %d primes\\n", n", number_of_primesnumber_of_primes ););

}}

Example: Change to Dynamic SchedulingExample: Change to Dynamic Scheduling

Each iteration is assigned dynamically to a worker thread

Page 41: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Profiling the Two Threads with Profiling the Two Threads with Dynamic SchedulingDynamic Scheduling

Work balanced between the threads

Page 42: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Introducing Introducing ThreadsThreads

DebuggingDebugging

PerformancePerformanceTuningTuning

Architectural Architectural AnalysisAnalysis

Using the IntelUsing the Intel®® Programming ToolsProgramming Tools

IntelIntel®® Thread ProfilerThread Profiler

IntelIntel®® Thread CheckerThread Checker

IntelIntel®® C++ CompilerC++ Compiler

IntelIntel®® VTuneVTune™™ AnalyzersAnalyzers

All these tools are available for free at Intel® website for academic use

Page 43: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

OutlineOutline

Parallelism in IntelParallelism in Intel®® processors processors

IntelIntel®® tools for parallel programmingtools for parallel programming

Some thoughtsSome thoughts

Page 44: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

The landscape of parallel toolsThe landscape of parallel tools

Tools need to just work out of the boxTools need to just work out of the box–– Provide noise less informationProvide noise less information

–– No false positivesNo false positives–– Information must be relevant and accurateInformation must be relevant and accurate

Must be scalable and work in a variety of environmentsMust be scalable and work in a variety of environments–– Language, Architecture and OS independentLanguage, Architecture and OS independent–– Must work in the presence of Must work in the presence of VMMsVMMs

Must work on existing binariesMust work on existing binaries–– No recompilation, special preparation of binaries for toolsNo recompilation, special preparation of binaries for tools

Performance Degradation must be in acceptable rangePerformance Degradation must be in acceptable rangeNeed more capable tools to handle future multiNeed more capable tools to handle future multi--core processorscore processors–– NonNon--Uniform memory accessUniform memory access–– Multiple layers of memory hierarchy visible to programmerMultiple layers of memory hierarchy visible to programmer–– Very large number of coresVery large number of cores

Page 45: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

Other Software development tools Other Software development tools in our labin our lab

Data Mining and machine learning techniques for utilization of PData Mining and machine learning techniques for utilization of PMU counters MU counters to improve program performanceto improve program performancePin Pin –– a dynamic instrumentation systema dynamic instrumentation system–– Computer Architecture ResearchComputer Architecture Research

–– For emulating new instructionsFor emulating new instructions–– Modeling microModeling micro--architectural featuresarchitectural features

–– Branch Predictor Models:Branch Predictor Models:–– Cache Models:Cache Models:–– Simple Timing Models:Simple Timing Models:

–– Performance Analysis ToolsPerformance Analysis Tools–– Dynamic/Static Instruction Counting ToolsDynamic/Static Instruction Counting Tools–– CallGraph/CallCountCallGraph/CallCount ToolsTools–– Tool to produce Annotated CFGTool to produce Annotated CFG

–– Program Analysis ToolsProgram Analysis Tools–– Code Coverage ToolsCode Coverage Tools–– Dynamic Optimization ToolsDynamic Optimization Tools–– Memory Checking ToolsMemory Checking Tools–– Race Detection ToolsRace Detection Tools–– Replay ToolsReplay Tools

Dynamic optimization of programs Dynamic optimization of programs Parallel libraries/domain specific languagesParallel libraries/domain specific languages

Page 46: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738

SummarySummary

Parallel processorsParallel processors–– Instruction level parallelismInstruction level parallelism–– SIMD parallelismSIMD parallelism–– MultiMulti--core parallelismcore parallelism

IntelIntel®® tools for parallel programmingtools for parallel programming–– IntelIntel®® Fortran/C++ compilerFortran/C++ compiler–– IntelIntel®® VTuneVTune™™–– IntelIntel®® Thread CheckerThread Checker–– IntelIntel®® Thread ProfilerThread Profiler

Parallel applicationsParallel applications–– New emerging class of parallel applicationsNew emerging class of parallel applications–– Tune for all levels of parallelismTune for all levels of parallelism–– Need more research into parallel tools and methodologiesNeed more research into parallel tools and methodologies

Page 47: Ramesh Peri - CSE · 2001-01-03 · Ramesh Peri Principal Engineer & Engineering Manager, Performance and Threading Tools Lab Intel® Corporation, Austin, TX 78738