Teaching “Think Parallel”

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others.

Teaching “Think Parallel” Four positive trends toward Parallel Programming, including advances in teaching/learning James Reinders, Intel April 2013

1

http://software.intel.com/en-us/articles/optimization-notice

Entdecken Sie weitere interessante Artikel und News zum Thema auf all-electronics.de!

Hier klicken & informieren!

http://www.all-electronics.de

© 2013, Intel Corporation. All righ ts reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 3

Better Tools for Parallel

Programming

Better Parallel Models

Wildly more Hardware

Parallelism

Better Educated

Programmers




Programming



Parallelism

Better Educated

Programmers

Parallel Programming is IMPORTANT These FOUR factors combine to help enable parallel programming to be on the rise more quickly.




Programming



Parallelism

Better Educated

Programmers

• Industry-leading performance from advanced compilers

• Comprehensive libraries

• Parallel programming models

• Insightful analysis tools




Programming



Parallelism

Better Educated

Programmers

Intel® Advisor XE




Programming



Parallelism

Better Educated

Programmers

Intel® Composer XE

Intel® C/C++ Compiler Intel® Fortran Compiler

Intel® Math Kernel Library Intel® Integrated Performance Primitives




Programming



Parallelism

Better Educated

Programmers

Intel® Inspector XE




Programming



Parallelism

Better Educated

Programmers

Intel® VTune™ Amplifier XE




Programming



Parallelism

Better Educated

Programmers

Parallel Programming is IMPORTANT Programming models are improving to be more productive.



• Standard used by many parallel applications – Supported by every major compiler for Fortran and C

• OpenMP 4.0 standard – new mid-2013

!$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo !$omp end parallel

OpenMP* (Open Multi-Processing)


Programming



Parallelism

Better Educated

Programmers



• Support in Intel Compilers since 2011 • OpenMP 4.0 standard – new mid-2013

– Expect to see in all OpenMP compliant compilers!

#pragma omp simd reduction(+:val) reduction(+:val2) for(int pos = 0; pos < RAND_N; pos++) { float callValue=expectedCall(Sval,Xval,MuByT,VBySqrtT,l_Random[pos]); val += callValue; val2 += callValue * callValue; }

SIMD directives: Intel innovation


Programming



Parallelism

Better Educated

Programmers



• OLDER methods (like IVDEP directives, restrict keyword, etc.): – Keep adding directives, keywords, compile time switches, hints, etc.

hoping you code will vectorize – Pro: start with working code, each step of the way it continues to work – Pro: if your algorithm, as written, cannot safely vectorize – the

compiler will never vectorize it (hard for most programmers to see as different than the compiler is being too conservative)

– Con: you are trying to guess what additional information the compiler needs to be comfortable in order to vectorize

SIMD vs. Prior methods (you can now choose!)



• OLDER methods (like IVDEP directives, restrict keyword, etc.): – Keep adding directives, keywords, compile time switches, hints, etc.

hoping you code will vectorize – Pro: start with working code, each step of the way it continues to work – Pro: if your algorithm, as written, cannot safely vectorize – the

compiler will never vectorize it (hard for most programmers to see as different than the compiler is being too conservative)

– Con: you are trying to guess what additional information the compiler needs to be comfortable in order to vectorize

• NEW (SIMD directives) method:

– Pro: Add the directive, and the compiler WILL vectorize the code. No time is spent fussing with the compiler and worrying it is too conservative.

– Con: right or wrong, it is vectorized – if wrong, you have debugging work to do to discover how to restructure your algorithm to vectorize without changing results. No more “help” from the compiler at noticing real problems with changing your algorithm to use vector instructions.

SIMD vs. Prior methods (you can now choose!)



OpenMP 4.0 offers industry convergence – for a true standard; Intel first to support!

Feature OpenACC LEO Desired Standard

Support for C and C++, Fortran ✔ ✔ ✔ Support single code base of herero-machine ✔ ✔ ✔ Overlap communication and computation ✔ ✔ ✔ Interoperate with MPI ✔ ✔ ✔ Interoperate with OpenMP ✔ ✔ Offload to GPU ✔ ✔ Offload to MIC Co-processor ✔ ✔ Ability to support all accelerators ✔ Ability to support all GPUs ✔ Ability to support all co-processors ✔ Proof of performance portability ✔ Support for nested parallelism ✔ ✔ User-managed memory consistency ✔ ✔ ✔ Multiple vendor support ✔ ✔ Restrict clause support ✔ Support for dynamic dispatch ✔ ✔ Parallel on/off separate from offload ✔ ✔ PGI, CAPS compiler support 2012 ✔ Cray compiler support soon ✔ Intel compiler support 2010* ✔ Broad standards body approval ✔ OpenMP 4.0 (late 2012) maybe * public product availability was 2012


Programming



Parallelism

Better Educated

Programmers



threadingbuildingblocks.org

TBB for C++ scaling Most popular solution for C++ parallel programming


Programming



Parallelism

Better Educated

Programmers



cilkplus.org

TBB has a “sister” Cilk™ Plus: • Help for C programmers • Involve compiler • Vectorization support


Programming



Parallelism

Better Educated

Programmers



Intel® Cilk™ Plus

23

Cilk™ Plus

Tasking

Cilk Keywords

Hyperobjects

Vectorization

Array Notation

SIMD Annotation

Elemental Functions

cilkplus.org


Programming



Parallelism

Better Educated

Programmers



Intel products, plus gcc and LLVM branches available


24

Cilk™ Plus

Tasking

Cilk Keywords

Hyperobjects

Vectorization

Array Notation

SIMD Annotation

Elemental Functions

cilkplus.org


Programming



Parallelism

Better Educated

Programmers



< adopted by OpenMP 4.0 (mid-2013)

< adopted by OpenMP 4.0 (mid-2013)

Intel products, plus gcc and LLVM branches available


25

Cilk™ Plus

Tasking

Cilk Keywords

Hyperobjects

Vectorization

Array Notation

SIMD Annotation

Elemental Functions

cilkplus.org


Programming



Parallelism

Better Educated

Programmers




Programming



Parallelism

Better Educated

Programmers

Parallel Programming Model abstractions that yield portability, performance, productivity, usability, maintainability.




Programming



Parallelism

Better Educated

Programmers

Parallel Programming is IMPORTANT These FOUR factors combine to help enable parallel programming to be on the rise more quickly.




Programming



Parallelism

Better Educated

Programmers



Intel® Xeon Phi™ Coprocessors Highly-parallel Processing for Unparalleled Discovery

29

Groundbreaking: differences

Up to 61 IA cores/1.1 GHz/ 244 Threads

Up to 8GB memory with up to 352 GB/s bandwidth

512-bit SIMD instructions

Linux operating system, IP addressable

Standard programming languages and tools

Leading to Groundbreaking results

Up to 1 TeraFlop/s double precision peak performance1 Enjoy up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2

Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.



Intel® Xeon Phi™ Coprocessors: They’re So Much More General purpose IA Hardware leads to less idle time for your investment.

30

Intel® Xeon Phi™ Coprocessor Custom HW Acceleration

It’s a supercomputer on a chip

GPU ASIC FPGA

Restrictive architectures

Source: Intel Estimates

Restrictive architectures limit the ability for applications to use arbitrary nested parallelism, functions calls and threading models

Operate as a compute node

Run a full OS

Program to MPI

Run x86 code

Run restricted code Run offloaded code



vision span from few cores to many cores with consistent models, languages, tools, and techniques

31



Multicore CPU Multicore CPU Intel® MIC architecture coprocessor

Source

Compilers Libraries,

Parallel Models

32



Multicore CPU Multicore CPU Intel® MIC architecture coprocessor

Source

Compilers Libraries,

Parallel Models

Game Changer

“Unparalleled productivity… most of this software does not run on a GPU” - Robert Harrison, NICS, ORNL

“R. Harrison, “Opportunities and Challenges Posed by Exascale Computing - ORNL's Plans and Perspectives”, National Institute of Computational Sciences, Nov 2011”

33

Intel® Trace Analyzer and Collector

Intel® MPI Library



Intel® Inspector XE, Intel® VTune™ Amplifier XE,

Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP

Intel® Parallel Studio XE

+ Intel® Trace Analyzer and Collector

+ Intel® MPI Library

34



Intel® Inspector XE, Intel® VTune™ Amplifier XE,

Intel® Advisor

Intel® C/C++ and Fortran Compilers w/OpenMP

Intel® MKL, Intel® Cilk Plus, Intel® TBB, and Intel® IPP

Intel® Parallel Studio XE

Intel® Trace Analyzer and Collector

Intel® MPI Library

35



SMP on a chip…



Intel® Xeon Phi™ Coprocessor: Increases Application Performance up to 10x

37

Application Performance Examples

* Xeon = Intel® Xeon® processor; * Xeon Phi = Intel® Xeon Phi™ coprocessor

Customer Application Performance Increase1 vs. 2S Xeon*

Los Alamos Molecular Dynamics Up to 2.52x

Acceleware 8th order isotropic variable velocity

Up to 2.05x

Jefferson Labs Lattice QCD Up to 2.27x

Financial Services

BlackScholes SP Monte Carlo SP

Up to 7x

Up to 10.75x

Sinopec Seismic Imaging Up to 2.53x2

Sandia Labs miniFE (Finite Element Solver)

Up to 2x3

Intel Labs Ray Tracing (incoherent rays)

Up to 1.88x4

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in ully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Customer Measured results as of October 22, 2012 Configuration Details: Please reference slide speaker notes. For more information go to http://www.intel.com/performance

Notes: 1. 2S Xeon* vs. 1 Xeon Phi* (preproduction HW/SW & Application running 100% on coprocessor unless otherwise noted) 2. 2S Xeon* vs. 2S Xeon* + 2 Xeon Phi* (offload) 3. 8 node cluster, each node with 2S Xeon* (comparison is cluster performance with and without 1 Xeon Phi* per node) (Hetero) 4. Intel Measured Oct. 2012

• Intel® Xeon Phi™ coprocessor accelerates highly parallel & vectorizable applications. (graph above) • Table provides examples of such applications


http://www.intel.com/performance

640

1,729

0

500

1000

1500

2000

2S Intel® Xeon®Processor

1 Intel® Xeon Phi™ coprocessor

SGEMM (GF/s)

Synthetic Benchmark Summary (Intel® MKL) (5110P)

38

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel Measured results as of October 26, 2012 Configuration Details: Please reference slide speaker notes.For more information go to http://www.intel.com/performance

Up to 2.7X

309

833

0

200

400

600

800

1000

2S Intel® Xeon®processor


DGEMM (GF/s)

303

722

0

200

400

600

800

1000

2S Intel® Xeon®processor


SMP Linpack (GF/s)

78

159 171

0

50

100

150

200

2S Intel®Xeon®

processor

1 Intel® Xeon Phi™

coprocessor

1 Intel® Xeon Phi™

coprocessor

STREAM Triad (GB/s)

Up to 2.7X Up to 2.3X Up to 2.1X

Notes 1. Intel® Xeon® Processor E5-2670 used for all SGEMM Matrix = 13824 x 13824 , DGEMM Matrix 7936 x 7936, SMP Linpack Matrix 30720 x 30720 2. Intel® Xeon Phi™ coprocessor 5110P (ECC on) with “Gold Release Candidate” SW stack SGEMM Matrix = 11264 x 11264, DGEMM Matrix 7680 x 7680, SMP Linpack Matrix 26872 x 28672

ECC O

n

ECC O

ff

85%

Effic

ient

82%

Effic

ient

71%

Effic

ient

Higher is Better Higher is Better Higher is Better Higher is Better

Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native)

http://www.intel.com/performance



Programming



Parallelism

Better Educated

Programmers

Lots of Parallelism is a big deal



Picture worth many words

40



http://tinyurl.com/inteljames twitter @jamesreinders http://intel.com/software/mic

41




42



Books to Help “Think Parallel”



Intel® Xeon Phi™ Coprocessor High Performance Programming, Jim Jeffers, James Reinders, (c) 2013, publisher: Morgan Kaufmann

It all comes down to PARALLEL PROGRAMMING ! (applicable to processors and Intel® Xeon Phi™ coprocessors both) Forward, Preface Chapters: 1. Introduction 2. High Performance Closed Track

Test Drive! 3. A Friendly Country Road Race 4. Driving Around Town:

Optimizing A Real-World Code Example

5. Lots of Data (Vectors) 6. Lots of Tasks (not Threads) 7. Offload 8. Coprocessor Architecture 9. Coprocessor System Software 10. Linux on the Coprocessor 11. Math Library 12. MPI 13. Profiling and Timing 14. Summary Glossary, Index

Available since February 2013.

This book belongs on the bookshelf of every HPC professional. Not only

does it successfully and accessibly teach us how to use and obtain high

performance on the Intel MIC architecture, it is about much more

than that. It takes us back to the universal fundamentals of high-

performance computing including how to think and reason about the

performance of algorithms mapped to modern architectures, and it puts into your hands powerful tools that

will be useful for years to come. —Robert J. Harrison

Institute for Advanced Computational Science, Stony Brook University

Learn more about this book:

lotsofcores.com

“© 2013, James Reinders & Jim Jeffers, book image used with permission




48



It all comes down to PARALLEL PROGRAMMING ! (applicable to processors and Intel® Xeon Phi™ coprocessors both)

Forward, Preface

Chapters:

1.Introduction

2.High Performance Closed Track Test Drive!

3.A Friendly Country Road Race

4.Driving Around Town: Optimizing A Real-World Code Example

5.Lots of Data (Vectors)

6.Lots of Tasks (not Threads)

7.Offload

8.Coprocessor Architecture

9.Coprocessor System Software

10. Linux on the Coprocessor

11. Math Library

12. MPI

13. Profiling and Timing

14. Summary

Glossary, Index



This is a really great book…

I've been dreaming for a while of a modern accessible book that I could

recommend to my threading-deprived colleagues and

assorted enquirers to get them up to speed with the core concepts of

multithreading as well as something that covers all the major current

interesting implementations.

Finally I have that book.

—Martin Watt, Principal Engineer,

Dreamworks Animation

Structured Parallel Programming, Michael McCool, Arch Robison, James Reinders (c) 2012, publisher: Morgan Kaufmann

Teaches parallel programming using a new pattern-based approach. Extensive examples in Cilk Plus and TBB. Not about any specific hardware, but relevant to all. It’s about effective parallel programming. Great for teaching!


parallelbook.com

Available since July 2012.

© 2012, Michael McCool, Arch Robison, James Reinders, book image used with permission





parallelbook.com

Available since July 2012 in English. February 2013 in Japanese.




Teaching Parallelism • Patterns & our parallel programming tools • Map, Reduce:

– Dot product, Cilk Plus • Stencil, Recurrence:

– Forward seismic simulation, Cilk Plus • Pipeline:

– Compression, Cilk Plus and TBB



Structured Programming with Patterns • Patterns are “best practices” for solving

specific problems. • Patterns can be used to organize your

code, leading to algorithms that are more scalable and maintainable.

• A pattern supports a particular algorithmic structure with an efficient implementation.

• Intel’s tools support a set of useful parallel patterns with low-overhead implementations.



Structured Serial Patterns The following patterns are the basis of “structured programming” for serial computation: • Sequence • Selection • Iteration • Nesting • Functions • Recursion

• Random read • Random write • Stack allocation • Heap allocation • Objects • Closures

Compositions of structured serial control flow patterns can be used in place of unstructured mechanisms such as “goto.” Using these patterns, “goto” can (mostly)

be eliminated and the maintainability of software improved.



Structured Parallel Patterns The following additional parallel patterns can be used for “structured parallel programming”:

• Superscalar sequence • Speculative selection • Map • Recurrence • Scan • Reduce • Pack/expand • Fork/join • Pipeline

• Partition • Segmentation • Stencil • Search/match • Gather • Merge scatter • Priority scatter • *Permutation scatter • !Atomic scatter

Using these patterns, threads and vector intrinsics can (mostly) be eliminated and the

maintainability of software improved.



• Map invokes a function on every element of an index set.

• The index set may be abstract or associated with the elements of an array.

• Corresponds to “parallel loop” where iterations are independent.

Examples: gamma correction and thresholding in images; color space conversions; Monte Carlo sampling; ray tracing.

Map



• Reduce combines every element in a collection into one using an associative operator:

x+(y+z) = (x+y)+z • For example: reduce can be

used to find the sum or maximum of an array.

• Vectorization may require that the operator also be commutative:

x+y = y+x

Examples: averaging of Monte Carlo samples; convergence testing; image comparison metrics; matrix operations.

Reduce



• Stencil applies a function to neighbourhoods of an array.

• Neighbourhoods are given by set of relative offsets.

• Boundary conditions need to be considered.

Examples: image filtering including convolution, median, anisotropic diffusion

Stencil



• Recurrence results from loop nests with both input and output dependencies between iterations

• Can also result from iterated stencils

Examples: Simulation including

fluid flow, electromagnetic, and financial PDE solvers, lattice QCD, sequence alignment and pattern matching

Recurrence



• Pipeline uses a sequence of stages that transform a flow of data

• Some stages may retain state

• Data can be consumed and produced incrementally: “online”

Examples: image filtering, data compression and decompression, signal processing

Pipeline



Pipeline: Cilk Plus and TBB

66

parallel_pipeline ( ntoken, make_filter<void,T>( filter::serial_in_order, [&]( flow_control & fc ) -> T{ T item = f(); if( !item ) fc.stop(); return item; } ) & make_filter<T,U>( filter::parallel, g ) & make_filter<U,void>( filter:: serial_in_order, h ) );

Intel® TBB

S s; reducer_consume<S,U> sink ( &s, h ); ... void Stage2( T x ) { sink.consume(g(x)); } ... while( T x = f() ) cilk_spawn Stage2(x); cilk_sync;

Intel® Cilk™ Plus (special case)



This is a really great book…

I've been dreaming for a while of a modern accessible book that I could

recommend to my threading-deprived colleagues and

assorted enquirers to get them up to speed with the core concepts of

multithreading as well as something that covers all the major current

interesting implementations.

Finally I have that book.

—Martin Watt, Principal Engineer,

Dreamworks Animation


Teaches parallel programming using a new pattern-based approach. Extensive examples in Cilk Plus and TBB. Not about any specific hardware, but relevant to all. It’s about effective parallel programming. Great for teaching!


parallelbook.com

Available since July 2012.




http://intel.com/software/mic



parallel programming from few to many cores with consistent models, languages, tools, and techniques

http://intel.com/software/mic

http://tinyurl.com/inteljames twitter @jamesreinders

69



INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © 2013, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

72

4/16/201


Documents

Teaching “Think Parallel”