Intel® Xeon Phi™ Coprocessor: IntroductionŸ7.pdf · 2014-11-20 · сопроцессором Intel® Xeon Phi Host CPU Host CPU Intel® Xeon® платформа («хост»)

Intel® Xeon Phi™ Coprocessor: IntroductionDmitry Sergeev

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Сопроцессоры Intel® Xeon Phi™

2

Основные характеристики платформы…

До 61 ядер на базе IA/1.1 GHz/ 244 потока

До 16GB памяти с пропускной способностью 352 Гб/с

512-битные SIMD инструкции

ОС Linux, доступ по IP-адресу

Стандартные программные средства и языки!

…приводящие к выдающимся результатам

До 1.2 Терафлоп пиковая производительность1

До 2.2x выше пропускная способность памяти по сравнению с Intel® Xeon® E5 2

До 4x более энергоэффективный, чем Intel® Xeon® E5 3

Software and workloads used in performance tests may have been optimized for performance only on

Intel microprocessors. Performance tests, such as SYSmark and MbileMark, are measured using

specific computer systems, components, software, operations and functions. Any change to any of

those factors may cause the results to vary. You should consult other information and performance

tests to assist you in fully evaluating your contemplated purchases, including the performance of that

product when combined with other products. For more information go to

http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details.


3 Family Впечатляющее решение

для параллельных расчетов

цена/производительность 3120P 3120A

5 FamilyСистемы с высокой

плотностьюэнергопотребление/производи

тельность 5110P 5120D

7 FamilyВысокопроизводительны

е системыНаивысшая

производительность 7120P 7120X

16GB GDDR5

352GB/s

>1.2TF DP

8GB GDDR5

>300GB/s

>1TF DP

225-245W

6GB GDDR5

240GB/s

>1TF DP

Сопроцессоры Intel® Xeon Phi™

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer

systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your

contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance3

http://www.intel.com/performance


Сопроцессор Intel® Xeon Phi™

Сильно-параллельные HPC расчеты

Процессор Intel® Xeon®

Общие HPC расчеты

Дополняющие технологии

4


Общие архитектурные характеристики

ПроцессорIntel® Xeon® Processor

E5-2690

Сопроцессор Intel® Xeon Phi™ 5110P

2.9GHz Частота 1.053GHz

8 (Multi-Core) Ядра 60 (Many-Core)

16 Потоки 240

256 SIMD 512

Когерентный Кэш Когерентный

Общая память Память Общая память

5


Типичная платформа с сопроцессором Intel® Xeon Phi

Host CPU

Host CPU

Intel® Xeon® платформа («хост»)

QPI

x16 PCIe Xeon Phi™

Intel® Xeon Phi™ сопроцессор(ы)

x16 PCIe

GDDR5DDR3

DDR3

IBA, 10GbE

IBA, 10GbE

1-4 на узел

1-2 CPUs на узел

For illustration only.

GDDR5

Xeon Phi™

6


Обзор микроархитектуры Intel® Xeon Phi™

PCIe

Client

Logic

Core

L2

Core

L2

Core

L2

Core

L2

TD TD TD TD

Core

L2

Core

L2

Core

L2

Core

L2

TDTDTDTD

GDDR MC

GDDR MC

GDDR MC

GDDR MC

TD: Tag DirectoryL2: L2-CacheMC: Memory Controller

For illustration only.

7


L2 Control

L1 TLB

and 32KB

Code Cache

T0 IP

4 Threads

In-Order

TLB Miss

Code Cache Miss

Decode uCode

16B/Cycle (2 IPC)

Pipe 0

X87 RF Scalar RF

X87 ALU 0 ALU 1

VPU RF

VPU

512b SIMD

Pipe 1

TLB Miss

Handler

L2 TLB

T1 IP

T2 IP

T3 IP

L1 TLB and 32KB Data Cache

DCache Miss

TLB Miss

To On-Die Interconnect

HWP

Intel® Xeon

Phi™ Processor

Core

512KB

L2 Cache

For illustration only.8

Ядро Intel® Xeon Phi™


Эффективные приложения для Intel® Xeon Phi™Допускают массовый параллелизм

Имеют высокую вычислительную сложность

Векторизация

Большое количество вычислений на единицу данных

Умещаются в доступную память

Multicore(8+)

Many-Core(60)

9


10


Intel Xeon PhiВектор: 512 bitТипы:• integer (32 и 64 бит)• float (F32)• double (F64)

X4

Y4

X4◦Y4

X3

Y3

X3◦Y3

X2

Y2

X2◦Y2

X1

Y1

X1◦Y1

0

X8

Y8

X8◦Y8

X7

Y7

X7◦Y7

X6

Y6

X6◦Y6

X5

Y5

X5◦Y5

X16

Y16

X16◦Y16

…

...

…

511

SIMD, Single Instruction Multiple-Data

11

SIMD/Параллелизм по данным


Векторизация кода• Заставляет последовательный код использовать возможности

параллелизма по данным (SIMD) процессоров Intel

– Вручную за счет спец синтаксиса

– Автоматически за счет компилятора

for(i = 0; i <= MAX;i++)

c[i] = a[i] + b[i];

a

b

c

++

a[i]

b[i]

c[i]

+

a[i+7] a[i+6] a[i+5] a[i+4] a[i+3] a[i+2] a[i+1] a[i]

b[i+7] b[i+6] b[i+5] b[i+4] b[i+3] b[i+2] b[i+1] b[i]

c[i+7] c[i+6] c[i+5] c[i+4] c[i+3] c[i+2] c[i+1] c[i]

12


Почему важна векторизация ?

#define MAX(x,y) ((x)>(y)?(x):(y))

#define MIN(x,y) ((x)<(y)?(x):(y))

#define SAT2SI16(x) \

MAX(MIN((x),32767),-32768)

void foo1(int n, short *A, short *B){

int i;

#pragma ivdep

#pragma vector aligned

for (i=0; i<n; i++)

A[i] = SAT2SI16(A[i]+B[i]);

}

movsx r11d, [rdx+r9*2]

movsx ebx, [r8+r9*2]

add r11d, ebx

cmp r11d, 32767

cmovge r11d, eax

cmp r11d, -32768

cmovl r11d, ecx

mov [rdx+r9*2], r11w

inc r9

cmp r9, r10

jb .B1.8

11 инстр./ 1 элем

Saturation Add

movdqa xmm0, [rdx+rax*2]

paddsw xmm0, [r8+rax*2]

movdqa [rdx+rax*2], xmm0

add rax, 8

cmp rax, r9

jb .B1.4 6 инстр/ 8 элем

Скалярный код:

Векторный код (SSSE-3):


Параллельные программные модели Intel

Intel® Cilk™ Plus

Расширения языка C/C++ для упрощения параллелизма

(Исходный код и продукт Intel)

Intel® Threading Building Blocks

Библиотека C++ шаблонов для параллелизма

(Исходный код и продукт Intel)

Специализиро-ванные библиотеки

Intel® Integrated Performance Primitives

Intel® Math Kernel Library

Стандарты

Message Passing Interface (MPI)

OpenMP*

CoarrayFortran

OpenCL*

R&D

Intel® Concurrent Collections

Offload Extensions

Intel® SPMD Parallel Compiler

Применяются как к Multicore, так и к Many-core

14


Fortran (CAF)

MKL

TBB

OpenCL

Cilk Plus

C++

Инструменты

OpenMP

Fortran (CAF)

TBB

OpenCL

Cilk Plus

C++

MKL

Параллельное программирование одно и то же

OpenMP

ИнструментыPCIe

PC

Ie

Исполняемый

файл для

сопроцессора

Xeon Phi

Исполняемый

файл для CPU

Гетерогенное программирование

15


Гибкие модели исполненияОптимальная производительность для различных нагрузок

XEON®

PHI

XEON

PHI™

XEON®XEON

PHI™

Родная (NATIVE)

модельOFFLOAD модель Симметричная

модель

XEON®XEON

PHI™

MPI

XEON® XEON®

DIRECTIVES

16


MPI+OffloadMPI ranks on Intel® Xeon® processors (only)

All messages into/out of processors

Offload models used to accelerate MPI ranks

Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* within Intel® MIC Architecture

Homogenous network of hybrid nodes:

Xeon MIC

Xeon MIC

Xeon MIC

Xeon MIC

Network

Data

Data

Data

Data

Data

Data

Data

Data

MPI

MPI

17


Compile your code with the offload directives

Create your hosts file (Xeon only)

Run your application (Xeon only)

MPI + OffloadHow to run

$ mpiifort –openmp test.f –o test.offload

$ cat hosts

node0

node1

$ mpirun –f hosts –n 2 ./test.offload

18


Пример offload : Вычисление π(только демонстрация)# define NSET 1000000

int main ( int argc, const char** argv )

{ long int i, num_inside = 0;

float Pi;

#pragma offload target (MIC)

#pragma omp parallel for reduction(+:num_inside)

for( i = 0; i < NSET; i++ )

{ float x, y, distance2;

// Generate x, y random numbers in [0,1)

x = float(rand()) / float(RAND_MAX + 1);

y = float(rand()) / float(RAND_MAX + 1);

distance2 = x*x + y*y;

if ( distance2 <= 1.0f )

num_inside++;

}

Pi = 4.0f * ( (float)num_inside / NSET );

printf("Value of Pi = %f \n",Pi);

}

Добавление всего одной строки для гетерогенной (Xeon +Xeon Phi) версии

19


void foo() /* Intel® Math Kernel Library */{

float *A, *B, *C; /* Matrices */

sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);

}

Автоматический offload с Intel® Math Kernel Library

Xeon Xeon Phi

Неявный автоматический offload не требует

изменений в исходном коде

2020


Many-core Hosted (Native)

MPI ranks on Intel® Xeon PhiTMcoprocessors(only)

All messages into/out of Intel® Xeon PhiTM coprocessors

Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreadsused directly within MPI processes

Programmed as homogenous network of many-core CPUs:

Xeon MIC

Xeon MIC

Xeon MIC

Xeon MIC

Network

Data

Data

Data

Data

MPI

21


Compile your code for Intel® Xeon Phi™ Coprocessor

Copy the MIC-enabled executable to the coprocessor

Create your hosts file (MIC only)

Let the library know you’re planning on running on MIC

Run your application (from the Xeon)

Many-core Hosted (Native)How to run

$ mpiifort –mmic test.f –o test.mic

$ scp test.mic mic0:/home/user/

$ scp test.mic mic1:/home/user/

$ cat hosts

mic0

mic1

$ export I_MPI_MIC=1

$ mpirun –f hosts –n 4 /home/user/test.mic

22


Symmetric

MPI ranks on Intel® Xeon PhiTMcoprocessors and Intel® Xeon® processors

Messages to/from any core

Intel® CilkTM Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* used directly within MPI processes

Programmed as heterogeneous network of homogeneous nodes:

Xeon MIC

Xeon MIC

Xeon MIC

Xeon MIC

Network

Data

Data

Data

Data

MPI

Data

Data

Data

Data

MPI

MPI

23


Compile for the Intel® Xeon and the Intel® Xeon Phi™ Coprocessor

Copy the MIC-enabled executable to the coprocessor (rename during copy)

Create your hosts file (Xeon+MIC)

Let the library know you’re planning on running on MIC

Run your application (from the Xeon)

SymmetricHow to run

$ mpiifort test.f –o /home/user/test

$ mpiifort –mmic test.f –o test.mic

$ scp test.mic mic0:/home/user/test

$ scp test.mic mic1:/home/user/test

$ cat hosts

node0

mic0

mic1

$ export I_MPI_MIC=1

$ mpirun –f hosts –n 4 /home/user/test.mic

24


Two environment variables available to support NFS on Coprocessor

I_MPI_MIC_PREFIX – Prepends value to executable name (directory)

I_MPI_MIC_POSTFIX – Appends value to executable name (extension)

Procedure:

Set I_MPI_MIC=1

Run job as normal

Host nodes will launch command as specified

Coprocessor nodes will launch modified command

NFS support via environment variables

mpirun … ./app args

./app args

$I_MPI_MIC_PREFIX./app$I_MPI_MIC_POSTFIX args

25


Configuration files allow different MPI options, different executables, different program arguments, etc.

One argument set per line, # for comments

Run command should only specify configuration file

Configuration files for complex runs

$ cat theconfigfile

-n 1 –host node1 ./master

-n 3 –env OMP_NUM_THREADS 8 –host node1 ./worker

-n 4 –env OMP_NUM_THREADS 60 –host node1-mic0 ./worker.mic

-n 4 –env OMP_NUM_THREADS 8 –host node2 ./worker

-n 4 –env OMP_NUM_THREADS 60 –host node2-mic0 ./worker.mic

$ mpirun –configfile theconfigfile

26


Intel® MPI Library 5.0What’s New

MPI-3 Standard Support

Non-Blocking Collectives

Fast RMA

Large Counts

MPICH ABI Compatibility

Compatibility with MPICH* v3.1, IBM* MPI v1.4, Cray* MPT v7.0

Performance & Scaling

Memory Consumption Optimizations

Scaling up to 150K Ranks*

Gains up to 35% reduction on Collectives

Hydra now default job manager on Windows*

Configuration: Hardware: Intel® Xeon® CPU E5-2680 @ 2.70GHz, RAM 64GB; Interconnect: InfiniBand, ConnectX adapters; FDR. MIC: C0-KNC 1238095 kHz; 61 cores. RAM: 15872 MB per card. Software: RHEL 6.2, OFED 1.5.4.1, MPSS Version: 3.2, Intel® C/C++ Compiler XE 13.1.1, Intel® MPI Benchmarks 3.2.4.;

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation

Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .

2.0

1.9

1.1

1.6

1.8

1 1 1 1 1

0

1

2

3

4 bytes 512 bytes 16 Kbytes 128 Kbytes 4 Mbytes

Sp

ee

du

p (

tim

es)

Intel MPI 5.0 MVAPICH2-2.0 RC2

Superior Performance with Intel® MPI Library 5.064 Processes, 8 nodes (InfiniBand + shared memory), Linux* 64Relative (Geomean) MPI Latency Benchmarks (Higher is Better)

2X Faster

1.9X Faster

1.1X Faster

1.6X Faster

1.8X Faster

27


Tuning MPI Application Performance

28


Performance Tuning Tools for Distributed ApplicationsIntel® Trace Analyzer and Collector

Tune cross-node MPI

Visualize MPI behavior

Evaluate MPI load balancing

Find communication hotspots

Intel® VTune™ Amplifier XE

Tune single node threading

Visualize thread behavior

Evaluate thread load balancing

Find thread sync bottlenecks

29


Intel® Trace Analyzer and Collector OverviewIntel® Trace Analyzer and Collector helps the developer:

Visualize and understand parallel application behavior

Evaluate profiling statistics and load balancing

Identify communication hotspots

Features

Event-based approach

Low overhead

Excellent scalability

Powerful aggregation and filtering functions

Idealizer

NEW in 9.0: Automatic Performance Assistant

Source

Code

Binary

Objects

Compiler

Linker

Runtime

Output

Intel® Trace Collector

Trace File (.stf)

API and -tcollect

-trace

Intel® Trace Analyzer

30

30


Using the Intel® Trace Analyzer and Collector is … Easy!

Run your binary and create a tracefile

$ mpirun –trace –n 2 ./test

View the Results:$ traceanalyzer &

Step 1

Step 2

31


Collection Mechanism Advantages Disadvantages

Run with –trace or preload trace collector library.

Automatically collects all MPI calls, requires no modification to source, compile, or link.

No user code collection.

Link with –trace. Automatically collects all MPI calls.

No user code collection.Must be done at link time.

Compile with –tcollect. Automaticallyinstruments all function entries/exits.

Requires recompile of code.

Add API calls to source code.

Can selectively instrument desired code sections.

Requires code modification.

Multiple Methods for Data Collection

32


Tracing libraries have been ported

Ensure libraries are available on card

Installation path available via NFS (preferred)

Manually copy files via scp

scp /opt/intel/itac/<version>/mic/slib/libVT.so mic0:/lib64

Run as a normal job

All trace files stored in working directory

If not on NFS share, files will need to be copied from coprocessor

Analyze using Intel® Trace Analyzer

traceanalyzer test.stf &

Tracing on Intel® Xeon Phi™ Coprocessor

33


Compare the event timelines of two communication profiles

Blue = computationRed = communication

Chart showing how the MPI processes interact

Intel® Trace Analyzer and Collector

34

34


Improving Load Balance: Real World Case

Host16 MPI procs x1 OpenMP thread

Coprocessor8 MPI procs x28 OpenMP threads

Collapsed data per node and coprocessor card

Too high load on Host= too low load on coprocessor

35





Coprocessor24 MPI procs x8 OpenMP threads

Too low load on Host= too high load on coprocessor

36





Coprocessor16 MPI procs x12 OpenMP thrds

Perfect balanceHost load = Coprocessor load

37


NEW in 9.0: MPI Performance Assistant

Automatic Performance Assistant

Detect common MPI performance issues

Automated tips on potential solutions

Automatically detect performance issues and their impact on runtime

38


Which Performance Issues are automatically identified? Point-to-point exchange

39

Late Sender Late Receiver


Which Performance Issues are automatically identified?Global collective operation performance

40

Wait at Barrier

Early Reduce

Late Broadcast


NEW in 9.0: Summary page shows computation vs. communication breakdown

Is your application

MPI-bound?

Is your application

CPU-bound?

Resource usage

Largest MPI consumers

Next Steps

41


Non-blocking Allreduce

(MPI_Iallreduce)

Support for major MPI-3.0 features

Non-blocking collectives

Fast RMA

Large counts

NEW in 9.0: Initial MPI-3.0 Support

42


Launch Intel® VTune™ Amplifier XE

Use mpirun

List your app as a parameter

Results organized by MPI rank

Review results

Graphical user interface

Command line report

Intel® VTune™ Amplifier XE with MPI

Tune for Scalable Multicore Performance

43


Use the command-line tool under the MPI run script to gather report data

Argument Sets can be used for more control

Required: Only run one driver collection per node

Only collect data on certain ranks

Different collections or options on different ranks

A unique results directory is created for each analyzed MPI rank

Launch the GUI and view the results for each rank

Using Intel® VTune™ Amplifier XE with MPI

mpirun –n #ranks amplxe-cl –result-dir ampl_results –collect hotspots -- ./test

44


Intel® Inspector XE with MPIWhere are my application’s…

Memory Errors Threading Errors Security Errors

• Invalid Accesses• Memory Leaks• Uninitialized Memory

Accesses

• Races• Deadlocks• Cross Stack References

• Buffer overflows and underflows

• Incorrect pointer usage• Over 250 error types…

• MPI aware, cluster friendly• Both dynamic and static analysis• Multiple tools – common GUI• Windows* & Linux* Jean Kypreos

Advanced Video Processing Team ManagerEnvivio

"Having such a tool this early in the development stage frees the validation from trivial bug reports and gives our engineers the opportunity to code more efficiently from the very beginning of the product cycle."

Multi-threading problems are hard to reproduce, difficult to debug, and expensive to fix!

45


Intel® Inspector XE

Dynamic Analysis

Launch Intel® Inspector XE

Use mpirun

List your app as a parameter

Results organized by MPI rank

Review results


Command line report

Static Analysis

Source analyzed for errors (similar to a build)

Review results


Find errors earlier when they are less expensive to fix

46


Use the command-line tool under the MPI run script to gather report data

Argument Sets can be used for more control

Only collect data on certain ranks

Different collections or options on different ranks

A unique results directory is created for each analyzed MPI rank

Launch the GUI and view the results for each rank

Using Intel® Inspector XE with MPI

mpirun –n #ranks inspxe-cl –result-dir insp_results –collect hotspots -- ./test

47


Intel® MPI Library product page

www.intel.com/go/mpi

Intel® Trace Analyzer and Collector product page

www.intel.com/go/traceanalyzer

Intel® Clusters and HPC Technology forums

http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology

Intel® Xeon Phi™ Coprocessor Developer Community

http://software.intel.com/en-us/mic-developer

Online Resources

48

http://www.intel.com/go/mpi

http://www.intel.com/go/traceanalyzer

http://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology

http://software.intel.com/en-us/mic-developer


Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

49

Documents

Intel® Xeon Phi™ Coprocessor: IntroductionŸ7.pdf · 2014-11-20 · сопроцессором Intel® Xeon Phi Host CPU Host CPU Intel® Xeon® платформа («хост»)