Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors

Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ CoprocessorsShuo LiFinancial Services EngineeringSoftware and Services GroupIntel Corporation

iXPTC 2013 Intel® Xeon Phi ™Coprocessor

Agenda

• A Tale of Two Architectures• Transcendental Functions - Step 5 Lab 1/3• Thread Affinity - Step 5 Lab part 2/3• Prefetch and Streaming Store• Data Blocking – Step 5 Lab part 3/3• Summary

2

A Tale of Two Architectures


Multicore

Intel® Xeon® processor Intel® Xeon Phi™ Coprocessor

Sockets 2 1Clock Speed 2.6 GHz 1.1 GHzExecution Style Out-of-order In-orderCores/socket 8 Up to 61HW Threads/Core 2 4Thread switching HyperThreading Round RobinSIMD widths 8SP, 4DP 16SP, 8DPPeak Gflops 692SP, 346DP 2020SP, 1010DPMemory Bandwidth 102GB/s 320GB/sL1 DCache/Core 32kB 32kBL2 Cache/Core 256kB 512kBL3 Cache/Socket 30MB none

A Tale of Two Architectures

4

Transcendental Functions


Extended Math Unit• Fast approximations of transcendental functions

using hardware lookup table in single precision• Minimax quadratic polynomial approximation• Full effective 23-bit Mantissa bits• 1-2 cycles of throughput in 4 elementary functions• Benefit other functions that directly use them

6

Elementary functions Name Cycle

Reciprocal rcp() 1Reciprocal square root rsqrt2() 1Exponential base 2 exp2() 2Logarithmic base 2 log2() 1Derived functions Name Cycle

divide rcp(), X 2Square root rsqrt(), X 2power log2(),X,exp2() 4exponent base 10 vexp223ps, X 3


Use EMU Functions in Finance Algorithms• Challenges in using exp2() and log2()

– exp() and log() are widely use in Finance algorithm– the base of which are e (=2.718…) not 2

• The Change of Base Formula– M_LOG2E, M_LN2 are defined in

math.h– Cost: 1 multiplication, and its effect on result accuracy

• In General, it works for any base b– expb(x) = exp2(x*log2b) logb(x) = log2(x) * logb2

• Manage the cost of conversion– Absorb multiply to other constant calculations outside loop– Always convert from other bases to base 2

7

exp(x) = exp2(x*M_LOG2E) log(x) = log2(x)*M_LN2


Black-Scholes can use EMU Functions const float R = 0.02f;

const float V = 0.30f;

const float RVV = (R + 0.5f * V * V)/V;

for(int opt = 0; opt < OPT_N; opt++)

{

float T = OptionYears[opt];

float X = OptionStrike[opt];

float S = StockPrice[opt];

float sqrtT = sqrtf(T);

float d1 = logf(S/X)/(V*sqrtT) + RVV*sqrtT;

float d2 = d1 - V * sqrtT;

CNDD1 = CNDF(d1);

CNDD2 = CNDF(d2);

float expRT = X * expf(-R * T);

float CallVal = S * CNDD1 - expRT * CNDD2;

CallResult[opt] = CallVal;

PutResult[opt] = CallVal + expRT - S;

}8

const float LN2_V = M_LN2 * (1/V);

const float RLOG2E =-R*M_LOG2E;


{

float T = OptionYears[opt];

float X = OptionStrike[opt];

float S = StockPrice[opt];

float rsqrtT = 1/sqrtf(T);

float sqrtT = 1/rsqrtT;

float d1 = log2f(S/X)*LN2_V*rsqrtT + RVV*sqrtT;

float d2= d1 - V * sqrtT;

CNDD1 = CNDF(d1);

CNDD2 = CNDF(d2);

float expRT = X * exp2f(RLOG2E * T);

float CallVal = S * CNDD1 - expRT * CNDD2;

CallResult[opt] = CallVal;

PutResult[opt] = CallVal + expRT - S;

}

Transcendental FunctionsStep 5 Lab 1/3


Using Intel® Xeon Phi™ Coprocessors• Use Intel® Xeon® E5 2670 platform with Intel® Xeon Phi™ Coprocessor • Make sure you can invoke Intel C/C++ Compiler

– Try icpc –V to prinout the banner for Intel® C/C++ compiler• Build the native Intel® Xeon Phi™ Coprocessor application

– Change the Makefile and use –mmic in lieu of –xAVXsource /opt/intel/Compiler/2013.2.146/composerxe/pkg_bin/compilervars.sh intel64

• Copy program from Host to Coprocessor– Find out the host name using hostname (suppose it returns esg014)– Copy the executives using - scp ./MonteCarlo esg014-mic0:– Establish a execution environment - ssh esg014-mic0– Set env. Variable %export LD_LIBRARY_PATH=.– Optionally export KMP_AFFINITY=“compact,granularity=fine”– Invoke the program ./MonteCarlo

10


Step 5 Transcendental Functions • Use transcendental functions in EMU• Inner loop calls expf(MuByT + VBySqrtT * random[pos]);• Call exp2f and adjust the parameter by a factor of M_LOG2E• Combine the multiplication with MuByT and VBySqrtT

float VBySqrtT = VLOG2E * sqrtf(T[opt]);

float MuByT = RVVLOG2E * T[opt];

11

Thread Affinity


More on Thread Affinity• Bind the worker threads to certain processor core/threads• Minimizes the thread migration and context switch • Improves data locality; reduce coherency traffic• Two components to the problem:

– How many worker threads to create?– How to bind worker threads to core/threads?

• Two ways to specify thread affinity– Environment variables OMP_NUM_THREADS, KMP_AFFINITY– C/C++ API: kmp_set_defaults("KMP_AFFINITY=compact")

omp_set_num_threads(244);• Add to your source file#include <omp.h>• Compiler with –openmp• Use libiomp5.so

13


The Optimal Thread Number• Intel MIC maintains 4 hardware contexts per core

– Round-robin execution policy, – Require 2 threads for decent performance– Financial algorithms takes all 4 threads to peak

• Intel Xeon processor optionally use HyperThreading– Execute-until-stall execution policy– Truly compute intensive ones peak with 1 thread per core– Finance algorithms likes HyperThreading, 2 threads per core

• Use OpenMP application with NCORE number of cores– Host only: 2 x ncore (or 1x if HyperThreading disabled)– MIC Native:4 x ncore – Offload: 4 x (ncore-1) OpenMP runtime avoids the core OS runs

14


Thread Affinity Choices• Intel® OpenMP* Supports the following Affinity Type:

– Compact assign threads to consecutive h/w contexts on same physical core to achieve the benefit from shared cache.

– Scatter assign consecutive threads to different physical cores maximize access to memory.

– Balanced blend of compact & scatter (currently only available for Intel® MIC Architecture)

• You can also specify affinity modifier

– Explicit setting KMP_AFFINITY set to granularity=fine, proclist=“1-240”,explicit

• Affinity is particularly important if not all available threads are used• Affinity Type is less important in full thread subscription

15

Modifier Specifier DescriptionGranularityDescribes the lowest levels that OpenMP threads are allowed to float within a topology map

=core Broadest granularity level supported. Allows all the OpenMP threads bound to a core to float between the different thread contexts.

=fine =thread

The finest granularity level. Causes each OpenMP thread to be bound to a single thread context


Thread Affinity in Monte Carlo• Monte Carlo can take all threads available to a core

– Enable HyperThreading for Intel® Xeon® processor– Set –opt-threads-per-core=4 for the Coprocessor code

• Affinity type is less important when all cores are used– Argument for compact: maximized the share random number

effect– Argument for scatter: maximize the bandwidth to memory

• However you have to set thread affinity type– API Calls:

– Env. variables:

16

#ifdef _OPENMP kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine");#endif

~ $ export LD_LIBRARY_PATH=.~ $ export KMP_AFFINITY="scatter,granularity=fine"~ $ ./MonteCarlo

Thread Affinity Step 5 Lab 2/3


Step 5 Thread Affinity• Add #include <omp.h>• Add the following line before the very first #pragam omp#ifdef _OPENMP kmp_set_defaults("KMP_AFFINITY=compact,granularity=fine");

#endif• Add –opt-threads-per-core =4 to the Makefile

18

Prefetch and Streaming Stores


Prefetch on Intel Multicore and Manycore platforms• Objective: Move data from memory to L1 or L2 Cache in

anticipation of CPU Load/Store• More import on in-order Intel Xeon Phi Coprocessor• Less important on out of order Intel Xeon Processor• Compiler prefetching is on by default for Intel® Xeon Phi™

coprocessors at –O2 and above• Compiler prefetch is not enabled by default on Intel® Xeon®

Processors– Use external options –opt-prefetch[=n] n = 1.. 4

• Use the compiler reporting options to see detailed diagnostics of prefetching per loop– Use -opt-report-phase hlo –opt-report 3

20


Automatic PrefetchesLoop Prefetch • Compiler generated prefetches target memory

access in a future iteration of the loop• Target regular, predictable array and pointer accessInteractions with Hardware prefetcher• Intel® Xeon Phi™ Comprocessor has a hardware L2

prefetcher• If Software prefetches are doing a good job,

Hardware prefetching does not kick in• References not prefetched by compiler may get

prefetched by hardware prefetcher

21


Explicit Prefetch• Use Intrinsics

– _mm_prefetch((char *) &a[i], hint);See xmmintrin.h for possible hints (for L1, L2, non-temporal, …)

– But you have to specify the prefetch distance– Also gather/scatter prefetch intrinsics, see zmmintrin.h and compiler

user guide, e.g. _mm512_prefetch_i32gather_ps• Use a pragma / directive (easier):

– #pragma prefetch a [:hint[:distance]]– You specify what to prefetch, but can choose to let compiler figure out

how far ahead to do it.• Use Compiler switches:

– -opt-prefetch-distance=n1[,n2]– specify the prefetch distance (how many iterations ahead, use n1 and

prefetches inside loops. n1 indicates distance from memory to L2.– BlackScholes uses -opt-prefetch-distance=8,2

22


Streaming Store• Avoid read for ownership for certain memory write operation• Bypass prefetch related to the memory read• Use #pragma vector nontemporal (v1, v2, …) to drop a

hint to compiler• Without Streaming Stores 448 Bytes read/write per iteration

23

for (int chunkBase = 0; chunkBase < OptPerThread; chunkBase += CHUNKSIZE){#pragma simd vectorlength(CHUNKSIZE)#pragma simd#pragma vector aligned#pragma vector nontemporal (CallResult, PutResult) for(int opt = chunkBase; opt < (chunkBase+CHUNKSIZE); opt++) { float CNDD1; float CNDD2; float CallVal =0.0f, PutVal = 0.0f; float T = OptionYears[opt]; float X = OptionStrike[opt]; float S = StockPrice[opt]; …… CallVal = S * CNDD1 - XexpRT * CNDD2; PutVal = CallVal + XexpRT - S; CallResult[opt] = CallVal ; PutResult[opt] = PutVal ; }}

• With Streaming Stores, 320 Bytes read/write per iteration

• Relief Bandwidth pressure; improve cache utilization

• –vec-report6 displaysthe compiler action

bs_test_sp.c(215): (col. 4) remark: vectorization support: streaming store was generated for CallResult.bs_test_sp.c(216): (col. 4) remark: vectorization support: streaming store was generated for PutResult.

Data Blocking


Data Blocking• Partition data to small blocks that fits in L2 Cache

– Exploit data reuse in the application.– Ensure the data remains in the cache across multiple uses– Using the data in cache remove the need to go to memory– Bandwidth limited program may execute at FLOPS limit

• Simple case of 1D – Data size DATA_N is used WORK_N times from 100s of threads – Each handles a piece of work and have to traverse all data

• Without Blocking

#pragma omp parallel forfor(int wrk = 0; wrk < WORK_N; wrk++){ initialize_the_work(wrk); for(int ind = 0; ind < DATA_N; ind++) { dataptr datavalue = read_data(dataind); result = compute(datavalue); aggregate = combine(aggregate, result); } postprocess_work(aggregate);}

for(int BBase = 0; BBase < DATA_N; BBase += BSIZE){#pragma omp parallel for for(int wrk = 0; wrk < WORK_N; wrk++) { initialize_the_work(wrk); for(int ind = BBase; ind < BBase+BSIZE; ind++) { dataptr datavalue = read_data(ind); result = compute(datavalue); aggregate[wrk] = combine(aggregate[wrk], result); } postprocess_work(aggregate[wrk]); }}

– 100s of thread pound on different area of DATA_N

– Memory interconnet limit the performance

– Cacheable BSIZE of data is processed by all 100s threads a time

– Each data is read once kept reusing until all threads are done with it

• With Blocking


Blocking in Monte Carlo European Options• Each thread runs Monte Carlo using all random num.• Random num. are too big to fit in each thread’s L2

– Random num size is RAND_N * sizeof(float) = 1 MB, at 256K floats

– Each thread’s L2 is 512KB/4 = 128KB effective: 64KB or 16K floats

• Without Blocking: – Each thread make independent pass of RAND_N data for

all options it runs– Interconnects is busy satisfying the read req. from

different threads at different points– Also prefetched data from different points saturate the

memory bandwidth • With Blocking

– Random number is partitioned into cacheable blocks– A block is brought to cache when its previous block is

done processing by all threads.– It remains in the caches of all threads until all thread

have finish their passes– Each threads repeatedly reuse the data in cache for all

options it need to runs.

const int nblocks = RAND_N/BLOCKSIZE;for(int block = 0; block < nblocks; ++block){ vsRngGaussian (VSL_METHOD_SGAUSSIAN_ICDF, Randomstream, BLOCKSIZE, random, 0.0f, 1.0f);#pragma omp parallel for for(int opt = 0; opt < OPT_N; opt++) { float VBySqrtT = VLOG2E * sqrtf(T[opt]); float MuByT = RVVLOG2E * T[opt]; float Sval = S[opt]; float Xval = X[opt]; float val = 0.0, val2 = 0.0;

#pragma vector aligned#pragma simd reduction(+:val) reduction(+:val2)#pragma unroll(4) for(int pos = 0; pos < BLOCKSIZE; pos++) { … … …

val += callValue; val2 += callValue * callValue; }

h_CallResult[opt] += val; h_CallConfidence[opt] += val2; }}

#pragma omp parallel forfor(int opt = 0; opt < OPT_N; opt++){ const float val = h_CallResult[opt]; const float val2 = h_CallConfidence[opt]; const float exprt = exp2f(-RLOG2E*T[opt]); h_CallResult[opt] = exprt * val * INV_RAND_N; … … … h_CallConfidence[opt] = (float)(exprt * stdDev * CONFIDENCE_DENOM);}

Data BlockingStep 5 Lab 3/3


Step 5 Data Blocking• 1D Data blocking to Random nums.

– L2: 521K per core, 128K per threads– Your application can use 64KB– BLOCKSIZE = 16*1024

const int nblocks = RAND_N/BLOCKSIZE;

for(block=0; block<nblocks; ++block){

vsRngGaussian(VSL_METHOD_SGAUSSIAN_ICDF,

Randomstream, BLOCKSIZE, random, 0.0f,

1.0f);

<<<Existing Code>>

// Save intermediate result here

h_CallResult[opt] += val;

h_CallConfidence[opt] += val2;

}

• Change Inner loop from RAND_N to BLOCKSIZE

28

• Move the loop to calculate per option data from middle loop to outside the loop

#pragma omp parallel for


{

const float val=h_CallResult[opt];

const float val2=h_CallConfidence[opt];

const float exprt=exp2f(-RLOG2E*T[opt]);

h_CallResult[opt]= exprt*val*INV_RAND_N;

const float stdDev= sqrtf((F_RAND_N * val2 - val * val) * STDDEV_DENOM);

h_CallConfidence[opt]= (float)(exprt * stdDev *

CONFIDENCE_DENOM);

• Add initialization loop before the triple nested loops

#pragma omp parallel for


{

h_CallResult[opt] = 0.0f;

h_CallConfidence[opt] = 0.0f;

}


Run your program on Intel® Xeon Phi™ Coprocessor

• Build the native Intel® Xeon Phi™ Coprocessor application– Change the Makefile and use –mmic in lieu of –xAVX

• Copy program from Host to Coprocessor– Find out the host name using hostname (if it retuns esg014)– Copy the executives using scp MonteCarlo esg014-mic0:– Establish a execution environment ssh esg014-mic0– Set env. Variable %export LD_LIBRARY_PATH=.– Invoke the program ./MonteCarlo

29

Summary


Summary• Base 2 exponent and logarithmic functions are always faster

than other bases on Intel multicore and manycore platforms• Set thread affinity when you use OpenMP*• Allow hardware prefetcher to work for you. Fine tuning your

loop with prefetcher pragma/directives• Optimize your data access and rearrange your computation

based on data in cache

31

Documents

Scale from Intel® Xeon® Processor to Intel® Xeon Phi™ Coprocessors