Vectorization - NERSC · Vectorization - 1 - March 23, 2016 NERSC User Group Mee9ng 2016 What’s All This About Vectorization? • Vectorizaon is an on-node, in-core way of exploing

Woo-Sun Yang!NERSC User Engagement Group

Vectorization

-1-

March23,2016

NERSCUserGroupMee9ng2016

What’s All This About Vectorization?

•  Vectoriza9onisanon-node,in-corewayofexploi9ngdatalevelparallelisminprogramsbyapplyingthesameopera9ontomul9pledataitemsinparallel.

-2-

DO I= 1, N Z(I) = X(I) + Y(I) ENDDO

•  Requirestransformingaprogramsothatasingleinstruc6oncanlaunchmanyopera6onsondifferentdata

•  Appliesmostcommonlytoarrayopera6onsinloops

What is Required for Vectorization? •  Codetransforma9on

•  Compilergeneratesvectorinstruc9ons:

-3-

DO I = 1, N Z(I) = X(I) + Y(I) ENDDO

DO I = 1, N, 4 Z(I) = X(I) + Y(I)

Z(I+1) = X(I+1) + Y(I+1) Z(I+2) = X(I+2) + Y(I+2) Z(I+3) = X(I+3) + Y(I+3) ENDDO

Compiler

VLOAD X(I), X(I+1), X(I+2), X(I+3) VLOAD Y(I), Y(I+1), Y(I+2), Y(I+3) VADD Z(I, ..., I+3) X+Y(I, ..., I+3) VSTORE Z(I), Z(I+1), Z(I+2), Z(I+3)

What is Required for Vectorization? •  VectorHardware:vectorregistersandvectorfunc9onalunits

-4-

Vectorlength

512bits 512bits

512bits

512bits

Evolution of Vector Hardware

•  Translatesto(peak)speed:coresperprocessorXvectorlengthXCPUSpeedX2arith.opspervector

-5-

Data Dependencies •  Examples:

DO I=2,N-1 A(I) = A(I-1) + B(I)END DO CompilerdetectsbackwardreferenceonA(I-1)Read-aDer-write(alsoknownas"flowdependency")DO I=2,N-1 A(I-1) = X(I) + DX A(I) = 2.*DYEND DO Compilerdetectssameloca6onbeingwriIen Write-aDer-write(alsoknownas”outputdependency")

-6-

Dn-qopt-report=2-cmms.f90Reportfrom:Loopnest,Vector&Auto-paralleliza6onop6miza6ons[loop,vec,par]Dn-qopt-report=2-cmms.f90remark#15346:vectordependence:assumedOUTPUTdependencebetweenline12andline11

How to Vectorize Your Code?

•  Auto-Vectoriza9onanalysisbythecompiler

•  Auto-Vectoriza9onanalysisbythecompilerenhancedwithdirec9ves–codeannota9onsthatsuggestwhatcanbevectorized

•  Codeexplicitlyforvectoriza9onusingOpenMP4.0SIMDpragmasorSIMDintrinsics(notportable)

•  Useassemblylanguage

•  Usevendor-suppliedop9mizedlibraries

-7-

Requirements for vectorization •  Looptripcountknownatentrytotheloopatrun9me•  Singleentryandsingleexit•  Nofunc9oncallsorI/O•  Nodatadependenciesintheloop•  Uniformcontrolflow(althoughcondi9onalcomputa9oncanbeimplementedusing“masked”assignment)

-8-

Vectorization performance (speed-up) •  Factorsthataffectvectoriza9onperformance–  Efficientloadsandstoreswithvectorregisters

•  Dataincaches•  Dataalignedtoacertainbyteboundaryinmemory•  Unitstrideaccess

–  Efficientvectoropera6ons•  Certainarithme6copera6onsnotatfullspeed

•  Goodspeed-upwithvectoriza9onwhenallthecondi9onsaremet

•  Examplesfromh`ps://www.nersc.gov/users/computa9onal-systems/edison/programming/vectoriza9on/

-9-

How good is vectorization •  Compilervectoriza9onofloops

–  Enabledwithdefaultop6miza6onlevelsforIntelandCraycompilersonCori/Edison(andIntelonBabbage)

–  Use-qopt-report[=n]-qopt-report-phase=vecflagwhere(nisfrom0through5;default:2)

-10-

LOOP BEGIN at a1.F(35,10) remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 2 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 6 remark #15477: vector loop cost: 2.000 remark #15478: estimated potential speedup: 2.990 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=249 LOOP END

LOOP BEGIN at a1.F(35,10) <Remainder loop for vectorization> remark #15301: REMAINDER LOOP WAS VECTORIZED remark #25015: Estimate of max trip count of loop=3 LOOP END

How good is vectorization (Cont’d)

•  IntelAdvisor–  Vectoriza6onanalysistoolthatiden6fiesloopsforvectoriza6onandreasonsthatblockseffec6vevectoriza6on

•  Manywebpagesonusefulinfo–  hIps://soDware.intel.com/en-us/intel-advisor-xe–  hIps://soDware.intel.com/en-us/get-started-with-advisor-vectoriza6on-linux

–  hIps://soDware.intel.com/en-us/intel-advisor-xe-support/training

–  hIps://soDware.intel.com/en-us/intel-advisor-2016-tutorial-vectoriza6on-linux-cplusplus

–  …

-11-

Data in Caches

doi=1,nc(i)=a(i)+b(i)

enddo

Speedupclosethetheori6calmaxbelowL1Cache.WorseasarraysizepassesL1size.

Data in Caches (Cont’d)

doi=1,nc(i)=a(i)+b(i)

enddo

SpeedupdropsagainaspassL2cachesize.

Let’s try Intel Advisor on these runs •  AnalysistypesthatIntelAdvisorprovides

–  survey:explorewheretoaddefficientvectoriza6on–  tripcounts:itera6oncountsforloops–  Refinementanalysis

•  map(MemoryaccesspaIern):memoryaccessstridesforloops•  dependencies:loop-carrieddependencies

•  AnalysiscanberuninGUIorCLI–  advixe-gui:GUIcommand–  advixe-cl:CLIcommand

•  Add-gtotheusualop9miza9onflag(e.g.,-g-O3)

•  A“project”isaphysicaldirectorywhereanalysescanbecarriedoutforagivenexecutable–  Needtocreatetheprojectdirectoryandspecifyitsothatanalysisresults

aresavedthere–  Cancontainmul6pleanalysistypes(e.g.,survey,tripcounts,map,…)

-14-

Intel Advisor

-15-

fromtripcountanalysis

fromdependenciesanalysis

frommapanalysis

fromsurveyanalysis

Some Advisor CLI commands •  From‘advixe-cl-help’:

-16-

$ module load advisor

$ mkdir myproj

$ advixe-cl -collect=survey -project-dir=./myproj -- ./a.out$ advixe-cl -report=survey -project-dir=./myproj -format=text \ -report-output=survey.txt

$ advixe-cl -collect=tripcounts -project-dir=./myproj -- ./a.out$ advixe-cl -report=tripcounts -project-dir=./myproj -format=text \ -report-output=survey.txt

$ advixe-cl -collect=map -project-dir=./myproj -- ./a.out$ advixe-cl -report=map -project-dir=./myproj -format=text \ -report-output=survey.txt

$ advixe-cl -collect=dependencies -project-dir=./myproj -- ./a.out$ advixe-cl -report=dependencies -project-dir=./myproj -format=text \ -report-output=survey.txt

results inmyproj/e000/hs000

project directory; contains all the following analysis results

results inmyproj/e000/trc000

results inmyproj/e000/mp000

results inmyproj/e000/dp000; can take very long

Intel Advisor for c(:) = a(:) + b(:)

•  n=1500(alldatawithinL1cache)usingAVX2

-17-


•  n=4000(datacannotfitL1cache)usingAVX2

-18-

Doesnottellyouabouteffectsofcachemisses


•  n=1500(alldatawithinL1cache)usingSSE

-19-

DoesyoutellabouteffectsofusingSSE,accessstrides,dependencies,…

Memory alignment

•  Moreinstruc9onsareneededtocollectandorganizeinregistersifdataisnotop9mallylaidoutinmemory

•  Datamovementisop9maliftheaddressofdatastartsatcertainbyteboundaries–  SSE:16bytes(128bits)–  AVX:32bytes(256bits)–  AVX-512onKNL:64bytes(512bits)

-20-

Memory alignment to assist vectorization •  From

h`ps://sohware.intel.com/en-us/ar9cles/data-alignment-to-assist-vectoriza9on•  Alignmentofdata(Intel)

–  Fortrancompilerflag-align•  ‘-alignarray<n>bytes’,wheren=8,16,32,64,128,256,asin‘–alignarray64byte’•  En66esofCOMMONblocks:‘-aligncommons’(4-byte);‘-aligndcommons’(8-byte);

‘-alignqcommons’(16-byte);‘-alignzcommons’(32-byte);nonefor64-byte•  ‘-alignrec<n>byte’,wheren=1,2,4,8,16,32,64:forderived-data-typecomponents

–  Alignmentdirec6ve/pragmasinsourcecode•  Fortran

–  !dir$aIributesalign:64::A–whenAisdeclared–  !dir$assume_alignedA:64–informsthatAhasbeenaligned–  !dir$vectoraligned–vectorizealoopusingalignedloadsforallarrays

•  CorC++–  ‘floatA[1000]__aIribute__((align(64));’or‘__declspec(align(64))float

A[1000];’whendeclaringasta6carray–  _alligned_malloc()/_aligned_free()or_mm_malloc()/_mm_free()toallocate

heapmemory–  __assume_aligned(A,64)–  #pragmavectoraligned–vectorizealoopusingalignedloadsforallarrays

-21-

Memory alignment for multidimensional arrays

•  Mul9-dimensionalarraysneedtobepaddedinthefastest-movingdimension,toensurearraysec9onstobealignedatthedesiredbyteboundaries–  Fortran:firstarraydimension–  C/C++:lastarraydimension

•  npadded=((n+veclen–1)/veclen)*veclen–  Noalignmentrequested:veclen=1–  16-bytealignment(SSE):veclen=4(sp)or2(dp)–  32-bytealignment(AVX2):veclen=8(sp)or4(dp)–  64-bytealignment(AVX-512):veclen=16(sp)or8(dp)

-22-

Memory alignment example

•  Naïvematrix-matrixmul9plica9ononEdison

-23-

real, allocatable :: a(:,:), b(:,:), c(:,:)!dir$ attributes align : 32 :: a,b,c...allocate (a(npadded,n))allocate (b(npadded,n))allocate (c(npadded,n))...do j=1,n do k=1,n!dir$ vector aligned do i=1,npadded c(i,j) = c(i,j) & + a(i,k) * b(k,j) end do end doend do

!... Ignore c(n+1:npadded,:)

Effect of vectorization (no alignment case)

•  Run9me:-no-vec>-xSSE4.2>-xCORE-AVX2(similarlyforalignedcases)

•  Run9medropswhennisamul9pleof4

-24-

Effect of memory alignment •  Effectofpaddingrows(Fortran)•  Bumpsgetsmoothened(towardbe`erperformance)•  Li`leimprovementwithALIGN64overALIGN32

-25-

type coords real :: x, y, zend typetype (coords) :: p(1024)real dsquared(1024)

do i=1,1024 dsquared(i) = p(i)%x**2 + p(i)%y**2 + p(i)%z**2end do

AoS vs. SoA

•  Dataobjectswithcomponentelementsora`ributes

•  Arrayofastructure(AoS)–  Thenaturalorderinarrangingsuchobjects–  Butitgivesnon-unitstridedaccesswhenloadingintovectorregisters

-26-

AoS vs. SoA

•  Structureofarrays(SoA)–  Unitstridedaccesswhenloadingintovectorregisters– Moreefficientwithloadingintovectorregisters

-27-

type coords real :: x(1024), y(1024), z(1024)end typetype (coords) :: preal dsquared(1024)

do i=1,1024 dsquared(i) = p%x(i)**2 + p%y(i)**2 + p%z(i)**2end do

AoSoA

•  WithSoA,localityofmul9plefieldswasreduced•  WithArrayofStructuresofArrays(TiledArrayofStructures),wehavelocalityovermul9plefieldsattheouter-levelandunit-strideattheinnermost-level

-28-

type coords real :: x(16), y(16), z(16)end typetype (coords) :: p(64)real dsquared(16,64)

do i=1,64 do j=1,16 dsquared(j,i) = p(i)%x(j)**2 + p(i)%y(j)**2 + p(i)%z(j)**2 end doend do

AoS •  aossoa.Ffrom

h`p://www.nersc.gov/users/computa9onal-systems/edison/programming/vectoriza9on/

-29-

SoA •  aossoa.Ffrom

h`p://www.nersc.gov/users/computa9onal-systems/edison/programming/vectoriza9on/

-30-

SIMD-Enabled (“Elemental”) function

•  Anelementalfunc9onoperateselement-wiseandreturnsanarraywiththesameshapeastheinputparameter– WidelyusedinFortranintrinsicfunc6ons(butnotinavectoriza6onsense)

•  Whendeclared,theIntelcompilergeneratesavectorversionandascalarversionofthefunc9on

•  Afunc9oncallwithinaloopgenerallyinhibitsvectoriza9on.Butaloopcontainingacalltoanelementalfunc9oncanbevectorized.Inthatcase,thevectorversionisused

-31-

SIMD-Enabled function example

-32-

module fofxcontains function f(x)!dir$ attributes vector :: f real, intent(in) :: x real f f = cos(x * x + 1.) / (x * x + 1.) end functionend module

program main use fofx real a(100), x(100) ... do i=1,100 a(i) = f(x(i)) end do ...end program

$ ifort -qopt-report=3 elemental.F... LOOP BEGIN at elemental.F(50,11) remark #15300: LOOP WAS VECTORIZED...

Line50

Line7

OpenMP 4.0 SIMD constructs

•  SIMDconstructsforexecu9onofaloopinvectoriza9onmode

•  Op9onalclauses–  safelen(length)–  aligned(list[:alignment])–  reduc6on(reduc6on-iden6fier:list)–  collapse(n)

-33-

#pragma omp simd [clauses…]

!$omp simd [clauses…]

OpenMP 4.0 SIMD constructs (Cont’d)

•  Example

-34-

...do j=1,n do k=1,n!$omp simd aligned(a,b,c:32) do i=1,nr c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end doend do...

OpenMP 4.0 SIMD constructs (Cont’d) •  SIMD-enabledfunc9on(“elementalfunc9on”)

•  Example

-35-

#pragma omp declare simd [clauses…] function definition or declaration

!$omp declare simd(proc-name) [clauses…] function definition

!$omp declare simd(f) function f(x) real f, x f = cos(x * x + 1.e0) / (x * x + 1.e0) end function f... do i=1,n a(i) = f(x(i)) end do...

More Intel Advisor usage tips •  Tooldesignandfunc9onali9escanchangeover9me,but

herearesome9psforusingadvisor/2016.1.40.455986(defaultoncori)

•  TorunaMPIcode–  Viaadvixe-clonly(notGUI)–  Dynamically-linkedexecutable–  Onlyonerankrunthroughadvixe-cl–  RuninMPMDmodeif#oftasks>1

-36-

$ salloc -N 1 -t 30:00 -p debug...$ ftn -dynamic -g -O3 myprog.f$ module load advisor

$ cat mpmd.conf0 advixe-cl --collect survey --project-dir ./myproj -- ./a.out1-9 ./a.out

$ srun --multi-prog ./mpmd.conf

Rank0withadvixe-cl;ranks1-9,without

More Intel Advisor usage tips (Cont’d) •  Intelcompilercangeneratemul9pleinstruc9onsetsalthoughtheymaynotbeexecutableonyourmachine

•  BelowistogenerateanexecutableonHaswellnodes(-xCORE-AVX2)thatalsocontainsAVX-512instruc9ons(-axCORE-AVX512),togetaglimpseofitsexpectedperformance

-37-

$ salloc -N 1 -t 30:00 -p debug...$ ftn -dynamic -g -O3 -xCORE-AVX2 -axCORE-AVX512 myprog.f$ module load advisor$ advixe-cl --collect survey --support-multi-isa-binaries \ --project-dir ./myproj -- ./a.out

More Intel Advisor usage tips (Cont’d) •  Select’Analyzeloopsthat

resideinnon-executedcodepaths’intheProjectProper9eswindow(reachedthruGUI’s‘File->ProjectProper9es’)

•  Seeh`ps://sohware.intel.com/en-us/blogs/2016/02/02/explore-intel-avx-512-code-paths-while-not-having-compa9ble-hardware

-38-

More Intel Advisor usage tips (Cont’d)

-39-

Then,clickthis

Vectorization sample codes •  h`p://www.nersc.gov/users/computa9onal-systems/edison/programming/vectoriza9on/

•  Intel-providedsamplesin$ADVISOR_XE_2016_DIR/samples/en

-40-

$ module load advisor

$ ls $ADVISOR_XE_2016_DIR/samples/en/C++Vector_Tutorial_Data_Alignment.tgzVector_Tutorial_Introduction.tgzVector_Tutorial_Memory_Access_101.tgzVector_Tutorial_Stride_and_MAP.tgzVector_Tutorial_Vectorization_and_Data_Size.tgzmmult_Advisor.tgzmpi_sample.tgznqueens_Advisor.tgztachyon_Advisor.tgzvec_samples.tgz

$ ls $ADVISOR_XE_2016_DIR/samples/en/Fortranmmult.tgznqueens.tgz

Intel Advisor 2016 tutorial

•  h`ps://sohware.intel.com/en-us/intel-advisor-2016-tutorial-vectoriza9on-linux-cplusplus

•  Uses$ADVISOR_XE_2016_DIR/samples/en/C++/vec_samples.tgz

-41-

National Energy Research Scientific Computing Center

-42-

Documents

Vectorization - NERSC · Vectorization - 1 - March 23, 2016 NERSC User Group Mee9ng 2016 What’s All This About Vectorization? • Vectorizaon is an on-node, in-core way of exploing