Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Woo-Sun Yang!NERSC User Engagement Group
Vectorization
-1-
March23,2016
NERSCUserGroupMee9ng2016
What’s All This About Vectorization?
• Vectoriza9onisanon-node,in-corewayofexploi9ngdatalevelparallelisminprogramsbyapplyingthesameopera9ontomul9pledataitemsinparallel.
-2-
DO I= 1, N Z(I) = X(I) + Y(I) ENDDO
• Requirestransformingaprogramsothatasingleinstruc6oncanlaunchmanyopera6onsondifferentdata
• Appliesmostcommonlytoarrayopera6onsinloops
What is Required for Vectorization? • Codetransforma9on
• Compilergeneratesvectorinstruc9ons:
-3-
DO I = 1, N Z(I) = X(I) + Y(I) ENDDO
DO I = 1, N, 4 Z(I) = X(I) + Y(I)
Z(I+1) = X(I+1) + Y(I+1) Z(I+2) = X(I+2) + Y(I+2) Z(I+3) = X(I+3) + Y(I+3) ENDDO
Compiler
VLOAD X(I), X(I+1), X(I+2), X(I+3) VLOAD Y(I), Y(I+1), Y(I+2), Y(I+3) VADD Z(I, ..., I+3) X+Y(I, ..., I+3) VSTORE Z(I), Z(I+1), Z(I+2), Z(I+3)
What is Required for Vectorization? • VectorHardware:vectorregistersandvectorfunc9onalunits
-4-
Vectorlength
512bits 512bits
512bits
512bits
Evolution of Vector Hardware
• Translatesto(peak)speed:coresperprocessorXvectorlengthXCPUSpeedX2arith.opspervector
-5-
Data Dependencies • Examples:
DO I=2,N-1 A(I) = A(I-1) + B(I)END DO CompilerdetectsbackwardreferenceonA(I-1)Read-aDer-write(alsoknownas"flowdependency")DO I=2,N-1 A(I-1) = X(I) + DX A(I) = 2.*DYEND DO Compilerdetectssameloca6onbeingwriIen Write-aDer-write(alsoknownas”outputdependency")
-6-
Dn-qopt-report=2-cmms.f90Reportfrom:Loopnest,Vector&Auto-paralleliza6onop6miza6ons[loop,vec,par]Dn-qopt-report=2-cmms.f90remark#15346:vectordependence:assumedOUTPUTdependencebetweenline12andline11
How to Vectorize Your Code?
• Auto-Vectoriza9onanalysisbythecompiler
• Auto-Vectoriza9onanalysisbythecompilerenhancedwithdirec9ves–codeannota9onsthatsuggestwhatcanbevectorized
• Codeexplicitlyforvectoriza9onusingOpenMP4.0SIMDpragmasorSIMDintrinsics(notportable)
• Useassemblylanguage
• Usevendor-suppliedop9mizedlibraries
-7-
Requirements for vectorization • Looptripcountknownatentrytotheloopatrun9me• Singleentryandsingleexit• Nofunc9oncallsorI/O• Nodatadependenciesintheloop• Uniformcontrolflow(althoughcondi9onalcomputa9oncanbeimplementedusing“masked”assignment)
-8-
Vectorization performance (speed-up) • Factorsthataffectvectoriza9onperformance– Efficientloadsandstoreswithvectorregisters
• Dataincaches• Dataalignedtoacertainbyteboundaryinmemory• Unitstrideaccess
– Efficientvectoropera6ons• Certainarithme6copera6onsnotatfullspeed
• Goodspeed-upwithvectoriza9onwhenallthecondi9onsaremet
• Examplesfromh`ps://www.nersc.gov/users/computa9onal-systems/edison/programming/vectoriza9on/
-9-
How good is vectorization • Compilervectoriza9onofloops
– Enabledwithdefaultop6miza6onlevelsforIntelandCraycompilersonCori/Edison(andIntelonBabbage)
– Use-qopt-report[=n]-qopt-report-phase=vecflagwhere(nisfrom0through5;default:2)
-10-
LOOP BEGIN at a1.F(35,10) remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 2 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 6 remark #15477: vector loop cost: 2.000 remark #15478: estimated potential speedup: 2.990 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=249 LOOP END
LOOP BEGIN at a1.F(35,10) <Remainder loop for vectorization> remark #15301: REMAINDER LOOP WAS VECTORIZED remark #25015: Estimate of max trip count of loop=3 LOOP END
How good is vectorization (Cont’d)
• IntelAdvisor– Vectoriza6onanalysistoolthatiden6fiesloopsforvectoriza6onandreasonsthatblockseffec6vevectoriza6on
• Manywebpagesonusefulinfo– hIps://soDware.intel.com/en-us/intel-advisor-xe– hIps://soDware.intel.com/en-us/get-started-with-advisor-vectoriza6on-linux
– hIps://soDware.intel.com/en-us/intel-advisor-xe-support/training
– hIps://soDware.intel.com/en-us/intel-advisor-2016-tutorial-vectoriza6on-linux-cplusplus
– …
-11-
Data in Caches
doi=1,nc(i)=a(i)+b(i)
enddo
Speedupclosethetheori6calmaxbelowL1Cache.WorseasarraysizepassesL1size.
Data in Caches (Cont’d)
doi=1,nc(i)=a(i)+b(i)
enddo
SpeedupdropsagainaspassL2cachesize.
Let’s try Intel Advisor on these runs • AnalysistypesthatIntelAdvisorprovides
– survey:explorewheretoaddefficientvectoriza6on– tripcounts:itera6oncountsforloops– Refinementanalysis
• map(MemoryaccesspaIern):memoryaccessstridesforloops• dependencies:loop-carrieddependencies
• AnalysiscanberuninGUIorCLI– advixe-gui:GUIcommand– advixe-cl:CLIcommand
• Add-gtotheusualop9miza9onflag(e.g.,-g-O3)
• A“project”isaphysicaldirectorywhereanalysescanbecarriedoutforagivenexecutable– Needtocreatetheprojectdirectoryandspecifyitsothatanalysisresults
aresavedthere– Cancontainmul6pleanalysistypes(e.g.,survey,tripcounts,map,…)
-14-
Intel Advisor
-15-
fromtripcountanalysis
fromdependenciesanalysis
frommapanalysis
fromsurveyanalysis
Some Advisor CLI commands • From‘advixe-cl-help’:
-16-
$ module load advisor
$ mkdir myproj
$ advixe-cl -collect=survey -project-dir=./myproj -- ./a.out$ advixe-cl -report=survey -project-dir=./myproj -format=text \ -report-output=survey.txt
$ advixe-cl -collect=tripcounts -project-dir=./myproj -- ./a.out$ advixe-cl -report=tripcounts -project-dir=./myproj -format=text \ -report-output=survey.txt
$ advixe-cl -collect=map -project-dir=./myproj -- ./a.out$ advixe-cl -report=map -project-dir=./myproj -format=text \ -report-output=survey.txt
$ advixe-cl -collect=dependencies -project-dir=./myproj -- ./a.out$ advixe-cl -report=dependencies -project-dir=./myproj -format=text \ -report-output=survey.txt
results inmyproj/e000/hs000
project directory; contains all the following analysis results
results inmyproj/e000/trc000
results inmyproj/e000/mp000
results inmyproj/e000/dp000; can take very long
Intel Advisor for c(:) = a(:) + b(:)
• n=1500(alldatawithinL1cache)usingAVX2
-17-
Intel Advisor for c(:) = a(:) + b(:)
• n=4000(datacannotfitL1cache)usingAVX2
-18-
Doesnottellyouabouteffectsofcachemisses
Intel Advisor for c(:) = a(:) + b(:)
• n=1500(alldatawithinL1cache)usingSSE
-19-
DoesyoutellabouteffectsofusingSSE,accessstrides,dependencies,…
Memory alignment
• Moreinstruc9onsareneededtocollectandorganizeinregistersifdataisnotop9mallylaidoutinmemory
• Datamovementisop9maliftheaddressofdatastartsatcertainbyteboundaries– SSE:16bytes(128bits)– AVX:32bytes(256bits)– AVX-512onKNL:64bytes(512bits)
-20-
Memory alignment to assist vectorization • From
h`ps://sohware.intel.com/en-us/ar9cles/data-alignment-to-assist-vectoriza9on• Alignmentofdata(Intel)
– Fortrancompilerflag-align• ‘-alignarray<n>bytes’,wheren=8,16,32,64,128,256,asin‘–alignarray64byte’• En66esofCOMMONblocks:‘-aligncommons’(4-byte);‘-aligndcommons’(8-byte);
‘-alignqcommons’(16-byte);‘-alignzcommons’(32-byte);nonefor64-byte• ‘-alignrec<n>byte’,wheren=1,2,4,8,16,32,64:forderived-data-typecomponents
– Alignmentdirec6ve/pragmasinsourcecode• Fortran
– !dir$aIributesalign:64::A–whenAisdeclared– !dir$assume_alignedA:64–informsthatAhasbeenaligned– !dir$vectoraligned–vectorizealoopusingalignedloadsforallarrays
• CorC++– ‘floatA[1000]__aIribute__((align(64));’or‘__declspec(align(64))float
A[1000];’whendeclaringasta6carray– _alligned_malloc()/_aligned_free()or_mm_malloc()/_mm_free()toallocate
heapmemory– __assume_aligned(A,64)– #pragmavectoraligned–vectorizealoopusingalignedloadsforallarrays
-21-
Memory alignment for multidimensional arrays
• Mul9-dimensionalarraysneedtobepaddedinthefastest-movingdimension,toensurearraysec9onstobealignedatthedesiredbyteboundaries– Fortran:firstarraydimension– C/C++:lastarraydimension
• npadded=((n+veclen–1)/veclen)*veclen– Noalignmentrequested:veclen=1– 16-bytealignment(SSE):veclen=4(sp)or2(dp)– 32-bytealignment(AVX2):veclen=8(sp)or4(dp)– 64-bytealignment(AVX-512):veclen=16(sp)or8(dp)
-22-
Memory alignment example
• Naïvematrix-matrixmul9plica9ononEdison
-23-
real, allocatable :: a(:,:), b(:,:), c(:,:)!dir$ attributes align : 32 :: a,b,c...allocate (a(npadded,n))allocate (b(npadded,n))allocate (c(npadded,n))...do j=1,n do k=1,n!dir$ vector aligned do i=1,npadded c(i,j) = c(i,j) & + a(i,k) * b(k,j) end do end doend do
!... Ignore c(n+1:npadded,:)
Effect of vectorization (no alignment case)
• Run9me:-no-vec>-xSSE4.2>-xCORE-AVX2(similarlyforalignedcases)
• Run9medropswhennisamul9pleof4
-24-
Effect of memory alignment • Effectofpaddingrows(Fortran)• Bumpsgetsmoothened(towardbe`erperformance)• Li`leimprovementwithALIGN64overALIGN32
-25-
type coords real :: x, y, zend typetype (coords) :: p(1024)real dsquared(1024)
do i=1,1024 dsquared(i) = p(i)%x**2 + p(i)%y**2 + p(i)%z**2end do
AoS vs. SoA
• Dataobjectswithcomponentelementsora`ributes
• Arrayofastructure(AoS)– Thenaturalorderinarrangingsuchobjects– Butitgivesnon-unitstridedaccesswhenloadingintovectorregisters
-26-
AoS vs. SoA
• Structureofarrays(SoA)– Unitstridedaccesswhenloadingintovectorregisters– Moreefficientwithloadingintovectorregisters
-27-
type coords real :: x(1024), y(1024), z(1024)end typetype (coords) :: preal dsquared(1024)
do i=1,1024 dsquared(i) = p%x(i)**2 + p%y(i)**2 + p%z(i)**2end do
AoSoA
• WithSoA,localityofmul9plefieldswasreduced• WithArrayofStructuresofArrays(TiledArrayofStructures),wehavelocalityovermul9plefieldsattheouter-levelandunit-strideattheinnermost-level
-28-
type coords real :: x(16), y(16), z(16)end typetype (coords) :: p(64)real dsquared(16,64)
do i=1,64 do j=1,16 dsquared(j,i) = p(i)%x(j)**2 + p(i)%y(j)**2 + p(i)%z(j)**2 end doend do
AoS • aossoa.Ffrom
h`p://www.nersc.gov/users/computa9onal-systems/edison/programming/vectoriza9on/
-29-
SoA • aossoa.Ffrom
h`p://www.nersc.gov/users/computa9onal-systems/edison/programming/vectoriza9on/
-30-
SIMD-Enabled (“Elemental”) function
• Anelementalfunc9onoperateselement-wiseandreturnsanarraywiththesameshapeastheinputparameter– WidelyusedinFortranintrinsicfunc6ons(butnotinavectoriza6onsense)
• Whendeclared,theIntelcompilergeneratesavectorversionandascalarversionofthefunc9on
• Afunc9oncallwithinaloopgenerallyinhibitsvectoriza9on.Butaloopcontainingacalltoanelementalfunc9oncanbevectorized.Inthatcase,thevectorversionisused
-31-
SIMD-Enabled function example
-32-
module fofxcontains function f(x)!dir$ attributes vector :: f real, intent(in) :: x real f f = cos(x * x + 1.) / (x * x + 1.) end functionend module
program main use fofx real a(100), x(100) ... do i=1,100 a(i) = f(x(i)) end do ...end program
$ ifort -qopt-report=3 elemental.F... LOOP BEGIN at elemental.F(50,11) remark #15300: LOOP WAS VECTORIZED...
Line50
Line7
OpenMP 4.0 SIMD constructs
• SIMDconstructsforexecu9onofaloopinvectoriza9onmode
• Op9onalclauses– safelen(length)– aligned(list[:alignment])– reduc6on(reduc6on-iden6fier:list)– collapse(n)
-33-
#pragma omp simd [clauses…]
!$omp simd [clauses…]
OpenMP 4.0 SIMD constructs (Cont’d)
• Example
-34-
...do j=1,n do k=1,n!$omp simd aligned(a,b,c:32) do i=1,nr c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end doend do...
OpenMP 4.0 SIMD constructs (Cont’d) • SIMD-enabledfunc9on(“elementalfunc9on”)
• Example
-35-
#pragma omp declare simd [clauses…] function definition or declaration
!$omp declare simd(proc-name) [clauses…] function definition
!$omp declare simd(f) function f(x) real f, x f = cos(x * x + 1.e0) / (x * x + 1.e0) end function f... do i=1,n a(i) = f(x(i)) end do...
More Intel Advisor usage tips • Tooldesignandfunc9onali9escanchangeover9me,but
herearesome9psforusingadvisor/2016.1.40.455986(defaultoncori)
• TorunaMPIcode– Viaadvixe-clonly(notGUI)– Dynamically-linkedexecutable– Onlyonerankrunthroughadvixe-cl– RuninMPMDmodeif#oftasks>1
-36-
$ salloc -N 1 -t 30:00 -p debug...$ ftn -dynamic -g -O3 myprog.f$ module load advisor
$ cat mpmd.conf0 advixe-cl --collect survey --project-dir ./myproj -- ./a.out1-9 ./a.out
$ srun --multi-prog ./mpmd.conf
Rank0withadvixe-cl;ranks1-9,without
More Intel Advisor usage tips (Cont’d) • Intelcompilercangeneratemul9pleinstruc9onsetsalthoughtheymaynotbeexecutableonyourmachine
• BelowistogenerateanexecutableonHaswellnodes(-xCORE-AVX2)thatalsocontainsAVX-512instruc9ons(-axCORE-AVX512),togetaglimpseofitsexpectedperformance
-37-
$ salloc -N 1 -t 30:00 -p debug...$ ftn -dynamic -g -O3 -xCORE-AVX2 -axCORE-AVX512 myprog.f$ module load advisor$ advixe-cl --collect survey --support-multi-isa-binaries \ --project-dir ./myproj -- ./a.out
More Intel Advisor usage tips (Cont’d) • Select’Analyzeloopsthat
resideinnon-executedcodepaths’intheProjectProper9eswindow(reachedthruGUI’s‘File->ProjectProper9es’)
• Seeh`ps://sohware.intel.com/en-us/blogs/2016/02/02/explore-intel-avx-512-code-paths-while-not-having-compa9ble-hardware
-38-
More Intel Advisor usage tips (Cont’d)
-39-
Then,clickthis
Vectorization sample codes • h`p://www.nersc.gov/users/computa9onal-systems/edison/programming/vectoriza9on/
• Intel-providedsamplesin$ADVISOR_XE_2016_DIR/samples/en
-40-
$ module load advisor
$ ls $ADVISOR_XE_2016_DIR/samples/en/C++Vector_Tutorial_Data_Alignment.tgzVector_Tutorial_Introduction.tgzVector_Tutorial_Memory_Access_101.tgzVector_Tutorial_Stride_and_MAP.tgzVector_Tutorial_Vectorization_and_Data_Size.tgzmmult_Advisor.tgzmpi_sample.tgznqueens_Advisor.tgztachyon_Advisor.tgzvec_samples.tgz
$ ls $ADVISOR_XE_2016_DIR/samples/en/Fortranmmult.tgznqueens.tgz
Intel Advisor 2016 tutorial
• h`ps://sohware.intel.com/en-us/intel-advisor-2016-tutorial-vectoriza9on-linux-cplusplus
• Uses$ADVISOR_XE_2016_DIR/samples/en/C++/vec_samples.tgz
-41-
National Energy Research Scientific Computing Center
-42-