120
Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | VectorizaAon in HotSpot JVM Vladimir Ivanov HotSpot JVM Compiler Oracle Corp. April 8, 2017

Vectorizaon in HotSpot JVM - java.netcr.openjdk.java.net/.../talks/2017_Vectorization_in_HotSpot_JVM.pdf · Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |

Embed Size (px)

Citation preview

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

VectorizaAoninHotSpotJVM

VladimirIvanovHotSpotJVMCompilerOracleCorp.April8,2017

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  SIMDISAextensions– packedvectorsonx86

•  JVM– auto-vectorizaAon,intrinsics

•  Future– JDK9– VectorAPI

2

Agenda

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

0x11529c8c0:mov%eax,-0x16000(%rsp)0x11529c8c7:push%rbp0x11529c8c8:sub$0x20,%rsp0x11529c8cc:mov%rdx,(%rsp)0x11529c8d0:mov%rsi,%rbp0x11529c8d3:movabs$0x7c0013d10,%rsi0x11529c8dd:nop0x11529c8de:nop0x11529c8df:nop0x11529c8e0:vzeroupper0x11529c8e3:callq0x00000001152418a00x11529c8e8:mov%rax,%rbx0x11529c8eb:mov(%rsp),%r100x11529c8ef:vmovdqu0x10(%r10),%ymm10x11529c8f5:vmovdqu0x10(%rbp),%ymm00x11529c8fa:vpaddd%ymm0,%ymm1,%ymm00x11529c8fe:vmovdqu%ymm0,0x10(%rbx)0x11529c903:mov%rbx,%rax0x11529c906:vzeroupper0x11529c909:add$0x20,%rsp0x11529c90d:pop%rbp

0x11529d240:mov%eax,-0x16000(%rsp)0x11529d247:push%rbp0x11529d248:sub$0x30,%rsp0x11529d24c:mov%rcx,%rbp0x11529d24f:vmovdqu0x10(%rsi),%ymm00x11529d254:vmovdqu0x10(%rdx),%ymm10x11529d259:vpaddd%ymm0,%ymm1,%ymm00x11529d25d:vmovdqu%ymm0,(%rsp)0x11529d262:movabs$0x7c0013d10,%rsi0x11529d26c:vzeroupper0x11529d26f:callq0x00000001152418a00x11529d274:mov%rax,%rbx0x11529d277:vmovdqu0x10(%rbp),%ymm10x11529d27c:vmovdqu(%rsp),%ymm00x11529d281:vpaddd%ymm0,%ymm1,%ymm00x11529d285:vmovdqu%ymm0,0x10(%rbx)0x11529d28a:mov%rbx,%rax0x11529d28d:vzeroupper0x11529d290:add$0x30,%rsp0x11529d294:pop%rbp0x11529d295:test%eax,-0xb78729b(%rip)

x86Assembly

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

mov0x10(%src),%dstvs

movdst,[src+10h]

4

AssemblySyntaxAT&T

Intel

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 5

“TheFreeLunchIsOver”,HerbSuZer,2005

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• Machines– Hadoop(Map/Reduce),ApacheSpark

• Cores/hardwarethreads– JavaStreamAPI– Fork/Joinframework

• CPUSIMDextensions– x86:SSE...,AVX,…,AVX-512

6

GoingParallel<10^3-10^6

<10s-100s

<10s

(servers)

(threads)

(elements)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• CPUs– SIMDISAextensions(SingleInstrucAon-MulApleData)– threads(MulApleInstrucAons-MulApleData)

• Co-processors

– GPUs– FPGAs– ASICs

•  DataAnalyAcsAccelerator(DAX)onSPARC

7

GoingParallel:CPUsvsCo-processors

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• Machines– upto12cards/server

•  IntelXeonPhi

– 4threadsx72cores

• AVX-512

– 2units/core

8

SIMDvsMIMD<10^3-10^612x

<10s-100s

<10s

(servers)

288(threads)

vs

16SP(elements)

4608-way

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  x86:MMX,SSE,AVX– 864-bitregisters(MMX)to32512-bitregisters(AVX-512)

• ARM:NEON– 32128-bitregisters

•  SPARC:VIS– 3264-bitregisters

• POWER:VMX/AlAVec– 32128-bitregisters

SIMDtoday

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• Wide(mulA-word)registers– 128-bit(xmm)– 256-bit(ymm)– 512-bit(zmm)

•  InstrucAonsonpackedvectors– packedinaregisterormemorylocaAon– shortvectorsofinteger/FPnumbers

•  2xdouble,4xint,8xshort– hard-codedvectorsize

10

x86SIMDExtensions

0128256512

xmm0ymm0zmm0

xmm00326496128

intintintint

double double

short short short short short short short short

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

//LoadA[i:i+3]vmovdqu0x10(%rcx,%rdx,4),%xmm0

//LoadB[i:i+3]

vmovdqu0x10(%r10,%rdx,4),%xmm1

//A[i:i+3]+B[i:i+3]

vpaddd%xmm0,%xmm1,%xmm2

//StoreintoC[i:i+3]

vmovdqu%xmm2,0x10(%r8,%rdx,4)

11

x86SIMDExtensions

A[i+0]A[i+1]A[i+2]A[i+3]

B[i+0]B[i+1]B[i+2]B[i+3]

+ + + +

= = = =

C[i+3]C[i+2]C[i+1]C[i+0]

xmm0

xmm1

xmm2

memory

int[]

A[i+3]A[i+2]A[i+1]A[i+0]

memory

int[]

B[i+3]B[i+2]B[i+1]B[i+0]int[]B[]

A[]

C[]

C[i+0]C[i+1]C[i+2]C[i+3]

3296128

registers

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

Year Name Registers1997 MMX 64-bit mm0-71999 SSE 128-bit xmm0-72001 SSE2 128-bit xmm0-152004 SSE3 128-bit xmm0-152006 SSSE3 128-bit xmm0-152006 SSE4.1 128-bit xmm0-152008 SSE4.2 128-bit xmm0-152011 AVX 256-bit ymm0-152013 AVX2 256-bit ymm0-152013 FMA3 256-bit ymm0-152015 AVX-512 512-bit zmm0-31(k0-7)

12

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

HowtouAlizeSIMDinstrucAons?

13

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• AutomaAc– sequenAallanguagesandpracAcesgetsintheway

•  Semi-automaAc– Giveyourcompiler/runAmehintsandhopeitvectorizes– e.g.,OpenMP4.0#pragmaompsimd

• Codeexplicitly

– e.g.,SIMDinstrucAonintrinsics

14

VectorizaAontechniques

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

IfthecodeiscompiledforaparJcularinstrucJonsetthenitwillbecompaJblewithallCPUsthatsupportthis

instrucAonsetoranyhigherinstrucAonset,butpossiblynotwithearlierCPUs.

15

Problem

SSE4.2<<AVX-512

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

Idea:MakecriAcalpartsofthecodeinmulJpleversionsfordifferentCPUs.

•  Forexample,provide:

– AVX2&SSE4.2specializaAons– genericversionthatiscompaAblewitholdmicroprocessors

•  TheprogramshouldautomaAcallydetectwhichinstrucAonsetissupportedandchoosetheappropriateversion.

16

CPUDispatching

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

“Itisquiteexpensive-intermsofdevelopment,tesAngandmaintenance-tomakeapieceofcodeinmulApleversions,eachcarefullyopAmizedandfine-tunedfora

parAcularsetofCPUs.“

“OpAmizingsowwareinC++”,AgnerFog

17

CPUDispatching

hZp://www.agner.org/opAmize/opAmizing_cpp.pdf

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• OpAmizingforpresentprocessorsratherthanfutureprocessors•  Thinkingintermsofspecificprocessormodelsratherthanprocessorfeatures

• Assumingthatprocessormodelnumbersformalogicalsequence•  Failuretohandleunknownprocessorsproperly• UnderesAmaAngthecostofkeepingaCPUdispatcherupdated• Makingtoomanybranches•  IgnoringvirtualizaAon

18

CPUdispatching:Commonpiyalls

“OpAmizingsowwareinC++”,AgnerFog

hZp://www.agner.org/opAmize/opAmizing_cpp.pdf

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

JVMisinagoodposiAon:1.  Javabytecodeisplayorm-agnosAc

2.  CPUprobingatrunAme(atstartup)– knowseverythingaboutthehardwareitexecutesatthemoment

3.  DynamiccodegeneraAon– onlyuseinstrucAonswhichareavailableonthehost

JVMandSIMDtoday

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• Hotspotsupportssomeofx86SIMDinstrucAons

• AutomaAcvectorizaAonofJavacode– SuperwordopAmizaAonsinHotSpotC2compilertoderiveSIMDcodefromsequenAalcode

•  JVMintrinsics– Arraycopying,filling,andcomparison

JVMandSIMDtoday

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

JVMIntrinsics

21

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

“AmethodisintrinsifiediftheHotSpotVMreplacestheannotatedmethodwithhand-wriOenassemblyand/orhand-wriOencompilerIR--acompilerintrinsic--toimproveperformance.”

@HotSpotIntrinsicCandidateJavaDoc

JVMIntrinsics

publicfinalclassjava.lang.Class<T>implements…{@HotSpotIntrinsicCandidatepublicnativebooleanisInstance(Objectobj);

hZp://hg.openjdk.java.net/jdk9/jdk9/jdk/file/Ap/src/java.base/share/classes/jdk/internal/HotSpotIntrinsicCandidate.java

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• Arraycopy– System.arraycopy(),Arrays.copyOf(),Arrays.equals()

•  Arraymismatch(@since9)– Arrays.mismatch(),Arrays.compare()– basedonArraysSupport.vectorizedMismatch()

23

VectorizedJVMIntrinsics

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

Auto-vectorizaAonbyJIT-compiler

24

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 25

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 26

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

SuperWordopAmizaAonis:1.  implementedonlyinC2JIT-compiler

hotspot/src/share/vm/opto/c2_globals.hpp:product(bool,UseSuperWord,true,"Transformscalaroperationsintosuperwordoperations”)

2.  appliedonlytounrolledloops– unrollingisperformedonlyforcountedloops

27

VectorizaAon:Prerequisites

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

“Countedloopsarealltrip-countedloops,withexactly1trip-counterexitpath(andmaybesomeotherexitpaths).Thetrip-counterexitisalwayslastintheloop.Thetrip-counterhavetostridebyaconstant;theexitvalueisalsoloopinvariant.”

hotspot/src/share/vm/opto/loopnode.hpp:136

28

CountedLoopsvsTrip-CountedLoops

hZp://hg.openjdk.java.net/jdk9/jdk9/hotspot/file/b725de5c248a/src/share/vm/opto/loopnode.hpp#l136

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=start;i<limit;i+=stride){//loopbody}inti=start;while(i<limit){//loopbodyi+=stride;}

29

CountedLoop

•  limitisloopinvariant•  strideisconstant(compile-Ame)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

1.  $java…-XX:+PrintCompilation-XX:+TraceLoopOpts…– availableonlyindebugbuilds1291bCountedLoop::test1(28bytes)CountedLoop:N100/N83limit_checkpredicatedcounted[0,100),+1(-1iters)

2.  $java…-XX:+PrintAssembly…– andeyeballgeneratedcode

30

Howtodetect?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=0;i<100;i++){/*loopbody*/}$java…-XX:+TraceLoopOpts…CountedLoop:N100/N83…counted[0,100),+1(-1iters)

31

CountedLoop?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=start;i<100;i++){/*loopbody*/}CountedLoop:N104/N84…counted[int,100),+1(-1iters)

32

CountedLoop?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=start;i<end;i++){/*loopbody*/}CountedLoop:N104/N84…counted[int,int),+1(-1iters)

33

CountedLoop?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=start;i<end1&&i<end2;i++){…}CountedLoop:N104/N84…counted[int,int),+1(-1iters)

34

CountedLoop?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=start;i<end1||i<end2;i++){…}Loop:N101/N93limit_checkpredicatedsfpts={93}PartialPeelLoop:N101/N93limit_checkpredicatedsfpts={93}CountedLoop:N136/N64counted[int,int),+1(-1iters)

35

CountedLoop?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=start;i<end;i+=2){/*loopbody*/}CountedLoop:N108/N85…counted[int,int),+2(-1iters)

36

CountedLoop?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=start;i<end;i+=d){/*loopbody*/}for(inti=start;i<end;i*=2){/*loopbody*/}for(inti=start;i<end;i++){…if(…){end++;}…}

37

CountedLoop?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(longl=0;l<100;l++){…}for(inti=0;i<100;i++){…}for(byteb=0;b<100;b++){…}for(shorts=0;s<100;s++){…}for(charc=0;c<100;c++){…}

38

CountedLoop?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=0;i<100;i++){…f();//notinlined

}CountedLoop:…counted[0,100),+1(-1iters)has_callhas_sfpt

39

CountedLoop?

Butnounrollinghappens,hencenovectorizaAon.

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

int[]A,B,C;for(inti=0;i<MAX;i++){A[i]=B[i]+C[i];}

40

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=0;i<MAX-4;i+=4){//mainloopA[i+0]=B[i+0]+C[i+0];A[i+1]=B[i+1]+C[i+1];A[i+2]=B[i+2]+C[i+2];A[i+3]=B[i+3]+C[i+3];}//post-loop

41

Loopunrolling(4Ames)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=0;i<MAX-4;i+=4){A[i+0]=B[i+0]+C[i+0];A[i+1]=B[i+1]+C[i+1];A[i+2]=B[i+2]+C[i+2];A[i+3]=B[i+3]+C[i+3];}

42

Loopunrolling

A[i:i+3]B[i:i+3]C[i:i+3]

isomorphic

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0;B[i:i+3]=>xmm0

0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0;C[i:i+3]+xmm0=>xmm0

0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4);xmm0=>A[i:i+3]

0x10a4249e4:add$0x4,%edx;i+=4

0x10a4249e7:cmp%r9d,%edx;

0x10a4249ea:jl0x10a4249d0;if(i<(MAX-4))repeat

43

8u121,AVX2(Haswell)

Mainloop

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(longl=0;l<MAX-4;l+=4){//mainloopA[l+0]=B[l+0]+C[l+0];A[l+1]=B[l+1]+C[l+1];A[l+2]=B[l+2]+C[l+2];A[l+3]=B[l+3]+C[l+3];}Nope…NounrollingduringcompilaAon,hencenovectorizaAon.

44

Manualunrolling?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

inti=0;for(;i<MAX-4;i+=4){//mainloopA[i:i+3]=B[i:i+3]+C[i:i+3];}for(;i<MAX;i++){//post-loopA[i]=B[i]+C[i];}

45

Vectorizedloop

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

//main-loop

0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0

0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0

0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4)

0x10a4249e4:add$0x4,%edx

0x10a4249e7:cmp%r9d,%edx

0x10a4249ea:jl0x10a4249d0

...

//post-loop

0x10a4249f4:mov0x10(%r10,%rdx,4),%ebx;A[i]=>ebx

0x10a4249f9:add0x10(%rcx,%rdx,4),%ebx;B[i]+ebx=>ebx

0x10a4249fd:mov%ebx,0x10(%r8,%rdx,4);ebx=>C[i]

0x10a424a02:inc%edx;i++

0x10a424a04:cmp%r11d,%edx;

0x10a424a07:jl0x10a4249f4;if(i<MAX)repeat46

8u121,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

//???

0x10a42499d:mov0x10(%r10,%rdx,4),%r9d

0x10a4249a2:add0x10(%rcx,%rdx,4),%r9d

0x10a4249a7:mov%r9d,0x10(%r8,%rdx,4)

0x10a4249ac:inc%edx

0x10a4249ae:cmp%edi,%edx

0x10a4249b0:jl0x10a42499d

//main-loop

0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0

0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0

0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4)

0x10a4249e4:add$0x4,%edx

0x10a4249e7:cmp%r9d,%edx

0x10a4249ea:jl0x10a4249d047

8u121,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

inti=0,prefix=???;for(;i<prefix;i++){//pre-loopA[i]=B[i]+C[i];}for(;i<MAX-4;i+=4){//mainloopA[i:i+3]=B[i:i+3]+C[i:i+3];}for(;i<MAX;i++){//post-loopA[i]=B[i]+C[i];}

48

Vectorizedloop

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 49

Alignment

A[i+2]A[i+1]A[i]A[0]int[] A[MAX-4]+160

8 8 64MAX

Pre-loop Mainloop Post-loopHeader

Cacheline

mov mov vmovdqu vmovdqu mov mov

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

T nounrolling notvectorized vectorized

byte 592±6 506±6 159±4short 541±7 495±4 140±3char 537±4 493±4 141±2int 532±5 490±4 154±2long 533±8 492±5 157±2float 530±4 489±7 155±2double 526±5 483±4 172±3

50

<anyT>voidadd(T[]A,T[]B,T[]C){for(inti=0;i<MAX;i++){A[i]=B[i]+C[i];}}

MAX=1000Corei7,1x2x2,Haswell(AVX2)8u121,macos-x64,ns/op

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0;B[i:i+3]=>xmm0

0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0;C[i:i+3]+xmm0=>xmm0

0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4);xmm0=>A[i:i+3]

0x10a4249e4:add$0x4,%edx;i+=4

0x10a4249e7:cmp%r9d,%edx;

0x10a4249ea:jl0x10a4249d0;if(i<(MAX-4))repeat

51

8u121,AVX2(Haswell)

Mainloop

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

0x10a4249d0:vmovdqu0x10(%rcx,%rdx,4),%xmm0;B[i:i+3]=>xmm0

0x10a4249d6:vpaddd0x10(%r10,%rdx,4),%xmm0,%xmm0;C[i:i+3]+xmm0=>xmm0

0x10a4249dd:vmovdqu%xmm0,0x10(%r8,%rdx,4);xmm0=>A[i:i+3]

0x10a4249e4:add$0x4,%edx;i+=4

0x10a4249e7:cmp%r9d,%edx;

0x10a4249ea:jl0x10a4249d0;if(i<(MAX-4))repeat

52

8u121,AVX2(Haswell)

Hmm…Whynot256-bit?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

0x117023512:vmovdqu0x10(%rbx,%rcx,4),%ymm0

0x117023518:vpaddd0x10(%rdi,%rcx,4),%ymm0,%ymm0

0x11702351e:vmovdqu%ymm0,0x10(%r9,%rcx,4)

0x11702354f:vmovdqu0x70(%rbx,%r8,4),%ymm0

0x117023556:vpaddd0x70(%rdi,%r8,4),%ymm0,%ymm0

0x11702355d:vmovdqu%ymm0,0x70(%r9,%r8,4)

0x117023564:add$0x20,%ecx

0x117023567:cmp%r10d,%ecx

0x11702356a:jl0x117023512

53

jdk9-b163,AVX2(Haswell)

Allright,compilerproblem.Fixedin9.

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

0x117023512:vmovdqu0x10(%rbx,%rcx,4),%ymm0;iteration#1

0x117023518:vpaddd0x10(%rdi,%rcx,4),%ymm0,%ymm0;A[i:i+7]=B[…]+C[…]

0x11702351e:vmovdqu%ymm0,0x10(%r9,%rcx,4);

…;…

0x11702354f:vmovdqu0x70(%rbx,%r8,4),%ymm0;iteration#4

0x117023556:vpaddd0x70(%rdi,%r8,4),%ymm0,%ymm0;A[i+24:i+31]=…

0x11702355d:vmovdqu%ymm0,0x70(%r9,%r8,4);

0x117023564:add$0x20,%ecx;i+=32;//(4*8)

0x117023567:cmp%r10d,%ecx

0x11702356a:jl0x117023512

54

jdk9-b163,AVX2(Haswell)

Butwait…Whathappenedtomainloop?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 55

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

“…weleverageunrollfactorsfromthebaselineloopwhicharemuchlargertoobtainopJmumthroughputonx86architectures.TheupliwrangeonSpecJvm2008isseenonscimark.lu.{small|large}withupliwnotedat3%and8%respecAvely.Weseeasmuchas1.5xupliwonvectorcentricmicroslikereducAonsondefaultopAmizaAons.”

MichaelBerg,IntelJDK-8129920

56

JDK-8129920:VectorizedLoopUnrolling

hZps://bugs.openjdk.java.net/browse/JDK-8129920

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(;i<a;i++){…}//pre-loopfor(;i<MAX-4;i+=4){//mainloopA[i:i+3]=B[i:i+3]+C[i:i+3];}for(;i<MAX;i++){//post-loopA[i]=B[i]+C[i];}

57

JDK9:VectorizedLoop:BeforeUnrolling

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(;i<MAX-step;i+=step){//mainloopA[i:i+V]=B[i:i+V]+C[i:i+V];A[i+V:i+2*V]=B[i+V:i+2*V]+C[i+V:i+2*V];A[i+2*V:i+3*V]=B[i+2*V:i+3*V]+C[i+2*V:i+3*V];A[i+3*V:i+4*V]=B[i+3*V:i+4*V]+C[i+3*V:i+4*V];}for(;i<MAX;i++){A[i]=B[i]+C[i];}//post-loopintstep=4/*unroll_factor*/*max_vector_size;intV=max_vector_size–1;

58

JDK9:VectorizedLoop:AwerUnrolling

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

intstep=unroll_factor*max_vector_size;for(;i<MAX-step;i+=step){…}//mainloop//NB!Upto(unroll_factor*max_vector_size)iterationsfor(;i<MAX;i++){A[i]=B[i]+C[i];}//post-loop

59

JDK9:VectorizedMainLoop

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

“theaddiAonofatomicunrolleddrainloopswhichprecedefix-upsegmentsandwhicharesignificantlyfasterthanscalarcode.TherequirementisthatthemainloopissuperunrolledawervectorizaAon.Iseeupto54%upliwonmicrobenchmarksonx86targetsforloopswhichpasssuperwordvectorizaAonandwhichmeettheabovecriteria.”

MichaelBerg,Intelhotspot-compiler-dev@ojn

60

JDK-8149421:VectorizedPostLoops

hZp://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2016-February/021205.html

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(;i<MAX-4;i+=4){//vectorizedpost-loopA[i:i+3]=B[i:i+3]+C[i:i+3];}//trip-countin[0;unroll_factor)for(;i<MAX;i++){//post-loopA[i]=B[i]+C[i];}//trip-countin[0;max_vector_size)

61

JDK9:VectorizedPost-loop

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

T nounrolling notvectorized vectorized jdk9-b163

byte 592±6 506±6 159±4 69±3short 541±7 495±4 140±3 69±4char 537±4 493±4 141±2 68±2int 532±5 490±4 154±2 74±1long 533±8 492±5 157±2 141±1float 530±4 489±7 155±2 80±3double 526±5 483±4 172±3 167±2

62

<anyT>voidadd(T[]A,T[]B,T[]C){for(inti=0;i<MAX;i++){A[i]=B[i]+C[i];}}

MAX=1000Corei7,1x2x2,Haswell(AVX2)8u121,macos-x64,ns/op

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

MAX 8u121 jdk9-b163

0 2.0±0 2±0

ns/op

1 3.8±0 3.8±010 7.3±2 8.2±1100 22±1 17±110^3 153±4 73±310^4 2058±57 2025±1410^5 36±1 35±1

μs/op10^6 858±34 883±1410^7 8751±145 9144±14

63

Corei7,1x2x2,Haswell(AVX2)macos-x64T=int

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

int[]A;for(inti=0;i<MAX;i++)A[i]++;

64

[Constants]

0x102222b60:0x00000001

vmovq0x102222b60,%xmm0

vpunpcklqdq%xmm0,%xmm0,%xmm0

vinserti128$0x1,%xmm0,%ymm0,%ymm0

//Mainloop

vmovdqu0x10(%r10,%rcx,4),%ymm1

vpaddd%ymm0,%ymm1,%ymm1

vmovdqu%ymm1,0x10(%r11,%rcx,4)

add$0x8,%ecx

cmp%r9d,%ecx

jl…

8u121,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

int[]A;for(inti=0;i<MAX;i++)A[i]*=10;

65

//Mainloop

vmovdqu0x10(%r10,%r11,4),%ymm0

vpslld$0x1,%ymm0,%ymm1

vpslld$0x3,%ymm0,%ymm0

vpaddd%ymm0,%ymm1,%ymm0

vmovdqu%ymm0,0x10(%r10,%r11,4)

add$0x8,%r11d

cmp%r8d,%r11d

jl...

8u121,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

int[]A,B,C;for(inti=0;i<MAX;i++)A[i]=B[i]*C[i];

66

//Mainloop

vmovdqu0x10(%rcx,%rdx,4),%xmm0

vpmulld0x10(%r10,%rdx,4),%xmm0,%xmm0

vmovdqu%xmm0,0x10(%r8,%rdx,4)

add$0x4,%edx

cmp%r9d,%edx

jl…

8u121,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

float[]A,B,C;for(inti=0;i<MAX;i++)A[i]=B[2*i]*C[2*i];

67

StridedAccess

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

float[]A,B,C;for(inti=0;i<MAX;i++)A[i]=B[2*i]*C[2*i];

68

StridedAccess //Mainloop

vmovss0x18(%rax,%rcx,4),%xmm4

vaddss0x18(%rdx,%rcx,4),%xmm4,%xmm1

vmovss%xmm1,0x14(%r11,%r9,4)

vmovss0x20(%rax,%rcx,4),%xmm4

vaddss0x20(%rdx,%rcx,4),%xmm4,%xmm1

vmovss%xmm1,0x18(%r11,%r9,4)

vmovss0x28(%rdx,%rcx,4),%xmm4

vaddss0x28(%rax,%rcx,4),%xmm4,%xmm1

vmovss%xmm1,0x1c(%r11,%r9,4)

add$0x4,%r10d

...

8u121,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

float[]A,B,C;for(inti=0;i<MAX;i++)A[i]=B[2*i]*C[2*i];VGATHERD*inAVX2,but:• noscaZeroperaAons• onlyfloaAngpointvariants

69

StridedAccess //Mainloop

vmovss0x18(%rax,%rcx,4),%xmm4

vaddss0x18(%rdx,%rcx,4),%xmm4,%xmm1

vmovss%xmm1,0x14(%r11,%r9,4)

vmovss0x20(%rax,%rcx,4),%xmm4

vaddss0x20(%rdx,%rcx,4),%xmm4,%xmm1

vmovss%xmm1,0x18(%r11,%r9,4)

vmovss0x28(%rdx,%rcx,4),%xmm4

vaddss0x28(%rax,%rcx,4),%xmm4,%xmm1

vmovss%xmm1,0x1c(%r11,%r9,4)

add$0x4,%r10d

...

8u121,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

WhataboutUnsafe?

70

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

//On-heaplongoff=Unsafe.ARRAY_INT_BASE_OFFSET;for(inti=0;i<MAX;i++){//A[i]=B[i]+C[i]intval=U.getInt(B,off)+U.getInt(C,off);U.putInt(A,off,val);off+=Unsafe.ARRAY_INT_INDEX_SCALE;}

71

WhataboutUnsafe? //Mainloop

...

mov%rdx,%rdi

add%r9,%rdi

mov(%rcx),%r8d

...

add$0x20,%r9

add$0x8,%r10d

cmp%eax,%r10d

jl...

8u121,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

//Off-heaplongaddrA=U.allocateMemory(...);longaddrB=U.allocateMemory(...);longaddrC=U.allocateMemory(...);

for(inti=0;i<MAX;i++){longoff=i*4;intval=U.getInt(null,addrB+off)+U.getInt(null,addrC+off);U.putInt(null,addrA+off,val);}

72

WhataboutUnsafe? //Mainloop

mov0x0(%rbp,%rax,1),%r11d

add(%rdi,%rax,1),%r11d

mov%r11d,(%rbx,%rax,1)

add$0x8,%r10d

cmp%r14d,%r10d

jl…

8u121,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 73

Unsafe==Fast

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 74

Unsafe==Fast

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

ReducAons

75

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 76

HorizontalAddiAon

+ + + +

xmm1 xmm0

xmm2

VPHADDD%xmm0,%xmm1,%xmm2

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 77

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

publicintsum(int[]A){intsum=0;for(inta:A){sum+=a;}returnsum;}

78

//Mainloop

add0x10(%r8,%rcx,4),%eax

add0x14(%r8,%rcx,4),%eax

add0x18(%r8,%rcx,4),%eax

add0x1c(%r8,%rcx,4),%eax

add0x20(%r8,%rcx,4),%eax

add0x24(%r8,%rcx,4),%eax

add0x28(%r8,%rcx,4),%eax

add0x2c(%r8,%rcx,4),%eax

add$0x8,%ecx

cmp%r10d,%ecx

jl…

jdk9-b163,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

“ReducAonvectoropAmizaAoncouldbeexpensiveforsimpleexpressionsbecauseitusesseveraladdiAonalinstrucAonspervector.[…]WeneedtorestrictreducAonopAmizaAononlytocaseswhenitisbeneficial.”

8074981:RestrictreducAonopAmizaAon

79

hZps://jbs.oracle.com/browse/JDK-8074981

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

intdotProduct(int[]A,int[]B){intr=0;for(inti=0;i<MAX;i++){r+=A[i]*B[i];}returnr;}

80

//Vectorizedpost-loop

vmovdqu0x10(%rdi,%r11,4),%ymm0

vmovdqu0x10(%rbx,%r11,4),%ymm1

vpmulld%ymm0,%ymm1,%ymm0

vphaddd%ymm0,%ymm0,%ymm3

vphaddd%ymm1,%ymm3,%ymm3

vextracti128$0x1,%ymm3,%xmm1

vpaddd%xmm1,%xmm3,%xmm3

vmovd%eax,%xmm1

vpaddd%xmm3,%xmm1,%xmm1

vmovd%xmm1,%eax

add$0x8,%r11d

cmp%r8d,%r11d

jl0x117e23668

jdk9-b163,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

intdotProduct(int[]A,int[]B){intr=0;for(inti=0;i<MAX;i++){r+=A[i]*B[i];}returnr;}

81

//Vectorizedpost-loop

vmovdqu0x10(%rdi,%r11,4),%ymm0

vmovdqu0x10(%rbx,%r11,4),%ymm1

vpmulld%ymm0,%ymm1,%ymm0

vphaddd%ymm0,%ymm0,%ymm3

vphaddd%ymm1,%ymm3,%ymm3

vextracti128$0x1,%ymm3,%xmm1

vpaddd%xmm1,%xmm3,%xmm3

vmovd%eax,%xmm1

vpaddd%xmm3,%xmm1,%xmm1

vmovd%xmm1,%eax

add$0x8,%r11d

cmp%r8d,%r11d

jl0x117e23668

jdk9-b163,AVX2(Haswell)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 82

VPHADDD

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

FusedMulAply-Add(FMA)

83

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  SingleInstrucAon–MulApleNestedoperaAons

• Usecases– dotproduct

for(inti=0;i<MAX;i++)r=r+A[i]*B[i];

– matrixmulAplicaAon

for(intk=0;k<MAX;k++)r=r+A[i][k]*B[k][j];

84

FusedOperaAons

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

Mnemonic Operands OperaJon

VFMADDPDyymm,ymm,ymm/m256

a=b*c+d

VFMADDPSyVFMADDPDx

xmm,xmm,xmm/m128VFMADDPSxVFMADDSD xmm,xmm,xmm/m64VFMADDSS xmm,xmm,xmm/m32

85

FMA4(onlyAMD)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

Mnemonic OperaJon

VFMADD132… a=a*c+bVFMADD213… a=b*a+cVFMADD231… a=b*c+a

86

FMA3(bothIntel&AMD)

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

float[]A,B,C,D=…;

for(inti=0;i<MAX;i++){A[i]=B[i]*C[i]+D[i];}

87

//Vectorizedpost-loop

vmovdqu0x10(%r8,%r11,4),%ymm0

vmulps0x10(%rcx,%r11,4),%ymm0,%ymm0

vaddps0x10(%rax,%r11,4),%ymm0,%ymm0

vmovdqu%ymm0,0x10(%rdx,%r11,4)

add$0x8,%r11d

cmp%r10d,%r11d

jl…

jdk9/hs,AVX2(Haswell)

Vectorized,noFMA.

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

float[]A,B,C,D=…;

for(inti=0;i<MAX;i++){A[i]=Math.fma(B[i],C[i],D[i]);}

88

//Post-loop

vmovss0x10(%r11,%rbx,4),%xmm1

vmovss0x10(%r9,%rbx,4),%xmm0

vmovss0x10(%r8,%rbx,4),%xmm3

vfmadd231ss%xmm1,%xmm0,%xmm3

vmovss%xmm3,0x10(%r10,%rbx,4)

inc%ebx

cmp%edi,%ebx

jl...

jdk9/hs,AVX2(Haswell)Math.fma()//@since9

Notvectorized,scalarFMA.

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

8153998:Maskedvectorpostloops8080325:SuperWordloopunrollinganalysis

8151573:MulAversioningforrangecheckeliminaAon

8135028:supportforvectorizingdoubleprecisionsqrt

8076284:ImprovevectorizaAonofparallelstreams

8139340:SuperWordenhancementtosupportvectorcondiAonalmove(CMovVD)onIntelAVXcpu

89

JDK9:OtherEnhancementsinSuperWord

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• MaskedvectoroperaAonsif(B[i]>0)A[i]=B[i];

•  ScaZer/GatherA[i]=B[i]+C[D[i]];//indirectaccess

• ConflictDetecAonA[B[i]]++; //histogram

90

WhatelseinAVX-512?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=0;i<MAX;i++)if(B[i]>0){A[i]=B[i];}

91

If-conversion

xmm0

B[i+0]B[i+1]B[i+2]B[i+3]3296128 64

B[i:i+3]>0 k1

B[i+3]A[i+2]B[i+1]A[i+0]int[]

0 01 1

VMOVDQU32%xmm0,%k1,…

A[]

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=0;i<MAX;i++)A[i]=B[i]+C[D[i]];“Gatherers”:fetchdataelementsusingvector-indexmemoryaddressing.

92

IndirectAccess

C[D[i]]C[D[i+1]]C[D[i+2]]C[D[i+3]]

xmm0

C[D[i+3]]C[D[i+2]]C[D[i+1]]C[D[i]]int[]C[]

3296128

VGATHERQPD

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=0;i<MAX;i++)A[B[i]]++;

93

Histogram

A[B[i]]A[B[i+1]]A[B[i+2]]A[B[i+3]]

xmm03296128

VPGATHERDD

A[B[i+3]]A[B[i+2]]A[B[i+1]]A[B[i]]int[]A[]

A[B[i+3]]+1A[B[i+2]]+1A[B[i+1]]+1A[B[i]]+1int[]A[]

VPSCATTERDD

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=0;i<MAX;i++)A[B[i]]++;

94

Histogram

A[B[i]]A[B[i+1]]A[B[i+2]]A[B[i+3]]

xmm03296128

VPGATHERDD

A[B[i+3]]A[B[i+2]]A[B[i+1]]A[B[i]]int[]A[]

A[B[i+3]]+1A[B[i+2]]+1A[B[i+1]]+1A[B[i]]+1int[]A[]

VPSCATTERDD

B[i]==B[j]?

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

for(inti=0;i<MAX;i++)A[B[i]]++;

95

Histogram

A[B[i]]A[B[i+1]]A[B[i+2]]A[B[i+3]]

xmm03296128

VPGATHERDD

A[B[i+3]]A[B[i+2]]A[B[i+1]]A[B[i]]int[]A[]

A[B[i+3]]+1A[B[i+2]]+1A[B[i+1]]+1A[B[i]]+1int[]A[]

VPSCATTERDD

ConflictdetecJon!VPCONFLICTD

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  100sofvectorinstrucAonsonx86•  IntelintrinsicinstrucAons

– MMX:~120– SSE:~130– SSE2/3/SSSE3/4.1/4.2:~260– AVX/AVX2:~380

VectorISAExtensions

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  1000sofvectorinstrucAonsonx86•  IntelintrinsicinstrucAons

– MMX:~120– SSE:~130– SSE2/3/SSSE3/4.1/4.2:~260– AVX/AVX2:~380– AVX-512:~3800

VectorISAExtensions

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

UTF-8 UTF-16

ASCII(1byte) 0aaaaaaa 000000000aaaaaaaBasicMulAlingualPlane(2or3bytes)

110bbbbb10aaaaaa 00000bbbbbaaaaaa1110cccc10bbbbbb10aaaaaa ccccbbbbbbaaaaaa

SupplementaryPlanes(4bytes)

11110ddd10ddcccc10bbbbbb10aaaaaauuuu=ddddd-1

110110uuuuccccbb110111bbbbaaaaaa

100

UseCase:UTF-8<=>UTF-16

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.| 101

UseCase:UTF-8<=>UTF-16

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  SuperwordopAmizaAonscanbeverybriZle– doesn’t(andcan’t)coveralltheusecases

•  Intrinsicsarepointfixes,notgeneral– powerful,lightweight,andflexible– highdevelopmentcosts

•  JNIishardtodevelopandmaintain– interoperabilityoverheadbetweenJava&naAvecode– CPUdispatchingisrequired

JavaandSIMDtoday

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

VectorAPIEmbraceexplicitvectorizaJon

103

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

ProjectPanama

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

SafeHarborStatementThefollowingisintendedtooutlineourgeneralproductdirecAon.ItisintendedforinformaAonpurposesonly,andmaynotbeincorporatedintoanycontract.Itisnotacommitmenttodeliveranymaterial,code,orfuncAonality,andshouldnotberelieduponinmakingpurchasingdecisions.Thedevelopment,release,andAmingofanyfeaturesorfuncAonalitydescribedforOracle’sproductsremainsatthesolediscreAonofOracle.

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

MoAvaAon

Expose data-parallel operations through a cross-platform API

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

Int8Vectorx=...,y=...;//vectorsof8intsInt8Vectorz=x.add(y);//element-wiseaddi4on

107

MoAvaAon

vpaddd%ymm1,%ymm0,%ymm0

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• MaximallyexpressiveandportableAPI– “principleofleastastonishment”– uniformcoverageoperaAonsanddatatypes– type-safe

• Performant– Highqualityofgeneratedcode– CompeAAvewithexisAngfaciliAesforauto-vectorizaAon

• GracefulperformancedegradaAon– fallbackfor"holes"innaAvearchitectures

108

Goals

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• DrawAPIproposedbyJohnRose– ImmutableVectortype– parameterizedbyelementtype&size(Vector<E,S>)

• PrototypeImplementaAoninPanama– int,float,long,anddoubleelementssupported

•  Int128Vector,Int256Vector,…– basedonMachineCodeSnippets&“super-longs”(Long2,Long4,Long8)

109

CurrentStatus

Vector<Integer,S256Bit>x=...,y=...;//vectorsof8intsVector<Integer,S256Bit>z=x.add(y);//element-wiseaddition

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  java.lang.Long2/Long4/Long8/…– represent128/256/512-bitvalues

•  “well-known“totheJVM– specialtreatmentintheJVM– C2knowshowtomapthevaluestoappropriatevectorregisters

RawVectors

//128-bitvector.public/*value*/finalclassLong2{privatefinallongl1,l2;//FIXMEprivateLong2(){thrownewError();}@HotSpotIntrinsicCandidatepublicstaticnativeLong2make(longlo,longhi);@HotSpotIntrinsicCandidatepublicnativelongextract(inti);@HotSpotIntrinsicCandidatepublicbooleanequals(Long2v){...}...

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

JVMvsHardware:ImpedanceMismatch

size(bits) 8 16 32 64 128 256 512 …

x86regs AL AX EAX RAX XMM0 YMM0 ZMM0 -

JVM B S I J j.l.Long2 j.l.Long4 j.l.Long8 …

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• OpAmizeawayvectorboxesinthecode– requiredformappingVectorinstancestovectorregistersingeneratedcode– Vector<Integer,S256Bits>=>vectorregister(ymm)onx86/AVX– crucialfordecentperformance

•  EscapeAnalysisinC2– doesn’tcoverallthecases(e.g.,non-trivialcontrolflow)– briZle(dependsoninliningdecisions;easyforausertoleakaninstance)

112

VectorBoxEliminaAon

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

• ValueTypes(ProjectValhalla)fortherescue!– representsuper-longs&typedvectorsasvaluetypes– lettheJIT-compilerdotherest

• MinimalValueTypes,asthefirststep– hZp://cr.openjdk.java.net/~jrose/values/shady-values.html

113

VectorboxeliminaAon

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

ValhallaJVMvsHardware

size(bits) 8 16 32 64 128 256 512 …

x86regs AL AX EAX RAX XMM0 YMM0 ZMM0 -

JVM B S I J j.l.Long2 j.l.Long4 j.l.Long8 …

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

Summary

115

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  SIMDISAextensions– veryirregularonx86– hardtouAlizeincross-playormmanner

•  JVM– auto-vectorizaAon

•  briZle•  can’tcoverallthecases

– intrinsics•  pros:powerful,lightweight,andflexible•  cons:pointfixes,highdevelopmentcosts

116

Summary

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  JDK9– enhancementsinauto-vectorizaAon

•  parAalAVX-512support,codeshapeimprovementsonx86

– newmethodsandintrinsics• Math.fma(),Arrays.vectorizedMismatch()

• VectorAPI– easy&reliablewaytowriteperformantvectorizedcode– workinprogress!

•  underacAvedevelopmentinProjectPanama

117

Summary:Future

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

SafeHarborStatementTheprecedingisintendedtooutlineourgeneralproductdirecAon.ItisintendedforinformaAonpurposesonly,andmaynotbeincorporatedintoanycontract.Itisnotacommitmenttodeliveranymaterial,code,orfuncAonality,andshouldnotberelieduponinmakingpurchasingdecisions.Thedevelopment,release,andAmingofanyfeaturesorfuncAonalitydescribedforOracle’sproductsremainsatthesolediscreAonofOracle.

118

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

•  Vectorinterface:hZp://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java•  Prototype:hZp://hg.openjdk.java.net/panama/panama/jdk/file/Ap/test/panama/vector-api-patchable

•  MinimalValueTypes:hZp://cr.openjdk.java.net/~jrose/values/shady-values.html

•  Super-longs: –  hZp://hg.openjdk.java.net/panama/panama/jdk/file/0243d8ef6bd1/src/java.base/share/classes/java/lang/Long2.java–  hZp://hg.openjdk.java.net/panama/panama/jdk/file/0243d8ef6bd1/src/java.base/share/classes/java/lang/Long4.java–  hZp://hg.openjdk.java.net/panama/panama/jdk/file/0243d8ef6bd1/src/java.base/share/classes/java/lang/Long8.java

•  MachineCodeSnippets:hZp://cr.openjdk.java.net/~vlivanov/talks/2016_JVMLS_MachineCodeSnippets.pdf

119

VectorAPI:Materials

Copyright©2017,Oracleand/oritsaffiliates.Allrightsreserved.|

[email protected]

@iwan0www