Upload
donga
View
213
Download
0
Embed Size (px)
Citation preview
Fernando Magno Quintão Pereira!
PROGRAMMING LANGUAGES LABORATORY!
UniversidadeFederaldeMinasGerais-DepartmentofComputerScience
PROGRAM ANALYSIS AND OPTIMIZATION – DCC888!
JUST-IN-TIME COMPILATION!
Thematerialintheseslideshavebeentakenmostlyfromtwopapers:"DynamicEliminaConofOverflowTestsinaTraceCompiler",and"Just-in-TimeValueSpecializaCon"
AToyExample
#include <stdio.h>!#include <stdlib.h>!#include <sys/mman.h>!
int main(void) {! char* program;! int (*fnptr)(void);! int a;! program = mmap(NULL, 1000, PROT_EXEC | PROT_READ |! PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);! program[0] = 0xB8;! program[1] = 0x34;! program[2] = 0x12;! program[3] = 0;! program[4] = 0;! program[5] = 0xC3;! fnptr = (int (*)(void)) program;! a = fnptr();! printf("Result = %X\n",a);!}!
1) WhatistheprogramonthelePdoing?
2) WhatisthisAPIallabout?
3) Whatdoesthisprogramhavetodowithajust-in-Cmecompiler?
Just-in-TimeCompilers
• AJITcompilertranslatesaprogramintobinarycodewhilethisprogramisbeingexecuted.
• WecancompileafuncConassoonasitisnecessary.– ThisisGoogle'sV8approach.
• OrwecanfirstinterpretthefuncCon,andaPerwerealizethatthisfuncConishot,wecompileitintobinary.– ThisistheapproachofMozilla'sIonMonkey.
1)When/where/whyarejust-in-Cmecompilersusuallyused?
2)CanaJITcompiledprogramrunfasterthanastaCcallycompiledprogram?
3)WhichfamousJITcompilersdoweknow?
TherearemanyJITcompilersaround
• JavaHotspotisoneofthemostefficientJITcompilersin
usetoday.Itwasreleasedin1999,andhasbeeninuse
sincethen.
• V8istheJavaScriptJITcompilerusedbyGoogleChrome.
• IonMonkeyistheJavaScriptJITcompilerusedbythe
MozillaFirefox.
• LuaJIT(http://luajit.org/)isatracebasedjust-in-CmecompilerthatgeneratescodefortheLua
programminglanguage.
• The.NetframeworkJITsCILcode.
• ForPythonwehavePyPy,whichrunsonCpython.
CanJITscompetewithstaCccompilers?
http://benchmarksgame.alioth.debian.org/!
Howcananinterpretedlanguagerunfasterthanacompiledlanguage?
Tradeoffs
• TherearemanytradeoffsinvolvedintheJITcompilaConofaprogram.
• TheCmetocompileispartofthetotalexecuConCmeoftheprogram.
• WemaybewillingtorunsimplerandfasteropCmizaCons(ornoopCmizaConatall)todiminishthecompilaConoverhead.
• AndwemaytrytolookatrunCmeinformaContoproducebecercodes.– Profilingisabigplayerhere.
• ThesamecodemaybecompiledmanyCmes!
WhywouldwecompilethesamecodemanyCmes?
Example:Mozilla'sIonMonkey
Parser
Optimizer
CodeGenerator
function inc(x) {
return x + 1;
}
JavaScript source code
mov 0x28(%rsp),%r10
shr $0x2f,%r10
cmp $0x1fff1,%r10d
je 0x1b
jmpq 0x6e
mov 0x28(%rsp),%rax
mov %eax,%ecx
mov $0x7f70b72c,%r11
mov (%r11),%rdx
cmp %rdx,%rsp
jbe 0x78
add $0x1,%ecx
jo 0xab
mov $0xfff88000,%rax
retq
Native code
Bytecodes
00000: getarg 0
00003: one
00004: add
00005: return
00006: stop
LIR
label ()
parameter ([x:1 (arg:0)])
parameter ([x:2 (arg:8)])
start ()
unbox ([i:3]) (v2:r)
checkoverrecursed()t=([i:4])
osipoint ()
addi ([i:5 (!)]) (v3:r), (c)
box ([x:6]) (v5:r)
return () (v6:rcx)
resumepoint 4 2 6
parameter -1 Value
resumepoint 4 2 6
parameter 0 Value
constant undef Undefined
start
unbox parameter4 Int32
resumepoint 4 2 6
checkoverrecursed
constant 0x1 Int32
add unbox10 const14 Int32
return add16
MIR
Interpreter
Compiler
IonMonkeyisoneoftheJITcompilersusedbytheFirefoxbrowsertoexecuteJavaScriptprograms.
ThiscompilerisCghtlyintegratedwithSpiderMonkey,theJavaScriptinterpreter.
SpiderMonkeyinvokesIonMonkeytoJITcompileafuncConeitherifitisoPencalled,orifithasaloopthatexecutesforalongCme.
WhydowehavesomanydifferentintermediaterepresentaConshere?
WhentoInvoketheJITCompiler?
• CompilaConhasacost.– FuncConsthatexecuteonlyonce,forafewiteraCons,shouldbeinterpreted.
• Compiledcoderunsfaster.– FuncConsthatarecalledoPen,orthatloopforalongCmeshouldbecompiled.
• AndwemayhavedifferentopCmizaConlevels…
Howtodecidewhentocompileapieceofcode?
Asanexample,SpiderMonkeyusesthreeexecuConmodes:thefirstisinterpretaCon;thenwehavethebaselinecompiler,whichdoesnotopCmizethecode.FinallyIonMonkeykicksin,andproduceshighlyopCmizedcode.
TheCompilaConThreshold
Interpreted Code
Unoptimized Native Code
Optimized Native Code Baseline Compiler
Optimizing Compiler
Number of instructions processed (either natively or via interpretation)
Number of instructions compiled
Tim
e
ManyexecuConenvironmentsassociatecounterswithbranches.Onceacounterreachesagiventhreshold,thatcodeiscompiled.Butdefiningwhichthresholdtouseisverydifficult,e.g.,howtominimizetheareaofthecurvebelow?JITsarecrowdedwithmagicnumbers.
TheMillion-DollarsQuesCon
• WhentoinvoketheJITcompiler?
1) CanyoucomeupwithastrategytoinvoketheJITcompilerthatopCmizesforspeed?
2) DoyouhavetoexecutetheprogramabitbeforecallingtheJIT?
3) HowmuchinformaCondoyouneedtomakeagoodguess?
4) WhatisthepriceyoupayformakingawrongpredicCon?
5) Whichprogramsareeasytopredict?
6) Dotheeasyprogramsreflecttheneedsoftheusers?
UniversidadeFederaldeMinasGerais–DepartmentofComputerScience–ProgrammingLanguagesLaboratory
SPECULATION
SpeculaCon
• AkeytrickusedbyJITcompilersisspecula(on.• Wemayassumethatagivenpropertyistrue,andthen
weproducecodethatcapitalizesonthatspeculaCon.• TherearemanydifferentkindsofspeculaCon,andthey
arealwaysagamble:
– Let'sassumethatthetypeofavariableisaninteger,
• butifwehaveanintegeroverflow…– Let'sassumethattheproperCesoftheobjectare
fixed,• butifweaddorremovesomethingfromtheobject…
– Let'sassumethatthetargetofacallisalwaysthesame,
• butifwepointthefuncConreferencetoanotherclosure…
InlineCaching
• Oneoftheearliest,andmosteffecCve,typesofspecializaConwasinlinecaching,anopCmizaCondevelopedfortheSmalltalkprogramminglanguage♧.
• Smalltalkisadynamicallytypedprogramminglanguage.• Inanutshell,objectsarerepresentedashash-tables.• Thisisaveryflexibleprogrammingmodel:wecanaddor
removeproperCesofobjectsatwill.• LanguagessuchasPythonandRubyalsoimplement
objectsinthisway.• Today,inlinecachingisthekeyideabehindJITs'shigh
performancewhenrunningJavaScriptprograms.
♧:EfficientimplementaConofthesmalltalk-80system,POPL(1984)
VirtualTables
InJava,C++,C#,andotherprogramminglanguagesthatarecompiledstaCcally,objectscontainpointerstovirtualtables.AnymethodcallcanberesolvedaPertwopointerdereferences.Thefirstdereferencefindsthevirtualtableoftheobject.Thesecondfindsthetargetmethod,givenaknownoffset.
classAnimal{publicvoideat(){System.out.println(this+"iseaCng");}publicStringtoString(){return"Animal";}}
classMammalextendsAnimal{publicvoidsuckMilk(){System.out.println(this+"issucking");}publicStringtoString(){return"Mammal";}publicvoideat(){System.out.println(this+"iseaCnglikeamammal");}}
classDogextendsMammal{publicvoidbark(){System.out.println(this+"isbarking");}publicStringtoString(){return"Dog";}publicvoideat(){System.out.println(this+",iseaCnglikeadog");}}
1) Inorderforthistricktowork,weneedtoimposesomerestricConsonvirtualtables.Whichones?
2) Whatarethevirtualtablesof?Animala=newAnimal();Animalm=newMammal();Mammald=newDog()
VirtualTables
classAnimal{publicvoideat(){System.out.println(this+"iseaCng");}publicStringtoString(){return"Animal";}}
classMammalextendsAnimal{publicvoidsuckMilk(){System.out.println(this+"issucking");}publicStringtoString(){return"Mammal";}publicvoideat(){System.out.println(this+"iseaCnglikeamammal");}}
classDogextendsMammal{publicvoidbark(){System.out.println(this+"isbarking");}publicStringtoString(){return"Dog";}publicvoideat(){System.out.println(this+",iseaCnglikeadog");}}
AnimaltoStringeat
MammaltoStringeatsuckMilk
DogtoStringeatsuckMilkbark
Animal a
Animal m
Mammal d
Howtolocatethetargetofd.eat()?
VirtualCall
DogtoStringeatsuckMilkbark
Mammal d
d.eat()First, we need to know the table d is pointing to. This requires one pointer dereference:
Second, we need to know the offset of the method eat, inside the table. This offset is always the same for any class that inherits from Animal, so we can jump blindly.
AnimaltoStringeat
MammaltoStringeatsuckMilk
public void eat() {
System.out.println ("Eats like a dog");
}
public void eat() {
System.out.println ("Eats like a mammal");
}
public void eat() {
System.out.println ("Eats like an animal");
}
Canwehavevirtualtablesinducktypedlanguages?
ObjectsinPython
INT_BITS = 32!
def getIndex(element):! index = element / INT_BITS! offset = element % INT_BITS! bit = 1 << offset! return (index, bit)!
class Set:! def __init__(self, capacity):! self.capacity = capacity! self.vector = range(1+capacity/INT_BITS)! for i in range(len(self.vector)):! self.vector[i] = 0! def add(self, element):! (index, bit) = getIndex(element)! self.vector[index] |= bit!
class ErrorSet(Set):! def checkIndex(self, element):! if (element > self.capacity):! raise IndexError(str(element) + " is out of range.")! def add(self, element):! self.checkIndex(element)! Set.add(self, element)! print element, "successfully added."!
1) WhatistheprogramonthelePdoing?
2) WhatisthecomplexitytolocatethetargetofamethodcallinPython?
3) Whycan'tcallsinPythonbeimplementedasefficientlyascallsinJava?
UsingPythonObjects
def fill(set, b, e, s):! for i in range(b, e, s):! set.add(i)!
s0 = Set(15)!fill(s0, 10, 20, 3)!
s1 = ErrorSet(15)!fill(s1, 10, 14, 3)!
class X:! def __init__(self):! self.a = 0!
fill(X(), 1, 10, 3)!>>> AttributeError: X instance!>>> has no attribute 'add'!
1) WhatdoesthefuncConfilldo?
2) Whydidthethirdcalloffillfailed?
3) Whataretherequirementsthatfillexpectsonitsparameters?
DuckTyping
def fill(set, b, e, s):! for i in range(b, e, s):! set.add(i)!
class Num:! def __init__(self, num):! self.n = num! def add(self, num):! self.n += num! def __str__(self):! return str(self.n)!
n = Num(3)!print n!fill(n, 1, 10, 1)!...!
Dowegetanerrorhere?
DuckTyping
class Num:! def __init__(self, num):! self.n = num! def add(self, num):! self.n += num! def __str__(self):! return str(self.n)!
n = Num(3)!print n!fill(n, 1, 10, 1)!print n!
>>> 3!>>> 48!
Theprogramworksjustfine.Theonlyrequirementthatfillexpectsonitsfirstargumentisthatithasamethodaddthattakestwoparameters.Anyobjectthathasthismethod,andcanreceiveanintegeronthesecondargument,willworkwithfill.Thisiscalledducktyping:ifitsqueakslikeaduck,swimslikeaduck,eatslikeaduck,thenitisaduck!
ThePriceofFlexibility
• Objects,inthesedynamicallytypedlanguages,areforthemostpartimplementedashashtables.– Thatiscool:wecanaddorremovemethodswithoutmuchhardwork.
def fill(set, b, e, s):! for i in range(b, e, s):! set.add(i)!
– Andmindhowmuchcodewecanreuse?
• Butmethodcallsareprecyexpensive.
Mammal d
__in
it__(
self,
cap
)
add(self, elem)del(self, elem)
contains(self, elem)Howcanwemakethesecallscheaper?
MonomorphicInlineCache
class Num:! def __init__(self, n):! self.n = n! def add(self, num):! self.n += num!
def fill(set, b, e, s):! for i in range(b, e, s):! set.add(i)!
n = Num(3)!print n!fill(n, 1, 10, 1)!print n!
>>> 3!>>> 48!
ThefirstCmewegeneratecodeforacall,wecancheckthetargetofthatcall.Weknowthistarget,becausewearegenera(ngcodeatrun(me!
fill(set,b,e,s):foriinrange(b,e,s):ifisinstance(set,Num):set.n+=ielse:add=lookup(set,"add")add(set,i)
CouldyouopCmizethiscodeevenfurtherusingclassiccompilertransformaCons?
InliningontheMethod
• Wecanalsospeculateonthemethodname,insteadofdoingitonthecallingsite:
fill(set,b,e,s):foriinrange(b,e,s):__f_add(set,i)
__f_add(o,e):ifisinstance(o,Num):o.n+=eelse:f=lookup(o,"add")f(o,e)
1) Isthereanyadvantagetothisapproach,whencomparedtoinliningatthecallsite?
2) Isthereanydisadvantage?
3) WhichoneislikelytochangemoreoPen?
fill(set,b,e,s):foriinrange(b,e,s):ifisinstance(set,Num):set.n+=ielse:add=lookup(set,"add")add(set,i)
PolymorphicCalls
• IfthetargetofacallchangesduringtheexecuConoftheprogram,thenwehaveapolymorphiccall.
• Amonomorphicinlinecachewouldhavetobeinvalidated,andwewouldfallbackintotheexpensivequest.
>>> l = [Num(1), Set(1), Num(2), Set(2), Num(3), Set(3)]!>>> for o in l:!... o.add(2)!...!
IsthereanythingwecoulddotoopCmizethiscase?
PolymorphicCalls
• IfthetargetofacallchangesduringtheexecuConoftheprogram,thenwehaveapolymorphiccall.
• Amonomorphicinlinecachewouldhavetobeinvalidated,andwewouldfallbackintotheexpensivequest.
>>> l = [Num(1), Set(1), Num(2), Set(2), Num(3), Set(3)]!>>> for o in l:!... o.add(2)!...!
fill(set,b,e,s):foriinrange(b,e,s):__f_add(set,i)
__f_add(o,e):ifisinstance(o,Num):o.n=eelifisinstance(o,Set):(index,bit)=getIndex(e)o.vector[index]|=bitelse:f=lookup(o,"add")f(o,e)
Woulditnotbebecerjusttohavethecodebelow?
foriinrange(b,e,s):lookup(set,"add")…
TheSpeculaCveNatureofInlineCaching
• Python–aswellasJavaScript,Ruby,Luaandotherverydynamiclanguages–allowstheusertoaddorremovemethodsfromanobject.
• Ifsuchchangesinthelayoutofanobjecthappen,thentherepresentaConofthatobjectmustberecompiled.Inthiscase,weneedtoupdatetheinlinecache.
fromSetimportINT_BITS,getIndex,Set
deferrorAdd(self,element):if(element>self.capacity):raiseIndexError(str(element)+"isoutofrange.")else:(index,bit)=getIndex(element)self.vector[index]|=bitprintelement,"addedsuccessfully!"
Set.add=errorAdds=Set(60)s.errorAdd(59)s.remove(59)
TheBenefitsoftheInlineCache
• Monomorphicinlinecachehit:– 10instrucCons
• PolymorphicInlinecachehit:– 35instrucConsifthereare10types– 60instrucConsifthereare20types
• Inlinecachemiss:1,000–4,000instrucCons.
ThesenumbershavebeenobtainedbyAhnetal.forJavaScript,intheChromeV8compiler♤:
♤:ImprovingJavaScriptPerformancebyDeconstrucCngtheTypeSystem,PLDI(2014)
WhichfactorscouldjusCfythesenumbers?
UniversidadeFederaldeMinasGerais–DepartmentofComputerScience–ProgrammingLanguagesLaboratory
PARAMETERSPECIALIZATION
FuncConsareoPencalledwithsamearguments
• Costaetal.haveinstrumentedtheMozillaFirefoxbrowser,andhaveusedittonavigatethroughthe100mostvisitedwebpagesaccordingtotheAlexaindex♤.
• Theyhavealsoperformedthesetestsonthreepopularbenchmarksuites:– Sunspider1.0,V86.0andKraken1.1
♤: http://www.alexa.com!♧:Just-in-TimeValueSpecializaCon(CGO2013)
Empiricalfinding:mostoftheJavaScriptfuncConsarealwayscalledwiththesamearguments.ThisobservaConhasbeenmadeoveracorpusof26,000funcCons.
!"
!#$"
!#%"
!#&"
!#'"
!#("
$" '" )" $!" $&" $*" $+" %%" %(" %,"
ManyFuncConsareCalledonlyOnce!
Almost50%ofthefuncConshavebeencalledonlyonceduringamockusersession.
MostFuncConsareCalledwithSameArguments
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
$" '" *" $!" $&" $)" $+" %%" %(" %,"
Almost60%ofthefuncConsarecalledwithonlyonesetofargumentsthroughouttheenCrewebsession.
ASimilarPacernintheBenchmarks
!"
!#!$"
!#%"
!#%$"
!#&"
!#&$"
%" '" %%" %'" &%" &'" (%" ('" )%" )'"
!"
!#!$"
!#!%"
!#!&"
!#!'"
(" &" ((" (&" $(" $&" )(" )&" %(" %&" *(" *&" &(" &&" +(" +&" '(" '&" ,(" ,&"(!("
!"
!#$"
!#%"
!#&"
!#'"
$" (" $&" $)" %*" &$" &(" '&" ')" **" +$" +("
!"
!#$"
!#%"
!#&"
!#'"
$" (" $$" $(" %$" %(" &$" &(" '$" '("
!"
!#$%"
!#&"
!#'%"
$" (" $$" $(" )$" )(" &$" &(" '$" '(" %$"
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
$" *" $&" $+" %(" &$" &*" '&" '+" (("
Sunspider 1.0 V8 version 6 Kraken 1.1
Sunspider 1.0 V8 version 6 Kraken 1.1
ThethreeupperhistogramsshowhowoPeneachfuncConiscalled.
ThethreelowerhistogramsshowshowoPeneachfuncConiscalledwiththesamesuiteofarguments.
So,ifitislikethis…
• Let'sassumethatthefuncConisalwayscalledwiththesamearguments.Wegeneratethebestcodespecifictothosearguments.IfthefuncConiscalledwithdifferentarguments,thenwere-compileit,thisCmeusingamoregenericapproach.
function sum(N) {! if(typeof N != 'number') ! return 0;!
var s = 0;! for(var i=0; i<N; i++)! {! if(N % 3 == 0)! s += i*N;! else! s += i;! }! return s;!}!
function sum() {! var s = 0;! for(var i=0; i<59; i++)! {! s += i;! }! return s;!}!
IfweassumethatN=59,thenwecanproducethisspecializedcode:
ARunningExample–LinearSimilaritySearch
function closest(v, q, n) {! if (n == 0) {! throw "Error";! } else {! var i = 0;! var d = 2147483647;! while (i < n) {! var nd = abs(s[i] - q);! if (nd <= d)! d = nd;! i++;! }! return d;! }!}!
Givenavectorvofsizen,holdingintegers,findtheshortestdifferencedfromanyelementinvtoaparameterq.
Wemustiteratethroughv,lookingforthesmallestdifferencev[i] – q.ThisisanO(q)algorithm.
Whatcouldwedoifwewantedtogeneratecodethatisspecifictoagiventuple(v,q,n)?
TheControlFlowGraphoftheExample
function closest(v, q, n) {! if (n == 0) {! throw "Error";! } else {! var i = 0;! var d = 2147483647;! while (i < n) {! var nd = abs(s[i]-q);! if (nd <= d)! d = nd;! i++;! }! return d;! }!}!
Function entry pointv = param[0]
q = param[1]
n = param[2]
if (n == 0) goto L1
L1:
throw "Error"L2: i0 = 0
d0 = 2147483647
L3: i1 =! (i0, i2, i3)
d1 =! (d0, d3, d4)
if (i1 < n) goto L5
L4:
return d1
L5: t0 = 4 * i
t1 = v[t0]
inbounds(t1, n) goto L8
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
On stack replacementv = param[0]
q = param[1]
n = param[2]
i3 = stack[0]
d4 = stack[0]
L7: nd = abs(t1, q)
if (nd > d1) goto L9
L8:
throw BoundsErr
TheCFGofafuncConJIT-compiledduringinterpretaConhastwoentrypoints.Why?
TheControlFlowGraphoftheExample
function closest(v, q, n) {! if (n == 0) {! throw "Error";! } else {! var i = 0;! var d = 2147483647;! while (i < n) {! var nd = abs(s[i]-q);! if (nd <= d)! d = nd;! i++;! }! return d;! }!}!
Function entry pointv = param[0]
q = param[1]
n = param[2]
if (n == 0) goto L1
L1:
throw "Error"L2: i0 = 0
d0 = 2147483647
L3: i1 =! (i0, i2, i3)
d1 =! (d0, d3, d4)
if (i1 < n) goto L5
L4:
return d1
L5: t0 = 4 * i
t1 = v[t0]
inbounds(t1, n) goto L8
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
On stack replacementv = param[0]
q = param[1]
n = param[2]
i3 = stack[0]
d4 = stack[0]
L7: nd = abs(t1, q)
if (nd > d1) goto L9
L8:
throw BoundsErr
ThisCFGhastwoentries.ThefirstmarkstheordinarystarCngpointofthefuncCon.Thesecondexistsifwecompiledthisbinarywhileitwasbeinginterpreted.ThisentryisthepointwheretheinterpreterwaswhentheJITwasinvoked.
v
q
n
load[0]
42
100i
d
heapstack
2147483647
40
Function entry pointv = load[0]
q = 42
n = 100
if (n == 0) goto L1
L1:
throw "Error"
L2: i0 = 0
d0 = 2147483647
L3: i1 =! (i0, i2, i3)
d1 =! (d0, d3, d4)
if (i1 < n) goto L5
L4:
return d1
L5: t0 = 4 * i
t1 = v[t0]
inbounds(t1, n) goto L8
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
On stack replacementv = load[0]
q = 42
n = 100
i3 = 40
d4 = 2147483647
L7: nd = abs(t1, q)
if (nd > d1) goto L9
L8:
throw BoundsErr
IdenCfyingtheParametersBecausewehavetwoentrypoints,therearetwoplacesfromwherewecanreadarguments.
Tofindthevaluesofthesearguments,wejustaskthemtotheinterpreter.WecanaskthisquesConbyinspecCngtheinterpreter'sstackattheverymomentwhentheJITcompilerwasinvoked.
DeadCodeEliminaCon(DeadCodeAvoidance?)
Function entry pointv = load[0]
q = 42
n = 100
if (n == 0) goto L1
L1:
throw "Error"
L2: i0 = 0
d0 = 2147483647
L3: i1 =! (i0, i2, i3)
d1 =! (d0, d3, d4)
if (i1 < n) goto L5
L4:
return d1
L5: t0 = 4 * i
t1 = v[t0]
inbounds(t1, n) goto L8
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
On stack replacementv = load[0]
q = 42
n = 100
i3 = 40
d4 = 2147483647
L7: nd = abs(t1, q)
if (nd > d1) goto L9
L8:
throw BoundsErr
IfweassumethatthefuncConwillnotbecalledagain,wedonotneedtogeneratethefuncCon'sentryblock.
Wecanalsoremoveanyotherblocksthatarenotreachablefromthe"On-Stack-Replacement"block.
Function entry pointv = load[0]
q = 42
n = 100
if (n == 0) goto L1
L1:
throw "Error"
L2: i0 = 0
d0 = 2147483647
L3: i1 =! (i0, i2, i3)
d1 =! (d0, d3, d4)
if (i1 < n) goto L5
L4:
return d1
L5: t0 = 4 * i
t1 = v[t0]
inbounds(t1, n) goto L8
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
On stack replacementv = load[0]
q = 42
n = 100
i3 = 40
d4 = 2147483647
L7: nd = abs(t1, q)
if (nd > d1) goto L9
L8:
throw BoundsErr
ArrayBoundsCheckingEliminaCon
L3: i1 =! (i2, i3)
d1 =! (d3, d4)
if (i1 < n) goto L5
L4:
return d3
L5: t0 = 4 * i
t1 = v[t0]
inbounds(t1, n) goto L8
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
L7: nd = abs(t1, q)
if (nd > d1) goto L9
L8:
throw BoundsErr
On stack replacementv = load[0]
q = 42
n = 100
i3 = 40
d4 = 2147483647
InJavaScript,everyarrayaccessisguardedagainstout-of-boundsaccesses.Wecaneliminatesomeoftheseguards.
Let'sjustcheckthepacerni = start; i < n; i++.Thispacerniseasytospot,anditalreadycatchesmanyinducConvariables.Therefore,ifwehavea[i],andn ≤ a.length,thenwecaneliminatetheguard.
L3: i1 =! (i2, i3)
d1 =! (d3, d4)
if (i1 < n) goto L7
L4:
return d3
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
On stack replacementv = load[0]
q = 42
n = 100
i3 = 40
d4 = 2147483647
L7: t0 = 4 * i
t1 = v[t0]
nd = abs(t1, q)
if (nd > d1) goto L9
LoopInversion
L3: i1 =! (i2, i3)
d1 =! (d3, d4)
if (i1 < n) goto L5
L4:
return d3
L5: t0 = 4 * i
t1 = v[t0]
inbounds(t1, n) goto L8
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
L7: nd = abs(t1, q)
if (nd > d1) goto L9
L8:
throw BoundsErr
On stack replacementv = load[0]
q = 42
n = 100
i3 = 40
d4 = 2147483647
Loopinversionconsistsinchangingawhileloopintoado-while:
while(cond){...}
if(cond){do{…}while(cond);}
1) WhydoweneedthesurroundingcondiConal?
2) Whichinfodoweneed,toeliminatethiscondiConal?
LoopInversion
while (i < n) {! var nd = abs(s[i] - q);! if (nd <= d)! d = nd;! i++;!}!
if (i < n) {! do {! var nd = abs(s[i] - q);! if (nd <= d)! d = nd;! i++;! } while (i < n)!}!
ThestandardloopinversiontransformaConrequiresustosurroundthenewrepeatloopwithacondiGonal.However,wecanevaluatethiscondiConal,ifweknowthevalueofthevariablesthatarebeingtested,inthiscase,iandn.Hence,wecanavoidthecostofcreaCngthiscode.
Otherwise,ifwegeneratecodeforthecondiConal,thenasubsequentsparsecondi(onalconstantpropaga(onphasemightsCllbeabletoremovethesurrounding'if'.
ConstantPropagaCon+DeadCodeEliminaCon
L3: i1 =! (i2, i3)
d1 =! (d3, d4)
if (i1 < n) goto L7
L4:
return d3
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
On stack replacementv = load[0]
q = 42
n = 100
i3 = 40
d4 = 2147483647
L7: t0 = 4 * i
t1 = v[t0]
nd = abs(t1, q)
if (nd > d1) goto L9
L3: i1 =! (i2, i3)
d1 =! (d3, d4)
L4:
return d3L6: d2 = nd
L7: d3 =! (d1, d2)
i2 = i1 + 1
if (i2 > n) goto L4
On stack replacementv = load[0]
q = 42
n = 100
i3 = 40
d4 = 2147483647
L7: t0 = 4 * i
t1 = v[t0]
nd = abs(t1, q)
if (nd > d1) goto L7
WecloseoursuiteofopCmizaConswithconstantpropagaCon,followedbyanewdead-code-eliminaConphase.
ThisispossiblytheopCmizaConthatgivesusthelargestprofit.
ItisoPenthecasethatwecaneliminatesomecondiConalteststoo,aswecanevaluatethecondiConsatcodegeneraConCme.
TheSpecializedCode
• Thespecializedcodeismuchshorterthantheoriginalcode.
• ThiscodesizereducConalsoimprovescompilaConCme.InparCcular,itspeedsupregisterallocaCon.
L3: i1 =! (i2, i3)
d1 =! (d3, d4)
L4:
return d3L6: d2 = nd
L7: d3 =! (d1, d2)
i2 = i1 + 1
if (i2 > 100) goto L4
On stack replacementv = load[0]
i3 = 40
d4 = 2147483647
L7: t0 = 4 * i
t1 = v[t0]
nd = abs(t1, 42)
if (nd > d1) goto L7
Function entry pointv = param[0]
q = param[1]
n = param[2]
if (n == 0) goto L1
L1:
throw "Error"L2: i0 = 0
d0 = 2147483647
L3: i1 =! (i0, i2, i3)
d1 =! (d0, d3, d4)
if (i1 < n) goto L5
L4:
return d1
L5: t0 = 4 * i
t1 = v[t0]
inbounds(t1, n) goto L8
L6: d2 = nd
L9: d3 =! (d1, d2)
i2 = i1 + 1
goto L3
On stack replacementv = param[0]
q = param[1]
n = param[2]
i3 = stack[0]
d4 = stack[0]
L7: nd = abs(t1, q)
if (nd > d1) goto L9
L8:
throw BoundsErr
ArethereotheropCmizaConsthatcouldbeappliedinthiscontextofrunCmespecializaCon?
FuncConCallInlining
function inc(x) {
return x + 1;
}
function map(s, b, n, f) {
var i = b;
while (i < n) {
s[i] = f(s[i]);
i++;
}
return s;
}
print (map(new Array(1, 2, 3, 4, 5), 2, 5, inc));
1)Whichcallcouldweinlineinthisexample?
2)HowwouldtheresulCngcodelooklike?
FuncConCallInlining
function inc(x) {
return x + 1;
}
function map(s, b, n, f) {
var i = b;
while (i < n) {
s[i] = f(s[i]);
i++;
}
return s;
}
print (map(new Array(1, 2, 3, 4, 5), 2, 5, inc));
function map(s, b, n, f) {! var i = b;! while (i < n) {! s[i] = s[i] + 1;! i++;! }! return s;!}!
Wecanalsoperformloopunrollinginthisexample.HowwouldbetheresulCngcode?
LoopUnrolling
function map(s, b, n, f) {! var i = b;! while (i < n) {! s[i] = s[i] + 1;! i++;! }! return s;!}!print (map(new Array(1, 2, 3, 4, 5), 2, 5, inc));!
function map(s, b, n, f) {! s[2] = s[2] + 1;! s[3] = s[3] + 1;! s[4] = s[4] + 1;! return s;!}!
Loopunrolling,inthiscase,isveryeffecCve,asweknowtheiniCalandfinallimitsoftheloop.Inthiscase,wedonotneedtoinsertprologandepiloguecodestopreservethesemanCcsoftheprogram.
UniversidadeFederaldeMinasGerais–DepartmentofComputerScience–ProgrammingLanguagesLaboratory
TRACECOMPILATION
WhatisaJITtracecompiler?
• Atrace-basedJITcompilertranslatesonlythemostexecutedpathsintheprogram’scontrolflowtomachinecode.
• Atraceisalinearsequenceofcode,thatrepresentsahotpathintheprogram.
• Twoormoretracescanbecombinedintoatree.
• ExecuConalternatesbetweentracesandinterpreter.
WhataretheadvantagesanddisadvantagesoftracecompilaConovertradiConalmethodcompilaCon?
Theanatomyofatracecompiler
• TraceMonkeyisthetracebasedJITcompilerusedintheMozillaFirefoxBrowser.
file.js AST BytecodesTraceengine
LIR x86
jsparser jsemicer jsinterpreter JIT
SpiderMonkey nanojit
Fromsourcetobytecodes
function foo(n){ var sum = 0; for(i = 0; i < n; i++){ sum+=i; } return sum; }
00:getnamen02:setlocal006:zero07:setlocal111:zero12:setlocal216:goto35(19)19:trace20:getlocal123:getlocal226:add27:setlocal131:localinc235:getlocal238:getlocal041:lt42:ifne19(-23)45:getlocal148:return49:stop
ASTjsparser jsemitter
Canyouseethecorrespondencebetweensourcecodeandbytecodes?
00:getnamen02:setlocal006:zero07:setlocal111:zero12:setlocal216:goto35(19)19:trace20:getlocal123:getlocal226:add27:setlocal131:localinc235:getlocal238:getlocal041:lt42:ifne19(-23)45:getlocal148:return49:stop
Thetraceenginekicksin
• TraceMonkeyinterpretsthebytecodes.• Oncealoopisfound,itmaydecidetoask
Nanojittotransformitintomachinecode(e.g,x86,ARM).– NanojitreadsLIRandproducesx86
• Hence,TraceMonkeymustconvertthistraceofbytecodesintoLIR
00:getnamen02:setlocal006:zero07:setlocal111:zero12:setlocal216:goto35(19)19:trace20:getlocal123:getlocal226:add27:setlocal131:localinc235:getlocal238:getlocal041:lt42:ifne19(-23)45:getlocal148:return49:stop
L: load “i” %r1 load “sum” %r2 add %r1 %r2 %r1 %p0 = ovf() bra %p0 Exit1 store %r1 “sum” inc %r2 store %r2 “i” %p0 = ovf() bra %p0 Exit2 load “i” %r0 load “n” %r1 lt %p0 %r0 %r1 bne %p0 L
Bytecodes
Nanojit LIR
FrombytecodestoLIR
Whydowehaveoverflowtests?
• ManyscripCnglanguagesrepresentnumbersasfloaCng-pointvalues.– ArithmeCcoperaConsarenotveryefficient.
• ThecompilersomeCmesisabletoinferthatthesenumberscanbeusedasintegers.– ButfloaCng-pointnumbersarelargerthanintegers.– ThisisanotherexampleofspeculaCveopCmizaCon.
– Thus,everyarithmeCcoperaConthatmightcauseanoverflowmustbeprecededbyatest.Ifthetestfails,thentherunCmeenginemustchangethenumber'stypebacktofloaCng-point.
L:movl -32(%ebp), %eax movl %eax, -20(%ebp) movl -28(%ebp), %eax movl %eax, -16(%ebp) movl -20(%ebp), %edx leal -16(%ebp), %eax addl %edx, (%eax) call _ovf testl %eax, %eax jne Exit1 movl -16(%ebp), %eax movl %eax, -28(%ebp) leal -20(%ebp), %eax incl (%eax) call _ovf testl %eax, %eax jne Exit2 movl -24(%ebp), %eax movl %eax, -12(%ebp) movl -20(%ebp), %eax cmpl -12(%ebp), %eax jl L
The overflow tests are also translated into machine code.
FromLIRtoassembly
Nanojit LIR
x86 Assembly
L: load “i” %r1 load “sum” %r2 add %r1 %r2 %r1 %p0 = ovf() bra %p0 Exit1 store %r1 “sum” inc %r2 store %r2 “i” %p0 = ovf() bra %p0 Exit2 load “i” %r0 load “n” %r1 lt %p0 %r0 %r1 bne %p0 L
CanyoucomeupwithanopCmizaContoeliminatesomeoftheoverflowchecks?
Howtoeliminatetheredundanttests
• Weuserangeanalysis:– FindtherangeofintegervaluesthatavariablemightholdduringtheexecuConofthetrace.
function foo(n){ var sum = 0; for(i = 0; i < n; i++){ sum+=i; } return sum; }
Example:ifweknowthatnis10,andiisalwayslessthann,thenwewillneverhaveanoverflowifweadd1toi.
CheaCngatrunCme
• AstaGcanalysismustbeveryconservaCve:ifwedonotknowforsurethevalueofn,thenwemustassumethatitmaybeanywherein[-∞,+∞].
• However,wearenotastaGcanalysis!
• WearecompilingatrunCme!
• Toknowthevalueofn,justasktheinterpreter.
function foo(n){ var sum = 0; for(i = 0; i < n; i++){ sum+=i; } return sum; }
Howthealgorithmworks
• Createaconstraintgraph.– WhilethetraceistranslatedtoLIR.
• Propagaterangeintervals.– BeforesendingtheLIRtoNanojit.– UsinginfiniteprecisionarithmeCc.
• Eliminatetestswheneveritissafetodoso.– WetellNanojitthatcodeforsomeoverflowtestsshouldnotbeproduced.
Nx
lt
inc
leq gt geq eq
!
Name
ArithmeCc
RelaConal
Assignment
Theconstraintgraph
• WehavefourcategoriesofverCces:
dec add sub mul
Buildingtheconstraintgraph
• WestartbuildingtheconstraintgraphonceTraceMonkeystartsrecordingatrace.
• TraceMonkeystartsatthebranchinstrucCon,whichisthefirstinstrucConvisitedintheloop.– Althoughitisattheendofthetrace.
19:trace20:getlocalsum23:getlocali26:add27:setlocalsum31:localinci35:getlocaln38:getlocali41:lt42:ifne19(-23)
IntermsofcodegeneraCon,canyourecognizethepacernofbytecodescreatedforthetestifn<igotoL?
19:trace20:getlocalsum23:getlocali26:add27:setlocalsum31:localinci35:getlocaln38:getlocali41:lt42:ifne19(-23)
n0
[10,10]
We check the interpreter’s stack to find out that n is 10.
19:trace20:getlocalsum23:getlocali26:add27:setlocalsum31:localinci35:getlocaln38:getlocali41:lt42:ifne19(-23)
n0
[10,10]
i0
[1,1]
i started holding 0, but at this point, its value is already 1.
WhywedonotiniCalizeiwith0inourconstraintgraph?
19:trace20:getlocalsum23:getlocali26:add27:setlocalsum31:localinci35:getlocaln38:getlocali41:lt42:ifne19(-23)
n0
[10,10]
lt
i0
[1,1]
n1 i1
Comparisons are great! We can learn new information about variables
Now we know that
i is less than 10, so
we rename it to i1
we also rename n
WearerenamingaPercondiConals.WhichprogramrepresentaConarewecreaCngdynamically?
19:trace20:getlocalsum23:getlocali26:add27:setlocalsum31:localinci35:getlocaln38:getlocali41:lt42:ifne19(-23)
n0
[10,10]
lt
i0
[1,1]
n1 i1
sum0
[1,1]
The current value of sum on the interpreter's stack is 1.
19:trace20:getlocalsum23:getlocali26:add27:setlocalsum31:localinci35:getlocaln38:getlocali41:lt42:ifne19(-23)
n0
[10,10]
lt
i0
[1,1]
n1 i1
addsum0
sum1
[1,1]
For the sum, we get the last version of variable i, which is i1.
Whydowepairupsum0withi1,andnotwithi0?
19:trace20:getlocalsum23:getlocali26:add27:setlocalsum31:localinci35:getlocaln38:getlocali41:lt42:ifne19(-23)
n0
[10,10]
lt
i0
[1,1]
n1 i1 inc
i2addsum0
sum1
[1,1]
The last version of
variable i is still i1
so we use it as a
parameter of the
increment.
n0
[10,10]
lt
i0
[1,1]
n1 i1 inc
i2addsum0
sum1
[1,1]
Some variables, like n, have not been updated inside the trace. We know that they are constants.
The comparisons allows us to infer
bounds on the variables that are
updated in the trace
Other variables, like sum, have been updated inside the trace.
n0
[10,10]
lt
i0
[-∞,+∞]
n1 i1 inc
i2addsum0
sum1
ThenextphaseofouralgorithmisthepropagaConofrangeintervals.
WestartbyassigningconservaGvei.e,[-∞,+∞],boundstotherangesofvariablesupdatedinsidethetrace.
[-∞,+∞]
n0
[10,10]
lt
i0
[-∞,+∞]
n1 i1 inc
i2addsum0
sum1
[-∞,+∞]
[10,10] [-∞,9]
We propagate ranges
through arithmetic
and relational nodes
in topological order.
No need to sort the graph topologically. Follow the order in which nodes are created.
19:trace20:getlocalsum23:getlocali26:add27:setlocalsum31:localinci35:getlocaln38:getlocali41:lt42:ifne19(-23)
1
2
3
WhydoestheorderofcreaConofnodesgivesusthetopologicalorderingoftheconstraintgraph?
n0
[10,10]
lt
i0
[-∞,+∞]
n1 i1 inc
i2addsum0
sum1
[-∞,+∞]
[10,10] [-∞,9]
[-∞,+∞]
WearedoingabstractinterpretaCon.WeneedanabstractsemanCcsforeveryoperaConinourconstraintgraph.Asanexample,weknowthat[-∞,+∞]+[-∞,9]=[-∞,+∞]
n0
[10,10]
lt
i0
[-∞,+∞]
n1 i1 inc
i2addsum0
sum1
[-∞,+∞]
[10,10] [-∞,9]
[-∞,+∞]
[-∞,10]
APerrangepropagaConwecancheckwhichoverflowtestsarereallynecessary.
1) Howmanyoverflowchecksdowehaveinthisconstraintgraph?
2) Isthereanyoverflowcheckthatisredundant?
n0
[10,10]
lt
i0
[-∞,+∞]
n1 i1 inc
i2addsum0
sum1
[-∞,+∞]
[10,10] [-∞,9]
[-∞,+∞]
[-∞,10]
This increment operation will never
cause an overflow,
for it receives at
most the integer 9.
The add operation might cause an overflow, for one of the inputs – sum – might be very large.
APerrangepropagaConwecancheckwhichoverflowtestsarereallynecessary.
n0
[10,10]
lt
i0
[-∞,+∞]
n1 i1 inc
i2addsum0
sum1
[-∞,+∞]
[10,10] [-∞,9]
[-∞,+∞]
[-∞,10]
L: load “i” %r1 load “sum” %r2 add %r1 %r2 %r1 %p0 = ovf() bra %p0 Exit1 store %r1 “sum” inc %r2 store %r2 “i” %p0 = ovf() bra %p0 Exit2 load “i” %r0 load “n” %r1 lt %p0 %r0 %r1 bne %p0 L
✓
✗
ABitofHistory
• Acomprehensivesurveyonjust-in-CmecompilaConisgivenbyJohnAycock.
• ClassInferencewasproposedbyChambersandUngar.• TheideaofParameterSpecializaConwasproposedby
Costaetal.• TheoverfloweliminaContechniquewastakenfroma
paperbySoletal.
• Chambers,C.,andUngar,D.,"CustomizaCon:opCmizingcompilertechnologyforSELF,adynamically-typedobject-orientedprogramminglanguage",SIGPLAN,p146-160(1989)
• Aycock,J.,"ABriefHistoryofJust-In-Time",ACMCompuCngSurveys,p97-113(2003)
• Sousa,R.,Guillon,C.,Pereira,F.andBigonha,M."DynamicEliminaConofOverflowTestsinaTraceCompiler",CC,p2-21(2011)
• Costa,I.,Oliveira,P.,Santos,H.,andPereira,F."Just-in-TimeValueSpecializaCon",CGO,(2013)