BanditLearningforNMTHyperparameter Search
KevinDuh
May2018discussion
1
SpeedingupHyperparameter Search
Givenbudgetconstraints,howtodecidewhichruntokillbeforeconvergence?
2
K-armbanditproblem
- Eachrun/modelisanarm- Eachtimewepullanarm,wetrainthemodelbyonestep
- Whicharmshouldwepullfirst?
3
Simulation
• WNMT2018DE-ENdata(4Msentences)• RunKmodelstoconvergence.Checkifbanditlearningcanchoosecorrectly.
• Seq2SeqHyperparameters:– Varied=BPE:10k,30k,50k;Embeddingsize:100,300,500;RNNhiddensize:100,300,500;#layers:1,2;Dropout:0.0-0.4
– Fixed=Defaultoptimizer,learning-ratescheduler• Checkpointfrequency:10k,Batchsize:128– Eachcheckpoint=1unitofbudget
4
Epsilon-GreedyAlgorithm
• Foreachturnuntilbudgetrunsout:– Drawxfromrandom_uniform(0,1)– Ifx<epsilon(e.g.0.1)
• Pullrandomarm
– Else:• Pullbestarm:k’=argmax_k value[k]• Updatevalue[k’]=latestBLEU(oraveragesofar)
5
Epsilon-Greedytendstoexploreonlymodelsthataregoodinitially.OKherebutrisky.(budget=40)
6
UpperConfidenceBound(UCB)
• Idea:moreuncertaintyonarmslesspulled,sofavorthem.
• Foreachturnuntilbudgetrunsout:– Foreacharmk:
• Computebound[k]=sqrt(2log(totalcount)/count[k])– Pullbestarm:k’=argmax_k value[k]+bound[k]– Updatevalue[k’]=latestBLEU(oraveragesofar)– Incrementcount[k’]+=1;totalcount +=1
7
Boundistoolargeinpractice.UCBuniformlyexploresallarms.(Budget=40)
8
Hyperband/SuccessiveHalvingLiet.al.2016.Hyperband:ANovelBandit-BasedApproachtoHyperparameter Optimization
• Previously:choosing1armistoorisky,andvaluesaren’tfairlycomparableacrosssteps
• Idea:Choosehalfofpopulationateachturn
• L =list(Arms)• Foreachturnuntilbudgetrunsout:– PulleacharmkinL;updatevalue[k]=currentBLEU– S =[armssortedbyvalue]– L=tophalfofS
9
PromisingarmstrainsuccessivelylongerunderSuccessiveHalving(Budget=40)
10
16arms.SuccessiveHalvingwithBudget=96.
11
Considerations/Discussions
• Nextsteps:– IncludemultipleobjectivesviaParetoranks–Makethismorepractical.Implementsuccessivehalvingasinnerloopwithinevolutionarysearch
• Algorithmicquestions:– Fixedoptimizer&learningratescheduler– Newrunvsoldrun
12
Considerations/Discussions
• Implementationquestions:– ContinueonfinishedrunforSockeye:
• --params,--source-vocab,--target-vocab• Differentdatasets?E.g.smallerdatasets• Sockeye.prepared_data?
–Measurements:• Time:CPUdecoding?VsGPUdecoding• Accuracy:validationBLEUvstrainperplexity
13
14
SuggestionsfromMichael
• Tryvastlydifferentarchitecturesandrepeatthesimulation
• Trydifferentoptimizers– ADAMforlargedata- longruns,lookingattrainingperplexity,decreasinglearningrateslowlyby0.9
– EVEforsmalldata– NADAMdoesn’twork,butinterestingtotry
• CPUdecoder:lookatWNMT’18DockerforMKLversionthatismoreperformant
15
SuggestionsforRudolphe
• Initialcondition:maybeIneedtotraineacharmlongerinitiallybeforestartingtheK-armbandits
• ButwhatifIhavetoomanyK?
16
June2018discussion
17
TEDDE-EN– differentoptimizers{adadelta,adagrad,adam,eve,nadam,rmsprop,sgd}x
initiallearningrate={0.0002,0.001}
batch_size=4096schedule=plataeu-reducelearning_rate_reduce_factor=0.7loss="cross-entropy”checkpoint=750(~1epoch)
Totalresource=56
18
Sameaslastslide,butevaluateat10checkpointintervals(750x10updates)
Totalresource=280
Increasingresourceusageà safer19
Initiallearning rate ValidationPerplexity
ValidationBLEU
Adadelta1 0.0002 19.17 23.27
Adadelta2 0.001 17.24 25.02
Adam1b 0.0002 20.68 24.53
Adam1 0.0002 20.41 24.25
Adam2b 0.001 17.18 25.68
Adam2 0.001 19.14 25.24
Eve1 0.0002 14.83 27.39
Eve2 0.001 40.06 12.59
Nadam1 0.0002 20.89 24.24
Nadam2 0.001 15.43 26.97
RMSprop1 0.0002 19.08 24.55
RMSprop2 0.001 16.10 27.00
adagrad 0.001,0.0002 411, 195
sgd 4171,66220
WMTZH-EN– differencearchitecture
21
WMTRU-EN– differentarchiecture
22
SuggestionsfromMichael
• Differentbatchsize• Differentscheduler(sqrt)• Mixarchitectures(&differentencoder/decoderdepths&layersize)
• BLEU– howclosetothebestmodel,i.e.canIgetto0.2BLEUofbestmodelwith10%oftheresources?
23
July2018discussion
24
MoreexperimentstoverifyHyberband’s robustness
• Motivation:– PreviousHyperband resultswerepromising,butwanttotestonmorediverse(e.g.crisscrossing)learningcurves
• Thismonth:– Curriculumlearningexperiments
25
CurriculumLearning
• Hunch:– Startbytrainingeasysamples– Asmodelimproves,addinhardersamples– Maybemodelwillconvergefaster?OrbetterBLEU?
• SockeyeImplementation(atMTMA):– Easy/hardsamplesareassignedtodifferentshards– Schedulewhatshardisvisibletotraineratwhattime
26
CurriculumLearning- Visualization
TrainingTimei.e.updates
VeryEasy
Hard
Startwitheasyshard
VeryHard
Easy MidLevel
Graduallyaddhardershards
CurriculumUpdateFrequencye.g.every1000updates
Atthispoint,seealldataandgetrandombatches
Visible(i.e.available)shards27
CurriculumLearning– manyvariantstogetdifferentlearningcurves
• Differentschedules,e.g.
• Differentdefinitionsofeasy/hard:– Sentencelength– Vocabularyfrequency– Force-decode/1bestscoreofexistingmodel
• Differentcurriculumupdatefrequency28
Setup
• Data:De-EnTED• Preparation:– 100trainingrunswithdifferentcurriculumlearningsetting
– Randomlydraw16runseachtimeandobserveHyberband results
• Question:– CanHyperband correctlybetonnear-bestruns?
29
30
31
RankHistogram(100randomtrials)
8120/64[19,19,20,18,14,5,3,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]8240/128[26,23,20,14,10,2,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]8360/192[42,31,14,9,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]8480/256[43,23,18,10,5,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]16148/256[16,11,10,10,14,12,11,3,6,2,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]16296/512[41,20,14,8,1,8,0,5,0,1,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]163144/768[46,23,12,3,6,7,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]164192/1024[47,21,8,6,7,4,3,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]321112/1024[28,11,7,5,7,4,9,5,2,3,1,3,0,1,0,2,0,2,3,1,1,0,0,0,2,2,1,0,0,0]322224/2048[45,21,11,2,1,3,3,4,5,3,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]323336/3072[101,61,14,6,2,4,2,1,2,3,2,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]324448/4096[56,33,1,4,1,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]641256/4096[10,22,27,14,1,1,0,1,0,0,2,0,0,2,3,0,3,3,3,4,0,0,2,0,2,0,0,0,0,0]642512/8192[19,32,18,18,8,3,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0]
#run
Halving_freq Resourceusedvs gridsearch
In19/100trials,Hyberband choserank1(best)curveIn20/100trials,Hyberband choserank3curve
32
Summary
• FoundrelativelyrobustsettingsforHyberbandonNMTlearningcurves
• Next:tryondifferentNMTarchitecturesandincorporatespeed/accuracymulti-objective
• (Nextmeeting:SeptemberratherthanAugust?CurrentlydoingsummerworkshoponDomainAdaptationforNMT)
33
JHUHLTCOESCALE2018Workshop:ResilientMachineTranslation
forNewDomains
KevinDuh,PaulMcNamee,KathyBaker,PhilippKoehn,BrianThompson,ChrisCallison-Burch,
JanNiehues,MarineCarpuat,TimAnderson,JeremyGwinnup,MariannaMartindale,Jenn Drexler,
Calandra Moore,StevenBradtke,JamesWoo,Gaurav Kumar,HudaKhayrallah,PamelaShapiro,BeckyMarvin,JonathanWeese,Dusan Varis
FinalPresentation:Baltimore,August9(savethedate!)
Goal:ImproveDomainAdaptationofNMT
Test TrainingdataforNMT Ar-En De-En Fa-En Ko-En Ru-En
Zh-En
TEDTalks
GeneralDomain 29.6 34.6 22.2 11.6 23.4 15.9InDomain(TED) 27.4 32.3 21.3 14.4 22.9 16.2ContinuedTraining 35.4 39.9 27.9 17.2 28.6 20.4
Patent GeneralDomain n/a 36.0 n/a 2.7 23.4 12.6In-Domain(Patent) n/a 61.9 n/a 29.9 26.9 40.2Continued Training n/a 62.3 n/a 31.7 37.0 43.7
35BLEUscores:ContinuedTraininggivesconsistentgains(~0.5-5BLEU)
LargeGeneral-DomainBitext In-DomainBitext:e.g.patents
GENERALMODEL
ADAPTEDMODEL
1.Train3.ContinueTraining
2.Initialize
SuggestionsfromMichael
• Quantifyresourcessaved:– whatpercentageofresourcescanwesavevs gridsearchwhileachievingsimilarmodels
• Whatwerethebadexamplesinthehistogramcurve?
• PullRequestforCurriculumLearningcode
36
September2018discussion
37
Summarysofar
38
Withbanditlearning,wecansaveX%ofresourceswhileachievinglessthanYdegradationinBLEU.(Here,X=81%,Y=0)
Openquestion:- Resultsfordrastically
differentarchitectures
Otherdirections
39
• Methodsforspeedinguptraining(i.e.inner-loopofhyperparameter optimization)– Banditlearning– Datasub-selectionfortrainingspeedup
• Methodsforspeedingupmodelsorreducingresourceusageduringinferenceingeneral–Modelcompression
Datasubsetselection:Formulation
• TrainingdataT:Nsamples• CanweselectsubsetSofM<<Nsamples– Wheretrainingonsubsetgivessamehyperparametersearchrecommendationsastrainingonfullset?
• Formulation:1. TrainKmodelswithdifferenthyperparameters onT2. SimilarlytheKmodelstrainonsubsetS3. Comparetherankingof(1)and(2).Ifsame,then
datasubsetisagoodsurrogate
40
Datasubsetselection:Details
• Baseline1:TrainonT asusual,withuptosametrainingtimeasM*#epoch
• Baseline2:Flipthesubsetselectioncriteria• Subsetselectionmethod:– Cynicaldataselection– Vocabularybasedselection
• Evaluation:– Howtointerpretrankingdifferences?
41
Modelcompression
• Focusmoreoninferenceresourceconstraints• Existingideastoexplore:–ModelDistillation– Quantization– (Discussion)
• Comparespeed,memoryfootprint• Integratethisinlargerauto-tuningloop
42
Discussionnotes(withMichael)• Datasubsetselection:
– It’dbegoodtohavealltheplots– informativewhetherresultsaregoodorbad
– Forvocabselection:currentlysomethingsimilarisdoneinunittests.(Replacemostvocabwithunk)
• Modeldistillation:– Trainbigmodelandtranslatetrainingdata.Trainsmallmodelandthencontinuetrainingonbigmodel’soutputs.Thismaybesufficient(noneedforoutputdistributionasIoriginallyimagined)
• Quantization:– MichaelwillhelplookforpointersonquantizationworkinMxNet
43
44
• Next:exploringaneworthogonaldirectionforspeedinguphyperparameter search
45
DataSubsetSelectionforNMTHyperparameter Search
KevinDuh
Motivation
• Ittakestimetotrainmodelstoconvergenceonlargedatasets
• Question:Canwetrainmodelstoconvergenceonasmalldataset?– E.g.inapaperlongtimeago,Lecun suggestsfiddlinglearningrateonsmallsubsetfirst
–Whatsubsetleadstofastconvergence?– Doestherankingofhyperparameters onsmallsubsetcorrelatewiththatonthefulldataset?
47
Datasubsetselection:Formulation
• TrainingdataT:Nsamples• CanweselectsubsetSofM<<Nsamples– Wheretrainingonsubsetgivessamehyperparametersearchrecommendationsastrainingonfullset?
• Formulation:1. TrainKmodelswithdifferenthyperparameters onT2. SimilarlytheKmodelstrainonsubsetS3. Comparetherankingof(1)and(2).Ifsame,then
datasubsetisagoodsurrogate
48
PreliminaryExperiments• Big:Modeltrainedon28millionsentencepairsofgeneral-domainDE-EN– Approx 300k-500kupdatestoconverge
• Small:Modeltrainedonrandomlyselected10%ofdata– Approx 150k-300kupdates(30-60hours)toconverge
• Vocab:Modeltrainedonsentencescontainingonlythetop1/256vocabulary– Approx 100k-200kupdatestoconverge
à Vary#layers,size,etc.andseeifrankingisthesameforBigvsSmallandBigvsVocab
49
Datasubsetselectionbyvocabulary
50
LSTMvs Transformer,Layer=1,2,4
51
Changinglearningrates
52
ConnectionstoBanditLearning
• Bandit:Stopsomerunsbeforeconvergence• DataSelection:Shortertimetoconvergence• Thesearebothheuristicsonearlystoppingsomemodelsduringhyperparameter search
• Nextstep:– Collectmoreempiricalresults– Experimentwithotherdataselectionmethods
53
Discussionnotes(withMichael)
• Plotallresultsonsamefigure• Tryevensmallerdatasetsandseewhenrankingstartstobreak
• Experimentonatleastonemoredataset• Aswrap-up,findgeneralrecommendations:basedonsomedatasetcharacteristic,whatspeed-upmethodtouse?
54