HypothesisTesting:HowtoEliminateIdeasasSoonasPossible
RomanZykovRetailRocket
Boston,RecSys2016
Context• Intro
• OfflinevsOnlinetesting
• Makeofflinetestingshorter
• Artificialdiversitymetric
• Onlinetests
RetailRocket• Personalisedreal-timerecommendations
• E-commerceonly
• Multiplechannels(site,email,…)
• Foundedin2012
• Offices:Amsterdam,Barcelona,Milan,Moscow
• 1000+retailpartners
• 100+milliondailyevents
Whytestingisimportant?
• Highlycompetitivemarket• It’snothardtocreateownrecommendation• Constantchangesintheproductandalgorithms• Fastandreliabledecisions
OfflinevsOnlinetesting
Offlinetestingforecastsonlinetestingresults• Relativelyfast,testingofminorchangesrequireshours• Fewresources:data,computationalresources,code,1dev• Hardtoforecastonlinemetricsinsomecases• Influenceofanalgorithmonusers'behaviourisignored• Badvaluesofofflinemetricspreventonlineimplementation
Onlinetest-finaldecisionpoint• Requiresmuchtime.Atleasttwocyclesofdecisionmaking• Requiresmanyresources:design,onsiteproduction,etc
Testingfacts
• Nineoutoftenideasdonotimproveanything• Mostideashaveminorimpact:
o addnewdata:extractedfromtext,images,etco adjustparametersofalgorithm
Offlinetesting
OfflinepredictsOnline
Majorchangesornewalgorithm• Alwayscheckbyonlineexperiment• Findappropriateofflinemetricafter• Trydifferentdefinitionsofusers’sessions• Trydifferenteventssequences
Minorchanges•Useofflinetestsifyouhaveprovedofflinemetric
MakeofflinetestingshorterRetailRocket
Whatwedid• FunctionalprogrammingonScala/Spark.Fourlanguages(Python,Java,Pig,Hive)hadbeenpreviouslyused.
• ResearchinScala/SparkNotebookswithaddedRintegrationforgraphics
• Offlineevaluationframeworkforallofourtaskswithmetricscalculations.ThemostcomplicatedprojectamongothersinRetailRocket
Whatwegot• Ittakeshourstoproveordisapproveanysimpleideawhereaspreviouslyitcouldhavetakendays
• Researchislimitedbythepowerofourclusterandthenumberofdatascientists
Scala/SparknotebookwithR
Offlineframework
• ScalaonSpark• Dealswithexistingweblogs• Implicitfeedback• Majormetrics:
o Recall,Diversity,RecallwithNN,EmptyRecs• Minormetrics:
o Serendipity,Novelty,Coverage• Differenttypesofeventssequences• Differentdefinitionsofusers’sessions• Personalised/Non-personalisedrecommendations• AdjustableTOPofviewablerecommendations• Testpanelofsitesfromdifferentdomains
Offlineeventssequences
view1view2view3cart1 cart2view4view5 view6purchase1
View2View View2Cart View2Purchase Cart2Purchase Cart2Cart
view1->view2
view2->view3
view3->view4
view4->view5
view5->view6
view1->cart1
view2->cart1
view3->cart1
view4->cart1
view5->cart2
view6->cart2
view1->purchase1
view2->purchase1
view3->purchase1
view4->purchase1
view5->purchase1
view6->purchase1
cart1->purchase1
cart2->purchase1
cart1->cart2
*Events:productview,addtocart,purchase,mainpageview,search,catalogpage,…
Offlinemetricexamples
view1view2view3cart1 cart2view4view5 view6purchase1
WhatCustomersBuyAfterViewingThisItem• View2Cart• View2Purchase• …
CustomersWhoBoughtThisItemAlsoBought• Cart2Cart• Cart2Purchase• View2Cart• …
Case:Artificialdiversification
ArtificialdiversificationOriginal
After
Problem:It’snotimpossibletouseRecallforevaluating
RecallwithNearestNeighbours(NN)
Top4recs
0.8 0.7 0.5 0.5
0.8 0.7 0.5 0.5
0.6 0.5 0.4
0.9 0.8 0.3 0.5
Contentbasedsimilarity(Nearestneighbours)
Realitem
0.5
Indirecthit
1.0
Directhit
Nohit
0.0
Metric=Averageoverallsessions
OnlineA/Btesting
AA/BBtests
Agroup
Agroup
Bgroup
Bgroup
Controlgroup
Testgroup
AA/BBtests
A
A
B
B
A
A
B
B
IdealDirty
Bayesianapproach• Conversionrates
o Betadistributionwithnormalpriors• AverageOrderValues
o Normaldistribution(afterlog)withnormalpriors• Priorsfromhistoricaldatabeforeexperiment
Anythingmaybedonewithposteriors.
E.g.:Thereisa95%chancethatAhasan1%liftoverB
Conclusion• Offlinetestingcanpredictonlineresults
• OneprogramminglanguageforR&Dreducesthetesttime
• TheScalalanguageisagoodalternativeforMLtasks
• Differenteventsequencesforofflinemetrics
• RecallwithNearestNeighbours(NN)metric
Thankyou!
https://github.com/RetailRocket/SparkMultiTool