23
Statistical Inference, Learning and Models in Big Data Franke, Beate; Plante, Jean–François; Roscher, Ribana; Lee, Annie; Smyth, Cathal; Hatefi, Armin; Chen, Fuqi; Gil, Einat; Schwing, Alexander; Selvitella, Alessandro; Hoffman, Michael M.; Grosse, Roger; Hendricks, Dietrich and Reid, Nancy. Abstract The need for new methods to deal with big data is a common theme in most scientific fields, although its definition tends to vary with the context. Statistical ideas are an essential part of this, and as a partial response, a thematic program on statistical inference, learning, and models in big data was held in 2015 in Canada, under the general direction of the Canadian Statistical Sciences Institute, with major funding from, and most activities located at, the Fields Institute for Research in Mathematical Sciences. This paper gives an overview of the topics covered, describing challenges and strategies that seem common to many different areas of application, and including some examples of applications to make these challenges and strategies more concrete. Keywords Aggregation, computational complexity, dimension reduction, high–dimensional data, streaming data, networks 1. Introduction Big data provides big opportunities for statistical inference, but perhaps even bigger challenges, especially when compared to the analysis of carefully collected, usually smaller, sets of data. From January to June 2015, the Canadian Statistical Sciences Institute organized a thematic program on Statistical Inference, Learning and Models in Big Data. It became apparent within the first two weeks of the program that a number of common issues arose in quite different practical settings. This paper arose from an attempt to distill these common themes from the presentations and discussions that took place during the thematic program. Scientifically, the program emphasized the roles of statistics, computer science, and mathematics in obtaining scientific insight from big data. Two complementary strands were introduced: cross– cutting, or foundational, research that underpins analysis, and domain–specific research focused on particular application areas. The former category included machine learning, statistical inference, optimization, network analysis, and visualization. Topic–specific workshops addressed problems in health policy, social policy, environmental science, cyber–security and social networks. These divisions are not rigid, of course, as foundational and application areas are part of a feedback cycle in which each inspires developments in the other. Some very important application areas where big data is fundamental were not able to be the subject of focussed workshops, but many of these applications did feature in individual presentations. The program

Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

StatisticalInference,LearningandModelsinBigData

Franke,Beate;Plante,Jean–François;Roscher,Ribana;Lee,Annie;Smyth,Cathal;Hatefi,Armin;Chen,Fuqi;Gil,Einat;Schwing,Alexander;Selvitella,Alessandro;Hoffman,MichaelM.;Grosse,Roger;Hendricks,DietrichandReid,Nancy.

AbstractTheneedfornewmethodstodealwithbigdataisacommonthemeinmostscientificfields,althoughitsdefinitiontendstovarywiththecontext.Statisticalideasareanessentialpartofthis,andasapartialresponse,athematicprogramonstatisticalinference,learning,andmodelsinbigdatawasheldin2015inCanada,underthegeneraldirectionoftheCanadianStatisticalSciencesInstitute,withmajorfundingfrom,andmostactivitieslocatedat,theFieldsInstituteforResearchinMathematicalSciences.Thispapergivesanoverviewofthetopicscovered,describingchallengesandstrategiesthatseemcommontomanydifferentareasofapplication,andincludingsomeexamplesofapplicationstomakethesechallengesandstrategiesmoreconcrete.

KeywordsAggregation,computationalcomplexity,dimensionreduction,high–dimensionaldata,streamingdata,networks

1.IntroductionBigdataprovidesbigopportunitiesforstatisticalinference,butperhapsevenbiggerchallenges,especiallywhencomparedtotheanalysisofcarefullycollected,usuallysmaller,setsofdata.FromJanuarytoJune2015,theCanadianStatisticalSciencesInstituteorganizedathematicprogramonStatisticalInference,LearningandModelsinBigData.Itbecameapparentwithinthefirsttwoweeksoftheprogramthatanumberofcommonissuesaroseinquitedifferentpracticalsettings.Thispaperarosefromanattempttodistillthesecommonthemesfromthepresentationsanddiscussionsthattookplaceduringthethematicprogram.Scientifically,theprogramemphasizedtherolesofstatistics,computerscience,andmathematicsinobtainingscientificinsightfrombigdata.Twocomplementarystrandswereintroduced:cross–cutting,orfoundational,researchthatunderpinsanalysis,anddomain–specificresearchfocusedonparticularapplicationareas.Theformercategoryincludedmachinelearning,statisticalinference,optimization,networkanalysis,andvisualization.Topic–specificworkshopsaddressedproblemsinhealthpolicy,socialpolicy,environmentalscience,cyber–securityandsocialnetworks.Thesedivisionsarenotrigid,ofcourse,asfoundationalandapplicationareasarepartofafeedbackcycleinwhicheachinspiresdevelopmentsintheother.Someveryimportantapplicationareaswherebigdataisfundamentalwerenotabletobethesubjectoffocussedworkshops,butmanyoftheseapplicationsdidfeatureinindividualpresentations.Theprogram

Page 2: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

startedwithanopeningconference1thatgaveanoverviewofthetopicsofthe6–monthprogram2.AllthetalkspresentedattheFieldsInstituteareavailableonlinethroughFieldsLive3.

Technologicaladvancesenableustocollectmoreandmoredata–alreadyin2012itwasestimatedthatdatacollectionwasgrowingat50%peryear(Lohr,2012).Whilethereismuchimportantresearchunderwayinimprovingtheprocessing,storingandrapidaccessingofrecords(Chenetal.,2014;Schadtetal.,2010),thisprogramemphasizedthechallengesformodelling,inferenceandanalysis.

Twostrikingfeaturesofthepresentationsduringtheprogramwerethebreadthoftopics,andtheremarkablecommonalitiesthatemergedacrossthisbroadrange.Weattemptheretoillustratemanyofthecommonissuesandsolutionsthatarose,andprovideapictureofthestatusofbigdataresearchinthestatisticalsciences.Whilethereportislargelyinspiredbytheseriesoftalksattheopeningconferenceandrelatedreferences,italsotouchesonissuesthatarosethroughouttheprogram.Inthenextsectionweidentifyanumberofchallengesthatarecommonlyidentifiedintheworldofbigdata,andinSection3wedescribesomeexamplesofgeneralinferencestrategiesthatarebeingdevelopedtoaddressthesechallenges.Section4highlightsseveralapplications,tomakesomeoftheseideasmoreconcrete.

2.“BigData,it’snotthedata”Thereisnouniquedefinitionofbigdata.ThetitleofthissectionistakenfromtheopeningpresentationbyBell(2015),whoparaphrasedSupremeCourtJusticePorterStewartbysaying“Ishallnottodayfurtherattempttodefinebigdata,butIknowitwhenIseeit”.Bellalsopointedoutthatthedefinitionof‘big’isevolvingwithcomputationalresources.Gaffield(2015)emphasizedthatpeopleandcompaniesareoftencaptivatedbytheconceptofbigdatabecauseitisaboutus,humans.Bigdataprovidesuswithdetailedinformation–ofteninrealtime–makingitpossibletomeasurepropertiesandpatternsofhumanbehaviourthatformerlycouldnotbemeasured.AnanalogysuggestedbyKeller(2015)istotheHubbletelescope:wecannowseeconstellationswithgreatcomplexitythatpreviouslyappearedonlyasbrightdots.Beyondnaturalhumancuriosity,bigdatadrawsinterestbecauseitisvaluable.Alargepartofthehypearoundbigdatacomesfromrapidlyincreasinginvestmentfrombusinessesseekingacompetitiveadvantage(e.g.McAffeeandBrynjolfsonn,2012;Bell,2013;KironandShockley,2011).NationalStatisticsAgenciesareexploringhowtointegratebigdatatobenchmark,complement,orcreatedataproducts(CappsandWright,2013;Struijsetal.,2014;LetouzéandJütting,2014;TamandClarke,2015).Inmanyfieldsofresearchcostsofdataacquisitionarenowlower,sometimesmuchlower,thatthecostsofdataanalysis.Thedifficultyoftransformingbig

1 The program for the opening conference is at http://www.fields.utoronto.ca/programs/scientific/14–15/bigdata/boot/ 2 A description of the complete thematic program is at https://www.fields.utoronto.ca/programs/scientific/14–15/bigdata/ 3 The Fields Institute video archive link is http://www.fields.utoronto.ca/video–archive

Page 3: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

dataintoknowledgeisrelatedtoitscomplexity,theessenceofwhichisbroadlycapturedbythe“FourVs”:volume,velocity,varietyandveracity.

2.1VolumeHandlingmassivedatasetsmayrequireaspecialinfrastructuresuchasadistributedHadoopcluster(e.g.Shvachkoetal.,2010)whereaverylargenumberofrecordsisstoredinseparatechunksscatteredovermanynodes.Asthecostsofprocessors,memoryanddiskspacecontinuestodecrease,thelimitingpiecenowisoftencommunicationbetweenmachines(Scottetal.2013).Asaconsequence,itisimpossibleorimpracticaltoaccessthewholedatasetonagivenprocessorandmanymethodswillfailbecausetheywerenotbuilttobeimplementedfromseparatepiecesofdata.Developmentsarerequired,andbeingmade,tomakewidely–usedstatisticalmethodsavailableforsuchdistributeddata.TwoexamplesarethedevelopmentofestimatingequationtheoryinLinandXi(2011),andtheconsensusMonteCarloalgorithmofScottetal.(2013).

Exploratorydataanalysisisaveryimportantstepofdatapreparationandcleaningthatbecomessignificantlymorechallengingwithincreasingsizeofthedata.Beyondthetechnicalcomplexityofmanagingthedata,graphicalandnumericalsummariesareessentiallylowdimensionalrepresentations–forexample,graphicaldisplaysarephysicallylimitedtoafewmillionpixels.Further,thehumanbrainrequiresaverylowdimensionalinputtoproceedtointerpretation(Scott,2015).Oneconsequenceofthislimitationisthatweoftenmustcompromisebetweentheintuitivenessofavisualizationandthepowerofavisualization(Carpendaleetal.,2004;Nacentaetal.,2012).Anotherconsequenceisthatwetypicallyvisualizeonlyasmallsubsetofthedataatonetime,andthismaygiveabiasedview(Kolaczyk,2009;p.91).

2.2VarietyFormanybigdataproblems,thevarietyofdatatypesiscloselyrelatedtoanincreaseincomplexity.Muchofclassicalstatisticalmodellingassumesa“tallandskinny”datastructurewithmanymorerows,indexingobservationalorexperimentalunitsn, thancolumns,indexingvariablesp.Images,bothtwo–andthree–dimensional,blogposts,twitterfeeds,socialnetworks,audiorecordings,sensornetworks,videosarejustsomeexamplesofnewtypesofdatathatdon’tfitthis“rectangular”model.Oftenrectangulardataiscreatedsomewhatartificially:forexamplefortwo–dimensionalimages,byorderingthepixelsrowbyrowandrecordingoneormorevaluesassociatedwitheachpixel,hencelosinginformationaboutthestructureofthedata.

Formanytypesofbigdatathenumberofvariablesislarger,oftenmuchlarger,thanthenumberofobservations.Thischangeintherelationshipbetweennandpcanleadtocounter–intuitivebehaviour,oracompletebreakdown,ofconventionalmethodsofanalysis.Problemsinwhichpscaleswithnareverycommon:forexample,inmodelsfornetworksthatassignoneparametertoeachnode,therearep=nparametersforagraphwithn×npossibleedges(ChungandLu,2002).Similarly,Bioucas–Diasetal.(2013)discusshyperspectralremotesensingimagedata,whichmayhavehundredsorthousandsofspectralbands,p,butaverysmallnumberoftrainingsamples,n.Inarecentgenomicsstudyreportedbythe1000GenomesProjectConsortium(2012),geneticistsacquireddataon𝑝 = 38 millionsinglenucleotidepolymorphismsinonly𝑛 = 1092individuals.

Theasymptoticpropertiesofclassicalmethodsinaregimewhere𝑝increaseswith𝑛arequitedifferentthanintheclassicalregimeoffixed𝑝.Forexample,if𝑋! , 𝑖 = 1,… ,𝑛areindependentandidenticallydistributedas𝑁(0, 𝐼!),and𝑝/𝑛staysconstantas𝑝and𝑛increase,theeigenvaluesof

Page 4: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

theempiricalcovariancematrix𝑆 = 𝑋𝑋′/𝑛donotconvergetothetruevalueof1,butratherhavealimitingdistributiondescribedbytheMarcenko–Pasturlaw.DonohoandGavish(2014)showhowthislawisrelevantforunderstandingtheasymptoticminimaxriskinestimatingalowrankmatrixfromnoisymeasurementsbythresholdingthesingularvalues.Donohoetal.(2013)obtainsimilarresultsfortheproblemofmatrixrecovery.

2.3VeracitySomebigdataisobtainedbypiecingtogetherseveraldatasetsthathavenotbeencollectedwiththecurrentresearchquestioninmind:thisissometimesreferredtoas“founddata”or“conveniencesamples”.Thistypeofdatawilltypicallyhaveextremelyheterogeneousstructure(HodasandLerman2014),andmaywellnotbeatallrepresentativeofthepopulationofinterest.Thevastmajorityofbigdataisobservational,butmoreimportantlymaybecollectedveryhaphazardly.Classicalideasofexperimentaldesign,surveysampling,anddesignofobservationalstudiesoftengetlostintheexcitementoftheavailabilityoflargeamountsofdata.Issuesofsamplingbias,ifignored,canleadtoincorrectinferencejustaseasilywithbigdataaswithsmall.Harford(2014)presentsanumberofhigh–profilefailures,fromtheLiteraryDigestpollof1936,whichhadtwomillionparticipants,toGoogleFluTrendsin2010,whichinbasingitspredictionsoninternetsearchtermsseverelyover–estimatedtheprevalenceofinfluenza.Despitethoselimitations,NationalStatisticalAgenciesaredevelopingmethodsforthecarefuluseofsuchdata,forexampleasanaidtoimputationonsmallareas(Marchettietal.,2015).

Cleaninglargevolumesofdataischallengingandveracityisaffectedbyincompleteorincorrectcleaning,andbythelackoftoolstoevaluatedataquality.Forexample,toconfirmthatameasureofsentimentcalculatedfromthesocialmediawassound,Daasetal.(2015)evaluateditsconsistencywithawell–validatedindicatorofconsumerconfidence.Whileitisclearthatsocialmediadataarenoisyandvolatile,datacollectedfromsensorsmayalsocomewithseveralproblems,includingalargeproportionofmissingvalues.Daasetal.(2015)consideredasanexampletheissuesinvolvedincleaningtrafficloopdetectordata.

Qualityofdatainlargeadministrativedatabasesisapressingproblem:additionalheterogeneitymaybeintroducedwhenthedataarerecordedbynumerousstakeholderswithvaryingprioritiesandincentives.Lixetal.(2012)usemodellingtoevaluateandimprovethequalityofthedatainlargehealthservicesdatabases.

2.4VelocityInmanyareasofapplication,datais“big”becauseitisarrivingatahighvelocity,fromcontinuouslyoperatinginstrumentation,e.g.,whenconsideringautonomousdrivingscenarios(Geigeretal.2012),orthroughanon–linestreamingservice.Sequentialoron–linemethodsofanalysismaybeneeded,whichrequiresmethodsofinferencethatcanbecomputedquickly,thatrepeatedlyadapttosequentiallyaddeddata,andthatarenotmisleadbythesuddenappearanceofirrelevantdata,asdiscussed,e.g.,byJainetal.(2006)andHoensetal.(2012).Theareasofapplicationsusingsequentiallearningarediverse.Forinstance,LiandFei–Fei(2010)useobjectrecognitiontechniquesinanefficientsequentialwaytolearnobjectcategorymodelsfromanyarbitrarilylargenumberofimagesfromtheweb.Monteleonietal.(2011)usesequentiallearningforclimatechangemodellingtoestimatetransitionprobabilitiesbetweendifferentclimatemodels.Forsomeapplications,suchashighenergyphysics,burstsofdataarrivesorapidlythatit

Page 5: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

isinfeasibletoevenrecordallofit.Inthiscaseagreatdealofscientificexpertiseisneededtoensurethatthedataretainedisrelevanttotheproblemofinterest.

2.5BeyondtheVsBigdataisabouthumans,itoffersnewpossibilities,andhence,itraisesnewethicalquestions.Duhigg(2012),forinstance,reportedhowtheTargetretailstoresperformedanalyticstosendtargetedmarketingmaterialtotheircustomers,includingpromotionsrelatedtopregnancytoateenagerwhosefatherwasnotyetawareofthecomingbaby.AtFacebook,Krameretal.(2014)conductedalargescaleexperimentonsentimentswithoutobtainingaspecificuseragreement(seealsoMeyer,2014).Privacyisalsoabigconcern,especiallywhenconsideringthatthelinkingofdatabasescandiscloseinformationthatwasmeanttoremainanonymous.ThelawsuitfollowingtheNetflixChallengeisastrikingexampleofthatwherelinkingtheprovideddatatotheIMDBmoviereviewsallowedtoidentifysomeusers(Singel,2009;2010).Ingeneral,ifreleaseddatacanbelinkedtopubliclyavailabledata,suchasmediaarticles,thenprivacycanbecompletelycompromised.Anotherhigh–profileexampleofthiswasthepublicationofthehealthrecordsoftheGovernorofMassachusetts,whichinturnledtostrictlawsonthecollectionandreleaseofhealthdataintheUnitedStatesknownasthe“HIPAArules”(Tarran,2014).Inadditiontoprivacy,legalquestionsarealsounanswered:copyrightandownershipofthedata,forinstance,isnotalwaysclear(seee.g.Struijsetal.,2014).

Sub–samplingfromalargedatasetisoftenpresentedasasolutiontosomeoftheproblemsraisedintheanalysisofbigdata;oneexampleisthe“bagoflittlebootstraps”ofKleiner,etal.(2014).Whilethisstrategycanbeusefultoanalyseadatasetthatisverymassive,butotherwiseregular,itcannotaddressmanyofthechallengesthatwerepresentedinthissection.

3.StrategiesforbigdataanalysisInspiteofthewidevarietyoftypesofbigdataproblems,thereisacommonsetofstrategiesthatseemessential.Asidefrompre–processingandmulti–disciplinarity,manyofthesefollowthegeneralapproachofidentifyingandexploitingahiddenstructurewithinthedatathatoftenisoflowerdimension.Wedescribesomeofthesecommonalitiesinthissection.

3.1DataWranglingAverylargehurdleintheanalysisofdata,whichisespeciallydifficultformassivedatasets,ismanipulatingthedataforuseinanalysis.Thisisvariouslyreferredtoasdatawrangling,datacarpentry,ordatacleaning4.Bigdataisoftenvery“raw”:considerablepre–processingsuchasextracting,parsing,andstoring,isrequiredbeforeitcanbeconsideredinananalysis.

Volumeposesachallengetodatacleaning:whilemanualorinteractiveeditingofadatasetareusuallypreferredforreasonablysizeddata,itquicklybecomesimpracticalwhenthedatagrows.Largeadministrativedatabaseswillrequirespecialtools(e.g.thetableplotsofPutsetal.,2015)

4 The R packages dplyr and tidyr were expressly created to simplify this step, and are highly recommended. The new “pipe” operator %>%, operating similarly to the old Unix pipe command, is especially helpful. (Wickham, 2014).

Page 6: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

toassistinthecleaning.Asthesizeorthevelocityincrease,automatizedsolutionsmaybecomenecessary.Putsetal.(2015),forinstance,useaMarkovmodeltoautomatizethecleaningoftrafficloopdata.Varietymakesoftenmakesdatawranglingdifficult,forinstancewhenselectingvariablesorfeaturesfromimageandsounddata(e.g.GuyonandElisseeff,2003),integratingdatafromheterogeneoussources(Izadietal.,2013),orparsingfree–formtextfortextminingorsentimentanalysis(e.g.,GuptaandLehal,2009;PangandLee,2008).Amoreadvancedapproachintextminingusesthesemanticsoftextratherthanthesimplecountofwordstodescribetexts(e.g.,Brienetal.,2013).

Oneexampleofadatapreparationproblemarisesinthemergingofseveralsetsofspatio–temporaldata,whichmayhavedifferentspatialunits–forexampleonedatabasemaybebasedoncensustracts,anotheronpostalcodes,andstillanotheronelectoraldistricts.Brownetal.(2013)havedevelopedanapproachtoheterogeneouscomplexdatabasedonalocalEMalgorithm.Thisstartsbyoverlayingallpartitionsthatareusedforaggregation,andthencreatingafinerpartitionsuchthatallelementsofthatpartitionsarefullyincludedinthelargerones.ThelocalEMalgorithmbasedonthisisparallelizable,andhenceprovidesaworkablesolutiontothiscomplexproblem.ArelatedissuearoseinthepresentationofGaffield(2015),wherehistoricalcensusdataisbeingusedaspartofastudyoftheimpactofindividualsonthehistoryofCanada.Thecensustractshavechangedovertime,andthecurrentdatabaseresolvesthisbymappingbackwardsfromthecurrentcensustracts.

3.2VisualizationVisualizationisveryoftenthefirstformalstepoftheanalysisandisoftenatoolofchoicefordatacleaning(seee.g.Putsetal.,2015).Theaimistofindefficientgraphicalrepresentationsthatsummarizethedataandemphasiseitsmaincharacteristics(Tukey,1977).Visualizationcanalsoserveasaninferentialtoolindifferentstagesoftheanalysis––understandingthepatternsinthedatahelpsincreatinggoodmodels(Bujaetal,2009).Forexample,Albert–Greenet.al(2014),WoolfordandBraun(2007)andWoolfordetal.(2009)presentexploratorydataanalysistoolsforlargeenvironmentaldatasets,whicharemeanttonotonlyplotthedata,butalsotolookforstructuresanddeparturesfromsuchstructures.Theauthorsfocusonconvergentdatasharpeningandthestalagmiteplotinthecontextofforestfirescience.

Evenif“apictureiswortha1000words”,onehastorealizethatgoingfrombigdatato1000wordisheavycompression.Visualizationis“bandwidth–limited”,inthesensethatthereisnotenoughscreenspacetovisualizelarge,orevenmoderatelysized,data.Interactivityprovidesmanywaystohelpwiththis;themostfamiliariszoomingintolargevisualizationstoreveallocalstructure,muchinthemannerofGoogleMaps.Newtechniquesarebeingdevelopedintheinformationvisualizationcommunity:Neumannetal.(2006)describedtakingcuesfromnature,andfromfractals,indevelopingPhylloTreesforillustratingnetworkgraphsatdifferentscales.Carpendaleinherbootcamppresentationdescribedseveralrecentinnovationsininteractivevisualization,manyofwhichareillustratedatthewebsiteoftheInnovisResearchGroup(Carpendale,2015).

3.3ReducingDimensionalityDirectreductionofdimensionalityisaverycommontechnique:inadditiontofacilitatingvisualization(Huronetal.,2014;Wickhametal.,2010),itisoftenneededtoaddressissuesof

Page 7: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

datastorage(WittenandCandès,2013).HintonandSalakhutdinov(2006)describetheuseofdirectdimensionreductioninclassification.Dimensionalityreductionalsooftenincorporatesanassumptionofsparsity,whichisaddressedinthenextsubsection.PrincipalComponentAnalysis(PCA)isaclassicalexampleofdimensionalityreduction,introducedbyPearson(1901).Thefirsttwo,orthree,principalcomponentsareoftenplottedinthehopesofidentifyingstructureinthedata.OneoftheissueswithperformingPCAonlargedatasetsistheintractabilityofitscomputation,whichreliesonsingularvaluedecomposition(SVD).WittenandCandès(2013),EldarandKutyniok(2012),andHalkoetal.(2011),amongothers,haveusedrandommatrixprojections.Thisisapowerfultoolbehindcompressedsensingandastrikingexampleofdimensionalityreduction,allowingapproximateQRdecompositionorsingularvaluedecompositionwithgreatperformanceadvantagesoverconventionalmethods.WittenandCandès(2013)approximateaverylargematrixbytheproductoftwo“skinny”matricesthatcanbeprocessedrelativelyeasily.Usingrandomizedmethodsonecaninfertheprincipalcomponentsoftheoriginaldatafromthesingularvaluedecompositionoftheseskinnyfactormatrices,thusovercomingacomputationalbottleneck.

Dependingontheapplication,lineardimensionalityreductiontechniquessuchasPCAmayneedtobeextendedtonon–linearones,e.g.,ifthestructureofthedatacannotbecapturedinalinearsubspace(HintonandSalakhutdinov,2006).However,theyarehardertocomputeandparameterize.Innetworkanalysis,anovelwaytoreducethedimensionalityoflargeandcomplexdatasetsisbyemployinganexplanatorytoolfordataanalysiscalled“networkhistogram”(OlhedeandWolfe,2014)illustratedinFigure1.SofiaOlhedeinjointworkwithPatrickWolfesuggestsanon–parametricapproachtoestimateconsistentlythegeneratingmodelofanetwork.Theideaistoidentifygroupssuchthatnodesfromthesamegroupshowasimilarconnectivitypatternandtoapproximatethegeneratingmodelbyapiecewiseconstantfunctionbyaveragingovergroups.ThepiecewiseconstantfunctionhereiscalledstochasticblockmodelanditapproximatesthegeneratingmodelinthesamewayasahistogramapproximatesapdfortheRiemannsumapproximatesacontinuousfunction.

Page 8: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

Figure1.ExampleofanetworkhistogramfromOlhedeandWolfe(2014).Anetworkcanbeexpressedasamatrixwhereeachentryrepresentthestrengthofconnectionbetweentwonodes.Cleverlyorderingthenodesuncoverstheunderlyingstructureofthenetwork.

3.4SparsityandRegularizationIncreasingmodelsparsityenforcesalower–dimensionalmodelstructure.Inturn,itmakesinferencemoretractable,modelseasiertointerpretanditleadstomorerobustnessagainstnoise(Candèsetal.2008,Candèsetal.2006).Regularizationtechniquesthatenforcesparsityhavebeenwidelystudiedintheliteratureformorethanadecade,includingthelasso(Tibshirani,1996),adaptivelasso(Zhou,2006),thesmoothlyclippedabsolutedeviation(SCAD)penalty(FanandLi,2001),andmodifiedCholeskydecomposition(WuandPourahmadi,2003).Thesemethodsandtheirmorerecentextensionsarenowkeytoolsindealingwithbigdata.

Astrongassumptionofsparsitymakesitpossibletouseahighernumberofparametersthanobservations,enablingustofindtheoptimalrepresentationofthedatawhilekeepingtheproblemsolvable.AgoodexampleinthiscontextisthewinningentryfortheNetflixChallengerecommendersystem,whichcombinedalargenumberofmethods,therebyartificiallyincreasingthenumberofparameterstoimprovepredictionperformance(Belletal.2008).

Sparsity–enforcingalgorithmscombinevariableselectionandmodelfitting:thesemethodstypicallyshrinksmallestimatedcoefficientstozeroandthistypicallyintroducesbiaswhichmustbetakenintoaccountinevaluatingthestatisticalsignificanceofthenon–zeroestimatedcoefficients.Muchoftheresearchinlassoandrelatedmethodshasfocussedonpointestimation,butinferentialaspectssuchashypothesistestingandconstructionofconfidenceintervalsisanactiveareaofresearch.Forinferenceaboutcoefficientsestimatedusingthelasso,vandeGeeretal.(2014),andJavanmardandMontanari(2014a,2014b)introduceanadditionalstepintheestimationprocedurefordebiasing.Lockhartetal.(2014)incontrastfocusontestingwhether

Page 9: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

thelassomodelcontainsalltrulyactivevariables.Inasurvivalanalysissetting,Fangetal.(2014)testtheeffectofalowdimensionalcomponentinahighdimensionalsettingbytreatingallremainingparametersasnuisanceparameters.Theyaddresstheproblemofthebiaswiththeaimofimprovingtheperformanceofthetheclassicallikelihood–basedstatistics:theWaldstatistic,thescorestatisticsandthelikelihoodratiostatistic,allofthesebasedontheusualpartiallikelihoodforsurvivaldata.Toofferacompromisetothecommonsparsityassumptionofzeroandlargenon–zerovaluedelements,Ahmed(2014)proposesarelaxedassumptionofsparsitybyaddingweaksignalsbackintoconsideration.Hisnewhigh–dimensionalshrinkagemethodimprovesthepredictionperformanceofaselectedsubmodelsignificantly.

3.5OptimizationOptimizationplaysacentralroleinstatisticalinference,andregularizationmethodsthatcombinemodelfittingwithmodelselectioncanleadtoverydifficultoptimizationproblems.Complexmodelswithmanyhiddenlayers,suchasariseintheconstructionofneuralnetworksanddeepBoltzmannmachines,typicallyleadtonon–convexoptimizationproblems.Sometimesaconvexrelaxationofaninitiallynon–convexproblemispossible:thelassoforexampleisarelaxationoftheproblemthatpenalizesthenumberofnon–zerocoefficientestimates,i.e.itreplacesan𝐿!penaltywithan𝐿!penalty.Whilethisconvexrelaxationmakestheoptimizationproblemmucheasier,itinducesbiasintheestimatesoftheparameters,asmentionedabove.Non–convexpenaltiescanreducesuchbias,butnon–convexoptimizationisusuallyconsideredtobeintractable.However,statisticalproblemsareoftensufficientlywell–behavedthatnon–convexoptimizationmethodsbecometractable.Moreover,duetothestructuresofthestatisticalmodels,LohandWainwright(2014,2015)suggestthatthesemodelsdonottendtobeadversarial,henceitisnotnecessarytolookattheworstcasescenario.Evenwhennon–convexoptimizationalgorithms,suchastheExpectation–Maximizationalgorithm,donotprovidetheglobaloptimum,thelocaloptimumcanbeadequatesincethedistancebetweenthelocalandglobalmaximaisoftensmallerthanthestatisticalerror(Balakrishnanetal.,2015,LohandWainwright,2015).Wainwrightconcludedhispresentationattheopeningconferencewitha"speculativemeta–theorem",thatforanyprobleminwhichaconvexrelaxationperformswell,thereisafirstordersimplemethodonthenon–convexformulationwhichperformsasgood.Applicationsofnon–convextechniquesincludespectralanalysisoftensors,analogoustoSVDofmatrices,whichcanbeemployedtoaidinlatentvariablelearning(Anandkumaretal.,2014).Non–convexmethodscanalsobeusedindictionarylearningaswellasinaversionofrobustPCA,whichislesssusceptibletosparsecorruptions(Netrapallietal.,2014).Insomecases,convexreformulationshavealsobeendevelopedtofindoptimalsolutionsinnon–convexscenarios(GuoandSchuurmans,2007).

Manymodelsusedforbigdataarebasedonprobabilitydistributions,suchasGibbsdistributions,forwhichtheprobabilitydensityfunctionisonlyknownuptoitsnormalizingconstant,sometimescalledthepartitionfunction.MarkovchainMonteCarlomethodsattempttoovercomethisbysamplingfromanunnormalizedprobabilitydensityfunctionbasedonaMarkovchainthathasthetargetdistributionasitsequilibriumdistribution,butthesemethodsdonotnecessarilyscalewell.Insomecases,approximateBayesiancomputation(ABC)methodsprovide

Page 10: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

analternativeapproximate,butcomputable,solutiontoamassiveproblem(BlumandTran,2010).

3.6MeasuringdistanceInhigh–dimensionaldataanalysis,distancesbetween“points”,orobservations,areneededformostanalyticprocedures.Forexampleoptimizationmethodstypicallymovethroughapossiblycomplexparameterspace,andsomenotionofstep–sizeforthismovementisneeded.SimulationmethodssuchasMarkovchainMonteCarlosamplingalsoneedto‘move’throughtheparameterspace.InboththesecontextstheuseofFisherinformationasametricintheparameterspacehasemergedasausefultool–itistheRiemannianmetric(Rao,1945;AmariandNagaoka,2000)whenregardingtheparameterspaceasamanifold.GirolamiandCalderhead(2011)exploitedthisinMarkovchainMonteCarloalgorithms,withimpressiveimprovementsinspeedofconvergence.GrosseandSalakhutdinov(2015)showedhowoptimizationofmulti–layerBoltzmannmachinescouldbeimprovedbyusinganapproximationtotheFisherinformation,thefullFisherinformationmatrixbeingtoohigh–dimensionaltouseroutinely.Asimilartechniqueemergedinenvironmentalscience(SampsonandGuttorp,1992),wheretwo–dimensionalgeographicalspaceismappedtoatwo–dimensionaldeformedspaceinwhichthespatialstructurebecomesstationaryandisotropic.Schmidtetal.(2011)describehowthisapproachcanbeexploitedinaBayesianframework,mappingtoadeformedspaceinhigherdimensions,andapplythistoanalysesofsolarradiationandtemperaturedata.

3.7RepresentationLearningRepresentationlearning,alsocalledfeaturelearningordeeplearning,aimstouncoverhiddenandcomplexstructuresindata(Bengioet.al,2013)andfrequentlyusesalayeredarchitecturetouncoverahierarchicaldatarepresentation.Themodelforthislayeredarchitecturemayhavehigherdimensionalitythantheoriginaldata.Inthatcase,onemayneedtocombinerepresentationlearningwiththedimensionalityreductiontechniquesdescribedaboveinthisreport.Large,well–annotateddatasetsfacilitatesuccessfulrepresentationlearning.Butlabelingdataforsupervisedlearningcanconsumeexcessivetimeandresources,andmayevenproveimpossible.Tosolvethisproblem,researchershavedevelopedtechniquestouncoverdatastructurewithunsupervisedtechniques(Anandkumaretal.,2013;SalakhutdinovandHinton,2012).

3.8SequentialLearningAstrategytodealwithbigdatacharacterizedbyahighvelocityissequentialorincrementallearning.Thistechniquetreatsthedatasuccessively,eitherbecausethetaskwasdesignedinthiswayorthedatadoesnotallowanyotheraccess,asitisthecaseforcontinuousdatastreamssuchaslongimagesequences(Kalaletal.2012).Sequentiallearningisalsoappliedifthedataiscompletelyavailable,buthastobeprocessedsequentially.Thisoccurswhenthedatasetistoolargetofitintothememoryortobeprocessedinapracticallytractableway.Forexample,Scottetal.(2013)introduceconsensusMonteCarlo,whichoperatesbyrunningseparateMonteCarloalgorithmsondistributedmachines,andthenaveragingindividualMonteCarlodrawsacrossmachines;theresultcanbenearlyindistinguishablefromthedrawsthatwouldhavebeenobtainedbyrunningasinglemachinealgorithmforaverylongtime.

Page 11: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

3.9Multi–disciplinarityAcommonthemeinagreatmanypresentationsonbigdataisthenecessityofaveryhighlevelofmulti–disciplinarity.Statisticalscientistsarewell–trainedinbothplanningofstudies,andininferenceunderuncertainty.Computerscientistsbringasophisticatedmixofcomputationalstrategiesanddatamanagementexpertise.Boththesegroupsofresearchersneedtoworkcloselywithscientists,socialscientists,andhumanistsintherelevantapplicationareastoensurethatthestatisticalmethodsandcomputationalalgorithmsareeffectiveforthescientificproblemofinterest.Forinstance,Deanetal.(2012)describeanexcitingcollaborationinforestry,wheretheyareworkingonlinkingstatisticalanalysis,computationalinfrastructurefornewinstrumentation,withsocialagencywiththegoalofmaximizingtheirimpactonthemanagementofforestfires.Thesynergyspacebetweenthemiscreatedinanintegrativeapproachinvolvingscientistsfromallfields.Thisintegrationisnotaninventionoftheworldofbigdatabyanymeans––statisticsisaninherentlyappliedfieldandcollaborationshavebeenamajorfocusofthedisciplinefromitsearliestdays.Whathaschangedisthespeedatwhichnewtechnologyprovidesnewchallengestostatisticalanalysis,andthemagnitudeofthecomputationalchallengesthatgoalongwiththis.Muchofthebigdatanowavailableinvolvesinformationaboutpeople––thisisoneoftheaspectsofbigdatathatcontributestoitspopularity.Thesocialsciencescanhelptoelucidatepotentialbiasesinsuchdata,aswellashelptofocusonthemostinterestingandrelevantresearchquestions.Humanistscanprovideframeworksforthinkingaboutprivacy,abouthistoricaldataandchangesthroughtime,andaboutethicalissuesarisinginsharingofdataabout,forexample,individuals’health.Individualsmaywellbewillingtosharetheirdataforresearchpurposes,forexample,butnotformarketingpurposes:thehumanitiesofferwaysofthinkingabouttheseissues.

4.ExamplesofApplicationsThebreadthofapplicationsisnearlylimitless;weprovidehereaselectionoftheapplicationareasdiscussedduringtheopeningconference,inparttomaketheideasoutlinedintheprevioussectionmoreconcrete.

4.1PublicHealthDietisanimportantfactorofpublichealthinthestudyofdisabilities(Murrayetal.,2013),butassessingpeople’snutritionalbehaviourisdifficult.Nielsencollectsinformationaboutallproductssoldbyalargenumberofgroceriesandcornerstoresatthe3–digitpostalcodelevel,andpurchasecardloyaltyprogramshavedataonpurchasesatthehouseholdlevel.UsingthesesourcesofinformationalongwithaUPCdatabaseofnutritionalvalues,Buckeridgeetal.(2012)matchnutritionvariablestoexistingmedicalrecordsintheprovinceofQuébec,incollaborationwiththeInstitutdesantépubliqueduQuébec.Sincethepurchasedataistemporal,Buckeridgeetal.(2014)measurethedecreaseinsalesofsugarycarbonateddrinksfollowingamarketingcampaign,henceassessingtheeffectivenessofthispublichealtheffort.Bigdataisalsobeingusedforsyndromicsurveillance,themonitoringofthesyndromesoftransmittablediseasesatthepopulationlevel.IntheprovinceofQuébec,acentralsystemmonitorsthetriagedatafromemergencyroomsincludingfreeformtextwhichneedstobetreatedappropriately.Whentheautomatedsurveillancesystemissuesanalert,anad–hoc

Page 12: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

analysisisperformedtodecidewhetherimmediateactionsareneeded.Buckeridgeetal.(2006)evaluatetheperformanceofsuchsystemsthroughsimulationfordifferentratesoffalsepositivealerts.Socialmediaproducedatawhosevolumeandvarietyareoftenchallenging.Thetraditionalreportingofflufromphysicianreportsgenerallyintroducesa1–2weektimedelay.GoogleFluTrends(Ginsberget.al2009)speedsupthereportingbytabulatingtheuseofsearchkeywords,suchas“runnynose”or“sorethroat”,inordertoestimatetheprevalenceofinfluenza–likeillnesses.Predictionscansometimesgowrong,seeButler(2013),butthegaininreactiontimeissignificant.UsingtheGPSlocationinthemetadataofTwitterpostsRamírez–Ramírezetal.(2013)introduceSimulationofInfectiousDiseases(SIMID)thatimprovegeographicandtemporallocalitybeyondGoogleFluTrends.TheyillustratetheuseofSIMIDtoestimateandpredictfluspreadinthePeelsub–metropolitanregionnearToronto.

4.2HealthPolicyManyexamplesinhealthpolicyrelyonthelinkageoflargeadministrativedatasets.InOntario,theInstituteforClinicalEvaluativeSciences(ICES)isaprovince–wideresearchnetworkthatcollectsandlinksOntario'shealth–relateddataattheindividuallevel.Similardatabasesexistsinotherprovinces.ComparingdataondiabetescaseascertainmentfromManitobawiththoseofNewfoundlandandLabrador,Lixetal.(2012)evaluatetheincompletenessofphysicianbillingclaimsandestimatetowhichextentthenumberofnon–fee–for–serviceactsisunderestimated.Withthequalityofthedatainmind,theauthorsalsoimprovediseaseascertainmentwithmodel–basedpredictivealgorithmsasanalternativetoofrelyingonpossiblyincorrectdiseasecodes.

Healthdatacanbespatiotemporalandcomplicationsarisequicklywhenthedataareheterogeneous.Postalcodesandcensustracts,forinstance,defineareasthatdonotmatch,withshapesthatvarythroughtime(seee.g.Nguyenetal.,2012).InastudyaboutcancerinNovaScotia,Leeetal.(2015)haveaccesstolocationdataforthemostrecentyears,butolderdataonlycontainthepostalcodes.Tohandlethedifferentgranularities,theyuseahierarchicalmodelwherethelocationofcancercasesaretreatedasalatentvariableandhigherlevelsofthehierarchyaretheaggregationunits(Lietal.,2012).

4.3LawandOrderInsurgenciesandriotsarefrequentinSouthAmerica,withhundredsorthousandsofoccurrencesineachcountryeveryyear.Korkmazetal.(2015)haveadatabaseofallTwittermessagessentfromSouthAmericaoveraperiodoffouryearsaswellasabinaryvariablenamedGSR,agoldstandardtodeterminewhetheraneventwasaninsurgencyornot.Theoccurrenceofaninsurgencycanbepredictedbythevolumeoftweets,thepresenceofsomekeywordstherein(fromatotalof634whowereinitiallyidentifiedaspotentiallyuseful),andanincreaseduseofTheOnionRouter(TOR),anonlineservicetoanonymizetweets.ThemassivedatabasecontainingallthetweetsisstoredonaHadoopfilesystemanditstreatmentisdonewithPythonthroughmapreduce.Thosetoolswhicharenotpartofatypicalstatisticscursusarestandardinthetreatmentofmassivedatasets.

4.4EnvironmentalSciencesTheworldisfacingglobalenvironmentalchallengeswhicharehardtomodel,giventheirinherentcomplexity.Toestimatethetemperaturerecordsince1000,Barbozaetal.(2014)use

Page 13: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

differentproxiesincludingtreerings,pollen,coral,icecores,hencecombiningveryheterogeneousdata.Fromearlierwork,theauthorsfoundthathistoricalrecordscanbemuchbetterpredictedbyincorporatingclimateforcingsinthemodel,forinstancesolarirradiance,greenhousegasconcentrationandvolcanism.

Inanexamplecombininghealthandenvironmentdata,Ensoretal.(2013)analyzethelocationandtimeofcardiacarrestsfrom2004to2011inthecityofHouston,withrespecttoairpollutionmeasurementstakenbyEPAsensorsscatteredacrossthecity.Insteadofconsideringdailyaveragesofexpositiontoairpollution,thisstudymeasuredtheexposureofanindividualinthethreehoursprecedingacardiacarrestandfindasignificanteffectontheriskofanout–of–hospitalcardiacarrest.Inanunpublishedfollow–upstudydiscussedbySallieKeller,asyntheticmodelisusedtosimulatetheactivitiesof4.4millionpeopleinHoustononagivenday.Theindividualexposurevariesgreatly,evenwithinagivenhousehold,aspeople’sexposuretopollutiondependsontheirmovementsinthecity.ThesyntheticdatahelpedoptimizingCPRtrainingsgeographicallyinthecity.

4.5EducationInaresearchprojecttostudythetheoryofJohnson(2003)aboutgradeinflation,Reeseetal.(2014)analyzethehistoryofBrighamYoungUniversitystudentevaluationsfrom2002to2014,matchingitwithstudentrecords.Atotalof2.9millioncourseevaluationsfrom185,000studentwereused.Thevariablesofinterestarethreequestionsansweredonanordinalscaleandafreetextsectionforwhichasentimentanalysisdeterminesifthecommentispositiveornegative.ThemagnitudeoftheBayesianproportionaloddsmodelfittedtotheordinalresponsevariablesistoobigforMCMCmethods,buttheABCapproachleadstoacomputationallyfeasiblesolution.Thedatashowsempiricalevidenceofgradeinflation.Withanexamplebasedoneducationdata,Cook(2014)illustrateshowvisualizationmaybeusedforinference.Basedonthe2012OECDPisatestsresultsfrom485,900studentsin65countries,Cookshowsafigurepresentingthedifferenceinmeanofmathscoresandreadingscoresbetweenboysandgirlsbycountry.Whiletheboysseemtooutperformthegirlsinmath,amuchgreaterdifferenceisseeninfavourofthegirlsforthereadingscores.Tohaveanideaofthesignificanceofthedifferences,Cookproducessimilarplotswherethegenderlabelsarerandomlyassignedtoscoreswithineachcountry.Thefactthatthetruedatastandsoutincomparisontotherandomassignmentsindicatesthatthegapdidnotoccurbychance.

4.6MobileApplicationSecurityConveyinginformationofinterestfrombigdataisachallengewhichmaybeaddressedbyvisualization.Hosseinkhanietal.(2014)proposePapilio,anovelgraphicalrepresentationofanetworkthatdisplaysthedifferentpermissionsofAndroidapps.Thecreationofatractablerepresentationismadepossiblebytakingadvantageofthespecialstructureofthedata.Thewholerepresentationisstilltoolargeforasinglescreen,sohalos(BaudischandRosenholtz,2003)areusedtoindicatethepositionofoffscreennodes,asillustratedinFigure2.Theradiusofthehalosareproportionaltothedistancetothenode.

Page 14: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

Figure2:IllustrationoftheHalosinPapilio,whichindicatethepositionofoffscreennodes.ThisPapilio

figurefromHosseinkhanietal.(2014)displaysthedifferentpermissionsofAndroidapps.

4.7ImageRecognitionandLabellingImagesareunstructureddataandtheiranalysisprogressedsignificantlywiththedevelopmentofdeeplearningmodelssuchasdeepBoltzmannmachinesthatareespeciallysuitableforautomaticallylearningfeaturesinthedata.TheTorontoDeepLearninggroupofferanonlinedemo5ofmulti–modallearningbasedontheworkofSrivastavaandSalakhutdinov(2012)thatcangenerateacaptiontodescribeanyuploadedimage.TheimagesearchengineofGooglealsousesdeeplearningmodelstofindimagesthataresimilartoanuploadedpicture.

4.8DigitalHumanitiesAlargebodyofresearchinthehumanitiesisbuiltfromanin–depthstudyofcarefullyselectedauthorsorrecords.Theabilitytoprocesslargeamountsofdataisopeningnewopportunitieswheresocialevidencecanbegatheredatapopulationlevel.Forinstance,Drummondetal.(2006)usethe1901CanadiancensusdatatorevisedifferenttheoriesaboutbilingualisminCanadaandimprovetheiragreementwiththedata.TheyalsoobservethatpatternsinQuébecdiffersignificantlyfromthoseintherestofthecountry.InhisbookCapitalintheTwenty–FirstCentury,Piketty(2014)discussestheevolutionofreferencestomoneyinnovelsasareflectionofinequalitiesinthesociety.Underwoodetal.(2014)digfurtherintothequestionbylistingallreferencestomoneyamountsinEnglish–languagenovelsfrom1750to1950.Theyfindthattherelativefrequencyofreferencestomoneyamountsincreasesthroughtime,buttheamountmentioneddecreasesinconstantdollaramount.Theydonotattempttodrawstrongconclusionsfromthissocialevidence,mentioningforinstancethatthechangesinthereadingaudienceduetoanincreasedliteracyamongtheworkingclassmayalsoplayaroleintheevolutionofthenovels.Onalargerscale,Micheletal.(2011)analyzeabout4%ofallbookseverprintedtodescribesocialtrendsthroughtimeaswellaschangesinlinguistics.Whilepatternsinthedatamatch 5 http://deeplearning.cs.toronto.edu/i2t

Page 15: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

milestoneseventssuchastheGreatWarsandsomeglobalepidemics,theevolutionofwordssuchas“feminism”measureimportantsocialtrends.Localeventssuchasthecensorshipofdifferentartistsforanumberofyearsindifferentcountriesarealsovisibleinthedata.Thosearesomeexamplesof‘culturomics’,thequantitativeanalysisofdigitizedtexttostudyculturaltrendsandhumanbehavior.

4.9MaterialsScienceInanapplicationtoMaterialsSciencebyNelsonetal.(2013),thepropertiesofabinaryalloydependontherelativepositionofatomsfrombothmaterialswhosedescriptionreliesonaverylargenumberofexplanatoryvariables.Surprisingly,asCandèsetal.(2006)showed,arandomgenerationofcandidatematerialsismoreefficientthanacarefullydesignedchoice.Thisisanexamplewherebigdatashowsacounterintuitivebehaviour,sinceonecouldthinkthatcarefuldesignofexperimentshouldbeabletobeatrandomselections.

5.ConclusionWhiletechnicalachievementsmaketheexistenceandavailabilityofverylargeamountsofdatapossible,thechallengesemergingfromsuchdatagoesbeyondprocessing,storingandaccessingrecordsrapidly.Asitstitleindicates,thethematicprogramonStatisticalInference,LearningandModelsforBigDatatriedtofocusthediscussionparticularlyonstatisticalissues.Thereareagreatmanyfieldsofresearchinwhichbigdataplaysakeyrole,andithasnotbeenpossibletosurveyallofit.

Inthisreport,weidentifychallengesandsolutionsthatarecommonacrossdifferentdisciplinesworkingoninferenceandlearningforbigdata.Withnewmagnitudesofdimensionality,complexityandheterogeneityinthedata,researchersarefacedwithnewchallengesofscalability,qualityandvisualization.Inaddition,bigdataisoftenobservationalinnature,havingbeencollectedforapurposeotherthantheoneintended,andhencemakinginferenceevenmorechallenging.Emergingtechnologicalsolutionsprovidenewopportunities,buttheyalsodefinenewparadigmsfordataandmodels.Makingthemodelsbiggerwithlayersoflatentvariablesmayprovideaninsightintothefeaturesofthedata,whilereducingthedatatolowerdimensionalmodelscanbekeyingettinganycalculationsthrough.Lowerdimensionalstructureanalysismanifestsitselfnotonlyassparsityassumptionsanddimensionalityreduction,butalsoinvisualization.Othertechniquessuchasnon–convexalgorithms,samplingstrategiesandmatrixrandomizationaidintheprocessingoflargedatasets.Withmassivedata,settlingforanapproximatesolutionseemstobethesavingcompromise,whetherbyallowingformoreflexibilityinanobjectivefunctionorbyreplacingastandardalgorithmthatisinfeasibleatsuchalargescale.

Manyofthemoststrikingexamplesofbigdataanalysispresentedduringthisthematicprogramcamefromteamsofmanyscientistswithcomplementarybackgrounds.Movingforward,researchteamsneedtoscaleup,invitingtalentsfromdifferentfieldstocometogetherandjoinforces.Thephrase“BigData”hashadagreatdealofmediaattentioninrecentyears,creatingahypethatwillcertainlyfade.Thenewproblemsthathaveemerged,however,areheretostayandtheneedforinnovativesolutionswillcertainlykeepourcommunitybusyformanyyearstocome.

Page 16: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

AcknowledgmentsWearegratefulfortheFieldsInstituteforResearchintheMathematicalSciences,whichprovidedthemajorportionofthefundingforthethematicprogramonStatisticalInference,LearningandModelsinBigData.TheFieldsInstituteisfundedbytheNaturalSciencesandEngineeringResearchCouncilofCanadaandtheMinistryofTraining,CollegesandUniversitiesinOntario.FundingwasalsoprovidedforspecificworkshopsbythePacificInstitutefortheMathematicalSciences,theCentredeRecherchesMathématiques,theAtlanticAssociationforAdvancementintheMathematicalSciences,theDepartmentofStatisticsatIowaStateUniversity,theNationalScienceFoundation,andtheCanadianStatisticalSciencesInstitute.

Wethankthereviewersofanearlierversionofthepaperforhelpfulandconstructivecommentsthatimprovedthescopeandpresentation.

References1000GenomesProjectConsortium(2012).Anintegratedmapofgeneticvariationfrom1,092humangenomes.Nature491(7422)56–65.

AHMED,S.E.(2014).Penalty,shrinkageandpreteststrategies:Variableselectionandestimation.SpringerBriefsinStatistics.Springer.Cham.

ALBERT–GREEN,A.,BRAUN,W.J.,MARTELL,D.L.andWOOLFORD,D.G.(2014).VisualizationtoolsforassessingtheMarkovproperty:sojourntimesintheforestfireweatherindexinOntario.Environmetrics25(6)417–430.

AMARI,S.–I.andNAGAOKA,H.(2000).Methodsofinformationgeometry(Vol.191).TranslationsofMathematicalMonographs.AmericanMathematicalSociety.USA.ANANDKUMAR,A.,GE,R.,HSU,D.,KAKADE,S.M.andTELGARSKY,M.(2014).Tensordecompositionsforlearninglatentvariablemodels.TheJournalofMachineLearningResearch15(1)2773–2832.ANANDKUMAR,A.,HSU,D.,JANZAMIN,M.andKAKADE,S.M.(2013).Whenareovercompletetopicmodelsidentifiable?Uniquenessoftensortuckerdecompositionswithstructuredsparsity.Burges,C.J.C.,Bottou,L.,Welling,M.,Ghahramani,Z.andWeinberger,K.Q.(Eds.).InAdvancesinNeuralInformationProcessingSystems261986–1994.

BALAKRISHNAN,S.,WAINWRIGHT,M.J.andYU,B.(2015).StatisticalguaranteesfortheEMalgorithm:Frompopulationtosample–basedanalysis.AnnalsofStatistics. Accepted.

BARBOZA,L.,LI,B.,TINGLEY,M.P.andVIENS,F.G.(2014).Reconstructingpasttemperaturesfromnaturalproxiesandestimatedclimateforcingsusingshort–andlong–memorymodels.TheAnnalsofAppliedStatistics8(4)1966–2001.

BAUDISCH,P.andROSENHOLTZ,R.(2003).Halo:atechniqueforvisualizingoff–screenobjects.InProceedingsoftheSIGCHIconferenceonhumanfactorsincomputingsystemsCHI'03481–488.ACM.

BELL,P.(2013). Creatingcompetitiveadvantageusingbigdata. IveyBusinessJournal 1-5.BELL,R.M.(2015). BigData:It'sNottheData.PresentedattheFieldsInstitute,January12th2015.availableat http://www.fields.utoronto.ca/video-archive/static/2015/01/315-4206/mergedvideo.ogv,accessedonJanuary26,2016.

Page 17: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

BENGIO,Y.,COURVILLE,A.andVINCENT,P.(2013).Representationlearning:Areviewandnewperspectives.PatternAnalysisandMachineIntelligence.IEEE35(8)1798–1828.

BIOUCAS–DIAS,J.M.,PLAZA,A.,CAMPS–VALLS,G.,SCHEUNDERS,P.,NASRABADI,N.M.andCHANUSSOT,J.(2013).Hyperspectralremotesensingdataanalysisandfuturechallenges.GeoscienceandRemoteSensingMagazine,IEEE1(2)6–36.

BLUM,M.G.andTRAN,V.C.(2010).HIVwithcontacttracing:acasestudyinapproximateBayesiancomputation.Biostatistics11(4)644–660.

BRIEN,S.,NADERI,N.,SHABAN–NEJAD,A.,MONDOR,L.,KROEMKER,D.andBUCKERIDGE,D.L.(2013).Vaccineattitudesurveillanceusingsemanticanalysis:Constructingasemanticallyannotatedcorpus.InProceedingsofthe22ndinternationalconferenceonWorldWideWebcompanion,Janeiro,Brazil,13–17May2013,683–686.InternationalWorldWideWebConferencesSteeringCommittee.

BROWN,P.E.,NGUYEN,P.,STAFFORD,J.andASHTA,S.(2013).Local–EMandmismeasureddata.StatisticsandProbabilityLetters83(1)135–140.

BUCKERIDGE,D.L.,CHARLAND,K.,LABBAN,A.andMA,Y.(2014).Amethodforneighborhood-levelsurveillanceoffoodpurchasing.AnnalsoftheNewYorkAcademyofSciences1331(1)270–277.BUCKERIDGE,D.L.,IZADI,M.,SHABAN–NEJAD,A.,MONDOR,L.,JAUVIN,C.,DUBÉ,L.,JANG,Y.andTAMBLYN,R.(2012).Aninfrastructureforreal–timepopulationhealthassessmentandmonitoring.IBMJournalofResearchandDevelopment56(5)2:1–2:11.BUCKERIDGE,D.L.,OWENS,D.K.,SWITZER,P.,FRANK,J.andMUSEN,M.A.(2006).Evaluatingdetectionofaninhalationalanthraxoutbreak.EmergingInfectiousDiseases12(12)1942–1949.BUJA,A.,COOK,D.,HOFMANN,H.,LAWRENCE,M.,LEE,E.K.,SWAYNE,D.F.andWICKHAM,H.(2009).Statisticalinferenceforexploratorydataanalysisandmodeldiagnostics.PhilosophicalTransactionsoftheRoyalSocietyA:Mathematical,PhysicalandEngineeringSciences367(1906)4361–4383.

Bᴜᴛʟᴇʀ,D.(2013).WhenGooglegotfluwrong.Nature494(7436)155-156.CANDÈS,E.J.,ROMBERG,J.andTAO,T.(2006).Robustuncertaintyprinciples:Exactsignalreconstructionfromhighlyincompletefrequencyinformation.InformationTheory,IEEE52(2)489–509.CANDÈS,E.J.,WAKIN,M.B.andBOYD,S.P.(2008).Enhancingsparsitybyreweightedℓ1minimization.JournalofFourieranalysisandapplications14(5–6)877–905.

CAPPS,C.andWRIGHT,T.(2013).Towardavision:Officialstatisticsandbigdata.AMSTATNEWS.http://magazine.amstat.org/blog/2013/08/01/official–statistics/accessedonJanuary26,2016.

CARAGEA,D.,SILVESCU,A.,andHONAVAR,V.(2004).Aframeworkforlearningfromdistributeddatausing sufficient statistics and its application to learningdecision trees. International Journal ofHybridIntelligentSystems,1(1–2)80–89.

CARPENDALE,S.(2015).WebsiteoftheInnovisResearchGroupavailableatHTTP://INNOVIS.CPSC.UCALGARY.CA/RESEARCH/RESEARCH,accessedonJanuary26,2015.

Page 18: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

CARPENDALE,S.,LIGH,J.andPATTISON,E.(2004).Achievinghighermagnificationincontext.InProceedingsofthe17thannualACMsymposiumonUserinterfacesoftwareandtechnology,SantaFe,USA,24–27October2004,71–80.ACM.CHEN,M.,MAO,S.andLIU,Y.(2014).Bigdata:Asurvey.MobileNetworksandApplications19(2)171–209.

COOK,D.(2014).Visiphilia:Thegendergapinmathisnotuniversal.CHANCE27(4)48–52.CHUNG,F.andLU,L.(2002).Theaveragedistancesinrandomgraphswithgivenexpecteddegrees.ProceedingsoftheNationalAcademyofSciences99(25)15879–15882.Daas,P.J.H.,Puts,M.J.,Buelens,B.andVandenHurk.P.A.M.(2015)BigDataasasourceforofficialstatistics.JournalofOfficialStatistics,31(2)249–262.

DEAN,C.,BRAUN,W.J.,MARTELL,D.andWOOLFORD,D.(2012).Interdisciplinarytraininginacollaborativeresearchenvironment.Chrisman,N.andWachowicz,M.(Eds.).Theaddedvalueofscientificnetworking:PerspectivesfromtheGEOIDEnetworkmembers1998–2012(pages69-73).GEOIDENetwork,Quebec.

DONOHO,D.andGAVISH,M.(2014).Minimaxriskofmatrixdenoisingbysingularvaluethresholding.AnnalsofStatistics42(6)2413–2440.DONOHO,D.,GAVISH,M.ANDMONTANARI,A.(2013).ThephasetransitionofmatrixrecoveryfromGaussianmeasurementsmatchestheminimaxMSEofmatrixdenoising.ProceedingsoftheNationalAcademyofSciences110(21)8405–8410.DRUMMOND,C.,MATWIN,S.,&GAFFIELD,C.(2006).Inferringandrevisingtheorieswithconfidence:analyzingbilingualisminthe1901Canadiancensus.AppliedArtificialIntelligence20(1)1–33.DUHIGG,C.(2012).Howcompanieslearnyoursecrets.TheNewYorkTimes.February16,2012.

ELDAR,Y.C.ANDKUTYNIOK,G.(2012).Compressedsensing:Theoryandapplications.CambridgeUniversityPress.NewYork.FAN,J.andLI,R.(2001).Variableselectionvianonconcavepenalizedlikelihoodanditsoracleproperties.JournaloftheAmericanStatisticalAssociation96(456)1348–1360.FANG,E.X.,NING,Y.ANDLIU,H.(2014).Testingandconfidenceintervalsforhighdimensionalproportionalhazardsmodel.arXivpreprintarXiv:1412.5158.

ENSOR,K.B.,RAUN,L.H.andPERSSE,D.(2013).Acase–crossoveranalysisofout–of–hospitalcardiacarrestandairpollution.Circulation1271192–1199.

GAFFIELD,C.(2015).Bigdatavs.humancomplexity:anearlystatusreportonthecentralquestionofthe21stcentury.PresentedattheFieldsInstitute,January20th2015.availableathttp://www.fields.utoronto.ca/video–archive/2015/01/315–4136,accessedonJanuary26,2016.GEIGER,A.,LENZ,P.andURTASUN,R.(2012).Arewereadyforautonomousdriving?Thekittivisionbenchmarksuite.InComputerVisionandPatternRecognition,IEEECVPR20123354–3361.

GINSBERG,J.,MOHEBBI,M.H.,PATEL,R.S.,BRAMMER,L.,SMOLINSKI,M.S.andBRILLIANT,L.(2009).Detectinginfluenzaepidemicsusingsearchenginequerydata.Nature457(7232)1012–1014.

Page 19: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

GIROLAMI,M.andCALDERHEAD,B.(2011).RiemannmanifoldLangevinandHamiltonianMonteCarlomethods.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology)73(2)123–214.GROSSE,R.andSALAKHUTDINOV,R.(2015).ScalingupnaturalgradientbysparselyfactorizingtheinverseFisherinformationmatrix.Bach,F.andBlei,D.(Eds.).InProceedingsofThe32ndInternationalConferenceonMachineLearning(ICML-15)2304–2313.GUO,Y.andSCHUURMANS,D.(2007).Convexrelaxationsoflatentvariabletraining.Platt,J.C.,Koller,D.,Singer,Y.andRoweis,S.T.(Eds.).InAdvancesinNeuralInformationProcessingSystems20601–608.

GUPTA,V.andLEHAL,G.S.(2009).Asurveyoftextminingtechniquesandapplications.Journalofemergingtechnologiesinwebintelligence1(1)60–76.GUYON,I.andELISSEEFF,A.(2003).Anintroductiontovariableandfeatureselection.TheJournalofMachineLearningResearch31157–1182.HALKO,N.,MARTINSSON,P.G.andTROPP,J.A.(2011).Findingstructurewithrandomness:Probabilisticalgorithmsforconstructingapproximatematrixdecompositions.SIAMReview53(2)217–288.HARFORD,T.(2014).Bigdata:arewemakingabigmistake?FinancialTimes.March 24, 2014.

HINTON,G.E.andSALAKHUTDINOV,R.R.(2006).Reducingthedimensionalityofdatawithneuralnetworks.Science313(5786)504–507.HODAS,N.O.andLERMAN,K.(2014).Thesimplerulesofsocialcontagion.ScientificReports4(4343).HOENS,T.R.,POLIKAR,R.,CHAWLA,N.V.(2012).Learningfromstreamingdatawithconceptdriftandimbalance:Anoverview.ProgressinArtificialIntelligence1(1)89–101.

HOSSEINKHANI,M.,FONG,P.W.L.andCARPENDALE,S.(2014).Papilio:VisualizingAndroidapplicationpermissions.Carr,H.,Rheingans,P.,andSchumann,H.(Eds.).InProceedingsoftheEuroVisConference33(3)391–400.JohnWiley&SonsLtd.HURON,S.,CARPENDALE,S.,THUDT,A.,TANG,A.andMAUERER,M.(2014).Constructivevisualization.InProceedingsofthe2014conferenceonDesigninginteractivesystems, Vancouver,Canada,June21-25,2014,433–442.ACM.IZADI,M.,SHABAN–NEJAD,A.,OKHMATOVSKAIA,A.,MONDOR,L.andBUCKERIDGE,D.L.(2013).Populationhealthrecord:Aninformaticsinfrastructureformanagement,integration,andanalysisoflargescalepopulationhealthdata.ProceedingsofAAAIHIAI201335–40.AAAI.JAIN,S.,LANGE,S.andZILLES,S.(2006).Towardsabetterunderstandingofincrementallearning.InAlgorithmicLearningTheory(169–183).Springer.Berlin.JAVANMARD,A.andMONTANARI,A.(2014a).Hypothesistestinginhigh–dimensionalregressionundertheGaussianrandomdesignmodel:Asymptotictheory.InformationTheory,IEEE60(10)6522–6554.JAVANMARD,A.andMONTANARI,A.(2014b).Confidenceintervalsandhypothesistestingforhigh–dimensionalregression.TheJournalofMachineLearningResearch15(1)2869–2909.

Page 20: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

JOHNSON,V.E.(2003).GradeInflation:Acrisisincollegeeducation,Springer–Verlag,NewYork.

KALAL,Z.,MIKOLAJCZYK,K.andMATAS,J.(2012).Tracking–learning–detection.PatternAnalysisandMachineIntelligence,IEEE34(7)1409–1422.KELLER,S.(2015).Buildingresilientcities:Harnessingthepowerofurbananalytics.PresentedattheFieldsInstitute,January20th2015.availableathttp://www.fields.utoronto.ca/video–archive/static/2015/01/315–4137/mergedvideo.ogv,accessedonJanuary26,2016.KIRON,D.,andSHOCKLEY,R.(2011). Creatingbusinessvaluewithanalytics. MITSloanManagementReview 53(1)57-63.KLEINER,A.,TALWALKAR,A.,SARKAR,P.andJORDAN,M.I.(2014).Ascalablebootstrapformassivedata.JournaloftheRoyalStatisticalSocietySeriesB76(4)795–816.

KOLACZYK,E.D.(2009).Statisticalanalysisofnetworkdata:Methodsandmodels.SpringerScienceandBusinessMedia,NewYork.

KORKMAZ,G.,CADENA,J.,KUHLMAN,C.J.,MARATHE,A.,VULLIKANTI,A.andRAMAKRISHNAN,N.(2015).Combiningheterogeneousdatasourcesforcivilunrestforecasting.InProceedingsofthe2015IEEE/ACMInternationalConferenceonAdvancesinSocialNetworksAnalysisandMiningASONAM'15258–265.ACM.KRAMER,A.D.,GUILLORY,J.E.,andHANCOCK,J.T.(2014).Experimentalevidenceofmassive–scaleemotionalcontagionthroughsocialnetworks.ProceedingsoftheNationalAcademyofSciences,111(24)8788–8790.LEE,J.S.,NGUYEN,P.,BROWN,P.E.,STAFFORD,J.andSAINT–JACQUES,N.(2015).Local–EMAlgorithmforspatio–temporalanalysiswithapplicationinSouthwesternNovaScotia,UnpublishedManuscript.LETOUZÉ,E.ANDJÜTTING,J.(2014).OfficialStatistics,bigdataandhumandevelopment:Towardsanewconceptualandoperationalapproach.Dat–PopAllianceWhitePapersSeries.http://www.odi.org/sites/odi.org.uk/files/odi-assets/events-documents/5161.pdf,accessedonJanuary26,2016.

LI,Y.,BROWN,P.,GESINK,D.C.andRUE,H.(2012).LogGaussianCoxprocessesandspatiallyaggregateddiseaseincidencedata.StatisticalMethodsinMedicalResearch21(5)479–507.

LI,L.J.andFEI–FEI,L.(2010).Optimol:Automaticonlinepicturecollectionviaincrementalmodellearning.InternationalJournalofComputerVision88(2),147–168.Lin,N.,andXi,R.(2011).Aggregatedestimatingequationestimation. StatisticsandItsInterface 473–83. LIX,L.M.,SMITH,M.,AZIMAEE,M.,DAHL,M.,NICOL,P.,BURCHILL,C.,BURLAND,E.,…andBAILLYA.(2012).AsystematicinvestigationofManitoba’sprovinciallaboratorydata.Winnipeg,MB:ManitobaCentreforHealthPolicy.

LOCKHART,R.,TAYLOR,J.,TIBSHIRANI,R.J.andTIBSHIRANI,R.(2014).Asignificancetestforthelasso.AnnalsofStatistics42(2)413–468.LOH,P.L.andWAINWRIGHT,M.J.(2014).Supportrecoverywithoutincoherence:Acasefornonconvexregularization.arXivpreprintarXiv:1412.5632.

Page 21: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

LOH,P.L.andWAINWRIGHT,M.J.(2015).RegularizedM–estimatorswithnonconvexity:Statisticalandalgorithmictheoryforlocaloptima.JournalofMachineLearningResearch16559–616.

Lohr,S.(2012).Theageofbigdata.NewYorkTimes.February11,2012.MARCHETTI,S.,GIUSTI,C.,PRATESI,M.,SALVATI,N.,GIANNOTTI,F.,PEDRESCHI,D.,RINZIVILLO,S.,PAPPALARDO,L.ANDGABRIELLI,L.(2015).Smallareamodel–basedestimatorsusingbigdatasources.JournalofOfficialStatistics31(2)263–281.

MCAFFEE,A.ANDBRYNJOLFSONN,E.(2012).Bigdata:Themanagementrevolution,HarvardBusinessReview.March18,2015.

MICHEL,J.B.,SHEN,Y.K.,AIDEN,A.P.,VERES,A.,GRAY,M.K.,PICKETT,J.P.,...&AIDEN,E.L.(2011).Quantitativeanalysisofcultureusingmillionsofdigitizedbooks.Science331(6014)176–182.MEYER,M.N.(2014).EverythingyouneedtoknowaboutFacebook’scontroversialemotionexperiment.Wired.RetrievedJuly03,2014.MONTELEONI,C.,SCHMIDT,G.A.,SAROHA,S.andASPLUND,E.(2011).Trackingclimatemodels.StatisticalAnalysisandDataMining:TheASADataScienceJournal4(4)372–392.

MURRAY,C.J.,VOS,T.,LOZANO,R.,NAGHAVI,M.,FLAXMAN,A.D.,MICHAUD,C.,...andBRIDGETT,L.(2013).Disability–adjustedlifeyears(DALYs)for291diseasesandinjuriesin21regions,1990–2010:AsystematicanalysisfortheGlobalBurdenofDiseaseStudy2010.TheLancet380(9859)2197–2223.NACENTA,M.,HINRICHS,U.andCARPENDALE,S.(2012).FatFonts:Combiningthesymbolicandvisualaspectsofnumbers.Tortora,G.,Levialdi,S.andTucci,M.(Eds.).InProceedingsoftheInternationalWorkingConferenceonAdvancedVisualInterfaces. Naples,Italy,May22-25,2012,407–414.ACM.

NELSON,L.J.,OZOLIŅŠ,V.,REESE,C.S.,ZHOU,F.andHART,G.L.(2013).ClusterexpansionmadeeasywithBayesiancompressivesensing.PhysicalReviewB88(15),155105.

NETRAPALLI,P.,NIRANJAN,U.N.,SANGHAVI,S.,ANANDKUMAR,A.andJAIN,P.(2014).Non–convexrobustPCA.Ghahramani,Z.,Welling,M.,Cortes,C.,Lawrence,N.D.andWeinberger,K.Q.(Eds.).InAdvancesinNeuralInformationProcessingSystems271107–1115.

NEUMANN,P.,CARPENDALE,M.S.T.andAGARAWALA,A.(2006).PhylloTrees:Phyllotacticpatternsfortreelayout.Ertl,T.,Joy,K.andSantos,B.(Eds.).InProceedingsofEurographics/IEEEVGTCSymposiumonVisualizationEuroVis200659–66.

NGUYEN,P.,BROWN,P.E.andSTAFFORD,J.(2012).MappingcancerriskinsouthwesternOntariowithchangingcensusboundaries.Biometrics68(4)1228–1237.

OLHEDE,S.C.andWOLFE,P.J.(2014).Networkhistogramsanduniversalityofblockmodelapproximation.ProceedingsoftheNationalAcademyofSciences111(41)14722–14727.

PANG,B.andLEE,L.(2008).Opinionminingandsentimentanalysis.Foundationsandtrendsininformationretrieval2(1–2)1–135.PIKETTY,T.(2014).Capitalinthetwenty–firstcentury.HarvardUniversityPress.Cambridge,USA.

PUTS,M.,DAAS,P.andDEWAAL,T.(2015).Findingerrorsinbigdata.Significance12(3)26–29.

Page 22: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

RAMÍREZ–RAMÍREZ,L.L.,GEL,Y.R.,THOMPSON,M.,DEVILLA,E.andMCPHERSON,M.(2013).Anewsurveillanceandspatio–temporalvisualizationtoolSIMID:SimulationofinfectiousdiseasesusingrandomnetworksandGIS.ComputerMethodsandProgramsinBiomedicine110(3)455–470.

RAO,C.R.(1945).Informationandtheaccuracyattainableintheestimationofstatisticalparameters.BulletinoftheCalcuttaMath.Soc.3781–89.Rᴇᴇsᴇ,C.S.,Tᴏʟʟᴇʏ,H.D.,Kᴇɪᴛʜ,J.,andBᴜʀʟɪɴɢᴀᴍᴇ,G.(2014).Unravelingthemysteryofgradeinflationandstudentratings:ABYUCaseStudy,BYUDepartmentofStatisticsTechnicalReport,TR–14–021.

ROBERTSON,J.,MCELDUFF,P.,PEARSON,S.A.,HENRY,D.A.,INDER,K.J.andATTIA,J.R.(2012).Thehealthservicesburdenofheartfailure:Ananalysisusinglinkedpopulationhealthdata–sets.BMCHealthServicesResearch12(103)1–11.

SALAKHUTDINOV,R.andHINTON,G.(2012).AnefficientlearningprocedurefordeepBoltzmannmachines.NeuralComputation24(8)1967–2006.

SAMPSON, P. D., andGUTTORP, P. (1992). Nonparametricestimationofnonstationaryspatialcovariancestructure.JournaloftheAmericanStatisticalAssociation,87(417)108–119. SCHADT,E.E.,LINDERMAN,M.D.,SORENSON,J.,LEE,L.andNOLAN,G.P.(2010).Computationalsolutionstolarge–scaledatamanagementandanalysis.NatureReviewsGenetics11(9)647–657. SCHMIDT,A.M.,GUTTORP,P.andO'HAGAN,A.(2011).Consideringcovariatesinthecovariancestructureofspatialprocesses.Environmetrics22(4),487–500.

SCOTT,S.L.(2015).Bayesandbigdata:TheconsensusMonteCarloalgorithm.PresentedattheFieldsInstitute,January13th2015.availableathttp://www.fields.utoronto.ca/video–archive/static/2015/01/315–4117/mergedvideo.ogv,accessedonJanuary26,2016.

SCOTT,S.L.,BLOCKER,A.W.,BONASSI,F.V.,CHIPMAN,H.A.,GEORGE,E.I.andMCCULLOCH,R.E.(2013).Bayesandbigdata:[email protected],USA,December15-17,2013.SHVACHKO,K.,KUANG,H.,RADIA,S.,&CHANSLER,R.(2010).TheHadoopdistributedfilesystem.InMassStorageSystemsandTechnologies(MSST),2010IEEE261–10.

SINGEL,R.(2009).NetflixspilledyourBrokebackMountainSecret,lawsuitclaims.Wired.December17,2009.

SINGEL,R.(2010).Netflixcancelsrecommendationcontestafterprivacylawsuit.Wired.March12,2010.SRIVASTAVA,N.andSALAKHUTDINOV,R.R.(2012).MultimodallearningwithdeepBoltzmannmachines.Pereira,F.,Burges,C.J.C.,Bottou,L.andWeinberger,K.Q.(Eds.).InAdvancesinNeuralInformationProcessingSystems252222–2230.

STRUIJS,P.,BRAAKSMA,B.ANDDAAS,P.J.H.(2014).Officialstatisticsandbigdata.BigDataandSociety.1(1)1-6.DOI:10.1177/2053951714538417.

Page 23: Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

TAM,S.M.ANDCLARKE,F.(2015).Bigdata,officialstatisticsandsomeinitiativesbytheAustralianBureauof Statistics.InternationalStatisticalReview83(3)436–448.

TARRAN,B.(2014).Nowyouseeme,nowyoudon’t.Significance11(4)11–14.TIBSHIRANI,R.(1996).Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety58(1)267–288.

TUKEY,J.W.(1977).Exploratorydataanalysis.Addison–Wesley.Reading,USA.UNDERWOOD,T.,LONG,H.andRICHARD,J.S.(2014).CentsandSensibility:TrustThomasPikettyoneconomicinequality.Ignorewhathesaysaboutliterature.BlogpostonSlate.com.Retrievedfromhttp://www.slate.com/articles/business/moneybox/2014/12/thomas_piketty_on_literature_balzac_austen_fitzgerald_show_arc_of_money.htmlaccessedonJanuary26,2016.

VANDEGEER,S.,BÜHLMANN,P.,RITOV,Y.A.andDEZEURE,R.(2014).Onasymptoticallyoptimalconfidenceregionsandtestsforhigh–dimensionalmodels.AnnalsofStatistics42(3)1166–1202.

WICKHAM,H.,COOK,D.,HOFMANN,H.andBUJA,A.(2010).GraphicalinferenceforInfovis.VisualizationandComputerGraphics,IEEE16(6)973–979.

WITTEN,R.andCANDÈS,E.(2013).Randomizedalgorithmsforlow–rankmatrixfactorizations:Sharpperformancebounds.Algorithmica72(1)264–281.WOOLFORD,D.G.,BRAUN,W.J.,DEAN,C.B.andMARTELL,D.L.(2009).Site–specificseasonalbaselinesforfireriskinOntario.Geomatica63(4)355–363.

WOOLFORD,D.G.andBRAUN,W.J.(2007).Convergentdatasharpeningfortheidentificationandtrackingofspatialtemporalcentersoflightningactivity.Environmetrics18(5)461–479.

WU,W.B.andPOURAHMADI,M.(2003).Nonparametricestimationoflargecovariancematricesoflongitudinaldata.Biometrika90(4)831–844.

ZOU,H.(2006).Theadaptivelassoanditsoracleproperties.JournaloftheAmericanStatisticalAssociation101(476)1418–1429.