Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,

StatisticalInference,LearningandModelsinBigData

Franke,Beate;Plante,Jean–François;Roscher,Ribana;Lee,Annie;Smyth,Cathal;Hatefi,Armin;Chen,Fuqi;Gil,Einat;Schwing,Alexander;Selvitella,Alessandro;Hoffman,MichaelM.;Grosse,Roger;Hendricks,DietrichandReid,Nancy.

AbstractTheneedfornewmethodstodealwithbigdataisacommonthemeinmostscientificfields,althoughitsdefinitiontendstovarywiththecontext.Statisticalideasareanessentialpartofthis,andasapartialresponse,athematicprogramonstatisticalinference,learning,andmodelsinbigdatawasheldin2015inCanada,underthegeneraldirectionoftheCanadianStatisticalSciencesInstitute,withmajorfundingfrom,andmostactivitieslocatedat,theFieldsInstituteforResearchinMathematicalSciences.Thispapergivesanoverviewofthetopicscovered,describingchallengesandstrategiesthatseemcommontomanydifferentareasofapplication,andincludingsomeexamplesofapplicationstomakethesechallengesandstrategiesmoreconcrete.

KeywordsAggregation,computationalcomplexity,dimensionreduction,high–dimensionaldata,streamingdata,networks

1.IntroductionBigdataprovidesbigopportunitiesforstatisticalinference,butperhapsevenbiggerchallenges,especiallywhencomparedtotheanalysisofcarefullycollected,usuallysmaller,setsofdata.FromJanuarytoJune2015,theCanadianStatisticalSciencesInstituteorganizedathematicprogramonStatisticalInference,LearningandModelsinBigData.Itbecameapparentwithinthefirsttwoweeksoftheprogramthatanumberofcommonissuesaroseinquitedifferentpracticalsettings.Thispaperarosefromanattempttodistillthesecommonthemesfromthepresentationsanddiscussionsthattookplaceduringthethematicprogram.Scientifically,theprogramemphasizedtherolesofstatistics,computerscience,andmathematicsinobtainingscientificinsightfrombigdata.Twocomplementarystrandswereintroduced:cross–cutting,orfoundational,researchthatunderpinsanalysis,anddomain–specificresearchfocusedonparticularapplicationareas.Theformercategoryincludedmachinelearning,statisticalinference,optimization,networkanalysis,andvisualization.Topic–specificworkshopsaddressedproblemsinhealthpolicy,socialpolicy,environmentalscience,cyber–securityandsocialnetworks.Thesedivisionsarenotrigid,ofcourse,asfoundationalandapplicationareasarepartofafeedbackcycleinwhicheachinspiresdevelopmentsintheother.Someveryimportantapplicationareaswherebigdataisfundamentalwerenotabletobethesubjectoffocussedworkshops,butmanyoftheseapplicationsdidfeatureinindividualpresentations.Theprogram

startedwithanopeningconference1thatgaveanoverviewofthetopicsofthe6–monthprogram2.AllthetalkspresentedattheFieldsInstituteareavailableonlinethroughFieldsLive3.

Technologicaladvancesenableustocollectmoreandmoredata–alreadyin2012itwasestimatedthatdatacollectionwasgrowingat50%peryear(Lohr,2012).Whilethereismuchimportantresearchunderwayinimprovingtheprocessing,storingandrapidaccessingofrecords(Chenetal.,2014;Schadtetal.,2010),thisprogramemphasizedthechallengesformodelling,inferenceandanalysis.

Twostrikingfeaturesofthepresentationsduringtheprogramwerethebreadthoftopics,andtheremarkablecommonalitiesthatemergedacrossthisbroadrange.Weattemptheretoillustratemanyofthecommonissuesandsolutionsthatarose,andprovideapictureofthestatusofbigdataresearchinthestatisticalsciences.Whilethereportislargelyinspiredbytheseriesoftalksattheopeningconferenceandrelatedreferences,italsotouchesonissuesthatarosethroughouttheprogram.Inthenextsectionweidentifyanumberofchallengesthatarecommonlyidentifiedintheworldofbigdata,andinSection3wedescribesomeexamplesofgeneralinferencestrategiesthatarebeingdevelopedtoaddressthesechallenges.Section4highlightsseveralapplications,tomakesomeoftheseideasmoreconcrete.

2.“BigData,it’snotthedata”Thereisnouniquedefinitionofbigdata.ThetitleofthissectionistakenfromtheopeningpresentationbyBell(2015),whoparaphrasedSupremeCourtJusticePorterStewartbysaying“Ishallnottodayfurtherattempttodefinebigdata,butIknowitwhenIseeit”.Bellalsopointedoutthatthedefinitionof‘big’isevolvingwithcomputationalresources.Gaffield(2015)emphasizedthatpeopleandcompaniesareoftencaptivatedbytheconceptofbigdatabecauseitisaboutus,humans.Bigdataprovidesuswithdetailedinformation–ofteninrealtime–makingitpossibletomeasurepropertiesandpatternsofhumanbehaviourthatformerlycouldnotbemeasured.AnanalogysuggestedbyKeller(2015)istotheHubbletelescope:wecannowseeconstellationswithgreatcomplexitythatpreviouslyappearedonlyasbrightdots.Beyondnaturalhumancuriosity,bigdatadrawsinterestbecauseitisvaluable.Alargepartofthehypearoundbigdatacomesfromrapidlyincreasinginvestmentfrombusinessesseekingacompetitiveadvantage(e.g.McAffeeandBrynjolfsonn,2012;Bell,2013;KironandShockley,2011).NationalStatisticsAgenciesareexploringhowtointegratebigdatatobenchmark,complement,orcreatedataproducts(CappsandWright,2013;Struijsetal.,2014;LetouzéandJütting,2014;TamandClarke,2015).Inmanyfieldsofresearchcostsofdataacquisitionarenowlower,sometimesmuchlower,thatthecostsofdataanalysis.Thedifficultyoftransformingbig

1 The program for the opening conference is at http://www.fields.utoronto.ca/programs/scientific/14–15/bigdata/boot/ 2 A description of the complete thematic program is at https://www.fields.utoronto.ca/programs/scientific/14–15/bigdata/ 3 The Fields Institute video archive link is http://www.fields.utoronto.ca/video–archive

dataintoknowledgeisrelatedtoitscomplexity,theessenceofwhichisbroadlycapturedbythe“FourVs”:volume,velocity,varietyandveracity.

2.1VolumeHandlingmassivedatasetsmayrequireaspecialinfrastructuresuchasadistributedHadoopcluster(e.g.Shvachkoetal.,2010)whereaverylargenumberofrecordsisstoredinseparatechunksscatteredovermanynodes.Asthecostsofprocessors,memoryanddiskspacecontinuestodecrease,thelimitingpiecenowisoftencommunicationbetweenmachines(Scottetal.2013).Asaconsequence,itisimpossibleorimpracticaltoaccessthewholedatasetonagivenprocessorandmanymethodswillfailbecausetheywerenotbuilttobeimplementedfromseparatepiecesofdata.Developmentsarerequired,andbeingmade,tomakewidely–usedstatisticalmethodsavailableforsuchdistributeddata.TwoexamplesarethedevelopmentofestimatingequationtheoryinLinandXi(2011),andtheconsensusMonteCarloalgorithmofScottetal.(2013).

Exploratorydataanalysisisaveryimportantstepofdatapreparationandcleaningthatbecomessignificantlymorechallengingwithincreasingsizeofthedata.Beyondthetechnicalcomplexityofmanagingthedata,graphicalandnumericalsummariesareessentiallylowdimensionalrepresentations–forexample,graphicaldisplaysarephysicallylimitedtoafewmillionpixels.Further,thehumanbrainrequiresaverylowdimensionalinputtoproceedtointerpretation(Scott,2015).Oneconsequenceofthislimitationisthatweoftenmustcompromisebetweentheintuitivenessofavisualizationandthepowerofavisualization(Carpendaleetal.,2004;Nacentaetal.,2012).Anotherconsequenceisthatwetypicallyvisualizeonlyasmallsubsetofthedataatonetime,andthismaygiveabiasedview(Kolaczyk,2009;p.91).

2.2VarietyFormanybigdataproblems,thevarietyofdatatypesiscloselyrelatedtoanincreaseincomplexity.Muchofclassicalstatisticalmodellingassumesa“tallandskinny”datastructurewithmanymorerows,indexingobservationalorexperimentalunitsn, thancolumns,indexingvariablesp.Images,bothtwo–andthree–dimensional,blogposts,twitterfeeds,socialnetworks,audiorecordings,sensornetworks,videosarejustsomeexamplesofnewtypesofdatathatdon’tfitthis“rectangular”model.Oftenrectangulardataiscreatedsomewhatartificially:forexamplefortwo–dimensionalimages,byorderingthepixelsrowbyrowandrecordingoneormorevaluesassociatedwitheachpixel,hencelosinginformationaboutthestructureofthedata.

Formanytypesofbigdatathenumberofvariablesislarger,oftenmuchlarger,thanthenumberofobservations.Thischangeintherelationshipbetweennandpcanleadtocounter–intuitivebehaviour,oracompletebreakdown,ofconventionalmethodsofanalysis.Problemsinwhichpscaleswithnareverycommon:forexample,inmodelsfornetworksthatassignoneparametertoeachnode,therearep=nparametersforagraphwithn×npossibleedges(ChungandLu,2002).Similarly,Bioucas–Diasetal.(2013)discusshyperspectralremotesensingimagedata,whichmayhavehundredsorthousandsofspectralbands,p,butaverysmallnumberoftrainingsamples,n.Inarecentgenomicsstudyreportedbythe1000GenomesProjectConsortium(2012),geneticistsacquireddataon𝑝 = 38 millionsinglenucleotidepolymorphismsinonly𝑛 = 1092individuals.

Theasymptoticpropertiesofclassicalmethodsinaregimewhere𝑝increaseswith𝑛arequitedifferentthanintheclassicalregimeoffixed𝑝.Forexample,if𝑋! , 𝑖 = 1,… ,𝑛areindependentandidenticallydistributedas𝑁(0, 𝐼!),and𝑝/𝑛staysconstantas𝑝and𝑛increase,theeigenvaluesof

theempiricalcovariancematrix𝑆 = 𝑋𝑋′/𝑛donotconvergetothetruevalueof1,butratherhavealimitingdistributiondescribedbytheMarcenko–Pasturlaw.DonohoandGavish(2014)showhowthislawisrelevantforunderstandingtheasymptoticminimaxriskinestimatingalowrankmatrixfromnoisymeasurementsbythresholdingthesingularvalues.Donohoetal.(2013)obtainsimilarresultsfortheproblemofmatrixrecovery.

2.3VeracitySomebigdataisobtainedbypiecingtogetherseveraldatasetsthathavenotbeencollectedwiththecurrentresearchquestioninmind:thisissometimesreferredtoas“founddata”or“conveniencesamples”.Thistypeofdatawilltypicallyhaveextremelyheterogeneousstructure(HodasandLerman2014),andmaywellnotbeatallrepresentativeofthepopulationofinterest.Thevastmajorityofbigdataisobservational,butmoreimportantlymaybecollectedveryhaphazardly.Classicalideasofexperimentaldesign,surveysampling,anddesignofobservationalstudiesoftengetlostintheexcitementoftheavailabilityoflargeamountsofdata.Issuesofsamplingbias,ifignored,canleadtoincorrectinferencejustaseasilywithbigdataaswithsmall.Harford(2014)presentsanumberofhigh–profilefailures,fromtheLiteraryDigestpollof1936,whichhadtwomillionparticipants,toGoogleFluTrendsin2010,whichinbasingitspredictionsoninternetsearchtermsseverelyover–estimatedtheprevalenceofinfluenza.Despitethoselimitations,NationalStatisticalAgenciesaredevelopingmethodsforthecarefuluseofsuchdata,forexampleasanaidtoimputationonsmallareas(Marchettietal.,2015).

Cleaninglargevolumesofdataischallengingandveracityisaffectedbyincompleteorincorrectcleaning,andbythelackoftoolstoevaluatedataquality.Forexample,toconfirmthatameasureofsentimentcalculatedfromthesocialmediawassound,Daasetal.(2015)evaluateditsconsistencywithawell–validatedindicatorofconsumerconfidence.Whileitisclearthatsocialmediadataarenoisyandvolatile,datacollectedfromsensorsmayalsocomewithseveralproblems,includingalargeproportionofmissingvalues.Daasetal.(2015)consideredasanexampletheissuesinvolvedincleaningtrafficloopdetectordata.

Qualityofdatainlargeadministrativedatabasesisapressingproblem:additionalheterogeneitymaybeintroducedwhenthedataarerecordedbynumerousstakeholderswithvaryingprioritiesandincentives.Lixetal.(2012)usemodellingtoevaluateandimprovethequalityofthedatainlargehealthservicesdatabases.

2.4VelocityInmanyareasofapplication,datais“big”becauseitisarrivingatahighvelocity,fromcontinuouslyoperatinginstrumentation,e.g.,whenconsideringautonomousdrivingscenarios(Geigeretal.2012),orthroughanon–linestreamingservice.Sequentialoron–linemethodsofanalysismaybeneeded,whichrequiresmethodsofinferencethatcanbecomputedquickly,thatrepeatedlyadapttosequentiallyaddeddata,andthatarenotmisleadbythesuddenappearanceofirrelevantdata,asdiscussed,e.g.,byJainetal.(2006)andHoensetal.(2012).Theareasofapplicationsusingsequentiallearningarediverse.Forinstance,LiandFei–Fei(2010)useobjectrecognitiontechniquesinanefficientsequentialwaytolearnobjectcategorymodelsfromanyarbitrarilylargenumberofimagesfromtheweb.Monteleonietal.(2011)usesequentiallearningforclimatechangemodellingtoestimatetransitionprobabilitiesbetweendifferentclimatemodels.Forsomeapplications,suchashighenergyphysics,burstsofdataarrivesorapidlythatit

isinfeasibletoevenrecordallofit.Inthiscaseagreatdealofscientificexpertiseisneededtoensurethatthedataretainedisrelevanttotheproblemofinterest.

2.5BeyondtheVsBigdataisabouthumans,itoffersnewpossibilities,andhence,itraisesnewethicalquestions.Duhigg(2012),forinstance,reportedhowtheTargetretailstoresperformedanalyticstosendtargetedmarketingmaterialtotheircustomers,includingpromotionsrelatedtopregnancytoateenagerwhosefatherwasnotyetawareofthecomingbaby.AtFacebook,Krameretal.(2014)conductedalargescaleexperimentonsentimentswithoutobtainingaspecificuseragreement(seealsoMeyer,2014).Privacyisalsoabigconcern,especiallywhenconsideringthatthelinkingofdatabasescandiscloseinformationthatwasmeanttoremainanonymous.ThelawsuitfollowingtheNetflixChallengeisastrikingexampleofthatwherelinkingtheprovideddatatotheIMDBmoviereviewsallowedtoidentifysomeusers(Singel,2009;2010).Ingeneral,ifreleaseddatacanbelinkedtopubliclyavailabledata,suchasmediaarticles,thenprivacycanbecompletelycompromised.Anotherhigh–profileexampleofthiswasthepublicationofthehealthrecordsoftheGovernorofMassachusetts,whichinturnledtostrictlawsonthecollectionandreleaseofhealthdataintheUnitedStatesknownasthe“HIPAArules”(Tarran,2014).Inadditiontoprivacy,legalquestionsarealsounanswered:copyrightandownershipofthedata,forinstance,isnotalwaysclear(seee.g.Struijsetal.,2014).

Sub–samplingfromalargedatasetisoftenpresentedasasolutiontosomeoftheproblemsraisedintheanalysisofbigdata;oneexampleisthe“bagoflittlebootstraps”ofKleiner,etal.(2014).Whilethisstrategycanbeusefultoanalyseadatasetthatisverymassive,butotherwiseregular,itcannotaddressmanyofthechallengesthatwerepresentedinthissection.

3.StrategiesforbigdataanalysisInspiteofthewidevarietyoftypesofbigdataproblems,thereisacommonsetofstrategiesthatseemessential.Asidefrompre–processingandmulti–disciplinarity,manyofthesefollowthegeneralapproachofidentifyingandexploitingahiddenstructurewithinthedatathatoftenisoflowerdimension.Wedescribesomeofthesecommonalitiesinthissection.

3.1DataWranglingAverylargehurdleintheanalysisofdata,whichisespeciallydifficultformassivedatasets,ismanipulatingthedataforuseinanalysis.Thisisvariouslyreferredtoasdatawrangling,datacarpentry,ordatacleaning4.Bigdataisoftenvery“raw”:considerablepre–processingsuchasextracting,parsing,andstoring,isrequiredbeforeitcanbeconsideredinananalysis.

Volumeposesachallengetodatacleaning:whilemanualorinteractiveeditingofadatasetareusuallypreferredforreasonablysizeddata,itquicklybecomesimpracticalwhenthedatagrows.Largeadministrativedatabaseswillrequirespecialtools(e.g.thetableplotsofPutsetal.,2015)

4 The R packages dplyr and tidyr were expressly created to simplify this step, and are highly recommended. The new “pipe” operator %>%, operating similarly to the old Unix pipe command, is especially helpful. (Wickham, 2014).

toassistinthecleaning.Asthesizeorthevelocityincrease,automatizedsolutionsmaybecomenecessary.Putsetal.(2015),forinstance,useaMarkovmodeltoautomatizethecleaningoftrafficloopdata.Varietymakesoftenmakesdatawranglingdifficult,forinstancewhenselectingvariablesorfeaturesfromimageandsounddata(e.g.GuyonandElisseeff,2003),integratingdatafromheterogeneoussources(Izadietal.,2013),orparsingfree–formtextfortextminingorsentimentanalysis(e.g.,GuptaandLehal,2009;PangandLee,2008).Amoreadvancedapproachintextminingusesthesemanticsoftextratherthanthesimplecountofwordstodescribetexts(e.g.,Brienetal.,2013).

Oneexampleofadatapreparationproblemarisesinthemergingofseveralsetsofspatio–temporaldata,whichmayhavedifferentspatialunits–forexampleonedatabasemaybebasedoncensustracts,anotheronpostalcodes,andstillanotheronelectoraldistricts.Brownetal.(2013)havedevelopedanapproachtoheterogeneouscomplexdatabasedonalocalEMalgorithm.Thisstartsbyoverlayingallpartitionsthatareusedforaggregation,andthencreatingafinerpartitionsuchthatallelementsofthatpartitionsarefullyincludedinthelargerones.ThelocalEMalgorithmbasedonthisisparallelizable,andhenceprovidesaworkablesolutiontothiscomplexproblem.ArelatedissuearoseinthepresentationofGaffield(2015),wherehistoricalcensusdataisbeingusedaspartofastudyoftheimpactofindividualsonthehistoryofCanada.Thecensustractshavechangedovertime,andthecurrentdatabaseresolvesthisbymappingbackwardsfromthecurrentcensustracts.

3.2VisualizationVisualizationisveryoftenthefirstformalstepoftheanalysisandisoftenatoolofchoicefordatacleaning(seee.g.Putsetal.,2015).Theaimistofindefficientgraphicalrepresentationsthatsummarizethedataandemphasiseitsmaincharacteristics(Tukey,1977).Visualizationcanalsoserveasaninferentialtoolindifferentstagesoftheanalysis––understandingthepatternsinthedatahelpsincreatinggoodmodels(Bujaetal,2009).Forexample,Albert–Greenet.al(2014),WoolfordandBraun(2007)andWoolfordetal.(2009)presentexploratorydataanalysistoolsforlargeenvironmentaldatasets,whicharemeanttonotonlyplotthedata,butalsotolookforstructuresanddeparturesfromsuchstructures.Theauthorsfocusonconvergentdatasharpeningandthestalagmiteplotinthecontextofforestfirescience.

Evenif“apictureiswortha1000words”,onehastorealizethatgoingfrombigdatato1000wordisheavycompression.Visualizationis“bandwidth–limited”,inthesensethatthereisnotenoughscreenspacetovisualizelarge,orevenmoderatelysized,data.Interactivityprovidesmanywaystohelpwiththis;themostfamiliariszoomingintolargevisualizationstoreveallocalstructure,muchinthemannerofGoogleMaps.Newtechniquesarebeingdevelopedintheinformationvisualizationcommunity:Neumannetal.(2006)describedtakingcuesfromnature,andfromfractals,indevelopingPhylloTreesforillustratingnetworkgraphsatdifferentscales.Carpendaleinherbootcamppresentationdescribedseveralrecentinnovationsininteractivevisualization,manyofwhichareillustratedatthewebsiteoftheInnovisResearchGroup(Carpendale,2015).

3.3ReducingDimensionalityDirectreductionofdimensionalityisaverycommontechnique:inadditiontofacilitatingvisualization(Huronetal.,2014;Wickhametal.,2010),itisoftenneededtoaddressissuesof

datastorage(WittenandCandès,2013).HintonandSalakhutdinov(2006)describetheuseofdirectdimensionreductioninclassification.Dimensionalityreductionalsooftenincorporatesanassumptionofsparsity,whichisaddressedinthenextsubsection.PrincipalComponentAnalysis(PCA)isaclassicalexampleofdimensionalityreduction,introducedbyPearson(1901).Thefirsttwo,orthree,principalcomponentsareoftenplottedinthehopesofidentifyingstructureinthedata.OneoftheissueswithperformingPCAonlargedatasetsistheintractabilityofitscomputation,whichreliesonsingularvaluedecomposition(SVD).WittenandCandès(2013),EldarandKutyniok(2012),andHalkoetal.(2011),amongothers,haveusedrandommatrixprojections.Thisisapowerfultoolbehindcompressedsensingandastrikingexampleofdimensionalityreduction,allowingapproximateQRdecompositionorsingularvaluedecompositionwithgreatperformanceadvantagesoverconventionalmethods.WittenandCandès(2013)approximateaverylargematrixbytheproductoftwo“skinny”matricesthatcanbeprocessedrelativelyeasily.Usingrandomizedmethodsonecaninfertheprincipalcomponentsoftheoriginaldatafromthesingularvaluedecompositionoftheseskinnyfactormatrices,thusovercomingacomputationalbottleneck.

Dependingontheapplication,lineardimensionalityreductiontechniquessuchasPCAmayneedtobeextendedtonon–linearones,e.g.,ifthestructureofthedatacannotbecapturedinalinearsubspace(HintonandSalakhutdinov,2006).However,theyarehardertocomputeandparameterize.Innetworkanalysis,anovelwaytoreducethedimensionalityoflargeandcomplexdatasetsisbyemployinganexplanatorytoolfordataanalysiscalled“networkhistogram”(OlhedeandWolfe,2014)illustratedinFigure1.SofiaOlhedeinjointworkwithPatrickWolfesuggestsanon–parametricapproachtoestimateconsistentlythegeneratingmodelofanetwork.Theideaistoidentifygroupssuchthatnodesfromthesamegroupshowasimilarconnectivitypatternandtoapproximatethegeneratingmodelbyapiecewiseconstantfunctionbyaveragingovergroups.ThepiecewiseconstantfunctionhereiscalledstochasticblockmodelanditapproximatesthegeneratingmodelinthesamewayasahistogramapproximatesapdfortheRiemannsumapproximatesacontinuousfunction.

Figure1.ExampleofanetworkhistogramfromOlhedeandWolfe(2014).Anetworkcanbeexpressedasamatrixwhereeachentryrepresentthestrengthofconnectionbetweentwonodes.Cleverlyorderingthenodesuncoverstheunderlyingstructureofthenetwork.

3.4SparsityandRegularizationIncreasingmodelsparsityenforcesalower–dimensionalmodelstructure.Inturn,itmakesinferencemoretractable,modelseasiertointerpretanditleadstomorerobustnessagainstnoise(Candèsetal.2008,Candèsetal.2006).Regularizationtechniquesthatenforcesparsityhavebeenwidelystudiedintheliteratureformorethanadecade,includingthelasso(Tibshirani,1996),adaptivelasso(Zhou,2006),thesmoothlyclippedabsolutedeviation(SCAD)penalty(FanandLi,2001),andmodifiedCholeskydecomposition(WuandPourahmadi,2003).Thesemethodsandtheirmorerecentextensionsarenowkeytoolsindealingwithbigdata.

Astrongassumptionofsparsitymakesitpossibletouseahighernumberofparametersthanobservations,enablingustofindtheoptimalrepresentationofthedatawhilekeepingtheproblemsolvable.AgoodexampleinthiscontextisthewinningentryfortheNetflixChallengerecommendersystem,whichcombinedalargenumberofmethods,therebyartificiallyincreasingthenumberofparameterstoimprovepredictionperformance(Belletal.2008).

Sparsity–enforcingalgorithmscombinevariableselectionandmodelfitting:thesemethodstypicallyshrinksmallestimatedcoefficientstozeroandthistypicallyintroducesbiaswhichmustbetakenintoaccountinevaluatingthestatisticalsignificanceofthenon–zeroestimatedcoefficients.Muchoftheresearchinlassoandrelatedmethodshasfocussedonpointestimation,butinferentialaspectssuchashypothesistestingandconstructionofconfidenceintervalsisanactiveareaofresearch.Forinferenceaboutcoefficientsestimatedusingthelasso,vandeGeeretal.(2014),andJavanmardandMontanari(2014a,2014b)introduceanadditionalstepintheestimationprocedurefordebiasing.Lockhartetal.(2014)incontrastfocusontestingwhether

thelassomodelcontainsalltrulyactivevariables.Inasurvivalanalysissetting,Fangetal.(2014)testtheeffectofalowdimensionalcomponentinahighdimensionalsettingbytreatingallremainingparametersasnuisanceparameters.Theyaddresstheproblemofthebiaswiththeaimofimprovingtheperformanceofthetheclassicallikelihood–basedstatistics:theWaldstatistic,thescorestatisticsandthelikelihoodratiostatistic,allofthesebasedontheusualpartiallikelihoodforsurvivaldata.Toofferacompromisetothecommonsparsityassumptionofzeroandlargenon–zerovaluedelements,Ahmed(2014)proposesarelaxedassumptionofsparsitybyaddingweaksignalsbackintoconsideration.Hisnewhigh–dimensionalshrinkagemethodimprovesthepredictionperformanceofaselectedsubmodelsignificantly.

3.5OptimizationOptimizationplaysacentralroleinstatisticalinference,andregularizationmethodsthatcombinemodelfittingwithmodelselectioncanleadtoverydifficultoptimizationproblems.Complexmodelswithmanyhiddenlayers,suchasariseintheconstructionofneuralnetworksanddeepBoltzmannmachines,typicallyleadtonon–convexoptimizationproblems.Sometimesaconvexrelaxationofaninitiallynon–convexproblemispossible:thelassoforexampleisarelaxationoftheproblemthatpenalizesthenumberofnon–zerocoefficientestimates,i.e.itreplacesan𝐿!penaltywithan𝐿!penalty.Whilethisconvexrelaxationmakestheoptimizationproblemmucheasier,itinducesbiasintheestimatesoftheparameters,asmentionedabove.Non–convexpenaltiescanreducesuchbias,butnon–convexoptimizationisusuallyconsideredtobeintractable.However,statisticalproblemsareoftensufficientlywell–behavedthatnon–convexoptimizationmethodsbecometractable.Moreover,duetothestructuresofthestatisticalmodels,LohandWainwright(2014,2015)suggestthatthesemodelsdonottendtobeadversarial,henceitisnotnecessarytolookattheworstcasescenario.Evenwhennon–convexoptimizationalgorithms,suchastheExpectation–Maximizationalgorithm,donotprovidetheglobaloptimum,thelocaloptimumcanbeadequatesincethedistancebetweenthelocalandglobalmaximaisoftensmallerthanthestatisticalerror(Balakrishnanetal.,2015,LohandWainwright,2015).Wainwrightconcludedhispresentationattheopeningconferencewitha"speculativemeta–theorem",thatforanyprobleminwhichaconvexrelaxationperformswell,thereisafirstordersimplemethodonthenon–convexformulationwhichperformsasgood.Applicationsofnon–convextechniquesincludespectralanalysisoftensors,analogoustoSVDofmatrices,whichcanbeemployedtoaidinlatentvariablelearning(Anandkumaretal.,2014).Non–convexmethodscanalsobeusedindictionarylearningaswellasinaversionofrobustPCA,whichislesssusceptibletosparsecorruptions(Netrapallietal.,2014).Insomecases,convexreformulationshavealsobeendevelopedtofindoptimalsolutionsinnon–convexscenarios(GuoandSchuurmans,2007).

Manymodelsusedforbigdataarebasedonprobabilitydistributions,suchasGibbsdistributions,forwhichtheprobabilitydensityfunctionisonlyknownuptoitsnormalizingconstant,sometimescalledthepartitionfunction.MarkovchainMonteCarlomethodsattempttoovercomethisbysamplingfromanunnormalizedprobabilitydensityfunctionbasedonaMarkovchainthathasthetargetdistributionasitsequilibriumdistribution,butthesemethodsdonotnecessarilyscalewell.Insomecases,approximateBayesiancomputation(ABC)methodsprovide

analternativeapproximate,butcomputable,solutiontoamassiveproblem(BlumandTran,2010).

3.6MeasuringdistanceInhigh–dimensionaldataanalysis,distancesbetween“points”,orobservations,areneededformostanalyticprocedures.Forexampleoptimizationmethodstypicallymovethroughapossiblycomplexparameterspace,andsomenotionofstep–sizeforthismovementisneeded.SimulationmethodssuchasMarkovchainMonteCarlosamplingalsoneedto‘move’throughtheparameterspace.InboththesecontextstheuseofFisherinformationasametricintheparameterspacehasemergedasausefultool–itistheRiemannianmetric(Rao,1945;AmariandNagaoka,2000)whenregardingtheparameterspaceasamanifold.GirolamiandCalderhead(2011)exploitedthisinMarkovchainMonteCarloalgorithms,withimpressiveimprovementsinspeedofconvergence.GrosseandSalakhutdinov(2015)showedhowoptimizationofmulti–layerBoltzmannmachinescouldbeimprovedbyusinganapproximationtotheFisherinformation,thefullFisherinformationmatrixbeingtoohigh–dimensionaltouseroutinely.Asimilartechniqueemergedinenvironmentalscience(SampsonandGuttorp,1992),wheretwo–dimensionalgeographicalspaceismappedtoatwo–dimensionaldeformedspaceinwhichthespatialstructurebecomesstationaryandisotropic.Schmidtetal.(2011)describehowthisapproachcanbeexploitedinaBayesianframework,mappingtoadeformedspaceinhigherdimensions,andapplythistoanalysesofsolarradiationandtemperaturedata.

3.7RepresentationLearningRepresentationlearning,alsocalledfeaturelearningordeeplearning,aimstouncoverhiddenandcomplexstructuresindata(Bengioet.al,2013)andfrequentlyusesalayeredarchitecturetouncoverahierarchicaldatarepresentation.Themodelforthislayeredarchitecturemayhavehigherdimensionalitythantheoriginaldata.Inthatcase,onemayneedtocombinerepresentationlearningwiththedimensionalityreductiontechniquesdescribedaboveinthisreport.Large,well–annotateddatasetsfacilitatesuccessfulrepresentationlearning.Butlabelingdataforsupervisedlearningcanconsumeexcessivetimeandresources,andmayevenproveimpossible.Tosolvethisproblem,researchershavedevelopedtechniquestouncoverdatastructurewithunsupervisedtechniques(Anandkumaretal.,2013;SalakhutdinovandHinton,2012).

3.8SequentialLearningAstrategytodealwithbigdatacharacterizedbyahighvelocityissequentialorincrementallearning.Thistechniquetreatsthedatasuccessively,eitherbecausethetaskwasdesignedinthiswayorthedatadoesnotallowanyotheraccess,asitisthecaseforcontinuousdatastreamssuchaslongimagesequences(Kalaletal.2012).Sequentiallearningisalsoappliedifthedataiscompletelyavailable,buthastobeprocessedsequentially.Thisoccurswhenthedatasetistoolargetofitintothememoryortobeprocessedinapracticallytractableway.Forexample,Scottetal.(2013)introduceconsensusMonteCarlo,whichoperatesbyrunningseparateMonteCarloalgorithmsondistributedmachines,andthenaveragingindividualMonteCarlodrawsacrossmachines;theresultcanbenearlyindistinguishablefromthedrawsthatwouldhavebeenobtainedbyrunningasinglemachinealgorithmforaverylongtime.

3.9Multi–disciplinarityAcommonthemeinagreatmanypresentationsonbigdataisthenecessityofaveryhighlevelofmulti–disciplinarity.Statisticalscientistsarewell–trainedinbothplanningofstudies,andininferenceunderuncertainty.Computerscientistsbringasophisticatedmixofcomputationalstrategiesanddatamanagementexpertise.Boththesegroupsofresearchersneedtoworkcloselywithscientists,socialscientists,andhumanistsintherelevantapplicationareastoensurethatthestatisticalmethodsandcomputationalalgorithmsareeffectiveforthescientificproblemofinterest.Forinstance,Deanetal.(2012)describeanexcitingcollaborationinforestry,wheretheyareworkingonlinkingstatisticalanalysis,computationalinfrastructurefornewinstrumentation,withsocialagencywiththegoalofmaximizingtheirimpactonthemanagementofforestfires.Thesynergyspacebetweenthemiscreatedinanintegrativeapproachinvolvingscientistsfromallfields.Thisintegrationisnotaninventionoftheworldofbigdatabyanymeans––statisticsisaninherentlyappliedfieldandcollaborationshavebeenamajorfocusofthedisciplinefromitsearliestdays.Whathaschangedisthespeedatwhichnewtechnologyprovidesnewchallengestostatisticalanalysis,andthemagnitudeofthecomputationalchallengesthatgoalongwiththis.Muchofthebigdatanowavailableinvolvesinformationaboutpeople––thisisoneoftheaspectsofbigdatathatcontributestoitspopularity.Thesocialsciencescanhelptoelucidatepotentialbiasesinsuchdata,aswellashelptofocusonthemostinterestingandrelevantresearchquestions.Humanistscanprovideframeworksforthinkingaboutprivacy,abouthistoricaldataandchangesthroughtime,andaboutethicalissuesarisinginsharingofdataabout,forexample,individuals’health.Individualsmaywellbewillingtosharetheirdataforresearchpurposes,forexample,butnotformarketingpurposes:thehumanitiesofferwaysofthinkingabouttheseissues.

4.ExamplesofApplicationsThebreadthofapplicationsisnearlylimitless;weprovidehereaselectionoftheapplicationareasdiscussedduringtheopeningconference,inparttomaketheideasoutlinedintheprevioussectionmoreconcrete.

4.1PublicHealthDietisanimportantfactorofpublichealthinthestudyofdisabilities(Murrayetal.,2013),butassessingpeople’snutritionalbehaviourisdifficult.Nielsencollectsinformationaboutallproductssoldbyalargenumberofgroceriesandcornerstoresatthe3–digitpostalcodelevel,andpurchasecardloyaltyprogramshavedataonpurchasesatthehouseholdlevel.UsingthesesourcesofinformationalongwithaUPCdatabaseofnutritionalvalues,Buckeridgeetal.(2012)matchnutritionvariablestoexistingmedicalrecordsintheprovinceofQuébec,incollaborationwiththeInstitutdesantépubliqueduQuébec.Sincethepurchasedataistemporal,Buckeridgeetal.(2014)measurethedecreaseinsalesofsugarycarbonateddrinksfollowingamarketingcampaign,henceassessingtheeffectivenessofthispublichealtheffort.Bigdataisalsobeingusedforsyndromicsurveillance,themonitoringofthesyndromesoftransmittablediseasesatthepopulationlevel.IntheprovinceofQuébec,acentralsystemmonitorsthetriagedatafromemergencyroomsincludingfreeformtextwhichneedstobetreatedappropriately.Whentheautomatedsurveillancesystemissuesanalert,anad–hoc

analysisisperformedtodecidewhetherimmediateactionsareneeded.Buckeridgeetal.(2006)evaluatetheperformanceofsuchsystemsthroughsimulationfordifferentratesoffalsepositivealerts.Socialmediaproducedatawhosevolumeandvarietyareoftenchallenging.Thetraditionalreportingofflufromphysicianreportsgenerallyintroducesa1–2weektimedelay.GoogleFluTrends(Ginsberget.al2009)speedsupthereportingbytabulatingtheuseofsearchkeywords,suchas“runnynose”or“sorethroat”,inordertoestimatetheprevalenceofinfluenza–likeillnesses.Predictionscansometimesgowrong,seeButler(2013),butthegaininreactiontimeissignificant.UsingtheGPSlocationinthemetadataofTwitterpostsRamírez–Ramírezetal.(2013)introduceSimulationofInfectiousDiseases(SIMID)thatimprovegeographicandtemporallocalitybeyondGoogleFluTrends.TheyillustratetheuseofSIMIDtoestimateandpredictfluspreadinthePeelsub–metropolitanregionnearToronto.

4.2HealthPolicyManyexamplesinhealthpolicyrelyonthelinkageoflargeadministrativedatasets.InOntario,theInstituteforClinicalEvaluativeSciences(ICES)isaprovince–wideresearchnetworkthatcollectsandlinksOntario'shealth–relateddataattheindividuallevel.Similardatabasesexistsinotherprovinces.ComparingdataondiabetescaseascertainmentfromManitobawiththoseofNewfoundlandandLabrador,Lixetal.(2012)evaluatetheincompletenessofphysicianbillingclaimsandestimatetowhichextentthenumberofnon–fee–for–serviceactsisunderestimated.Withthequalityofthedatainmind,theauthorsalsoimprovediseaseascertainmentwithmodel–basedpredictivealgorithmsasanalternativetoofrelyingonpossiblyincorrectdiseasecodes.

Healthdatacanbespatiotemporalandcomplicationsarisequicklywhenthedataareheterogeneous.Postalcodesandcensustracts,forinstance,defineareasthatdonotmatch,withshapesthatvarythroughtime(seee.g.Nguyenetal.,2012).InastudyaboutcancerinNovaScotia,Leeetal.(2015)haveaccesstolocationdataforthemostrecentyears,butolderdataonlycontainthepostalcodes.Tohandlethedifferentgranularities,theyuseahierarchicalmodelwherethelocationofcancercasesaretreatedasalatentvariableandhigherlevelsofthehierarchyaretheaggregationunits(Lietal.,2012).

4.3LawandOrderInsurgenciesandriotsarefrequentinSouthAmerica,withhundredsorthousandsofoccurrencesineachcountryeveryyear.Korkmazetal.(2015)haveadatabaseofallTwittermessagessentfromSouthAmericaoveraperiodoffouryearsaswellasabinaryvariablenamedGSR,agoldstandardtodeterminewhetheraneventwasaninsurgencyornot.Theoccurrenceofaninsurgencycanbepredictedbythevolumeoftweets,thepresenceofsomekeywordstherein(fromatotalof634whowereinitiallyidentifiedaspotentiallyuseful),andanincreaseduseofTheOnionRouter(TOR),anonlineservicetoanonymizetweets.ThemassivedatabasecontainingallthetweetsisstoredonaHadoopfilesystemanditstreatmentisdonewithPythonthroughmapreduce.Thosetoolswhicharenotpartofatypicalstatisticscursusarestandardinthetreatmentofmassivedatasets.

4.4EnvironmentalSciencesTheworldisfacingglobalenvironmentalchallengeswhicharehardtomodel,giventheirinherentcomplexity.Toestimatethetemperaturerecordsince1000,Barbozaetal.(2014)use

differentproxiesincludingtreerings,pollen,coral,icecores,hencecombiningveryheterogeneousdata.Fromearlierwork,theauthorsfoundthathistoricalrecordscanbemuchbetterpredictedbyincorporatingclimateforcingsinthemodel,forinstancesolarirradiance,greenhousegasconcentrationandvolcanism.

Inanexamplecombininghealthandenvironmentdata,Ensoretal.(2013)analyzethelocationandtimeofcardiacarrestsfrom2004to2011inthecityofHouston,withrespecttoairpollutionmeasurementstakenbyEPAsensorsscatteredacrossthecity.Insteadofconsideringdailyaveragesofexpositiontoairpollution,thisstudymeasuredtheexposureofanindividualinthethreehoursprecedingacardiacarrestandfindasignificanteffectontheriskofanout–of–hospitalcardiacarrest.Inanunpublishedfollow–upstudydiscussedbySallieKeller,asyntheticmodelisusedtosimulatetheactivitiesof4.4millionpeopleinHoustononagivenday.Theindividualexposurevariesgreatly,evenwithinagivenhousehold,aspeople’sexposuretopollutiondependsontheirmovementsinthecity.ThesyntheticdatahelpedoptimizingCPRtrainingsgeographicallyinthecity.

4.5EducationInaresearchprojecttostudythetheoryofJohnson(2003)aboutgradeinflation,Reeseetal.(2014)analyzethehistoryofBrighamYoungUniversitystudentevaluationsfrom2002to2014,matchingitwithstudentrecords.Atotalof2.9millioncourseevaluationsfrom185,000studentwereused.Thevariablesofinterestarethreequestionsansweredonanordinalscaleandafreetextsectionforwhichasentimentanalysisdeterminesifthecommentispositiveornegative.ThemagnitudeoftheBayesianproportionaloddsmodelfittedtotheordinalresponsevariablesistoobigforMCMCmethods,buttheABCapproachleadstoacomputationallyfeasiblesolution.Thedatashowsempiricalevidenceofgradeinflation.Withanexamplebasedoneducationdata,Cook(2014)illustrateshowvisualizationmaybeusedforinference.Basedonthe2012OECDPisatestsresultsfrom485,900studentsin65countries,Cookshowsafigurepresentingthedifferenceinmeanofmathscoresandreadingscoresbetweenboysandgirlsbycountry.Whiletheboysseemtooutperformthegirlsinmath,amuchgreaterdifferenceisseeninfavourofthegirlsforthereadingscores.Tohaveanideaofthesignificanceofthedifferences,Cookproducessimilarplotswherethegenderlabelsarerandomlyassignedtoscoreswithineachcountry.Thefactthatthetruedatastandsoutincomparisontotherandomassignmentsindicatesthatthegapdidnotoccurbychance.

4.6MobileApplicationSecurityConveyinginformationofinterestfrombigdataisachallengewhichmaybeaddressedbyvisualization.Hosseinkhanietal.(2014)proposePapilio,anovelgraphicalrepresentationofanetworkthatdisplaysthedifferentpermissionsofAndroidapps.Thecreationofatractablerepresentationismadepossiblebytakingadvantageofthespecialstructureofthedata.Thewholerepresentationisstilltoolargeforasinglescreen,sohalos(BaudischandRosenholtz,2003)areusedtoindicatethepositionofoffscreennodes,asillustratedinFigure2.Theradiusofthehalosareproportionaltothedistancetothenode.

Figure2:IllustrationoftheHalosinPapilio,whichindicatethepositionofoffscreennodes.ThisPapilio

figurefromHosseinkhanietal.(2014)displaysthedifferentpermissionsofAndroidapps.

4.7ImageRecognitionandLabellingImagesareunstructureddataandtheiranalysisprogressedsignificantlywiththedevelopmentofdeeplearningmodelssuchasdeepBoltzmannmachinesthatareespeciallysuitableforautomaticallylearningfeaturesinthedata.TheTorontoDeepLearninggroupofferanonlinedemo5ofmulti–modallearningbasedontheworkofSrivastavaandSalakhutdinov(2012)thatcangenerateacaptiontodescribeanyuploadedimage.TheimagesearchengineofGooglealsousesdeeplearningmodelstofindimagesthataresimilartoanuploadedpicture.

4.8DigitalHumanitiesAlargebodyofresearchinthehumanitiesisbuiltfromanin–depthstudyofcarefullyselectedauthorsorrecords.Theabilitytoprocesslargeamountsofdataisopeningnewopportunitieswheresocialevidencecanbegatheredatapopulationlevel.Forinstance,Drummondetal.(2006)usethe1901CanadiancensusdatatorevisedifferenttheoriesaboutbilingualisminCanadaandimprovetheiragreementwiththedata.TheyalsoobservethatpatternsinQuébecdiffersignificantlyfromthoseintherestofthecountry.InhisbookCapitalintheTwenty–FirstCentury,Piketty(2014)discussestheevolutionofreferencestomoneyinnovelsasareflectionofinequalitiesinthesociety.Underwoodetal.(2014)digfurtherintothequestionbylistingallreferencestomoneyamountsinEnglish–languagenovelsfrom1750to1950.Theyfindthattherelativefrequencyofreferencestomoneyamountsincreasesthroughtime,buttheamountmentioneddecreasesinconstantdollaramount.Theydonotattempttodrawstrongconclusionsfromthissocialevidence,mentioningforinstancethatthechangesinthereadingaudienceduetoanincreasedliteracyamongtheworkingclassmayalsoplayaroleintheevolutionofthenovels.Onalargerscale,Micheletal.(2011)analyzeabout4%ofallbookseverprintedtodescribesocialtrendsthroughtimeaswellaschangesinlinguistics.Whilepatternsinthedatamatch 5 http://deeplearning.cs.toronto.edu/i2t

milestoneseventssuchastheGreatWarsandsomeglobalepidemics,theevolutionofwordssuchas“feminism”measureimportantsocialtrends.Localeventssuchasthecensorshipofdifferentartistsforanumberofyearsindifferentcountriesarealsovisibleinthedata.Thosearesomeexamplesof‘culturomics’,thequantitativeanalysisofdigitizedtexttostudyculturaltrendsandhumanbehavior.

4.9MaterialsScienceInanapplicationtoMaterialsSciencebyNelsonetal.(2013),thepropertiesofabinaryalloydependontherelativepositionofatomsfrombothmaterialswhosedescriptionreliesonaverylargenumberofexplanatoryvariables.Surprisingly,asCandèsetal.(2006)showed,arandomgenerationofcandidatematerialsismoreefficientthanacarefullydesignedchoice.Thisisanexamplewherebigdatashowsacounterintuitivebehaviour,sinceonecouldthinkthatcarefuldesignofexperimentshouldbeabletobeatrandomselections.

5.ConclusionWhiletechnicalachievementsmaketheexistenceandavailabilityofverylargeamountsofdatapossible,thechallengesemergingfromsuchdatagoesbeyondprocessing,storingandaccessingrecordsrapidly.Asitstitleindicates,thethematicprogramonStatisticalInference,LearningandModelsforBigDatatriedtofocusthediscussionparticularlyonstatisticalissues.Thereareagreatmanyfieldsofresearchinwhichbigdataplaysakeyrole,andithasnotbeenpossibletosurveyallofit.

Inthisreport,weidentifychallengesandsolutionsthatarecommonacrossdifferentdisciplinesworkingoninferenceandlearningforbigdata.Withnewmagnitudesofdimensionality,complexityandheterogeneityinthedata,researchersarefacedwithnewchallengesofscalability,qualityandvisualization.Inaddition,bigdataisoftenobservationalinnature,havingbeencollectedforapurposeotherthantheoneintended,andhencemakinginferenceevenmorechallenging.Emergingtechnologicalsolutionsprovidenewopportunities,buttheyalsodefinenewparadigmsfordataandmodels.Makingthemodelsbiggerwithlayersoflatentvariablesmayprovideaninsightintothefeaturesofthedata,whilereducingthedatatolowerdimensionalmodelscanbekeyingettinganycalculationsthrough.Lowerdimensionalstructureanalysismanifestsitselfnotonlyassparsityassumptionsanddimensionalityreduction,butalsoinvisualization.Othertechniquessuchasnon–convexalgorithms,samplingstrategiesandmatrixrandomizationaidintheprocessingoflargedatasets.Withmassivedata,settlingforanapproximatesolutionseemstobethesavingcompromise,whetherbyallowingformoreflexibilityinanobjectivefunctionorbyreplacingastandardalgorithmthatisinfeasibleatsuchalargescale.

Manyofthemoststrikingexamplesofbigdataanalysispresentedduringthisthematicprogramcamefromteamsofmanyscientistswithcomplementarybackgrounds.Movingforward,researchteamsneedtoscaleup,invitingtalentsfromdifferentfieldstocometogetherandjoinforces.Thephrase“BigData”hashadagreatdealofmediaattentioninrecentyears,creatingahypethatwillcertainlyfade.Thenewproblemsthathaveemerged,however,areheretostayandtheneedforinnovativesolutionswillcertainlykeepourcommunitybusyformanyyearstocome.

AcknowledgmentsWearegratefulfortheFieldsInstituteforResearchintheMathematicalSciences,whichprovidedthemajorportionofthefundingforthethematicprogramonStatisticalInference,LearningandModelsinBigData.TheFieldsInstituteisfundedbytheNaturalSciencesandEngineeringResearchCouncilofCanadaandtheMinistryofTraining,CollegesandUniversitiesinOntario.FundingwasalsoprovidedforspecificworkshopsbythePacificInstitutefortheMathematicalSciences,theCentredeRecherchesMathématiques,theAtlanticAssociationforAdvancementintheMathematicalSciences,theDepartmentofStatisticsatIowaStateUniversity,theNationalScienceFoundation,andtheCanadianStatisticalSciencesInstitute.

Wethankthereviewersofanearlierversionofthepaperforhelpfulandconstructivecommentsthatimprovedthescopeandpresentation.

References1000GenomesProjectConsortium(2012).Anintegratedmapofgeneticvariationfrom1,092humangenomes.Nature491(7422)56–65.

AHMED,S.E.(2014).Penalty,shrinkageandpreteststrategies:Variableselectionandestimation.SpringerBriefsinStatistics.Springer.Cham.

ALBERT–GREEN,A.,BRAUN,W.J.,MARTELL,D.L.andWOOLFORD,D.G.(2014).VisualizationtoolsforassessingtheMarkovproperty:sojourntimesintheforestfireweatherindexinOntario.Environmetrics25(6)417–430.

AMARI,S.–I.andNAGAOKA,H.(2000).Methodsofinformationgeometry(Vol.191).TranslationsofMathematicalMonographs.AmericanMathematicalSociety.USA.ANANDKUMAR,A.,GE,R.,HSU,D.,KAKADE,S.M.andTELGARSKY,M.(2014).Tensordecompositionsforlearninglatentvariablemodels.TheJournalofMachineLearningResearch15(1)2773–2832.ANANDKUMAR,A.,HSU,D.,JANZAMIN,M.andKAKADE,S.M.(2013).Whenareovercompletetopicmodelsidentifiable?Uniquenessoftensortuckerdecompositionswithstructuredsparsity.Burges,C.J.C.,Bottou,L.,Welling,M.,Ghahramani,Z.andWeinberger,K.Q.(Eds.).InAdvancesinNeuralInformationProcessingSystems261986–1994.

BALAKRISHNAN,S.,WAINWRIGHT,M.J.andYU,B.(2015).StatisticalguaranteesfortheEMalgorithm:Frompopulationtosample–basedanalysis.AnnalsofStatistics. Accepted.

BARBOZA,L.,LI,B.,TINGLEY,M.P.andVIENS,F.G.(2014).Reconstructingpasttemperaturesfromnaturalproxiesandestimatedclimateforcingsusingshort–andlong–memorymodels.TheAnnalsofAppliedStatistics8(4)1966–2001.

BAUDISCH,P.andROSENHOLTZ,R.(2003).Halo:atechniqueforvisualizingoff–screenobjects.InProceedingsoftheSIGCHIconferenceonhumanfactorsincomputingsystemsCHI'03481–488.ACM.

BELL,P.(2013). Creatingcompetitiveadvantageusingbigdata. IveyBusinessJournal 1-5.BELL,R.M.(2015). BigData:It'sNottheData.PresentedattheFieldsInstitute,January12th2015.availableat http://www.fields.utoronto.ca/video-archive/static/2015/01/315-4206/mergedvideo.ogv,accessedonJanuary26,2016.

BENGIO,Y.,COURVILLE,A.andVINCENT,P.(2013).Representationlearning:Areviewandnewperspectives.PatternAnalysisandMachineIntelligence.IEEE35(8)1798–1828.

BIOUCAS–DIAS,J.M.,PLAZA,A.,CAMPS–VALLS,G.,SCHEUNDERS,P.,NASRABADI,N.M.andCHANUSSOT,J.(2013).Hyperspectralremotesensingdataanalysisandfuturechallenges.GeoscienceandRemoteSensingMagazine,IEEE1(2)6–36.

BLUM,M.G.andTRAN,V.C.(2010).HIVwithcontacttracing:acasestudyinapproximateBayesiancomputation.Biostatistics11(4)644–660.

BRIEN,S.,NADERI,N.,SHABAN–NEJAD,A.,MONDOR,L.,KROEMKER,D.andBUCKERIDGE,D.L.(2013).Vaccineattitudesurveillanceusingsemanticanalysis:Constructingasemanticallyannotatedcorpus.InProceedingsofthe22ndinternationalconferenceonWorldWideWebcompanion,Janeiro,Brazil,13–17May2013,683–686.InternationalWorldWideWebConferencesSteeringCommittee.

BROWN,P.E.,NGUYEN,P.,STAFFORD,J.andASHTA,S.(2013).Local–EMandmismeasureddata.StatisticsandProbabilityLetters83(1)135–140.

BUCKERIDGE,D.L.,CHARLAND,K.,LABBAN,A.andMA,Y.(2014).Amethodforneighborhood-levelsurveillanceoffoodpurchasing.AnnalsoftheNewYorkAcademyofSciences1331(1)270–277.BUCKERIDGE,D.L.,IZADI,M.,SHABAN–NEJAD,A.,MONDOR,L.,JAUVIN,C.,DUBÉ,L.,JANG,Y.andTAMBLYN,R.(2012).Aninfrastructureforreal–timepopulationhealthassessmentandmonitoring.IBMJournalofResearchandDevelopment56(5)2:1–2:11.BUCKERIDGE,D.L.,OWENS,D.K.,SWITZER,P.,FRANK,J.andMUSEN,M.A.(2006).Evaluatingdetectionofaninhalationalanthraxoutbreak.EmergingInfectiousDiseases12(12)1942–1949.BUJA,A.,COOK,D.,HOFMANN,H.,LAWRENCE,M.,LEE,E.K.,SWAYNE,D.F.andWICKHAM,H.(2009).Statisticalinferenceforexploratorydataanalysisandmodeldiagnostics.PhilosophicalTransactionsoftheRoyalSocietyA:Mathematical,PhysicalandEngineeringSciences367(1906)4361–4383.

Bᴜᴛʟᴇʀ,D.(2013).WhenGooglegotfluwrong.Nature494(7436)155-156.CANDÈS,E.J.,ROMBERG,J.andTAO,T.(2006).Robustuncertaintyprinciples:Exactsignalreconstructionfromhighlyincompletefrequencyinformation.InformationTheory,IEEE52(2)489–509.CANDÈS,E.J.,WAKIN,M.B.andBOYD,S.P.(2008).Enhancingsparsitybyreweightedℓ1minimization.JournalofFourieranalysisandapplications14(5–6)877–905.

CAPPS,C.andWRIGHT,T.(2013).Towardavision:Officialstatisticsandbigdata.AMSTATNEWS.http://magazine.amstat.org/blog/2013/08/01/official–statistics/accessedonJanuary26,2016.

CARAGEA,D.,SILVESCU,A.,andHONAVAR,V.(2004).Aframeworkforlearningfromdistributeddatausing sufficient statistics and its application to learningdecision trees. International Journal ofHybridIntelligentSystems,1(1–2)80–89.

CARPENDALE,S.(2015).WebsiteoftheInnovisResearchGroupavailableatHTTP://INNOVIS.CPSC.UCALGARY.CA/RESEARCH/RESEARCH,accessedonJanuary26,2015.

CARPENDALE,S.,LIGH,J.andPATTISON,E.(2004).Achievinghighermagnificationincontext.InProceedingsofthe17thannualACMsymposiumonUserinterfacesoftwareandtechnology,SantaFe,USA,24–27October2004,71–80.ACM.CHEN,M.,MAO,S.andLIU,Y.(2014).Bigdata:Asurvey.MobileNetworksandApplications19(2)171–209.

COOK,D.(2014).Visiphilia:Thegendergapinmathisnotuniversal.CHANCE27(4)48–52.CHUNG,F.andLU,L.(2002).Theaveragedistancesinrandomgraphswithgivenexpecteddegrees.ProceedingsoftheNationalAcademyofSciences99(25)15879–15882.Daas,P.J.H.,Puts,M.J.,Buelens,B.andVandenHurk.P.A.M.(2015)BigDataasasourceforofficialstatistics.JournalofOfficialStatistics,31(2)249–262.

DEAN,C.,BRAUN,W.J.,MARTELL,D.andWOOLFORD,D.(2012).Interdisciplinarytraininginacollaborativeresearchenvironment.Chrisman,N.andWachowicz,M.(Eds.).Theaddedvalueofscientificnetworking:PerspectivesfromtheGEOIDEnetworkmembers1998–2012(pages69-73).GEOIDENetwork,Quebec.

DONOHO,D.andGAVISH,M.(2014).Minimaxriskofmatrixdenoisingbysingularvaluethresholding.AnnalsofStatistics42(6)2413–2440.DONOHO,D.,GAVISH,M.ANDMONTANARI,A.(2013).ThephasetransitionofmatrixrecoveryfromGaussianmeasurementsmatchestheminimaxMSEofmatrixdenoising.ProceedingsoftheNationalAcademyofSciences110(21)8405–8410.DRUMMOND,C.,MATWIN,S.,&GAFFIELD,C.(2006).Inferringandrevisingtheorieswithconfidence:analyzingbilingualisminthe1901Canadiancensus.AppliedArtificialIntelligence20(1)1–33.DUHIGG,C.(2012).Howcompanieslearnyoursecrets.TheNewYorkTimes.February16,2012.

ELDAR,Y.C.ANDKUTYNIOK,G.(2012).Compressedsensing:Theoryandapplications.CambridgeUniversityPress.NewYork.FAN,J.andLI,R.(2001).Variableselectionvianonconcavepenalizedlikelihoodanditsoracleproperties.JournaloftheAmericanStatisticalAssociation96(456)1348–1360.FANG,E.X.,NING,Y.ANDLIU,H.(2014).Testingandconfidenceintervalsforhighdimensionalproportionalhazardsmodel.arXivpreprintarXiv:1412.5158.

ENSOR,K.B.,RAUN,L.H.andPERSSE,D.(2013).Acase–crossoveranalysisofout–of–hospitalcardiacarrestandairpollution.Circulation1271192–1199.

GAFFIELD,C.(2015).Bigdatavs.humancomplexity:anearlystatusreportonthecentralquestionofthe21stcentury.PresentedattheFieldsInstitute,January20th2015.availableathttp://www.fields.utoronto.ca/video–archive/2015/01/315–4136,accessedonJanuary26,2016.GEIGER,A.,LENZ,P.andURTASUN,R.(2012).Arewereadyforautonomousdriving?Thekittivisionbenchmarksuite.InComputerVisionandPatternRecognition,IEEECVPR20123354–3361.

GINSBERG,J.,MOHEBBI,M.H.,PATEL,R.S.,BRAMMER,L.,SMOLINSKI,M.S.andBRILLIANT,L.(2009).Detectinginfluenzaepidemicsusingsearchenginequerydata.Nature457(7232)1012–1014.

GIROLAMI,M.andCALDERHEAD,B.(2011).RiemannmanifoldLangevinandHamiltonianMonteCarlomethods.JournaloftheRoyalStatisticalSociety:SeriesB(StatisticalMethodology)73(2)123–214.GROSSE,R.andSALAKHUTDINOV,R.(2015).ScalingupnaturalgradientbysparselyfactorizingtheinverseFisherinformationmatrix.Bach,F.andBlei,D.(Eds.).InProceedingsofThe32ndInternationalConferenceonMachineLearning(ICML-15)2304–2313.GUO,Y.andSCHUURMANS,D.(2007).Convexrelaxationsoflatentvariabletraining.Platt,J.C.,Koller,D.,Singer,Y.andRoweis,S.T.(Eds.).InAdvancesinNeuralInformationProcessingSystems20601–608.

GUPTA,V.andLEHAL,G.S.(2009).Asurveyoftextminingtechniquesandapplications.Journalofemergingtechnologiesinwebintelligence1(1)60–76.GUYON,I.andELISSEEFF,A.(2003).Anintroductiontovariableandfeatureselection.TheJournalofMachineLearningResearch31157–1182.HALKO,N.,MARTINSSON,P.G.andTROPP,J.A.(2011).Findingstructurewithrandomness:Probabilisticalgorithmsforconstructingapproximatematrixdecompositions.SIAMReview53(2)217–288.HARFORD,T.(2014).Bigdata:arewemakingabigmistake?FinancialTimes.March 24, 2014.

HINTON,G.E.andSALAKHUTDINOV,R.R.(2006).Reducingthedimensionalityofdatawithneuralnetworks.Science313(5786)504–507.HODAS,N.O.andLERMAN,K.(2014).Thesimplerulesofsocialcontagion.ScientificReports4(4343).HOENS,T.R.,POLIKAR,R.,CHAWLA,N.V.(2012).Learningfromstreamingdatawithconceptdriftandimbalance:Anoverview.ProgressinArtificialIntelligence1(1)89–101.

HOSSEINKHANI,M.,FONG,P.W.L.andCARPENDALE,S.(2014).Papilio:VisualizingAndroidapplicationpermissions.Carr,H.,Rheingans,P.,andSchumann,H.(Eds.).InProceedingsoftheEuroVisConference33(3)391–400.JohnWiley&SonsLtd.HURON,S.,CARPENDALE,S.,THUDT,A.,TANG,A.andMAUERER,M.(2014).Constructivevisualization.InProceedingsofthe2014conferenceonDesigninginteractivesystems, Vancouver,Canada,June21-25,2014,433–442.ACM.IZADI,M.,SHABAN–NEJAD,A.,OKHMATOVSKAIA,A.,MONDOR,L.andBUCKERIDGE,D.L.(2013).Populationhealthrecord:Aninformaticsinfrastructureformanagement,integration,andanalysisoflargescalepopulationhealthdata.ProceedingsofAAAIHIAI201335–40.AAAI.JAIN,S.,LANGE,S.andZILLES,S.(2006).Towardsabetterunderstandingofincrementallearning.InAlgorithmicLearningTheory(169–183).Springer.Berlin.JAVANMARD,A.andMONTANARI,A.(2014a).Hypothesistestinginhigh–dimensionalregressionundertheGaussianrandomdesignmodel:Asymptotictheory.InformationTheory,IEEE60(10)6522–6554.JAVANMARD,A.andMONTANARI,A.(2014b).Confidenceintervalsandhypothesistestingforhigh–dimensionalregression.TheJournalofMachineLearningResearch15(1)2869–2909.

JOHNSON,V.E.(2003).GradeInflation:Acrisisincollegeeducation,Springer–Verlag,NewYork.

KALAL,Z.,MIKOLAJCZYK,K.andMATAS,J.(2012).Tracking–learning–detection.PatternAnalysisandMachineIntelligence,IEEE34(7)1409–1422.KELLER,S.(2015).Buildingresilientcities:Harnessingthepowerofurbananalytics.PresentedattheFieldsInstitute,January20th2015.availableathttp://www.fields.utoronto.ca/video–archive/static/2015/01/315–4137/mergedvideo.ogv,accessedonJanuary26,2016.KIRON,D.,andSHOCKLEY,R.(2011). Creatingbusinessvaluewithanalytics. MITSloanManagementReview 53(1)57-63.KLEINER,A.,TALWALKAR,A.,SARKAR,P.andJORDAN,M.I.(2014).Ascalablebootstrapformassivedata.JournaloftheRoyalStatisticalSocietySeriesB76(4)795–816.

KOLACZYK,E.D.(2009).Statisticalanalysisofnetworkdata:Methodsandmodels.SpringerScienceandBusinessMedia,NewYork.

KORKMAZ,G.,CADENA,J.,KUHLMAN,C.J.,MARATHE,A.,VULLIKANTI,A.andRAMAKRISHNAN,N.(2015).Combiningheterogeneousdatasourcesforcivilunrestforecasting.InProceedingsofthe2015IEEE/ACMInternationalConferenceonAdvancesinSocialNetworksAnalysisandMiningASONAM'15258–265.ACM.KRAMER,A.D.,GUILLORY,J.E.,andHANCOCK,J.T.(2014).Experimentalevidenceofmassive–scaleemotionalcontagionthroughsocialnetworks.ProceedingsoftheNationalAcademyofSciences,111(24)8788–8790.LEE,J.S.,NGUYEN,P.,BROWN,P.E.,STAFFORD,J.andSAINT–JACQUES,N.(2015).Local–EMAlgorithmforspatio–temporalanalysiswithapplicationinSouthwesternNovaScotia,UnpublishedManuscript.LETOUZÉ,E.ANDJÜTTING,J.(2014).OfficialStatistics,bigdataandhumandevelopment:Towardsanewconceptualandoperationalapproach.Dat–PopAllianceWhitePapersSeries.http://www.odi.org/sites/odi.org.uk/files/odi-assets/events-documents/5161.pdf,accessedonJanuary26,2016.

LI,Y.,BROWN,P.,GESINK,D.C.andRUE,H.(2012).LogGaussianCoxprocessesandspatiallyaggregateddiseaseincidencedata.StatisticalMethodsinMedicalResearch21(5)479–507.

LI,L.J.andFEI–FEI,L.(2010).Optimol:Automaticonlinepicturecollectionviaincrementalmodellearning.InternationalJournalofComputerVision88(2),147–168.Lin,N.,andXi,R.(2011).Aggregatedestimatingequationestimation. StatisticsandItsInterface 473–83. LIX,L.M.,SMITH,M.,AZIMAEE,M.,DAHL,M.,NICOL,P.,BURCHILL,C.,BURLAND,E.,…andBAILLYA.(2012).AsystematicinvestigationofManitoba’sprovinciallaboratorydata.Winnipeg,MB:ManitobaCentreforHealthPolicy.

LOCKHART,R.,TAYLOR,J.,TIBSHIRANI,R.J.andTIBSHIRANI,R.(2014).Asignificancetestforthelasso.AnnalsofStatistics42(2)413–468.LOH,P.L.andWAINWRIGHT,M.J.(2014).Supportrecoverywithoutincoherence:Acasefornonconvexregularization.arXivpreprintarXiv:1412.5632.

LOH,P.L.andWAINWRIGHT,M.J.(2015).RegularizedM–estimatorswithnonconvexity:Statisticalandalgorithmictheoryforlocaloptima.JournalofMachineLearningResearch16559–616.

Lohr,S.(2012).Theageofbigdata.NewYorkTimes.February11,2012.MARCHETTI,S.,GIUSTI,C.,PRATESI,M.,SALVATI,N.,GIANNOTTI,F.,PEDRESCHI,D.,RINZIVILLO,S.,PAPPALARDO,L.ANDGABRIELLI,L.(2015).Smallareamodel–basedestimatorsusingbigdatasources.JournalofOfficialStatistics31(2)263–281.

MCAFFEE,A.ANDBRYNJOLFSONN,E.(2012).Bigdata:Themanagementrevolution,HarvardBusinessReview.March18,2015.

MICHEL,J.B.,SHEN,Y.K.,AIDEN,A.P.,VERES,A.,GRAY,M.K.,PICKETT,J.P.,...&AIDEN,E.L.(2011).Quantitativeanalysisofcultureusingmillionsofdigitizedbooks.Science331(6014)176–182.MEYER,M.N.(2014).EverythingyouneedtoknowaboutFacebook’scontroversialemotionexperiment.Wired.RetrievedJuly03,2014.MONTELEONI,C.,SCHMIDT,G.A.,SAROHA,S.andASPLUND,E.(2011).Trackingclimatemodels.StatisticalAnalysisandDataMining:TheASADataScienceJournal4(4)372–392.

MURRAY,C.J.,VOS,T.,LOZANO,R.,NAGHAVI,M.,FLAXMAN,A.D.,MICHAUD,C.,...andBRIDGETT,L.(2013).Disability–adjustedlifeyears(DALYs)for291diseasesandinjuriesin21regions,1990–2010:AsystematicanalysisfortheGlobalBurdenofDiseaseStudy2010.TheLancet380(9859)2197–2223.NACENTA,M.,HINRICHS,U.andCARPENDALE,S.(2012).FatFonts:Combiningthesymbolicandvisualaspectsofnumbers.Tortora,G.,Levialdi,S.andTucci,M.(Eds.).InProceedingsoftheInternationalWorkingConferenceonAdvancedVisualInterfaces. Naples,Italy,May22-25,2012,407–414.ACM.

NELSON,L.J.,OZOLIŅŠ,V.,REESE,C.S.,ZHOU,F.andHART,G.L.(2013).ClusterexpansionmadeeasywithBayesiancompressivesensing.PhysicalReviewB88(15),155105.

NETRAPALLI,P.,NIRANJAN,U.N.,SANGHAVI,S.,ANANDKUMAR,A.andJAIN,P.(2014).Non–convexrobustPCA.Ghahramani,Z.,Welling,M.,Cortes,C.,Lawrence,N.D.andWeinberger,K.Q.(Eds.).InAdvancesinNeuralInformationProcessingSystems271107–1115.

NEUMANN,P.,CARPENDALE,M.S.T.andAGARAWALA,A.(2006).PhylloTrees:Phyllotacticpatternsfortreelayout.Ertl,T.,Joy,K.andSantos,B.(Eds.).InProceedingsofEurographics/IEEEVGTCSymposiumonVisualizationEuroVis200659–66.

NGUYEN,P.,BROWN,P.E.andSTAFFORD,J.(2012).MappingcancerriskinsouthwesternOntariowithchangingcensusboundaries.Biometrics68(4)1228–1237.

OLHEDE,S.C.andWOLFE,P.J.(2014).Networkhistogramsanduniversalityofblockmodelapproximation.ProceedingsoftheNationalAcademyofSciences111(41)14722–14727.

PANG,B.andLEE,L.(2008).Opinionminingandsentimentanalysis.Foundationsandtrendsininformationretrieval2(1–2)1–135.PIKETTY,T.(2014).Capitalinthetwenty–firstcentury.HarvardUniversityPress.Cambridge,USA.

PUTS,M.,DAAS,P.andDEWAAL,T.(2015).Findingerrorsinbigdata.Significance12(3)26–29.

RAMÍREZ–RAMÍREZ,L.L.,GEL,Y.R.,THOMPSON,M.,DEVILLA,E.andMCPHERSON,M.(2013).Anewsurveillanceandspatio–temporalvisualizationtoolSIMID:SimulationofinfectiousdiseasesusingrandomnetworksandGIS.ComputerMethodsandProgramsinBiomedicine110(3)455–470.

RAO,C.R.(1945).Informationandtheaccuracyattainableintheestimationofstatisticalparameters.BulletinoftheCalcuttaMath.Soc.3781–89.Rᴇᴇsᴇ,C.S.,Tᴏʟʟᴇʏ,H.D.,Kᴇɪᴛʜ,J.,andBᴜʀʟɪɴɢᴀᴍᴇ,G.(2014).Unravelingthemysteryofgradeinflationandstudentratings:ABYUCaseStudy,BYUDepartmentofStatisticsTechnicalReport,TR–14–021.

ROBERTSON,J.,MCELDUFF,P.,PEARSON,S.A.,HENRY,D.A.,INDER,K.J.andATTIA,J.R.(2012).Thehealthservicesburdenofheartfailure:Ananalysisusinglinkedpopulationhealthdata–sets.BMCHealthServicesResearch12(103)1–11.

SALAKHUTDINOV,R.andHINTON,G.(2012).AnefficientlearningprocedurefordeepBoltzmannmachines.NeuralComputation24(8)1967–2006.

SAMPSON, P. D., andGUTTORP, P. (1992). Nonparametricestimationofnonstationaryspatialcovariancestructure.JournaloftheAmericanStatisticalAssociation,87(417)108–119. SCHADT,E.E.,LINDERMAN,M.D.,SORENSON,J.,LEE,L.andNOLAN,G.P.(2010).Computationalsolutionstolarge–scaledatamanagementandanalysis.NatureReviewsGenetics11(9)647–657. SCHMIDT,A.M.,GUTTORP,P.andO'HAGAN,A.(2011).Consideringcovariatesinthecovariancestructureofspatialprocesses.Environmetrics22(4),487–500.

SCOTT,S.L.(2015).Bayesandbigdata:TheconsensusMonteCarloalgorithm.PresentedattheFieldsInstitute,January13th2015.availableathttp://www.fields.utoronto.ca/video–archive/static/2015/01/315–4117/mergedvideo.ogv,accessedonJanuary26,2016.

SCOTT,S.L.,BLOCKER,A.W.,BONASSI,F.V.,CHIPMAN,H.A.,GEORGE,E.I.andMCCULLOCH,R.E.(2013).Bayesandbigdata:[email protected],USA,December15-17,2013.SHVACHKO,K.,KUANG,H.,RADIA,S.,&CHANSLER,R.(2010).TheHadoopdistributedfilesystem.InMassStorageSystemsandTechnologies(MSST),2010IEEE261–10.

SINGEL,R.(2009).NetflixspilledyourBrokebackMountainSecret,lawsuitclaims.Wired.December17,2009.

SINGEL,R.(2010).Netflixcancelsrecommendationcontestafterprivacylawsuit.Wired.March12,2010.SRIVASTAVA,N.andSALAKHUTDINOV,R.R.(2012).MultimodallearningwithdeepBoltzmannmachines.Pereira,F.,Burges,C.J.C.,Bottou,L.andWeinberger,K.Q.(Eds.).InAdvancesinNeuralInformationProcessingSystems252222–2230.

STRUIJS,P.,BRAAKSMA,B.ANDDAAS,P.J.H.(2014).Officialstatisticsandbigdata.BigDataandSociety.1(1)1-6.DOI:10.1177/2053951714538417.

TAM,S.M.ANDCLARKE,F.(2015).Bigdata,officialstatisticsandsomeinitiativesbytheAustralianBureauof Statistics.InternationalStatisticalReview83(3)436–448.

TARRAN,B.(2014).Nowyouseeme,nowyoudon’t.Significance11(4)11–14.TIBSHIRANI,R.(1996).Regressionshrinkageandselectionviathelasso.JournaloftheRoyalStatisticalSociety58(1)267–288.

TUKEY,J.W.(1977).Exploratorydataanalysis.Addison–Wesley.Reading,USA.UNDERWOOD,T.,LONG,H.andRICHARD,J.S.(2014).CentsandSensibility:TrustThomasPikettyoneconomicinequality.Ignorewhathesaysaboutliterature.BlogpostonSlate.com.Retrievedfromhttp://www.slate.com/articles/business/moneybox/2014/12/thomas_piketty_on_literature_balzac_austen_fitzgerald_show_arc_of_money.htmlaccessedonJanuary26,2016.

VANDEGEER,S.,BÜHLMANN,P.,RITOV,Y.A.andDEZEURE,R.(2014).Onasymptoticallyoptimalconfidenceregionsandtestsforhigh–dimensionalmodels.AnnalsofStatistics42(3)1166–1202.

WICKHAM,H.,COOK,D.,HOFMANN,H.andBUJA,A.(2010).GraphicalinferenceforInfovis.VisualizationandComputerGraphics,IEEE16(6)973–979.

WITTEN,R.andCANDÈS,E.(2013).Randomizedalgorithmsforlow–rankmatrixfactorizations:Sharpperformancebounds.Algorithmica72(1)264–281.WOOLFORD,D.G.,BRAUN,W.J.,DEAN,C.B.andMARTELL,D.L.(2009).Site–specificseasonalbaselinesforfireriskinOntario.Geomatica63(4)355–363.

WOOLFORD,D.G.andBRAUN,W.J.(2007).Convergentdatasharpeningfortheidentificationandtrackingofspatialtemporalcentersoflightningactivity.Environmetrics18(5)461–479.

WU,W.B.andPOURAHMADI,M.(2003).Nonparametricestimationoflargecovariancematricesoflongitudinaldata.Biometrika90(4)831–844.

ZOU,H.(2006).Theadaptivelassoanditsoracleproperties.JournaloftheAmericanStatisticalAssociation101(476)1418–1429.

Documents

Statistical Inference, Learning and Models in Big Datadata into knowledge is related to its complexity, the essence of which is broadly captured by the “Four Vs”: volume, velocity,