CompSci 590.02 Instructor: Ashwin Machanavajjhala

AlgorithmsforBig‐DataManagement

CompSci590.02Instructor:AshwinMachanavajjhala

1Lecture1:590.02Spring13

AdministriviahCp://www.cs.duke.edu/courses/spring13/compsci590.2/

•  Tue/Thu3:05–4:20PM

•  “ReadingCourse+Project”–  Noexams!

–  Everyclassbasedon1(or2)assignedpapersthatstudentsmustread.

•  Projects:(50%ofgrade)–  Individualorgroupsofsize2‐3

•  ClassPar\cipa\on+assignments(other50%)

•  Officehours:byappointment


Administrivia•  Projects:(50%ofgrade)

–  Ideaswillbepostedinthecomingweeks

•  Goals:–  Literaturereview–  Someoriginalresearch/implementa\on

•  Timeline(detailswillbepostedonthewebsitesoon)–  ≤Feb12:ChooseProject(ideaswillbeposted…newideaswelcome)

–  Feb21:Projectproposal(1‐4pagesdescribingtheproject)–  Mar21:Mid‐projectreview(2‐3pagereportonprogress)

–  Apr18:Finalpresenta\onsandsubmission(6‐10pageconferencestylepaper+20minutetalk)

Lecture1:590.02Spring13 3

Whyyoushouldtakethiscourse?•  Industry,academicandgovernmentresearchiden\fiesthevalue

ofanalyzinglargedatacollec\onsinallwalksoflife.–  “WhatNext?AHalf‐DozenDataManagementResearchGoalsforBig

DataandCloud”,SurajitChaudhuri,MicrosoOResearch

–  “Bigdata:ThenextfronQerforinnovaQon,compeQQon,andproducQvity”,McKinseyGlobalInsQtuteReport,2011


Whyyoushouldtakethiscourse?•  Veryac\vefieldandtonsofinteres\ngresearch.

Wewillreadpapersin:–  DataManagement–  Theory

–  MachineLearning

–  …


Whyyoushouldtakethiscourse?•  Introtoresearchbyworkingonacoolproject

–  ReadscienQficpapers

–  Formulateaproblem–  PerformascienQficevaluaQon


Today•  Courseoverview

•  Analgorithmforsampling


INTRODUCTION


WhatisBigData?



hCp://visual.ly/what‐big‐data


hCp://visual.ly/what‐big‐data

3KeyTrends•  Increaseddatacollec\on

•  (Sharednothing)Parallelprocessingframeworksoncommodityhardware

•  Powerfulanalysisoftrendsbylinkingdatafromheterogeneoussources


Big‐Dataimpactsallaspectsofourlife


ThevalueinBig‐Data…

14

+250% clicks vs. editorial one size fits all

+79% clicks vs. randomly selected

+43% clicks vs. editor selected

Recommendedlinks PersonalizedNewsInterests

TopSearches

Lecture1:590.02Spring13

ThevalueinBig‐Data…

15

“IfUShealthcareweretousebigdata

creaQvelyandeffecQvelytodriveefficiencyand

quality,thesectorcouldcreatemorethan

$300billioninvalueeveryyear.”McKinseyGlobalIns\tuteReport

Lecture1:590.02Spring13

Example:GoogleFlu



hCp://www.ccs.neu.edu/home/amislove/twiCermood/

CourseOverview•  Sampling

–  ReservoirSampling

–  Samplingwithindices–  SamplingfromJoins

–  MarkovchainMonteCarlosampling

–  GraphSampling&PageRank



•  StreamingAlgorithms–  Sketches–  OnlineAggrega\on–  Windowedqueries

–  Onlinelearning



•  StreamingAlgorithms•  ParallelArchitectures&Algorithms

–  PRAM

–  MapReduce

–  Graphprocessingarchitectures:BulkSynchronousparallelandasynchronousmodels

–  (Graphconnec\vity,MatrixMul\plica\on,BeliefPropaga\on)



•  StreamingAlgorithms•  ParallelArchitectures&Algorithms

•  Joiningdatasets&RecordLinkage–  ThetaJoins:orhowtoop\mallyjointwolargedatasets

–  ClusteringsimilardocumentsusingminHash

–  Iden\fyingmatchingusersacrosssocialnetworks

–  Correla\onClustering–  MarkovLogicNetworks


SAMPLING


WhySampling?•  Approximatelycomputequan\\eswhen

–  Processingtheen\redatasettakestoolong.HowmanytweetsmenQonObama?

–  Computa\onisintractableNumberofsaQsfyingassignmentsforaDNF.

–  Donothaveaccessorexpensivetogetaccesstoen\redata.HowmanyrestaurantsdoesGoogleknowabout?NumberofusersinFacebookwhosebirthdayistoday.WhatfracQonofthepopulaQonhastheflu?


Zero‐OneEs\matorTheoremInput:AuniverseofitemsU(e.g.,alltweets)

AsubsetG(e.g.,tweetsmen\oningObama)

Goal:Es\mateμ=|G|/|U|

Algorithm:•  PickNsamplesfromU{x1,x2,…,xN}•  Foreachsample,letYi=1ifxiεG.•  Output:Y=ΣYi/N

Theorem:Letε<2.IfN>(1/μ)(4ln(2/δ)/ε2),thenPr[(1‐ε)μ<Y<(1+ε)μ]>1‐δ


Zero‐OneEs\matorTheoremAlgorithm:

•  PickNsamplesfromU{x1,x2,…,xN}•  Foreachsample,letYi=1ifxiεG.

•  Output:Y=ΣYi/N

Theorem:Letε<2.IfN>(1/μ)(4ln(2/δ)/ε2),then

Pr[(1‐ε)μ<Y<(1+ε)μ]>1‐δ

Proof:Homework


SimpleRandomSample•  GivenatableofsizeN,pickasubsetofnrows,suchthateach

subsetofnrowsisequallylikely.

•  Howtosamplenrows?•  …ifwedon’tknowN?


ReservoirSamplingHighlights:

•  Makeonepassoverthedata•  Maintainareservoirofnrecords.

•  A}erreadingtrows,thereservoirisasimplerandomsampleofthefirsttrows.


ReservoirSampling[ViCerACMToMS‘85]AlgorithmR:

•  Ini\alizereservoirtothefirstnrows.

•  Forthe(t+1)strowR,

–  Pickarandomnumbermbetween1andt+1

–  Ifm<=n,thenreplacethemthrowinthereservoirwithR


Proof


Proof•  IfN=n,thenP[rowisinsample]=1.Hence,reservoircontains

alltherowsinthetable.

•  SupposeforN=t,thereservoirisasimplerandomsample.Thatis,eachrowhasn/tchanceofappearinginthesample.

•  ForN=t+1:–  (t+1)strowisincludedinthesamplewithprobabilityn/(t+1)–  Anyotherrow:

P[rowisinreservoir]=P[rowisinreservoira}ertsteps]*P[rowisnot replaced] =n/t*(1‐1/(t+1))=n/(t+1)


Complexity•  Running\me:O(N)

•  Numberofcallstorandomnumbergenerator:O(N)

•  Expectednumberofelementsthatmayappearinthereservoir:

n+ΣnN‐1n/(t+1)=n(1+HN‐Hn)≈n(1+ln(N/n))

•  Isthereawaytosamplefaster?in\meO(n(1+ln(N/n)))??


Fasteralgorithm•  AlgorithmRskipsover(doesnotinsertintoreservoir)anumber

ofrecords(N‐n(1+ln(N/n)))

•  Atanystept,letS(n,t)denotethenumberofrowsskippedbytheAlgorithmR.–  InvolvedO(S)\meandO(S)callstotherandomnumbergenerator.

•  P[S(n,t)=s]=?


Fasteralgorithm•  Atanystept,letS(n,t)denotethenumberofrowsskippedbythe

AlgorithmR.

•  P[S(n,t)=s]=forallt<x<=t+s,rowxwasnotinsertedintoreservoir,butrowt+s+1isinserted.

={1‐n/(t+1)}x{1–n/(t+2)}x…x{1‐n/(t+s)}xn/(t+s+1)

•  WecanderiveexpressionforCDF:P[S(n,t)<=s]=1–(t/t+s+1)(t‐1/t+s)(t‐2/t+s‐1)…(t‐n+1/t+s‐n+2)


FasterAlgorithmAlgorithmX

•  Ini\alizereservoirwithfirstnrows.

•  A}erseeingtrows,randomlysampleaskips=S(n,t)fromtheCDF

•  Pickanumbermbetween1andn

•  Replacethemthrowinthereservoirwiththe(t+s+1)strow.

•  Sett=t+s+1


FasterAlgorithmAlgorithmX

•  Ini\alizereservoirwithfirstnrows.•  A}erseeingtrows,randomlysampleaskips=S(n,t)fromthe

CDF–  PickarandomUbetween0and1

–  FindtheminimumssuchthatP[S(n,t)<=s]<=1‐U

•  Pickanumbermbetween1andn

•  Replacethemthrowinthereservoirwiththe(t+s+1)strow.•  Sett=t+s+1


AlgorithmX•  Running\me:

EachskiptakesO(s)\metocomputeTotal\me=sumofalltheskips=O(N)

•  Expectednumberofcallstotherandomnumbergenerator=2*expectednumberofrowsinthereservoir

=O(n(1+ln(N/n)))op\mal!

Seepaperforalgorithmwhichhasop\malrun\me


Summary•  Samplingisanimportanttechniqueforcomputa\onwhendatais

toolarge,orthecomputa\onisintractable,orifaccesstodataislimited.

•  Reservoirsamplingtechniquesallowcompu\ngasampleevenwithoutknowledgeofthesizeofthedata.–  Alsocandoweightedsampling[Efraimidis,SpirakisIPL2006]

•  Veryusefulforsamplingfromstreams(e.g.,twiCerstream)


References•  J.ViCer,“RandomSamplingwithaReservoir”,ACMTransac\ononMathema\cal

So}ware,1985•  P.Efraimidis,P.Spirakis,“Weightedrandomsamplingwithareservoir”,Journal

Informa\onProcessingLeCers,97(5),2006

•  R.Karp,R.Luby,N.Madras,“MonteCarloApproxima\onAlgorithmsforEnumera\onProblems”,JournalofAlgorithms,1989


Documents

CompSci 590.02 Instructor: Ashwin Machanavajjhala