Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
AlgorithmsforBig‐DataManagement
CompSci590.02Instructor:AshwinMachanavajjhala
1Lecture1:590.02Spring13
AdministriviahCp://www.cs.duke.edu/courses/spring13/compsci590.2/
• Tue/Thu3:05–4:20PM
• “ReadingCourse+Project”– Noexams!
– Everyclassbasedon1(or2)assignedpapersthatstudentsmustread.
• Projects:(50%ofgrade)– Individualorgroupsofsize2‐3
• ClassPar\cipa\on+assignments(other50%)
• Officehours:byappointment
2Lecture1:590.02Spring13
Administrivia• Projects:(50%ofgrade)
– Ideaswillbepostedinthecomingweeks
• Goals:– Literaturereview– Someoriginalresearch/implementa\on
• Timeline(detailswillbepostedonthewebsitesoon)– ≤Feb12:ChooseProject(ideaswillbeposted…newideaswelcome)
– Feb21:Projectproposal(1‐4pagesdescribingtheproject)– Mar21:Mid‐projectreview(2‐3pagereportonprogress)
– Apr18:Finalpresenta\onsandsubmission(6‐10pageconferencestylepaper+20minutetalk)
Lecture1:590.02Spring13 3
Whyyoushouldtakethiscourse?• Industry,academicandgovernmentresearchiden\fiesthevalue
ofanalyzinglargedatacollec\onsinallwalksoflife.– “WhatNext?AHalf‐DozenDataManagementResearchGoalsforBig
DataandCloud”,SurajitChaudhuri,MicrosoOResearch
– “Bigdata:ThenextfronQerforinnovaQon,compeQQon,andproducQvity”,McKinseyGlobalInsQtuteReport,2011
Lecture1:590.02Spring13 4
Whyyoushouldtakethiscourse?• Veryac\vefieldandtonsofinteres\ngresearch.
Wewillreadpapersin:– DataManagement– Theory
– MachineLearning
– …
Lecture1:590.02Spring13 5
Whyyoushouldtakethiscourse?• Introtoresearchbyworkingonacoolproject
– ReadscienQficpapers
– Formulateaproblem– PerformascienQficevaluaQon
Lecture1:590.02Spring13 6
Today• Courseoverview
• Analgorithmforsampling
Lecture1:590.02Spring13 7
INTRODUCTION
Lecture1:590.02Spring13 8
WhatisBigData?
Lecture1:590.02Spring13 9
Lecture1:590.02Spring13 10
hCp://visual.ly/what‐big‐data
Lecture1:590.02Spring13 11
hCp://visual.ly/what‐big‐data
3KeyTrends• Increaseddatacollec\on
• (Sharednothing)Parallelprocessingframeworksoncommodityhardware
• Powerfulanalysisoftrendsbylinkingdatafromheterogeneoussources
Lecture1:590.02Spring13 12
Big‐Dataimpactsallaspectsofourlife
13Lecture1:590.02Spring13
ThevalueinBig‐Data…
14
+250% clicks vs. editorial one size fits all
+79% clicks vs. randomly selected
+43% clicks vs. editor selected
Recommendedlinks PersonalizedNewsInterests
TopSearches
Lecture1:590.02Spring13
ThevalueinBig‐Data…
15
“IfUShealthcareweretousebigdata
creaQvelyandeffecQvelytodriveefficiencyand
quality,thesectorcouldcreatemorethan
$300billioninvalueeveryyear.”McKinseyGlobalIns\tuteReport
Lecture1:590.02Spring13
Example:GoogleFlu
Lecture1:590.02Spring13 16
Lecture1:590.02Spring13 17
hCp://www.ccs.neu.edu/home/amislove/twiCermood/
CourseOverview• Sampling
– ReservoirSampling
– Samplingwithindices– SamplingfromJoins
– MarkovchainMonteCarlosampling
– GraphSampling&PageRank
Lecture1:590.02Spring13 18
CourseOverview• Sampling
• StreamingAlgorithms– Sketches– OnlineAggrega\on– Windowedqueries
– Onlinelearning
Lecture1:590.02Spring13 19
CourseOverview• Sampling
• StreamingAlgorithms• ParallelArchitectures&Algorithms
– PRAM
– MapReduce
– Graphprocessingarchitectures:BulkSynchronousparallelandasynchronousmodels
– (Graphconnec\vity,MatrixMul\plica\on,BeliefPropaga\on)
Lecture1:590.02Spring13 20
CourseOverview• Sampling
• StreamingAlgorithms• ParallelArchitectures&Algorithms
• Joiningdatasets&RecordLinkage– ThetaJoins:orhowtoop\mallyjointwolargedatasets
– ClusteringsimilardocumentsusingminHash
– Iden\fyingmatchingusersacrosssocialnetworks
– Correla\onClustering– MarkovLogicNetworks
Lecture1:590.02Spring13 21
SAMPLING
Lecture1:590.02Spring13 22
WhySampling?• Approximatelycomputequan\\eswhen
– Processingtheen\redatasettakestoolong.HowmanytweetsmenQonObama?
– Computa\onisintractableNumberofsaQsfyingassignmentsforaDNF.
– Donothaveaccessorexpensivetogetaccesstoen\redata.HowmanyrestaurantsdoesGoogleknowabout?NumberofusersinFacebookwhosebirthdayistoday.WhatfracQonofthepopulaQonhastheflu?
Lecture1:590.02Spring13 23
Zero‐OneEs\matorTheoremInput:AuniverseofitemsU(e.g.,alltweets)
AsubsetG(e.g.,tweetsmen\oningObama)
Goal:Es\mateμ=|G|/|U|
Algorithm:• PickNsamplesfromU{x1,x2,…,xN}• Foreachsample,letYi=1ifxiεG.• Output:Y=ΣYi/N
Theorem:Letε<2.IfN>(1/μ)(4ln(2/δ)/ε2),thenPr[(1‐ε)μ<Y<(1+ε)μ]>1‐δ
Lecture1:590.02Spring13 24
Zero‐OneEs\matorTheoremAlgorithm:
• PickNsamplesfromU{x1,x2,…,xN}• Foreachsample,letYi=1ifxiεG.
• Output:Y=ΣYi/N
Theorem:Letε<2.IfN>(1/μ)(4ln(2/δ)/ε2),then
Pr[(1‐ε)μ<Y<(1+ε)μ]>1‐δ
Proof:Homework
Lecture1:590.02Spring13 25
SimpleRandomSample• GivenatableofsizeN,pickasubsetofnrows,suchthateach
subsetofnrowsisequallylikely.
• Howtosamplenrows?• …ifwedon’tknowN?
Lecture1:590.02Spring13 26
ReservoirSamplingHighlights:
• Makeonepassoverthedata• Maintainareservoirofnrecords.
• A}erreadingtrows,thereservoirisasimplerandomsampleofthefirsttrows.
Lecture1:590.02Spring13 27
ReservoirSampling[ViCerACMToMS‘85]AlgorithmR:
• Ini\alizereservoirtothefirstnrows.
• Forthe(t+1)strowR,
– Pickarandomnumbermbetween1andt+1
– Ifm<=n,thenreplacethemthrowinthereservoirwithR
Lecture1:590.02Spring13 28
Proof
Lecture1:590.02Spring13 29
Proof• IfN=n,thenP[rowisinsample]=1.Hence,reservoircontains
alltherowsinthetable.
• SupposeforN=t,thereservoirisasimplerandomsample.Thatis,eachrowhasn/tchanceofappearinginthesample.
• ForN=t+1:– (t+1)strowisincludedinthesamplewithprobabilityn/(t+1)– Anyotherrow:
P[rowisinreservoir]=P[rowisinreservoira}ertsteps]*P[rowisnot replaced] =n/t*(1‐1/(t+1))=n/(t+1)
Lecture1:590.02Spring13 30
Complexity• Running\me:O(N)
• Numberofcallstorandomnumbergenerator:O(N)
• Expectednumberofelementsthatmayappearinthereservoir:
n+ΣnN‐1n/(t+1)=n(1+HN‐Hn)≈n(1+ln(N/n))
• Isthereawaytosamplefaster?in\meO(n(1+ln(N/n)))??
Lecture1:590.02Spring13 31
Fasteralgorithm• AlgorithmRskipsover(doesnotinsertintoreservoir)anumber
ofrecords(N‐n(1+ln(N/n)))
• Atanystept,letS(n,t)denotethenumberofrowsskippedbytheAlgorithmR.– InvolvedO(S)\meandO(S)callstotherandomnumbergenerator.
• P[S(n,t)=s]=?
Lecture1:590.02Spring13 32
Fasteralgorithm• Atanystept,letS(n,t)denotethenumberofrowsskippedbythe
AlgorithmR.
• P[S(n,t)=s]=forallt<x<=t+s,rowxwasnotinsertedintoreservoir,butrowt+s+1isinserted.
={1‐n/(t+1)}x{1–n/(t+2)}x…x{1‐n/(t+s)}xn/(t+s+1)
• WecanderiveexpressionforCDF:P[S(n,t)<=s]=1–(t/t+s+1)(t‐1/t+s)(t‐2/t+s‐1)…(t‐n+1/t+s‐n+2)
Lecture1:590.02Spring13 33
FasterAlgorithmAlgorithmX
• Ini\alizereservoirwithfirstnrows.
• A}erseeingtrows,randomlysampleaskips=S(n,t)fromtheCDF
• Pickanumbermbetween1andn
• Replacethemthrowinthereservoirwiththe(t+s+1)strow.
• Sett=t+s+1
Lecture1:590.02Spring13 34
FasterAlgorithmAlgorithmX
• Ini\alizereservoirwithfirstnrows.• A}erseeingtrows,randomlysampleaskips=S(n,t)fromthe
CDF– PickarandomUbetween0and1
– FindtheminimumssuchthatP[S(n,t)<=s]<=1‐U
• Pickanumbermbetween1andn
• Replacethemthrowinthereservoirwiththe(t+s+1)strow.• Sett=t+s+1
Lecture1:590.02Spring13 35
AlgorithmX• Running\me:
EachskiptakesO(s)\metocomputeTotal\me=sumofalltheskips=O(N)
• Expectednumberofcallstotherandomnumbergenerator=2*expectednumberofrowsinthereservoir
=O(n(1+ln(N/n)))op\mal!
Seepaperforalgorithmwhichhasop\malrun\me
Lecture1:590.02Spring13 36
Summary• Samplingisanimportanttechniqueforcomputa\onwhendatais
toolarge,orthecomputa\onisintractable,orifaccesstodataislimited.
• Reservoirsamplingtechniquesallowcompu\ngasampleevenwithoutknowledgeofthesizeofthedata.– Alsocandoweightedsampling[Efraimidis,SpirakisIPL2006]
• Veryusefulforsamplingfromstreams(e.g.,twiCerstream)
Lecture1:590.02Spring13 37
References• J.ViCer,“RandomSamplingwithaReservoir”,ACMTransac\ononMathema\cal
So}ware,1985• P.Efraimidis,P.Spirakis,“Weightedrandomsamplingwithareservoir”,Journal
Informa\onProcessingLeCers,97(5),2006
• R.Karp,R.Luby,N.Madras,“MonteCarloApproxima\onAlgorithmsforEnumera\onProblems”,JournalofAlgorithms,1989
Lecture1:590.02Spring13 38