13
Interactive Clustering Barna Saha

Interactive Clustering -

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interactive Clustering -

InteractiveClusteringBarnaSaha

Page 2: Interactive Clustering -

Clustering

Page 3: Interactive Clustering -

LearningoverNoisyData

Noisecomesfromusingsimilarityfunc5ons—addanedgebetweentwoimagesiftheyrepresentthesamemonument—clusterscouldbeerroneous

Learnaclassifierorfindclustersovernoisy/uncertaindata

Page 4: Interactive Clustering -

•  Learnaclassifierordoclusteringovernoisy/uncertaindataNoisecomesfrominherentdataerrors/missinga?ributes—clusteringcollabora5onnetworkobtainedfromDBLPcouldbeerroneous.

LearningoverNoisyDataLearnaclassifierorfindclustersovernoisy/uncertaindata

Page 5: Interactive Clustering -

FurtherApplications•  LinkingCensusRecords•  PublicHealth•  Websearch•  Comparisonshopping•  SpamDetec5on•  MachineReading•  IPAliasing•  ……..

Page 6: Interactive Clustering -

Querycomplexityofoptimalstrategy?

Page 7: Interactive Clustering -

Querycomplexityofoptimalstrategy?

Davidson,Khanna,Milo,Roy,2014

Page 8: Interactive Clustering -

FaultyOracle

Page 9: Interactive Clustering -

FaultyOracle

Repeatthesameques5on.Assumingp=q,repeateachques5on(say)24logn/(1-2p)25mes

Page 10: Interactive Clustering -

FaultyOracle

Page 11: Interactive Clustering -

FaultyOracle:NoResampling•  Findseednodesforeachcluster•  Ifwecanfind24logn/(1-2p)2seednodesfromeachclusterthenwearedone![Why?]

…..

………………..

Page 12: Interactive Clustering -

FaultyOracle:HowtoCindseednodes?•  LetN=O(k2logn/(1-2p)4)•  SelectNnodesandaskallpossiblepairwisequeriesamongthesenodes.

•  Runcorrela5onclusteringalgorithminthissmallsetofnodes•  Eachclusterreturnedbythecorrela5onclusteringthathassizeatleast24logn/(1-2p)2actasaseed

Page 13: Interactive Clustering -

FaultyOracle:HowtoCindseednodes?•  LetN=O(k2logn/(1-2p)4)•  SelectNnodesandaskallpossiblepairwisequeriesamongthesenodes.

•  Runcorrela5onclusteringalgorithminthissmallsetofnodes•  Eachclusterreturnedbythecorrela5onclusteringthathassizeatleast24logn/(1-2p)2actasaseed

Someintui5onontheanalysis:Ifweknowallthequeryresults,correla5onclusteringgivesthemaximumlikelihoodes5mator.

Moreover,itisaninstanceofcorrela5onclusteringwhereerrorsarerandom—weknowhowtosolveit!