54
Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jeffrey D. Ullman Stanford University

JeffreyD.UllmanStanfordUniversity

Page 2: Jeffrey D. Ullman Stanford University

2

¡  Givenasetofpoints,withano1onofdistancebetweenpoints,groupthepointsintosomenumberofclusters,sothatmembersofaclusterare“close”toeachother,whilemembersofdifferentclustersare“far.”

Page 3: Jeffrey D. Ullman Stanford University

3

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Page 4: Jeffrey D. Ullman Stanford University

4

¡  Clusteringintwodimensionslookseasy.¡  Clusteringsmallamountsofdatalookseasy.¡  Andinmostcases,looksarenotdeceiving.

Page 5: Jeffrey D. Ullman Stanford University

5

¡ Manyapplica1onsinvolvenot2,but10or10,000dimensions.

¡  High-dimensionalspaceslookdifferent:almostallpairsofpointsareataboutthesamedistance.

Page 6: Jeffrey D. Ullman Stanford University

6

¡  Assumerandompointsbetween0and1ineachdimension.

¡  In2dimensions:avarietyofdistancesbetween0and1.41.

¡  Inanynumberofdimensions,thedistancebetweentworandompointsinanyonedimensionisdistributedasatriangle.

Anypointisdistancezerofromitself.

Halfthepointsarethefirstofpointsatdistance½.

Onlypoints0and1aredistance1.

Page 7: Jeffrey D. Ullman Stanford University

7

¡  Thelawoflargenumbersapplies.¡  Actualdistancebetweentworandompointsisthesqrtofthesumofsquaresofessen1allythesamesetofdifferences.§  I.e.,“allpointsarethesamedistanceapart.”

Page 8: Jeffrey D. Ullman Stanford University

¡  Euclideanspaceshavedimensions,andpointshavecoordinatesineachdimension.

¡  Distancebetweenpointsisusuallythesquare-rootofthesumofthesquaresofthedistancesineachdimension.

¡  Non-Euclideanspaceshaveadistancemeasure,butpointsdonotreallyhaveaposi1oninthespace.§  Bigproblem:cannot“average”points.

8

Page 9: Jeffrey D. Ullman Stanford University

9

¡  Objectsaresequencesof{C,A,T,G}.¡  Distancebetweensequences=editdistance=theminimumnumberofinsertsanddeletesneededtoturnoneintotheother.§ No1ce:nowayto“average”twostrings.

¡  Inprac1ce,thedistanceforDNAsequencesismorecomplicated:allowsotheropera1onslikemuta/ons(changeofasymbolintoanother)orreversalofsubstrings.

Page 10: Jeffrey D. Ullman Stanford University

10

¡  Hierarchical(Agglomera1ve):§  Ini1ally,eachpointinclusterbyitself.§  Repeatedlycombinethetwo“nearest”clustersintoone.

¡  PointAssignment:§ Maintainasetofclusters.§  Placepointsintotheir“nearest”cluster.§  Possiblysplitclustersorcombineclustersaswego.

Page 11: Jeffrey D. Ullman Stanford University

¡  Pointassignmentgoodwhenclustersarenice,convexshapes.

¡  Hierarchicalcanwinwhenshapesareweird.

11

Aside:ifyourealizedyouhadconcentricclusters,youcouldmappointsbasedondistancefromcenter,andturntheproblemintoasimple,one-dimensionalcase.

Page 12: Jeffrey D. Ullman Stanford University

12

¡  Twoimportantques1ons:1.  Howdoyoudeterminethe“nearness”ofclusters?2.  Howdoyourepresentaclusterofmorethanone

point?

Page 13: Jeffrey D. Ullman Stanford University

13

¡  Keyproblem:asyoubuildclusters,howdoyourepresenttheloca1onofeachcluster,totellwhichpairofclustersisclosest?

¡  Euclideancase:eachclusterhasacentroid=averageofitspoints.§ Measureinterclusterdistancesbydistancesofcentroids.

Page 14: Jeffrey D. Ullman Stanford University

14

(5,3) o (1,2) o

o (2,1) o (4,1)

o (0,0) o

(5,0)

x (1.5,1.5)

x (4.5,0.5) x (1,1)

x (4.7,1.3)

Page 15: Jeffrey D. Ullman Stanford University

15

(0,0) (1,2) (2,1) (4,1) (5,0) (5,3)

Page 16: Jeffrey D. Ullman Stanford University

16

¡  Theonly“loca1ons”wecantalkaboutarethepointsthemselves.§  I.e.,thereisno“average”oftwopoints.

¡  Approach1:clustroid=point“closest”tootherpoints.§  Treatclustroidasifitwerecentroid,whencompu1nginterclusterdistances.

Page 17: Jeffrey D. Ullman Stanford University

17

¡  Possiblemeanings:1.  Smallestmaximumdistancetotheotherpoints.2.  Smallestaveragedistancetootherpoints.3.  Smallestsumofsquaresofdistancestoother

points.4.  Etc.,etc.

Page 18: Jeffrey D. Ullman Stanford University

18

1 2

3

4

5

6

interclusterdistance

clustroid

clustroid

Page 19: Jeffrey D. Ullman Stanford University

19

¡  Approach2:interclusterdistance=minimumofthedistancesbetweenanytwopoints,onefromeachcluster.

¡  Approach3:Pickano1onof“cohesion”ofclusters,e.g.,maximumdistancefromthecentroidorclustroid.§ Mergeclusterswhoseunionismostcohesive.

Page 20: Jeffrey D. Ullman Stanford University

20

¡  Approach1:Usethediameterofthemergedcluster=maximumdistancebetweenpointsinthecluster.

¡  Approach2:Usetheaveragedistancebetweenpointsinthecluster.

¡  Approach3:Density-basedapproach:takethediameteroraveragedistance,e.g.,anddividebythenumberofpointsinthecluster.§  Perhapsraisethenumberofpointstoapowerfirst,e.g.,square-root.

Page 21: Jeffrey D. Ullman Stanford University

¡  Itreallydependsontheshapeofclusters.§ Whichyoumaynotknowinadvance.

¡  Example:we’llcomparetwoapproaches:1.  Mergeclusterswithsmallestdistancebetween

centroids(orclustroidsfornon-Euclidean).2.  Mergeclusterswiththesmallestdistancebetween

twopoints,onefromeachcluster.

21

Page 22: Jeffrey D. Ullman Stanford University

¡  Centroid-basedmergingworkswell.

¡  Butmergerbasedonclosestmembersmightaccidentallymergeincorrectly.

22

AandBhaveclosercentroidsthanAandC,butclosestpointsarefromAandC.

A

B

C

Page 23: Jeffrey D. Ullman Stanford University

¡  Linkingbasedonclosestmembersworkswell.

¡  ButCentroid-basedlinkingmightcauseerrors.

23

Page 24: Jeffrey D. Ullman Stanford University

24

¡  Anexampleofpoint-assignment.¡  AssumesEuclideanspace.¡  Startbypickingk,thenumberofclusters.¡  Ini1alizeclusterswithaseed(=onepointpercluster).§  Example:pickonepointatrandom,thenk-1otherpoints,eachasfarawayaspossiblefromthepreviouspoints.§ OK,aslongastherearenooutliers(pointsthatarefarfromanyreasonablecluster).

Page 25: Jeffrey D. Ullman Stanford University

¡  Basicidea:pickasmallsampleofpoints,clusterthembyanyalgorithm,andusethecentroidsasaseed.

¡  Ink-means++,samplesize=k1mesafactorthatislogarithmicinthetotalnumberofpoints.

¡  Sequen1allypicksamplepointsrandomly,buttheprobabilityofaddingapointptothesampleispropor1onaltoD(p)2.§ D(p)=distancebetweenpandthenearestpickedpoint.

25

Page 26: Jeffrey D. Ullman Stanford University

¡  k-means++,likeotherseedmethods,issequen1al.§  YouneedtoupdateD(p)foreachunpickedpduetonewpoint.

¡  Naturallyparallel:manycomputenodescaneachhandleasmallsetofpoints.§  EachpicksafewnewsamplepointsusingsameD(p).

¡  Reallyimportantandcommontrick:don’tupdateakereveryselec1on;rathermakemanyselec1onsatoneround.§  Subop1malpicksdon’treallymamer.

26

Page 27: Jeffrey D. Ullman Stanford University

27

1.  Foreachpoint,placeitintheclusterwhosecurrentcentroiditisnearest.

2.  Akerallpointsareassigned,fixthecentroidsofthekclusters.

3.  Op1onal:reassignallpointstotheirclosestcentroid.§  Some1mesmovespointsbetweenclusters.

Page 28: Jeffrey D. Ullman Stanford University

28

1

2

3

4

5

6

7 8 x

x

Clustersafterfirstround

Reassignedpoints

Page 29: Jeffrey D. Ullman Stanford University

29

¡  Trydifferentk,lookingatthechangeintheaveragedistancetocentroid,askincreases.

¡  Averagefallsrapidlyun1lrightk,thenchangeslimle.

k

Averagedistancetocentroid Bestvalue

ofk

Note:binarysearchforkispossible.

Page 30: Jeffrey D. Ullman Stanford University

30

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Toofew;manylongdistancestocentroid.

Page 31: Jeffrey D. Ullman Stanford University

31

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Justright;distancesrathershort.

Page 32: Jeffrey D. Ullman Stanford University

32

x x x x x x x x x x x x x

x x

x xx x x x

x x x x

x x x x

x x x x x x x x x

x

x

x

Toomany;littleimprovementinaveragedistance.

Page 33: Jeffrey D. Ullman Stanford University

33

¡  BFR(Bradley-Fayyad-Reina)isavariantofk-meansdesignedtohandleverylarge(disk-resident)datasets.

¡  ItassumesthatclustersarenormallydistributedaroundacentroidinaEuclideanspace.§  Standarddevia1onsindifferentdimensionsmayvary.

Page 34: Jeffrey D. Ullman Stanford University

34

¡  Pointsarereadonemain-memory-fullata1me.

¡ Mostpointsfrompreviousmemoryloadsaresummarizedbysimplesta1s1cs.§  Alsokeptinmainmemory,whichlimitshowmanypointscanbereadinone“memoryload.”

¡  Tobegin,fromtheini1alloadweselecttheini1alkcentroidsbysomesensibleapproach.

Page 35: Jeffrey D. Ullman Stanford University

35

1.  Thediscardset(DS):pointscloseenoughtoacentroidtobesummarized.

2.  Thecompressionset(CS):groupsofpointsthatareclosetogetherbutnotclosetoanycentroid.Theyaresummarized,butnotassignedtoacluster.

3.  Theretainedset(RS):isolatedpoints.

Page 36: Jeffrey D. Ullman Stanford University

36

Acluster.ItspointsareinDS.

Thecentroid

Compressionsets.TheirpointsareinCS.

PointsinRS

Page 37: Jeffrey D. Ullman Stanford University

37

¡  Eachclusterinthediscardsetandeachcompressionsetissummarizedby:1.  Thenumberofpoints,N.2.  ThevectorSUM,whoseithcomponentisthesum

ofthecoordinatesofthepointsintheithdimension.

3.  ThevectorSUMSQ:ithcomponent=sumofsquaresofcoordinatesinithdimension.

Page 38: Jeffrey D. Ullman Stanford University

38

¡  2d+1valuesrepresentanynumberofpoints.§  d=numberofdimensions.

¡  Averagesineachdimension(centroidcoordinates)canbecalculatedeasilyasSUMi/N.§  SUMi=ithcomponentofSUM.

¡  Varianceindimensionicanbecomputedby:(SUMSQi/N)–(SUMi/N)2

§  Andthestandarddevia1onisthesquarerootofthat.

Page 39: Jeffrey D. Ullman Stanford University

39

1.  Findthosepointsthatare“sufficientlyclose”toaclustercentroid;addthosepointstothatclusterandtheDS.

2.  Useanymain-memoryclusteringalgorithmtoclustertheremainingpointsandtheoldRS.

§  ClustersgototheCS;outlyingpointstotheRS.

Page 40: Jeffrey D. Ullman Stanford University

40

3.  Adjuststa1s1csoftheclusterstoaccountforthenewpoints.

§  ConsidermergingcompressedsetsintheCS.4.  Ifthisisthelastround,mergeallcompressed

setsintheCSandallRSpointsintotheirnearestcluster.

Page 41: Jeffrey D. Ullman Stanford University

41

¡  Howdowedecideifapointis“closeenough”toaclusterthatwewilladdthepointtothatcluster?

¡  Howdowedecidewhethertwocompressedsetsdeservetobecombinedintoone?

Page 42: Jeffrey D. Ullman Stanford University

42

¡  Weneedawaytodecidewhethertoputanewpointintoacluster.

¡  BFRsuggesttwoways:1.  TheMahalanobisdistanceislessthanathreshold.2.  Lowlikelihoodofthecurrentlynearestcentroid

changing.

Page 43: Jeffrey D. Ullman Stanford University

43

¡  NormalizedEuclideandistancefromcentroid.¡  Forpoint(x1,…,xk)andcentroid(c1,…,ck):

1.  Normalizeineachdimension:yi=(xi-ci)/σi§  σi=standarddevia1oninithdimensionforthiscluster.

2.  Takesumofthesquaresoftheyi’s.3.  Takethesquareroot.

Page 44: Jeffrey D. Ullman Stanford University

44

¡  Ifclustersarenormallydistributedinddimensions,thenakertransforma1on,onestandarddevia1on=√d.§  I.e.,70%ofthepointsoftheclusterwillhaveaMahalanobisdistance<√d.

¡  AcceptapointforaclusterifitsM.D.is<somethreshold,e.g.4standarddevia1ons.

Page 45: Jeffrey D. Ullman Stanford University

45

σ

Page 46: Jeffrey D. Ullman Stanford University

46

¡  Similartomeasuringcohesion.Forexample:¡  Computethevarianceofthecombinedsubcluster,ineachdimension.§ N,SUM,andSUMSQallowustomakethatcalcula1onquickly.

¡  Combineifthevarianceisbelowsomethreshold.

¡ Manyalterna1ves:treatdimensionsdifferently,considerdensity.

Page 47: Jeffrey D. Ullman Stanford University

47

¡  ProblemwithBFR/k-means:§  Assumesclustersarenormallydistributedineachdimension.

§  Andaxesarefixed–ellipsesatananglearenotOK.¡  CURE:§  AssumesaEuclideandistance.§  Allowsclusterstoassumeanyshape.

Page 48: Jeffrey D. Ullman Stanford University

48

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Page 49: Jeffrey D. Ullman Stanford University

49

1.  Pickarandomsampleofpointsthatfitinmainmemory.

2.  Clusterthesepointshierarchically–groupnearestpoints/clusters.

3.  Foreachcluster,pickasampleofpoints,asdispersedaspossible.

4.  Fromthesample,pickrepresenta1vesbymovingthem(say)20%towardthecentroidofthecluster.

Page 50: Jeffrey D. Ullman Stanford University

50

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Page 51: Jeffrey D. Ullman Stanford University

51

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Pick(say)4remotepointsforeachcluster.

Page 52: Jeffrey D. Ullman Stanford University

52

e e

e

e

e e

e

e e

e

e

h

h

h

h

h

h

h h

h

h

h

h h

salary

age

Movepoints(say)20%towardthecentroid.

Page 53: Jeffrey D. Ullman Stanford University

¡  Alarge,dispersedclusterwillhavelargemovesfromitsboundary.

¡  Asmall,denseclusterwillhavelimlemove.¡  Favorsasmall,denseclusterthatisnearalargerdispersedcluster.

53

Page 54: Jeffrey D. Ullman Stanford University

54

¡  Now,visiteachpointpinthedataset.¡  Placeitinthe“closestcluster.”§ Normaldefini1onof“closest”:thatclusterwiththeclosest(top)amongallthesamplepointsofalltheclusters.