48
Clustering and community detection Social and Technological Networks Rik Sarkar University of Edinburgh, 2018.

Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Clusteringandcommunitydetection

SocialandTechnologicalNetworks

Rik Sarkar

UniversityofEdinburgh,2018.

Page 2: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Communitydetection

• Givenanetwork• Whatarethe“communities”– Closelyconnectedgroupsofnodes– Relativelyfewedgestooutsidethecommunity

• Similartoclusteringindatasets– Grouptogetherpointsthataremorecloseorsimilartoeachotherthanotherpoints

Page 3: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Communitydetectionbyclustering

• First,defineametricbetweennodes– Eithercomputeintrinsicmetricslikeallpairsshortestpaths[Floyd-Warshall algorithmO(n3)]

– OrembedthenodesinaEuclideanspace,andusethemetricthere• Wewilllaterstudyembeddingmethods

• Applyaclusteringalgorithmwiththemetric

Page 4: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Clustering

• Acoreproblemofmachinelearning:–Whichitemsareinthesamegroup?

• Identifiesitemsthataresimilarrelativetorestofdata

• Simplifiesinformationbygroupingsimilaritems– Helpsinalltypesofotherproblems

Page 5: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Clustering• Outlineapproach:• Givenasetofitems

– Defineadistancebetween them• E.g.Euclideandistancebetweenpointsinaplane;Euclideandistancebetweenotherattributes;non-euclidean distances;pathlengthsinanetwork;tiestrengthsinanetwork…

– Determineagrouping(partitioning)thatoptimisessomefunction(prefers‘close’itemsinsamegroup).

• Referenceforclustering:– CharuAggarwal:TheDataMiningTextbook,Springer

• FreeonSpringersite(fromuniversitynetwork)– Blumetal.FoundationsofDataScience(freeonline)

Page 6: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

K-meansclustering

• Findk-clusters

–Withcenters

– Thatminimizethesumofsquareddistancesofnodestotheirclustercenters(calledthek-meanscost)

Page 7: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

K-meansclustering:Lloyd’salgorithm

• Therearenitems• Selectk‘centers’

– Mayberandomklocationsinspace– Maybelocationofkoftheitemsselectedrandomly– Maybechosenaccordingtosomemethod

• Iteratetillconvergence:– Assigneachitemtotheclusterforitsclosestcenter– Recompute locationofcenterasthemeanlocationofallelements inthecluster(theircentroid)

– Repeat• Warning:Lloyd’salgorithmisaHeuristic.Doesnotguaranteethatthek-meanscostisminimised

Page 8: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

K-means

• Visualisations• http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html

• http://shabal.in/visuals/kmeans/1.html

Page 9: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

K-means

• Ward’salgorithm(alsoHeuristic)– Startwitheachnodeasitsowncluster– Ateachround,findtwoclusterssuchthatmergingthemwillreducethek-meanscostthemost

–Mergethesetwoclusters– Repeatuntiltherearek-clusters

Page 10: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Kmeans:discussion• Triestominimise squaredsumofdistancesofitemsto

clustercenters– NP-hard.Computationally intractable– Algorithmgiveslocaloptimum

• Dependsoninitialisation (startingsetofcenters)– Cangivepoorresults– Submodular optimisation canhelp

• Theright‘k’maybeunknown– Possiblestrategy:trydifferentpossibilities andtakethebest

• Canbeimprovedbyheuristicslikechoosingcenterscarefully– E.g.choosingcenterstobeasfarapartaspossible:chooseone,

choosepointfarthesttoit,choosepointfarthesttoboth(maximise mindistancetoexistingsetetc)…

– Trymultipletimesandtakebestresult..

Page 11: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

K-medoids

• Similar,butnoweachcentermustbeoneofthegivenitems– Ineachcluster,findtheitemthatisthebest‘center’andrepeat

• Usefulwhenthereisnoambientspace(extrinsicmetric)– E.g.Adistancebetweenitemscanbecomputedbetweennodes,buttheyarenotinanyparticularEuclideanspace,sothe‘centroid’inLloyd’salgorithmisnotameaningfulpoint

Page 12: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Othercenterbasedmethods

• K-center:Minimise maximumdistancetocenter:

• K-median:Minimise sumofdistances:

Page 13: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Hierarchicalclustering

• Hierarchicallygroupitems• Usingsomestandardclusteringmethod

Page 14: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Hierarchicalclustering• Topdown(divisive):– Startwitheverything in1cluster

– Makethebestdivision,andrepeatineachsubcluster

• Bottomup(agglomerative):– Startwithndifferent clusters– Mergetwoatatimebyfindingpairsthatgivethebestimprovement

Page 15: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Hierarchicalclustering• Givesmanyoptionsforaflatclustering

• Problem:whatisagood‘cut’ofthedendogram?

Page 16: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Densitybasedclustering

• Groupdenseregionstogether

• Betteratnon-linearseparations

• Workswithunknownnumberofclusters

Page 17: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

DBSCAN• Densityatadatapoint:

– NumberofdatapointswithinradiusEps• Acorepoint:

– Pointwithdensityatleastτ• Borderpoint

– Densitylessthanτ,butatleastonecorepointwithinradiusEps• Noisepoint

– Neithercorenorborder.Farfromdenseregions

Algorithm

• ConstructUDGofcorepoints

• Connectedcomponentsofthegraphgivetheclusters

• Assignborderpointstosuitableclusters(E.g.totheclustertowhichithasmostedges)

Page 18: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

DBSCAN:Discussions

• Requiresknowledgeofsuitableradiusanddensityparameters(Eps andτ)

• Doesnotallowforpossibilitythatdifferentclustersmayhavedifferentdensities

Page 19: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

DBSCAN

• Usefulincaseswhereitisclearwhichobjectscanbeconsideredsimilarbutnumberofclustersisnotknown

• Knowntoperformverywellinrealproblems

• Worstcasecomplexity:O(n2)

• Currentresearch:Makingfasterinspecialcases,approximations,distributedalgorithms.

Page 20: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Otherdensitybasedclustering

• Singlelinkage(sameasKruskal’s MSTalgorithm)– Startwithnclusters–Mergetwoclusterswiththeshortestbridginglink– Repeatuntilkclusters

• Other,morerobustmethodsexist

Page 21: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Communities

• Groupsoffriends• Colleagues/collaborators• Webpagesonsimilartopics• Biologicalreactiongroups• Similarcustomers/users…

Page 22: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Otherapplications

• Acoarserrepresentationofnetworks• Oneormoremeta-nodeforeachcommunity• Identifybridges/weak-links• Structuralholes

Page 23: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Communitydetectioninnetworks

• Asimplestrategy:– Chooseasuitabledistancemeasurebasedonavailabledata• E.g.Pathlengths;distancebasedoninversetiestrengths;sizeoflargestenclosinggrouporcommonattribute;distanceinaspectral(eigenvector)embedding;etc..

– Applyastandardclusteringalgorithm

Page 24: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Clusteringisnotalwayssuitableinnetworks

• Smallworldnetworkshavesmalldiameter– Andsometimeintegerdistances– Adistancebasedmethoddoesnothavealotofoptiontorepresentsimilarities/dissimilarities

• Highdegreenodesarecommon– Connectdifferentcommunities– Hardtoseparatecommunities

• Edgedensitiesvaryacrossthenetwork– Samethresholddoesnotworkwelleverywhere

Page 25: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Definitionsofcommunities

• Varies.Dependingonapplication

• Generalidea:Densesubgraphs:Morelinkswithincommunity,fewlinksoutside

• Sometypesandconsiderations:– Partitions:Eachnodeinexactlyonecommunity– Overlapping:Eachnodecanbeinmultiplecommunities

Page 26: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Comment:Findingdensesubgraphs ishardingeneral

• Findinglargestclique– NP-hard– Computationallyintractable

• Decisionversion:Doesacliqueofsizekexist?– AlsoNP-complete– Computationallyintractable– Polynomialtime(efficient)algorithmsunlikelytoexist

• Wewilllookforapproximations

Page 27: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Densesubgraphs:Fewpreliminarydefinitions

• ForS,Tsubgraphs ofV• e(S,T):SetofedgesfromStoT– e(S)=e(S,S):EdgeswithinS

• dS(v):numberofedgesfromvtoS• EdgedensityofS:|e(S)|/|S|– Largestforcompletegraphsorcliques

Page 28: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Dense subgraph Problem

• Findthesubgraph withlargestedgedensity• Therealsoexistsadecisionversion:– Isthereasubgraph withedgedensity>α

• CanbesolvedusingMaxFlowalgorithms– O(n2m):inefficient inlargedatasets– Findstheonedensestsubgraph

• Variant:FinddensestScontaininggivensubsetX• Otherversions:Findsubgraphs sizekorless• NP-hard

Page 29: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

EfficientapproximationforfindingdenseScontainingX

• Givesa1/2approximation• EdgedensityofoutputSsetisatleasthalfofoptimalsetS*

• (ProofinKempe 2018:http://www-bcf.usc.edu/~dkempe/teaching/structure-dynamics.pdf ).

Page 30: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Betweenness&graphpartitioning• Wewanttosplitnetworkintotightlyknitgroups(communitiesetc)

• Idea:Identifytheedgesconnectingdifferentcommunitiesandremovethem

• Theseedgesare“central”tothenetwork– Theylieonshortestpaths

• Betweenness ofedge(e)(canbeconsideredforvertex(v)):– Wesend1unitoftrafficbetween everypairofnodesinthenetwork

– measurewhatfractionpassesthroughe,assumingtheflowissplitequallyamongallshortestpaths.

Page 31: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Computingbetweenness

• Computingallshortestpathsseparatelyisinefficient

• Amoreefficientway:• Fromeachnode:– Step1:ComputeBFStree– Step2:Findnumberofshortestpathstoeachnode

– Step3:Findtheflowthrougheachedge– Seekleinberg-Easleyfordetailedalgorithm

Page 32: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Partitioning(Girvan-newman)Repeat:• Findedgeeofhighest

betweenness• Removee

• Producesahierarchicparitioning structureasthegraphdecomposesintosmallercomponents

• Networkversionofhierarchicclustering

Page 33: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Modularity

• Whatistheright“cut”inahierarchicclusteringthatrepresentsgoodcommunities?

• Clusteringagraph• Problem:Whatistherightclustering?• Idea:Maximizeaquantitycalledmodularity

Page 34: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

ModularityofsubsetS

• GivengraphG• ConsiderarandomG’graphwithsamenodedegrees(rememberconfigurationmodel)– NumberofedgesinSinG:|e(S)|G– ExpectednumberofedgesinSinG’:E[|e(S)|G’]– Modularity ofS:|e(S)|G - E[|e(S)|G’]– Morecoherentcommunitieshavemoreedgesinsidethanwouldbeexpected inarandomgraphwithsamedegrees

– Note:modularity canbenegative

Page 35: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Modularityofaclustering

• Takeapartition(clustering)ofV:• Writed(Si)forsumofdegreesofallnodesinSi• ItcanbeshownthatE[|e(S)|G’]≈d(Si)2• Definition:Sumoverthepartition:

q(P) =1

m

X

i

|e(Si)|G � 1

4md(si)

2

Page 36: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

• Canbeusedasastoppingcriterion(orfindingrightlevelofpartitioning)inothermethods– Eg.Girvan-newman

Page 37: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Modularitybasedclustering• Modularityismeantforusemoreasameasureofquality,notso

muchasaclusteringmethod

• FindingclusteringwithhighestmodularityisNP-hard• Heuristic:Louvain method:

– Placeeach node inits own community– For each community,consider merging with neighbor.

• Make the greedy choice – make the merge that maximizes modularity• Or donotmerge if none increases modularity

– Repeat• Note:Modularityisarelativemeasureforcomparingcommunity

structure.• Notentirelyclearinwhichcasesitmayormaynotgivegoodresults• Athresholdof0.3ormoreissometimes consideredtogivegood

clustering

Page 38: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Karateclubhierarchicclustering

• Shapeofnodesgivesactualsplitintheclubduetointernalconflicts– Newman2003

Page 39: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Correlationclustering• Someedgesareknowntobesimilar/friends/trusted

• marked“+”• Someedgesareknowntobedissimilar/enemies/distrusted

• marked“-”• Maximizethenumberof+edgesinsideclustersand

• Minimizethenumberof-edgesbetweenclusters

Page 40: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Applications

• Communitydetectionbasedonsimilarpeople/users

• Documentclusteringbasedonknownsimilarityordissimilaritybetweendocuments

• Useofsentimentsand/orotherdivisiveattributes

Page 41: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Features

• Clusteringwithoutneedtoknownumberofclusters– k-means,medians,clustersetc needtoknownumberofclustersorotherparameters likethreshold

– Numberofclustersdepends onnetworkstructure• Actually,doesnotneedanyparameter• NPhard• Notethatgraphmaybecompleteornotcomplete

– Insomeapplicationswithunlabelededges,itmaybereasonabletochangeedgesto“+”edgesandnon-edgesto“-”edges

Page 42: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Approximation

• Naive1/2approximation:– Iftherearemore+edges

• Putthemallin1cluster

– Iftherearemore- edges• Putnodesinndifferentclusters

• (notveryuseful)!

Page 43: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Betterapproximations

• 2waysoflookingatit:– MaximizeagreementorMinimizedisagreement– Similar idea,butweknowdifferentapproximationalgorithms

• NikhilBansal etal.developPTAS(polynomialtimeapproximationscheme)formaximizingagreement:– (1-ε)approximation, running time

• Min-disagree:– 4-approximation

Page 44: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Localdetectionofcommunities

• Supposethereisahugegraph,likewww,orfacebook network

• Weoftenwanttofindthecommunitythatcontainsaparticularnodeorgroup– E.g.tomakerecommendations: “yourfriendshavewatchedthismovie…”

– Toinferpreferences andattributes• Runningafullscalecommunitydetectioniscomputationallyimpractical

• Wedonotknowthenumberofcommunities• A“localmethod”likeDBSCANcanhelp

Page 45: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Conductance:measureofedgesinsidecommunityvsoutside

• GivensubsetsS,TinV• e(S,T):setofedgesbetweenSandT• Volumeofedges:

• ConductanceofSisdefinedas:

• Communitiesarelikelytohavelowconductance

vol(S) =X

v2S

d(v)

Page 46: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Personalised pagerank

• GivenaseedsetX• FindthecommunitySthatcontainsX• Pagerank style:Userandomwalks• Algorithm– Setalimitktonumberofstepsinrandomwalks– Repeat:

• SelectatrandomastartpointfromX• Takekrandomstepsinthegraph

– Counthowfrequently eachnodeoccurs– pagerank– Nodesinthecommunityhavehighpagerank

Page 47: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Personalised pagerank

• AlternativeAlgorithm– Setaprobabilitytoresetrandomwalk– Repeat:

• SelectatrandomastartpointfromX• Withprobability1– 𝜀movetoarandomneighbor• Withprobability𝜀movetoarandomnodeinX• Counthowfrequentlyeachnodeoccurs– pagerank

– Nodesinthecommunityhavehighpagerank

Page 48: Clustering and community detection · Comment: Finding dense subgraphs is hard in general • Finding largest clique – NP-hard – Computationally intractable • Decision version:

Personalised pagerank

• Communitieshavelowconductance• Therefore,shortrandomwalkswillleavethecommunityonlyrarely

• Therefore,nodesinthecommunityofXwillhavehighpagerank comparedtothoseoutside

• ItcanbeprovedthatifXisinalowconductancecommunity,nodesoutsidethiscommunitywilloccurinfrequently.– Wewillomitthisproof