Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Communitydetectionandclustering
SocialandTechnologicalNetworks
Rik Sarkar
UniversityofEdinburgh,2019.
Communities
• Groupsoffriends• Colleagues/collaborators• Webpagesonsimilartopics• Biologicalreactiongroups• Similarcustomers/users…
Otherapplications
• Acoarserrepresentationofnetworks• Oneormoremeta-nodeforeachcommunity• Identifybridges/weak-links• Structuralholes
Communitydetection
• Givenanetwork• Whatarethe“communities”– Closelyconnectedgroupsofnodes– Relativelyfewedgestooutsidethecommunity
• Similartoclusteringindatasets– Grouptogetherpointsthataremorecloseorsimilartoeachotherthanotherpoints
Definitionsofcommunities
• Varies.Dependingonapplication
• Generalidea:Densesubgraphs: Morelinkswithincommunity,fewlinksoutside
• Sometypesandconsiderations:– Partitions:Eachnodeinexactlyonecommunity– Overlapping:Eachnodecanbeinmultiplecommunities
–Wewillusuallyconsiderpartitions
Comment:Findingdensesubgraphs ishardingeneral
• Findinglargestclique– NP-hard– Computationallyintractable
• Decisionversion:Doesacliqueofsizekexist?– AlsoNP-complete– Computationallyintractable– Polynomialtime(efficient)algorithmsunlikelytoexist
• Wewilllookforapproximations
Densesubgraphs:Fewpreliminarydefinitions
• ForS,Tsubgraphs ofV• e(S,T):SetofedgesfromStoT– e(S)=e(S,S):EdgeswithinS
• dS(v):numberofedgesfromvtoS• EdgedensityofS:|e(S)|/|S|– Largestforcompletegraphsorcliques
Dense subgraph Problem
• Findthesubgraph withlargestedgedensity• Therealsoexistsadecisionversion:– Isthereasubgraph withedgedensity>α
• CanbesolvedusingMaxFlowalgorithms– O(n2m):inefficientinlargedatasets– Findstheonedensestsubgraph
• Variant:FinddensestScontaininggivensubsetX• Otherversions:Findsubgraphs sizekorless• NP-hard
EfficientapproximationforfindingdenseScontainingX
• Givesa1/2approximation• EdgedensityofoutputSsetisatleasthalfofoptimalsetS*
• (ProofinKempe 2018:http://www-bcf.usc.edu/~dkempe/teaching/structure-dynamics.pdf ).
Betweenness &graphpartitioning• Wewanttosplitnetworkintotightlyknitgroups(communitiesetc)
• Idea:Identifytheedgesconnectingdifferentcommunitiesandremovethem
• Theseedgesare“central”tothenetwork– Theylieonshortestpaths
• Betweenness ofedge(e)(canbeconsideredforvertex(v)):– Wesend1unitoftrafficbetweeneverypairofnodesinthenetwork
– measurewhatfractionpassesthroughe,assumingtheflowissplitequallyamongallshortestpaths.
Computingbetweenness
• Computingallshortestpathsseparatelyisinefficient
• Amoreefficientway:• Fromeachnode:– Step1:ComputeBFStree– Step2:Findnumberofshortestpathstoeachnode
– Step3:Findtheflowthrougheachedge– Seekleinberg-Easleyfordetailedalgorithm
Partitioning(Girvan-newman)Repeat:• Findedgeeofhighest
betweenness• Removee
• Producesahierarchicparitioning structureasthegraphdecomposesintosmallercomponents
• Networkversionofhierarchicclustering
Modularity
• Howdoweevaluatethequalityofdetectedcommunities?– Howdoyousaytheverticalcutisbetterthanthehorizontalcut?
• Idea:Maximizeaquantitycalledmodularity
ConfigurationmodelofRandomgraphs
• Supposewewantagraphthatisrandom• Buthasgivendegreeforeachvertex:
• Ateachvertexi wediopen-edges• Pairuptheedgesrandomly
• GivenagraphG,wecanconstructarandomgraphG’withsamedegreesbutrandomedges
ModularityofsubsetSofV
• GiventhegraphG• ConsiderarandomG’graphwithsamenodedegrees– NumberofedgesinSinG:|e(S)|G– ExpectednumberofedgesinSinG’:E[|e(S)|G’]– ModularityofS:|e(S)|G - E[|e(S)|G’]– Morecoherentcommunitieshavemoreedgesinsidethanwouldbeexpectedinarandomgraphwithsamedegrees
– Note:modularitycanbenegative
Modularityofaclustering
• Takeapartition(clustering)ofV:• Writed(Si)forsumofdegreesofallnodesinSi• ItcanbeshownthatE[|e(S)|G’]≈d(Si)2• Definition:Sumoverthepartition:
•
q(P) =1
m
X
i
|e(Si)|G � 1
4md(si)
2
• Canbeusedasastoppingcriterion(orfindingrightlevelofpartitioning)inhierarchicmethods– Eg.Girvan-newman
Modularitybasedclustering• Modularityismeantforusemoreasameasureofquality,notso
muchasaclusteringmethod
• FindingclusteringwithhighestmodularityisNP-hard• Heuristic:Louvainmethod:
– Placeeach node inits own community– For each community,considermergingwith neighbor.
• Make the greedy choice – make the merge that maximizes modularity• Or donotmerge if none increases modularity
– Repeat• Note:Modularityisarelativemeasureforcomparingcommunity
structure.• Notentirelyclearinwhichcasesitmayormaynotgivegoodresults• Athresholdof0.3ormoreissometimesconsideredtogivegood
clustering
Karateclubhierarchicclustering
• Shapeofnodesgivesactualsplitintheclubduetointernalconflicts– Newman2003
Correlationclustering• Someedgesareknowntobesimilar/friends/trusted
• marked“+”• Someedgesareknowntobedissimilar/enemies/distrusted
• marked“-”• Maximizethenumberof+edgesinsideclustersand
• Minimizethenumberof-edgesbetweenclusters
Applications
• Communitydetectionbasedonsimilarpeople/users
• Documentclusteringbasedonknownsimilarityordissimilaritybetweendocuments
• Useofsentimentsand/orotherdivisiveattributes
Features
• Clusteringwithoutneedtoknownumberofclusters– k-means,medians,clustersetc needtoknownumberofclustersorotherparameterslikethreshold
– Numberofclustersdependsonnetworkstructure• Actually,doesnotneedanyparameter• NPhard• Notethatgraphmaybecompleteornotcomplete– Insomeapplicationswithunlabelededges,itmaybereasonabletochangeedgesto“+”edgesandnon-edgesto“-”edges
Approximation
• Naive1/2approximation:– Iftherearemore+edges• Putthemallin1cluster
– Iftherearemore- edges• Putnodesinndifferentclusters
• (notveryuseful)!
Betterapproximations
• 2waysoflookingatit:– MaximizeagreementorMinimizedisagreement– Similaridea,butweknowdifferentapproximationalgorithms
• NikhilBansal etal.developPTAS(polynomialtimeapproximationscheme)formaximizingagreement:– (1-ε)approximation,runningtime
• Min-disagree:– 4-approximation
Localdetectionofcommunities
• Supposethereisahugegraph,likewww,orfacebook network
• Weoftenwanttofindthecommunitythatcontainsaparticularnodeorgroup– E.g.tomakerecommendations:“yourfriendshavewatchedthismovie…”
– Toinferpreferencesandattributes• Runningafullscalecommunitydetectioniscomputationallyimpractical
• Wedonotknowthenumberofcommunities
Conductance:measureofedgesinsidecommunityvsoutside
• GivensubsetsS,TinV• e(S,T):setofedgesbetweenSandT• Volumeofedges:
• ConductanceofSisdefinedas:
• Communitiesarelikelytohavelowconductance
vol(S) =X
v2S
d(v)
Personalised pagerank
• GivenaseedsetX• FindthecommunitySthatcontainsX• Pagerank style:Userandomwalks• Algorithm– Setalimitktonumberofstepsinrandomwalks– Repeat:
• SelectatrandomastartpointfromX• Takekrandomstepsinthegraph
– Counthowfrequentlyeachnodeoccurs– pagerank– Nodesinthecommunityhavehighpagerank
Personalised pagerank
• AlternativeAlgorithm– Setaprobabilitytoresetrandomwalk– Repeat:• SelectatrandomastartpointfromX• Withprobability1– 𝜀 movetoarandomneighbor• Withprobability𝜀 movetoarandomnodeinX• Counthowfrequentlyeachnodeoccurs– pagerank
– Nodesinthecommunityhavehighpagerank
Personalised pagerank
• Communitieshavelowconductance• Therefore,shortrandomwalkswillleavethecommunityonlyrarely
• Therefore,nodesinthecommunityofXwillhavehighpagerank comparedtothoseoutside
• ItcanbeprovedthatifXisinalowconductancecommunity,nodesoutsidethiscommunitywilloccurinfrequently.– Wewillomitthisproof
Communitydetectionbyclustering
• First,defineametricbetweennodes– Eithercomputeintrinsicmetricslikeallpairsshortestpaths[Floyd-Warshall algorithmO(n3)]
– OrembedthenodesinaEuclideanspace,andusethemetricthere• Wewilllaterstudyembeddingmethods
• Applyaclusteringalgorithmwiththemetric
Clustering
• Acoreproblemofmachinelearning:–Whichitemsareinthesamegroup?
• Identifiesitemsthataresimilarrelativetorestofdata
• Simplifiesinformationbygroupingsimilaritems– Helpsinalltypesofotherproblems
Clustering• Outlineapproach:• Givenasetofitems– Defineadistancebetweenthem
• E.g.Euclideandistancebetweenpointsinaplane;Euclideandistancebetweenotherattributes;non-euclideandistances;pathlengthsinanetwork;tiestrengthsinanetwork…
– Determineagrouping(partitioning)thatoptimisessomefunction(prefers‘close’itemsinsamegroup).
• Generalreferencesforclustering:– CharuAggarwal:TheDataMiningTextbook,Springer
• FreeonSpringersite(fromuniversitynetwork)– Blumetal.FoundationsofDataScience(freeonline)
K-meansclustering
• Findk-clusters
–Withcenters
– Thatminimizethesumofsquareddistancesofnodestotheirclustercenters(calledthek-meanscost)
K-meansclustering:Lloyd’salgorithm
• Therearenitems• Selectk‘centers’– Mayberandomklocationsinspace– Maybelocationofkoftheitemsselectedrandomly– Maybechosenaccordingtosomemethod
• Iteratetillconvergence:– Assigneachitemtotheclusterforitsclosestcenter– Recompute locationofcenterasthemeanlocationofallelementsinthecluster(theircentroid)
– Repeat• Warning:Lloyd’salgorithmisaHeuristic.Doesnotguaranteethatthek-meanscostisminimised
K-means
• Visualisations• http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html
• http://shabal.in/visuals/kmeans/1.html
K-means
• Ward’salgorithm(alsoHeuristic)– Startwitheachnodeasitsowncluster– Ateachround,findtwoclusterssuchthatmergingthemwillreducethek-meanscostthemost
– Mergethesetwoclusters– Repeatuntiltherearek-clusters
Kmeans:discussion• Triestominimise squaredsumofdistancesofitemsto
clustercenters– NP-hard.Computationallyintractable– Algorithmgiveslocaloptimum
• Dependsoninitialisation (startingsetofcenters)– Cangivepoorresults– Submodular optimisation canhelp
• Theright‘k’maybeunknown– Possiblestrategy:trydifferentpossibilitiesandtakethebest
• Canbeimprovedbyheuristicslikechoosingcenterscarefully– E.g.choosingcenterstobeasfarapartaspossible:chooseone,
choosepointfarthesttoit,choosepointfarthesttoboth(maximisemindistancetoexistingsetetc)…
– Trymultipletimesandtakebestresult..
K-medoids
• Similar,butnoweachcentermustbeoneofthegivenitems– Ineachcluster,findtheitemthatisthebest‘center’andrepeat
• Usefulwhenthereisnoambientspace(extrinsicmetric)– E.g.Adistancebetweenitemscanbecomputedbetweennodes,buttheyarenotinanyparticularEuclideanspace,sothe‘centroid’inLloyd’salgorithmisnotameaningfulpoint
Othercenterbasedmethods
• K-center:Minimise maximumdistancetocenter:
• K-median:Minimise sumofdistances:
Hierarchicalclustering
• Hierarchicallygroupitems• Usingsomestandardclusteringmethod
Hierarchicalclustering• Topdown(divisive):– Startwitheverythingin1cluster
– Makethebestdivision,andrepeatineachsubcluster
• Bottomup(agglomerative):– Startwithndifferentclusters– Mergetwoatatimebyfindingpairsthatgivethebestimprovement
Hierarchicalclustering• Givesmanyoptionsforaflatclustering
• Problem:whatisagood‘cut’ofthedendogram?
Densitybasedclustering
• Groupdenseregionstogether
• Betteratnon-linearseparations
• Workswithunknownnumberofclusters
DBSCAN• Densityatadatapoint:
– NumberofdatapointswithinradiusEps• Acorepoint:
– Pointwithdensityatleastτ• Borderpoint
– Densitylessthanτ,butatleastonecorepointwithinradiusEps• Noisepoint
– Neithercorenorborder.Farfromdenseregions
Algorithm
• ConstructUDGofcorepoints
• Connectedcomponentsofthegraphgivetheclusters
• Assignborderpointstosuitableclusters(E.g.totheclustertowhichithasmostedges)
DBSCAN:Discussions
• Requiresknowledgeofsuitableradiusanddensityparameters(Eps andτ)
• Doesnotallowforpossibilitythatdifferentclustersmayhavedifferentdensities
DBSCAN
• Usefulincaseswhereitisclearwhichobjectscanbeconsideredsimilarbutnumberofclustersisnotknown
• Knowntoperformverywellinrealproblems
• Worstcasecomplexity:O(n2)
• Currentresearch:Makingfasterinspecialcases,approximations,distributedalgorithms.
Otherdensitybasedclustering
• Singlelinkage(sameasKruskal’s MSTalgorithm)– Startwithnclusters– Mergetwoclusterswiththeshortestbridginglink– Repeatuntilkclusters
• Other,morerobustmethodsexist
Communitydetectioninnetworks
• Asimplestrategy:– Chooseasuitabledistancemeasurebasedonavailabledata• E.g.Pathlengths;distancebasedoninversetiestrengths;sizeoflargestenclosinggrouporcommonattribute;distanceinaspectral(eigenvector)embedding;etc..
– Applyastandardclusteringalgorithm
Clusteringisnotalwayssuitableinnetworks
• Smallworldnetworkshavesmalldiameter– Andsometimeintegerdistances– Adistancebasedmethoddoesnothavealotofoptiontorepresentsimilarities/dissimilarities
• Highdegreenodesarecommon– Connectdifferentcommunities– Hardtoseparatecommunities
• Edgedensitiesvaryacrossthenetwork– Samethresholddoesnotworkwelleverywhere