Upload
doanminh
View
216
Download
3
Embed Size (px)
Citation preview
Community Detection by SCOREwith applications to Statisticians’ Networks
Jiashun Jin
Statistics DepartmentCarnegie Mellon University
Collaborators: Pengsheng Ji (Univ. of Georgia)Zheng Tracy Ke (Univ. of Chicago)
April 6, 2015
Jiashun Jin Community Detection by SCORE
Network community detection
Jiashun Jin Coauthorship and Citation networks for Statisticians
Political web blogs (Adamic and
Glance; 2005)
I n = 1222 web blogs (nodes)
I 16714 hyperlinks (edges)
I #edges n2: adjacencymatrix X is very sparse
I Two perceivable communities
I Goal. Find the (unknown)community labels
Jiashun Jin Community Detection by SCORE
Abstraction (undirected)
Data: adjacency matrix A of a network N = (V ,E )
I V = 1, 2, . . . , n: nodes
A(i , j) =
1, an edge between nodes i and j0, otherwise
I K perceivable “communities”
V = V (1) ∪ V (2) . . . ∪ V (K )
Goal. For each node, predict the community label.
Diagonals of A are 0 for convenience
Jiashun Jin Community Detection by SCORE
Signal and noise decomposition
Adjacency matrix : A = E [A]+W , W ≡ (A−E [A]), “signal”+“noise”
I W = A− E [A]: generalized Wigner matrix
I upper triangles: independent centered-Bernoulli
I Question: How to model Ω if we write
E [X ] = Ω− diag(Ω)
Jiashun Jin Community Detection by SCORE
Box’s wisdom
George E.P. Box (1919–2013)
“All models are wrong, butsome are useful”
Jiashun Jin Community Detection by SCORE
Degree Corrected Block Model (DCBM)Ω(i , j)
θ(i) · θ(j)= P(k , `) ⇐⇒ Ω = ΘLΘ
P =
[a bb c
], Θ =
θ(1)
θ(2). . .
θ(7)
L =
a b a b a b ab c b c b c ba b a b a b ab c b c b c ba b a b a b ab c b c b c ba b a b a b a
permute−−−−→
a a a a b b ba a a a b b ba a a a b b ba a a a b b bb b b b c c cb b b b c c cb b b b c c c
Karrer and Newman (2010)
Jiashun Jin Community Detection by SCORE
Tukey’s suggestion
John W. Tukey (1915–2000)
“Which part of the samplecontains the information”Tukey (1965), PNAS
Jiashun Jin Community Detection by SCORE
Where is the information?
A = Ω− diag(Ω) + W ≈ Ω
SVD : Ω = ΘLΘ = Un,KDK ,K (Un,K )′
Un,K = ΘTn,K =
θ(1)
θ(2). . .
θ(n)
s1 t1
s2 t2
s1 t1...
...s1 t1
s2 t2
Jiashun Jin Community Detection by SCORE
SCORE: algorithm
SCORE: Spectral Clustering On Ratios-of-Eigenvectors
Input: A and K
I Obtain leading eigenvectors η1, η2, . . ., ηK
I Obtain n × (K − 1) matrix of entry-wise ratios
R(i , k) =ηk+1(i)
η1(i), 1 ≤ i ≤ n, 1 ≤ k ≤ K − 1
I Apply k-means to R (assume ≤ K clusters)
Jiashun Jin Community Detection by SCORE
Political weblog network (K = 2)x-axis: i = 1, 2, . . . , n; y -axis: R(i); 58 errors (lowest in literature)
0 200 400 600 800 1000 1200−4
−3.5
−3
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
Methods SCORE PCA normalized PCA NSC BCPLErrors 58 437 600 69 104.5 (SD: 145.4)
Newman (2016), Bickel and Chen (2009), Zhao et al (2012)
Jiashun Jin Community Detection by SCORE
Regularity conditions
A = Ω− diag(Ω) + W , Ω = ΘLΘ, L =K∑
k,`=1
P(k, `)1k1′`
I (a). Eigen-spacing of DPD is ≥ a constant C
D(k , k)2 =[ ∑i∈V (k)
θ(i)2]/‖θ‖2
I (b). log(n)θmax‖θ‖1/‖θ‖4 → 0, so that
‖W ‖ ‖Ω‖, with prob. 1− o(n−3)
I (c). log(n)θ2max/θmin ≤ ‖θ‖33, so matrix-form Bernsteininequality holds (for the sum of random matrices)
Jiashun Jin Community Detection by SCORE
Consistency of SCORE
Hammp(ˆ, `) = n−1 minπ
n∑i=1
P(
ˆi 6= π(`i )
), errn =
‖θ‖33‖θ‖4
max n∑
i=1
1
θ(i),
1
θmin
(‖θ‖1‖θ‖2
)2Theorem. Consider DCBM where (a)-(c) hold. As n→∞, if
n−1∗ log(n)errn → 0, where n∗ is the minimum community size,
then Hammp(ˆscore , `) ≤ Cn−1 log3(n)errn.
Proof. Full analysis of Θ−1(ηk − ηk)
I Spectral perturbation theory
I Classical large deviations inequalities
I Matrix-form Bernstein inequality (Tropp, 2012)
Remark. If we assume θ(i)iid∼ F as in Zhao et al (2012), then
Hammp(ˆscore , `) ≤ Cn−1 log3(n)
Jiashun Jin Community Detection by SCORE
Coauthor/Citation Networks (statisticians)
I People most interested: statisticians/friends
I We know “inside information” N/A to outsiders
Scientific Problem: Dynamics of US-basedstatisticians in theory & methods of the HDDA eraHDDA: High-Dimensional Data Analysis
Data: All published research papers in AoS,Biometrika, JASA, and JRSS-B, 2003–2012
Jiashun Jin Community Detection by SCORE
Disclaimer
I Data and scope of scientific interests: limitedI It is not our intention to
I rank one author/paper/area over the othersI label an author/paper to a certain area
I We have to use real names because thenetworks are for real people (“us”)
Jiashun Jin Community Detection by SCORE
Citation Network, I
Large-Scale Multiple Testing by SCORE (359 nodes; 26 shown)
0 5 10 15 2010
15
20
25
30
35
40
Aad van der Vaart
Abba M Krieger
Bradley Efron
Christian P Robert
Christopher Genovese
D R Cox
Daniel Yekutieli
David L DonohoDavid Siegmund
Donald B Rubin
E L Lehmann
Felix AbramovichIain M Johnstone
James O Berger
Jiashun Jin
John D Storey
John Rice
Joseph P Romano
Larry Wasserman
Mark G Low
Paul R Rosenbaum
Peter Muller
Sanat K Sarkar
Subhashis Ghosal
Yoav Benjamini
Zhiyi Chi
Jiashun Jin Community Detection by SCORE
Citation Network, II
Spatial stat./nonparametric stat. by SCORE (1010 nodes; 42 shown)
Adrian E Raftery
Alan E Gelfand
Alan H Welsh
Amy H Herring
Andrew O Finley Anthony OHagan
Athanasios Kottas
Brian S Caffo
Ciprian M Crainiceanu
David Ruppert
Douglas W Nychka
Gareth Roberts
Gary L Rosner
Hao Purdue Zhang
Huiyan Sang
Jeffrey S Morris
Jonathan Tawn
Joseph G Ibrahim
Laurens de Haan
Marc G Genton
Mark F J Steel
Martin Schlather
Michael L Stein
Michael Sherman
Ming−Hui Chen
Mohammad Hosseini−Nasab
Montserrat Fuentes
N Reid
Naisyin Wang
Omiros Papaspiliopoulos
Paul Fearnhead
R Todd Ogden
Raymond J Carroll
Robin Henderson
Simon N Wood
Steven N MacEachern
Sudipto Banerjee
Theo GasserTilmann Gneiting
Ulrich Stadtmuller
Yi Li
Yongtao Guan
Jiashun Jin Community Detection by SCORE
Citation Network, II (further split, I)
Parametric Spatial Statistics by SCORE (304 nodes; 21 shown)
Adrian E Raftery
Andrew O Finley
Anthony OHagan
Cristiano Varin
Douglas W Nychka
Fadoua Balabdaoui
Haavard Rue
Hao Zhang (Purdue)
Huiyan Sang
Jonathan Tawn
Laurens de Haan
Leah J Welty
Marc G GentonMartin Schlather
Michael L Stein
Montserrat Fuentes
N ReidNicolas Chopin
Paolo Vidoni
Sudipto Banerjee
Tilmann Gneiting
Jiashun Jin Community Detection by SCORE
Citation Network, II (further split, II)
Nonparametric Spatial Statistics by SCORE (212 nodes; 21 shown)
Alan E Gelfand
Alexandros Beskos
Athanasios KottasDavid M Blei
Fernando A Quintana
Gareth Roberts
Gary L Rosner
Herbert K H Lee
Ju−Hyun Park
Mark F J Steel
Matthew J Beal
Natesh Pillai
Omiros PapaspiliopoulosPaul Fearnhead
Pilar L Iglesias
Radford M Neal
Robert B GramacySteven N MacEachern
Trivellore E Raghunathan
Yee Whye Teh
Yi Li
Jiashun Jin Community Detection by SCORE
Citation Network, II (further split, III)
Non-parametrics/semi-parametrics by SCORE (392 nodes; 24 shown)
Alan H Welsh
Brian S Caffo
Ciprian M Crainiceanu
D Mikis Stasinopoulos
David Ruppert
Hongtu Zhu
Hua Yun Chen
Jeffrey S Morris
Joseph G Ibrahim
Michael A Benjamin
Ming−Hui Chen
Mohammad Hosseini−Nasab
Naisyin Wang
Nilanjan Chatterjee
Rabi Bhattacharya
Ray Carroll
Robert A Rigby
Robin Henderson
Rui Paulo
Silvia Shimakura
Theo Gasser
Thomas C M Lee
Ulrich Stadtmuller
Vic Patrangenaru
Jiashun Jin Community Detection by SCORE
Citation Network, III
Variable Selection by SCORE (1285 nodes; 40 shown)
Alexandre B Tsybakov
Cun−Hui Zhang
Dan Yu Lin
Elizaveta Levina
Emmanuel J Candes
Hans−Georg Muller
Hansheng Wang
Hao Helen Zhang
Heng Peng
Hui Zou
Ji Zhu
Jian HuangJianhua Z HuangJianqing Fan
Jinchi Lv
Joel L Horowitz
L J Wei
Lixing Zhu
Michael R Kosorok
Ming Yuan
Mohsen Pourahmadi
Nicolai Meinshausen
Peter Buhlmann
Peter HallPeter J Bickel
Qiwei Yao
R Dennis Cook
Robert J Tibshirani
Runze LiTerence TaoTrevor J HastieXuming He
Yi Lin
Jiashun Jin Community Detection by SCORE
Coauthorship Network, I
Objective Bayes by SCORE (64 nodes; 14 shown)
0 5 10 15 206
7
8
9
10
11
Alan E Gelfand
Athanasios Kottas
Carlos M Carvalho
Daniel Walsh
Fei Liu
Gonzalo Garcia−DonatoJ Palomo
James O Berger
Jerry Sacks
John A Cafeo
M J Bayarri
R J Parthasarathy
Rui Paulo
Steven N MacEachern
Jiashun Jin Community Detection by SCORE
Coauthorship Network, II
Biostatistics by SCORE (388 nodes; 16 shown)
David Dunson
Debajyoti SinhaEric Feuer
Helen Zhang
Heping ZhangHongtu Zhu
Steve MarronJi Zhu
Joseph Ibrahim
Jun LiuL J Wei
Louise Ryan
Tapabrata Maiti
Trivellore Raghunathan
Weili LinYimei Li
Zhiliang Ying
Jiashun Jin Community Detection by SCORE
Coauthorship Network, III
HDDA by SCORE (1811 nodes, 32 shown)
Alexandre TsybakovAndrea Rotnitzky
Bani Mallick
Christian Robert
Ciprian Crainiceanu
Enno Mammen
Gerda Claeskens
Giovanni Parmigiani
Hans−Georg Muller
Holger Dette
Hua Liang
James R Robins
Jane−Ling Wang
Jianqing Fan
Larry Wasserman
Larry BrownLixing Zhu
Malay Ghosh
Marc G Genton
Nilanjan Chatterjee
Peter Hall
Peter Muller
Ray Carroll
Robert J Tibshirani
Runze Li
Song Xi Chen
T Tony Cai
Trevor Hastie
Wolfgang Hardle
Xihong Lin
Xuming He
Yanyuan Ma
Jiashun Jin Community Detection by SCORE
Comparisons, I
Undirected networks:
I Newman’s Spectral Clustering (NSC)
I Bickel and Chen’s Profile Likelihood (BCPL)
I Amini et al’s Pseudo Likelihood (APL)
Directed networks: Leicht & Newman’s Spectral Clustering
Jiashun Jin Community Detection by SCORE
Comparisons, IIAdjusted Rand Index (ARI); larger means more similar
SCORE NSC APL BCPLSCORE 1.00 .55 .19 .00NSC 1.00 .41 .00APL 1.00 0.00BCPL 1.00
Sizes of the 3 communities identified by SCORE, NSC, and APL
Objective Bayes Biostat-Coau HDDA-CoauSCORE 64 388 1811
NSC 69 163 2031APL 20 50 2193
SCORE ∩ NSC 55 162 1807SCORE ∩ APL 20 50 1811
NSC ∩ APL 20 50 2032
SCORE ∩ NSC ∩ APL 20 50 1807
Jiashun Jin Community Detection by SCORE
More on Coauthorship Network, I
“Theo. Statist. Learning” (15 nodes) and “Dim. Reduction” (14 nodes)
Alexandre B TsybakovAnatoli B Juditsky
Bin Yu
Bing Li
Fadoua Balabdaoui
Florentina BuneaFrancesca Chiaromonte
Guilherme Rocha
Jon A Wellner
Karim Lounici
Lexin Li
Liliana Forzani
Liping Zhu
Liqiang Ni
Liugen Xue
Lixing ZhuLukas MeierMarkus Kalisch
Marloes H Maathuis
Marten H Wegkamp
Nicolai Meinshausen
Peter Buhlmann
Philippe Rigollet
Piet Groeneboom
R Dennis Cook
Sara van de Geer
Tao Shi
Winfried StuteXia Cui
Xiangrong YinXin Chen
Yuexiao Dong
Jiashun Jin Community Detection by SCORE
More on Coauthorship Network, II
“Johns Hopkins”, “Duke”, “Stanford”, “Quant. Reg.”, “Exp. Design”
Barry RowlingsonBrian S CaffoChong-Zhi DiCiprian M CrainiceanuDavid RuppertDobrin MarchevGalin L JonesJames P HobertJohn P BuonaccorsiJohn StaudenmayerNaresh M PunjabiPeter J DiggleSheng Luo
Carlos M CarvalhoGary L RosnerGerard LetacHelene MassamJames G ScottJonathan R StroudMaria De IorioMike WestNicholas G PolsonPeter Muller
Armin SchwartzmanBenjamin YakirDavid SiegmundF GosselinJohn D StoreyJonathan E TaylorKeith J WorsleyNancy Ruonan ZhangRyan J Tibshirani
Hengjian CuiHuixia Judy WangJianhua HuJianhui ZhouValen E JohnsonWing K FungXuming HeYijun ZuoZhongyi Zhu
Andrey PepelyshevFrank BretzHolger DetteNatalie NeumeyerStanislav VolgushevStefanie BiedermannTim Holland-LetzViatcheslav B Melas
Jiashun Jin Community Detection by SCORE
More on Coauthorship Network, III
David Dunson
Donglin Zeng
Hans−Georg Muller
Hongtu Zhu
Hua Liang
Jianqing Fan
Jing Qin
Joseph G Ibrahim
Peter HallRaymond J Carroll
T Tony Cai
David Dunson
Donglin Zeng
Hans−Georg Muller
Hongtu Zhu
Hua Liang
Jianqing Fan
Jing Qin
Joseph G Ibrahim
Peter HallRaymond J Carroll
T Tony Cai
Jiashun Jin Community Detection by SCORE
More on Coauthorship Network, IV
David Dunson
Donglin Zeng
Hans−Georg Muller
Hongtu Zhu
Hua Liang
Jianqing Fan
Jing Qin
Joseph G Ibrahim
Peter HallRaymond J Carroll
T Tony Cai
David Dunson
Donglin Zeng
Hans−Georg Muller
Hongtu Zhu
Hua Liang
Jianqing Fan
Jing Qin
Joseph G Ibrahim
Peter HallRaymond J Carroll
T Tony Cai
Jiashun Jin Community Detection by SCORE
Take home messages
I Proposed a fast, flexible, easy-to-implement,yet effective, method: SCORE
I Successfully applied to Statisticians’ networksand found many meaningful communities
I Data sets: a fertile ground for future research(many results are not reported here)
References:Jin J (2015) Fast network community detection by SCORE. Ann. Statist.43(1), 57-89.
Ji P, Jin J (2014) Coauthorship and Citation networks for statisticians.
arXiv.1410.2840.
Jiashun Jin Community Detection by SCORE