30
Community Detection by SCORE with applications to Statisticians’ Networks Jiashun Jin Statistics Department Carnegie Mellon University Collaborators: Pengsheng Ji (Univ. of Georgia) Zheng Tracy Ke (Univ. of Chicago) April 6, 2015 Jiashun Jin Community Detection by SCORE

Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Embed Size (px)

Citation preview

Page 1: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Community Detection by SCOREwith applications to Statisticians’ Networks

Jiashun Jin

Statistics DepartmentCarnegie Mellon University

Collaborators: Pengsheng Ji (Univ. of Georgia)Zheng Tracy Ke (Univ. of Chicago)

April 6, 2015

Jiashun Jin Community Detection by SCORE

Page 2: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Network community detection

Jiashun Jin Coauthorship and Citation networks for Statisticians

Political web blogs (Adamic and

Glance; 2005)

I n = 1222 web blogs (nodes)

I 16714 hyperlinks (edges)

I #edges n2: adjacencymatrix X is very sparse

I Two perceivable communities

I Goal. Find the (unknown)community labels

Jiashun Jin Community Detection by SCORE

Page 3: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Abstraction (undirected)

Data: adjacency matrix A of a network N = (V ,E )

I V = 1, 2, . . . , n: nodes

A(i , j) =

1, an edge between nodes i and j0, otherwise

I K perceivable “communities”

V = V (1) ∪ V (2) . . . ∪ V (K )

Goal. For each node, predict the community label.

Diagonals of A are 0 for convenience

Jiashun Jin Community Detection by SCORE

Page 4: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Signal and noise decomposition

Adjacency matrix : A = E [A]+W , W ≡ (A−E [A]), “signal”+“noise”

I W = A− E [A]: generalized Wigner matrix

I upper triangles: independent centered-Bernoulli

I Question: How to model Ω if we write

E [X ] = Ω− diag(Ω)

Jiashun Jin Community Detection by SCORE

Page 5: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Box’s wisdom

George E.P. Box (1919–2013)

“All models are wrong, butsome are useful”

Jiashun Jin Community Detection by SCORE

Page 6: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Degree Corrected Block Model (DCBM)Ω(i , j)

θ(i) · θ(j)= P(k , `) ⇐⇒ Ω = ΘLΘ

P =

[a bb c

], Θ =

θ(1)

θ(2). . .

θ(7)

L =

a b a b a b ab c b c b c ba b a b a b ab c b c b c ba b a b a b ab c b c b c ba b a b a b a

permute−−−−→

a a a a b b ba a a a b b ba a a a b b ba a a a b b bb b b b c c cb b b b c c cb b b b c c c

Karrer and Newman (2010)

Jiashun Jin Community Detection by SCORE

Page 7: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Tukey’s suggestion

John W. Tukey (1915–2000)

“Which part of the samplecontains the information”Tukey (1965), PNAS

Jiashun Jin Community Detection by SCORE

Page 8: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Where is the information?

A = Ω− diag(Ω) + W ≈ Ω

SVD : Ω = ΘLΘ = Un,KDK ,K (Un,K )′

Un,K = ΘTn,K =

θ(1)

θ(2). . .

θ(n)

s1 t1

s2 t2

s1 t1...

...s1 t1

s2 t2

Jiashun Jin Community Detection by SCORE

Page 9: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

SCORE: algorithm

SCORE: Spectral Clustering On Ratios-of-Eigenvectors

Input: A and K

I Obtain leading eigenvectors η1, η2, . . ., ηK

I Obtain n × (K − 1) matrix of entry-wise ratios

R(i , k) =ηk+1(i)

η1(i), 1 ≤ i ≤ n, 1 ≤ k ≤ K − 1

I Apply k-means to R (assume ≤ K clusters)

Jiashun Jin Community Detection by SCORE

Page 10: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Political weblog network (K = 2)x-axis: i = 1, 2, . . . , n; y -axis: R(i); 58 errors (lowest in literature)

0 200 400 600 800 1000 1200−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

Methods SCORE PCA normalized PCA NSC BCPLErrors 58 437 600 69 104.5 (SD: 145.4)

Newman (2016), Bickel and Chen (2009), Zhao et al (2012)

Jiashun Jin Community Detection by SCORE

Page 11: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Regularity conditions

A = Ω− diag(Ω) + W , Ω = ΘLΘ, L =K∑

k,`=1

P(k, `)1k1′`

I (a). Eigen-spacing of DPD is ≥ a constant C

D(k , k)2 =[ ∑i∈V (k)

θ(i)2]/‖θ‖2

I (b). log(n)θmax‖θ‖1/‖θ‖4 → 0, so that

‖W ‖ ‖Ω‖, with prob. 1− o(n−3)

I (c). log(n)θ2max/θmin ≤ ‖θ‖33, so matrix-form Bernsteininequality holds (for the sum of random matrices)

Jiashun Jin Community Detection by SCORE

Page 12: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Consistency of SCORE

Hammp(ˆ, `) = n−1 minπ

n∑i=1

P(

ˆi 6= π(`i )

), errn =

‖θ‖33‖θ‖4

max n∑

i=1

1

θ(i),

1

θmin

(‖θ‖1‖θ‖2

)2Theorem. Consider DCBM where (a)-(c) hold. As n→∞, if

n−1∗ log(n)errn → 0, where n∗ is the minimum community size,

then Hammp(ˆscore , `) ≤ Cn−1 log3(n)errn.

Proof. Full analysis of Θ−1(ηk − ηk)

I Spectral perturbation theory

I Classical large deviations inequalities

I Matrix-form Bernstein inequality (Tropp, 2012)

Remark. If we assume θ(i)iid∼ F as in Zhao et al (2012), then

Hammp(ˆscore , `) ≤ Cn−1 log3(n)

Jiashun Jin Community Detection by SCORE

Page 13: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Coauthor/Citation Networks (statisticians)

I People most interested: statisticians/friends

I We know “inside information” N/A to outsiders

Scientific Problem: Dynamics of US-basedstatisticians in theory & methods of the HDDA eraHDDA: High-Dimensional Data Analysis

Data: All published research papers in AoS,Biometrika, JASA, and JRSS-B, 2003–2012

Jiashun Jin Community Detection by SCORE

Page 14: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Disclaimer

I Data and scope of scientific interests: limitedI It is not our intention to

I rank one author/paper/area over the othersI label an author/paper to a certain area

I We have to use real names because thenetworks are for real people (“us”)

Jiashun Jin Community Detection by SCORE

Page 15: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Citation Network, I

Large-Scale Multiple Testing by SCORE (359 nodes; 26 shown)

0 5 10 15 2010

15

20

25

30

35

40

Aad van der Vaart

Abba M Krieger

Bradley Efron

Christian P Robert

Christopher Genovese

D R Cox

Daniel Yekutieli

David L DonohoDavid Siegmund

Donald B Rubin

E L Lehmann

Felix AbramovichIain M Johnstone

James O Berger

Jiashun Jin

John D Storey

John Rice

Joseph P Romano

Larry Wasserman

Mark G Low

Paul R Rosenbaum

Peter Muller

Sanat K Sarkar

Subhashis Ghosal

Yoav Benjamini

Zhiyi Chi

Jiashun Jin Community Detection by SCORE

Page 16: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Citation Network, II

Spatial stat./nonparametric stat. by SCORE (1010 nodes; 42 shown)

Adrian E Raftery

Alan E Gelfand

Alan H Welsh

Amy H Herring

Andrew O Finley Anthony OHagan

Athanasios Kottas

Brian S Caffo

Ciprian M Crainiceanu

David Ruppert

Douglas W Nychka

Gareth Roberts

Gary L Rosner

Hao Purdue Zhang

Huiyan Sang

Jeffrey S Morris

Jonathan Tawn

Joseph G Ibrahim

Laurens de Haan

Marc G Genton

Mark F J Steel

Martin Schlather

Michael L Stein

Michael Sherman

Ming−Hui Chen

Mohammad Hosseini−Nasab

Montserrat Fuentes

N Reid

Naisyin Wang

Omiros Papaspiliopoulos

Paul Fearnhead

R Todd Ogden

Raymond J Carroll

Robin Henderson

Simon N Wood

Steven N MacEachern

Sudipto Banerjee

Theo GasserTilmann Gneiting

Ulrich Stadtmuller

Yi Li

Yongtao Guan

Jiashun Jin Community Detection by SCORE

Page 17: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Citation Network, II (further split, I)

Parametric Spatial Statistics by SCORE (304 nodes; 21 shown)

Adrian E Raftery

Andrew O Finley

Anthony OHagan

Cristiano Varin

Douglas W Nychka

Fadoua Balabdaoui

Haavard Rue

Hao Zhang (Purdue)

Huiyan Sang

Jonathan Tawn

Laurens de Haan

Leah J Welty

Marc G GentonMartin Schlather

Michael L Stein

Montserrat Fuentes

N ReidNicolas Chopin

Paolo Vidoni

Sudipto Banerjee

Tilmann Gneiting

Jiashun Jin Community Detection by SCORE

Page 18: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Citation Network, II (further split, II)

Nonparametric Spatial Statistics by SCORE (212 nodes; 21 shown)

Alan E Gelfand

Alexandros Beskos

Athanasios KottasDavid M Blei

Fernando A Quintana

Gareth Roberts

Gary L Rosner

Herbert K H Lee

Ju−Hyun Park

Mark F J Steel

Matthew J Beal

Natesh Pillai

Omiros PapaspiliopoulosPaul Fearnhead

Pilar L Iglesias

Radford M Neal

Robert B GramacySteven N MacEachern

Trivellore E Raghunathan

Yee Whye Teh

Yi Li

Jiashun Jin Community Detection by SCORE

Page 19: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Citation Network, II (further split, III)

Non-parametrics/semi-parametrics by SCORE (392 nodes; 24 shown)

Alan H Welsh

Brian S Caffo

Ciprian M Crainiceanu

D Mikis Stasinopoulos

David Ruppert

Hongtu Zhu

Hua Yun Chen

Jeffrey S Morris

Joseph G Ibrahim

Michael A Benjamin

Ming−Hui Chen

Mohammad Hosseini−Nasab

Naisyin Wang

Nilanjan Chatterjee

Rabi Bhattacharya

Ray Carroll

Robert A Rigby

Robin Henderson

Rui Paulo

Silvia Shimakura

Theo Gasser

Thomas C M Lee

Ulrich Stadtmuller

Vic Patrangenaru

Jiashun Jin Community Detection by SCORE

Page 20: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Citation Network, III

Variable Selection by SCORE (1285 nodes; 40 shown)

Alexandre B Tsybakov

Cun−Hui Zhang

Dan Yu Lin

Elizaveta Levina

Emmanuel J Candes

Hans−Georg Muller

Hansheng Wang

Hao Helen Zhang

Heng Peng

Hui Zou

Ji Zhu

Jian HuangJianhua Z HuangJianqing Fan

Jinchi Lv

Joel L Horowitz

L J Wei

Lixing Zhu

Michael R Kosorok

Ming Yuan

Mohsen Pourahmadi

Nicolai Meinshausen

Peter Buhlmann

Peter HallPeter J Bickel

Qiwei Yao

R Dennis Cook

Robert J Tibshirani

Runze LiTerence TaoTrevor J HastieXuming He

Yi Lin

Jiashun Jin Community Detection by SCORE

Page 21: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Coauthorship Network, I

Objective Bayes by SCORE (64 nodes; 14 shown)

0 5 10 15 206

7

8

9

10

11

Alan E Gelfand

Athanasios Kottas

Carlos M Carvalho

Daniel Walsh

Fei Liu

Gonzalo Garcia−DonatoJ Palomo

James O Berger

Jerry Sacks

John A Cafeo

M J Bayarri

R J Parthasarathy

Rui Paulo

Steven N MacEachern

Jiashun Jin Community Detection by SCORE

Page 22: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Coauthorship Network, II

Biostatistics by SCORE (388 nodes; 16 shown)

David Dunson

Debajyoti SinhaEric Feuer

Helen Zhang

Heping ZhangHongtu Zhu

Steve MarronJi Zhu

Joseph Ibrahim

Jun LiuL J Wei

Louise Ryan

Tapabrata Maiti

Trivellore Raghunathan

Weili LinYimei Li

Zhiliang Ying

Jiashun Jin Community Detection by SCORE

Page 23: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Coauthorship Network, III

HDDA by SCORE (1811 nodes, 32 shown)

Alexandre TsybakovAndrea Rotnitzky

Bani Mallick

Christian Robert

Ciprian Crainiceanu

Enno Mammen

Gerda Claeskens

Giovanni Parmigiani

Hans−Georg Muller

Holger Dette

Hua Liang

James R Robins

Jane−Ling Wang

Jianqing Fan

Larry Wasserman

Larry BrownLixing Zhu

Malay Ghosh

Marc G Genton

Nilanjan Chatterjee

Peter Hall

Peter Muller

Ray Carroll

Robert J Tibshirani

Runze Li

Song Xi Chen

T Tony Cai

Trevor Hastie

Wolfgang Hardle

Xihong Lin

Xuming He

Yanyuan Ma

Jiashun Jin Community Detection by SCORE

Page 24: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Comparisons, I

Undirected networks:

I Newman’s Spectral Clustering (NSC)

I Bickel and Chen’s Profile Likelihood (BCPL)

I Amini et al’s Pseudo Likelihood (APL)

Directed networks: Leicht & Newman’s Spectral Clustering

Jiashun Jin Community Detection by SCORE

Page 25: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Comparisons, IIAdjusted Rand Index (ARI); larger means more similar

SCORE NSC APL BCPLSCORE 1.00 .55 .19 .00NSC 1.00 .41 .00APL 1.00 0.00BCPL 1.00

Sizes of the 3 communities identified by SCORE, NSC, and APL

Objective Bayes Biostat-Coau HDDA-CoauSCORE 64 388 1811

NSC 69 163 2031APL 20 50 2193

SCORE ∩ NSC 55 162 1807SCORE ∩ APL 20 50 1811

NSC ∩ APL 20 50 2032

SCORE ∩ NSC ∩ APL 20 50 1807

Jiashun Jin Community Detection by SCORE

Page 26: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

More on Coauthorship Network, I

“Theo. Statist. Learning” (15 nodes) and “Dim. Reduction” (14 nodes)

Alexandre B TsybakovAnatoli B Juditsky

Bin Yu

Bing Li

Fadoua Balabdaoui

Florentina BuneaFrancesca Chiaromonte

Guilherme Rocha

Jon A Wellner

Karim Lounici

Lexin Li

Liliana Forzani

Liping Zhu

Liqiang Ni

Liugen Xue

Lixing ZhuLukas MeierMarkus Kalisch

Marloes H Maathuis

Marten H Wegkamp

Nicolai Meinshausen

Peter Buhlmann

Philippe Rigollet

Piet Groeneboom

R Dennis Cook

Sara van de Geer

Tao Shi

Winfried StuteXia Cui

Xiangrong YinXin Chen

Yuexiao Dong

Jiashun Jin Community Detection by SCORE

Page 27: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

More on Coauthorship Network, II

“Johns Hopkins”, “Duke”, “Stanford”, “Quant. Reg.”, “Exp. Design”

Barry RowlingsonBrian S CaffoChong-Zhi DiCiprian M CrainiceanuDavid RuppertDobrin MarchevGalin L JonesJames P HobertJohn P BuonaccorsiJohn StaudenmayerNaresh M PunjabiPeter J DiggleSheng Luo

Carlos M CarvalhoGary L RosnerGerard LetacHelene MassamJames G ScottJonathan R StroudMaria De IorioMike WestNicholas G PolsonPeter Muller

Armin SchwartzmanBenjamin YakirDavid SiegmundF GosselinJohn D StoreyJonathan E TaylorKeith J WorsleyNancy Ruonan ZhangRyan J Tibshirani

Hengjian CuiHuixia Judy WangJianhua HuJianhui ZhouValen E JohnsonWing K FungXuming HeYijun ZuoZhongyi Zhu

Andrey PepelyshevFrank BretzHolger DetteNatalie NeumeyerStanislav VolgushevStefanie BiedermannTim Holland-LetzViatcheslav B Melas

Jiashun Jin Community Detection by SCORE

Page 28: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

More on Coauthorship Network, III

David Dunson

Donglin Zeng

Hans−Georg Muller

Hongtu Zhu

Hua Liang

Jianqing Fan

Jing Qin

Joseph G Ibrahim

Peter HallRaymond J Carroll

T Tony Cai

David Dunson

Donglin Zeng

Hans−Georg Muller

Hongtu Zhu

Hua Liang

Jianqing Fan

Jing Qin

Joseph G Ibrahim

Peter HallRaymond J Carroll

T Tony Cai

Jiashun Jin Community Detection by SCORE

Page 29: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

More on Coauthorship Network, IV

David Dunson

Donglin Zeng

Hans−Georg Muller

Hongtu Zhu

Hua Liang

Jianqing Fan

Jing Qin

Joseph G Ibrahim

Peter HallRaymond J Carroll

T Tony Cai

David Dunson

Donglin Zeng

Hans−Georg Muller

Hongtu Zhu

Hua Liang

Jianqing Fan

Jing Qin

Joseph G Ibrahim

Peter HallRaymond J Carroll

T Tony Cai

Jiashun Jin Community Detection by SCORE

Page 30: Community Detection by SCORE - Carnegie Mellon …jiashun/Research/Talks/SCORE.pdf · Abba M Krieger Bradley Efron Christian P Robert Christopher Genovese D R Cox ... Jiashun Jin

Take home messages

I Proposed a fast, flexible, easy-to-implement,yet effective, method: SCORE

I Successfully applied to Statisticians’ networksand found many meaningful communities

I Data sets: a fertile ground for future research(many results are not reported here)

References:Jin J (2015) Fast network community detection by SCORE. Ann. Statist.43(1), 57-89.

Ji P, Jin J (2014) Coauthorship and Citation networks for statisticians.

arXiv.1410.2840.

Jiashun Jin Community Detection by SCORE