38
Statistical Analysis of Network Data in the Context of ‘Big Data’: Large Networks and Many Networks Eric D. Kolaczyk Dept of Mathematics and Statistics, Boston University [email protected] Northeastern U, May 2015

Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Statistical Analysis of Network Datain the Context of ‘Big Data’:Large Networks and Many Networks

Eric D. Kolaczyk

Dept of Mathematics and Statistics, Boston University

[email protected]

Northeastern U, May 2015

Page 2: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Introduction

Outline

1 IntroductionComplex Networks @ ∼ 15 YearsStatistical Analysis of Network Data: Foundations?

2 Propagation of Uncertainty to Network Summary Statistics

3 ‘Stat 101’ for Collections of Network Data Objects

4 Closing Thoughts: Much Still to Do . . .

Northeastern U, May 2015

Page 3: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Introduction Complex Networks @ ∼ 15 Years

Complex Networks

Pdp

dCLK Cyc

Tim

VriPer

Network-based analysis traditionally a relatively small ‘field’ of study

Epidemic-like spread of interest in networks since mid/late-1990s

Arguably due to various factors, such as

Increasingly systems-level perspective in science,away from reductionism;Flood of high-throughput data;Globalization, the Internet, etc.

Northeastern U, May 2015

Page 4: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Introduction Complex Networks @ ∼ 15 Years

Much Has Happened in 15 Years . . .

Google Scholar reports ∼ 361, 000 articles with ‘network’ in the titlepublished since 1999.

Contributions have been from across the sciences:

Computer Science, Mathematics, StatisticsSignal processing, Statistical Physics, Information SciencesBioinformatics, Economics, Neuroscience, SociologyDigital Humanities

1

USE R !

ISBN 978-1-4939-0982-7

Use R ! Use R !

Eric KolaczykGábor Csárdi

Statistical Analysis of Network Data with R

Statistical Analysis of Netw

ork Data w

ith RKolaczyk · Csárdi

Eric Kolaczyk · Gábor Csárdi

Statistical Analysis of Network Data with R

Th is book is the fi rst of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to use the base code for many tasks. igraph is the central package and has created a standard for developing and manipulating network graphs in R. Measurement and analysis are integral components of network research. As a result, there is a critical need for all sorts of statistics for network analysis, ranging from applications to methodology and theory. Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing, and as such, network analysis is an important growth area in the quantitative sciences. Th eir roots are in social network analysis going back to the 1930s and graph theory going back centuries. Th is text also builds on Eric Kolaczyk’s book Statistical Analysis of Network Data (Springer, 2009).

Eric Kolaczyk is a professor of statistics, and Director of the Program in Statistics, in the Department of Mathematics and Statistics at Boston University, where he also is an affi liated faculty member in the Bioinformatics Program, the Division of Systems Engineering, and the Program in Computational Neuroscience. His publications on network-based topics, beyond the development of statistical methodology and theory, include work on applications ranging from the detection of anomalous traffi c patterns in computer networks to the prediction of biological function in networks of interacting proteins to the characterization of infl uence of groups of actors in social networks. He is an elected fellow of the American Statistical Association (ASA) and an elected senior member of the Institute of Electrical and Electronics Engineers (IEEE).

Gábor Csárdi is a research associate at the Department of Statistics at Harvard University, Cambridge, Mass. He holds a PhD in Computer Science from Eötvös University, Hungary. His research includes applications of network analysis in biology and social sciences, bioinformatics and computational biology, and graph algorithms. He created the igraph soft ware package in 2005 and has been one of the lead developers since then.

Statistics

9 781493 909827

Northeastern U, May 2015

Page 5: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Introduction Statistical Analysis of Network Data: Foundations?

Where is Statistics Amid All of this Work?

Everywhere!

Statistical aspects of network analysis include problems involving

sampling and design

description and visualization

modeling and inference

prediction

for data both of and on networks.

Much of this work occurs in domain-specific areas;a nontrivial amount of it is general.

However . . . comparatively less of it is foundational in nature.

Northeastern U, May 2015

Page 6: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Introduction Statistical Analysis of Network Data: Foundations?

Implications of Network Data on Statistical Foundations

Broadly speaking, the primary statistical challenge(s) in most networkproblems comes from nontrivial interplay between

relational/dependent nature of the data;

network structure;

lack of (traditional) geometry; and

‘big data’.

Question: Despite a fairly well-developed ability to deal successfully withmany classes of problems in many contexts and domain areas, how well dowe truly understand the implications of network data (e.g., as opposed toIID, temporal, or spatial data) on statistics at a foundational level?

My Answer: Somewhat . . . but there’s still a long way to go!

Northeastern U, May 2015

Page 7: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Introduction Statistical Analysis of Network Data: Foundations?

Focus of this Talk: Behavior of ‘Network Averages’

Fundamental to much of statistics areaverages and their asymptotic behavior.

Goal: To consider (arguably quite basic!) notions of ‘network averages’and their asymptotic behavior in the settings of

1 large networks, and

2 many networks

and sketch some our recent results, with an eye towards illustrating

the novelty lent by the aspect of complex networks; and

some of the resulting challenges and open problems.

Northeastern U, May 2015

Page 8: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

Outline

1 IntroductionComplex Networks @ ∼ 15 YearsStatistical Analysis of Network Data: Foundations?

2 Propagation of Uncertainty to Network Summary Statistics

3 ‘Stat 101’ for Collections of Network Data Objects

4 Closing Thoughts: Much Still to Do . . .

Northeastern U, May 2015

Page 9: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

Setting the Stage

Common modus operandi in network analysis:

System of elements and their interactions is of interest.

Collect elements and relations among elements.

Represent the collected data via a network.

Summarize properties of the network.

Sounds good . . . right?

Northeastern U, May 2015

Page 10: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

But What About the Noise?!

(Stating the Obvious . . . ) If there is uncertainty in the determination ofedge status (i.e., presence/absence) between vertex pairs, then thatuncertainty propagates through any calculations using the resultingnetwork graph as input!

Surprisingly, there is very little work acknowledging, much less addressing,this issue.

Work needed on problems of

characterizating propagation of error, from networks to sumaries, and

adjusting for error (e.g., improved estimators).

=⇒ This represents a major foundation missing in network science. ⇐=

Northeastern U, May 2015

Page 11: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

Propagation of Noise: Illustrationilv

Y_b

3773

_at

hydG

_b40

04_a

tga

tR_1

_b20

87_a

tcy

nR_b

0338

_at

leuO

_b00

76_a

tm

etR

_b38

28_a

tar

sR_b

3501

_at

yiaJ

_b35

74_a

ttd

cR_b

3119

_at

soxR

_b40

63_a

trh

aS_b

3905

_at

rhaR

_b39

06_a

tal

pA_b

2624

_at

rtcR

_b34

22_a

ttd

cA_b

3118

_at

rbsR

_b37

53_a

txy

lR_b

3569

_at

emrR

_b26

84_a

tcp

xR_b

3912

_at

yhhG

_b34

81_a

tm

elR

_b41

18_a

thy

cA_b

2725

_at

rpiR

_b40

89_a

tgl

nG_b

3868

_at

hyfR

_b24

91_a

tfh

lA_b

2731

_at

iciA

_b29

16_a

tyg

iX_b

3025

_at

yhiX

_b35

16_a

tyh

iW_b

3515

_at

adiY

_b41

16_a

tyh

iE_b

3512

_at

srlR

_b27

07_a

tgu

tM_b

2706

_at

gntR

_b34

38_a

tm

alT

_b34

18_a

tcy

tR_b

3934

_at

crp_

b335

7_at

yjdG

_b41

24_a

tyj

bK_b

4046

_at

yijC

_b39

63_a

tly

sR_b

2839

_at

glnL

_b38

69_a

tux

uR_b

4324

_at

ygaE

_b26

64_a

tyj

fQ_b

4191

_at

cadC

_b41

33_a

tlld

R_b

3604

_at

sohA

_b31

29_a

tid

nR_b

4264

_at

treR

_b42

41_a

tfr

uR_b

0080

_at

lacI

_b03

45_a

ten

vY_b

0566

_at

fucR

_b28

05_a

tom

pR_b

3405

_at

fecI

_b42

93_a

tfis

_b32

61_a

thu

pB_b

0440

_at

hupA

_b40

00_a

tcs

pE_b

0623

_at

nadR

_b43

90_a

trp

oN_b

3202

_at

gcvR

_b24

79_a

tds

dC_b

2364

_at

atoC

_b22

20_a

tap

pY_b

0564

_at

rcsB

_b22

17_a

tna

gC_b

0676

_at

him

A_b

1712

_at

lrp_

b088

9_at

fnr_

b133

4_at

putA

_b10

14_a

tui

dA_b

1617

_at

b139

9_at

baeR

_b20

79_a

tga

tR_2

_b20

90_f

_at

fadR

_b11

87_a

tna

rL_b

1221

_at

pspF

_b13

03_a

tph

oP_b

1130

_at

fliA

_b19

22_a

tdn

aA_b

3702

_at

birA

_b39

73_a

tle

xA_b

4043

_at

yifA

_b37

62_a

tas

nC_b

3743

_at

caiF

_b00

34_a

tna

c_b1

988_

athn

s_b1

237_

atcs

pA_b

3556

_at

uhpA

_b36

69_a

tar

cA_b

4401

_at

nhaR

_b00

20_a

tso

xS_b

4062

_at

rpoH

_b34

61_a

tar

gR_b

3237

_at

fur_

b068

3_at

evgA

_b23

69_a

tro

b_b4

396_

atrp

oD_b

3067

_at

hcaR

_b25

37_a

trp

oE_b

2573

_at

ebgR

_b30

75_a

tar

aC_b

0064

_at

exuR

_b30

94_a

tga

lR_b

2837

_at

mtlR

_b36

01_a

tgl

pR_b

3423

_at

sdiA

_b19

16_a

tflh

D_b

1892

_at

flhC

_b18

91_a

tcs

gD_b

1040

_at

lrhA

_b22

89_a

txa

pR_b

2405

_at

gcvA

_b28

08_a

tyg

aA_b

2709

_at

rcsA

_b19

51_a

tic

lR_b

4018

_at

glcC

_b29

80_a

tbe

tI_b0

313_

atac

rR_b

0464

_at

yhdM

_b32

92_a

thi

pA_b

1507

_at

hipB

_b15

08_a

trp

oS_b

2741

_at

mod

E_b

0761

_at

tyrR

_b13

23_a

tm

alI_

b162

0_at

slyA

_b16

42_a

tcy

sB_b

1275

_at

ada_

b221

3_at

trpR

_b43

93_a

tde

oR_b

0840

_at

torR

_b09

95_a

tb2

531_

atm

arR

_b15

30_a

tm

arA

_b15

31_a

tga

lS_b

2151

_at

narP

_b21

93_a

tm

lc_b

1594

_at

purR

_b16

58_a

tpd

hR_b

0113

_at

him

D_b

0912

_at

farR

_b07

30_a

tyl

cA_b

0571

_at

oxyR

_b39

61_a

tm

etJ_

b393

8_at

cbl_

b198

7_at

phoB

_b03

99_a

tb1

499_

atm

hpR

_b03

46_a

tkd

pE_b

0694

_at

yebF__U_N0075zipA__U_N0075b2618_U_N0075bcp___U_N0075cpxR__U_N0075crcB__U_N0075crp___U_N0075cspF__U_N0075dam___U_N0075dnaA__U_N0075dnaN__U_N0075dnaT__U_N0075era___U_N0075fis___U_N0075fklB__U_N0075folA__U_N0075galF__U_N0075gcvR__U_N0075gyrA__U_N0075gyrI__U_N0075hlpA__U_N0075holD__U_N0075hscA__U_N0075ldrA__U_N0075mcrB__U_N0075mcrC__U_N0075menB__U_N0075menC__U_N0075minD__U_N0075minE__U_N0075murI__U_N0075yoeB__U_N0075nrdA__U_N0075nupC__U_N0075pyrC__U_N0075rimI__U_N0075rstB__U_N0075ruvC__U_N0075sbcB__U_N0075uspA__U_N0075

=⇒

YBR083W

YHR084W

YDR480W

YGR088W

YHR030C

YLR342W

YCL027W

YDR461WYFL026W

YBL016W

YJL157C

YMR043W

YGR032W

YER111C

YJL128C

YLR113W

YKL062W

YML004C

YNL145W

YKR095W

YCR089W

YBR040W

YMR198WYDR309C

YDR085C

YMR065W

YNL192W

YNR044W

YGL021WYDR146C YGL116W

YBR023C

YDR055W

YGR189C

YGR282C

YIL076W

YJL158C

YJL159W

YKL096WYKL163W

YKL164C

YMR238W

YCR012W

YNL160W

YBR072W

YDL204WYBR126C

YJL141C

YKL150W

YER054C

YOR173W

YGR248W

YJL161WYHR139C

YIR019C

YJR153W YHR084W

YMR043W

YLR113W

YMR037C

YKL062W

YPL089CYBR083W

=⇒ Density = 0.14± ????

Examples particularly prevalent in biology (e.g., gene regulatory networks,protein-protein interaction networks, and neural functional connectivitynetworks), but some noise likely present in most network applications.

Northeastern U, May 2015

Page 12: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

A General Formulation of the Problem

Suppose we have

G , a true underlying (undirected) network-graph

η (G ), a network summary / characteristic of interest.

We will restrict our attention to the problem of counting subgraphs H, i.e.,where η is of the form

ηH(G ) =1

|Iso(H)|∑

H′⊆Knv ,H′∼=H

1H ′ ⊆ G .

Northeastern U, May 2015

Page 13: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

A General Formulation (cont)

Rather than having G , however, suppose we only have anapproximation/estimate of G , say G = (V , E ).

We model observed edges to reflect potential errors in determining the(non)edge status between vertex pair i , j:

Yij ∼

Bernoulli (αij) , if i , j ∈ E c ,

Bernoulli (1− βij) , if i , j ∈ E ,

Goal: Understand the propagation of error from the Yij to the standard

plug-in estimate ηH(G ), by characterizing the distribution of thediscrepency

D = ηH(G )− ηH(G ) .

Northeastern U, May 2015

Page 14: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

Some Assumptions

Motivated by a combination of (i) canonical characteristics in complexnetworks in practice, and (ii) a need to further simplify, we assume thefollowing:

(A1) Large Graphs: nv →∞.

(A2) Sparse Graphs: |E | = Θ(nv log nv ).

(A3) Edge Unbiasedness:∑i ,j∈E c αij =

∑i ,j∈E βij (≡ λ).

(A4) Low Error Rate: λ = Θ (nγv logκ nv ), for γ ∈ [0, 1) and κ ≥ 0.

(A5) Homogeneity: αij ≡ α and βij ≡ β, for α, β ∈ (0, 1).

Northeastern U, May 2015

Page 15: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

Key Observation

When subgraph counting is the goal, the statistic

D =1

|Iso(H)|∑

H′⊆Knv ,H′∼=H

[1H ′ ⊆ G − 1H ′ ⊆ G

],

is a difference of sums of indicator random variables.

Key Point: This can be rewritten as # Type I Errors - # Type II Errors.

Intutitively, the combination of

Sparse Networks + Low-rate Errors

suggests that each sum behaves like a Poisson random variable, and hencetheir difference, like a Skellam random variable.

Northeastern U, May 2015

Page 16: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

The Skellam Distribution

A random variable W defined on the integers is said to have a Skellamdistribution, with parameters λ1, λ2 > 0, i.e., W ∼ Skellam (λ1, λ2), if

P (W = k) = e−(λ1+λ2)

(√λ1λ2

)k

Ik

(2√λ1λ2

)for k ∈ Z, (1)

where Ik(2√λ1λ2

)is the modified Bessel function of the first kind with

index k and argument 2√λ1λ2.

Note:

1 E[W ] = λ1 − λ2 and Var(W ) = λ1 + λ2;

2 Distribution symmetric iff λ1 = λ2.

Northeastern U, May 2015

Page 17: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

An Initial Treatment: Edge-Counting + Independence

Theorem

Let DE = |E | − |E |. Under assumptions (A1)-(A5) and independenceamong errors in declaring (non)edge status (i.e., among the Yij),

dKS (DE , Skellam(λ, λ) ) ≤ O((n1−γv log1−κ nv )−1

). (2)

At the same time,

dKS (DE/σ , N(0, 1) )) ≤ O(

(nγ/2v logκ/2 nv )−1

)(3)

and, for sufficiently large nv ,

dKS (DE/σ , N(0, 1) )) ≥ Ω((nv log nv )−1

), (4)

where N(0, 1) refers to a standard normal random variable.

Northeastern U, May 2015

Page 18: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

Stronger Results Are Likely . . .

Constant Log Sqrt Linear

−3

−2

−1

0

100 1000 10000 100 1000 10000 100 1000 10000 100 1000 10000NumberOfVertices

log1

0(K

SD

ista

nce)

Limit

Normal

Skellam

(Log)Kolmogorov-Smirnov distance between Skellam and standard normal approximationsto the distributon of discrepency DE in edge counts under independent errors,with rates constant, log, square-root, and linear in nv .

Northeastern U, May 2015

Page 19: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

A More General Treatment

For the special case of counting edges under independent errors, the binaryrandom variables defining D = ηH(G )− ηH(G ) are independent.

In fact, dependence can be expected in practice, due to

construction/estimation of G (e.g., gene regulatory networkinference); and/or

overlap of candidate subgraphs H ′, for subgraphsof order larger than 2.

Nevertheless, a Skellam approximation still likely to be appropriate in manycontexts, due to fundamental role of sparse graphs and low-rates errors.

Northeastern U, May 2015

Page 20: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

A More General Treatment

Motivated by these observations, we provide1 a general analysis ofdifferences of sums of binary random variables,

U =n∑

k=1

Lk −m∑

k=1

Mk ,

extending the above treatment.

Our approach uses Stein’s method, through which we

exhibit a Stein operator for the Skellam distribution;

produce several bounds on dKS(U,W ), for W ∼ Skellam(λ1, λ2);

control the Stein constant, in the case λ1 = λ2, and

illustrate in a handful of simple network error models.

1Balachandran, Kolaczyk, Viles (2014). On the Propagation of Low-Rate Measurement Error

to Subgraph Counts in Large, Sparse Networks. (arXiv:1409.5640)

Northeastern U, May 2015

Page 21: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

Illustration: Edge-Counting + Dependence

Suppose that Type I and Type II errors are determined according to asimple two-step process:

1 A fixed integer number ν of vertex pairs are selected for potentialerrors;

2 These potential errors are realized with probability

α = λ/[ν(1− δ)], for vertex pairs in E c , andβ = λ/[νδ], for vertex pairs in E ,

where δ is the network density (i.e, fraction of edges).

Theorem

The resulting noise process is negatively associated and, as a result,

dKS (DE , Skellam(λ, λ) ) ≤ O (λ/ν) .

Northeastern U, May 2015

Page 22: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Propagation of Uncertainty to Network Summary Statistics

Challenges / Open Problems in Network Error Propagation

Tighter bounds are needed on dKS(U,W )

Control is needed on the Stein constant when λ1 6= λ2.

‘Signal plus noise’ formulation posed here leads to a need for betterunderstanding the large-graph limit of subgraph counts in general(particularly sparse!) network models.

Development/study of domain-specific measurement error models.

Note: Also of interest are related statistical questions, e.g., improvedestimation of functions η(G ), confidence intervals, etc. We have madesome initial contributions2 using methods of matrix approximation.

2Balachandran, P., Airoldi, E., and Kolaczyk, E.D. (2013). Inference of network summary statistics

through network denosing. (arXiv:1310.0423)

Northeastern U, May 2015

Page 23: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Outline

1 IntroductionComplex Networks @ ∼ 15 YearsStatistical Analysis of Network Data: Foundations?

2 Propagation of Uncertainty to Network Summary Statistics

3 ‘Stat 101’ for Collections of Network Data Objects

4 Closing Thoughts: Much Still to Do . . .

Northeastern U, May 2015

Page 24: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

From Large Networks to Many Networks

Traditionally in ‘network science’ what is ‘large’is the number of vertices nv .

But a second paradigm arguably is emerging, where what is large is thenumber of networks n.

Naturally emerging trend in the ‘big data’ approach to science

Northeastern U, May 2015

Page 25: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Motivation: Functional Connectivity Networks

Kramer, M.A., Eden, U.T., Kolaczyk, E.D., Zepeda, R., Eskandar, E.N., Cash, S.S. (2010). Coalescence andfragmentation of cortical networks during focal seizures. Journal of Neuroscience, 30(30), 10076-10085.

Northeastern U, May 2015

Page 26: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

‘Statistics 101’ for Network Data Objects?

While there are lots of methods and models for single (large) complexnetworks, there is comparatively little for analysis of collections ofnetworks.

Tools are needed for answering basic questions like

“What is the ‘average’ of a collection of networks?”

“Do these networks differ, on average, from a given nominalnetwork?”

“Do two collections of networks differ on average?”

“What factors (e.g., age, gender, etc.) appear to contribute todifferences in networks?”

In order to answer these and similar questions, we require network-basedanalogues of classical tools for statistical estimation and hypothesis testing.

Northeastern U, May 2015

Page 27: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Laying an Initial Foundation . . .

Extension of classical tools to network-based datasets is non-trivial:networks are not Euclidean objects – rather, they are combinatorialobjects, defined simply through their sets of vertices and edges.

Nevertheless, in recent work3 we have

shown certain classes of networks can be associated with nice subsetsof Euclidean space,

which permits definition of a natural notion of distance and averaging,

allowing results in statistical shape analysis to be used to establish anasymptotic theory,

resulting in, for example, a principled approach to one- andtwo-sample hypothesis testing.

3Ginestet, C.E., Balachandran, P., Rosenberg, S., and Kolaczyk, E.D. (2014). Hypothesis testing for network data

in functional neuroimaging. (arXiv:1407.5525)

Northeastern U, May 2015

Page 28: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Some Notation and Such

Let G = (V ,E ,W ) be a weighted undirected graph, that is

simple (i.e., no self-loops or multi-edges)

connected (i.e., only one component)

and define the (combinatorial) graph Laplacian

L = D(W )−W ,

where D is a diagonal matrix of weighted degrees,i.e., Djj = dj(W ) =

∑i 6=j wij .

Our interest will be in IID collections of graphs G1, . . . ,Gn,with which we will inter-changeably associatean IID collection of graph Laplacians L1, . . . , Ln.

Northeastern U, May 2015

Page 29: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Geometry of a Certain Space of Networks

Theorem

Let the set L′d consist of d × d matrices A, satisfying:

(1) Rank(A) = d − 1,

(2) Symmetry, AT = A,

(3) Positive semi-definiteness, A ≥ 0,

(4) The entries in each row sum to 0,

(5) The off-diagonal entries are non-positive, aij ≤ 0 .

Then L′d is a manifold with corners, of dimension d(d − 1)/2.

Furthermore, L′d is a convex subset of an affine space in Rd2

of dimension d(d − 1)/2.

Northeastern U, May 2015

Page 30: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

A Central Limit Theorem

For L1, . . . , Ln IID wrt some distribution Q,and ρF the Frobenius norm, define the (Frechet) means

EQ [L] := arg minL∈L′d

∫L′d

ρ2F (L, L)Q(dL) and Ln := arg minL∈L′d

1

n

n∑i=1

ρ2F (L, Li ).

Theorem

If the expectation, Λ := EQ [L], does not lie on the boundary of L′d , andPQ [U] > 0, where U is an open subset of L′d with Λ ∈ U, then (undersome further regularity conditions) we obtain the following convergence indistribution:

n1/2(φ(Ln)− φ(Λ)) −→ N(0,Σ),

where Σ := Cov [φ(L)] and φ(·) denotes the half-vectorization of its matrixargument.

Northeastern U, May 2015

Page 31: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Large-Sample Testing Theory

Corollary

Under the null hypothesis H0 : E[L] = Λ0, we have,

T1 := n(φ(L)− φ(Λ0)

)TΣ−1

(φ(L)− φ(Λ0)

)−→ χ2

m,

with m :=(d2

)degrees of freedom, and where Σ is the sample covariance.

Analogous results may be stated for two- and multiple-sample testing.

Northeastern U, May 2015

Page 32: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Practical Considerations: Covariance Estimation

In order to use these results in practice, we require knowledge of Σ or,more realistically, for Σ to be stable.

For n O(d2), it may be that Σ is stable, but for n O(d2), we face a“large n, small p” problem.

Natural to seek to exploit recent literature on estimation of large,structured covariance matrices from limited data.

In neuroimaging, it has been argued that the networks of interest besparse. Our empirical experience suggests that Σ is similarly sparse.

We use a thresholding-based method of Cai & Lui ’11 in our applications.

Northeastern U, May 2015

Page 33: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Illustration: 1000 FCP data

The 1000 Functional Connectomes Project (FCP) is a major MRIdata-sharing initiative, launched in 2010.

Data collected on both genders, at five international locations, across arange of ages.

(A) Sex (B) Age

0

200

400

0

100

200

300

Female Male x ≤ 22 22 < x ≤ 32 32 < x

Northeastern U, May 2015

Page 34: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Testing for Differences in 1000 FCP Networks

Current state of the art (aka ‘mass univariate’) uses edge-wise testing(presence/absence) and multiple correction.

(A) Mass-univariate analysis (Sex) (B) Mass-univariate analysis (Age)

Uncorrected Corrected Uncorrected Corrected

Comparison: Mass-Univariate vs Multivariate

Both methods detect differences in mean networks across gender andage, when using the full 1000 connectomes; but . . .

Only our multivariate method detects those differences at smallsample sizes (i.e., relevant to single labs).

Northeastern U, May 2015

Page 35: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

‘Stat 101’ for Collections of Network Data Objects

Challenges / Open Problems with Network Data Objects

This work lays only an initial stone in the foundation.There are a vast number of directions to go from here!

Examples include

Critical understanding of the impact of network structure (e.g.,sparseness, degree distribution, etc.) on geometric, probabilistic, andstatistical aspects of the problem.

A more refined understanding of covariance structure of Laplacians,as well as a careful study of their estimation.

Finite-sample properties, corrections, etc.

Adaptation of other tools in the ‘Statistics 101’ toolbox?

Northeastern U, May 2015

Page 36: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Closing Thoughts: Much Still to Do . . .

Outline

1 IntroductionComplex Networks @ ∼ 15 YearsStatistical Analysis of Network Data: Foundations?

2 Propagation of Uncertainty to Network Summary Statistics

3 ‘Stat 101’ for Collections of Network Data Objects

4 Closing Thoughts: Much Still to Do . . .

Northeastern U, May 2015

Page 37: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Closing Thoughts: Much Still to Do . . .

Some Closing Thoughts

Use of network-based perspective in modeling and analysis is nowpervasive across the sciences (and has even penetrated the humanities).

While much of the work done in this area is necessarily problem-specific,to varying extents, there is sufficient evidence after 15 years to suggest itis both necessary and interesting to (re)visit the statistical foundations4.

The hook (hopefully!) is that the area is a source of problems that arebroadly relevant, with an intruiging blend of traditional and new elements,and generally needing to incorporate many of the directions of researchalready being explored in the broader community.

4Note there is parallel movement in the making within the signal and informationprocessing literatures (e.g., witness the rise of ‘graph signal processing’).Northeastern U, May 2015

Page 38: Statistical Analysis of Network Data in the Context of ... · Statistical Analysis of Network Data with R Statistical Analysis of Network Data with R Kolaczyk · Csárdi EricKolaczyk·

Closing Thoughts: Much Still to Do . . .

Collaborators and Support

Supported in part by AFOSR, NIH, NSF, and ONR.

Northeastern U, May 2015