34
Statistical Analysis of Network Data James Boyle supervised by George Bolt September 4, 2020 James Boyle supervised by George Bolt Statistical Analysis of Network Data

Statistical Analysis of Network Data...James Boyle supervised by George Bolt Statistical Analysis of Network Data Stochastic Block Model Idea: Split the vertices into groups, and consider

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Statistical Analysis of Network Data

    James Boylesupervised by George Bolt

    September 4, 2020

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Network Basics

    Network G = (V ,E )

    Vertices V = {1, . . . ,NV }Edges {i , j} joining vertices

    Figure: A graph with NV = 10vertices

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Network Basics

    Network G = (V ,E )

    Vertices V = {1, . . . ,NV }

    Edges {i , j} joining vertices

    Figure: A graph with NV = 10vertices

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Network Basics

    Network G = (V ,E )

    Vertices V = {1, . . . ,NV }Edges {i , j} joining vertices

    Figure: A graph with NV = 10vertices

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Network Basics

    Network G = (V ,E )

    Adjacency matrix A ∈ RNV×NV

    aij =

    {1 if {i , j} ∈ E ,0 otherwise

    0 0 1 1 0 0 0 0 0 00 0 0 0 1 1 1 0 1 01 0 0 0 1 0 0 1 0 01 0 0 0 1 1 1 0 0 10 1 1 1 0 0 0 1 0 00 1 0 1 0 0 1 1 1 00 1 0 1 0 1 0 0 1 10 0 1 0 1 1 0 0 0 00 1 0 0 0 1 1 0 0 00 0 0 1 0 0 1 0 0 0

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Network Characteristics

    Vertex centrality, measures of how “important” a vertex is:

    degree - number of edges incident to a vertex

    closeness centrality - ccl(v) =1∑

    u∈V d(v ,u)

    betweenness centrality - proportion of shortest paths betweenpairs of vertices passing through v

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Network Characteristics

    Vertex centrality, measures of how “important” a vertex is:

    degree - number of edges incident to a vertex

    closeness centrality - ccl(v) =1∑

    u∈V d(v ,u)

    betweenness centrality - proportion of shortest paths betweenpairs of vertices passing through v

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Network Characteristics

    Vertex centrality, measures of how “important” a vertex is:

    degree - number of edges incident to a vertex

    closeness centrality - ccl(v) =1∑

    u∈V d(v ,u)

    betweenness centrality - proportion of shortest paths betweenpairs of vertices passing through v

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Network Characteristics

    Vertex centrality, measures of how “important” a vertex is:

    degree - number of edges incident to a vertex

    closeness centrality - ccl(v) =1∑

    u∈V d(v ,u)

    betweenness centrality - proportion of shortest paths betweenpairs of vertices passing through v

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Network Characteristics

    Vertex centrality, measures of how “important” a vertex is:

    degree - number of edges incident to a vertex

    closeness centrality - ccl(v) =1∑

    u∈V d(v ,u)

    betweenness centrality - proportion of shortest paths betweenpairs of vertices passing through v

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • .

    Random Graphs

    Random Graphs

  • Stochastic Block Model

    Idea: Split the vertices into groups, and consider all vertices in agiven group stochastically equivalent.

    Parameters: NV ,K ∈ N, B ∈ RK×K

    Split the NV vertices into K classes (a priori or at random).Class memberships c = (c1, . . . , cNV )

    P(edge between vertices i and j) = bcicj

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Stochastic Block Model

    Idea: Split the vertices into groups, and consider all vertices in agiven group stochastically equivalent.

    Parameters: NV ,K ∈ N, B ∈ RK×K

    Split the NV vertices into K classes (a priori or at random).Class memberships c = (c1, . . . , cNV )

    P(edge between vertices i and j) = bcicj

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Stochastic Block Model

    Idea: Split the vertices into groups, and consider all vertices in agiven group stochastically equivalent.

    Parameters: NV ,K ∈ N, B ∈ RK×K

    Split the NV vertices into K classes (a priori or at random).Class memberships c = (c1, . . . , cNV )

    P(edge between vertices i and j) = bcicj

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Stochastic Block Model

    Idea: Split the vertices into groups, and consider all vertices in agiven group stochastically equivalent.

    Parameters: NV ,K ∈ N, B ∈ RK×K

    Split the NV vertices into K classes (a priori or at random).Class memberships c = (c1, . . . , cNV )

    P(edge between vertices i and j) = bcicj

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Stochastic Block Model - Example

    NV = 35

    K = 3

    B =

    0.9 0.1 0.10.1 0.3 0.050.1 0.05 0.3

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Measures of Vertex Centrality

    Figure: Betweenness Centrality

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Measures of Vertex Centrality

    Figure: Closeness Centrality

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Multiple Network Models

    Single network models are in general not suitable for modellingmultiple network observations, e.g. brain scans.

    Figure: Brain Networks[4]

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Multiple Network Models

    Aim: Model multiple noisy realisations of a single “true” network,i.e. observations of the form

    True Network + Noise

    For a binary network, noise can only manifest itself in the form offalse positive and false negative observations

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Multiple Network Models

    Aim: Model multiple noisy realisations of a single “true” network,i.e. observations of the form

    True Network + Noise

    For a binary network, noise can only manifest itself in the form offalse positive and false negative observations

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • The Measurement Error Model

    Model Assumptions[3]1:

    True network A ∼ StochasticBlockModel(NV ,K ,B)

    Observation noise A(1), . . . ,A(n) respects the block structure

    Concretely, letting P,Q ∈ RK×K be the matrices of false positiveand false negative rates respectively, we suppose that

    A(m)ij ∼

    {Bernoulli(Pcicj ) if Aij = 0

    Bernoulli(1− Qcicj ) if Aij = 1

    1C. M. Le, K. Levin, E. Levina, et al. Estimating a network from multiplenoisy realizations.Electronic Journal of Statistics, 12(2):4697–4740, 2018

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • The Measurement Error Model

    Model Assumptions[3]1:

    True network A ∼ StochasticBlockModel(NV ,K ,B)Observation noise A(1), . . . ,A(n) respects the block structure

    Concretely, letting P,Q ∈ RK×K be the matrices of false positiveand false negative rates respectively, we suppose that

    A(m)ij ∼

    {Bernoulli(Pcicj ) if Aij = 0

    Bernoulli(1− Qcicj ) if Aij = 1

    1C. M. Le, K. Levin, E. Levina, et al. Estimating a network from multiplenoisy realizations.Electronic Journal of Statistics, 12(2):4697–4740, 2018

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • The Measurement Error Model

    Model Assumptions[3]1:

    True network A ∼ StochasticBlockModel(NV ,K ,B)Observation noise A(1), . . . ,A(n) respects the block structure

    Concretely, letting P,Q ∈ RK×K be the matrices of false positiveand false negative rates respectively, we suppose that

    A(m)ij ∼

    {Bernoulli(Pcicj ) if Aij = 0

    Bernoulli(1− Qcicj ) if Aij = 1

    1C. M. Le, K. Levin, E. Levina, et al. Estimating a network from multiplenoisy realizations.Electronic Journal of Statistics, 12(2):4697–4740, 2018

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • The Measurement Error Model

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • The Measurement Error Model

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • The Measurement Error Model

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Metric Based Models

    Idea[4]2: Assign probabilities to networks based on their distancefrom a central, “true”, network.

    e.g. For a true network G true , the Spherical Network Modelassigns

    P(G ;G true , γ) ∝ exp(−γd(G ,G true))

    2Lunagomez S., Olhed, S. C., and Wolfe P. J. (2020). Modeling networkpopulations via graph distances.Journal of the American Statistical Association(just-accepted):1–59

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Metric Based Models

    Idea[4]2: Assign probabilities to networks based on their distancefrom a central, “true”, network.

    e.g. For a true network G true , the Spherical Network Modelassigns

    P(G ;G true , γ) ∝ exp(−γd(G ,G true))

    2Lunagomez S., Olhed, S. C., and Wolfe P. J. (2020). Modeling networkpopulations via graph distances.Journal of the American Statistical Association(just-accepted):1–59

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • On Things not Covered

    Network Path data - Each data point is a path through a networke.g. vertex ↔ webpage

    edge ↔ navigation by user between webpages

    Inference - Very complicated, so approximate methods such asMCMCMLE or EMA must be used

    Single Network Models

    Dynamic networks

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • On Things not Covered

    Network Path data - Each data point is a path through a networke.g. vertex ↔ webpage

    edge ↔ navigation by user between webpages

    Inference - Very complicated, so approximate methods such asMCMCMLE or EMA must be used

    Single Network Models

    Dynamic networks

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • On Things not Covered

    Network Path data - Each data point is a path through a networke.g. vertex ↔ webpage

    edge ↔ navigation by user between webpages

    Inference - Very complicated, so approximate methods such asMCMCMLE or EMA must be used

    Single Network Models

    Dynamic networks

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • On Things not Covered

    Network Path data - Each data point is a path through a networke.g. vertex ↔ webpage

    edge ↔ navigation by user between webpages

    Inference - Very complicated, so approximate methods such asMCMCMLE or EMA must be used

    Single Network Models

    Dynamic networks

    James Boyle supervised by George Bolt Statistical Analysis of Network Data

  • Bibliography

    B. Kim, K. H. Lee, L. Xue, and X. Niu.

    A review of dynamic network models with latent variables.

    Statistics surveys, 12:105, 2018.

    E. D. Kolaczyk and G. Csárdi.

    Statistical analysis of network data with R, volume 65.

    Springer, 2014.

    C. M. Le, K. Levin, E. Levina, et al.

    Estimating a network from multiple noisy realizations.

    Electronic Journal of Statistics, 12(2):4697–4740, 2018.

    S. Lunagómez, S. C. Olhede, and P. J. Wolfe.

    Modeling network populations via graph distances.

    Journal of the American Statistical Association, (just-accepted):1–59,2020.

    M. Salter-Townshend, A. White, I. Gollini, and T. Murphy.

    Review of statistical network analysis: Models, algorithms and softwaresupplementary material.

    2012.James Boyle supervised by George Bolt Statistical Analysis of Network Data