Graphical Models: Learning - Data Science Summer School€¦ · Learning Graphical Models • In...

Preview:

Citation preview

Graphical Models: LearningPradeep Ravikumar

Carnegie Mellon University

Learning Graphical Models• In many contexts, we do not have an apriori specified graphical model

distribution

• All we have access to are iid samples drawn from *unknown* graphical model distribution

• Would like to estimate graphical model distribution from data:

• graph

• factor functions

• We focus on parametric graphical models where factor functions have a specific parametric form, so this entails estimating some parameters (rather than general functions)

Learning Undirected Graphical Models

We will focus on pairwise graphical models

p(X; , G) =1

Z()exp

(s,t)E(G)

st st(Xs, Xt)

st(xs, xt) : arbitrary potential functions

Ising xs xt

Potts I(xs = xt)Indicator I(xs, xt = j, k)

Graphical Model Selection

Graphical model selection

let G = (V,E) be an undirected graph on p = |V | vertices

pairwise Markov random field: family of prob. distributions

P(x1, . . . , xp; θ) =1

Z(θ)exp

! "

(s,t)∈E

θstxsxt

#

Problem of graph selection: given n independent and identicallydistributed (i.i.d.) samples of X = (X1, . . . , Xp), identify the underlyinggraph structure

Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 7 / 36

Given: n samples of X = (X1, . . . , Xp) with distribution p(X;

;G), where

p(X;

) = exp

8<

:X

(s,t)2E(G)

stst(xs, xt)A(

)

9=

;

Problem: Estimate graph G given just the n samples.

?

Samples from binary-valued pairwise MRFs

Independence model θst = 0

Samples from binary-valued pairwise MRFs

Medium coupling θst ≈ 0.2

Samples from binary-valued pairwise MRFs

Strong coupling θst ≈ 0.8

Graphical Model Selection

Graphical model selection

let G = (V,E) be an undirected graph on p = |V | vertices

pairwise Markov random field: family of prob. distributions

P(x1, . . . , xp; θ) =1

Z(θ)exp

! "

(s,t)∈E

θstxsxt

#

Problem of graph selection: given n independent and identicallydistributed (i.i.d.) samples of X = (X1, . . . , Xp), identify the underlyinggraph structure

Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 7 / 36

Given: n samples of X = (X1, . . . , Xp) with distribution p(X;

;G), where

p(X;

) = exp

8<

:X

(s,t)2E(G)

stst(xs, xt)A(

)

9=

;

Problem: Estimate graph G given just the n samples.

?

Learning Graphical Models

Learning Graphical Models

• Two Step Procedures:

Learning Graphical Models

• Two Step Procedures:

‣ 1. Model Selection; estimate graph structure

Learning Graphical Models

• Two Step Procedures:

‣ 1. Model Selection; estimate graph structure

‣ 2. Parameter Inference given graph structure

Learning Graphical Models

• Two Step Procedures:

‣ 1. Model Selection; estimate graph structure

‣ 2. Parameter Inference given graph structure

• Score Based Approaches: search over space of graphs, with a score for graph based on parameter inference

Learning Graphical Models

• Two Step Procedures:

‣ 1. Model Selection; estimate graph structure

‣ 2. Parameter Inference given graph structure

• Score Based Approaches: search over space of graphs, with a score for graph based on parameter inference

• Constraint-based Approaches: estimate individual edges by hypothesis tests for conditional independences

Learning Graphical Models

• Two Step Procedures:

‣ 1. Model Selection; estimate graph structure

‣ 2. Parameter Inference given graph structure

• Score Based Approaches: search over space of graphs, with a score for graph based on parameter inference

• Constraint-based Approaches: estimate individual edges by hypothesis tests for conditional independences

• Caveats: (a) difficult to provide guarantees for estimators; (b) estimators are NP-Hard

Learning Graphical Models

• State of the art methods are based on estimating neighborhoods

‣ Via high-dimensional statistical model estimation

‣ Via high-dimensional hypothesis tests

Ising Model Selection

Graphical model selection

let G = (V,E) be an undirected graph on p = |V | vertices

pairwise Markov random field: family of prob. distributions

P(x1, . . . , xp; θ) =1

Z(θ)exp

! "

(s,t)∈E

θstxsxt

#

Problem of graph selection: given n independent and identicallydistributed (i.i.d.) samples of X = (X1, . . . , Xp), identify the underlyinggraph structure

Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 7 / 36

Given: n samples of X = (X1, . . . , Xp) with distribution p(X;

;G), where

p(X;

) = exp

8<

:X

(s,t)2E(G)

stst(xs, xt)A(

)

9=

;

Problem: Estimate graph G given just the n samples.

?

p(X; ) = exp

8<

:X

(s,t)2E(G)

stXsXt A()

9=

;

Ising Model Selection

Given: n samples of X = (X1, . . . , Xp) with distribution p(X;

;G), where

p(X;

) = exp

8<

:X

(s,t)2E(G)

stst(xs, xt)A(

)

9=

;

Problem: Estimate graph G given just the n samples.

p(X; ) = exp

8<

:X

(s,t)2E(G)

stXsXt A()

9=

;

Applications: statistical physics, computer vision, social network analysis

US Senate 109th Congress

Banerjee et al, 2008

Ising Model Selection

• Just computing the likelihood of a known Ising model is NP Hard (since the normalization constant requires summing over exponentially many configurations)

Z() =

X

x21,1p

exp

X

st

st

x

s

x

t

!

Ising Model Selection

• Just computing the likelihood of a known Ising model is NP Hard (since the normalization constant requires summing over exponentially many configurations)

• Estimating the unknown Ising model parameters as well as graph structure might seem to be NP Hard as well

Z() =

X

x21,1p

exp

X

st

st

x

s

x

t

!

Ising Model Selection

• Just computing the likelihood of a known Ising model is NP Hard (since the normalization constant requires summing over exponentially many configurations)

• Estimating the unknown Ising model parameters as well as graph structure might seem to be NP Hard as well

• On the other hand, it is tractable to estimate the node-wise conditional distributions, of one variable conditioned on the rest of the variables

Z() =

X

x21,1p

exp

X

st

st

x

s

x

t

!

Neighborhood Estimation in Ising Models

p(Xr|XV \r; , G) =exp(

tN(r) 2 rtXrXt)

exp(

tN(r) 2 rtXrXt) + 1

For Ising models, node conditional distribution is just a logistic regression model:

Neighborhood Estimation in Ising Models

• So instead of estimating graph structure constrained global Ising model, we could estimate structure constrained local node-conditional distributions —logistic regression models

p(Xr|XV \r; , G) =exp(

tN(r) 2 rtXrXt)

exp(

tN(r) 2 rtXrXt) + 1

For Ising models, node conditional distribution is just a logistic regression model:

Neighborhood Estimation in Ising Models

• So instead of estimating graph structure constrained global Ising model, we could estimate structure constrained local node-conditional distributions —logistic regression models

• But would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?

p(Xr|XV \r; , G) =exp(

tN(r) 2 rtXrXt)

exp(

tN(r) 2 rtXrXt) + 1

For Ising models, node conditional distribution is just a logistic regression model:

Conditional and Joint Distributions

• Would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?

Conditional and Joint Distributions

• Would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?

• In general: no!

Conditional and Joint Distributions

• Would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?

• In general: no!

• But for the Ising model and node-wise logistic regression models: yes!

Conditional and Joint Distributions

• Would node-conditional distributions uniquely specify a consistent joint, or even be consistent with any joint at all?

• In general: no!

• But for the Ising model and node-wise logistic regression models: yes!

• Theorem (Besag 1974, R., Wainwright, Lafferty 2010): An Ising model uniquely specifies and is uniquely specified by a set of node-wise logistic regression models.

Neighborhood Estimation in Ising Models

• Global graph constraint of sparse, bounded degree graphs is equivalent to local constraint of bounded node-degrees (number of neighbors)

• Estimate node neighborhoods via constrained logistic regression models, and stich node-neighborhoods together to form global graph

p(Xr|XV \r; , G) =exp(

tN(r) 2 rtXrXt)

exp(

tN(r) 2 rtXrXt) + 1

For Ising models, node conditional distribution is just a logistic regression model:

Graph selection via neighborhood regression

Observation: Recovering graph G equivalent to recovering neighborhood set N(s)for all s ∈ V .

Method: Given n i.i.d. samples X(1), . . . , X(n), perform logistic regression of

each node Xs on X\s := Xs, t = s to estimate neighborhood structure bN(s).

1 For each node s ∈ V , perform ℓ1 regularized logistic regression of Xs on theremaining variables X\s:

bθ[s] := arg minθ∈Rp−1

(1n

nX

i=1

f(θ; X(i)\s )

| z + ρn ∥θ∥1|z

)

logistic likelihood regularization

2 Estimate the local neighborhood bN(s) as the support (non-negative entries) of

the regression vector bθ[s].

3 Combine the neighborhood estimates in a consistent manner (AND, or ORrule).

Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 21 / 36

Empirical behavior: Unrescaled plots

0 100 200 300 400 500 6000

0.2

0.4

0.6

0.8

1

Number of samples

Pro

b.

su

cce

ss

Star graph; Linear fraction neighbors

p = 64

p = 100

p = 225

Plots of success probability versus raw sample size .Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 22 / 36

Sufficient conditions for consistent model selectiongraph sequences Gp,d = (V, E) with p vertices, and maximum degree d.

edge weights |θst| ≥ θmin for all (s, t) ∈ E

draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d)

Theorem

Under incoherence conditions, for a rescaled sample size (RavWaiLaf06)

θLR(n, p, d) :=n

d3 log p> θcrit

and regularization parameter ρn ≥ c1 τ!

log pn , then with probability greater

than 1 − 2 exp"− c2(τ − 2) log p

#→ 1:

(a) Uniqueness: For each node s ∈ V , the ℓ1-regularized logistic convexprogram has a unique solution. (Non-trivial since p ≫ n =⇒ not strictly convex).

(b) Correct exclusion: The estimated sign neighborhood $N(s) correctlyexcludes all edges not in the true neighborhood.

(c) Correct inclusion: For θmin ≥ c3τ√

dρn, the method selects the correctsigned neighborhood.

Consequence: For θmin = Ω(1/d), it suffices to have n = Ω(d3 log p).

(R., Wainwright, Lafferty, 2010)

Assumptions

Define Fisher information matrix of logistic regression:Q∗ := Eθ∗

!∇2f(θ∗;X)

".

A1. Dependency condition: Bounded eigenspectra:

Cmin ≤ λmin(Q∗SS), and λmax(Q∗

SS) ≤ Cmax.

λmax(Eθ∗ [XXT ]) ≤ Dmax.

A2. Incoherence There exists an ν ∈ (0, 1] such that

|||Q∗ScS(Q∗

SS)−1|||∞,∞ ≤ 1 − ν.

where |||A|||∞,∞ := maxi#

j |Aij |.

bounds on eigenvalues are fairly standard

incoherence condition:

! partly necessary (prevention of degenerate models)! partly an artifact of ℓ1-regularization

incoherence condition is weaker than correlation decay

Martin Wainwright (UC Berkeley) High-dimensional graph selection September 2009 26 / 36

Other Undirected Graphical Models

• Similar estimators work for other undirected parametric graphical models as well

• Discrete/Categorical Graphical Models (Jalali, Ravikumar, Sanghavi, Ruan 2011)

• Gaussian Graphical Models (Ravikumar, Raskutti, Wainwright, Yu 2011)

• Exponential Family Graphical Models (Yang, Ravikumar, Allen, Liu 2015)

Example: Mixed Graphical ModelsReferences

Experiments: Cancer Genomic and Transcriptomic DataCombine ‘Level III RNA-sequencing’ data and ‘Level II non-silent somatic mutationand level III copy number variation data’ for 697 breast cancer patients.

ABCC11

CXCL17

CEACAM6

CEACAM5

TSPYP5 RPS24P1

ASSP6ADLICANP PLIN4CXCL13

PDZK1

GREB1

PGR

OFDYP2

CDY2A

ARSEP

RBMY2XP

CDY9P

OFDYP13

CDY15P

FAM8A8P

GYG2P

XKRYP5

SEDLP5 OFDYP14

CDY4PCDY21P

EIF1AY

CDY22P

SEDLP4

XKRYP6

XKRYP2

OFDYP1

BPY2

OFDYP9

RBMY1H

ACTGP2

CDY13P CDY5P

XKRYNLGN4Y

TAF9P1 HDHD1BP

VCYCYCSP46

STSPCDY2B

CDY6P

CXCL9RBMY2QP

RBMY2NP ADIPOQ

TSPYP3 UBDPREX1SERPINA1

GATA3

FAM5C

SERPINA6

HCP5P14

CPB1

CALML5 CST9

CHAD

KLK11DHRS2

KIAA1972

NAT1 KLK6KLK5

EHF

CLEC3A

YESP

PVALB

GALNT6

ELF5

SH3BGRL

KRT23

NKAIN1

PRAME

NPC2STC1

WNK4CHGB

CARTPT RAP1AP RIMS4 CP

CDY7P

CDY8PCDH3

SEDLP3 IFI6

ADH1B FABP4MRP63P10

C4orf7 ARSDP

RBMY2OP

CD36

PCSK1

PIK3CA

TSPAN1

PYY

CBLN2VSTM2A

SYT1UGT2B11

BCAS1 TRH

DCDS100A14

CACNG4

SLC5A8

ABCA12

FAM106A

FADS2GLYATL2

CYP2A7

KRT6BTCN1

SFRP1

SOX10

SERPINB5 GABRP

COL17A1

TMC5

TGIF2LY

S100A6

USP12P3

CST5

SLC30A8

T2R55

CYP2A6

CYP2B7P1

PS5

NELL2

S100A9

S100A7

ADAM6

LOC96610

IGJ S100A8

KRT5

DSC3

FABP7

KRT14

DSG3

USP9YP2

SMCYTBL1YP

CYorf15A

ZNF381P

RPS4Y2P CYorf14

PRY

CDY10P

TSPYP4 CYorf15B

OFDYP5

TAF9P2 RBMY1B

CASKP

HSFY2

OFDYP6 TSPYP2

SMCYPUSP9YP1

RBMY2EP RBMY2KP RBMY2HP

AMELY

SRY

PSMA6P AYP1p1

AQP10

ATP8B2 ADAM32

UTY

ADAM5

OFDYP7

RBMY2TP FAM8A9P

RBMY1A1 OFDYP4

PCDH11Y

HSFY1XKRYP1

RBMY2SP BCORL2

KRT18P10

RBMY1J DAZ1

RBMY2VP RBMY2BP

RBMY2WP

DAZ2

TSPYP1 TBL1YPRKY

TSPY2RBMY2GP

RPS4Y1

AQP3

MUC2

GSTM1 COX6C MUC6

CGA GSTM3

ALBSMR3B

SLPI

PKP1

DLK1

TFF3

CNTNAP2 CYP4X1

CYP4Z1

SPDEF

TMEM150C RPL8

FOXA1

TFF1

C1orf64

CRISP3 C3orf57

A2ML1

THSD4

GSTT1

TP53

SLC7A5

LBP

IFIT1

GOLGA2LY1

OFDYP10

C8orf85 XKRYP3

XKRY2

OFDYP16

CDY16P

RBMY1D MX1

C20orf114

VCY1B

TMSB4Y ZFY

OFDYP3

CYorf16

RBMY2MP

CDY3P

KALP

RBMY1A3P

RBMY2UP

PRY2

RBMY1F

CDY11P

RBMY2AP

RBMY1E

OFDYP8

CDY12P

USP9YAPXLP

GPR143P FAM8A7P

RBMY2JP

DDX3Y

OFDYP15 OFDYP11 CDY17P

CDY18P OFDYP12 BPY2BRBMY3AP

SFPQP

CDY1PPP1R12BP

CDY20P

CSPG4LYP1

RBMY2DP

CYCSP48

RBMY2CP DAZ4CDY1B

XKRYP4

CSPG4LYP2

CDY14P DAZ3

CDY23P

GOLGA2LY2

BPY2CCDY19P PARP4P RBMY2YP

TSPY1

TPGM - Ising graphical model

(Yellow) Gene expression via RNA-sequencing, count-valued

(Blue) Genomic mutation, binary mutation status

Well known components: (DLK1, THSD4) - (TP53)

(UT Austin) Mixed Graphical Models via Exponential Families AISTATS 2014 22 / 25

References

Experiments: Cancer Genomic and Transcriptomic DataCombine ‘Level III RNA-sequencing’ data and ‘Level II non-silent somatic mutationand level III copy number variation data’ for 697 breast cancer patients.

ABCC11

CXCL17

CEACAM6

CEACAM5

TSPYP5 RPS24P1

ASSP6ADLICANP PLIN4CXCL13

PDZK1

GREB1

PGR

OFDYP2

CDY2A

ARSEP

RBMY2XP

CDY9P

OFDYP13

CDY15P

FAM8A8P

GYG2P

XKRYP5

SEDLP5 OFDYP14

CDY4PCDY21P

EIF1AY

CDY22P

SEDLP4

XKRYP6

XKRYP2

OFDYP1

BPY2

OFDYP9

RBMY1H

ACTGP2

CDY13P CDY5P

XKRYNLGN4Y

TAF9P1 HDHD1BP

VCYCYCSP46

STSPCDY2B

CDY6P

CXCL9RBMY2QP

RBMY2NP ADIPOQ

TSPYP3 UBDPREX1SERPINA1

GATA3

FAM5C

SERPINA6

HCP5P14

CPB1

CALML5 CST9

CHAD

KLK11DHRS2

KIAA1972

NAT1 KLK6KLK5

EHF

CLEC3A

YESP

PVALB

GALNT6

ELF5

SH3BGRL

KRT23

NKAIN1

PRAME

NPC2STC1

WNK4CHGB

CARTPT RAP1AP RIMS4 CP

CDY7P

CDY8PCDH3

SEDLP3 IFI6

ADH1B FABP4MRP63P10

C4orf7 ARSDP

RBMY2OP

CD36

PCSK1

PIK3CA

TSPAN1

PYY

CBLN2VSTM2A

SYT1UGT2B11

BCAS1 TRH

DCDS100A14

CACNG4

SLC5A8

ABCA12

FAM106A

FADS2GLYATL2

CYP2A7

KRT6BTCN1

SFRP1

SOX10

SERPINB5 GABRP

COL17A1

TMC5

TGIF2LY

S100A6

USP12P3

CST5

SLC30A8

T2R55

CYP2A6

CYP2B7P1

PS5

NELL2

S100A9

S100A7

ADAM6

LOC96610

IGJ S100A8

KRT5

DSC3

FABP7

KRT14

DSG3

USP9YP2

SMCYTBL1YP

CYorf15A

ZNF381P

RPS4Y2P CYorf14

PRY

CDY10P

TSPYP4 CYorf15B

OFDYP5

TAF9P2 RBMY1B

CASKP

HSFY2

OFDYP6 TSPYP2

SMCYPUSP9YP1

RBMY2EP RBMY2KP RBMY2HP

AMELY

SRY

PSMA6P AYP1p1

AQP10

ATP8B2 ADAM32

UTY

ADAM5

OFDYP7

RBMY2TP FAM8A9P

RBMY1A1 OFDYP4

PCDH11Y

HSFY1XKRYP1

RBMY2SP BCORL2

KRT18P10

RBMY1J DAZ1

RBMY2VP RBMY2BP

RBMY2WP

DAZ2

TSPYP1 TBL1YPRKY

TSPY2RBMY2GP

RPS4Y1

AQP3

MUC2

GSTM1 COX6C MUC6

CGA GSTM3

ALBSMR3B

SLPI

PKP1

DLK1

TFF3

CNTNAP2 CYP4X1

CYP4Z1

SPDEF

TMEM150C RPL8

FOXA1

TFF1

C1orf64

CRISP3 C3orf57

A2ML1

THSD4

GSTT1

TP53

SLC7A5

LBP

IFIT1

GOLGA2LY1

OFDYP10

C8orf85 XKRYP3

XKRY2

OFDYP16

CDY16P

RBMY1D MX1

C20orf114

VCY1B

TMSB4Y ZFY

OFDYP3

CYorf16

RBMY2MP

CDY3P

KALP

RBMY1A3P

RBMY2UP

PRY2

RBMY1F

CDY11P

RBMY2AP

RBMY1E

OFDYP8

CDY12P

USP9YAPXLP

GPR143P FAM8A7P

RBMY2JP

DDX3Y

OFDYP15 OFDYP11 CDY17P

CDY18P OFDYP12 BPY2BRBMY3AP

SFPQP

CDY1PPP1R12BP

CDY20P

CSPG4LYP1

RBMY2DP

CYCSP48

RBMY2CP DAZ4CDY1B

XKRYP4

CSPG4LYP2

CDY14P DAZ3

CDY23P

GOLGA2LY2

BPY2CCDY19P PARP4P RBMY2YP

TSPY1

TPGM - Ising graphical model

(Yellow) Gene expression via RNA-sequencing, count-valued

(Blue) Genomic mutation, binary mutation status

Well known components: (DLK1, THSD4) - (TP53)

(UT Austin) Mixed Graphical Models via Exponential Families AISTATS 2014 22 / 25

Poisson-Ising Models

Example: Poisson Graphical Models

• MicroRNA network learnt from The Cancer Genome Atlas (TCGA) Breast Cancer Level II Data

Case Study: Biological Results

SPGM miRNA Network:

An Important Special Case: Poisson Graphical Model

Joint Distribution:

P(X ) = exp

8

<

:

X

s

s

X

s

+X

(s,t)2E

st

X

s

X

t

+X

s

log(Xs

!) A()

9

=

;

.

Node-conditional Distributions:

P(Xs

|XV \s) / exp

8

<

:

0

@s

+X

t2N(s)

st

X

t

1

A

X

s

+ log(Xs

!)

9

=

;

,

Pairwise variant discussed as “Poisson auto-model” in (Besag, 74).

Learning Directed Graphical Models

Moralization

• The undirected graphical corresponding to moralized graph is the smallest undirected graphical model that includes directed graphical model distribution

• Learning undirected graphical model structure given iid samples from directed graphical model would estimate moralized graph

Recall: Moralization

1X

2X

3X

X 4

X 5

X61X

2X

3X

X 4

X 5

X6

(a) (b)

1X

2X

3X

X 4

X 5

X61X

2X

3X

X 4

X 5

X6

(a) (b)

Learning Directed Graphical Models

• Two Step Process:

• Learn the undirected “moralized” graph

• Orient the edges using conditional independence tests

• PC algorithm — Spirtes, Glymour & Scheines (2000)

• Open Problem: Not very scalable algorithms for the second step (or for overall directed graphical model estimation)

Recommended