Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people

Statistical Learning from Relational Data

Daphne KollerStanford University

Joint work with many many people

Relational Data is Everywhere

The web Webpages (& the entities they represent),

hyperlinks Social networks

People, institutions, friendship links Biological data

Genes, proteins, interactions, regulation Bibliometrics

Papers, authors, journals, citations Corporate databases

Customers, products, transactions

Relational Data is Different

Data instances not independent Topics of linked webpages are correlated

Data instances are not identically distributed: Heterogeneous instances (papers, authors)

No IID assumption

This is a good thing

New Learning Tasks Collective classification of related instances

Labeling an entire website of related webpages

Relational clustering Finding coherent clusters in the genome

Link prediction & classification Predicting when two people are likely to be friends

Pattern detection in network of related objects Finding groups (research groups, terrorist groups)

Probabilistic Models Uncertainty model:

space of “possible worlds”; probability distribution over this space.

Worlds: often defined via a set of state variables medical diagnosis: diseases, symptoms, findings, …

each world: an assignment of values to variables

Number of worlds is exponential in # of vars 2n if we have n binary variables

Outline

Relational Bayesian networks* Relational Markov networks Collective Classification Relational clustering

* with Avi Pfeffer, Nir Friedman, Lise Getoor

Bayesian Networks

nodes = variablesedges = direct influence

Graph structure encodes independence assumptions: Letter conditionally independent of Intelligence given Grade

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,lowA B C

CPD P(G|D,I)

Job

Grade

SAT

IntelligenceDifficulty

Bayesian Networks: Problem

Bayesian nets use propositional representation Real world has objects, related to each other

Intelligence Difficulty

Grade

Intell_Jane Diffic_CS101

Grade_Jane_CS101

Intell_George Diffic_Geo101

Grade_George_Geo101

Intell_George Diffic_CS101

Grade_George_CS101A C

These “instances” are not independent

Relational Schema Specifies types of objects in domain, attributes of

each type of object & types of relations between objects

Teach

Student

Intelligence

Registration

Grade

Satisfaction

Course

Difficulty

Professor

Teaching-Ability

In

Take

ClassesClasses

RelationsRelationsAttributesAttributes

St. Nordaf University

Tea

ches

Tea

ches

In-course

In-course

Registered

In-course

Prof. SmithProf. Jones

George

Jane

Welcome to

CS101

Welcome to

Geo101

Teaching-abilityTeaching-ability

Difficulty

Difficulty Registered

RegisteredGrade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

World

Relational Bayesian Networks

Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies

Links define potential interactions

StudentIntelligence

RegGrade

Satisfaction

CourseDifficulty

ProfessorTeaching-Ability

[K. & Pfeffer; Poole; Ngo & Haddawy]

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,lowA B C

Prof. SmithProf. Jones

Welcome to

CS101

Welcome to

Geo101

RBN Semantics

Teaching-abilityTeaching-ability

Difficulty

Difficulty

Grade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

George

Jane

Ground model: variables: attributes of all objects dependencies: determined by relational links & template model

Welcome to

CS101

Welcome to

CS101

low / high

The Web of Influence

0% 50% 100%0% 50% 100%

Welcome to

Geo101 A

C

low high

0% 50% 100%

easy / hard

Outline

Relational Bayesian networks* Relational Markov networks†

Collective Classification Relational clustering

* with Avi Pfeffer, Nir Friedman, Lise Getoor

† with Ben Taskar, Pieter Abbeel

Why Undirected Models? Symmetric, non-causal interactions

E.g., web: categories of linked pages are correlated

Cannot introduce direct edges because of cycles

Patterns involving multiple entities E.g., web: “triangle” patterns Directed edges not appropriate

“Solution”: Impose arbitrary direction Not clear how to parameterize CPD for variables

involved in multiple interactions Very difficult within a class-based

parameterization[Taskar, Abbeel, K. 2001]

Markov Networks

Laura

Noah

Mary

James

N)(L,N)(M,M)(J,L)(K,L)(J,K)(J,

ZN)M,L,K,P(J,

1

Kyle

0 0.5 1 1.5 2

AAABACBABBBCCACBCC

Template potential

Relational Markov Networks

Universals: Probabilistic patterns hold for all groups of objects

Locality: Represent local probabilistic dependencies Sets of links give us possible interactions

Study Group

Student2

Reg2

GradeIntelligence

Course

Reg1Grade

Student1

Difficulty

Intelligence

0 0.5 1 1.5 2

AAABACBABBBCCACBCC

Template potential

Welcome to

CS101

RMN Semantics

Welcome to

Geo101

Difficulty

Difficulty

Grade

Grade

Intelligence

Intelligence

George

Jane

Jill

Intelligence

Geo Study Group

CS Study Group

Grade

Grade

Outline

Relational Bayesian Networks Relational Markov Networks Collective Classification*

Discriminative training Web page classification Link prediction

Relational clustering

* with Ben Taskar, Carlos Guestrin, Ming Fai Wong, Pieter Abbeel

Model Structure

ProbabilisticRelational

ModelCourse

Student

Reg

Training Data

New Data

Learning

Inference

Conclusions

Collective Classification

Train on one year of student intelligence, course difficulty, and grades Given only grades in following year, predict all students’ intelligence

Example:

Features: .x

Labels: .y*

Features: ’.x Labels: ’.y

Learning RMN Parameters

Student2

Reg2

GradeIntelligence

Course

Reg1Grade

Student1

Difficulty

IntelligenceTemplate potential

Study Group

AAABACBABBBCCACBCC

Parameterize potentials as log-linear model

)exp(1

).( )(xfwxw

wT

ZP

)exp().,.( 21 CCCCAAAA fwfwGRGR

Max Likelihood Estimation

maximizew

Estimation Classification

argmaxy

.x

.y* ).|.(log xy*w P ).,.(log xy*w P

We don’t care about the joint distribution P(.x, .y)

)'.|'.(log xyw P

Web KB

Tom MitchellProfessor

WebKBProject

Sean SlatteryStudent

Advisor-of

Project-of

Member

[Craven et al.]

Web Classification Experiments

WebKB dataset Four CS department websites Bag of words on each page Links between pages Anchor text for links

Experimental setup Trained on three universities Tested on fourth Repeated for all four combinations

Professordepartment

extractinformationcomputersciencemachinelearning

…

Standard Classification

Categories:facultycourseprojectstudentother

Page

...

Category

Word1 WordN

Standard Classification

...LinkWordN

workingwithTom Mitchell …

Page

...

Category

Word1 WordN

00.020.040.060.080.1

0.120.140.160.18

Logistic

test

set

err

or

4-fold CV:Trained on 3 universities

Tested on 4th

Discriminatively trained naïve Markov

= Logistic Regression

Power of ContextProfessor

?Student? Post-doc?


...

PageCategory

Word1 WordN

From-

Link ...

PageCategory

Word1 WordN

To-

CCCFCPCSFCFFFPFSPCPFPPPSSCSFSPSS

Compatibility (From,To)FT


...

PageCategory

Word1 WordN

From-

Link ...

PageCategory

Word1 WordN

To-

Logistic Links

Classify all pages collectively,

maximizing the joint label probability

00.020.040.060.080.1

0.120.140.160.18

test

set

err

or

[Taskar, Abbeel, K., 2002]

More Complex Structure

More Complex Structure

C

Wn

W1Faculty

S

Students

S

Courses

Collective Classification: Results

00.020.040.060.080.1

0.120.140.160.18

Logistic Links Section Link+Section[Taskar, Abbeel, K., 2002]

test

set

err

or

35.4% error reduction over logistic

Max Conditional Likelihood

maximizew


argmaxy

)(log..).|.(log xyx,fwxy ww ZP T

xyfwx

xyw

w .,.exp)(

1).|.( T

ZP

)'.|'.(log xyw P xyfw '.,'. T).|.(log xy*w P.x

.y*

We don’t care about the conditional distribution P(.y |

.x)

*yy

yyx,fw

*yx,fw

].[..

..

T

T

margin # labelingmistakes in y

Max Margin Estimation

[Taskar, Guestrin, K., 2003] (see also [Collins, 2002; Hoffman 2003])

Quadratic program

Exponentially many constraints

maximize ||w||=1


argmaxy xyfw '.,'. T.x

.y*

What we really want: correct class labels

Max Margin Markov Networks

We use structure of Markov network to provide equivalent formulation of QP Exponential only in tree width of network Complexity = max-likelihood classification

Can solve approximately in networks where induced width is too large Analogous to loopy belief propagation

Can use kernel-based features! SVMs meet graphical models

[Taskar, Guestrin, K., 2003]

WebKB Revisited

00.020.040.060.080.1

0.120.140.160.180.2

Test

Err

or

Logistic likelihood max margin

16.1% relative reduction in error relative to cond. likelihood RMNs

Predicting Relationships

Even more interesting: relationships between objects

Tom MitchellProfessor

WebKBProject

Sean SlatteryStudent

Advisor-of

Member

Member

Predicting Relations

0

5

10

15

20

25

30

Flat Collective

Introduce exists/type attribute for each potential link Learn discriminative model for this attribute Collectively predict its value in new world

Relation

...

Page

Word1 WordN

From-

...

Page

Word1 WordN

To-

Exists/Type...LinkWord1 LinkWordN

Category Category

72.9% error reduction over flat

[Taskar, Wong, Abbeel, K., 2003]

Outline

Relational Bayesian Networks Relational Markov Networks Collective Classification Relational clustering

Movie data* Biological data†

* with Ben Taskar, Eran Segal

† with Eran Segal, Nir Friedman, Aviv Regev, Dana Pe’er, Haidong Wang, Micha Shapira, David Botstein

Model Structure

ProbabilisticRelational

ModelCourse

Student

Reg

Unlabeled Relational Data

Learning

Relational Clustering

Given only students’ grades, cluster similar students

Example:

Clustering of instances

Learning w. Missing Data: EM

EM Algorithm applies essentially unchanged E-step computes expected sufficient statistics,

aggregated over all objects in class M-step uses ML (or MAP) parameter estimation

Key difference: In general, the hidden variables are not

independent Computation of expected sufficient statistics

requires inference over entire network

P(Registration.Grade | Course.Difficulty, Student.Intelligence)

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

Learning w. Missing Data: EM

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

0% 20% 40% 60% 80% 100%

hard,high

hard,low

easy,high

easy,low

low / higheasy / hard

A B C

CoursesStudents

[Dempster et al. 77]

Movie Data

Internet Movie Databasehttp://www.imdb.com

Actor

Director

Movie

Genres Rating

Year#Votes

MPAA Rating

Discovering Hidden Types

Type Type

Type

[Taskar, Segal, K., 2001]

Learn model using EM

Directors

Steven SpielbergTim BurtonTony ScottJames CameronJohn McTiernanJoel Schumacher

Alfred HitchcockStanley KubrickDavid LeanMilos FormanTerry GilliamFrancis Coppola

Actors

Anthony HopkinsRobert De NiroTommy Lee JonesHarvey KeitelMorgan FreemanGary Oldman

Sylvester StalloneBruce WillisHarrison FordSteven SeagalKurt RussellKevin CostnerJean-Claude Van DammeArnold Schwarzenegger

…

MoviesWizard of OzCinderellaSound of MusicThe Love BugPollyannaThe Parent TrapMary PoppinsSwiss Family Robinson

…

Terminator 2BatmanBatman ForeverGoldenEyeStarship TroopersMission: Impossible Hunt for Red October

Discovering Hidden Types

[Taskar, Segal, K., 2001]

Biology 101: Gene Expression

Gene 2

CodingControl

Gene 1

CodingControl

DNA

RNA

Protein

Swi5 Transcription factor

Sw

i5

Cells express different subsets of their genesin different tissues and under different conditions

Gene Expression Microarrays

Measure mRNA level for all genes in one condition Hundreds of experiments Highly noisy

Expression of gene i in experiment jExperiment

s

Gen

es

Induced

Repressed

Standard Analysis Cluster genes by similarity of expression profiles Manually examine clusters to understand what’s

common to genes in cluster

Clustering

General Approach Expression level is a function of gene

properties and experiment properties Learn model that best explains the data• Observed properties: gene sequence, array condition, …• Hidden properties: gene clusterGene Experiment

Expression

Properties of

Gene iProperties of Experiment j

Expression levelof Gene i

in Experiment j

Attributes Attributes

Level

• Assignment to hidden variables (e.g., module assignment)• Expression level as function of properties

Level

Gene ExperimentCluster

Expression

ID

Clustering as a PRM

P(Ei.L | g.C)g.C

1

2

3

0

0

0

g.C

g.E1 g.E2 g.Ek

CPD 2

CPD k

Naïve Bayes

CPD 1

Modular Regulation Learn functional modules:

Clusters of genes that are similarly controlled Learn control program for modules

Expression as function of control genes

HAP4

CMK1 truefalse

truefalse

[Segal, Regev, Pe’er, Koller, Friedman, 2003]

Level

GeneControlk

ExperimentCluster

Expression

Control2Control1

Module Network PRM

HAP4

CMK1 truefalse

truefalse

00

0

Cluster 1BMH1

Yer184c

true

false

truefalse

GIC2 USV1FAR1 true

false

true

truefalse

false

true

true

false

USV1

truefalse

APG1

Cluster 2

Activity levelof control

genein experiment

Experimental Results

Yeast Stress Data (Gasch et al.) 2355 genes that showed activity 173 experiments (microarrays):

Diverse environmental stress conditions (e.g. heat shock)

Learned module network with 50 modules: Cluster assignments are hidden variables Structure of dependency trees unknown

Learned model using structural EM algorithm

Segal et al., Nature Genetics, 2003

Biological Evaluation

Find sets of co-regulated genes (regulatory module)

Find the regulators of each module

[Segal et al., Nature Genetics, 2003]

46/50

30/50

Experimental Results Hypothesis: Regulator ‘X’ regulates process ‘Y’ Experiment: Knock out ‘X’ and rerun the experiment

HAP4

CMK1 truefalse

truefalse X?


wt Ypl230w

0 3 5 7 9 24 0 2 5 7 9 24

(hrs.)

>16x

341 differentially expressed genes

0 7 15 30 60 0 7 15 30 60

wt (min.)

Ppt1

>4x

602

0 5 15 30 60 0 5 15 30 60

wt (min.)

Kin82

>4x

281

Differentially Expressed Genes


Were the differentially expressed genes predicted as targets?

Rank modules by enrichment for diff. expressed genes

# Module Significance

14 Ribosomal and phosphate metabolism 8/32, 9e 3

11 Amino acid and purine metabolism 11/53, 1e 2

15 mRNA, rRNA and tRNA processing 9/43, 2e 2

39 Protein folding 6/23, 2e 2

30 Cell cycle 7/30, 2e 2

Ppt1


39Protein folding 7/23, 1e-4

29Cell differentiation 6/41, 2e-2

5 Glycolysis and folding 5/37, 4e-2

34Mitochondrial and protein fate 5/37, 4e-2

Ypl230w


3 Energy and osmotic stress I 8/31, 1e 4

2 Energy, osmolarity & cAMP signaling 9/64, 6e 3

15 mRNA, rRNA and tRNA processing 6/43, 2e 2

Kin82

Biological Experiments Validation

All regulators regulate predicted modules


Biology 102: Pathways

Pathways are sets of genes that act together to achieve a common function

Finding Pathways: Attempt I

Use protein-protein interaction data





Problems: Data is very noisy Structure is lost:

Large connected component in interaction graph (3527/3589 genes)

Finding Pathways: Attempt II

Use expression microarray clusters

Pathway I

Pathway II

Problems: Expression is only

‘weak’ indicator of interaction

Interacting pathways are not separable

Finding Pathways: Our Approach

Use both types of data to find pathways Find “active” interactions using gene expression Find pathway-related co-expression using

interactions

Pathway I

Pathway II

Pathway III

Pathway IV

[Segal, Wang, K., 2003]

Probabilistic Model

...

Pathway

Exp1 ExpN

Gene

Interacts


1

...

Pathway

Exp1 ExpN

Gene2

Expression level in N arrays

protein productinteraction

Compatibilitypotential

(g.C,g.C)g1.C g2.C

123123123

111222333

1

1

2

3

0

0

Cluster all genes collectively,

maximizing the joint model likelihood

Capturing Protein Complexes

Independent data set of interacting proteins

0

50

100

150

200

250

300

350

400

0 10 20 30 40 50 60 70 80 90 100Complex Coverage (%)

Nu

m C

om

ple

xes

Our method

Standard expression clustering

124 complexes covered at 50% for our method

46 complexes covered at 50% for clustering


YHR081WRRP40RRP42MTR3RRP45RRP4RRP43DIS3TRM7SKI6RRP46CSL4

RNAse Complex Pathway

YHR081W

SKI6

RRP42

RRP45

RRP46

RRP43TRM7RRP40

MTR3RRP4

DIS3

CSL4

Includes all 10 known pathway genes

Only 5 genes found by clustering


Interaction Clustering RNAse complex found by interaction

clustering as part of cluster with 138 genes


Truth in Advertising Huge graphical models:

3000-50,000 hidden variables Hundreds of thousands of observed nodes Very densely connected

Learning: Multiple iterations of model updates Each requires running inference on the model

Inference: Exact inference is intractable Use belief propagation Single inference iteration: 1-6 hours Algorithmic ideas key to scaling

Relational Data: A New Challenge

Data consists of different types of instances

Instances are related in complex networks

Instances are not independent

New tasks for machine learning Collective classification Relational clustering Link prediction Group detection

Opportunity

http://robotics.stanford.edu/~koller/

Documents

Statistical Learning from Relational Data Daphne Koller Stanford University Joint work with many many people