61
Automated Gene Classification using Nonnegative Matrix Factorization on Biomedical Literature Kevin Heinrich PhD Dissertation Defense Department of Computer Science, UTK March 19, 2007

Heinrich Presentation - Michael W. Berry

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Heinrich Presentation - Michael W. Berry

Automated Gene Classificationusing Nonnegative Matrix Factorization

on Biomedical Literature

Kevin Heinrich

PhD Dissertation DefenseDepartment of Computer Science, UTK

March 19, 2007

Page 2: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification2/1

Page 3: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification3/1

Page 4: Heinrich Presentation - Michael W. Berry

What Is The Problem?

Understanding functional gene relationships requires expertknowledge.

Gene sequence analysis does not necessarily imply function.

Gene structure analysis is difficult.

Issue of scale.

Biologists know a small subset of genes.Thousands of genes.

Time & Money.

Kevin Heinrich Using NMF for Gene Classification4/1

Page 5: Heinrich Presentation - Michael W. Berry

Defining Functional Gene Relationships

Direct Relationships.

Known gene relationships (e.g. A-B).Based on term co-occurrence.1

Indirect Relationships.

Unknown gene relationships (e.g. A-C).Based on semantic structure.

1Jenssen et al., Nature Genetics, 28:21, 2001.Kevin Heinrich Using NMF for Gene Classification5/1

Page 6: Heinrich Presentation - Michael W. Berry

Semantic Gene Organizer

Gene information is compiled in human-curated databases.

Medical Literature, Analysis, and Retrieval System Online(MEDLINE)EntrezGene (LocusLink)Medical Subject Heading (MeSH)Gene Ontology (GO)

Gene documents are formed by taking titles and abstractsfrom MEDLINE citations cross-referenced in the Mouse, Rat,and Human EntrezGene entries for that gene.

Examines literature (phenotype) instead of genotype.

Can be used as a guide for future gene exploration.

Kevin Heinrich Using NMF for Gene Classification6/1

Page 7: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification7/1

Page 8: Heinrich Presentation - Michael W. Berry

Vector Space Model

Gene documents are parsed into tokens.

Tokens are assigned a weight, wij , of i th token in j th

document.

An m × n term-by-document matrix, A, is created.A = [wij ]

Genes are m-dimensional vectors.Tokens are n-dimensional vectors.

Kevin Heinrich Using NMF for Gene Classification8/1

Page 9: Heinrich Presentation - Michael W. Berry

Term-by-Document Matrix

d1 d2 d3 . . . dn

t1 w11 w12 w13 w1n

t2 w21 w22 w23 w2n

t3 w31 w32 w33 w3n

t4 w41 w42 w43 w4n...

. . .

tm wm1 wm2 wm3 wmn

Typically, a term-document matrix is sparse and unstructured.

Kevin Heinrich Using NMF for Gene Classification9/1

Page 10: Heinrich Presentation - Michael W. Berry

Weighting Schemes

Term weights are the product of a local, global component,and document normalization factor.

wij = lijgidj

The log-entropy weighting scheme is used where

lij = log2 (1 + fij)

gi = 1 +

∑j

(pij log2 pij)

log2 n

, pij =fij∑j

fij

Kevin Heinrich Using NMF for Gene Classification10/1

Page 11: Heinrich Presentation - Michael W. Berry

Latent Semantic Indexing (LSI)

LSI performs a truncated singular value decomposition (SVD) onM into three factor matrices

A = UΣV T

U is the m × r matrix of eigenvectors of AAT

V T is the r × n matrix of eigenvectors of ATA

Σ is the r × r diagonal matrix of the r nonnegative singularvalues of A

r is the rank of A

Kevin Heinrich Using NMF for Gene Classification11/1

Page 12: Heinrich Presentation - Michael W. Berry

SVD Properties

A rank-k approximation is generated by truncating the first kcolumn of each matrix, i.e., Ak = UkΣkV T

k

Ak is the closest of all rank-k approximations, i.e.,‖A− Ak‖F ≤ ‖A− B‖ for any rank-k matrix B

Kevin Heinrich Using NMF for Gene Classification12/1

Page 13: Heinrich Presentation - Michael W. Berry

SVD Querying

Document-to-Document Similarity

ATk Ak = (VkΣk) (VkΣk)

T

Term-to-Term Similarity

AkATk = (UkΣk) (UkΣk)

T

Document-to-Term Similarity

Ak = UkΣkVTk

Kevin Heinrich Using NMF for Gene Classification13/1

Page 14: Heinrich Presentation - Michael W. Berry

Advantages of LSI

A is sparse, factor matrices are dense. This causes improvedrecall for concept-based matching.

Scaled document vectors can be computed once and storedfor quick retrieval.

Components of factor matrices represent concepts.

Decreasing number of dimensions compares documents in abroader sense and achieves better compression.

Similar word usage patterns get mapped to same geometricspace.

Genes are compared at a concept level rather than a simpleterm co-occurrence level resulting in vocabulary independentcomparisons.

Kevin Heinrich Using NMF for Gene Classification14/1

Page 15: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification15/1

Page 16: Heinrich Presentation - Michael W. Berry

Presentation of Results

Problem:

Biologists are familiar with interpreting trees.

LSI produces ranked lists of related terms/documents.

Solution:

Generate pairwise distance data, i.e., 1− cos θij

Apply distance-based tree-building algorithm

Fitch - O(n4)NJ - O(n3)FastME - O(n2)

Kevin Heinrich Using NMF for Gene Classification16/1

Page 17: Heinrich Presentation - Michael W. Berry

Defining Functional Gene Relationships on Test Data

Kevin Heinrich Using NMF for Gene Classification17/1

Page 18: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification18/1

Page 19: Heinrich Presentation - Michael W. Berry

“Problems” with LSI

Initial term weights are nonnegative; SVD introduces negativecomponents.

Dimensions of factored space do not have an immediateinterpretation.

Want advantages of factored/reduced dimension space, butwant to interpret dimensions for clustering/labeling trees.

Issue of scale—understand small collections better rather thanhuge collections.

Kevin Heinrich Using NMF for Gene Classification19/1

Page 20: Heinrich Presentation - Michael W. Berry

Defining Functional Gene Relationships

Direct Relationships.

Known gene relationships (e.g. A-B).Based on term co-occurrence.2

Indirect Relationships.

Unknown gene relationships (e.g. A-C).Based on semantic structure.

Label Relationships (e.g. x & y).

2Jenssen et al., Nature Genetics, 28:21, 2001.Kevin Heinrich Using NMF for Gene Classification20/1

bb

yb

xb

A

b

B

b

C

1

Page 21: Heinrich Presentation - Michael W. Berry

NMF Problem Definition

Given nonnegative V , find W and H such that

V ≈WH

W ,H ≥ 0

W has size m × k

H has size k × n

W and H are not unique.i.e., WDD−1H for any invertible nonnegative D

Kevin Heinrich Using NMF for Gene Classification21/1

Page 22: Heinrich Presentation - Michael W. Berry

NMF Problem Definition

Given nonnegative V , find W and H such that

V ≈WH

W ,H ≥ 0

W has size m × k

H has size k × n

W and H are not unique.i.e., WDD−1H for any invertible nonnegative D

Kevin Heinrich Using NMF for Gene Classification22/1

Page 23: Heinrich Presentation - Michael W. Berry

NMF Interpretation

V ≈WH

Columns of W are k “feature” or “basis” vectors; representsemantic concepts.

Columns of H are linear combinations of feature vectors toapproximate corresponding column in V .

Choice of k determines accuracy and quality of basis vectors.

Ultimately produces a “parts-based” representation of theoriginal space.

Kevin Heinrich Using NMF for Gene Classification23/1

Page 24: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification24/1

W

m

k

H

k

n

Page 25: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification25/1

W

m

k

H

k

n

3 0.18 2.29 0.7

-

Page 26: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification26/1

W

m

k

H

k

n

3 0.18 2.29 0.7

-

cerebrovasculardisturbancemicrocephalyspectroscopyneuromuscular

Page 27: Heinrich Presentation - Michael W. Berry

Euclidean Distance (Cost Function)

E (W ,H) = ‖V −WH‖2F =∑i ,j

(Vij − (WH)ij

)2

Minimize E (W ,H) subject to W ,H ≥ 0.

E (W ,H) ≥ 0.

E (W ,H) = 0 if and only if V = WH.

‖V −WH‖ convex in W or H separately, not bothsimultaneously.

No guarantee to find global minima.

Kevin Heinrich Using NMF for Gene Classification27/1

Page 28: Heinrich Presentation - Michael W. Berry

Initialization Methods

Since NMF is an iterative algorithm, W and H must be initialized.

Random positive entries.

Structured initialization typically speeds convergence.

Run k-means on V .Choose representative vector from each cluster to form W andH.

Most methods do not provide static starting point.

Kevin Heinrich Using NMF for Gene Classification28/1

Page 29: Heinrich Presentation - Michael W. Berry

Non-Negative Double SVD

NNDSVD is one way to provide a static starting point.3

Observe Ak =k∑

j=1σjujv

Tj , i.e. sum of rank-1 matrices

Foreach j

Compute C = ujvTj

Set to 0 all negative elements of CCompute maximum singular triplet of C , i.e., [u, s, v ]Set jth column of W to u and jth row of H to σj s v

Resulting W and H are influenced by SVD.

3Boutsidis & Gallopoulos, Tech Report, 2005Kevin Heinrich Using NMF for Gene Classification29/1

Page 30: Heinrich Presentation - Michael W. Berry

NNDSVD Variations

Zero elements remain “locked” during MM update.

NNDSVDz keeps zero elements.

NNDSVDe assigns ε = 10−9 to zero elements.

NNDSVDa assigns average value of A to zero elements.

Kevin Heinrich Using NMF for Gene Classification30/1

Page 31: Heinrich Presentation - Michael W. Berry

Update Rules

Update rules should

decrease the approximation.

maintain nonnegativity constraints.

maintain other constraints imposed by the application(smoothness/sparsity).

Kevin Heinrich Using NMF for Gene Classification31/1

Page 32: Heinrich Presentation - Michael W. Berry

Multiplicative Method (MM)

Hcj ← Hcj

(W TV

)cj

(W TWH)cj + ε

Wic ←Wic

(VHT

)ic

(WHHT )ic + ε

ε ensures numerical stability.

Lee and Seung proved MM non-increasing under Euclideancost function.

Most implementations update H and W “simultaneously.”

Kevin Heinrich Using NMF for Gene Classification32/1

Page 33: Heinrich Presentation - Michael W. Berry

Other Objective Functions

‖V −WH‖2F + αJ1 (W ) + βJ2 (H)

α and β are parameters to control level of additional constraints.

Kevin Heinrich Using NMF for Gene Classification33/1

Page 34: Heinrich Presentation - Michael W. Berry

Smoothing Update Rules

For example, set J2 (H) = ‖H‖2F to enforce smoothness on H totry to force uniqueness on W .4

Hcj ← Hcj

(W TV

)cj− βHcj

(W TWH)cj + ε

Wic ←Wic

(VHT

)ic− αWic

(WHHT )ic + ε

4Piper et. al., AMOS, 2004Kevin Heinrich Using NMF for Gene Classification34/1

Page 35: Heinrich Presentation - Michael W. Berry

Sparsity

Hoyer defined sparsity as

sparseness (~x) =

√n −

P|xi |√P

x2i√

n − 1

Zero if and only if all components have same magnitude.

One if and only if x contains one nonzero component.

Kevin Heinrich Using NMF for Gene Classification35/1

Page 36: Heinrich Presentation - Michael W. Berry

Visual Interpretation

Sparseness constraints are explicitly set and built into updatealgorithm.

Can be applied to W , H, or both.

Dominant features are (hopefully) preserved.

Kevin Heinrich Using NMF for Gene Classification36/1

0.90.70.30.1

Page 37: Heinrich Presentation - Michael W. Berry

Sparsity Update Rules with MM

Pauca et. al. implemented sparsity within MM as

Hcj ← Hcj

(W TV

)cj− β (c1Hcj + c2Ecj)

(W TWH)cj + ε

c1 = ω2 − ω‖H‖12‖H‖2

c2 = ‖H‖ − ω‖H‖2ω =

√kn −

(√kn − 1

)sparseness(H)

Kevin Heinrich Using NMF for Gene Classification37/1

Page 38: Heinrich Presentation - Michael W. Berry

Benefits of NMF

Automated cluster labeling (& clustering)

Synonym generation (features)

Possible automated ontology creation

Labeling can be applied to any hierarchy

Kevin Heinrich Using NMF for Gene Classification38/1

Page 39: Heinrich Presentation - Michael W. Berry

Comparison of SVD vs. NMF

SVD NMF

Solution Accuracy A BUniqueness A CConvergence A C-

Querying A C+Interpretability of Parameters A CInterpretability of Elements D A

Sparseness D- B+Storage B- A

Kevin Heinrich Using NMF for Gene Classification39/1

Page 40: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification40/1

Page 41: Heinrich Presentation - Michael W. Berry

Labeling Algorithm

Given a hierarchical tree and a weighted list of terms associatedwith each gene, assign labels to each internal node

Mark each gene as labeled.

For each pair of labeled sibling nodes

Add all terms to parent’s listKeep top t terms

Kevin Heinrich Using NMF for Gene Classification41/1

Page 42: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification42/1

��

��inherited 1.05genetic 1disorder 0.95tumor 0.95Alzheimer 0.9memory 0.8cancer 0.8

tumor 0.95cancer 0.8inherited 0.6genetic 0.5disorder 0.35

Alzheimer 0.9memory 0.8disorder 0.6genetic 0.5inherited 0.45

Parentb

b

b

Gene1

b

Gene2

1

Page 43: Heinrich Presentation - Michael W. Berry

Calculating Initial Term Weights

Three different methods to calculate initial term weights:

Assign global weight associated with each term to eachdocument.

Calculate document-to-term similarity (LSI).

For NMF, for each document j :

Determine top feature i (by examining H).Assign dominant terms from feature vector i to document j ,scaled by coefficient from H.(Can be extended/thresholded to assign more features)

Kevin Heinrich Using NMF for Gene Classification43/1

Page 44: Heinrich Presentation - Michael W. Berry

MeSH Labeling

Many studies validate via inspection.

For automated validation, need to generate a “correct” treelabeling.

Take advantage of expert opinions, i.e., indexers from MeSH.Create MeSH meta-document for each gene, i.e., list of MeSHheadings.Label hierarchy using global weights of meta-collection.

Kevin Heinrich Using NMF for Gene Classification44/1

Page 45: Heinrich Presentation - Michael W. Berry

Recall

From traditional IR, recall is ratio of relevant returneddocuments to all relevant documents.

Extending this to trees, recall can be found at each node.

Averaging across each tree depth level produces average recall.

Averaging average recalls across all levels produces meanaverage recall.

Kevin Heinrich Using NMF for Gene Classification45/1

Page 46: Heinrich Presentation - Michael W. Berry

Feature Vector Replacement

Unfortunately, MeSH vocabulary is too restrictive, so nearly allruns produced near 0% recall.

Map NMF vocabulary to MeSH terminology.

Foreach document i :

Determine j , the highest coefficient from ith column of H.Choose top r MeSH headings from corresponding MeSHmeta-document.Split each MeSH heading into tokens.Add each token to jth column of W ′, where weight is globalMeSH header weight × coefficient from H.

Result: feature vector comprised solely of MeSH terms (MeSHfeature vector).

Kevin Heinrich Using NMF for Gene Classification46/1

Page 47: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification47/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10

Ave

rage

Rec

all

Node Level

RecallBest Possible Recall

Page 48: Heinrich Presentation - Michael W. Berry

Kevin Heinrich Using NMF for Gene Classification48/1

Page 49: Heinrich Presentation - Michael W. Berry

Data Sets

Five collections were available:

50TG, test set of 50 genes

115IFN, set of 115 interferon genes

3 cerebellar datasets, 40 genes of unknown relationship

Kevin Heinrich Using NMF for Gene Classification49/1

Page 50: Heinrich Presentation - Michael W. Berry

Constraints

Each set was run under the given constraints fork = 2, 4, 6, 8, 10, 15, 20, 25, 30:

no constraints

smoothing W with α = 0.1, 0.01, 0.001

smoothing H with β = 0.1, 0.01, 0.001

sparsifying W with α = 0.1, 0.01, 0.001 and sparseness = 0.1,0.25, 0.5, 0.75, 0.9

sparsifying H with β = 0.1, 0.01, 0.001 and sparseness = 0.1,0.25, 0.5, 0.75, 0.9

Kevin Heinrich Using NMF for Gene Classification50/1

Page 51: Heinrich Presentation - Michael W. Berry

50TG Recall

Kevin Heinrich Using NMF for Gene Classification51/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10

Ave

rage

Rec

all

Node Level

Best Overall (72%)Best First (83%)

Best Second (76%)Best Third (83%)

Page 52: Heinrich Presentation - Michael W. Berry

115IFN Recall

Kevin Heinrich Using NMF for Gene Classification52/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10 12 14 16 18

Ave

rage

Rec

all

Node Level

Best Overall (41%)Best First (51%)

Best Second (46%)Best Third (56%)

Page 53: Heinrich Presentation - Michael W. Berry

Math1 Recall

Kevin Heinrich Using NMF for Gene Classification53/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10 12 14

Ave

rage

Rec

all

Node Level

Best Overall (69%)Best First (84%)

Best Second (77%)Best Third (83%)

Page 54: Heinrich Presentation - Michael W. Berry

Mea Recall

Kevin Heinrich Using NMF for Gene Classification54/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10

Ave

rage

Rec

all

Node Level

Best Overall (64%)Best First (72%)

Best Second (66%)Best Third (91%)

Page 55: Heinrich Presentation - Michael W. Berry

Sey Recall

Kevin Heinrich Using NMF for Gene Classification55/1

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10

Ave

rage

Rec

all

Node Level

Best Overall (56%)Best First (71%)

Best Second (71%)Best Third (78%)

Page 56: Heinrich Presentation - Michael W. Berry

Observations

Recall was more a function of initialization and choice of kthan constraint.

Often, the NMF run with the smallest approximation error didnot produce the optimal labeling.

Overall, sparsity did not perform well.

Smoothing W achieved the best MAR in general.

Small parameters choices and larger values of k performedbetter.

Kevin Heinrich Using NMF for Gene Classification56/1

Page 57: Heinrich Presentation - Michael W. Berry

Future Work

Lots of NMF algorithms, each with differentstrengths/weaknesses.

More complex labeling scheme.

Different weighting schemes.

Investigate hierarchical graphs.

Visualization?

Gold Standard?

Kevin Heinrich Using NMF for Gene Classification57/1

Page 58: Heinrich Presentation - Michael W. Berry

SGO & CGx

This tool as implemented is called theSemantic Gene Organizer (SGO).

Much of the technology used in SGO are the basis forComputable Genomix, LLC.

Kevin Heinrich Using NMF for Gene Classification58/1

Page 59: Heinrich Presentation - Michael W. Berry

SGO & CGx Screenshots

Kevin Heinrich Using NMF for Gene Classification59/1

Page 60: Heinrich Presentation - Michael W. Berry

SGO & CGx Screenshots

Kevin Heinrich Using NMF for Gene Classification60/1

Page 61: Heinrich Presentation - Michael W. Berry

Acknowledgments

SGO: shad.cs.utk.edu/sgoCGx: computablegenomix.com

Committee:

Michael W. Berry, chair

Ramin Homayouni (Memphis)

Jens Gregor

Mike Thomason

Paul Pauca, WFU

Kevin Heinrich Using NMF for Gene Classification61/1