22
A Statistical Approach to Literature-based Gene Group Annotation Xin He 01/31/2007

A Statistical Approach to Literature-based Gene Group Annotation

Embed Size (px)

DESCRIPTION

A Statistical Approach to Literature-based Gene Group Annotation. Xin He 01/31/2007. Annotating a Gene List. Goal: understand the commonalities in a list of genes from: Clustering by expression patterns Differentially expressed genes Genes sharing cis-regulatory elements - PowerPoint PPT Presentation

Citation preview

Page 1: A Statistical Approach to Literature-based Gene Group Annotation

A Statistical Approach to Literature-based Gene Group

Annotation

Xin He

01/31/2007

Page 2: A Statistical Approach to Literature-based Gene Group Annotation

Annotating a Gene List

• Goal: understand the commonalities in a list of genes from:– Clustering by expression patterns– Differentially expressed genes – Genes sharing cis-regulatory elements

• How to automatically construct the annotations?

Page 3: A Statistical Approach to Literature-based Gene Group Annotation

Gene Ontology-based Approach

• Each gene is annotated by a set of GO terms

• The importance of any term wrt the gene list is measured by the number of genes that are associated with this term

• Need to correct for the uneven distribution of GO terms: a hypergeometric test

Page 4: A Statistical Approach to Literature-based Gene Group Annotation

Limitations of GO-based Approach

• GO annotations of all genes: may not be available

• Rapid growth of literature: constantly add new functions to existing genes

• Coverage is not even in all areas. E.g. ecology and behavior; medicine; anatomy and physiology; etc.

Page 5: A Statistical Approach to Literature-based Gene Group Annotation

Literature-based Approach

• Annotate via the analysis of text extracted from literature

• Advantages:– Not dependent on manually created data– Easy to keep up with the recent discoveries– Broad coverage

• Explorative tool: suggest new hypothesis

Page 6: A Statistical Approach to Literature-based Gene Group Annotation

Literature-based Approach

• Extract abstracts for each gene• Idea: If a word is overrepresented in the

abstracts for the list, then it is likely to describe the common functions of the list

• A simple measure of significance: Z-score = (observed count – expected count under background distribution) / standard deviation

Page 7: A Statistical Approach to Literature-based Gene Group Annotation

Motivations for the Current Work

• Drawbacks of the existing approach– Biased textual representation: genes that are

well-studied will dominate the results

• Motivations for a new approach:– Need to capture overrepresentation of words– Favor words that are common to all or most

genes– A unified way to solve both problems?

Page 8: A Statistical Approach to Literature-based Gene Group Annotation

Ideas for the Statistical Model

• Observation: typically, some genes in the list are related to a given word, but the other genes are not (Few gene clusters are perfect!)

• Assumption: the count of a term in a document follows Poisson distribution

• Idea: the count of a term in one gene is either from a background distribution (if the gene is unrelated) or from a positive distribution (if the gene is related)

Page 9: A Statistical Approach to Literature-based Gene Group Annotation

Poisson Mixture Model

1 2

1 2

0

... : the size of documents of genes 1 to n

... : counts of the word in documents 1 to n

: the rate of the background and positive Poisson distributions

: the mixing weight of the

n

n

d ,d d

x ,x x

, positive Poisson

])|()1()|([

)|(|(

10

1

n

iiiii

n

i0ii0

dxPdxP

,,,dxP),,,P

dx

)|()1()|(|( 0 iiii0ii dxPdxP ),,,dxP

Each count is generated from a mixture of the background and signal Poisson distributions:

The probability of observing the data is thus:

Notations:

Page 10: A Statistical Approach to Literature-based Gene Group Annotation

Parameter EstimationMaximum likelihood estimation of parameters:

),,,P, 0

dx |(maxarg)ˆˆ(0,10

EM algorithm to maximize the likelihood function. The updating formula is given by:

i(t)(t)

0ii

n

iii

(t)(t)0ii

n

ii

t

(t)(t)0ii

n

ii

t

d,,,d,xzPx,,,d,xzP

n,,,d,xzP

)|1(/)|1(

/)|1(

11

)1(

1

)1(

The posterior probability of missing label (zi) is:

)|()1()|((

)|()|1(

0)()()(

)()(

iit

it

it

it

it

(t)(t)0iii dxPdxP

dxP,,,d,xzP

Page 11: A Statistical Approach to Literature-based Gene Group Annotation

Evaluating the Statistical Significance

• The candidate terms: those with a large theta (an estimate of the proportion of related genes)

• Need to assess the significance– E.g. a term from a distribution slightly different from

the background. EM may estimate lambda to be this distribution and theta close to 1

• Idea: if the counts can be explained well by the background, then there is no need to use a mixture of two distributions. This word would be insignificant regardless of the estimated theta

Page 12: A Statistical Approach to Literature-based Gene Group Annotation

Likelihood Ratio Test

Hypothesis testing:

mixture thefrom generated are counts observed the0, :H

background thefrom generated are counts observed the0, :H

1

0

Generalized Likelihood Ratio test Statistic (LRS):

n

i iiii

ii

0

0

dxPdxP

dxP

,,,P

,PT

1 0

0

)|()ˆ1()ˆ|(ˆ)|(

log2

)ˆˆ|(

)|(log2

dx

dx

Reject H0 if T is greater than a certain threshold.

Page 13: A Statistical Approach to Literature-based Gene Group Annotation

Asymptotic Distribution of LRS

• It is well known that the distribution of LRS converges to chi-square, with degree of freedom equal to the difference between the number of free parameters of null and alternative hypothesis

• Bonferroni correction: significance level (0.05) divided by the number of hypothesis tested (number of terms)

Page 14: A Statistical Approach to Literature-based Gene Group Annotation

Implementation• Computational framework

– Medline collections: yeast and fly– Indexing, retrieval, analysis by Indri 2.2

• Background distribution of terms– Organism-specific distribution

• Phrase construction: bigrams only– NSP (Ngram statistical package) for bigram selection: chi-square

measure• Procedure

– Retrieval of articles about the given list of genes– Collection of counts of all words and significant bigrams– Statistical analysis by Poission mixture model– Presentation of results: (1) sorted by statistical significance; (2)

remove concepts whose fractions (theta) are below a threshold

Page 15: A Statistical Approach to Literature-based Gene Group Annotation

Web Interface

• Web link:http://beespace.igb.uiuc.edu:8080/BeeSpace/

Annot.jsp

• User interface– Choose collection: MedYeast and MedFly– Input genes: textual names rather than

identifiers

Page 16: A Statistical Approach to Literature-based Gene Group Annotation

Experimental Validation

• Methods compared– StatAnnot– vs. term overpresentation method: GEISHA– vs. GO enrichment analysis: FuncAssociate

http://llama.med.harvard.edu/cgi/func/funcassociate

• Evaluation– Concepts that are common to both methods– Concepts that are unique to the method compared– Concepts that are unique to StatAnnot– Genes that are found by StatAnnot

Page 17: A Statistical Approach to Literature-based Gene Group Annotation

Comparison with other Literature-based Method

• Gene List: Eisen K cluster (15 genes)– Mainly respiratory chain complex (13), one

mitochondrial membrane pore (por1)– However, por1 matches many more articles than other

genes, thus dominates the results, eg. voltage-dependent channel

• Results:– GESHIA: voltage-dependent, outer member, pores,

channel, succinate– StatAnnot: electron_transport, electron_transfer,

respiratory, cytochrome_c1, mitochondrial_membrane, succinate_dehydrogenase

Page 18: A Statistical Approach to Literature-based Gene Group Annotation

Comparison with GO-based Method

• Gene list: Eisen B cluster (11 genes)• Results

– cell_cycle, mitosis, s_phase, microtubule, cytokinesi, cytoskeletal_protein, spindle_pole, pole_body, mitotic_spindle, spindle_apparatus, spindle_checkpoint

– contractile ring, actomyosin ring, cell cortex, microtubule nucleation, microtubule organizing center, M phase, bud site selection, phosphatidylinositol binding, septin ring

– anaphase, anaphase_promote, ubiquitin, ligase, bud_neck, metaphase, proteolysis

– cyclin_b, cdc3, cdc28, cdc12, cdc15, cdc2

Page 19: A Statistical Approach to Literature-based Gene Group Annotation

Comparison with GO-based Method

• Gene List: cAMPDown001 cluster– Originally 59 genes– 16 genes have fly orthologs that match at least one article

• Results– heat_shock, molecular_chaperone, stress (response to stress)– response to temperature, unfolded protein binding, response to

biotic stimulus– channel/ion_channel/channel_gate,

kinase/threonine_kinase/serine_threonine, gtpase, ethanol, immunophilin, tetratricopeptide_repeat, geldanamycin, tacrolimu_bind, co_chaperone/cochaperon, steroid, quinone, cyclophilin

– genes: hsp90, hsc70, hsp70, hsp40, cyp40, fkbp52, mok,

Page 20: A Statistical Approach to Literature-based Gene Group Annotation

Conclusion from Experiments

• Solved the biased textual representation problem of the earlier literature-based method

• In general, the new method is able to cover a large proportion of terms from GO enrichment analysis

• Supplement with additional biological concepts, including many related genes

• May be particularly useful for studying aspects not focused in GO, such as medicine

Page 21: A Statistical Approach to Literature-based Gene Group Annotation

Future Plan (I)

• Phrases– N-gram, N > 2– Syntactic phrase construction– Removal of redundant concepts in the results

• Retrieval for genes– Gene name mapping to database entries– Synonym expansion– Gene name disambiguition

• Entity recognition in the resulting concepts– Gene names, chemicals, organisms

Page 22: A Statistical Approach to Literature-based Gene Group Annotation

Future Plan (II)

• Sentence selection– Informative sentences summarizing the concepts

(theme retrieval in Beespace)

• Clustering of concepts– Ex. Distance measure of terms based on co-occurrence

• A general tool to summarize a group of entities? For example:– Common aspects of a set of diseases– Genes commonly involved in multiple developmental

stages