A Statistical Approach to Literature-based Gene Group Annotation

A Statistical Approach to Literature-based Gene Group

Annotation

Xin He

01/31/2007

Annotating a Gene List

• Goal: understand the commonalities in a list of genes from:– Clustering by expression patterns– Differentially expressed genes – Genes sharing cis-regulatory elements

• How to automatically construct the annotations?

Gene Ontology-based Approach

• Each gene is annotated by a set of GO terms

• The importance of any term wrt the gene list is measured by the number of genes that are associated with this term

• Need to correct for the uneven distribution of GO terms: a hypergeometric test

Limitations of GO-based Approach

• GO annotations of all genes: may not be available

• Rapid growth of literature: constantly add new functions to existing genes

• Coverage is not even in all areas. E.g. ecology and behavior; medicine; anatomy and physiology; etc.

Literature-based Approach

• Annotate via the analysis of text extracted from literature

• Advantages:– Not dependent on manually created data– Easy to keep up with the recent discoveries– Broad coverage

• Explorative tool: suggest new hypothesis

Literature-based Approach

• Extract abstracts for each gene• Idea: If a word is overrepresented in the

abstracts for the list, then it is likely to describe the common functions of the list

• A simple measure of significance: Z-score = (observed count – expected count under background distribution) / standard deviation

Motivations for the Current Work

• Drawbacks of the existing approach– Biased textual representation: genes that are

well-studied will dominate the results

• Motivations for a new approach:– Need to capture overrepresentation of words– Favor words that are common to all or most

genes– A unified way to solve both problems?

Ideas for the Statistical Model

• Observation: typically, some genes in the list are related to a given word, but the other genes are not (Few gene clusters are perfect!)

• Assumption: the count of a term in a document follows Poisson distribution

• Idea: the count of a term in one gene is either from a background distribution (if the gene is unrelated) or from a positive distribution (if the gene is related)

Poisson Mixture Model

1 2

1 2

0

... : the size of documents of genes 1 to n

... : counts of the word in documents 1 to n

: the rate of the background and positive Poisson distributions

: the mixing weight of the

n

n

d ,d d

x ,x x

, positive Poisson

])|()1()|([

)|(|(

10

1

n

iiiii

n

i0ii0

dxPdxP

,,,dxP),,,P

dx

)|()1()|(|( 0 iiii0ii dxPdxP ),,,dxP

Each count is generated from a mixture of the background and signal Poisson distributions:

The probability of observing the data is thus:

Notations:

Parameter EstimationMaximum likelihood estimation of parameters:

),,,P, 0

dx |(maxarg)ˆˆ(0,10

EM algorithm to maximize the likelihood function. The updating formula is given by:

i(t)(t)

0ii

n

iii

(t)(t)0ii

n

ii

t

(t)(t)0ii

n

ii

t

d,,,d,xzPx,,,d,xzP

n,,,d,xzP

)|1(/)|1(

/)|1(

11

)1(

1

)1(

The posterior probability of missing label (zi) is:

)|()1()|((

)|()|1(

0)()()(

)()(

iit

it

it

it

it

(t)(t)0iii dxPdxP

dxP,,,d,xzP

Evaluating the Statistical Significance

• The candidate terms: those with a large theta (an estimate of the proportion of related genes)

• Need to assess the significance– E.g. a term from a distribution slightly different from

the background. EM may estimate lambda to be this distribution and theta close to 1

• Idea: if the counts can be explained well by the background, then there is no need to use a mixture of two distributions. This word would be insignificant regardless of the estimated theta

Likelihood Ratio Test

Hypothesis testing:

mixture thefrom generated are counts observed the0, :H

background thefrom generated are counts observed the0, :H

1

0

Generalized Likelihood Ratio test Statistic (LRS):

n

i iiii

ii

0

0

dxPdxP

dxP

,,,P

,PT

1 0

0

)|()ˆ1()ˆ|(ˆ)|(

log2

)ˆˆ|(

)|(log2

dx

dx

Reject H0 if T is greater than a certain threshold.

Asymptotic Distribution of LRS

• It is well known that the distribution of LRS converges to chi-square, with degree of freedom equal to the difference between the number of free parameters of null and alternative hypothesis

• Bonferroni correction: significance level (0.05) divided by the number of hypothesis tested (number of terms)

Implementation• Computational framework

– Medline collections: yeast and fly– Indexing, retrieval, analysis by Indri 2.2

• Background distribution of terms– Organism-specific distribution

• Phrase construction: bigrams only– NSP (Ngram statistical package) for bigram selection: chi-square

measure• Procedure

– Retrieval of articles about the given list of genes– Collection of counts of all words and significant bigrams– Statistical analysis by Poission mixture model– Presentation of results: (1) sorted by statistical significance; (2)

remove concepts whose fractions (theta) are below a threshold

Web Interface

• Web link:http://beespace.igb.uiuc.edu:8080/BeeSpace/

Annot.jsp

• User interface– Choose collection: MedYeast and MedFly– Input genes: textual names rather than

identifiers

Experimental Validation

• Methods compared– StatAnnot– vs. term overpresentation method: GEISHA– vs. GO enrichment analysis: FuncAssociate

http://llama.med.harvard.edu/cgi/func/funcassociate

• Evaluation– Concepts that are common to both methods– Concepts that are unique to the method compared– Concepts that are unique to StatAnnot– Genes that are found by StatAnnot

Comparison with other Literature-based Method

• Gene List: Eisen K cluster (15 genes)– Mainly respiratory chain complex (13), one

mitochondrial membrane pore (por1)– However, por1 matches many more articles than other

genes, thus dominates the results, eg. voltage-dependent channel

• Results:– GESHIA: voltage-dependent, outer member, pores,

channel, succinate– StatAnnot: electron_transport, electron_transfer,

respiratory, cytochrome_c1, mitochondrial_membrane, succinate_dehydrogenase

Comparison with GO-based Method

• Gene list: Eisen B cluster (11 genes)• Results

– cell_cycle, mitosis, s_phase, microtubule, cytokinesi, cytoskeletal_protein, spindle_pole, pole_body, mitotic_spindle, spindle_apparatus, spindle_checkpoint

– contractile ring, actomyosin ring, cell cortex, microtubule nucleation, microtubule organizing center, M phase, bud site selection, phosphatidylinositol binding, septin ring

– anaphase, anaphase_promote, ubiquitin, ligase, bud_neck, metaphase, proteolysis

– cyclin_b, cdc3, cdc28, cdc12, cdc15, cdc2

Comparison with GO-based Method

• Gene List: cAMPDown001 cluster– Originally 59 genes– 16 genes have fly orthologs that match at least one article

• Results– heat_shock, molecular_chaperone, stress (response to stress)– response to temperature, unfolded protein binding, response to

biotic stimulus– channel/ion_channel/channel_gate,

kinase/threonine_kinase/serine_threonine, gtpase, ethanol, immunophilin, tetratricopeptide_repeat, geldanamycin, tacrolimu_bind, co_chaperone/cochaperon, steroid, quinone, cyclophilin

– genes: hsp90, hsc70, hsp70, hsp40, cyp40, fkbp52, mok,

Conclusion from Experiments

• Solved the biased textual representation problem of the earlier literature-based method

• In general, the new method is able to cover a large proportion of terms from GO enrichment analysis

• Supplement with additional biological concepts, including many related genes

• May be particularly useful for studying aspects not focused in GO, such as medicine

Future Plan (I)

• Phrases– N-gram, N > 2– Syntactic phrase construction– Removal of redundant concepts in the results

• Retrieval for genes– Gene name mapping to database entries– Synonym expansion– Gene name disambiguition

• Entity recognition in the resulting concepts– Gene names, chemicals, organisms

Future Plan (II)

• Sentence selection– Informative sentences summarizing the concepts

(theme retrieval in Beespace)

• Clustering of concepts– Ex. Distance measure of terms based on co-occurrence

• A general tool to summarize a group of entities? For example:– Common aspects of a set of diseases– Genes commonly involved in multiple developmental

stages

Documents

A Statistical Approach to Literature-based Gene Group Annotation