23
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez

Introduction to Functional Analysis

  • Upload
    bandele

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

J.L. Mosquera and Alex Sanchez. Introduction to Functional Analysis. Motivation. The rise of the genomic era and especially the deciphering of the whole genome sequences of several organism has represented huge quantities of information. - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to  Functional Analysis

Introduction to Functional Analysis

J.L. Mosquera and Alex Sanchez

Page 2: Introduction to  Functional Analysis

2

Motivation• The rise of the genomic era and especially the

deciphering of the whole genome sequences of several organism has represented huge quantities of information.

• New technologies such as DNA microarrays (but not only these!) allow the simultaneous study of hundreds, even thousands of genes, in a single experiment.

Page 3: Introduction to  Functional Analysis

3

Motivation• This represents different challenges:

1)The experiment in itself2)Statistical analysis of results3)Biological interpretation

• Very often the results are large-lists of genes which have been selected according to some specific criteria.

PROBLEM: How could a researcher give these sets a biological interpretation?

Page 4: Introduction to  Functional Analysis

4

Rationale• A reasonable thing to do is to rely on existing

annotations which help to relate the selected sequences with biological knowledge.

• Bioinformatics resources hold data, often in the form of sequences which are annotated in scientific natural language.

• The annotation in this form is human readable and understandable, but difficult to interpret computationally.

Page 5: Introduction to  Functional Analysis

5

What’s in a name?QUESTION: What’s a cell?

Image from http://microscopy.fsu.edu

• The same name can be used to describe different concepts

• A concept can be described using different names

• Comparison is difficult, especially across species or across databases

Page 6: Introduction to  Functional Analysis

6

• Probably, the most important thing you want to know is what the genes or their products are concerned with, i.e. their function.

• Function annotation is difficult: 1) Different people use different words for the same function, 2) may mean different things by the same word. 3) The context in which a gene was found (e.g. “TGF-induced

gene”) may not be particularly associated with its function.

• Inference of function from sequence alone is error-prone and sometimes unreliable.

• The best function annotation systems use human beings who read the literature before assigning a function to a gene

Functional annotation

Page 7: Introduction to  Functional Analysis

7

What can we do?

To overcome some of the problems, an annotation system has been created: The Gene Ontology (GO).

Page 8: Introduction to  Functional Analysis

8

• An ontology is an entity which provides a set of vocabulary terms covering a conceptual domain.

• These terms must1) have an exhaustive and rigorous definition,2) be placed within a structure of relationships. It usually is a

hierarchical data structure.

• The terms may be linked by two kind of relationships:1) “is-a” between parent and child.2) “part-of” between part and whole.

• They may have one or more parents.

What is an ontology?

Page 9: Introduction to  Functional Analysis

9

What’s the GO?

• The GO is a cooperative project, developed and maintained by the Gene Ontology Consortium.

• It is an annotation database created to provide a controlled vocabulary to describe gene and gene product attributes in any organism.

• It is organized around three basic ontologies:

Ontology Number of terms1

Molecular Function 7220

Biological Process 9529

Cellular Component

1536

Total Terms 182851May, 2005

Page 10: Introduction to  Functional Analysis

10

The GO ontologies and the GO graph

GO

Molecular Functions (MF)

Biological Processes (BP)

Cellular Components (CC)

Page 11: Introduction to  Functional Analysis

11

Genes and GO terms

A given gene product may

•represent one or more molecular functions,

•be used in one or more biological processes and

•appear in one or more cellular components.

Page 12: Introduction to  Functional Analysis

12

• Consist of two essential parts:1) The current ontologies:

o Vocabularyo Structure

2) The current annotations:o Create a link between the known genes and the

associated GOs that define their function.

• The GO database exists independently from other annotation databases

1) It does not depend on the organism2) It does not depend on other databases, but

• Most important databases have cross-references with the GO databases o It is possible to link and relate other annotations with those

contained in GO

GO database

Page 13: Introduction to  Functional Analysis

13

Two types of GO Annotations

Electronic Annotation

Manual Annotation

• All annotations must

1) be attributed to a source,

2) indicate what evidence was found to support the GO term-gene/protein association

Page 14: Introduction to  Functional Analysis

14

Evidence Codes

IEA Inferred from Electronic AnnotationISS Inferred from Sequence SimilarityIEP Inferred from Expression PatternIMP Inferred from Mutant PhenotypeIGI Inferred from Genetic InteractionIPI Inferred from Physical InteractionIDA Inferred from Direct AssayRCA Inferred from Reviewed Computational AnalysisTAS Traceable Author StatementNAS Non-traceable Author StatementIC Inferred by CuratorND No biological Data available

Page 15: Introduction to  Functional Analysis

15

• Unbiased method to ask question, “What’s so special about my set of genes?”

• Many tools follow similar steps1) Obtain GO annotation (most specific term(s))

for genes in your set2) Climb an ontology to get all “parents” (more

general terms) 3) Look at occurrence of each term in your set

compared to terms in population (all genes or all genes on your chip)

4) Are some terms over-represented?

Enrichment Analysis

Page 16: Introduction to  Functional Analysis

16

Statistical Methods for enrichment analysis

• Let us consider:o N genes on a microarray:

M belong to a given GO term category (A)M-N do not belong it (category Ac)

o K of the N genes are selected and assigned to a given class (e.g. regulated genes)

o x genes of these K will be in A (EXAMPLE)

STATISTICAL HYPOTHESIS:H0: GO category A is equally represented on the

microarray than in the class of differentially regulated genes

H1: GO category A is higher (or lower) represented on the microarray than in the class differentially regulated genes

Page 17: Introduction to  Functional Analysis

17

Hypergeometric Distribution (1/2)

We ask: Assuming sampling without replacement, what is the probability of having exactly x genes of category A?

• The probability that certain category occurs x times just by chance in the list of differentially regulated genes is modelled by a hypergeometric distribution with parameters (N, M, K).

P X= x =Mx N−M

K−x NK

Page 18: Introduction to  Functional Analysis

18

Hypergeometric Distribution (2/2)

• So, under the null hypothesis p_value of having x genes or larger in A will be:

• This corresponds to a one-side test in which small p_values relate to over-represented GO terms.

• For under-represented categories can be calculated as1 - p_value

pvalue=P X ≥x∣H 0 =∑k=x

K Mk N −MK−k

NK

Page 19: Introduction to  Functional Analysis

19

Disadvantages

• The hypergeometric distribution is rather difficult and time consuming to calculate when N is high.

• We can proof,

• Using this approximation the p_value for over-represented GO terms can be calculated as

Hip M , N , K N ∞Bin K , MN

pvalue=∑i=0

x−1

Ki MN i

1−MN

K−i

Page 20: Introduction to  Functional Analysis

20

Alternative approaches

• Let us assume

where N=N.., M=N1., K=N.1 and x=n11

• Using this notation, alternative include:o test for equality of two proportionso Fisher’s Exact Test

Differentially regulated genes (D)

DcGenes on

Microarray

Category A

n11 n12 N1.

Ac n21 n22 N2.

N.1 N.2 N..

χ2

Page 21: Introduction to  Functional Analysis

21

Fisher’s Exact Test

• This test consider fixed the marginal totals and uses the hypergeometric distribution to calculate the probability of observing each individual table.

• One can calculate a table containing all possible combinations of n11n12n21n22.

• The p_value for a particular occurrence is the sum of all probabilities lower than or equal to the probability corresponding to the observed combination.

Page 22: Introduction to  Functional Analysis

22

Correction for Multiple Tests

• As the number of GO terms for which test significance is large, p_values have to take the correction for multiple tests in account. For instance:o Methods controlling False Discovery Rate (FDR):

Benjamin and Hochberg (assuming independence) Benjamin and Yekutieli (dropping independence)

o Methods controlling Family Wyse Error Rate (FWER): Holm correction Westfall and Young

Page 23: Introduction to  Functional Analysis

23

Example

N= 9177 genes on microarrayA

Ac

M= 467 in GO category

A

N-M= 8710 in Ac

K= 173 genes picked randomly

x= 51 genes of category

A