32
Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell.

Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Embed Size (px)

DESCRIPTION

Koonin & Galperin

Citation preview

Page 1: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Genome analysis.Genome – the sum of genes and intergenic

sequences of a haploid cell.

Page 2: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

The value of genome sequences lies in their annotation

• Annotation – Characterizing genomic features using computational and experimental methods

• Genes: Four levels of annotation– Gene Prediction – Where are genes?– What do they look like?– What do they encode?– What proteins/pathways involved in?

Page 3: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Koonin & Galperin

Page 4: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Accuracy of genome annotation.

• In most genomes functional predictions has been made for majority of genes 54-79%.

• The source of errors in annotation: - overprediction (those hits which are statistically

significant in the database search are not checked) - multidomain protein (found the similarity to only one

domain, although the annotation is extended to the whole protein).

The error of the genome annotation can be as big as 25%.

Page 5: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Sample genomes

Species Size Genes Genes/Mb

H.sapiens 3,200Mb 35,000 11

D.melanogaster 137Mb 13.338 97

C.elegans 85.5Mb 18,266 214

A.thaliana 115Mb 25,800 224

S.cerevisiae 15Mb 6,144 410

E.coli 4.6Mb 4,300 934

Page 6: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

So much DNA – so “few” genes …

s

T

Genic

Intergenic

T C

Page 7: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Human Genome project.

Page 8: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Comparative genomics - comparison of gene number, gene content and gene location in

genomes..

Campbell & Heyer “Genomics”

Page 9: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Analysis of gene order (synteny).

Genes with a related function are frequently clustered on the chromosome.

Ex: E.coli genes responsible for synthesis of Trp are clustered and order is conserved between different bacterial species.

Operon: set of genes transcribed simultaneously with the same direction of transcription

Page 10: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Analysis of gene order (synteny).

Koonin & Galperin “Sequence, Evolution, Function”

Page 11: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Analysis of gene order (synteny).

• The order of genes is not very well conserved if %identity between prokaryotic genomes is less than 50%

• The gene neighborhood can be conserved so that all neighboring genes belong to the same functional class.

• Functional prediction can be based on gene neighboring.

Page 12: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Role of “junk” DNA in a cell.

Species Size Genes Genes/Mb

H.sapiens 3,200Mb 35,000 11

D.melanogaster 137Mb 13.338 97

C.elegans 85.5Mb 18,266 214

A.thaliana 115Mb 25,800 224

S.cerevisiae 15Mb 6,144 410

E.coli 4.6Mb 4,300 934

1. There is almost no correlation between the number of genes and organism’s complexity.

2. There is a correlation between the amount of nonprotein-coding DNA and complexity.

Page 13: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

New interpretation of introns.

1. Modern introns envaded eukaryotes late in evolution, they are derived from self-splicing mobile genetic elements similar to group II introns.

2. Nucleus which separates transcription and translation, appears only in eukaryotes. For prokaryotes there would not be time for introns to splice themselves out.

3. Hypothesis: important regulatory role of introns.

Page 14: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Regulatory role of non-coding regions.

- “Micro-RNAs” control timing of processes in development and apoptosis.

- Intron’s RNAs inform about the transcription of a particular gene.

- Alternative splicing can be regulated by non-coding regions.

- Non-coding regions can be very well conserved between the species and many genetic deseases have been linked to variations/mutations in non-coding regions.

Page 15: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

COGs – Clusters of Orthologous Genes.

Orthologs – genes in different species that evolved from a common ancestral gene by speciation;

Paralogs – paralogs are genes related by duplication within a genome.

Page 16: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Classwork I: Comparing microbial genomes.

• Go to http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi

• Select Thermus thermophilus genome• View TaxTable• What gene clusters do you see which are

common with Archaea?

Page 17: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Systems biology.

• Integrative approach to study the relationships and interactions between various parts of a complex system.

Goal: to develop a model of interacting components for the whole system.

Page 18: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Basic notions of networks.

Network (graph) – a set of vertices connected via edges.

The degree of a vertex – the total number of connections of a vertex.

Random networks – networks with a disordered arrangement of edges.

Page 19: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Properties of networks.

• Vertex degree distribution/connectivity.

• Clustering coefficient.

• Network diameter.

Page 20: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Characteristics of networks: vertex degree distribution.

K=2K=2

K=3

K=1

P(k,N) – degree distribution, k - degree of the vertex, N - number of vertices.

If vertices are statistically independent and connections are random, the degree distribution completely determines the statistical properties of a network.

Page 21: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Characteristics of networks: vertex degree distribution.

Page 22: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Characteristics of networks: clustering coefficient.

The clustering coefficient characterizes the density of connections in the environment close to a given vertex.

)1(2

nndC

d – total number of edges connecting nearest neighbors; n – number of nearest verteces for a given vertex

C = 2/6

Page 23: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Characteristics of networks: diameter, small-world.

Diameter of a network – shortest path along the existing links averaged over all pairs of verteces. Distance between two verteces = the smallest number of steps one can take to reach on vertex from another.

Small-world character of the networks: any two verteces can be connected by relatively short paths.

For random networks the diameter increases logarithmically with the addition of new verteces.

Page 24: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Different network models:Erdos-Renyi model.

• Start with the fixed set of vertices.• Iterate the following process:

Chose randomly two vertices and connect them by an edge.• Stop at certain number of edges.

Degree distribution – Poisson distribution, λ – average degree

!)(

kekP

k

ln(P(k))

ln( k )

Page 25: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Different network models: model 2.• At each step, a new vertex is added to the graph• Simultaneously, a pair of randomly chosen vertices is connected by an

edge.This is a non-equilibrium model – the total number of vertices is not fixed.

ln(P(k))

ln(k)

/)( kekp

Degree distribution – exponential distribution.

Page 26: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Different network models: Barabasi-Alberts.

Model of preferential attachment.• At each step, a new vertex is added to the graph• The new vertex is attached to one of old vertices with probability proportional

to the degree of that old vertex.

ln(P(k))

ln(k)

kkp )(

Degree distribution – power law distribution.

Page 27: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Power Law distribution kkp ~)(

)()()( kpkkp

Multiplying k by a constant, does not change the shape of the distribution – scale free distribution.

From T. Przytycka

Page 28: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Difference between scale-free and random networks.

Random networks are homogeneous, most nodes have the same number of links.

Scale-free networks have a few highly connected verteces.

Page 29: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

D-Glucose

D-Glucose-6P

D-Fructose-6P

D-Fructose-1,6P2

Glyceraldehyde-3PGlycerone-P

Glycerate-1,3P2

Glycerate-3P

Glycerate-2P

Phosphoenol-pyruvate

Pyruvate

Lactate

ATP

ADP

ATP

ADP

NAD+ + PiNADH + H+

ADP

ATP

ADP

ATP

NADH + H+

NAD+

Hexokinase

Phosphofructokinase

Aldolase

Triose phosphate isomerase

Glyceraldehyde 3-P dehydrogenase

H2OEnolase

Pyruvate kinase

Lactate dehydrogenase

Phosphoglucose isomerase

Phoshoglycerate kinase

Phosphoglycerate mutase

Pentose phosphate cycle

2.7.1.1

5.3.1.9

2.7.1.11

4.1.2.13

5.3.1.1

1.2.1.12

2.7.2.3

5.4.2.1

4.2.1.11

2.7.1.40

1.1.1.27

Pyruvate metabolism

Apicoplast FA synthesis

Glycerolipid metabolism

Glycolysis metabolic network

enzymes

subsbstrate

Slide credit: Hagai Ginsburg

Example 1: the large-scale organization of metabolic networks.

Page 30: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Example 1: the large-scale organization of metabolic networks.

Jeong et al, Nature, 2000:- Compared metabolic networks of 43 organisms.- Verteces – substrates connected with each other through

links/metabolic reactions.

Results:

- Scale-free nature of metabolic networks for all organisms, γ = 2.2

- Diameters of metabolic networks for all organisms are the same.

Page 31: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Biological interpretations of power-law connectivity.

• Few verteces dominate the overall connectivity of network.

• Self-similarity of networks.

• Small diameter, respond quickly to a mutation which can destroy an enzyme, activate different paths quickly.

Page 32: Genome analysis. Genome – the sum of genes and intergenic sequences of a haploid cell

Protein-protein interaction networks.

Sneppen & Maslov:

• Verteces – proteins, edges connect those proteins which interact in a cell

• Network: 3278 interactions,1289 proteins • Scale free network, = 2.5 +/- 0.3

Sneppen & Maslov