56
1 Introduction to Bioinformatic s

Introduction to

  • Upload
    olive

  • View
    23

  • Download
    2

Embed Size (px)

DESCRIPTION

Introduction to. Bioinformatics. Introduction to Bioinformatics. LECTURE 8: Whole genome comparisons * Chapter 8: Welcome to the Hotel Chlamydia. Introduction to Bioinformatics LECTURE 8: WHOLE GENOME COMPARISONS. 8.1 Uninvited guests - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to

1

Introduction to

Bioinformatics

Page 2: Introduction to

2

Introduction to Bioinformatics.

LECTURE 8: Whole genome comparisons

* Chapter 8: Welcome to the Hotel Chlamydia

Page 3: Introduction to

3

Introduction to BioinformaticsLECTURE 8: WHOLE GENOME COMPARISONS

8.1 Uninvited guests

* Symbionts: organisms that live together in a beneficial relation

* E. coli and Human: receives nutrients, gives vitamin K

* Numerous examples in Nature: flowers and bees, tick- bird and rhinoceros, pea aphid and Buchnera, mitochondria and Eukaryotes

Page 4: Introduction to

4

Introduction to BioinformaticsLECTURE 8: WHOLE GENOME COMPARISONS

8.1 Uninvited guests

* Some symbionts have moved permanently into the cells of the host

* They have become entirely dependent on the host to provide them with nutrients, oxygen, specific proteins …

* In the process they have lost many genes necessary to produce such products themselves

* As a result, intracellular obligate symbionts have the smallest genomes – both in total size as in number of genes

Page 5: Introduction to

5

Introduction to Bioinformatics8.1 – UNINVITED GUESTS

Chlamydia trachomatis

* Chlamydia trachomatis is an intracellular symbiont that gives no benefit to the host : it is a parasite

* C. trachomatis is the most common sexually transmitted disease with +/- 3M new infections per annum in the USA

* It has lost the ability to produce many biochemical products and must live in specific cells in the human (hence the characterisation of: obligate endo-symbiont)

Page 6: Introduction to

6

Introduction to BioinformaticsLECTURE 8: WHOLE GENOME COMPARISONS

Chlamydia trachomatis

Page 7: Introduction to

7

Introduction to Bioinformatics8.1 – UNINVITED GUESTS

Chlamydia pneunomia

* Chlamydia pneumonia is a related bacterial parasite of the human respiratory tract : it causes pneumonia and bronchitis

* Like C. trachomatis it has a very small genome ~ 1 Mb

Page 8: Introduction to

8

Human respiratory tractChlamydia pneunomia

Page 9: Introduction to

9

Introduction to Bioinformatics8.1 – UNINVITED GUESTS

Chlamydia pneunomia

Phylogenetic analysis of the parasitic lifestyle of Chlamydia shows that it dates back to 700 Myrs with the emergence of

Eukaryotes

* This is the same date as the pure symbiontic lifestyle of mitochondria

Page 10: Introduction to

10

Introduction to Bioinformatics8.1 – UNINVITED GUESTS

Whole genome comparisons

* In this lecture we study the problems involved with the comparisons of entire genomes

* Because of its very small genome Chlamydia are a perfect case study

* Moreover, Chlamydia shows a high conservation of the order of the genes and virtually no horizontal gene transfer.

Page 11: Introduction to

11

Introduction to Bioinformatics8.1 – UNINVITED GUESTS

Hotel Chlamydia

‘Hotel’ Chlamydia is not so much that we are a living hotel for the Chlamydia (which is true), but that genes are guest in hotel Chlamydia – guests that come in, reshuffle rooms, move out, pass along …

Page 12: Introduction to

12

Introduction to Bioinformatics LECTURE 8: WHOLE GENOME COMPARISONS

8.2 Patterns of genome evolution

* Genome comparison looks at the differences between the entire set of genes between two organisms

* This provides insight in evolution and function of genes

* Single nucleotide polymorphisms form the bulk of the genetic variability

* Also rearrangement and shuffling of genes: inversion, duplication, translocation

Page 13: Introduction to

13

Introduction to Bioinformatics LECTURE 8: WHOLE GENOME COMPARISONS

8.2 Patterns of genome evolution

* Often translocation between two organisms: horizontal gene transfer

* Some 20% of E. coli‘s genes derive from horizontal transfer

* Chromosomes can break apart and or stick together

* Whole genomes can be duplicated → polyploid individuals

* This is the basis for new functions as these extra genes are free to evolve

Page 14: Introduction to

14

Introduction to Bioinformatics LECTURE 8: WHOLE GENOME COMPARISONS

Page 15: Introduction to

15

Introduction to Bioinformatics LECTURE 8: WHOLE GENOME COMPARISONS

8.3 Beanbag genomics

* Genome = beanbag of genes + junk-DNA

* Comparison of whole genomes is more than comparing individual genes

* inversions, transpositions, duplications, deletions, chromosomal rearrangements

* Therefore an alignment of entire genomes will not work

Page 16: Introduction to

16

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Basic mechanisms of gene evolution

Page 17: Introduction to

17

Page 18: Introduction to

18

8.3 - BEANBAG GENOMICS : Comparison of two genomes

ACTTTTTGGGAC

inversion transposition

GGG

duplication

GGG

ACTTTTTGG TATATA CATGTAGTAC AAATAATCG AACCCCCGGAC GGG

TATATA CATGTAGTAC AAATAATCG AACCCCCG

deletion

TATATA

an alignment of entire genome will not work !

Therefore we have to break the problem in smaller pieces and build it back up …

… with multiple single-gene analysis

gene evolution

Page 19: Introduction to

19

8.3 - BEANBAG GENOMICS : Comparison of two chromosomes

AAATAATCG AACCCCCGACTTTTTGG TATATA CATGTAGTACGAC GGG

chromosome evolution

CATGTAGTACAAATAATAACCCCCG ACTTTTTGGTATATA CGGACGGG

Splitting into new chromosomesReshuffling of genes over the chromosomes

Page 20: Introduction to

20

* STEP 1: Find which genes are present in both

* STEP 2: use ORF-finder with threshold of 100 codons

* EXAMPLE: Chlamydia trachomatis and C. pneumonia :

--- Organism --------- size (nt) --- ORFs ---

C. trachomatis 1 042 519 916

C. pneumonia 1 229 853 1048E. coli - 5000

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Page 21: Introduction to

21

8.3 - BEANBAG GENOMICS : Comparison of two genomes

* Intracellular symbionts like CT and CP have lost many genes: they parasite on their host and ‘steal’ the gene-products

* CT lives in urinary tracts and CP lives in respiratory tracts

* The differences in their genomes tells us something about the function of their retained genes

* What are suitable algorithmic methods for comparing lost and gained genes in Chlamydia?

--- Organism --------- size (nt) --- ORFs ---

C. trachomatis 1 042 519 916

C. pneumonia 1 229 853 1048

E. coli - 5000

Page 22: Introduction to

22

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Similarity on a genomic scale

Similarity between pairs of genes informs about:

* Blocks of conserved gene order

* Changes in size of gene families

* Nucleotide substitutions between orthologous genes

Page 23: Introduction to

23

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Central idea for genomic comparison

Define the similarity scores between genomes as:

* Nucleotide sequences of all genes found in both genomes

* Fill out a matrix with alignment scores between each possible pair of sequences

* For the Chlamydiae CT and CP this is a 1048x916 matrix

* Use Needleman-Wunsch or BLAST to compute similarity scores

Page 24: Introduction to

24

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Identifying orthologous and paralogous genes

* Use genome similarity matrix to distinguish between paralogs and orthologs

* remember: homologs are genes that have a common ancestor, orthologs arise as homologs evolve in sister- species; paralogs arise from duplication and subsequent specialisation

* Result of evolution of homologs and paralogs: no one-to- one relationship, but (many/one)-to-(many/one)

Page 25: Introduction to

25

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Reciprocal similarity

* Recognition of orthologs: Best Reciprocal similarity Hits (BRHs)

* A pair of ORFs is a BRH if it is the best match between the two genomes (using alignment scores)

* possible: ORFs without BRH

* possible: ORF with ortholog in other species and a paralog

in the same species.

Page 26: Introduction to

26

8.3 - BEANBAG GENOMICS : Comparison of two genomes

EXAMPLE 8.1: Homology in Chlamydia

* Similarity matrix → BRH → orthologs

* With threshold = 100 codons we find 1964 ORFs (CT: 916 and CP: 1048)

* Among these 1964 ORFs are 728 ortholog pairs

* Also 126 pairs of paralogs (CT: 56, CP: 70)

* These paralogs are more similar to each other than to orthologs → result of duplication after the species split

* The remaining 13% (=253 ORFs) perhaps older paralogs that have been lost in the other species due to specialisation

Page 27: Introduction to

27

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Identifying gene families

* Defining a gene family is a tricky thing; at some high level all genes are ‘family’

* The very first DNA-based organism, ancestral to all present living beings, had a set of genes.

* All (?) genes have derived from these ancestral genes through duplication and subsequent specialization.

* Genes that cooperate tend to move close together (Dawkins: like rowers in a rowing boat)

Page 28: Introduction to

28

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Identifying gene families

* Practical solution: only consider genes that are > 50% similar; they are ‘closely’ related and probably have a similar ‘function’

* Method for finding ‘similar’ genes: clustering

* Draw-back: all clustering methods have some degree of arbitrariness

* Cluster both genomes simultaneously, then count #genes in each cluster (=gene family).

Page 29: Introduction to

29

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Identify gene families with Hierarchical Clustering

* input: genome similarity matrix d

* method: cluster d with NJ- or UPGMA-algorithm to group the genes in families

* This is called Hierarchical Clustering (HC)

Page 30: Introduction to

30

Hierarchical Clustering on both Chlamidiae

Application of HC on Chlamydia reveals a large number of small gene families and a small number of large families

Page 31: Introduction to

31

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Similar function of gene families

Largest gene families in C. trachomatis and C. pneumonia:

-------- CT ---- CP ------ Function ----------------------------12 12 ABC transporters6 15 G family outer membrane protein9 10 Function not known9 10 Function not known

ABC transporters are transmembrane proteins with binding sites on both sites : major role in transport in/out the cell

They are very old: they are (near) identical in all organisms

Page 32: Introduction to

32

ABC transporter · Hydrophobicity

ABC transport ATP-binding cassette

Page 33: Introduction to

33

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Schematic of the E. coli vitamin B12 importer system.

Page 34: Introduction to

34

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Snapshots of pore formation in the bilayer, with an applied field of 0.5 V/nm in the presence of 1 M NaCl

Page 35: Introduction to

35

8.3 - BEANBAG GENOMICS : Comparison of two genomes

Alternative approaches to finding orthologs

* Clustering of genes in families has some arbitrariness

* Ortholog genes are separated by a speciation event

* Thus, a phylogenetic tree is also a useful metaphor

* A phylogenetic tree is a better representation, but it is less amenable to an automated analysis

Page 36: Introduction to

36

Introduction to Bioinformatics LECTURE 8: WHOLE GENOME COMPARISONS

8.4 Synteny

* In Section 8.3 the emphasis was on analysis of genes, here the emphasis is on chromosomes

* syn- = together, tenia = ribbon, band,

* synteny : the relative ordering of genes on the same chromosomes

Page 37: Introduction to

37

Introduction to Bioinformatics 8.4 – SYNTENY

Major mechanisms of reshuffling of synteny are: inversions and transpositions

Noise on synteny is caused by:insertions, duplications, and deletions

Blocks of synteny: long stretches of DNA where the relative ordering of orthologous genes is conserved

Synteny allows for annotation of non-coding sequences and identification of homologous intergenetic regions

Page 38: Introduction to

38

Introduction to Bioinformatics 8.4 – SYNTENY

Page 39: Introduction to

39

Introduction to Bioinformatics 8.4 – SYNTENY

Cat on Human

Conserved synteny map

Page 40: Introduction to

40

Introduction to Bioinformatics 8.4 – SYNTENY

Visualising Synteny

Dot-plot:

* x-axis = position on genome_1,

* y-axis = position on genome_2,

* For a homologous gene with genome_1-position x, and genome_2-position y: put a dot ‘*’ on (x,y)

Page 41: Introduction to

41

Page 42: Introduction to

42

Introduction to Bioinformatics 8.4 – SYNTENY

Chlamidia Synteny

The high level of conservation in Chlamidia is remarkable but typical for all intracellular symbionts

For instance Buchnera aphidicola, a intracellular symbiont in pea aphids, has retained synteny for 50 million years.

Page 43: Introduction to

43

Introduction to Bioinformatics 8.4 – SYNTENY

Buchneria, a endosymbiont in pea aphids, has retained synteny for 50 million years.

pea aphid Buchneria aphidicola

Page 44: Introduction to

44

Introduction to Bioinformatics 8.4 – SYNTENY

SYMBIOSIS BETWEEN PEA APHIDS AND BUCHNERA

Plant sap contains little protein and aphids cannot produce ten essential amino acids

The required amino acids come from their symbiotic friends, the bacterium Buchnera aphidicola.

Page 45: Introduction to

45

Introduction to Bioinformatics 8.4 – SYNTENY

SYMBIOSIS BETWEEN PEA APHIDS AND BUCHNERA

The symbionts genome reflects this biosynthetic activity.

Buchnera aphidicola carries the two genes trpEG for tryptophan synthesis. Each bacterium contains three or four plasmids that contain four tandem repeats of these genes, resulting in 12 to 16 copies of trpEG.

Thus, the symbionts supply the host with the essential amino acids and receives free nutrients and shelter.

Page 46: Introduction to

46

Introduction to Bioinformatics 8.4 – SYNTENY

The relation between pea aphids and Buchnera aphidicola is very old ….

Page 47: Introduction to

47

Introduction to Bioinformatics 8.4 – SYNTENY

Endosymbionts and Synteny

Buchnera aphidicola has retained synteny for 50 million years.

IN GENERAL: The cloistered lifestyle of endo-symbiotic organisms shields them from viruses and other bacteria that may induce gene rearrangement

Page 48: Introduction to

48

Introduction to Bioinformatics 8.4 – SYNTENY

Homologous intergenic regions and ‘phylogenetic footprinting’

Intergenic regions are not selected for → fast evolution

Non-protein coding regions of the genome that are conserved are suspicious: they may be RNA-coding or regulatory sequences

Using syntenic coding regions as anchors we can find intergenic regions that are highly conserved.

This is called: Genetic Footprinting

Page 49: Introduction to

49

ORF CT672

CP1950ORF CT671-CP1949 ORF CT673-CP1951

intergenic regions

Page 50: Introduction to

50

Introduction to Bioinformatics 8.4 – SYNTENY

A metric for the syntenic distance

Two genomes can be formed by many smaller syntenic blocks rearranged by inversions or transpositions

Can we define a metric for the syntenic distance of these two genomes?

We are not interested in nucleotide differences but in the number of genomic rearrangements that separate the two species.

METRIC = smallest number of operations (=inversion or transposition) that transform one genome into the other

Page 51: Introduction to

51

Introduction to Bioinformatics 8.4 – SYNTENY

EXAMPLE: Sorting by reversals

As an example let us consider only the case of inversions

“sorting by reversals” = minimum number of inversions to transform one genome into the other

Algorithm: given a permutation of N numbers find the shortest series of reversals that can sort the back into their original order

Page 52: Introduction to

52

Introduction to Bioinformatics 8.4 – SYNTENY

EXAMPLE: Sorting by reversals

3 2 1 4 8 7 6 5 9 1 2 3 4 8 7 6 5 9

1 2 3 4 5 6 7 8 9

2 reversals → syntenic distance = 2

Page 53: Introduction to

53

Introduction to Bioinformatics 8.4 – SYNTENY

Sorting by reversals

NOTE:

In practice we do not know the original genome, so we select either one of the two as ‘the’ standard

Page 54: Introduction to

54

Introduction to Bioinformatics 8.4 – SYNTENY

SIMPLE REVERSAL ALGORITHM

STEP 1: designate one sequence as the standard s and the other as t

STEP 2: i=1, increase(i) until s(i) ≠ t(i) or i=length(t)

STEP 3: j=i; increase(j) until t(j) = s(i), reverse(t(i:j)

STEP 4: i=j+1; if i=length(t), stop, else goto STEP-2

Page 55: Introduction to

55

Introduction to Bioinformatics 8.4 – SYNTENY

SIMPLE REVERSAL ALGORITHM

QUESTION:

Can this algorithm solve overlapping reversals?

REMARK :

Involving transpositions is even more complex …

Page 56: Introduction to

56

END of LECTURE 8