16
Andrea Pagnani Database of coding sequences rational organization of knowledge DATA BEERS Cortile del Maglio 16 Ottobre 2014

Database of coding sequences and rational organization of knowledge

Embed Size (px)

Citation preview

Page 1: Database of coding sequences and rational organization of knowledge

Andrea Pagnani

Database of coding sequences!rational organization of knowledge

DATA BEERS !!

Cortile del Maglio !16 Ottobre 2014

Page 2: Database of coding sequences and rational organization of knowledge

Central Dogma of Molecular Biologyoutdated !!!

DNA

mRNA

Protein

Page 3: Database of coding sequences and rational organization of knowledge

Proteins are chains of amino acids (AA)

normally 50-1,000 amino acidslongest AA sequence in Uniprot: 36,805 AAs

Tuesday, November 2, 2010

Page 4: Database of coding sequences and rational organization of knowledge

Amino acids

C C!

H

N O

OH H

H

amino group carboxyl group

side chain - 20 standard amino acids in proteins

physical-chemical amino acid properties determined by side chains:

Tuesday, November 2, 2010

Page 5: Database of coding sequences and rational organization of knowledge

-helices

• hydrogen bonds between C=0 of residue n and NH of residue n+4• 3.6 residues / turn• rise 1.5Å / residue• 4-40 residues long

Secondary structure elements

Tuesday, November 2, 2010

Page 6: Database of coding sequences and rational organization of knowledge

-strand / sheet

• strands of 5-10 amino acids• H bonds between neighboring strands• strands can be parallel / antiparallel

Secondary structure elements�

Tuesday, November 2, 2010

Page 7: Database of coding sequences and rational organization of knowledge

Large proteins frequently show modular organization

•parts of the AA chain (~40-400 AAs) fold independently into stable structures

• linked by (relatively) unstructured linker regions

•different domains usually associated to different functions

‣ allows for complex protein functionality

•alternative to protein complexes - variable solution from species to species

Protein domains as modular protein units

Tuesday, November 2, 2010

Page 8: Database of coding sequences and rational organization of knowledge

Proteome by the numbers I20,000 ~ 30,000! ! Number of distinct proteins in humans!!50,000 ~ 500,000! Number of distinct alternative splicing isoforms!!

14 Gigabyte! ! size of the db (2011)!21 Gigabyte! ! size of the db (2014)!30 Millions!! ! number of sequences (2011)!49 Millions!! ! number of sequences (2013)

Complete genome sequencing projects

Page 9: Database of coding sequences and rational organization of knowledge

Proteome by the numbers II

Sequencing centre

Page 10: Database of coding sequences and rational organization of knowledge

The Tree of Life

Page 11: Database of coding sequences and rational organization of knowledge

Protein families have a phylogenetic structure

RAS superfamily

Page 12: Database of coding sequences and rational organization of knowledge

3. Segment shuffling: two or more existing genes can be broken and rejoinedto make a hybrid gene consisting of DNA segments that originallybelonged to separate genes.

4. Horizontal (intercellular) transfer: a piece of DNA can be transferred fromthe genome of one cell to that of another—even to that of another species.This process is in contrast with the usual vertical transfer of genetic infor-mation from parent to progeny.

Each of these types of change leaves a characteristic trace in the DNAsequence of the organism, providing clear evidence that all four processes haveoccurred. In later chapters we discuss the underlying mechanisms, but for thepresent we focus on the consequences.

Gene Duplications Give Rise to Families of Related Genes Within aSingle Cell

A cell duplicates its entire genome each time it divides into two daughter cells.However, accidents occasionally result in the inappropriate duplication of justpart of the genome, with retention of original and duplicate segments in a singlecell. Once a gene has been duplicated in this way, one of the two gene copies isfree to mutate and become specialized to perform a different function within thesame cell. Repeated rounds of this process of duplication and divergence, overmany millions of years, have enabled one gene to give rise to a family of genesthat may all be found within a single genome. Analysis of the DNA sequence ofprocaryotic genomes reveals many examples of such gene families: in Bacillussubtilis, for example, 47% of the genes have one or more obvious relatives (Fig-ure 1–24).

When genes duplicate and diverge in this way, the individuals of one speciesbecome endowed with multiple variants of a primordial gene. This evolutionary

THE DIVERSITY OF GENOMES AND THE TREE OF LIFE 19

Figure 1–23 Four modes of geneticinnovation and their effects on the DNAsequence of an organism. A special formof horizontal transfer occurs when twodifferent types of cells enter into apermanent symbiotic association. Genesfrom one of the cells then may betransferred to the genome of the other,as we shall see below when we discussmitochondria and chloroplasts.

1gene

INTRAGENICMUTATION

ORIGINAL GENOME GENETIC INNOVATION

2

GENEDUPLICATION

3

4

DNA SEGMENTSHUFFLING

HORIZONTALTRANSFER

+

+

+

+

gene A

gene B

organism Borganism B with

new gene

organism A

mutation

Mechanisms behind genetic innovation

Page 13: Database of coding sequences and rational organization of knowledge

Structural conservation of domains

Trypsin (bovine) Elastase (pig)

The structure domain folds in families frequently more conserved thanamino-acid sequence (~25% sequence identity)

Sequence variability expected to carry information about structure

Tuesday, November 2, 2010

Page 14: Database of coding sequences and rational organization of knowledge

Differents? Maybe in details ...

We Are All Different in Detail

What precisely do we mean when we speak of the human genome? Whosegenome? On average, any two people taken at random differ in about one ortwo in every 1000 nucleotide pairs in their DNA sequence. The Human GenomeProject has arbitrarily selected DNA from a small number of anonymous indi-viduals for sequencing. The human genome—the genome of the humanspecies—is, properly speaking, a more complex thing, embracing the entire poolof variant genes that are found in the human population and continuallyexchanged and reassorted in the course of sexual reproduction. Ultimately, wecan hope to document this variation too. Knowledge of it will help us under-stand, for example, why some people are prone to one disease, others to another;why some respond well to a drug, others badly. It will also provide new clues toour history—the population movements and minglings of our ancestors, theinfections they suffered, the diets they ate. All these things leave traces in thevariant forms of genes that have survived in human communities.

GENETIC INFORMATION IN EUCARYOTES 41

Figure 1–53 Human and mouse: similargenes and similar development. Thehuman baby and the mouse shown herehave similar white patches on theirforeheads because both have mutations inthe same gene (called Kit), required for thedevelopment and maintenance of pigmentcells. (Courtesy of R.A. Fleischman.)

Figure 1–52 Times of divergence ofdifferent vertebrates. The scale on theleft shows the estimated date andgeological era of the last commonancestor of each specified pair of animals.Each time estimate is based oncomparisons of the amino acidsequences of orthologous proteins; thelonger a pair of animals have had toevolve independently, the smaller thepercentage of amino acids that remainidentical. Data from many differentclasses of proteins have been averaged toarrive at the final estimates, and the timescale has been calibrated to match thefossil evidence that the last commonancestor of mammals and birds lived 310million years ago. The figures on the rightgive data on sequence divergence forone particular protein (chosenarbitrarily)—the a chain of hemoglobin.Note that although there is a cleargeneral trend of increasing divergencewith increasing time for this protein,there are also some irregularities. Thesereflect the randomness within theevolutionary process and, probably, theaction of natural selection drivingespecially rapid changes of hemoglobinsequence in some organisms thatexperienced special physiologicaldemands. On average, within anyparticular evolutionary lineage,hemoglobins accumulate changes at arate of about 6 altered amino acids per100 amino acids every 100 million years.Some proteins, subject to stricterfunctional constraints, evolve much moreslowly than this, others as much as 5 times faster. All this gives rise tosubstantial uncertainties in estimates ofdivergence times, and some expertsbelieve that the major groups ofmammals diverged from one another asmuch as 60 million years more recentlythan shown here. (Adapted from S. Kumar and S.B. Hedges, Nature392:917–920, 1998. With permission fromMacmillan Publishers Ltd.)

Tertiary

Cretaceous

Jurassic

Triassic

Permean

Carboniferous

Devonian

Silurian

Ordovician

Cambrian

Proterozoic

0

50

100

150

200

250

300

350

400

450

500

550

human/chimp

human/orangutanmouse/ratcat/dog

pig/whalepig/sheephuman/rabbithuman/elephanthuman/mousehuman/sloth

human/kangaroo

bird/crocodile

human/lizard

human/chicken

human/frog

human/tuna fish

human/shark

human/lamprey

100

988486

778782838981

81

76

57

70

56

55

51

35

tim

e in

mill

ion

s o

f ye

ars

% a

min

o a

cid

s id

enti

cal i

n h

emo

glo

bin

� c

hai

n

mutations in Kit gene

Page 15: Database of coding sequences and rational organization of knowledge

What do we know

PDB is the largest database for protein structure

Page 16: Database of coding sequences and rational organization of knowledge

What do we NOT know

A very long list of items but, to my taste, Protein-Protein Interactions