19
Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline ! Biological Data in Digital Symbol Sequences ! Genomes – Diversity, Size, and Structure ! Proteins and Proteomes ! On the Information Content of Biological Sequences ! Prediction of Molecular Function and Structure

Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

Bioinformatics Chapter 1.

Introduction

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2

Outline

! Biological Data in Digital Symbol Sequences

! Genomes – Diversity, Size, and Structure

! Proteins and Proteomes

! On the Information Content of Biological Sequences

! Prediction of Molecular Function and Structure

Page 2: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 3

! A liger is the result of breeding a male lion with a female tiger. It hasstripes and spots.

! All ligers are presumed to be born sterile like mule. This is not unusualfor hybrids.

! Extravagant combination of genomes: this organism is unable tocontribute to further to the evolution of the gene pool.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 4

1. Biological Data in Digital Symbol Sequence

! Digital nature of geneticdata/biological sequence

! DNA sequence:4 nucleotides (A, T, G, C)

! Protein sequence:20 amino acids

Structure of DNA

Page 3: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 5

! Central Dogma4 Information flow

from DNA to protein

4Proteins aresynthesized based onthe information ofDNA

4DNA: informationstorage

4RNA: informationintermediate

4Protein: variouscellular functions

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 6

! RNA transcription4Newly synthesized RNAs have

complementary base sequencesto template DNAs. (A – U, C –G)

4Three major RNA classes:

4mRNA: protein coding

4 tRNA: at translation process,brings amino acids to ribosomes

4 rRNA: ribosome component

4This transcription process isperformed by RNA polymerase.

Page 4: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 7

! Translation: synthesis ofproteins4Ribosome, tRNA and other

components carry out thisprocess.

4Codon (mRNA): anticodon(tRNA) recognition

4Codon triplelets give morediverse compositions

(4 nucleotide compositions =give different 64 codons)

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 8

! Post translational proteinfolding4Proteins exist not as linear

form but as complex 3D form.4Proteins are folded by many

molecular interactions(hydrophobic interactions,disulfied bond, ionic bond…)

4Various molecules likechaperone help proteinfolding.

4This protein conformationconstruction is determined byinformation on its sequence.

Page 5: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 9

Why Probabilistic Models?

! This course focuses on the probabilistic models ofsequences. Why?

! While the goal is to study a particular sequence and itsmolecular structure and function, the analysis typicallyproceeds through the study of an ensemble of sequencesconsisting of its different versions in different species ordifferent versions in the same species.

! Thus, comparison of sequence patterns across species musttake into account that biological sequences are inherentlynoisy, the variability resulting in part from random eventsamplified by evolution.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 10

1.1 Database Annotation Quality

! High error rate after handling of information4Handling by a diverse group of people

4 Incorrect assignments by experimentalists: simple consistency check

4Features are indicated by listing the relevant positions in numeric form.

4Fuzzy statements in annotation: comments like “potential” or “probable”

! Bioinformaticians4Should consider potential sources of errors when creating machine-

learning approaches for prediction and classification.

4Prepare the data carefully

4Discard data from unclear sources

Page 6: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 11

1.1 Database Annotation Quality

! Machine learning techniques4can handle noise if large corpora of sequences are available.

! Learning from DNA sequences as language acquisition4 Infants can detect linguistic regularities and learn simple statistics

for the recognition of word boundaries

4Learning techniques are useful for revealing similar regularities(“grammar” or rules) in genomic data (“sentences” or examples).

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 12

1.2 Database Redundancy

! Redundancy: sources of error4Data for large families of closely related sequences" biased and over-represent features

4Apparent correlations between different positions in the sequencesmay be artifact of biased sampling of the data.

4Predictive performance may be overestimated if training/test dataare closed related.

! Over-representation problem4 It is necessary to avoid too closely related sequences in a data set.4Keep all sequences in a data set and assign weights according to

their novelty4Redundancy reduction

• arbitrary similarity threshold• “representative” data set

Page 7: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 13

2. Genomes – Diversity, Size, and Structure

! Genomes: total genetic material within a cell or organism

DNA (e.g. cellular organism) or RNA (e.g. HIV virus)

! Genome size: 5,386bp (bacteriophage ϕX174)

~ 3Gbp (homo sapiens)

• Gene numbers in several organisms

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 14

2. Genomes – Diversity, Size, and Structure

! Genome size does not directly relate to cell complexity4Many plant and amphibian genomes are larger than the human

genome.

4Cell complexity is related to gene numbers or gene-to-geneinteractions.

4Most regions of genome are not expressive units.

Gene: expressive segment - codingproteins or functional RNAmolecules

Page 8: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 15

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 16

Page 9: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 17

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 18

Page 10: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 19

3. Proteins and Proteomes

3.1 From Genome to Proteome! Gene : genome = protein : proteome

! Proteome: total protein expression of a set of chromosomes

! Protein: dynamic structure with various modification

! Proteome analysis:4Not only dealing with sequences and functions but also concerning

biochemical states of proteins in its posttranslational form

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 20

3.2 Protein Length Distributions

! Amino acid sequences in protein direct its structure whichin turn influences its functions.

! To make a suitable structure, regularity of amino acidsequences does exist.

! Statistical analysis has played a major role in proteinsequence analysis.

Sequence ofhemoglobin andits structure

Page 11: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 21

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 22

Some Observations

! It is observed that the peaks for the prokaryote H.influenzae and the eukaryote S. cerevisiae are positioned indifferent intervals: at 140-160 and 100-120, respectively.

! Further studies show that a eukaryotic distribution from awide range of species peaks at 125 amino acids and thatthe distribution displays a periodicity based on this sizeunit.

! Interestingly, the distribution for the achaeon M. jannaschiiis sort of inbetween the H. influenzae (eukaryote) and the S.cerevisiae (prokaryote) distributions.

Page 12: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 23

• Tree of life: Three great kingdoms

4 Drawn by sequence analysis of rRNA

4 Three kingdoms: bacteria, archaea, eukarya

* Former five kingdoms: Animalia, Plantae, Fungi, Protista, Monea

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 24

3.3 Protein Function

! Protein function4 Does not depend on whole structure

4 Determined mainly by local sequences

(e.g. active site of enzyme)

! End of human genome project: finding genes, measuring itsfunction and activity and many other analysis are required.4 Prediction by experiment: time costing and hard job & non-real

environment (under laboratory environment)

4 Many computational approaches can be helpful for protein analysis.

Page 13: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 25

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 26

Page 14: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 27

4. On the Information Content of BiologicalSequences

! Concept of information and its quantification is essentialfor the basic principles of machine-learning approaches inmolecular biology

! Data-driven prediction methods can extract essentialfeatures from individual examples and discard unwantedinformation.

! Machine-learning techniques are excellent for the task ofdiscarding of and compacting redundant sequenceinformation

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 28

! Information reduction4Key feature in the understanding of almost any kind of system

4Create simple representation of a sequence space " much morepowerful and useful than the original data that contain all details

! Compression rate4Ratio between size of the an encoded corpus of sequences and the

original corpus of sequences

4Global quantifier of the degree of regularity in the data

O

EC

S

SR =

Page 15: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 29

4.2 Alignment vs. Prediction: When areAlignment Reliable?

! New sequences are aligned against all sequences in DB4Hints toward structural/functional relationships + functional insights

! Fundamental question4When is the sequence similarity high enough that one may infer a

structural/functional similarity from the pairwise alignment of twosequences?

4Given the detected overlap in a sequence segment, can a similaritythreshold be defined that sifts out cases where the inference will bereliable?

! Answer4 It depends on the structural/functional aspect one wants to investigate4Different for each task

! Prediction4When alignment alone is not enough to lead a reliable inference

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 30

BLAST search result

Query: Soy bean (plant)leghemoglobin

Database: homo sapiens

Alignment result shows twomerely matched sequences, buttheir functions and structures aresurprisingly coincided.

Page 16: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 31

4.3 Prediction of Functional Features

! Two protein sequences share sequence similarity4These proteins share common function?

! New sequence identity threshold is required for predictionof function4Threshold used for structural problems can not be used.

! Solution4Split each sequence into a number of subsequences

4The fraction of aligned site per alignment

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 32

4.4 Global and Local Alignments andSubstitution Matrix Entropies

! Pairwise alignment4There is no optimal answer (actually we don’t know optimal

alignment): changed by various options

4Dynamic programming: computational demanding job

4Heuristic approaches: BLAST, FASTA

! Substitution matrix4Set of scores sij of replacing amino acid i by amino acid j

4Generated from simplified protein evolution model involvingamino acid frequencies, pi

)(ln1

ji

ijij

pp

qs

λ=

≠=

==jiq

jiqqp iji

for

for

20

1

Page 17: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 33

PAM250 Matrix

Matches involving rare amino acids count more than common aminoacids

Mismatches between similar amino acids have higher score thanbetween distinct amino acids

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 34

4.5 Consensus Sequences and Sequence Logos

! Creating consensus sequences4From alignments4Choose most common nucleotides or amino acids4Can be highly misleading

! Sequence logo4Based on Shannon information content at each position4Emphasize the deviation from the uniform or flat distribution

4Displays a significant degree of deviation from flat distribution

20or4||

)(log)(||log)(||

122

=

+= ∑=

A

ipipAiDA

k

kk

Page 18: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 35

• Initiation codon (ATG) is very well conserved.

• 5’ to the initiation codons (-12~-7) are conserved also.

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 36

! Slight variation of the logo formula

4Based on relative entropy

4Quantification of the contrast between the observed probabilitiesP(i) and a reference probability distribution Q(i)

4Q: depends on the position i in the alignment

∑=

==||

1 )(

)(log)())(),(()(

A

k k

kk

iq

ipipiQiPHiH

Page 19: Introduction · 2015-11-24 · Bioinformatics Chapter 1. Introduction (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2 Outline! Biological Data in Digital Symbol Sequences! Genomes

(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 37

5. Prediction of Molecular Function and Structure

! DNA, RNA and protein sequence analysis based onmachine-learning approaches can be helpful for manybiological problems.4Finding intron splice sites in mRNA processing4Gene finding4Recognition of promoter and other regulatory regions4Prediction of gene expression levels4Prediction of DNA bending and bendability4Sequence clustering4RNA secondary structure prediction4Protein function/structure prediction4Protein family classification