Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Bioinformatics Chapter 1.
Introduction
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 2
Outline
! Biological Data in Digital Symbol Sequences
! Genomes – Diversity, Size, and Structure
! Proteins and Proteomes
! On the Information Content of Biological Sequences
! Prediction of Molecular Function and Structure
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 3
! A liger is the result of breeding a male lion with a female tiger. It hasstripes and spots.
! All ligers are presumed to be born sterile like mule. This is not unusualfor hybrids.
! Extravagant combination of genomes: this organism is unable tocontribute to further to the evolution of the gene pool.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 4
1. Biological Data in Digital Symbol Sequence
! Digital nature of geneticdata/biological sequence
! DNA sequence:4 nucleotides (A, T, G, C)
! Protein sequence:20 amino acids
Structure of DNA
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 5
! Central Dogma4 Information flow
from DNA to protein
4Proteins aresynthesized based onthe information ofDNA
4DNA: informationstorage
4RNA: informationintermediate
4Protein: variouscellular functions
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 6
! RNA transcription4Newly synthesized RNAs have
complementary base sequencesto template DNAs. (A – U, C –G)
4Three major RNA classes:
4mRNA: protein coding
4 tRNA: at translation process,brings amino acids to ribosomes
4 rRNA: ribosome component
4This transcription process isperformed by RNA polymerase.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 7
! Translation: synthesis ofproteins4Ribosome, tRNA and other
components carry out thisprocess.
4Codon (mRNA): anticodon(tRNA) recognition
4Codon triplelets give morediverse compositions
(4 nucleotide compositions =give different 64 codons)
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 8
! Post translational proteinfolding4Proteins exist not as linear
form but as complex 3D form.4Proteins are folded by many
molecular interactions(hydrophobic interactions,disulfied bond, ionic bond…)
4Various molecules likechaperone help proteinfolding.
4This protein conformationconstruction is determined byinformation on its sequence.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 9
Why Probabilistic Models?
! This course focuses on the probabilistic models ofsequences. Why?
! While the goal is to study a particular sequence and itsmolecular structure and function, the analysis typicallyproceeds through the study of an ensemble of sequencesconsisting of its different versions in different species ordifferent versions in the same species.
! Thus, comparison of sequence patterns across species musttake into account that biological sequences are inherentlynoisy, the variability resulting in part from random eventsamplified by evolution.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 10
1.1 Database Annotation Quality
! High error rate after handling of information4Handling by a diverse group of people
4 Incorrect assignments by experimentalists: simple consistency check
4Features are indicated by listing the relevant positions in numeric form.
4Fuzzy statements in annotation: comments like “potential” or “probable”
! Bioinformaticians4Should consider potential sources of errors when creating machine-
learning approaches for prediction and classification.
4Prepare the data carefully
4Discard data from unclear sources
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 11
1.1 Database Annotation Quality
! Machine learning techniques4can handle noise if large corpora of sequences are available.
! Learning from DNA sequences as language acquisition4 Infants can detect linguistic regularities and learn simple statistics
for the recognition of word boundaries
4Learning techniques are useful for revealing similar regularities(“grammar” or rules) in genomic data (“sentences” or examples).
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 12
1.2 Database Redundancy
! Redundancy: sources of error4Data for large families of closely related sequences" biased and over-represent features
4Apparent correlations between different positions in the sequencesmay be artifact of biased sampling of the data.
4Predictive performance may be overestimated if training/test dataare closed related.
! Over-representation problem4 It is necessary to avoid too closely related sequences in a data set.4Keep all sequences in a data set and assign weights according to
their novelty4Redundancy reduction
• arbitrary similarity threshold• “representative” data set
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 13
2. Genomes – Diversity, Size, and Structure
! Genomes: total genetic material within a cell or organism
DNA (e.g. cellular organism) or RNA (e.g. HIV virus)
! Genome size: 5,386bp (bacteriophage ϕX174)
~ 3Gbp (homo sapiens)
• Gene numbers in several organisms
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 14
2. Genomes – Diversity, Size, and Structure
! Genome size does not directly relate to cell complexity4Many plant and amphibian genomes are larger than the human
genome.
4Cell complexity is related to gene numbers or gene-to-geneinteractions.
4Most regions of genome are not expressive units.
Gene: expressive segment - codingproteins or functional RNAmolecules
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 15
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 16
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 17
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 18
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 19
3. Proteins and Proteomes
3.1 From Genome to Proteome! Gene : genome = protein : proteome
! Proteome: total protein expression of a set of chromosomes
! Protein: dynamic structure with various modification
! Proteome analysis:4Not only dealing with sequences and functions but also concerning
biochemical states of proteins in its posttranslational form
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 20
3.2 Protein Length Distributions
! Amino acid sequences in protein direct its structure whichin turn influences its functions.
! To make a suitable structure, regularity of amino acidsequences does exist.
! Statistical analysis has played a major role in proteinsequence analysis.
Sequence ofhemoglobin andits structure
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 21
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 22
Some Observations
! It is observed that the peaks for the prokaryote H.influenzae and the eukaryote S. cerevisiae are positioned indifferent intervals: at 140-160 and 100-120, respectively.
! Further studies show that a eukaryotic distribution from awide range of species peaks at 125 amino acids and thatthe distribution displays a periodicity based on this sizeunit.
! Interestingly, the distribution for the achaeon M. jannaschiiis sort of inbetween the H. influenzae (eukaryote) and the S.cerevisiae (prokaryote) distributions.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 23
• Tree of life: Three great kingdoms
4 Drawn by sequence analysis of rRNA
4 Three kingdoms: bacteria, archaea, eukarya
* Former five kingdoms: Animalia, Plantae, Fungi, Protista, Monea
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 24
3.3 Protein Function
! Protein function4 Does not depend on whole structure
4 Determined mainly by local sequences
(e.g. active site of enzyme)
! End of human genome project: finding genes, measuring itsfunction and activity and many other analysis are required.4 Prediction by experiment: time costing and hard job & non-real
environment (under laboratory environment)
4 Many computational approaches can be helpful for protein analysis.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 25
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 26
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 27
4. On the Information Content of BiologicalSequences
! Concept of information and its quantification is essentialfor the basic principles of machine-learning approaches inmolecular biology
! Data-driven prediction methods can extract essentialfeatures from individual examples and discard unwantedinformation.
! Machine-learning techniques are excellent for the task ofdiscarding of and compacting redundant sequenceinformation
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 28
! Information reduction4Key feature in the understanding of almost any kind of system
4Create simple representation of a sequence space " much morepowerful and useful than the original data that contain all details
! Compression rate4Ratio between size of the an encoded corpus of sequences and the
original corpus of sequences
4Global quantifier of the degree of regularity in the data
O
EC
S
SR =
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 29
4.2 Alignment vs. Prediction: When areAlignment Reliable?
! New sequences are aligned against all sequences in DB4Hints toward structural/functional relationships + functional insights
! Fundamental question4When is the sequence similarity high enough that one may infer a
structural/functional similarity from the pairwise alignment of twosequences?
4Given the detected overlap in a sequence segment, can a similaritythreshold be defined that sifts out cases where the inference will bereliable?
! Answer4 It depends on the structural/functional aspect one wants to investigate4Different for each task
! Prediction4When alignment alone is not enough to lead a reliable inference
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 30
BLAST search result
Query: Soy bean (plant)leghemoglobin
Database: homo sapiens
Alignment result shows twomerely matched sequences, buttheir functions and structures aresurprisingly coincided.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 31
4.3 Prediction of Functional Features
! Two protein sequences share sequence similarity4These proteins share common function?
! New sequence identity threshold is required for predictionof function4Threshold used for structural problems can not be used.
! Solution4Split each sequence into a number of subsequences
4The fraction of aligned site per alignment
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 32
4.4 Global and Local Alignments andSubstitution Matrix Entropies
! Pairwise alignment4There is no optimal answer (actually we don’t know optimal
alignment): changed by various options
4Dynamic programming: computational demanding job
4Heuristic approaches: BLAST, FASTA
! Substitution matrix4Set of scores sij of replacing amino acid i by amino acid j
4Generated from simplified protein evolution model involvingamino acid frequencies, pi
)(ln1
ji
ijij
pp
qs
λ=
≠=
==jiq
jiqqp iji
for
for
20
1
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 33
PAM250 Matrix
Matches involving rare amino acids count more than common aminoacids
Mismatches between similar amino acids have higher score thanbetween distinct amino acids
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 34
4.5 Consensus Sequences and Sequence Logos
! Creating consensus sequences4From alignments4Choose most common nucleotides or amino acids4Can be highly misleading
! Sequence logo4Based on Shannon information content at each position4Emphasize the deviation from the uniform or flat distribution
4Displays a significant degree of deviation from flat distribution
20or4||
)(log)(||log)(||
122
=
+= ∑=
A
ipipAiDA
k
kk
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 35
• Initiation codon (ATG) is very well conserved.
• 5’ to the initiation codons (-12~-7) are conserved also.
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 36
! Slight variation of the logo formula
4Based on relative entropy
4Quantification of the contrast between the observed probabilitiesP(i) and a reference probability distribution Q(i)
4Q: depends on the position i in the alignment
∑=
==||
1 )(
)(log)())(),(()(
A
k k
kk
iq
ipipiQiPHiH
(C) 2001 SNU CSE Artificial Intelligence Lab (SCAI) 37
5. Prediction of Molecular Function and Structure
! DNA, RNA and protein sequence analysis based onmachine-learning approaches can be helpful for manybiological problems.4Finding intron splice sites in mRNA processing4Gene finding4Recognition of promoter and other regulatory regions4Prediction of gene expression levels4Prediction of DNA bending and bendability4Sequence clustering4RNA secondary structure prediction4Protein function/structure prediction4Protein family classification