Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
MATH 5610, Computational BiologyLecture 2 – Intro to Molecular Biology (cont)
Stephen Billups
University of Colorado at Denver
MATH 5610, Computational Biology – p.1/24
Announcements
Error on syllabus–Class meets 5:30-6:45, not 7:00-8:15.
Office hours for Christiaan Van Woudenberg: TR afterclass (6:45-7:45)
MATH 5610, Computational Biology – p.2/24
Key Ideas from Last Lecture
DNA and RNA are strings from 4 letter alphabet.
Orientation
Complementarity
Central Dogma
MATH 5610, Computational Biology – p.3/24
Tonight
More Molecular BiologyProteinsTranslation & the genetic codeGenes: reading frames, introns/exonsProtein Structure and FunctionHydrophobicity/Hydrophilicity
Sequence Alignment
MATH 5610, Computational Biology – p.4/24
Proteins
Do most of the “work” in a cell.Enzymes: catalysts for chemical reactions.Structural Proteins: form cellular structure.Regulatory Proteins: control expression of genes oractivities of other proteins.Transport Proteins: carry molecules acrossmembranes or around body.
Composed of strings of amino acids.
Translated from RNA by ribosome complexes.
Fold up into 3 dimensional structure which largelydetermines protein function.
MATH 5610, Computational Biology – p.5/24
Translation
mRNA is translated to form proteins.
Genetic Code: 3 nucleotides translate to 1 amino acid.20 amino acids + stop codon = 21 codes needed.Question: How many possible sequences of 3nucleotides are there?
Answer: 43
= 64.Duplication: different codons code for same aminoacids.Often an error in 3rd position in codon results insame amino acid.
MATH 5610, Computational Biology – p.6/24
Translation
mRNA is translated to form proteins.
Genetic Code: 3 nucleotides translate to 1 amino acid.20 amino acids + stop codon = 21 codes needed.Question: How many possible sequences of 3nucleotides are there? Answer: 4
3= 64.
Duplication: different codons code for same aminoacids.Often an error in 3rd position in codon results insame amino acid.
MATH 5610, Computational Biology – p.6/24
Translation
mRNA is translated to form proteins.
Genetic Code: 3 nucleotides translate to 1 amino acid.20 amino acids + stop codon = 21 codes needed.Question: How many possible sequences of 3nucleotides are there? Answer: 4
3= 64.
Duplication: different codons code for same aminoacids.Often an error in 3rd position in codon results insame amino acid.
MATH 5610, Computational Biology – p.6/24
Genes
A gene is a sequence of nucleotides in the DNA molecule coding for one unit ofgenetic information.
A gene is expressed when that segment of DNA is transcribed into mRNA.
RNA polymerase binds to DNA molecule, and then moves along the DNA buildingthe complementary RNA molecule.
For binding to occur, there must be a promoter sequence on the DNA, which is aset of specific nucleotide sequences in just the right positions relative to the gene.
Gene expression is regulated by proteins (or RNA molecules) that bind to thepromoter regions.
Sometimes, such a protein makes it easier for RNA polymerase to bind to theDNA and begin transcription (positive regulation).Other times, the protein makes it harder (negative regulation).
MATH 5610, Computational Biology – p.7/24
Reading Frames
Recall: nucleotides in the mRNA molecule aretranslated to proteins in triplets.
If you shifted by one nucleotide, you would get anentirely different amino acid sequence:
tyr leu arg leu|-----|-----|-----|-----|U A C C U U A G A C U C G|-----|-----|-----|-----|
thr leu asp ser
There are 3 possible reading frames, which correspondto the 3 possible ways of dividing the sequence up intotriplets.
MATH 5610, Computational Biology – p.8/24
Open reading frames
The choice of reading frame depends on where the ribosome binds to the mRNA. Thisalways occurs at a start codon (AUG).
Translation begins immediately following a start codon (AUG), and continues until astop codon is reached (UAA, UAG, or UGA).
An open reading frame (ORF) is a sequence of mRNA beginning with a start codon,and ending at the first stop codon encountered.
All proteins are translated from ORFs. But not all ORFs are translated.
Long ORFs are rare, unless the ORF corresponds to an actual protein.
In a random Nucleotide sequence, stop codons make up 3/64 of the codons, soaverage ORF length is about 21 codons.
Most proteins are hundreds of amino acids long.
MATH 5610, Computational Biology – p.9/24
Introns and Exons
In prokaryotes, the mRNA is transcribed directly from the DNA.
In Eukaryotes, the mRNA can be modified by splicing before it is translated.
Splicing involves removing internal sequences called introns from the mRNA.Introns do not code for proteins.
The parts of the mRNA that are not removed are called exons. Exons contain thesequence information for the protein.
Alternative splicing: Many RNA sequences can be spliced in multiple ways (calledalternative splicings).
MATH 5610, Computational Biology – p.10/24
Protein Structure and Function
The 3-D structure of a protein largely determines the function.
Hierarchy of structure:
Primary structure: Linear order of the amino acids.
Secondary structure: Location and direction of common structures calledα-helices and β-sheets.
Tertiary structure: The 3-dimensional shape of the protein.
Quaternary structure: The overall 3-D structure of a complex of multiple proteins.
(Image from www.biology.bnl.gov/structure/images/swami_p18.jpg)
MATH 5610, Computational Biology – p.11/24
Hydrophobicity/Hydrophilicity
Some amino acids have polar side chains. (the chargeof the molecule is not symmetric. These residues areattracted to water. Such amino acids are calledHydrophilic.
Other amino acids involve nonpolar side chains. Theseare called Hydrophobic.
Because proteins reside in water, they tend to fold up inways such that the hydrophylic residues are on theoutside and the hydrophobic residues are in the inside.
MATH 5610, Computational Biology – p.12/24
Topics not covered from Ch. 1
Read about this on your own.
Chemical details.
Molecular Biology Tools
Genomic Information Content (Optional).
MATH 5610, Computational Biology – p.13/24
Sequence Comparison/Alignment
Dot Plots
Sequence Alignment
Scoring methods
Derivation of scoring matrices
Dynamic Programming
MATH 5610, Computational Biology – p.14/24
Dot Plot
A C T C G A G C
A ∗ ∗
C ∗ ∗ ∗
A ∗ ∗
G ∗ ∗
T ∗
A ∗ ∗
G ∗ ∗
C ∗
MATH 5610, Computational Biology – p.15/24
Sequence Alignment
Definition: Alignment = pairwise matching between thecharacters of each sequence.
often requires inserting gaps into the sequences.
Ex: 2 alignments of the same sequences:AATCTATAAAG-AT-A
AATCTATAAA--GATA
Which is better?
MATH 5610, Computational Biology – p.16/24
Sequence Alignment
Definition: Alignment = pairwise matching between thecharacters of each sequence.
often requires inserting gaps into the sequences.
Ex: 2 alignments of the same sequences:AATCTATAAAG-AT-A
AATCTATAAA--GATA
Which is better?
MATH 5610, Computational Biology – p.17/24
Types of Alignments
Global–best alignment of two fixed sequences
Semiglobal–Finds best overlap of two sequences.Doesn’t penalize gaps at beginning or end.
Local – Finds the best scoring alignment ofsubsequences.
Multiple Sequence Alignment – aligns multiplesequences.
MATH 5610, Computational Biology – p.18/24
Scoring Alignments
Goal: Devise a scoring function for an alignment such thatthe “best” alignment gets the highest score.Once the scoring function is defined, we will then be able todevise algorithms to search for the highest scoringalignments.Simple Example:
Matches = +1
Mismatches = 0
Gaps = -1
AATCTATAAAG-AT-A++0-00-+ = +1
AATCTATAAA--GATA++--0+++ = +3
MATH 5610, Computational Biology – p.19/24
Discussion Question
What makes a good scoring function?
MATH 5610, Computational Biology – p.20/24
Ideas for Scoring Functions:
Edit distance. (how many edits (sub., ins., del.) areneeded to transform one sequence to another?)
Homology. Assume both sequences evolved from acommon (but unknown) ancestor. Which alignment bestreflects this evolutionary relationship?
Avoid unintended mathematical biases.
Computational efficiency. Complex scoring functionsmay be harder to compute with.
MATH 5610, Computational Biology – p.21/24
Substitution Score Matrix
Evolutionarily, some substitutions are more probable thanothers.
Physical/Chemical properties.Ex: in DNA, transitional substitutions (purine ↔
purine) are more probable than transverionalsubstitutions (purine ↔ pyrimidine)
Selective pressure during evolution.Ex: in protein, substitutions that change structureare selected against.
MATH 5610, Computational Biology – p.22/24
Nucleotide Score Matrices
Usually quite simple:BLAST matrix (match=5, mismatch=-4)
A T C GA 5 -4 -4 -4T -4 5 -4 -4C -4 -4 5 -4G -4 -4 -4 5
Transition/Transversion matrixA T C G
A 1 -5 -5 -1T -5 1 -1 -5C -5 -1 1 -5G -1 -5 -5 1
MATH 5610, Computational Biology – p.23/24
Amino Acid Substitution Score Matrix
Based on statistical model of accepted mutations (i.e.,mutations that survive evolution).Example: (BLOSUM62 amino acid substitution matrix)
C S T P A G · · ·
C 9S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6... . . .
MATH 5610, Computational Biology – p.24/24