26
MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen Billups University of Colorado at Denver MATH 5610, Computational Biology – p.1/24

MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

MATH 5610, Computational BiologyLecture 2 – Intro to Molecular Biology (cont)

Stephen Billups

University of Colorado at Denver

MATH 5610, Computational Biology – p.1/24

Page 2: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Announcements

Error on syllabus–Class meets 5:30-6:45, not 7:00-8:15.

Office hours for Christiaan Van Woudenberg: TR afterclass (6:45-7:45)

MATH 5610, Computational Biology – p.2/24

Page 3: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Key Ideas from Last Lecture

DNA and RNA are strings from 4 letter alphabet.

Orientation

Complementarity

Central Dogma

MATH 5610, Computational Biology – p.3/24

Page 4: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Tonight

More Molecular BiologyProteinsTranslation & the genetic codeGenes: reading frames, introns/exonsProtein Structure and FunctionHydrophobicity/Hydrophilicity

Sequence Alignment

MATH 5610, Computational Biology – p.4/24

Page 5: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Proteins

Do most of the “work” in a cell.Enzymes: catalysts for chemical reactions.Structural Proteins: form cellular structure.Regulatory Proteins: control expression of genes oractivities of other proteins.Transport Proteins: carry molecules acrossmembranes or around body.

Composed of strings of amino acids.

Translated from RNA by ribosome complexes.

Fold up into 3 dimensional structure which largelydetermines protein function.

MATH 5610, Computational Biology – p.5/24

Page 6: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Translation

mRNA is translated to form proteins.

Genetic Code: 3 nucleotides translate to 1 amino acid.20 amino acids + stop codon = 21 codes needed.Question: How many possible sequences of 3nucleotides are there?

Answer: 43

= 64.Duplication: different codons code for same aminoacids.Often an error in 3rd position in codon results insame amino acid.

MATH 5610, Computational Biology – p.6/24

Page 7: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Translation

mRNA is translated to form proteins.

Genetic Code: 3 nucleotides translate to 1 amino acid.20 amino acids + stop codon = 21 codes needed.Question: How many possible sequences of 3nucleotides are there? Answer: 4

3= 64.

Duplication: different codons code for same aminoacids.Often an error in 3rd position in codon results insame amino acid.

MATH 5610, Computational Biology – p.6/24

Page 8: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Translation

mRNA is translated to form proteins.

Genetic Code: 3 nucleotides translate to 1 amino acid.20 amino acids + stop codon = 21 codes needed.Question: How many possible sequences of 3nucleotides are there? Answer: 4

3= 64.

Duplication: different codons code for same aminoacids.Often an error in 3rd position in codon results insame amino acid.

MATH 5610, Computational Biology – p.6/24

Page 9: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Genes

A gene is a sequence of nucleotides in the DNA molecule coding for one unit ofgenetic information.

A gene is expressed when that segment of DNA is transcribed into mRNA.

RNA polymerase binds to DNA molecule, and then moves along the DNA buildingthe complementary RNA molecule.

For binding to occur, there must be a promoter sequence on the DNA, which is aset of specific nucleotide sequences in just the right positions relative to the gene.

Gene expression is regulated by proteins (or RNA molecules) that bind to thepromoter regions.

Sometimes, such a protein makes it easier for RNA polymerase to bind to theDNA and begin transcription (positive regulation).Other times, the protein makes it harder (negative regulation).

MATH 5610, Computational Biology – p.7/24

Page 10: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Reading Frames

Recall: nucleotides in the mRNA molecule aretranslated to proteins in triplets.

If you shifted by one nucleotide, you would get anentirely different amino acid sequence:

tyr leu arg leu|-----|-----|-----|-----|U A C C U U A G A C U C G|-----|-----|-----|-----|

thr leu asp ser

There are 3 possible reading frames, which correspondto the 3 possible ways of dividing the sequence up intotriplets.

MATH 5610, Computational Biology – p.8/24

Page 11: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Open reading frames

The choice of reading frame depends on where the ribosome binds to the mRNA. Thisalways occurs at a start codon (AUG).

Translation begins immediately following a start codon (AUG), and continues until astop codon is reached (UAA, UAG, or UGA).

An open reading frame (ORF) is a sequence of mRNA beginning with a start codon,and ending at the first stop codon encountered.

All proteins are translated from ORFs. But not all ORFs are translated.

Long ORFs are rare, unless the ORF corresponds to an actual protein.

In a random Nucleotide sequence, stop codons make up 3/64 of the codons, soaverage ORF length is about 21 codons.

Most proteins are hundreds of amino acids long.

MATH 5610, Computational Biology – p.9/24

Page 12: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Introns and Exons

In prokaryotes, the mRNA is transcribed directly from the DNA.

In Eukaryotes, the mRNA can be modified by splicing before it is translated.

Splicing involves removing internal sequences called introns from the mRNA.Introns do not code for proteins.

The parts of the mRNA that are not removed are called exons. Exons contain thesequence information for the protein.

Alternative splicing: Many RNA sequences can be spliced in multiple ways (calledalternative splicings).

MATH 5610, Computational Biology – p.10/24

Page 13: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Protein Structure and Function

The 3-D structure of a protein largely determines the function.

Hierarchy of structure:

Primary structure: Linear order of the amino acids.

Secondary structure: Location and direction of common structures calledα-helices and β-sheets.

Tertiary structure: The 3-dimensional shape of the protein.

Quaternary structure: The overall 3-D structure of a complex of multiple proteins.

(Image from www.biology.bnl.gov/structure/images/swami_p18.jpg)

MATH 5610, Computational Biology – p.11/24

Page 14: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Hydrophobicity/Hydrophilicity

Some amino acids have polar side chains. (the chargeof the molecule is not symmetric. These residues areattracted to water. Such amino acids are calledHydrophilic.

Other amino acids involve nonpolar side chains. Theseare called Hydrophobic.

Because proteins reside in water, they tend to fold up inways such that the hydrophylic residues are on theoutside and the hydrophobic residues are in the inside.

MATH 5610, Computational Biology – p.12/24

Page 15: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Topics not covered from Ch. 1

Read about this on your own.

Chemical details.

Molecular Biology Tools

Genomic Information Content (Optional).

MATH 5610, Computational Biology – p.13/24

Page 16: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Sequence Comparison/Alignment

Dot Plots

Sequence Alignment

Scoring methods

Derivation of scoring matrices

Dynamic Programming

MATH 5610, Computational Biology – p.14/24

Page 17: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Dot Plot

A C T C G A G C

A ∗ ∗

C ∗ ∗ ∗

A ∗ ∗

G ∗ ∗

T ∗

A ∗ ∗

G ∗ ∗

C ∗

MATH 5610, Computational Biology – p.15/24

Page 18: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Sequence Alignment

Definition: Alignment = pairwise matching between thecharacters of each sequence.

often requires inserting gaps into the sequences.

Ex: 2 alignments of the same sequences:AATCTATAAAG-AT-A

AATCTATAAA--GATA

Which is better?

MATH 5610, Computational Biology – p.16/24

Page 19: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Sequence Alignment

Definition: Alignment = pairwise matching between thecharacters of each sequence.

often requires inserting gaps into the sequences.

Ex: 2 alignments of the same sequences:AATCTATAAAG-AT-A

AATCTATAAA--GATA

Which is better?

MATH 5610, Computational Biology – p.17/24

Page 20: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Types of Alignments

Global–best alignment of two fixed sequences

Semiglobal–Finds best overlap of two sequences.Doesn’t penalize gaps at beginning or end.

Local – Finds the best scoring alignment ofsubsequences.

Multiple Sequence Alignment – aligns multiplesequences.

MATH 5610, Computational Biology – p.18/24

Page 21: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Scoring Alignments

Goal: Devise a scoring function for an alignment such thatthe “best” alignment gets the highest score.Once the scoring function is defined, we will then be able todevise algorithms to search for the highest scoringalignments.Simple Example:

Matches = +1

Mismatches = 0

Gaps = -1

AATCTATAAAG-AT-A++0-00-+ = +1

AATCTATAAA--GATA++--0+++ = +3

MATH 5610, Computational Biology – p.19/24

Page 22: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Discussion Question

What makes a good scoring function?

MATH 5610, Computational Biology – p.20/24

Page 23: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Ideas for Scoring Functions:

Edit distance. (how many edits (sub., ins., del.) areneeded to transform one sequence to another?)

Homology. Assume both sequences evolved from acommon (but unknown) ancestor. Which alignment bestreflects this evolutionary relationship?

Avoid unintended mathematical biases.

Computational efficiency. Complex scoring functionsmay be harder to compute with.

MATH 5610, Computational Biology – p.21/24

Page 24: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Substitution Score Matrix

Evolutionarily, some substitutions are more probable thanothers.

Physical/Chemical properties.Ex: in DNA, transitional substitutions (purine ↔

purine) are more probable than transverionalsubstitutions (purine ↔ pyrimidine)

Selective pressure during evolution.Ex: in protein, substitutions that change structureare selected against.

MATH 5610, Computational Biology – p.22/24

Page 25: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Nucleotide Score Matrices

Usually quite simple:BLAST matrix (match=5, mismatch=-4)

A T C GA 5 -4 -4 -4T -4 5 -4 -4C -4 -4 5 -4G -4 -4 -4 5

Transition/Transversion matrixA T C G

A 1 -5 -5 -1T -5 1 -1 -5C -5 -1 1 -5G -1 -5 -5 1

MATH 5610, Computational Biology – p.23/24

Page 26: MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

Amino Acid Substitution Score Matrix

Based on statistical model of accepted mutations (i.e.,mutations that survive evolution).Example: (BLOSUM62 amino acid substitution matrix)

C S T P A G · · ·

C 9S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6... . . .

MATH 5610, Computational Biology – p.24/24