MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen

MATH 5610, Computational BiologyLecture 2 – Intro to Molecular Biology (cont)

Stephen Billups

University of Colorado at Denver

MATH 5610, Computational Biology – p.1/24

Announcements

Error on syllabus–Class meets 5:30-6:45, not 7:00-8:15.

Office hours for Christiaan Van Woudenberg: TR afterclass (6:45-7:45)


Key Ideas from Last Lecture

DNA and RNA are strings from 4 letter alphabet.

Orientation

Complementarity

Central Dogma


Tonight

More Molecular BiologyProteinsTranslation & the genetic codeGenes: reading frames, introns/exonsProtein Structure and FunctionHydrophobicity/Hydrophilicity

Sequence Alignment


Proteins

Do most of the “work” in a cell.Enzymes: catalysts for chemical reactions.Structural Proteins: form cellular structure.Regulatory Proteins: control expression of genes oractivities of other proteins.Transport Proteins: carry molecules acrossmembranes or around body.

Composed of strings of amino acids.

Translated from RNA by ribosome complexes.

Fold up into 3 dimensional structure which largelydetermines protein function.


Translation

mRNA is translated to form proteins.

Genetic Code: 3 nucleotides translate to 1 amino acid.20 amino acids + stop codon = 21 codes needed.Question: How many possible sequences of 3nucleotides are there?

Answer: 43

= 64.Duplication: different codons code for same aminoacids.Often an error in 3rd position in codon results insame amino acid.


Translation


Genetic Code: 3 nucleotides translate to 1 amino acid.20 amino acids + stop codon = 21 codes needed.Question: How many possible sequences of 3nucleotides are there? Answer: 4

3= 64.

Duplication: different codons code for same aminoacids.Often an error in 3rd position in codon results insame amino acid.


Translation


Genetic Code: 3 nucleotides translate to 1 amino acid.20 amino acids + stop codon = 21 codes needed.Question: How many possible sequences of 3nucleotides are there? Answer: 4

3= 64.

Duplication: different codons code for same aminoacids.Often an error in 3rd position in codon results insame amino acid.


Genes

A gene is a sequence of nucleotides in the DNA molecule coding for one unit ofgenetic information.

A gene is expressed when that segment of DNA is transcribed into mRNA.

RNA polymerase binds to DNA molecule, and then moves along the DNA buildingthe complementary RNA molecule.

For binding to occur, there must be a promoter sequence on the DNA, which is aset of specific nucleotide sequences in just the right positions relative to the gene.

Gene expression is regulated by proteins (or RNA molecules) that bind to thepromoter regions.

Sometimes, such a protein makes it easier for RNA polymerase to bind to theDNA and begin transcription (positive regulation).Other times, the protein makes it harder (negative regulation).


Reading Frames

Recall: nucleotides in the mRNA molecule aretranslated to proteins in triplets.

If you shifted by one nucleotide, you would get anentirely different amino acid sequence:

tyr leu arg leu|-----|-----|-----|-----|U A C C U U A G A C U C G|-----|-----|-----|-----|

thr leu asp ser

There are 3 possible reading frames, which correspondto the 3 possible ways of dividing the sequence up intotriplets.


Open reading frames

The choice of reading frame depends on where the ribosome binds to the mRNA. Thisalways occurs at a start codon (AUG).

Translation begins immediately following a start codon (AUG), and continues until astop codon is reached (UAA, UAG, or UGA).

An open reading frame (ORF) is a sequence of mRNA beginning with a start codon,and ending at the first stop codon encountered.

All proteins are translated from ORFs. But not all ORFs are translated.

Long ORFs are rare, unless the ORF corresponds to an actual protein.

In a random Nucleotide sequence, stop codons make up 3/64 of the codons, soaverage ORF length is about 21 codons.

Most proteins are hundreds of amino acids long.


Introns and Exons

In prokaryotes, the mRNA is transcribed directly from the DNA.

In Eukaryotes, the mRNA can be modified by splicing before it is translated.

Splicing involves removing internal sequences called introns from the mRNA.Introns do not code for proteins.

The parts of the mRNA that are not removed are called exons. Exons contain thesequence information for the protein.

Alternative splicing: Many RNA sequences can be spliced in multiple ways (calledalternative splicings).


Protein Structure and Function

The 3-D structure of a protein largely determines the function.

Hierarchy of structure:

Primary structure: Linear order of the amino acids.

Secondary structure: Location and direction of common structures calledα-helices and β-sheets.

Tertiary structure: The 3-dimensional shape of the protein.

Quaternary structure: The overall 3-D structure of a complex of multiple proteins.

(Image from www.biology.bnl.gov/structure/images/swami_p18.jpg)


Hydrophobicity/Hydrophilicity

Some amino acids have polar side chains. (the chargeof the molecule is not symmetric. These residues areattracted to water. Such amino acids are calledHydrophilic.

Other amino acids involve nonpolar side chains. Theseare called Hydrophobic.

Because proteins reside in water, they tend to fold up inways such that the hydrophylic residues are on theoutside and the hydrophobic residues are in the inside.


Topics not covered from Ch. 1

Read about this on your own.

Chemical details.

Molecular Biology Tools

Genomic Information Content (Optional).


Sequence Comparison/Alignment

Dot Plots

Sequence Alignment

Scoring methods

Derivation of scoring matrices

Dynamic Programming


Dot Plot

A C T C G A G C

A ∗ ∗

C ∗ ∗ ∗

A ∗ ∗

G ∗ ∗

T ∗

A ∗ ∗

G ∗ ∗

C ∗


Sequence Alignment

Definition: Alignment = pairwise matching between thecharacters of each sequence.

often requires inserting gaps into the sequences.

Ex: 2 alignments of the same sequences:AATCTATAAAG-AT-A

AATCTATAAA--GATA

Which is better?


Sequence Alignment

Definition: Alignment = pairwise matching between thecharacters of each sequence.

often requires inserting gaps into the sequences.

Ex: 2 alignments of the same sequences:AATCTATAAAG-AT-A

AATCTATAAA--GATA

Which is better?


Types of Alignments

Global–best alignment of two fixed sequences

Semiglobal–Finds best overlap of two sequences.Doesn’t penalize gaps at beginning or end.

Local – Finds the best scoring alignment ofsubsequences.

Multiple Sequence Alignment – aligns multiplesequences.


Scoring Alignments

Goal: Devise a scoring function for an alignment such thatthe “best” alignment gets the highest score.Once the scoring function is defined, we will then be able todevise algorithms to search for the highest scoringalignments.Simple Example:

Matches = +1

Mismatches = 0

Gaps = -1

AATCTATAAAG-AT-A++0-00-+ = +1

AATCTATAAA--GATA++--0+++ = +3


Discussion Question

What makes a good scoring function?


Ideas for Scoring Functions:

Edit distance. (how many edits (sub., ins., del.) areneeded to transform one sequence to another?)

Homology. Assume both sequences evolved from acommon (but unknown) ancestor. Which alignment bestreflects this evolutionary relationship?

Avoid unintended mathematical biases.

Computational efficiency. Complex scoring functionsmay be harder to compute with.


Substitution Score Matrix

Evolutionarily, some substitutions are more probable thanothers.

Physical/Chemical properties.Ex: in DNA, transitional substitutions (purine ↔

purine) are more probable than transverionalsubstitutions (purine ↔ pyrimidine)

Selective pressure during evolution.Ex: in protein, substitutions that change structureare selected against.


Nucleotide Score Matrices

Usually quite simple:BLAST matrix (match=5, mismatch=-4)

A T C GA 5 -4 -4 -4T -4 5 -4 -4C -4 -4 5 -4G -4 -4 -4 5

Transition/Transversion matrixA T C G

A 1 -5 -5 -1T -5 1 -1 -5C -5 -1 1 -5G -1 -5 -5 1


Amino Acid Substitution Score Matrix

Based on statistical model of accepted mutations (i.e.,mutations that survive evolution).Example: (BLOSUM62 amino acid substitution matrix)

C S T P A G · · ·

C 9S -1 4T -1 1 5P -3 -1 -1 7A 0 1 0 -1 4G -3 0 -2 -2 0 6... . . .


Documents

MATH 5610, Computational Biologymath.ucdenver.edu/~billups/courses/ma5610/lectures/lec2.pdf · MATH 5610, Computational Biology Lecture 2 – Intro to Molecular Biology (cont) Stephen