44
Computational Molecular Biology Introduction and Preliminaries

Computational Molecular Biology

  • Upload
    lindsay

  • View
    24

  • Download
    1

Embed Size (px)

DESCRIPTION

Computational Molecular Biology. Introduction and Preliminaries. Preliminaries in Computer Science. Strings and alphabet Basic notations in graph theory Algorithms and Complexity. Strings. Consist of a sequence of letters: DNA: four nucleotides A, C, G, T - PowerPoint PPT Presentation

Citation preview

Page 1: Computational Molecular Biology

Computational Molecular Biology

Introduction and Preliminaries

Page 2: Computational Molecular Biology

My T. [email protected]

2

Preliminaries in Computer Science

Strings and alphabet Basic notations in graph theory Algorithms and Complexity

Page 3: Computational Molecular Biology

My T. [email protected]

3

Strings

Consist of a sequence of letters: DNA: four nucleotides A, C, G, T Proteins: 20 symbol alphabet of animo acids

Given a string s, we have the following notations: Length: |s| Substring: ACT is a substring of ATGACTG Superstring: ATGACTG is a superstring of ACT Index and interval: s[i] and s[i..j] Prefix and suffix: s[1..j] and s[i..|s|]

Page 4: Computational Molecular Biology

My T. [email protected]

4

Graphs

G = (V, E) where V is a set of vertices and E is a set of edges

Undirected graph: edges are undirected Directed graph: edges are directed Weighted graph G = (V, E, w) where each edge

has some weight Some special graphs: complete graph, bipartite

graph, tree, and interval graph Subgraph, spanning tree, steiner tree

Page 5: Computational Molecular Biology

My T. [email protected]

5

Interval Graphs

Intersection graph of a set of intervals on the real line

A vertex represents an interval and an edge (u, v) exists if intervals u and v intersect

Page 6: Computational Molecular Biology

My T. [email protected]

6

Some Problems in Graphs

Euler circuit: Given a graph, find a cycle that passes through each edge exactly once

Hamiltonian circuit: Given a graph, find a cycle that passes through each vertex exactly once

Minimum Spanning Tree: Given a weighted undirected graph, find a spanning tree with minimum total weight

Maximum Matching: Given an undirected graph, find a maximum cardinality matching, which is a subset of edges such that no two edges in the subset share an endpoint

Page 7: Computational Molecular Biology

My T. [email protected]

7

P vs. NP

Class of P: Set of problems solvable by polynomial-time algoirthms

Class of NP: Set of problems whose solutions, once found, can be verified in polynomial time

NP-complete (NP-hard) problems: cannot obtain an optimal solutions in polynomial time

Page 8: Computational Molecular Biology

My T. [email protected]

8

Some approaches for NP-complete Problems

Special-case method: Work on the problem with a restricted class of inputs

Exhaustive search: Design an exponential-time algorithms that may perform well in practice

Approximation algorithms: Design a polynomial-time algorithm that is guaranteed to find near-optimal solutions (with a good approximation ratio)

Heuristics: Fast algorithms that produce satisfactory solutions most of the time but without guarantee

Page 9: Computational Molecular Biology

My T. [email protected]

9

Preliminaries in Molecular Biology

Page 10: Computational Molecular Biology

My T. [email protected]

10

DNA and Base Pairs

Double helix consisting of two dual strands

Has four types of nucleotides: Adenine, Thymine, Guanine, Cytosine

Base Pairs: A↔T, C↔G Two ends of a strand are

marked with 3’ and 5’ The entire DNA of a living

organism is called its genome

Page 11: Computational Molecular Biology

My T. [email protected]

11

DNA Sequences

Page 12: Computational Molecular Biology

My T. [email protected]

12

DNA Replication

Strands are separatedEach strand is replicated

using one of the parental strands as a template

Page 13: Computational Molecular Biology

My T. [email protected]

13

Cell, Chromosome, and DNA

Page 14: Computational Molecular Biology

My T. [email protected]

14

Cell Classification

Page 15: Computational Molecular Biology

My T. [email protected]

15

Chromosomes

Consists of a DNA molecule associated with proteins that fold and pack the DNA thread into a more compact structure and proteins required for the process of gene expression, DNA replication and DNA repair.

Human genome is distributed over 24 chromosomes Each cell contains 46 chromosomes

22 pairs common to both males and females2 sex chromosome X and Y in males and two Xs in

female

Page 16: Computational Molecular Biology

My T. [email protected]

16

Genes Segments of DNA

Functional and physical unit of heredity passed from parent to offspring

Contain the information for making a specific protein

Page 17: Computational Molecular Biology

My T. [email protected]

17

Proteins

Shorts strings in the amino acid 20-letter alphabet

Human genome: about 100,000 proteins, with each protein a few hundred amino acids long

Bacteria make 500-1500 proteins Made by genes (fragments of DNA) that are

roughly three times longer than the corresponding proteins.

Why? Every 3 nucleotides in the DNA alphabet code one letter in the protein alphabet of amino acids

Page 18: Computational Molecular Biology

My T. [email protected]

18

Central Dogma of Molecular Biology

Page 19: Computational Molecular Biology

My T. [email protected]

19

Transcription

Page 20: Computational Molecular Biology

My T. [email protected]

20

Translation Translation

mRNA (after exported out of the nucleus and reaching the cytosol) directs the synthesis of the protein by joining together amino acids in the order encoded by the mRNA

Genetic codeDefines a mapping between codons and amino acid.Codon

Triplet of nucleotides specifies a single amino acid in a corresponding protein

64 codons and 20 amino acids

Translation is carried out by ribosomes

Page 21: Computational Molecular Biology

My T. [email protected]

21

Polymerase Chain Reaction (PCR)

PrimerNucleic acid

strandServes as a

starting point of DNA replication

Page 22: Computational Molecular Biology

My T. [email protected]

22

Plasmid Vector

Plasmid Circular and double-stranded DNA

Antibiotic resistance

Automatic replication

Exists in bacteria

Vector an agent that can carry a DNA fragment into a host cell

Page 23: Computational Molecular Biology

My T. [email protected]

23

DNA Cloning Using Plasmids as Vectors

(a) DNA recombination

(b) Transformation 

 

Page 24: Computational Molecular Biology

My T. [email protected]

24

DNA Cloning Using Plasmids as Vectors (Cont) (c) Selective amplification

(d) Isolation of desired DNA clones

Page 25: Computational Molecular Biology

My T. [email protected]

25

DNA Library Screening Probe:

Labeled with radioisotope or fluorescence

Used to detect specific DNA sequences by hybridization

Hybridization: Binding of two nucleic acid

chains by base paring

DNA Library Screening To identify each clone whether it

contains a probe from a given set of probes

Positive clone: contains a probe

Page 26: Computational Molecular Biology

My T. [email protected]

26

Some Computational Problems

Pooling Design Non-unique probe selection Sequence Alignment, Multi Sequence

Alignment DNA sequencing Genome Rearrangement Protein Structure Prediction and Recognition Protein-Protein Interactions Functional Groups, Modules

Page 27: Computational Molecular Biology

My T. [email protected]

27

Pooling Designs

Problem Definition Given a set of n clones with at most d positive

clones Identify all positive clones with the minimum

number of tests

Pool: a subset of clones Positive pool: a pool contains at least one positive

clone

Page 28: Computational Molecular Biology

My T. [email protected]

28

Pooling Designsclones

c1 c2 cj cn

p1 0 0 … 0 … 0 … 0 … 0 0 p2 0 1 … 0 … 0 … 0 … 0 1

pools . .. .

pi 0 0 … 0 … 1 … 0 … 0 1. .. .

pt 0 0 … 0 … 0 … 0 … 0 0 txn tx1

M[i, j] = 1 iff the ith pool contains the jth clone

Decoding Algorithm: Given M and V(D), identify all positive clones

Testing

V(D)

Mtxn =

Page 29: Computational Molecular Biology

My T. [email protected]

29

Challenges

Challenge 1: How to construct the binary matrix M such that: Outputs of any union of d columns are distinct

Challenge 2: How to design a decoding algorithm with efficient time complexity [O(tn)]

Page 30: Computational Molecular Biology

My T. [email protected]

30

Probe Selection

Problem Definition: Given a biological sample (e.g., blood) and a set of

probes Identify the presence (or absence) of some

biological objects (e.g., viruses or bacteria) with the minimum number of probes

Page 31: Computational Molecular Biology

My T. [email protected]

31

Unique Probes VS. Non-unique Probes

Unique probes Gene-specific probes or signature probes. Difficult to find

Non-unique probes Hybridize to more than one target. Difficult to decode the results

Page 32: Computational Molecular Biology

My T. [email protected]

32

Probe-Target Matrix 12 probe candidates. 4 targets (genes). For target set S, define P(S) as set of

probes reacting to any gene in S. P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}. P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}. Symmetric set difference: P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes

that separate two sets.

Page 33: Computational Molecular Biology

My T. [email protected]

33

Sequence Alignment

Problem Definition: Given: 2 DNA or protein sequences Find: Best match between them

What is an Alignment: Given: 2 Strings S and S’ Goal: The lengths of S and S’ are the same by inserting

spaces into these strings

A -- T C -- A

-- C T C A A

Page 34: Computational Molecular Biology

My T. [email protected]

34

Matches, Mismatches and Indels

Match: two aligned, identical characters in an alignment

Mismatch: two aligned, unequal characters Indel: A character aligned with a space

A A C T A C T -- C C T A A C A C T -- ---- -- C T C C T A C C T -- -- T A C T T T

10 matches, 2 mismatches, 7 indels

Page 35: Computational Molecular Biology

My T. [email protected]

35

Basic Algorithmic Problem

Find the alignment of the two strings that: Max m where m = (# matches – mismatches –

indels)

m defines the similarity of the two strings, also called Optimal Global Alignment

Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character

Page 36: Computational Molecular Biology

My T. [email protected]

36

Multiple Sequence Alignment

Problem Definition: Similar to the sequence alignment problem but the

input has more than 2 strings

Challenges: NP-hard Guarantee factor: 2 – 2/k where k is the number of

the input sequences. More work to reduce the time and space complexity

Page 37: Computational Molecular Biology

My T. [email protected]

37

DNA Sequencing Problem Definition:

Given a set of fragments that are contained in a DNA string S

Goal: Determine the string S NP-complete Further complicated due to the existence of

repetitive sequences in the genome Can cast this as a Hamiltonian path or Euler path

problem (was introduced by Pavel Pavzner)

Page 38: Computational Molecular Biology

My T. [email protected]

38

Genome Rearrangement

Problem Definition: Given genomes of 2 different species Goal: Find a sequence of evolutionary events that

turn the first genome to the second one.

Biological reasons: How close between these species, how much evolution separate these species.

E.g.: We usually test new drugs on mice before humans. However, how close is a mouse to a human?

Page 39: Computational Molecular Biology

My T. [email protected]

39

Genome Rearrangement

Can we use the solutions of sequence alignment to solve this problem?

Answer: NO, because: Genome is a very long strings (3 million letters for

a human genome Model of sequence alignment is not appropriate for

human genome comparison since the differences are not in terms of insertions/deletion/mutations of a nucleotide, but a rearrangement of a long DNA regions

The basic comparison is gene

Page 40: Computational Molecular Biology

My T. [email protected]

40

An Example

• If we compare these two strings by sequence alignment, it’s impossible

• However, the second string is the first string after reverse the fragment AATGGT…CCC.

Page 41: Computational Molecular Biology

My T. [email protected]

41

Main Evolutionary Events

Deletions: A fragment is removed Duplications: create many copies of a fragment

and insert into different positions Transpositions: A fragment is removed and re-

inserted into a different position Inversions: A fragment is removed, reversed,

and then reinserted into the same position Translocations: A pair of fragments are

exchanged between the ends of two chromosomes

Page 42: Computational Molecular Biology

My T. [email protected]

42

Page 43: Computational Molecular Biology

My T. [email protected]

43

Protein Structure Prediction

Problem Definition: Given: A sequence of amino acids Goal: Predict the 3D structure of the protein

Some approaches: Determine the position of a protein’s atoms so as to

minimize the total free energy Find the similarities to some known proteins

Page 44: Computational Molecular Biology

My T. [email protected]

44

Community Structure

Problem Definition: Given a graph G = (V, E) representing a network Partition G into a set of subgraph (community structure) so

that nodes in each subgraph are highly connected

Biological reason: Genes with similar expression data may have similar functions. Identify the community structure can help us to reduce the number of tests

Others: Community structure is also studied in different fields