90
Programa Table: Temas Brute F. Greedy DP D&C Graph Comb. patt. Clust.& trees HMM Rand. Subject 4 5 6 7 8 9 10 11 12 Mapping DNA * Sequencing * Comparing Seqs * * * Predicting Genes * Finding Signals * * * * ldentifying Prots * Repeat Analysis * DNA arrays * * Genome Rearrang. * Molecular evol. * Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 1 / 85

Algorithm Techniques -class1- · The backtracking technique can be applied to those problems that exhibit the domino principle: if a constraint (condition) is not satis ed by a partial

Embed Size (px)

Citation preview

Programa

Table: Temas

Brute F. Greedy DP D&C Graph Comb. patt. Clust.& trees HMM Rand.

Subject 4 5 6 7 8 9 10 11 12

Mapping DNA *Sequencing *Comparing Seqs * * *Predicting Genes *Finding Signals * * * *ldentifying Prots *Repeat Analysis *DNA arrays * *Genome Rearrang. *Molecular evol. *

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 1 / 85

Libro recomendado

An introduction to bioinformatics algorithms Neil Jones and Pavel Pevzner

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 2 / 85

Algorithm Techniques

In this chapter we will very briefly review the most common algorithmictechniques which are used in bioinformatics.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 3 / 85

Algorithms

An algorithm is a well-defined and finite sequence of steps used to solve awell-defined problem.

Algorithms that solve all instances of the problem for which they weredesigned are said to be correct.

The running time of an algorithm is the number of machine instructions itexecutes when run on a particular instance.

For the analysis of the algorithm the running time is computed for theworst case instance of the problem.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 4 / 85

Running time

Computers need determined amount of time top for the execution ofsome operation (e. g. 10−9s)

Algorithms need a determined amount of steps s

If top and s is known → running time of algorithm: top · sSince top changes constantly we base on s (independent of hardware)

s is not always easy to determine → depends on input n

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 5 / 85

Running time

Computers need determined amount of time top for the execution ofsome operation (e. g. 10−9s)

Algorithms need a determined amount of steps s

If top and s is known → running time of algorithm: top · sSince top changes constantly we base on s (independent of hardware)

s is not always easy to determine → depends on input n

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 5 / 85

Big-O Notation

Big-O for describing the running time of analgorithm

O(n2) running time of the algorithm islimited by a 2nd degree polynomial

f (n) = O(n2): f doesn’t grow faster thanc · n2 for a c

2n = O(n2) valid, but uninformative →more informative 2n = O(n)

Big-O establishes an upper bound for thegrowth of a function.If f (n) = O(g(n)),then f doesn’t grow fasterthan g

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 6 / 85

Definitions

Let f and g be real functions

1 One writes f (x) = O(g(x)) if and only if there exists c and x0 (c,x0 ∈R, c ≥ 0) such that

f (x) ≤ c · g(x) for all x ≥ x0

2 One writes f (x) = Ω(g(x)) if and only if there exists c and x0 (c,x0 ∈R, c ≥ 0) such that

f (x) ≥ c · g(x) for all x ≥ x0

3 One writes f (x) = Θ(g(x)) if and only if

f (x) = O(g(x)) and f (x) = Ω(g(x))

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 7 / 85

Example: Sorting Algorithms

Sorting Problem:Sort a list of integersInput: A list of n distinct integers a = (a1, a2, ..., an)Output: Sorted list of integers, that is, a reordering b = (b1, b2, ..., bn) ofintegers from a such that b1 < b2 < < bn

Selection Sort Algorithm:SELECTIONSORT(a, n)1 for i ← 1 to n − 12 aj ← Smallest element among ai , ai+1, . . ., an3 Swap ai and aj4 return a

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 8 / 85

Example: Sorting Algorithms

Recursive Selection Sort:RECURSIVESELECTIONSORT(a, first, last)1 if first < last2 index ← INDEXOFMIN (a, first, last)3 Swap afirst with aindex

4 a ← RECURSIVESELECTIONSORT(a, first+1, last)5 return a

INDEXOFMIN (array, first, last)1 index ← first2 for k ← first +1 to last3 if arrayk < arrayindex

4 index ← k5 return index

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 9 / 85

Complexity Analysis

n − 1 iterations

Analyzes n − i + 1 elements in eachiteration i

The aprox. number of operations:n + (n − 1) + (n − 2) + . . . + 2 + 1

= 1 + 2 + ... + n = n(n+1)2

In each iteration a swap: 3 ops

Total: n(n+1)2 + 3(n − 1)

→ O(n2)

SELECTIONSORT(a, n)1 for i ← 1 to n − 12 j ← INDEXOFMIN (a, i, n)3 Swap elements ai and aj4 return a

INDEXOFMIN (array, first,last)1 index ← first2 for k ← first+1 to last3 if arrayk < arrayindex

4 index ← k5 return index

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 10 / 85

Complexity Analysis

Let T (n) be the time analgorithm needs for the input ofsize n

Finding the smaller n → max. n

recursive call on array of sizen − 1→ T (n − 1)

Call on array of size 1

It holds: T (n) = n + T (n − 1)T (1) = 1T (n) = n + (n − 1) + T (n − 2)=n+(n−1)+(n−2)+...+2+T (1)→ O(n2)

Recursive Selection Sort:

1 RECURSIVESELSORT(a, first,last)

2 if first < last

3 index ← INDEXOFMIN (a, first,last)

4 Swap afirst with aindex

5 a ← RECURSIVESELSORT(a,first+1, last)

6 return a

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 11 / 85

Algorithms

Conceptually we distinguish

Algorithm strategy

Algorithm structureI recursiveI iterative

Algorithm solutionI find a good solutionI find best(s) solution(s)

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 12 / 85

Algorithm strategies

Brute force algorithms

Greedy algorithms

Recursive algorithms

Backtracking algorithms

Branch and bound algorithms

Divide and conquer algorithms

Dynamic programming algorithms

Heuristic algorithms

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 13 / 85

Brute Force or Exhaustive Search

Systematically enumerating all possible candidates for the solution andchecking whether each candidate satisfies the problem’s statement

Simple

Very slow

Used as starting point for other types of algorithms

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 14 / 85

Greedy

Many algorithms are iterative processes

Greedy algorithms choose in each iteration the more “attractive”solution

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 15 / 85

Recursive

A combinatorial problem: Fibonacci numbers

n 0 1 2 3 4 5 6 7 8 9 10 11Fn 0 1 1 2 3 5 8 13 21 34 55 89

The problem of the Fibonacci numbers is a classical example for arecursion problem:

F0 = 0

F1 = 1

Fn = Fn−1 + Fn−2

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 16 / 85

Recursions

Recursions: reapply algorithm to subproblemAnother example: N!, the factorial of a number N:

function fact(N)

if(N==1)

return 1

else

return N*fact(N-1)

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 17 / 85

Backtracking

Backtracking is a general technique for organizing the exhaustive searchfor a solution to a combinatorial problem.

The backtracking technique can be applied to those problems that exhibitthe domino principle: if a constraint (condition) is not satisfied by a partialsolution, the constraint will not be satisfied by any extension of the partialsolution to a global solution.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 18 / 85

Backtracking

Domino principle

1 2 3w

h

... n n+1

Given h (height of a domino) > w (space in between dominos):we knock over the first dominoif nth domino falls, then (n + 1)st domino will fall.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 19 / 85

Backtracking

The backtracking algorithm enumerates a set of partial candidatesthat could be completed in various ways to giveall the possible solutions to the given problem.

The way towards the solution is done incrementally, by a sequence ofcandidate extension steps.

Conceptually, the partial candidates are the nodes of a tree, the“search tree”

Each partial candidate is the parent of the candidates thatdiffer from it by a single extension step

Leaves of the tree are the partial candidates thatcannot be further extended

The backtracking algorithm traverses this search tree recursively,from the root down, in depth-first order

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 20 / 85

Backtracking

Root

1

2

3

5

64

At each node c, the algorithm checks whether c can be completed toa valid solution

If it cannot, the whole sub-tree rooted at c is skipped (pruned)

Otherwise, the algorithm (a) checks whether c itself is a valid solutionand (b) recursively enumerates all sub-trees of c

The actual search tree that is traversed by the algorithm is only a part ofthe tree. The total cost of the algorithm is the number of nodes of theactual tree times the cost of obtaining and processing each node.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 21 / 85

BacktrackingExample: Eight queens puzzle → How to place 8 queens in a chess board

Consider one row of the board at a time

Eliminate most nonsolution board positions at a very early stage

It rejects attacks on incomplete boards, hence it examines only 15720possible queen placements (brute force: 648 = 281.474.976.710.656)

The actual search tree is only a part of the tree. The total cost of thealgorithm is the # nodes of the actual tree × the cost of obtaining andprocessing each node.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 22 / 85

Branch-and-Bound

The branch-and-bound method can be used for finding one or all solutionsof a combinatorial problem, where solutions are associated with a cost,such that the cost of the whole solution cannot besmaller than the cost of any partial solution →optimization problems

The technique consists of remembering the lowest-cost solution found ateach stage of the backtracking search, and to use the cost of thelowest-cost solution found so far as a lower bound on the cost of aleast-cost solution to the problem, in order to discard partial solutionswith costs larger than the lowest-cost solution found so far.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 23 / 85

Branch-and-Bound

Represent again as a tree: The root of the bb-tree is a so-called dummynode of cost zero, the nodes at level one represent the possible valueswhich the first variable can be assigned to, the nodes at level tworepresent the possible values which the second variable can be assigned to,given the value which the first variable was assigned to, and so on.

Subtrees in the tree rooted at nodes of cost greater than the cost of aprevious leaf node, are pruned off the bb-tree.

A1

SEC

B

...

2

C E

4 2

C

2

S

S

2

C

Problem: can become exponential

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 24 / 85

Divide-and-Conquer

Definition: An algorithmic technique. To solve a problem on aninstance of size n, a solution is found either directly because solving thatinstance is easy (typically, because the instance is small) or the instance isdivided into two or more smaller instances. Each of these smaller instancesis recursively solved, and the solutions are combined to produce a solutionfor the original instance.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 25 / 85

Divide-and-Conquer Methodology1 Given a problem, identify a small number of significantly smaller

subproblems of the same type2 Solve each subproblem recursively (the smallest possible size of a

subproblem is a base-case)3 Combine these solutions into a solution for the main problem

The name divide and conquer can be motivated because the problem isconquered by dividing it into several smaller problems.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 26 / 85

Divide-and-Conquer

The divide-and-conquer technique can be applied to those problems thatexhibit the independence principle:problem instance can be divided into a series of smaller problem instanceswhich are independent of each other.

Example: One of the simplest examples is “Quicksort” of an array:Partition the array into two parts, and quicksort each of the parts. Here infact, no additional work is required to combine the two sorted parts.Running time: O(n2)

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 27 / 85

Divide-and-Conquer

When a problem is solved by “divide-and-conquer”, sometimes the samesubproblem appears multiple times.A recursive algorithm for the divide-and-conquer according to thisdefinition is:

Fibonacci-R(i)

if i = 0

then return 0

else

if i = 1

then return 1

else return Fibonacci-R(i-1) + Fibonacci-R(i-2)

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 28 / 85

Divide-and-Conquer

However, it is easy to see that the algorithm is not efficient, since values ofFi are calculated several times independently.

n

n

n-2

n-3

n-2

n-3 n-3 n-4

n-1

n-4 n-4 n-5 n-4 n-5 n-5 n-6

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 29 / 85

Randomized Algorithms

Toss a coin to decide where to start looking for the phone

Not as intuitive as deterministic algorithms

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 30 / 85

Machine Learning

Collect statistics over the course of a year about where you leave thephone, learning where the phone tends to end up most of the time.

E. g. 80% of the times it was left on the bathroom, 15% in thebedroom and 5% in the kitchen

Strategy: first look in the bathroom, the in the bedroom and finallyin the kitchen

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 31 / 85

Dynamic Programming

Dynamic Programming is a very general programming technique.

Most often applied in the construction of algorithms to solve a certainclass of optimisation problems, ie. problems which require theminimisation or maximisation of some measure.

Applicable when a large search space can be structured into asuccession of stages, such that the initial stage contains trivialsolutions to sub-problems, each partial solution in a later stage canbe calculated by recurring on only a fixed number of partialsolutions in an earlier stage, the final stage contains the overallsolution.

The method usually accomplishes this by maintaining a table ormatrix of sub-instance results.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 32 / 85

Dynamic Programming

Dynamic programming can be thought of as being the reverse of recursionor divide-and-conquer.? Divide-and-conquer is a top-down mechanism – we take a problem, splitit up, and solve the smaller problems that are created.? Dynamic programming is a bottom-up mechanism – we solve all possiblesmall problems and then combine them to obtain solutions for biggerproblems.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 33 / 85

Dynamic Programming

A general DP algorithm consists of 4 steps:

1 Characterization of the structure of the (an) optimal solution

2 Recursive definition of the value of an optimal solution

3 Computation of the optimum using recursion

4 Construction of an optimal solution through the computed optimalvalue.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 34 / 85

Dynamic Programming

Example: “The Rocks game”

2 players, 2 piles of rocks, say 10 each

In each turn one player may take either one rock (from either pile) ortwo rocks (one from each pile). Taken rocks are removed from thegame.

The player that takes the last rock wins the game

To find the winning strategy we construct a 10× 10 table R:→ If Player 1 can always win the game (i,j), then we would say Rij = W→ If Player 1 looses the game Rij , then we would say Rij = L

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 35 / 85

Dynamic Programming

Example: “The Rocks game”

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 36 / 85

Dynamic ProgrammingExample: “The Rocks game”

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 37 / 85

Tractable vs. Non Tractable Problems

Algorithms can be classified accoriding to its complexity

Problems might also be classified according to its inherent complexity

There are problems, for which there is no non polynomial algorithm:enumerate all subsets of n elements

Other problems can be solved in polynomial time

Between these two, exponential and polinomial problems, lie theNP-complete

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 38 / 85

Tractable vs. Non Tractable Problems

Problems for which there is no known polynomial algorithm, but forwhich you cannot prove that it does’t exist

The classic: Traveling-Salesman Problem

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 39 / 85

Literature

Sources and further recommended reading:

Schoning, Algorithmik, Spektrum Akademischer Verlag, 2001.

Kay Nieselts Lecture Notes (Grundlagen der Bioinformatik SS 2007),Karls-Eberhard Universitat Tubingen

N. C. Jones and P. A. Pevzner, An Introduction to BioinformaticsAlgorithms, 2004

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 40 / 85

DNA Mapping, Motifs and Brute Force Algorithms

In this chapter we will see:

Restriction Enzymes

Gel Electrophoresis

Partial Digest Problem

Brute Force Algorithm for Partial Digest Problem

Branch and Bound Algorithm for Partial Digest Problem

Double Digest Problem

Finding Regulatory Motifs

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 41 / 85

Molecular Scissors

Molecular Cell Biology, 4th edition

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 42 / 85

Molecular Scissors

ePlantScience.com, An online botanical encyclopedia, Chapter 3.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 43 / 85

Uses of restriction enzymes

Recombinant DNA technologyI Recombinant technology starts with the isolation of a gene of interest.

It is then inserted into a vector and clonedI Recombinant protein result form the expression of rDNA

DNA CloningI Is a technique to reproduce DNA fragments.I Cell based or via PCR

cDNA/genomic library constructionI mRNA→cDNA→restriction enzyme + ligase→into plasmidI genomic regions

DNA mapping

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 44 / 85

Restriction maps

A map showing positions of restriction sites in a DNA sequence

If DNA sequence is knownthen construction ofrestriction map is a trivialexercise

In early days of molecularbiology DNA sequences wereoften unknown

Biologists had to solve theproblem of constructingrestriction maps withoutknowing DNA sequences

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 45 / 85

Full Restriction Digest

A map showing positions of restriction sites in a DNA sequence

Cutting DNA at each restriction site creates multiple restrictionfragments:

Is it possible to reconstruct the order of the fragments from the sizes ofthe fragments 3,5,5,9 ?

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 46 / 85

Full Restriction Digest

Multiple Solutions

vs.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 47 / 85

Measuring length of fragments: Gel electrophoresis

Gel electrophoresis: processfor separating DNA by sizeand measuring sizes ofrestriction fragments

Separates DNA fragmentsthat differ in only 1nucleotide for fragments upto 500

Using an electric field,molecules can be made tomove through a gel (agar)

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 48 / 85

Measuring length of fragments: Gel electrophoresis

The gel is placed in anelectrophoresis chamber.When the electric current isapplied, the larger moleculesmove more slowly throughthe gel while the smallermolecules move faster. Thedifferent sized moleculesform bands on the gel

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 49 / 85

Detecting DNA

One possibility to visualize DNA bands: Fluorescence

The gel is incubated with a solution containing the fluorescent dyeethidium

Ethidium binds to the DNA

The DNA lights up when the gel is exposed to ultraviolet light.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 50 / 85

Partial Restriction Digest

The sample of DNA is exposed to the restriction enzyme for only alimited amount of time to prevent it from being cut at all restrictionsites

This experiment generates the set of all possible restriction fragmentsbetween every two (not necessarily consecutive) cuts

This set of fragment sizes is used to determine the positions of therestriction sites in the DNA sequence

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 51 / 85

Partial Restriction Digest: Example

Partial Digest results in the following 10 restriction fragments:

Multiset: 3, 5, 5, 8, 9, 14, 14, 17, 19, 22

→We assume that multiplicity of a fragment can be detected, i.e., thenumber of restriction fragments of the same length can be determined(e.g., by observing twice as much fluorescence intensity for a doublefragment than for a single fragment)

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 52 / 85

Partial Restriction Digest: Example

Partial Digest results in the following 10 restriction fragments:

Multiset: 3, 5, 5, 8, 9, 14, 14, 17, 19, 22→We assume that multiplicity of a fragment can be detected, i.e., thenumber of restriction fragments of the same length can be determined(e.g., by observing twice as much fluorescence intensity for a doublefragment than for a single fragment)

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 52 / 85

Partial Digest

Fundamentals:

X: the set of n integers representing the location of all cuts in therestriction map, including the start and end

n: the total number of cuts

DX: the multiset of integers representing lengths of each of the(n

2

)fragments produced from a partial digest

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 53 / 85

Partial Digest

A way of representating n, X , DX :

Representation of DX = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10 as a two dimensionaltable, with elements of X = 0, 2, 4, 7, 10 along both the top and leftside. The elements at (i , j) in the table is xj − xi for 1 ≤ i < j ≤ n.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 54 / 85

Partial Digest Problem

Formulation:

Goal: Given all pairwise distances between points on a line, reconstructthe positions of those points

Input: The multiset of pairwise distances L, containing n(n−1)2 integers

Output: A set X , of n integers, such that DX = L

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 55 / 85

Partial Digest Problem: Multiple Solutions

It is not always possible to uniquely reconstruct a set X based only on DX

For example, the set:X = 0, 2, 5andX + 10 = 10, 12, 15

both produce DX = 2, 3, 5 as their partial digest set.

The sets 0, 1, 2, 5, 7, 9, 12 and 0, 1, 5, 7, 8, 10, 12 present a less trivialexample of non-uniqueness. They both digest into:1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, 8, 9, 10, 11, 12

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 56 / 85

Homometric Sets

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 57 / 85

Brute Force Algorithms

Also known as exhaustive search algorithms; examine every possiblevariant to find a solution

Efficient in rare cases; usually impractical

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 58 / 85

Partial Digest Problem: Brute Force

1. Find the restriction fragment ofmaximum length M. M is the lengthof the DNA sequence3. For every possible set

X = 0, x2, ..., xn−1,M

compute the corresponding DX5. If DX is equal to the experimentalpartial digest L, then X is the correctrestriction map

BruteForcePDP(L, n):1. M← maximum element in L2. for every set of n − 2 integers0 < x2 < ...xn−1 < M

X ← 0, x2, ..., xn−1,M3. Form DX from X4. if DX = L5. return X6. output “no solution”

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 59 / 85

Efficiency of Brute Force

BruteForcePDP takes O(Mn−2) time since it must examine allpossible sets of positions.

One way to improve the algorithm is to limit the values of xi to onlythose values which occur in L.

BruteForcePDP(L, n):1. M← maximum element in L2. for every set of n − 2 integers 0 < x2 < ...xn−1 < M

X ← 0, x2, ..., xn−1,M3. Form DX from X4. if DX = L5. return X6. output “no solution”

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 60 / 85

Efficiency of Brute Force

BruteForcePDP takes O(Mn−2) time since it must examine allpossible sets of positions.

One way to improve the algorithm is to limit the values of xi to onlythose values which occur in L.

AnotherBruteForcePDP(L, n):1. M← maximum element in L2. for every set of n − 2 integers 0 < x2 < ...xn−1 < M from L

X ← 0, x2, ..., xn−1,M3. Form DX from X4. if DX = L5. return X6. output “no solution”

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 61 / 85

Efficiency of AnotherBruteForcePDP

Its more efficient, but still slow. This algorithms examines( |L|n−2

)If L = 2, 998, 1000, (n = 3,M = 1000), BruteForcePDP will beextremely slow, but AnotherBruteForcePDP will be quite fast

Fewer sets are examined, but runtime is still exponential: O(n2n−4)

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 62 / 85

Branch and Bound Algorithm for PDP

1 Begin with X = 02 Remove the largest element in L and place it in X

3 See if the element fits on the right or left side of the restriction map

4 When it fits, find the other lengths it creates and remove those from L

5 Go back to step 1 until L is empty

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 63 / 85

PartialDigest Algorithm

Before describing PartialDigest, first define D(y ,X )as the multiset of all distances between point y and all other points in theset X

D(y ,X ) = |y − x1|, |y − x2|, ..., |y − xn|

for X = x1, x2, ..., xn

PartialDigest(L):width ← Maximum element in LDELETE(width, L)X ←0,widthPLACE(L,X )

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 64 / 85

PartialDigest Algorithm

PLACE(L, X)2. if L is empty3. output X4. return5. y← maximum element in L6. Delete(y,L)7. if D(y, X ) ∈ L8. Add y to X and remove lengths D(y, X) from L9. PLACE(L,X )10. Remove y from X and add lengths D(y, X) to L11. if D(width-y, X ) ∈ L12. Add width-y to X and remove lengths D(width-y, X) from L13. PLACE(L,X )14. Remove width-y from X and add lengths D(width-y, X ) to L15. return

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 65 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0

Remove 10 from L and insert it into X . We know this must be the lengthof the DNA sequence because it is the largest fragment.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 66 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0

Remove 10 from L and insert it into X . We know this must be the lengthof the DNA sequence because it is the largest fragment.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 66 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 10

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 67 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 10

Take 8 from L and make y = 2 or 8. But since the two cases aresymmetric, we can assume y = 2.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 68 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 10

We find that the distances from y = 2 to other elements in X areD(y ,X ) = 8, 2, so we remove 8, 2 from L and add 2 to X .

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 69 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 10

We find that the distances from y = 2 to other elements in X areD(y ,X ) = 8, 2, so we remove 8, 2 from L and add 2 to X .

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 70 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 10

Take 7 from L and make y = 7 or y = 10− 7 = 3. We will explore y = 7first, so D(y ,X ) = 7, 5, 3.D(y ,X ) = 7, 5, 3 = 7− 0, 7− 2, 7− 10

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 71 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 10

For y = 7 first, D(y ,X ) = 7, 5, 3. Therefore we remove 7, 5, 3 from Land add 7 to X .

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 72 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 7, 10

For y = 7 first, D(y ,X ) = 7, 5, 3. Therefore we remove 7, 5, 3 from Land add 7 to X .

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 73 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 7, 10

Take 6 from L and make y = 6. Unfortunately D(y ,X ) = 6, 4, 1, 4,which is not a subset of L. Therefore we won’t explore this branch.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 74 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 7, 10

This time make y = 4. D(y ,X ) = 4, 2, 3, 6, which is a subset of L so wewill explore this branch. We remove 4, 2, 3, 6 from L and add 4 to X .

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 75 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 4, 7, 10

L is now empty, so we have a solution, which is X .

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 76 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 4, 7, 10

L is now empty, so we have a solution, which is X .

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 76 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 7, 10

To find other solutions, we backtrack.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 77 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 10

More backtrack.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 78 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 2, 10

This time we will explore y = 3. D(y ,X ) = 3, 1, 7, which is not a subsetof L, so we won’t explore this branch.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 79 / 85

An Example

L = 2, 2, 3, 3, 4, 5, 6, 7, 8, 10X = 0, 10

We backtracked back to the root. Therefore we have found all thesolutions.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 80 / 85

Complexity analysis of PartialDigest Problem

Still exponential in worst case, but is very fast on average

Informally, let T (n) be the time PartialDigest takes to place n cuts:

No branching case: T (n) = T (n − 1) + O(n)

→Quadratic

Branching case: T (n) = 2T (n − 1) + O(n)= T (n) = 2(2T (n − 2) + O(n)) + O(n)

→Exponential

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 81 / 85

Double Digest Problem (DDP)

Double Digest is yet another experimentally method to constructrestriction maps

Uses two restriction enzymes; three full digests:I One with only first enzymeI One with only second enzymeI One with both enzymes

Computationally, Double Digest problem is more complex than PartialDigest problem

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 82 / 85

Double Digest Problem (DDP)

Without the information about X (i.e. A + B), it is impossible to solve theDDP

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 83 / 85

Double Digest Problem (DDP)

Without the information about X (i.e. A + B), it is impossible to solve theDDP

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 83 / 85

Double Digest Problem

Formulation:

Input: dA → fragment lengths from the digest with enzyme AdB → fragment lengths from the digest with enzyme BdX → fragment lengths from the digest with both A and B

Output: A → location of the cuts in the restriction map for the enzyme A.B → location of the cuts in the restriction map for the enzyme B.

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 84 / 85

DDP: Multiple Solutions

Bioinfo I (Institut Pasteur de Montevideo) Algorithm Techniques -class1- July 11th, 2011 85 / 85