76
1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

Embed Size (px)

Citation preview

Page 1: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

1

Bioinformatics AlgorithmsLecture 1

© Jeff Parker, 2009

It is always advisable to perceive clearly our ignorance. Charles Darwin

Page 2: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

2

Outline

What is this course about?

What do I need to know?

What will I learn?

What tools will I be using?

What is our first task?

Page 3: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

3

Outline

Introduce an interesting problem from BiologyApply some Computer Science techniques

Introduce some biological backgroundExplain the motivation for our problemLook at exact pattern match

Find a faster algorithmLook at approximate pattern match

Find a much faster algorithm that uses Dynamic ProgrammingAlgorithm used in tools such as

Basic Local Alignment Search Tool (BLAST)http://www.ncbi.nlm.nih.gov/BLAST/

If time permits, we will consider searching a text for multiple patterns

Page 4: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

4

What is Bioinformatics?

The use of techniques from mathematics, statistics, and computer science to solve biological problems

Many activities of the cell can be interpreted as manipulation of strings from a small alphabet

Things we will not be studying

How to use cells to perform computation

How the cells perform the computation

Instead, we will be studying computations that can help us identify

Genes that are similar - pattern matching

Retracing evolutionary history - phylogenetic trees

How some reactions are facilitated - protein folding

Page 5: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

5

Pattern Matching

We are interested in exact match and inexact match

Page 6: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

6

Pylogenetic Trees

Page 7: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

7

Protein Folding

Proteins are defined by a sequence, but their use depends upon their three dimensional shape

Evolution has selected proteins that reliably assume the same shape

Page 8: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

8

Biotechnology @ Extension

The Extension school has an ALM in Biotechnology Program

This course serves as one of the Information Technology courses

Requisite – CS 119 (Data Structures)

Comfort reading and writing algorithms

Comfort evaluating their running time

Will run as a standard Lecture course with problem sets

Hope we can have more interactions in class than typical lecture

Edward Freedman, course TF

Broad Institute at MIT

Chair of the Boston chapter of the ACM

The course is new: we expect to hear about your interests

Page 9: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

9

What do I need to know?

Enough Biology to understand the central Dogma

Enough Programming to read and write algorithms

A willingness to explore regions that we don’t understand fully

Page 10: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

10

What will I learn?

An understanding of how some Bioinformatics tools work

The datasets are huge and the problems intractable (NP Complete)

Thus most algorithms are heuristics (algorithm that may not yield an optimal solution, but finds one quickly)

An appreciation for the strength and weaknesses of certain approaches

An introduction to a wide number of computer algorithms

An introduction to an important new field

Page 11: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

11

What tools

Biologists use a number of tools, such as BLAST

Our prime interest will be in understanding how these algorithms work

We will be using a computer language to express algorithms

Page 12: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

12

What is our first task?

We will begin with a simple problem to introduce the major ideas

Pattern Matching

Understand the problem

Write some algorithms

Look at the odds

First we need to review some basic Biology

We will not get through all of my notes tonight

Page 13: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

13

Central Dogma of Biology

To understand Life, we must understandDNA - holds information on how cell worksRNA - is used to transfer information from DNA and to build Proteins - which form enzymes that are used to signal and regulate all

activity, build key componentsAll three can be viewed as a string of symbols from a small alphabet

DNA - 4 characters: A G T CAdenine, Guanine, Thymine, Cytosine (A-T, C-G)

RNA Like DNA, replacing Thymine with Uracil

Protein - 20 amino Acids - Glycine, Alanine, Valvoline, etc.

Page 14: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

14

Central Dogma of Biology

All living organisms are described by 4 letter strings of DNAA-T and G-C form complementary pairs as shown aboveWe are watching replication above – more realistic images later

Page 15: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

15

DNA

DNA is a double helix made up of

Sugar Molecule

Phospate Group

a base that holds the information

The two sides are not symmetric

The sugar molecule has 5 carbons

Note special role of carbons 3 and 5

DNA rebuilding proceeds naturally only on the 3 end.

Page 16: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

16

Replication and Transcription

We duplicate all of a DNA strandWe transcribe a gene (a section of the strand) to mRNA which is translatednews.bbc.co.uk/2/shared/spl/hi/sci_nat/03/dna50/how_dna_works/html/default.stm

Page 17: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

17

Translation

Triplets of RNA (called a codon) describe 20 Amino Acids that are used to build up Proteins.Sample Amino Acids

Leucine, Proline, …

There is redundancy in encoding43 = 64 >> 20

Different codons may yield the same amino acidACT, ACC, ACA, ACG

all yield Threonine C4H9NO3

Page 18: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

18

Pattern Match

A known gene may help us understand unknown gene with similar structureProteins with similar makeup may act similarly.

Locating similar genes in different organisms can help us trace lineage.Our Goal

Want to be able to find approximate matches for a gene or protein.Model this as a search for a pattern in a text.

A related problem is looking at similarities between strings.Parallel solution

Problem is hard becauseStrings are very long The set of possible matches is large

We start with a simpler problem: exact match

Page 19: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

19

Exact Pattern MatchingThe basis for most exact pattern match follows

Algorithm Line up text and pattern

Compare the two

If they match

Report the position of match

Else

Slide pattern to right and try again

Text

Pattern

Page 20: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

20

Python Pattern Match

def find(text, pattern): """Look for pattern in the string text.""" for x in range(len(text)): for y in range(len(pattern)): if (text[x+y] != pattern[y]): break if (y == len(pattern) - 1): return x return -1

print find("This is my wish", "is")

Page 21: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

21

Python Pattern Match

def find(text, pattern): """Look for pattern in the string text.""" for x in range(len(text)): for y in range(len(pattern)): if (text[x+y] != pattern[y]): break if (y == len(pattern) - 1): return x return -1

print find("This is my wish", "is")

Define a function

Page 22: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

22

Python Pattern Match

def simpleSearch(text, pattern): """Look for pattern in the string text.""" for x in range(len(text)): for y in range(len(pattern)): if (text[x+y] != pattern[y]): break if (y == len(pattern) - 1): return x return -1

print simpleSearch("This is my wish", "is")

>>>print range(4)[0, 1, 2, 3]Space is used rather than { }Note use of ":"

Page 23: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

23

Using Python Slice

def simpleSearch2(text, pattern): """Find the pattern in the string text.""" for x in range(len(text)): if (text[x:x+len(pattern)] == pattern): return x return -1

print simpleSearch2("This is my wish", "is")print simpleSearch2("This is my wish", "if")

>>>print text[1:3]hi

Page 24: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

24

AnalysisThis algorithm behaves well in practiceThe worst case is bad

For pattern of length NText of length MWorst case is O(NM)

Page 25: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

25

Odds of a match –here-

A priori odds of matching two characters: 1/P, where P is the size of the alphabet

A Posteriori: we base the odds on measurement.Say there are 100,000 distinct last names in the Boston Phone Book.What are the odds that two people selected at random have the same

name?Higher than 1/100,000: some names are common – Smith or Parker

Page 26: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

26

What are the odds of a match?

What are the odds that a pattern of length N matches at arbitrary spot?(1/P)N

What are the odds that there is no match at a given spot?1 - (1/P)N

Odds of no match first two spots? (Must fail in both spot.)(1 – (1/P)N)*(1 – (1/P)N)

Odds of no match in text of length M? Have M – N + 1 starting spots(1 – (1/P)N)M – N + 1 ~ 1 – (M-N+1)(1/P)N + …

Page 27: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

27

Odds of match somewhere

Odds of no match in text of length M. Have M – N + 1 starting spots(1 – (1/P)N)M – N + 1 ~ 1 – (M-N+1)(1/P)N + …

But the odds that there is a match at the next spot are not independent of the outcome at a previous position.

A posteriori, we need to look at the pattern and what we have learnedWe will often be sloppy and use a priori reasoning.One theme that we will encounter multiple times is that there is information contained in the

work we do to find a partial match

Page 28: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

28

What are the frequencies?Let's count the frequency of letters in a real gene"""Frequency - count the frequency of each letter in DNA sequence"""text = input("Enter the quoted text: ")print "Saw ", textsymbolCounts = {} # Empty Dictionary# Go over all pairs in the sequencefor x in range(len(text)): ch = text[x] # Increment count if (ch in symbolCounts): symbolCounts[ch] = symbolCounts[ch] + 1 else: symbolCounts[ch] = 1

Page 29: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

29

What are the frequencies?Let's count the frequency of letters in a real gene

# Rough printprint symbolCounts

symbols = ['A', 'G', 'C', 'T']

# Pretty Printfor ch in symbols: if (ch in symbolCounts): print ch, symbolCounts[ch] else: print ch, 0

Page 30: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

30

What are the frequencies?% python freq.py

Enter the quoted text: AGCTTraceback (most recent call last): File "freq.py", line 3, in <module> text = input("Enter the quoted text: ") File "<string>", line 1, in <module>NameError: name 'AGCT' is not defined

% python freq.py Enter the quoted text: "AGCT"Saw AGCT{'A': 1, 'C': 1, 'T': 1, 'G': 1}A 1G 1C 1T 1

Page 31: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

31

What are the frequencies?% wc temp 1 1 1272 temp% python freq.py < tempEnter the quoted text: Saw

ATGAAAGCTTCCTGGGCCTCCTTCCCCATCCTTGCACCTGTAGCCACCGTCAGTGGTGTTTGGAGGCTACAGCTGTTCCGACTGATGCTCATAGGACTCATACATGGTATGTCATCTGTATTCGTGGTGAAAAATGGCTACTGAACAACTTGCACAATGGAAGTCTACTCAAGCTGCCTCCTTGTCAAATTAACATACTAACAGCAGTGATAAAAATGTGACCTTCAACCTGCCCTGTAATTTAGAAGTACTAAATAACAAATGTCGTGGTCAAGGAAATGCT…

{'A': 413, 'C': 270, 'T': 347, 'G': 239}A 413G 239C 270T 347

Page 32: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

32

Better Pattern Matching

The Boyer Moore algorithm uses the same basic idea as simple search

Algorithm Line up text and pattern

Compare the two

If they match

Report the position of match

Else

Slide pattern to right and try again

Page 33: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

33

Insight

We get the most mileage by looking at right edge

Match text above last letter in patternOnly need to call function compare here

Skip 1

Skip 3

Skip 2

Skip -4

If the text has…

Page 34: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

34

Data Structures for Boyer MooreBoyer-Moore preprocesses the pattern and keeps a "skip table"

If you see this character in the text, skip this many places

If the text holds Blue, slide 3

Now Blue in pattern is below Blue in text

The skip table has an entry for each element of the "alphabet" - the 4 nucleotides in our case.

If the character matches the last char of pattern, we compare full string

Skip 1

Skip 3

Skip 2

Skip -4

If the text has…

Page 35: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

35

Data Structures for Boyer Moore# Create a dictionary to hold skip table.# Insert skip for last letter.ln = len(pattern)letter = pattern[ln-1] # Last letterd = { letter:(ln) }ln = ln - 1

# Iterate over the pattern, filling out skip table.for x in xrange(len(pattern) - 1):

d[pattern[x]] = lnln = ln - 1

# The last character is special.d[letter] = -1 * d[letter]

Skip 1

Skip 3

Skip 2

Skip -4

If pattern is…

Page 36: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

36

Matchdef boyerMooreSearch(text, pattern, d):

ln = len(pattern)x = ln - 1while (x < len(text)):

if (text[x] in d):skip = d[text[x]]

else:skip = ln

if (skip < 0): # Match last charstart = x - ln + 1if (text[start: x + 1] == pattern):

return start # Found it!x = x - skip # Not a match: skip

else:x = x + skip # Ordinary skip.

return -1 # Never found it

Skip 1

Skip 3

Skip 2

Skip -4

Page 37: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

37

Inputpattern = input("Enter the pattern in quotes: ")...

% python BoyerMoore.py Enter the pattern in quotes: wikiTraceback (most recent call last):File "BoyerMoore.py", line 21, in <module>

pattern = input("Enter the pattern in quotes: ")File "<string>", line 1, in <module>

NameError: name 'wiki' is not defined

% python BoyerMoore.py Enter the pattern in quotes: "wiki"

Page 38: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

38

Analysis of Boyer MooreMuch of the time, we are just looking up an entry in the skip table and

slidingHas a modest setup time to build the skip tableFor some reasonable assumptions, Boyer Moore is sublinear in text length

Does not need to even look at many characters in the testIn the example below, we inspect only 4 items in text

Does better with large alphabetsNot much use for approximate matches

Page 39: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

39

Odds in Boyer MooreWhat can we expect from Boyer-Moore?With large alphabet, when we have a missmatch, we hope to slide a long

wayWith a small alphabet, the average length of slide decreasesWe can have a long slide with a small alphabet – just not very likely

In the example below, skip table entry for blue is 15. Typical measure used is the expected length of a slide

Page 40: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

40

Rumination on Boyer MooreIs it worth the effort to preprocess the pattern?If we are searching a long text, and it speeds up the search, it is worthwhileKnuth-Morris-Pratt is another algorithm that preprocesses pattern

Looks for repeats in pattern. If we have matched the first instance of a repeat, we don't have to check it again

In the example below, when pattern and text fail, we know that the first two symbols in the pattern will match 3 spaces to the right.

http://www.ics.uci.edu/~goodrich/dsa/11strings/demos/pattern/

Later we will see some algorithms that preprocess the text

Page 41: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

41

Mutations

DNA is constantly being transcribed and replicatedSometimes there are transcription errors which lead to mutationsThree types

The Good: Mutation in sickle cell gene provides resistance to malariaThe Bad: Huntington’s disease, a degenerative disease of nervous systemThe Silent: may cause no difference

May result in same Amino AcidMay be part of junk DNA

ATCTAG

ATCGAG

Page 42: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

42

Cystic Fibrosis

Cystic Fibrosis (CF) is chronic and frequently fatal genetic disease of the body’s mucus glands. CF primarily affects the lungs of children.

In early 1980s biologists hypothesized that CF is caused by mutations in some gene.

ATP binding proteins are present on cell membranes and act as a transport channel.

In 1989 biologists found similarity between CF Gene and ATP binding proteins.

A mutation was found in 70% of CF patients.Those with CF are missing a single amino acid.

Page 43: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

43

Cystic Fibrosis

Page 44: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

44

Browse CFTR Gene

Go to www.ensembl.orgSelect human Click on Human GenomeClick on Cromosome 7In Search box on page (not browser) enter CFTRSelect map element (NM_001104950.1)Click on NM_001104950.1 Take you to http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?list_uids=157279742List of publications about the gene, with PubMed linksScroll down to see /translation and ORIGINGene is over 4K base pairs long.

Page 45: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

45

Approximate pattern match

In biology, we are often looking for an approximate matchChanges can be viewed as one of three forms

Add a character to pattern Remove a character from patternAlter a character

ATCGGAATG-GA

Page 46: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

46

Distance

How can we define how far apart two sequences are?We can take two sequences, and count the number of places they differ

ATCGGTATGGGA

We speak of the Hamming distance. Measures number of substitutions to get from one string to the other

Since mutations can also lead to dropped or added terms, we will also useLevenshtein distance or edit distance

ATCGG-AT-GGA

Smallest number of insertions, deletions, and substitutions required Finding the edit distance is a problem in itself that we will address soonEither satisfies properties for a metric, including the triangle inequality

D(a,c) <= D(a, b) + D(b, c)

Page 47: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

47

Approximate pattern match

Example: find pattern ATGGA in text ATCGGA Here are two choices

Can change G in pattern to C and add G to patternWe illustrate add or delete with “-” in text or pattern

ATCGGAATG-GA

Or we could add a C to patternATCGGAAT-GGA

Second version should be cheaperDefine distance between two strings as the sum of the costs of the operations

needed to make them the sameWe assume today that each operation has cost 1. Methods extend to other pricing

schemes

Page 48: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

48

Recursive Solution

To find a match between ATGGA and ATCGGA

Try all possible actions on first characters, then compare the rest

Match or replace first characters of each string

Drop first char of text

Drop first char of pattern

Try to match the remainder using recursion

At each step, at least one string is shorter.

ATGGAATCGGA

TGGATCGGA

TGGAATCGGA

ATGGATCGGA

Page 49: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

49

Backtracking Solutiondef approxMatch(text, pattern):

"""Find the pattern in the string text."""print "Looking for", pattern, "in text ", texttlen = len(text)plen = len(pattern)if (tlen == 0):

return plenif (plen == 0):

return tlen

match = approxMatch(text[1:tlen], pattern[1:plen])if (text[0] != pattern[0]):

match = match + 1

delt = 1 + approxMatch(text[1:tlen], pattern)delp = 1 + approxMatch(text, pattern[1:plen])

return min(match, min(delt, delp))

print approxMatch("ATGGA", "ATCGGA")

Page 50: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

50

Backtracking Solution

Page 51: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

51

Problems with backtracking

This finds right answer, but spends time recomputing values

Example ”ATGGA" to ”ATCGGA"

Worse at next level

If we can organize previous results, we can use Dynamic Programming to build a solution from the ground up

ATGGAATCGGA

TGGATCGGA

TGGAATCGGA

ATGGATCGGA

GGATCGGA

TGGACGGA

GGATCGGA

GGAATCGGA

TGGATCGGA

TGGACGGA

TGGATCGGA

ATGGACGGA

GGACGGA

Page 52: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

52

DotsList the pattern and text as row and column headings.

Place a dot in each cell where row heading and column heading match.

We will use this idea in other ways…

Page 53: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

53

Dotsdef dots(text, pattern):

tlen = len(text)plen = len(pattern)

# Print the text.print " ",for col in xrange(tlen):

print text[col],print ""

for row in xrange(plen):print(pattern[row]),for col in xrange(tlen):

if (text[col] == pattern[row]):print "*",

else:print " ",

print ""

dots("ATGGA", "ATCGGA")

A T G G A A * * T * C G * * G * * A * *

Page 54: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

54

Dynamic Programming

2 5 1 6 7 3 2

3 2 6 8 2 9 3

1 7 6 8 5 3 8

8 6 8 3 4 2 1

2 6 3 8 2 3 4

6 7 5 6 8 4 2

6 3 4 6 8 3 6

Our first example of Dynamic Programming is a puzzle

Given an array, pick the path that goes from top to bottom that maximizes the values hit.

Path must descend with every step: cannot meander back up.

Page 55: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

55

Puzzle path

Pick a path that goes from top to bottom, and maximizes the sum

Every step must descend.

Here is a sample path, with value

7 + 8 + 8 + 4 + 3 + 8 + 3

This isn't the best we can do. (Tweak the tail of the path to select 8 rather than 3)

What other changes do you see?

How can we find the best? Too many choices to try them all.

2 5 1 6 7 3 2

3 2 6 8 2 9 3

1 7 6 8 5 3 8

8 6 8 3 4 2 1

2 6 3 8 2 3 4

6 7 5 6 8 4 2

6 3 4 6 8 3 6

Page 56: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

56

Dynamic Programming

It is easy to find the best path in a two level puzzle

For each new row

For each element of the row

Look at the three nbr in row above: pick the best of them

Store the running total for following round

For each square, we remember

where the path came from (lines)

2 5 1 6 7 3 2

3 2 6 8 2 9 3

1 7 6 8 5 3 8

2 5 1 6 7 3 2

3 2 6 8 2 9 3

1 7 6 8 5 3 8

8 7 12 9 615 16

Page 57: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

57

Iteration

At each stage, we build on the previous results.Note that some squares are never selected (the 1 and 2s in first row)Note that some paths are started, and then dropped (3 to 3)

These will never be used againInput to each new round: contents of current row, and the running totals

from previous row. We don't care about prior path yet.For solution: select the largest total in last row, and follow path back.

2 5 1 6 7 3 2

3 2 6 8 2 9 3

1 7 6 8 5 3 8

8 7 12 9 615 16

8 19 21 23 21 19 249

Page 58: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

58

Rock Pile GameTwo player gameHave two piles of rocksPlayers take turns.Must take a rock from a pileCan take a rock from each pileIf you take the last rock, you win the game.Is there a winning strategy for the game?Assume we start with two piles of 8 rocks each

Page 59: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

59

Rock Pile GameMust take a rock from a pileCan take a rock from each pile

Represent situation as ordered pair, (x, y)If player has (1, 0) left, can win by taking rock

Page 60: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

60

Rock Pile GameMust take a rock from a pileCan take a rock from each pileIf player has (1, 0) left, can win by taking rockIf player has (1, 1) left, wins by taking both.

Page 61: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

61

Rock Pile GameMust take a rock from a pileCan take a rock from each pileIf player has (2, 0) left has no wining moveMust take one, leaving (1, 0), which is wining move

Page 62: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

62

Rock Pile GameMust take a rock from a pileCan take a rock from each pileIf player has (2, 1) left, can take one and leave (2, 0)

Page 63: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

63

Rock Pile GameStrategy: try to leave even number of rocks

Page 64: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

64

Approximate pattern match

Example: find pattern CTAG in text CCTGHere are two choices

Can change T in pattern to CCTAGCCTG

Or we could add a C to pattern and an A to the textC_TAGCCT_G

Page 65: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

65

DP Approximate Pattern Match

Keep a table that stores the best match to substrings

Use stored values to compute next value

C0

C0

T0

G0

C -C

1

CC

0

CC

0

TC

1

GC

1

T --CT

2

C-CT

1

CCCT

1

CTCT

0

CTGCT-

1

A ---CTA

3

C--CTA

2

C-CCTA

2

CT-CTA

1

G ----CTAG

4

C---CTAG

3

C--CCTAG

3

0

This represents thebest we can do matching pattern CTA in text CCT

Page 66: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

66

How we build the table

Consider filling in the blank spot in pinkWe have three choicesBuild on pair above, deleting char T in pattern

T-CTCost: 1 + 1 = 2

Build on pair on left, inserting char T from textCCTCT-Cost: 1 + 1 = 2

Match or replace, using pair from upper leftCTCTCost: 0 + 0 (since the T’s in text and pattern match)

We only display the winner

C T

CCC

0

TC

1

TCCCT

1

CTCT

0

13

2

C T

C CC

0

TC

T CCCT

1

1

Page 67: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

67

Key Idea

To compute the best match ending at location [i,j] we compute the three values below, pick minimal value, and store it in d[i][j]

insertCost = d[i-1][j] + 1;deleteCost = d[i][j-1] + 1;

if (pattern[i] == text[j])matchCost = d[i-1][j-1];

elsereviseCost = d[i-1][j-1] + 1;

A C

CAC

1

CC

0

TCACT

1

C-CT

1

13

2

Page 68: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

68

Compare to Backtracking

if (pattern[i] == text[j])matchCost = d[i-1][j-1];

elsereviseCost = d[i-1][j-1] + 1;

insertCost = d[i-1][j] + 1;deleteCost = d[i][j-1] + 1;

match = approxMatch(text[1:tlen], pattern[1:plen])if (text[0] != pattern[0]): match = match + 1delt = 1 + approxMatch(text[1:tlen], pattern)delp = 1 + approxMatch(text, pattern[1:plen])

A C

CAC

1

CC

0

TCACT

1

C-CT

1

13

2

Page 69: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

69

What do we need to store?Possible to compute one row at a time - only need to store prior row

Better to run down cols: they are shorter since pattern is shorter than text

To find best approximate match for pattern of length N in text of length M takes O(NM)

Same as worst case for simple match

C0

C0

T0

G0

C -C

1

CC

0

CC

0

TC

1

GC

1

T --CT

2

C-CT

1

CCCT

1

CTCT

0

CTGCT-

1

A ---CTA

3

C--CTA

2

C-CCTA

2

CT-CTA

1

CTGCTA

1

G ----CTAG

4

C---CTAG

3

C--CCTAG

3

CT--CTAG

2

CT-GCTAG

1

0C

0CC

0C-CT

1CTGCCT-A

2CTGCCTAG

2

A0

AC

1CACT

1C-ACTA

2CTGCACT-AG

3

Page 70: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

70

Larger Sample

We don’t need the strings: implicit from the shape of the path

Only need to store the scores

G A T C G C C T G A C G G0 0 0 0 0 0 0 0 0 0 0 0 0

C 1 1 1 1 0 1 0 0 1 1 1 0 1 1T 2 2 2 1 1 1 1 1 0 1 2 1 1 2A 3 3 2 2 2 2 2 2 1 1 1 2 2 2G 4 3 3 3 3 2 3 3 2 1 2 2 2 2

0

Sometimes we have multiple choices that yield the same score

ATCG

CTAG

and

C--G

CTAG

Page 71: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

71

Odds

What are the odds of a match between two strings of length k, if we can tolerate one replacement error?

AGCT vs AGTT

Clearly the odds are better than (1/N)k

Page 72: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

72

Other Pricing Schemes

We may decide that alternative pricing models are betterOne common assumption is that the first deletion is rare

(expensive) but it is much cheaper to continue to delete

ATC AT- --T GGT GTT

Our basic algorithm can deal with this changeModify the cost of a delete when we are in a cut

Use an Affine Gap Function

Page 73: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

73

Problem for next week

The human Genome includes many repetitionsSome of this reflects historySome reflects motifs

The book uses finding motifs as an important example

Our problem: take a string, and look for longest repetitionCome up with as many ideas as you can, and implement someYou may assume that the string is DNA

Page 74: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

74

Problem for next week

Write a program that takes a DNA string and counts the frequency of each 2 letter sequence

On line references for PythonThe Python Tutorial at pyton.org (poke around)Dive Into Python

% python freq.py Enter the text: "ACGGTCG"Saw ACGGTCG {'GG': 1, 'AC': 1, 'GT': 1, 'CG': 2, 'TC': 1} A G C TA 0 0 1 0G 0 1 0 1C 0 2 0 0T 0 0 1 0

Page 75: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

75

References

Boyer R.S., Moore J.S., “A fast string searching algorithm”, CACM, 20:762-772, 1977

See Knuth, Morris, and Pratt, "Fast Pattern Matching in Strings" in SIAM Journal on Computing, 6(2): 323-350, 1977

The approximate match algorithm is due to Wagner and Fischer, and is described in "The String-to-String Correction Problem", Journal of the ACM 21(1):168-178

Good reference is Computer Algorithms by Sara Baase and Allen Van Gelder, Addison-Wesley

Page 76: 1 Bioinformatics Algorithms Lecture 1 © Jeff Parker, 2009 It is always advisable to perceive clearly our ignorance. Charles Darwin

76

Summary

There is a world of interesting problems in Biology

There is great interest in finding solutions

Computer Science can help

Crucial to keep in touch with Biologists about solutions

Not all simplifications are equally valid

Not all matches are meaningful

Many Biologists use the new tools in their research

There is a need for those who understand the algorithms the tools use