Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching

JM - http://folding.chmcc.org 1

Introduction to Bioinformatics: Lecture III

Genome Assembly and String Matching

Jarek MellerJarek Meller

Division of Biomedical Informatics, Division of Biomedical Informatics, Children’s Hospital Research Foundation Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC& Department of Biomedical Engineering, UC


Outline of the lecture

Physical mapping problem and the resulting computational challenges

Ordering clone libraries: from the consecutive ones to global optimization methods

Applications of exact string matching methods Towards the shortest superstring problem and

the shotgun assembly problem


Literature watch

Aloy et. al., “Structure-Based Assembly of Protein Complexesin Yeast”, Science 303,

As a way of getting acquainted with protein pathways and theirintersection with structural studies.

4

Assembling physical maps of a genome

Markers DNA

Physical mapping problem: create and locate in the genome of interesta set of markers (e.g. stretches of DNA that hybridize to a given probe).

With sufficiently dense and ordered set of markers any newly sequenced(and long enough to cover at least one marker) DNA fragment can be mapped to a rough location on the genome.

One of the early goals of the Human Genome Project was to select and map a set of STS markers such that there would be at least one STS ineach stretch of 100 kb of the genome.

5

Physical mapping and the problem of ordering clone libraries with STS markers

DNA clone 1 clone 2 clone 3 clone 4

STS: 1 2 3 4 5

Definition A clone library consists of a set of short DNA fragments,called clones that originated in a stretch of the studied DNA.

Definition A sequence tagged site (STS) is a DNA substring which occurs only once in the DNA of interest. One may think of STSs as a set of indices to which new DNA sequences can be referenced.

Problem What is the minimum length of the STSs that could (at leastin principle) provide the requested coverage for the Human genome?

6

The problem of ordering clone libraries with STS markers can be cast (and solved) as the consecutive ones problem


STS: 1 2 3 4 5

Our task is to reconstruct the original order of the STSs (and thus orderthe clone library) given this data.

Assuming that the STS probes are unique and that there are no hybridization errors the problem can be cast as the consecutive ones problem and efficiently solved using CS techniques (PQ-tree algorithm, Booth and Leuker, 1976).

The true location of the STSs and clones is not known. However,for each clone the list of STSs hybridizing to it is given.

7

The consecutive ones problem and its solution

3 5 1 4 2

1 1 0 0 1 0

2 0 0 1 0 1

3 1 0 0 1 1

4 1 1 0 1 0


STS: 1 2 3 4 5

1 2 3 4 5

1 0 0 1 1 0

2 1 1 0 0 0

3 0 1 1 1 0

4 0 0 1 1 1

For a binary hybridization matrix find a permutation of its columns such thatin each row all ones are located in a block of consecutive entries.

STS

Clone

8

Fortunately errors make life more interesting …

5 4 1 3 2

1 0 1 0 1 0

2 0 1 1 0 1

3 0 1 0 1 1

4 1 0 0 1 0


STS: 1 2 3 4 5

1 2 3 4 5

1 0 0 1 1 0

2 1 1 0 1 0

3 0 1 1 1 0

4 0 0 1 0 1

In the presence of experimental errors the problem leads to globaloptimization problem (see Pevzner, Chapter 3).

STS

Clone


Heuristic solutions may still provide good probe ordering

The number of “gaps” (blocks of zeros in rows) in the hybridization matrixmay be used as a cost function, since hybridization errors typically splitblocks of ones (false negatives) or split a gap into two gaps (false positive).

The problem of finding a permutation that minimizes the number of gapscan be cast as a Traveling Salesman Problem (TSP), in which cities are the columns of the hybridization matrix (plus an additional column of zeros)and the distance between two cities is the number of positions in which the two columns differ (Hamming dist.)

Thus, an efficient algorithm is unlikely in general case (unless P=NP) andheuristic solutions are being sought that provide good probe ordering, atleast for most cases (e.g. Alizadeh et. al., 1995)

Problem Is the correct order of the STSs in the example from the previousslide providing the shortest cycle for the corresponding TSP?


Map location of anonymous DNA as a string matching problem

A sufficiently long string of anonymous yet sequenced DNA can beplaced on the physical map by finding which STSs are contained inthis sequence.

Due to the size of the problem, efficiency is very important.

Millions of STS are available at present and their total length is typicallymuch larger than the length of the DNA sequence to be mapped.

Assuming no sequencing errors, the problem can be cast as the exact set matching and solved efficiently using for example suffix trees.

Generalized suffix tree or inexact string matching methods need to be used when some errors are allowed.


Strings, sequences and string operations


String exact matching problem


Solving the exact matching problem: conceptual simplicity vs. computational complexity


Computationally efficient and elegant solutions


The idea of the suffix tree method

A string with m characters has m suffixes, which can be representedas m leaves of a rooted directed tree. Consider for example T=cabca

ca

bc

a$

1

a

b

c

a

$

2

bc

a

$

3

$ 4

$

5

For simplicity one leaf, due to the terminal character $ is not included.Problem What is the reason for adding the terminal character?


Why does it work?

A substring of a string is a prefix of a suffix in that string. For example,a substring P=ab is a prefix of the suffix bca in T=cabca. Thus, if P occurs in T there is a leaf in the suffix tree that has a label starting with P.

ca

bc

a$

1

a

b

c

a

$

2

bc

a

$

3

$ 4

$

5

As a related problem consider the motif search, as implemented in PROSITE. Explain how finite automata formalism is used for motif search.


General idea: ordered fingerprints and the notion of closeness between DNA fragments

Hierarchical sequencing: physical maps, clone libraries and shotgun

Definition The algorithmic problem of shotgun sequence assemblyis to deduce the sequence of the DNA string from a set of sequencedand partially overlapping short substrings derived from that string.

Analogy to physical map assembly: DNA sequence of a substring maybe viewed as a precise ordered fingerprint (in analogy to STSs) and thesuffix-prefix match determines if two substrings would be assembledtogether.

In general, the shortest superstring problem (find the shortest stringthat contains each string from a certain set of strings as its substring) is NP-hard and heuristics are being developed to address the problem.


Get the relevant sequences to compare them: conservation and differences

Problem Algorithms Programs

Sequencing Fragment assembly problem The Shortest Superstring Problem Phrap (Green, 1994)

Gene finding Hidden Markov Models, pattern recognition methods GenScan (Burge & Karlin, 1997)

Sequence comparison pairwise and multiple sequence alignments dynamic algorithm, heuristic methods BLAST (Altschul et. al., 1990)

Documents

Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching