51
1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Scie nce Academia Sinica

1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Embed Size (px)

Citation preview

Page 1: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

1

Error-Tolerant Algorithms in Bioinformatics

Wen-Lian Hsu

Institute of Information Science

Academia Sinica

Page 2: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

2/55

Discrete Algorithms

‧ Discrete Math. lies in the foundation of modern computer science

‧ Most algorithms we have learned in computer science are discrete

‧ Discrete algorithms emphasize “worst case analysis”

‧ Many sequence manipulation algorithms in bioinformatics are discrete

Page 3: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Error-Tolerant Algorithms

‧ Many recognition problems in nature need algorithm to remove noises automatically to get the correct information :

– Optical character recognition( OCR)– Human face recognition– Voice recognition– Style checker

Page 4: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Design of Algorithms

‧ Optimization problems– can define “approximation algorithms”

‧ Decision problems

(isomorphism, recognition, etc )‧ one can consider the “least # of changes” needed to

yield a “yes” answer But, this often makes the problem much harder

‧ even if one can find a solution above, it might not make any practical sense

‧ no easy way to measure the “deviation”

Page 5: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

5/55

A New ParadigmError-Tolerant Algorithms

‧ Real life data always contain some errors (say 5%)‧ The Challenge:

Discover the 95% “correct” information versus the 5% “incorrect” information automatically

‧ Robustness (difficult to define)‧ Similar in nature to voice recognition and character

recognition algorithms

Page 6: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

6/55

Natural Problems (1)

‧ Natural problems: Problems arised from nature, which are guaranteed to have feasible solutions if data is collected accurately.– But because of noises in sampled data, su

ch solutions are hard to come by.‧ To tackle these problems one should focus o

n real data rather than worst case analysis.

Page 7: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

7/55

Natural Problems (2)‧ Techniques taking advantage of the natural cons

traints of these problems do not necessarily work for general data (especially the worst case), but could perform very well for those well-structured problems.

Constraints Structures Knowledge

Page 8: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

An Error-Tolerant Algorithmfor the Consecutive Ones Property

Wen-Lian Hsu

Academia Sinica

Page 9: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Human Genome Project

‧ DNA sequencing (could be over 10 million bp)

sequences of 4 letters A,G,C,T‧ Topics of human genome project:

– Cutting and reassembling DNA sequence– Sequence comparison– Gene finding– Transcription mechanism of DAN sequence– Prediction of the structure of proteins– Phylogenetic trees

Page 10: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Cutting and reassembling for DNA sequence

‧ Cut a DNA sequence into small pieces in different ways and reassemble them together

‧ the “small” pieces (called clones) are still too large to find complete sequences

‧ biologically, use “probe”to mark the clones– each probe could mark several clones clone could

contain several probes

Page 11: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Probe-Clone (0,1)-Matrix

‧ Each probe can be regarded as a column;

each clone can be regarded as a row of probes‧ If each probe hits the DNA sequence only once (unique

probe) and there is no error in the probe-clone matrix, then one can use the consecutive ones test to order the clones

Page 12: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Consecutive Ones Property (C1P)

‧ Booth & Lueker [1976] linear time, on-line– made use of a data structure called PQ-trees

‧ Hsu [1992] decomposition, off-line– did not use PQ-trees

‧ However, these algorithms do not work on data that contain errors

Page 13: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

13/55

C1P Testing with Good Row Ordering

Page 14: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

14/55

Exact Algorithm for Consecutive Ones Testing

1. Construct G’’, a spanning tree of G’ ( the strictly overlapping graph ). Each connected component corresponds to a prime submatrix. ( matrix decomposition )

2. Decide the topological ordering of prime matrix.

3. For each prime matrix determine the ordering of columns, using the set partition strategy, according to the preorder traversal of the corresponding connected component of G’’. ( good ordering )

Page 15: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Problems in Lab Data

‧ False positives : a “1” should actually be a “0”‧ False negatives : a “0” should actually be a ”1” ‧ The probes are not necessarily unique

– there are a lot of repeating subsequences in a DNA sequence

‧ Chimeric clones : two clones stick together at the end‧ In STOC, Karp[1993] posed this as the problem that ne

eds major breakthrough in computational biology ‧ How to deal with it? -- neighborhood consensus

Page 16: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

False Positives and False Negatives

false positive

false negative

Page 17: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

17/55

Non-unique Probe

0

Page 18: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

18/55

Non-unique Probe

0

Page 19: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

19/55

Remote False Positives

0

Page 20: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

20/55

Chimeric Clone

0

Page 21: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

21

An Error-Tolerant Algorithm for the C1P test

The idea is derived from the off-line C1P test based on

Good row ordering

Page 22: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

22/55

Strategy of Fault-tolerant Algorithm for Consecutive Ones Testing

1. Detecting and correcting the four types of errors to construct G’’.

2. Decide the topological ordering of prime matrix.

3. Using heuristic set partition strategy to determine the ordering of columns. There will be bad rows, lost columns, which indicate the corresponding clones, probes are bad, and the additional lab work is needed.

Page 23: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

23/55

A Matrix Satisfying the C1P

Page 24: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

24/55

A Matrix Mixed with All Four Type of Errors

Page 25: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

25/55

Monotone Property in a Consecutive Ones Matrix

Page 26: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

26/55

u

E(u)

A(u)

B(u)

C(u)D(u)

STA(u)

STA’(u)

Processing row u (I)-Errorless case

LL RR

Page 27: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

27/55

Processing row u (II) -Errorless case

‧ At the end, row u is shrunk to 2 columns, representing the left and right parts

‧ At the end of the algorithm, we can rewind the rows to restore all the shrunk rows

Page 28: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

28/55

u

False Negatives of Row u

Page 29: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

29/55

u

False Negatives of C(u)

Page 30: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

30/55

A General Error-Tolerant Algorithm for constructing G’’ (I)

1. Determine, for each probe, whether it is unique, and remove the remote false positives.

2. Determine, for each clone, whether it is chimerical, and remove the remote false positives.

3. Detect certain false negatives using a global technique

4. Partition STA’(u) (STA(u) – E(u)), C(u) and D(u) based on the containment relationship and partition A(u) and B(u) from STA’(u) .

Page 31: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

31/55

A General Error-Tolerant Algorithm for constructing G’’ (II)

5. For each row u, detect those local false negatives and false positives.

6. Make u adjacent to every row in A(u) and B(u).

7. Delete row u, construct a special row [u] such that CL([u]) = {v1,v2} and Proceed to the next regular row.

Page 32: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

32/55

u

??

?

E(u)

A(u)

B(u)

C(u)D(u)

STA(u)

STA’(u)

Neighborhood Clustering

LL RR

Page 33: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Non-Unique Probes

Page 34: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

34/55

Chimeric Clones

Page 35: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

35/55

Remote False Positive

Remote False positive

Remote False positive

Page 36: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

36/55

?

??

False Negative (Global Method)

Rows “close” to the above

rows

Page 37: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

37/55

u ?

Avoid False Negatives of Row u

Where would the false negatives go

-to the left or right?

Page 38: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

38/55

u

? ?

Avoid False Negatives of C(u)

Page 39: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

39/55

Monotone Property in a Consecutive Ones Matrix

Page 40: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

40/55

Local False Positives and False Negatives

false positive

false negative

Page 41: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

41/55

A Heuristic for Local False Positives and Negatives

Fill-in

Try the columns one by one to see which has the minimum fill-ins

Page 42: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

42/55

Ordering Probes

False negatives

False positives

Page 43: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

43/55

Bad Row for Partition

Page 44: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

44/55

Islands of probes

Island 1 Island 2

Bad row

Page 45: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

45/55

Order of Islands

Island 1 Island 2 Island 3

Page 46: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

Jump Column of Result Matrix

1 47 2 3 5 6 8 9

Page 47: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

47/55

Simulation Results (I)100x100(total 50matrices)

0

10

20

30

40

50

60

0~5 6~10 11~15 16~20 21~25 26~30

10%

5%

3%

Page 48: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

48/55

Simulation Results (II)200x200(total 50matrices)

0510152025303540

0~5 6~10 11~15 16~20 21~25 26~30 31~35

10%

5%

3%

Page 49: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

49/55

Simulation Results (III)400x400(total 50matrices)

0

10

20

30

40

50

0~5 6~10 11~15 16~20 21~25 26~30 31~35

10%

5%

3%

Page 50: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

50/55

A 50x50 matrix with error rate 5%

11111111 1111111111N11 111N1111 111111111111111111 1N11111N111111111111 1111111111111 111111 11N11111111111111111 11111111111111111111 1111111 111111111 111111111111111111 11111111111111N1 11111111111 111N1111111111N 11111N11111 1111111111111111N11 11111111111111 11111111111N111 1111111111111N11111 111111N111111111 11111111111111111111 11111111111111111 111111 111111111N11 11111111111111 11111111 11111111111111111 111111111N11 1111111111 1N1111111111N11111 P N111N111111111 N1N11111 P 1111111111111111 1111111N11N111111 11111111111111 11111111N111111 11N1111111 N1N111 111111111111111111 1111111111111 11111111111111 P 1111111 11111N111111111 111111111111111111 1N111111111111 111111N11 11111111111111111111 1111111 111111111111111

11111111 1111111111F1F1 111F1111 111111111111111111 1F11111F111111111111 1111111111111 111111 11F11111111111111111 11111111111111111111 1111111 111111111 111111111111111111 111111111111111 11111111111 111F1111111111 111111F1111 1111111111111111F11 11111111111111 11111111111F111 11FF11111111111F1111F1 111111F111111111 11111F11111111111111 11111F11111111111 11111 1111111111F1 11111111111111 11111111 1111111111111111 1111111111 1111111111 11111111111F11111 11F1111111111 111111 1F1111111111111111 1111111F11F111111 11F111111111111 11111111F111111 1F11111111 1F111 111111111111111111 1111111111111 11111111111111 11111111 11111F11111111F1 111111111111111111 1F111111111111 111111F11 11111111111111111111 1111111 11F1111111111111

Page 51: 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

51/55

A 50x50 matrix with error rate 10%

11111111 1111111N11N11 1N1N1111 P 11111111111111111N 1N1NN11N11111111111N 1111111111111 P 111111 11N11111111111N11111 11111111111111111111 1N11111 1111111N1 111111N11111111111 11111111111111NN N1111111N11 111N1111111111N 11111N1N111 11111NN111111111N11 11111111111111 P 1N111111111N111 1N1N1111N1111N11111 111111N111111111 11111111111111111111 11111111111111111 111111 111111111N11 11111111111111 11111111 111N1111111111N11 111111111N11 111111N111 1N1111111111NN1111 N111N111111111 N1N11111 1111111111111111 1111111N11N111111 11111111111111 11111111N111111 P 11N1N11111 N1N11N 1N111N111111111111 1N1111111111N 111111111N1111 P 1111111 11111N111111111 111111111111111111 1N111111111111 111111N11 11111111111111111111 P 1111111 111111111111111

11111111 1111111F11F11 11F1FF11 11111FF1111111111 11FFFF1F1111111111 11FF11111111FF11 111111 111111111111F11111 11111111111111111111 1F11111 1111111F1 11111F11111111111 11111FF111111 111111F11 1F11FFF11111111 1111F1F111 1111111FF1111111 11111111111111 11111111111F111 1F111F11F1111F1FF11 111111F111111111 11F11111111111111111 1FF11111111111111 11111 111111111F11 11111111111111 11111111 11F111111111F111 111111F1111 1F11111F111 11F11111111FF1111 1111F1111111 1F11111 11111111111111FF1 1111111F11F111111 11FFF1F11111FF111F1 111FF11F1111111 111FF1111 11F11111111111111 11FF11F111111F1 11111111F11111 11F11111 11111F111111111 111111111111111111 1F111111111111 111111F11 11111111111111111111 1111111 111111111111111