BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014

BIONFORMATIC ALGORITHMS

Ryan TinsleyBrandon Lile

May 9th, 2014

2

Bioinformatics

• What?• Why?• Remains large frontier• Goals:

– Organize and serve data– Develop tools to analyze– Interpret results

3

Clustering

• Clustering’s goal is to take large amounts of data and group it into similarity classes.

• Viewing and analyzing vast amounts of biological data as a whole can be a challenging job. It is easier to interpret the data if it is partitioned into clusters combining similar data points.

• Clustering arose from researchers wanting to know the functions of newly sequenced genes. If you were to compare the gene sequence to already know DNA sequence it would not often give away the function of the gene. In fact 40% of sequenced gene’s functionality cannot be attained by only comparing sequences of known genes.

4

Microarrays

• Biologists infer gene functions by using microarrays. Microarrays are a grid of DNA segment of known sequences that is used to test and map DNA fragments. They measure the expression level of genes under varying conditions.

• Expression level is estimated by measuring the amount of mRNA for that particular gene. Microarray data are transformed into an intensity matrix, which allows biologists to make correlations between genes. This is where clustering comes in handy.

• Microarray data is plotted by each data as a point in an N-dimensional space. Clustering makes a distance matrix for the distances between every gene point. Genes that are close together share the same characteristics. The outcome is groups of clusters that have related functionality.

5

Green: control Red: experimental cellYellow: in both samplesBlack: in neither sample

Each box represents one gene’s expression over time.

Microarrays can also be expressed by tables.

6

Hierarchical Clustering

• Hierarchical clustering is a method which seeks to build a hierarchy of clusters. This is done in two ways, agglomerative and divisive.

• Agglomerative starts with every element in its own cluster and iteratively joins clusters together. The time complexity of agglomerative clustering is O(n^3).

• Divisive starts with one cluster and iteratively divides it into smaller clusters. Divisive clustering with an exhaustive search is O(2^n), which is even worse. However, for some special cases, optimal efficient agglomerative methods (of complexity O(n^2)) are known: SLINK[1] for single-linkage and CLINK[2] for complete-linkage clustering.

7

• Hierarchical clustering is often used to reveal evolutionary history. Distance between two blusters is the smallest distance between any pair of their elements. Average distance is between all pairs of elements.

This is a good example of clustering. It satisfies both homogeneity and separation principles.

8

H.C. Algorithm1. Hierarchical Clustering (d , n)

2. Form n clusters each with one element

3. Construct a graph T by assigning one vertex to each cluster

4. while there is more than one cluster

5. Find the two closest clusters C1 and C2

6. Merge C1 and C2 into new cluster C with |C1| +|C2| elements

7. Compute distance from C to all other clusters

8. Add a new vertex C to T and connect to vertices C1 and C2

9. Remove rows and columns of d corresponding to C1 and C2

10. Add a row and column to d corrsponding to the new cluster C

11. return T

Distance:

dmin(C, C*) = min d(x,y)

davg(C, C*) = (1 / |C*||C|) ∑ d(x,y)

Squared Error Disttortion:d(V,X) = ∑d(vi, X)2 / n 1 < i < n

9

K-Means Cluster and Lloyd’s Clustering

• K-Means clustering is a technique to partition a set of N points into K clusters.

• 1-Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center.

• An efficient heuristic method for K-Means clustering is the Lloyd algorithm. It differs from K-Means in that its inputs are a continuous geometric region rather than a discrete set of points. Lloyd’s algorithm, while being fast, in each iteration it moves many data points not necessarily causing a better convergence.

• A more overall better clustering cost is done with the Progressive Greedy K-Means clustering. This method moves one point at a time.

• A comparison between the two has the conclusion that Lloyd’s method is more efficient in run-time and in best found SED.

10

K-Means Clustering: Lloyd Algorithm

1. Arbitrarily assign the k cluster centers

2. while the cluster centers keep changing

3. Assign each data point to the cluster Ci corresponding to the closest cluster representative (center) (1 ≤ i ≤ k)

4. After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is ∑v \ |C| for all v in C for every cluster C

11

Progressive “Greedy” K-Means

12

K-Mean Execution Ex.

13

Distance Graph

• A distance matrix can be turned into a distance graph. Genes are then represented as vertices in the graph, otherwise known as Clique graphs. A problem with clique graphs is corruption which can be solved with the CAST algorithm.

14

Other Clustering Uses

• Clustering has a few more applications. Clustering can be used on highly homologous sequences to reduce the size of large protein databases. It reduces the complexity and speeds up comparisons.

• Clustering can be used to evaluate protein-protein interaction networks. The structure of proteins interactions can be represented by a graph. Clustering then looks for clusters in the graph.

• Also it can be used to improve the protein structure prediction by merging the predictions made by a large number of alternative conformation models.

BLAST

15

• Need way to compare newly sequenced genes with predetermined genes

• Quadratic local alignment too slow

• Heuristic approach instead allows for O(nm) runtime

16

BLAST cont…

• Step 1: Removal of low complexity/repetitive regions– DUST

• Step 2: Splitting amino acidsnucleotides into words/sequences

17

BLAST cont…

• Step 3: All words above threshold incorporated into query list and placed in efficient search tree

18

BLAST cont…

• Step 4: DB queried for matches• Step 5: Verification/Extension of

matches– Ungapped BLAST– Gapped BLAST

19

BLAST VerificationUngapped Gapped

• Step 6: Verifies threshold and returns ranked results

20

BLAST Implementations

• NCBI: http://blast.ncbi.nlm.nih.gov/– Blastn– Blastp– Blastx– Etc.

• mpiBLAST• ScalaBLAST• BLAT• Future of BLAST

http://blast.ncbi.nlm.nih.gov/

http://blast.ncbi.nlm.nih.gov/

21

Questions?

Documents

BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014