27
Hierarchical Clustering of Gene Expression Data Author : Feng L uo, Kun Tang Latifur Khan Graduate : Chie n-Ming Hsiao

Hierarchical Clustering of Gene Expression Data

  • Upload
    bela

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

Hierarchical Clustering of Gene Expression Data. Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao. Outline. Motivation Objective Introduction Hierarchical Clustering Self-Organizing Tree algorithm - PowerPoint PPT Presentation

Citation preview

Page 1: Hierarchical Clustering of Gene Expression Data

Hierarchical Clustering of Gene Expression Data

Author : Feng Luo, Kun Tang

Latifur Khan

Graduate : Chien-Ming Hsiao

Page 2: Hierarchical Clustering of Gene Expression Data

Outline

Motivation Objective Introduction Hierarchical Clustering

Self-Organizing Tree algorithm hierarchical growing self-organizing tree algorithm

Preliminary result Conclusions Personal Opinion

Page 3: Hierarchical Clustering of Gene Expression Data

Motivation

Rapid development of biological technologies generates a hug amount of data.

Analyzation and interpretation of these massive data is a challenging task.

we are interested in data analysis tools that can help researchers to detect patterns hidden behind these complex initial data.

Page 4: Hierarchical Clustering of Gene Expression Data

Objective

to extract useful and rational fundamental patterns of gene expression inherent in these huge data.

Page 5: Hierarchical Clustering of Gene Expression Data

introduction

Current approaches for measuring gene expression profiles

SAGE, RT/PCR, cDNA, oligonucleotide microarray Sample of Microarray

Page 6: Hierarchical Clustering of Gene Expression Data

introduction

Two classes of algorithms have been successfully used to analyze gene expression data. hierarchical clustering a self-organizing tree

Page 7: Hierarchical Clustering of Gene Expression Data

Hierarchical Clustering

Page 8: Hierarchical Clustering of Gene Expression Data

Self-Organizing Tree algorithm

Self-Organizing Tree algorithm (SOTA) based on the Kohonen’s self-organizing map (S

OM) and Fritzke’s growing cell structures output of SOTA is a binary tree topological neu

ral network

Page 9: Hierarchical Clustering of Gene Expression Data

Self-Organizing Tree algorithm

Step 1: Initially the system as a binary tree with

three nodes Node

Cell

Node

Node

Cell

Cell

Cell

Cellw

m

s

w

A B

(A) Initial Architecture of SOTA.(B) Two Difference Reference Vector Updating Schemas.

Page 10: Hierarchical Clustering of Gene Expression Data

Self-Organizing Tree algorithm

Step 2: Present all data and compute distances from each data to all external Cells (tree leaves) Euclidean distances cosine distances

Step 3: Select output winning cell c with minimum distance dij for each data.

Page 11: Hierarchical Clustering of Gene Expression Data

Self-Organizing Tree algorithm

Step 4: Update reference vector of winning cell and its neighbors

Where (t) is the learning function:

The (t) is the learning rate function, (t) = 1/t

is a learning constant.

will have a different value for the winning cell and

Its neighbors.

)()( ii wxtw

)()( tt

Page 12: Hierarchical Clustering of Gene Expression Data

Self-Organizing Tree algorithm

Step 2,3,4 form a Cycle. While relative error of the entire tree is greater than a threshold repeat the cycle.

Step 5: If a cycle finished, increase the network size: two new cells are attached to the cell with highest resources. This cell becomes a node.

Resources: an average of the distances of the input data associated this cell to itself.

Step 6: Repeat Step 2 until convergence (resources are below a threshold).

D

j

iii D

wxdsource

1

),(Re

Page 13: Hierarchical Clustering of Gene Expression Data

Self-Organizing Tree algorithm

Time Complexity of SOTA is O( n log N)

Space Complexity of SOTA is O (n)

Page 14: Hierarchical Clustering of Gene Expression Data

The incorrect capture of the hierarchical relationship

SOTA

Page 15: Hierarchical Clustering of Gene Expression Data

hierarchical growing self-organizing tree algorithm

hierarchical growing self-organizing tree algorithm (HGSOT)

The HGSOT grows vertical grows

adds descendents the same strategy used in SOTA

horizontal grows adds more siblings a level threshold : controlling growth in the sibling genera

tion

Page 16: Hierarchical Clustering of Gene Expression Data

hierarchical growing self-organizing tree algorithm

To determine horizontal growth

Page 17: Hierarchical Clustering of Gene Expression Data

hierarchical growing self-organizing tree algorithm

Initialization

Vertical Growing

Distribution

the error of theentire tree

HorizontalGrowing

Distribution

the error of theentire tree

to grow

Page 18: Hierarchical Clustering of Gene Expression Data

The pseudo code of HGSOT 1. Initialization

initially the tree only has one root node. Initialize its reference vector with the centroid of entire data and all data will be associated with the root.

2. Vertical Growing change the leaf cell to a node and add two children to each. The r

eference vector of a new cell is initialized as the node’s reference vector.

3. Distribution distribute each input datum between two newly created cells; find

the winning cell (using KLD, see 2.2.1), and then update the reference vector of the winning cell and its neighbor.

4. Error when the error of the entire tree is larger than a threshold, called

error threshold (TE), repeat Step 3.

Page 19: Hierarchical Clustering of Gene Expression Data

The pseudo code of HGSOT 5. Horizontal Growing

when the difference between the minimum and maximum distance of all children cells of a node (x) is less than a threshold, called level threshold (TL), a child is added to this node; on the other hand if the difference is greater than the TL, a child is deleted from this node, and the horizontal growth terminated.

6. Distribution distribute the input data associated with x into its descendents

along siblings; find the winning cell (using KLD, see 2.2.), then update the reference vector of the winning cell and its neighbor.

7. Error if the error of the entire tree is greater than (TE), then repeat Step

6.

8. if there are more levels to grow in the hierarchy, and then return to Step 2, otherwise, stop.

Page 20: Hierarchical Clustering of Gene Expression Data

Hierarchical Cluster Algorithms

How we can distribute input data of selected node among these new created cells. Similar to the SOTA approach. Input data of selected node will be distributed not only

its new created cells but also its neighbor cells. We determine K level apart ancestor node of selected node. We determine sub-tree of rooted by the ancestor node and

input data of selected cell will be distributed among all cells (leaf) of this sub-tree. The latter approach is known as K level distribution (KLD).

Page 21: Hierarchical Clustering of Gene Expression Data

Hierarchical Cluster Algorithms

KLD: We need to distribute data associated with node M to new created cells. For K=1, Data of node M will be distributed to cells, B, C, D & E. If K=0, data of M will be distributed between B and C.

Page 22: Hierarchical Clustering of Gene Expression Data

Preliminary result

Experiment setup Experiment Data

112 genes expression data of rat central nervous system (CNS) Four Gene Families: Neuro-Glial Markers Family (NGMs), Ne

urotransmitter receptors Family (NTRs), Peptide Signaling Family (PepS) and Diverse

These gene expression data were measured by using RT-PCR in mRNA expression in rat’s cervical spinal cord tissue over nine different developmental time points from embryonic days 11 through 21, postnatal days 0 through 14 and adult.

For each gene, data are normalized to the maximal expression level among the nine time points

Page 23: Hierarchical Clustering of Gene Expression Data

Preliminary result

Experiment setup The Parameters of HGSOT

The winner learning rate w and sibling learning rate s of HGSOT is 0.15 and 0.015.

The error threshold is 0.001. The level threshold is 0.8, which means the

minimum distance will not be less than 80% of the maximum distance.

The distribution level K is equal to 4. Euclidean distance is used to calculate the

similarity.

Page 24: Hierarchical Clustering of Gene Expression Data

Preliminary result

Page 25: Hierarchical Clustering of Gene Expression Data

Preliminary result

Page 26: Hierarchical Clustering of Gene Expression Data

Conclusions

can successfully gain five clusters similar to Wen et al’s original HAC result and gives a better hierarchical structure.

this algorithm can detect more subtle patterns at the lower hierarchical levels, and it shows a more suitable clustering than HAC on some genes.

Page 27: Hierarchical Clustering of Gene Expression Data

Personal Opinion

we would like to do more experiments on different data sets