Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
MotivationClustering pipeline
ImprovementsOutlook
Summary
News from the cluster pipeline
Jan Engelhardt
Department of Computer Science, Bioinformatics GroupUniversity of Leipzig
October 24, 2009
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
ncRNA candidates
a lot of new genomes
a lot of predicted ncRNAs (RNAz,wet lab)
2R WGDLarvaceans
Cephalochordates
Ciona intestinalisAscidians
Ciona savignyi
NematodesHemichordates
Echinoderms VertebratesCaenorhabditis elegansCaenorhabditis briggsae~3600
Oikopleura dioica~3300
Homo sapiens~30,000
Deuterostomes
Chordates
ProtostomesBilateria
Urochordates
Nemathelminthes
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
LocARNARNAsoupStructural clustering pipeline
LocARNA - local alignment of RNA (S. Will)
local sequence/structure alignment tool
detects homologous secondary structure motifs
variation of Sankoff algorithm (very fast)
defines a distance measure
distance can be used to build a cluster tree
RNA 8RNA 7RNA 6
RNA 1
RNA 3RNA 2
RNA 5RNA 4
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
LocARNARNAsoupStructural clustering pipeline
RNAsoup - Spot grOUPs in RNA cluster-tree (K. Reiche)
decision rule after Duda and Heart
optimal number of clusters
squared error of the minimum free energies
different threshold values
RNA 8RNA 7RNA 6
RNA 1
RNA 3RNA 2
RNA 5RNA 4
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
LocARNARNAsoupStructural clustering pipeline
LocARNA - RNAsoup - SoupViewer
ncRNAsannotatedpredicted
ncRNAs (manual)SoupViewerRNAclust RNAsoup
LocARNARNAfold
Pipeline
Start: A set of RNA sequences.Calculate pairwise distances using LocARNA.Build a hierarchical cluster tree.Get propositions for partitions by RNAsoup.Examine you RNA tree using SoupViewer.
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
RNAclustRNAsoupSoupViewer
Several updates
predictedncRNAs ncRNAs
annotated(manual)
SoupViewerRNAclust
LocARNARNAfold
RNAsoup
adapted to the new version of LocARNA
RNAsoup is embedded in RNAclust (−−rnasoup)
got rid of some artifact code lines
small computational improvements
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
RNAclustRNAsoupSoupViewer
Adding of new sequences to an existing cluster tree
Assumption
You have an RNA cluster tree. (n sequences)You want to add an additional sequence.
“naive” approach
Do the complete computation again.
Takes n + 1 RNAfold calls and (n+1)2
2 calls of LocARNA
“clever” approach
Do just the new computations.Takes 1 RNAfold call and n calls of LocARNA
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
RNAclustRNAsoupSoupViewer
Improvements of RNAsoup
Cluster validation measure over LocARNA distances
Goal: Give the user a measure for the quality of a specific cluster(possible) Solution: Silhouette-value
Silhouette-value: Measures the compactness of a cluster
For one element of a cluster, the distances to every other elementin this cluster should be smaller than the distances to elements ofother clusters.
Cluster validation over sequence identities
“Mean pairwise sequence identity”: Calculate the pairwise identityfor all elements of a set of sequences and compute the average.Similar measure like the “structure conservation index” but onsequence level.
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
RNAclustRNAsoupSoupViewer
Improvements of SoupViewer
Usability improvements
New text highlighting function (4 regular expressions at once).Exporting of the tree in newick format.
Manual inspection of cluster quality: Constraint folding of leaves
Fold sequences of a group using RNAfold -pC with the consensussecondary structure as constraint.Colorize the positional entropy with relplot.pl
Visualize how well the sequences suits the consensus secondarystructure.
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
RNAclustRNAsoupSoupViewer
Constraint folding of leaves
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
Clustering of Rfam 9.1
How to make a large cluster tree
Rfam database
192445 Sequences
1372 families (768 new in 9.1)
clustering of a subset of Rfam 9.0 took about 4 weeks
Why should we cluster annotated sequences?
Helps us to validate the correctness of the cluster pipeline
Especially the new introduced cluster validation measure
Makes it possible to add new ncRNA sequences with minoreffort
Produces a large tree to test methods for reducing theresource demand of SoupViewer
Jan Engelhardt Cluster pipeline
MotivationClustering pipeline
ImprovementsOutlook
Summary
Summary
Structure-based clustering is cool and easy
The cluster pipeline is still in progress
There are a lot new useful features
Jan Engelhardt Cluster pipeline