13
Motivation Clustering pipeline Improvements Outlook Summary News from the cluster pipeline Jan Engelhardt Department of Computer Science, Bioinformatics Group University of Leipzig October 24, 2009 Jan Engelhardt Cluster pipeline

News from the cluster pipeline - bioinf.uni-leipzig.de

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

MotivationClustering pipeline

ImprovementsOutlook

Summary

News from the cluster pipeline

Jan Engelhardt

Department of Computer Science, Bioinformatics GroupUniversity of Leipzig

October 24, 2009

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

ncRNA candidates

a lot of new genomes

a lot of predicted ncRNAs (RNAz,wet lab)

2R WGDLarvaceans

Cephalochordates

Ciona intestinalisAscidians

Ciona savignyi

NematodesHemichordates

Echinoderms VertebratesCaenorhabditis elegansCaenorhabditis briggsae~3600

Oikopleura dioica~3300

Homo sapiens~30,000

Deuterostomes

Chordates

ProtostomesBilateria

Urochordates

Nemathelminthes

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

LocARNARNAsoupStructural clustering pipeline

LocARNA - local alignment of RNA (S. Will)

local sequence/structure alignment tool

detects homologous secondary structure motifs

variation of Sankoff algorithm (very fast)

defines a distance measure

distance can be used to build a cluster tree

RNA 8RNA 7RNA 6

RNA 1

RNA 3RNA 2

RNA 5RNA 4

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

LocARNARNAsoupStructural clustering pipeline

RNAsoup - Spot grOUPs in RNA cluster-tree (K. Reiche)

decision rule after Duda and Heart

optimal number of clusters

squared error of the minimum free energies

different threshold values

RNA 8RNA 7RNA 6

RNA 1

RNA 3RNA 2

RNA 5RNA 4

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

LocARNARNAsoupStructural clustering pipeline

LocARNA - RNAsoup - SoupViewer

ncRNAsannotatedpredicted

ncRNAs (manual)SoupViewerRNAclust RNAsoup

LocARNARNAfold

Pipeline

Start: A set of RNA sequences.Calculate pairwise distances using LocARNA.Build a hierarchical cluster tree.Get propositions for partitions by RNAsoup.Examine you RNA tree using SoupViewer.

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

RNAclustRNAsoupSoupViewer

Several updates

predictedncRNAs ncRNAs

annotated(manual)

SoupViewerRNAclust

LocARNARNAfold

RNAsoup

adapted to the new version of LocARNA

RNAsoup is embedded in RNAclust (−−rnasoup)

got rid of some artifact code lines

small computational improvements

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

RNAclustRNAsoupSoupViewer

Adding of new sequences to an existing cluster tree

Assumption

You have an RNA cluster tree. (n sequences)You want to add an additional sequence.

“naive” approach

Do the complete computation again.

Takes n + 1 RNAfold calls and (n+1)2

2 calls of LocARNA

“clever” approach

Do just the new computations.Takes 1 RNAfold call and n calls of LocARNA

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

RNAclustRNAsoupSoupViewer

Improvements of RNAsoup

Cluster validation measure over LocARNA distances

Goal: Give the user a measure for the quality of a specific cluster(possible) Solution: Silhouette-value

Silhouette-value: Measures the compactness of a cluster

For one element of a cluster, the distances to every other elementin this cluster should be smaller than the distances to elements ofother clusters.

Cluster validation over sequence identities

“Mean pairwise sequence identity”: Calculate the pairwise identityfor all elements of a set of sequences and compute the average.Similar measure like the “structure conservation index” but onsequence level.

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

RNAclustRNAsoupSoupViewer

Improvements of SoupViewer

Usability improvements

New text highlighting function (4 regular expressions at once).Exporting of the tree in newick format.

Manual inspection of cluster quality: Constraint folding of leaves

Fold sequences of a group using RNAfold -pC with the consensussecondary structure as constraint.Colorize the positional entropy with relplot.pl

Visualize how well the sequences suits the consensus secondarystructure.

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

RNAclustRNAsoupSoupViewer

Constraint folding of leaves

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

Clustering of Rfam 9.1

How to make a large cluster tree

Rfam database

192445 Sequences

1372 families (768 new in 9.1)

clustering of a subset of Rfam 9.0 took about 4 weeks

Why should we cluster annotated sequences?

Helps us to validate the correctness of the cluster pipeline

Especially the new introduced cluster validation measure

Makes it possible to add new ncRNA sequences with minoreffort

Produces a large tree to test methods for reducing theresource demand of SoupViewer

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

Summary

Structure-based clustering is cool and easy

The cluster pipeline is still in progress

There are a lot new useful features

Jan Engelhardt Cluster pipeline

MotivationClustering pipeline

ImprovementsOutlook

Summary

Acknowledgements

Kristin ReicheJorg Hackermuller

Peter StadlerSebastian WillSteffen HeyneRolf Backofen

Jan Engelhardt Cluster pipeline