27
Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis Kaznadzey, GBP

Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Embed Size (px)

Citation preview

Page 1: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

MGM WorkshopSeptember 26, 2011

Reducing Search Space in Protein

and

DNA/RNA Sequence Analysis

Denis Kaznadzey, GBP

Page 2: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

- Classify into groups of essentially similar objects

- When new data arrives, assign objects to existing groups

- Classify ‘leftovers’

- Occasionally review entire classification

Problem: What is essentially similar’?• Finding properties that are important

(Ontological relevancy)

• Does classification reflect reality in any way?

To deal with a huge variety of individual ‘objects’:

Page 3: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Taxonomical Classification

vs.

Continuity of Great Chain of Being

Even if reductionist, classification is a tool to study the world – the biology in particular.

When data is incomplete, any classification is a convention. At the same time, it is an approximation of a “reality”.

Carl Linnaeus Georges Buffon

Page 4: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

In Modern Biology: Most abundant type of data is sequence:

• Genomic DNA

• RNA (through RNASeq)

• Derived Proteins

Primary feature is Primary Structure, but

- Classification criteria depends on application.

Page 5: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

Genome Assembly: Binning, Scaffolding

Transcriptomics: EST (read) clustering

Protein Function and Evolution studies:Protein families

Phylogenetic profiling: OTUs

Select Applications in Genomic Sciences:

Page 6: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

In Metagenomics:

Primary tasks:

• Assess diversity

• Find genes

• Predict functions

• Predict pathways

• Estimate capabilities

Based on sequence comparison.

Page 7: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

- Any Clustering is based on the Distance in some Metric.

- Initial clustering is based on pair-wise distances.

- Subsequent classification is based on distances from object to clusters- Representative- Set of representatives (all at

extreme)- Other measure, may be

unrelated to initial.

Page 8: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

When distance measure is chosen, and distances are obtained / computed:

• There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology)• K-mean, average linkage, complete linkage, single linkage,

iterative, SOM, etc.

• However options for large volume clustering are limited due to performance of algorithms.

• Single-linkage can be computed very efficiently

• (Method for pledging new sequences to clusters may be computationally more intense)

Page 9: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Most efficient clustering: transitive-closure based.

• Requires ‘boolean’ distances (two sequences can be linked or not linked

• Requires number of nodes to be known

• Space ~ NodesNo

• Run-time (worst) ~ EdgesNo* AveClustSize

• Run-time (average) ~ EdgesNo * log2 (AveClustSize))

Page 10: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Practical Transitive Closure algorithm:Allocate array of sequence numbers A [0..N]

Phase I: connect linked vertices through vertex of smallest index

For each edge (m, n):

While A [n] != n:

n = A [n]

While A [m] != m:

m = A [m]

A [max (m, n)] = min (m, n)

Phase II: propagate smallest indices as cluster identifiers

For each n from 0 to N:

If A [n] ! = A [ A [n]]:

A [n] = A [A [n]]

Phase III: collect clusters. (Implementation dependent)

Count number of distinct cluster “id”s => M (1 pass)

Allocate array of sizes; Count size of each cluster (1 pass)

Allocate array of clusters; fill it in (1 pass)

0 1 2 3 4 5 6

0 1 2 1 4 5 6

0 1 2 1 4 5 5

0 1 2 1 4 1 5

0 1 2 1 4 1 1

+(1,3)

+(5,6)

+(6, 1)

(0); (1,3,5,6); (2); (4)

Page 11: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

OK

Sequence clustering

Computing ‘boolean’ distances:• Threshold – based

• Additional rules (match arrangement)

Example: read/EST clustering% identity + length + arrangement:

Page 12: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Computing similarity measure:- Edit distance or (ungapped) statistics P-value: BLAST,

Fasta, needle, water, etc.

- Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee

- K-mere statistics: CD-HIT, USEARCH, MUSCLE

- Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ

- Suffix Arrays: Bowtie, BWT

- Position-Specific scoring matrix: PSI-Blast, Impala

- Hidden Markoff Models: HMMer, HHSearch/HHPred, SAM

Page 13: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Distance computing is harder then clustering. (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)

- For large data sets only k-mere and suffix array measures are practical.

- However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes use of sensitive similarity measures possible.

- For boolean distance, iterative similarity detection is possible. Fast binning->slow comparing. (no off-the-shelve implementations(?))

Page 14: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Boolean distance clustering killer:

CLUSTER AGGREGATION.

In large clusters, even a small number of random links lead to huge conglomerates.

Page 15: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Common causes:

1) Contamination with standard constructs

2) Repeats

3) Chimeras

4) Spurious similarities (low complexity zones etc.

Page 16: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Fighting aggregation

- Vector / adapter trimming:- Lucy, Figaro, etc. Integrated in many assembly suites

(newbler, velvet, AMOS, CLCbio, etc.)

- Low complexity detection / masking:- SEG, DUST, FastQC, WindowMasker etc. – often integrated

in search tools.

Page 17: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

- Repeat detection / masking:- Regular (tandem) repeats:

- Pre-search masking: Based on structure (IMEx, SRF); or on database (TRDB)

- Post-search detection based on similarity properties (multiple parallel threads)

- Irregular (long) repeats:- Database based: RepeatMasker- De-novo: RepeatScout, orrb, PILER, etc.

Require genome as input, construct database.

Page 18: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Detecting chimeric sequences:

• Abundance-based: Perseus, UCHIME• Chimeras undergo less amplification

cycles. So chimera segments in native arrangement are more frequent.

• Specific to 16S: ChimeraSlayer, Bellerophon• Chimera ‘arms’ are closer to originating

phyla then entire chimera

Page 19: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Detecting chimeric sequences• Similarity coverage based: Mira assembler

Page 20: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Detecting chimeric sequences• Similarity graph topology based: dchim

Alignment view Connectivity view

Page 21: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Protein Clusters: various criteria- Primary structure similarity

- Close evolutionary relationship

- Similarity in physical properties

- 3-D structure similarity

- Similar fold arrangement

- Domain structure similarity

- Common or similar functions

- etc.

Page 22: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Functional and structural classifications in IMG

Page 23: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species

Position-specific scoring matrices and profile-HMMs provide better sensitivity, but MUCH SLOWER.

For individual genomes (103 -5x104 proteins) could be used with massively parallel computations (while number of genomes is within thousands)

For metagenomes can not be used with foreseeable computing resources.

Page 24: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Functional annotation of metagenome genes through protein clusters (under development):

- Build set of functionally homogenous clusters of similar proteins – for annotated genomes

- Build HMMs for each cluster, compose model database

- Pledge metagenome proteins to clusters by matching to models

- Cluster unpledged proteins, build models, update model database.

- Balance model database by creating model tree: aggregating small relative clusters and dissecting large ones.

- Perform hierarchical searches through profiles tree.

Page 25: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Clustering reduces search space, but adds another level of indirection, which is a source of errors, and complexity, which consumes effort.

Improves only searches within parameters space used for clustering (structure-based clusters not useful for searching for certain codon usage, etc.)

Page 26: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

However, for proteins, which form dense relationship networks, clustering is a great tool.

Page 27: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Thank you!