27
Introducing MMseqs MARTIN STEINEGGER GENE CENTER MUNICH

MMseqs NGS 2014

Embed Size (px)

DESCRIPTION

MMseqs (Many-against-Many sequence searching) is a novel software suite for very fast protein sequence searches and clustering of huge protein sequence data sets, such as sets of predicted protein sequences or 6-frame-translated open reading frames (ORFs) from large metagenomics experiments. MMseqs is around 1000 times faster than protein BLAST and sensitive enough to capture similarities down to less than 30% sequence identity. At the core of MMseqs are two modules for the comparison of two sequence sets with each other. The first, prefiltering module computes the similarities between all sequences in one set with all sequences in the other based on a very fast and sensitive alignment-free metric, the sum of scores of similar 7-mers. The second module implements an AVX2-accelerated Smith-Waterman-alignment of all sequences that pass a cut-off for the score in the first module. Due to its unparalleled combination of speed and sensitivity, searches of all predicted ORFs in large metagenomics data sets through the entire UniProt or NCBI-NR databases will be feasible. This could allow to assign to functional clusters and taxonomic clades many reads that are too diverged to be mappable by current software. MMseqs' third module can also cluster sequence sets efficiently, based on the similarity graph obtained from the comparison of the sequence set with itself in modules 1 and 2. MMseqs further supports an updating mode in which sequences can be added to an existing clustering with stable cluster identifiers and without the need to recluster the entire sequence set. MMseqs will therefore be used to offer high-quality clustered versions of the UniProt database down to 30% sequence similarity threshold.

Citation preview

  • 1. Introducing MMseqs MARTIN STEINEGGER GENE CENTER MUNICH

2. Motivation Map to protein / organism Blast: ~40 000 days (16 cores) MMseqs: ~40 days (16 cores) 7 lanes 200M reads ~ 7 200M seqs of 50 amino acids UniProt 5107 Protein seqs 1.4109 Search reads against UniProt Gene predictionSequence genome 3. Growth of the UniProtKB/TrEMBL Protein Sequence Database MARTIN STEINEGGER 4. Result Protein Search Build & read index Search Time Speed-up factor MMseqs s=4 1h 17m 6m 950x MMseqs s=7 1h 17m 11m 518x swipe 36m 2d 5h 34m 1.8x BLAST 36m 3d 23h 01m 1x ublast 1h 52m 46m 127x RAPsearch 2h 11m 10h 56m 9.5x UniProt 54 790 250 7 616 Proteins search 5. ROC5 query 1: db 50 db 48 query 3: db 65 db 63 db 62 db 59 db 56 query 2: db 55 db 43 Query 4: db 100 db 99 ROC5 Roc value: query 4: 0.2 query 1: 0.4 query 3: 0.6 query 2: 1.0 1.00 .2 .4 .6 1.0 .75 .50 .25 TP FP AUC 0.6 ROC5 Fractionofqueries 5 6. Result Protein Search Fractionofqueries ROC5 SCOP25 UniProtKB 283 406 SCOP25 7 616 true positive: same SCOP superfamily false positive: different SCOP fold ignore same fold different superfamily search 950x 518x 9.5x 127x 1.8x 1x 7. Workflow Protein Search Prefiltering Alignment Search space : 108 108 ~ 7 days for UniProt 5.4*107 Search space: 108 102 ~ 2 days Query 1 Query n Database Hit1 Query 1 Query nquery 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: hit 1: 123 hit 2: 68 hit 3: 32 query n ... 8. Filtering Sequences with k-mers Homologous proteins Unrelated proteins Sequence2 Sequence2 Sequence 1 Sequence 1 9. Filtering Sequences with k-mers 2014/5/8 MARTIN STEINEGGER Exact matches of length 3 Similar matches of length 6 10. Filtering Sequences with k-mers Exact 3-mer matches Similar 6-mer matches Informationispower k-mersaslongaspossible Butweneedinexactmatchestokeepsensitivityhigh 3 mer, exact 5 mer, exact 5 mer, 25 similar 6 mer, 100 similar 7 mer, 400 similar Prob. of chance k mer match 3 1.2 10 3 5 3 10 7 25 5 7.5 10 6 100 6 1.5 10 6 400 7 3 10 7 Prob. of homologous match at 25% sequence identity 3 1/64 5 1/1024 25 5 1/40 100 6 1/40 400 7 1/40 Keep low for high speed! Keep high for high sensitivity! 11. Prefiltering Algorithm Most critical part of MMseqs regarding speed and memory consumption Calculates similarity scores on multiple CPUs. Computationally intense parts are vectorized. 11 Database SetQuery Set AAAAAA AAAAAR ... MHWVRE ... XXXXXX Seq.Ids 5351 43314 2314 ... Query matchList of k-mers Index table of database Sum of scores Result of query 1 . . L G T M H W V R Q A . . MHWVRQ42 MHWVKQ34 MHWVRE34 ... query 1: db 5351: (123) db 2314: (68) db 2: (62) 23 ... 11+34 ... 42+34 ... 12+34 1 ... 2314 ... 5351 ... 43314Db. Seq. Idx. 12. Z-scorescorrectforbackgroundk-mermatches : summed k mer match score of query with target protein , with from calibration run : expected score from background matches # expected chance k mer matches Poisson distributed matches 13. Fast Smith-Waterman alignment using SSE2 Fast Smith Waterman Using Michael Farrars version of the Smith-Waterman algorithm to align prefiltering outputs. 13 . . . Prefiltering Result Alignment Result Hit1 Query 1 Query n 14. Multi core parallelization over query sequence Thread level parallelization with OpenMP. Splits query database in packages and matches them against the database set. 14 Node Query seqs. 0 - 25.000 Query seqs. 25.001-50.000 Query seqs. 50.001-75.000 Query seqs. 75.001-100.000 Result Database Set query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query k1: db 12: 103 db 71: 58 db 92: 52 . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query k2: db 15: 152 db 23: 88 db 24: 32 . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: 123 db 23: 68 db 2: 32 . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query k3: db 5: 123 db 23: 68 db 2: 32 . . . Core 1 Core 2 Core 3 Core 4 15. Multi node parallelization over database sequence From top to bottom: 1. Message Passing Interface 2. Thread Level Parallelism 3. Data Level Parallelism 15 Aggregated results DB Seq 0 - 100.000 Node 1 Query Query Query DB Seq. 100.001 - 200.000 Node 2 Query Query Query DB Seq 200.001 - 300.000 Node 3 Query Query Query 16. Sequences Clusters GLTRETVSR Why Sequence Clustering 17. Workflow of MMseqs ClusteringPrefiltering Query 1 Query n Database Alignment Hit1 Query 1 Query n 18. Clustering Clustering with greedy set cover Linear time and space greedy set cover algorithm to cluster results. 18 Database Set Alignment Result Query Set Clustering Result 19. Cascaded Clustering 19 90% sequence identity 50% sequence identity 20% sequence identity Speed Sensitivity Data to cluster ClusteringPrefiltering Alignment 20. Updating We created an updating mechanism that is able to detect changes and update the current database. We also guarantee stable cluster identifiers. 20 New sequences Old sequences Deleted sequences Old Result Update New against New NewagainstOld + Updating: N N Reclustering: N N 21. Clustering Results Clusters Corrupted Clusters Seq. per Cluster Time MMseqs s=4 naive clust 85 780 3.4 3.4 4m 03s MMseqs s=4 set cover 60 915 1 4.7 4m 02s MMseqs cascaded s=4 41 173 3 7.0 3m 35s MMseqs s=7 29 801 2 9.7 9m 26s MMseqs cascaded s=7 22 541 1 12.9 5m 07s blastclust 21 890 1 13.3 7h 25m 01s CD-HIT 114 386 260 2.5 1h 25m 01s kClust 91 681 1 3.2 9m 57s Usearch 157 981 11 1.8 45s SCOP25 UniProtKB 283 406 SCOP25 7 616 cluster 22. Summary l BLAST-like searches at up to 1000x speed l Application on metagenomics datasets l Copes with huge sequence data amounts l Clustering large protein seq data sets with best sensitivity/speed l More sensitive core algorithm l Profile searches => boosts sensitivity at same speed l Applications in metagenomics l E. g. gut microbiomes for medical research, soil for agriculture etc. l Nucleotide sequence version to be tested Outlook 23. Thanks Maria Hauser Development Gene Center Munich Ludwig-Maximilians-Universitt Johannes Sding PI Max Planck Institute Gttingen Justas Dapkunas Betatest Institute of Biotechnology Vilnius University Klaus Faidt Betatest Max Planck Institute Tbingen Borisas Bursteinas Betatest EBI: UniProt development Andreas Hauser FFindex 24. Thank you for your time. Discussion 25. Backup 2014/5/8 MILOT MIRDITA 26. Result Protein Search MARTIN STEINEGGER TP FP 27. ROC5 query 1: db 50 db 48 query 3: db 65 db 63 db 62 db 59 db 56 query 2: db 55 db 43 ROC All querys: db 100 db 99 db 65 db 63 db 62 db 59 db 56 db 55 db 50 db 48 db 43 ROC5 Roc value: query 4: 0.2 query 1: 0.4 query 3: 0.6 query 2: 1.0 Query 4: db 100 db 99 ROC 0.4 ROC 1.0 ROC 0.6 ROC 0.2 TP FP 1.00 .2 .4 .6 1.0 .75 .50 .25 TP FP AUC 0.6 ROC5 Fractionofqueries query 3 contributes of the scores query 4 contributes all highest scores