46
SePhHaDe Computa/onal Challenges on High Throughput Sequencing and Phenotyping E. Pacitti & E. Rivals Colloque Mastodons 22/1/2015, Paris

SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

SePhHaDe  Computa/onal  Challenges  on  High  

Throughput  Sequencing  and  Phenotyping        

                                                     

E. Pacitti & E. Rivals

Colloque Mastodons 22/1/2015, Paris

Page 2: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

PLAN  

¢  Introduc.on    ¢  Phenotyping  

�  Informa.on  Retrieval  of  Complex  Contents  �  Search  and  Recommenda.on  for  Image  Observa.ons  

¢  Sequencing  �  Indexing  reads  and  sequencing  error  correc.on  �  New  spaced  seed  filtering  for  similarity  search  �  Metagenomics  pipeline  

¢  Project  fusion  with  Credible  ¢  Conclusions  

Page 3: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

INTRODUCTION  -­‐  ARCHITECTURE  

BIG  DATA  SCIENTIFQUE  données  de  séquencage,  phenotypage,  images    

P2P,    Cloud,  Flots  de  Données,  Mu.-­‐Site,  HPC  

Analyse de Données

Programmes pour le séquencage a Haut Débit

Indexation

Recherche d’Information de Contenus Complexes

Recommandation

Page 4: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

PLANT  PHENOTYPING  

Page 5: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

PHENOTYPING  -­‐  MOTIVATIONS    ¢  Phenotype  

�  Observable  state,  characteris.c  or    behavior  of  a  living  being  

�  Plant  morphologie  ,  blood  glucose  levels  

�  Data:  Images,  Meta-­‐Data  

¢  Phenotyping  �  Observa.onal    method    that  

records  a  phenotype  data    for  analysis,  quering,  etc.  

�  Botanical    observa.ons  uses  content-­‐based  mul.media  iden.fica.on  methods  

Greenhouse based Phenotyping (Inra, Montpellier)

In the field based phenotyping (Plant Observations)

Page 6: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

INFORMATION  RETRIEVAL  OF  COMPLEX  CONTENTS  

Page 7: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

CONTEXT  

¢  Accurate knowledge of living species distribution and evolution is essential

•  Ultimate goal: sustainable and global surveillance tools of living species

•  global warming effects, invasive species, biodiversity, impact of Human activities

   ¢  It is necessary to boost the production of observations

¢  The Taxonomic gap is a tricky problem ¢ Scien.fic  name  =  unique  access  key  to  informa.on  ¢ Knowledge  accessible  only  to  specialists  

 

Taxon Castanea sativa Mill.

Page 8: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

LIFECLEF  OBJECTIVES  

A new lab of CLEF evaluation forum

•  “European NIST” for information retrieval, 15 years, hundreds of research groups world wide

Objectives

•  Study, evaluate and boost state-of-the-art content-based multimedia identification methods (signals+metadata)

•  Assemble a transdisciplinary and cross-media community around the topic

•  Promote environmental challenges in the multimedia community

Page 9: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

LIFECLEF  2014:  THREE  TASKS  (CHALLENGES)   Based on real-world big data

Page 10: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

LIFECLEF  2014:  SCHEDULE  

December 2013: registration opens January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission of runs May 15th 2014: release of results June 15th 2015: deadline for submission of working notes (peer reviewed) 15-19 September 2014: LifeCLEF Workshop at CLEF 2014 Conference (UK, Sheffield)

Page 11: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

REGISTRANTS  /  COMMUNITY  

 42  groups  31  groups  

5  groups  

4  groups  

36  groups  

6  groups  

3  groups  

127  research  groups  registered  worldwide  (academics  &  industry)  

Page 12: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

PARTICIPANTS  ON  FINISH  LINE  

0  groups  

0  groups  

0  groups    10  groups  10  groups  

1  group  

1  group  

22  groups  submiOed  a  total  of  70  runs  and  22  working  notes  (published  in  CEUR-­‐WS  proceedings)  

Page 13: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

FOCUS  ON  PLANTCLEF  

Hervé Goeau, Inria Pierre Bonnet, CIRAD

o  Context: The Pl@ntNet initiative, a multimedia-oriented

citizen sciences and participatory sensing

France flora dataset

iPhone & Androïd applications Content-based participatory sensing

Botanical Social Network Citizen sciences

+

+ +

+350K downloads 2K users / day

+20K members +100K images + 5K species

Page 14: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

FOCUS  ON  PLANTCLEF:  DATA  

2011 2012 2013 2014

Espèces 71 126 250 500

Images 5 400 11 500 26 077 60962

Organes/vues

Contributeurs 17 46 327 1000

Observations 368 1136 15046 30136

Page 15: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

¢  Best  scores  increasing  each  year  despite  an  increasing  complexity  of  the  task  [Joly  et  al.,  CLEF  proceedings  2014]  

 ¢ Man  vs.    Machine  experiment  [Bonnet  et  al.,  MTAP  journal  2015]  

2011   2012   2013   2014  

Scan-­‐like   0.52   0.56   0.61   0.64  

Photographs   0.25   0.32   0.40   0.47  

FOCUS  ON  PLANTCLEF:  RESULTS  

Page 16: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

NEXT  YEAR  

¢  Keep  the  3  tasks  but  with  enriched  test  data  (1000  bird  species,  1000  plant  species)  

¢  PlantCLEF  novelty  =  authorize  external  training  data  as  a  variant  of  the  task  �  To  the  condi.on  that  the  experiment  is  reliable  and  reproduceable  �  Data  availability,  clear  descrip.on,  no  risk  of  including  test  data  etc.  

¢  FishCLEF  restructuring  =  fusion  of  the  4  subtasks  in  1  single  applica.on-­‐oriented  task  �  Coun/ng  the  number  of  fish  instances  of  a  list  of  species  �  This  should  avoid  fragmen/on  and  makes  the  task  more  aOrac/ve  

for  breaking  research  

Page 17: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

SEARCH  AND  RECOMMENDATION  OF  PLANT  OBSERVATIONS  

Page 18: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

CHALLENGE  

0

175

350

525

700

875

0 4500 9000 13500 18000 22500

#obs

erva

tions

#species

A few plants represents the majority of the observations! The majority of the plants are rarely observed!

A better distribution for recommendations: need for diversification !

#recommendations

Challenge: Retrieve/recommend the k most diverse plant observations given a query (e.g grape).

Page 19: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

PROFILE  DIVERSITY  

With profile diversity*, the recommended observations, take into account the diverse relevant users profiles and their observations.

¢  Results:  �  A  new  scoring  fonc.on  based  on  a  probabilis.c  model  (2013)  �  Top-­‐k  threshold  algorithm  for  content  and  profile  diversity  (2014)  �  Op.miza.ons  for  scaling  up,  factor  of  12      (2014)  [Servajean  et  al.,      

Informa.on  Systems    Journal,  2015]    �  Distributed  Profile  Diversifica.on  [Servajean  et  al.,    Globe  2014]    �  2  Prototypes    [Servajean  et  al,  BDA  2014]    �  1  Phd  Thesis  Defense  

¢  Next  Year:  Exploit  Recommenda4on/  Crowd  Sourcing  for  plant  iden4fica4on    

 

Page 20: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

USE  CASE:  PROFILE  DIVERSITY  FOR  PLANT  OBSERVATION  

Page 21: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission
Page 22: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Sequencing data analysis : a challenge

Page 23: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Overview

BIG DATA ANALYSIS

Recommandation

InformationretrievalComplexcontent

Next GenSequencingBioinformaticsPrograms

Indexing

Page 24: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Context

I 3rd generation sequencing technologies yield longer reads

I PacBio SMRT sequencing : much longer reads (up to 20 Kb)but much higher error rates

I Error correction is required

1. self correction : using only PacBio reads [Chin et al 2013]2. hybrid correction : using short reads to correct long reads

our focus !

Page 25: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Motivation

LR correction programs ”require high computationalresources and long running times on a supercomputereven for bacterial genome datasets”.

[Deshpande et al. 2013]

Page 26: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Algorithm overview

1. build a de Bruijn graph of the short reads

2. take each long read in turn and attempt to correct it

I. correct internal regions,

II. correct end regions of the long read

Page 27: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Example of de Bruijn Graph of order k = 3

bba

bac acb cba bab

aac baa abc

caa bca

S = {bbacbaa, cbaac , bacbab, cbabcaa, bcaacb}

Page 28: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

LoRDEC uses GATB (from IRISA partner Rennes)

http://gatb.inria.fr

Page 29: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Long read is corrected with DBG

bridge path

s1 t1

path not found

s2 t2

extension path

s3

For each putative region of a long read :

I align the region to paths of the de Bruijn graph

I find best path according to edit distance

I limited path search

Page 30: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Runtime, memory and disk usage

CPU time (h) Memory (GB) Disk (GB)0

200

400

600

800

1000

1200

Yeast

PacBioToCALSCLoRDEC

Page 31: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Scalability of LoRDEC

CPU time (h) Memory (GB) Disk (GB)0.1

1

10

100

1000

E. coliYeastParrot

Page 32: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

New spaced seed filtering for similarity search

Page 33: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

New seeds for sequence comparison

I Principle : similar sequences share exact or approximatecommon subwords

I Application

� choice of the combinatorial model for the seed (sensitivity,selectivity)

� data organization (hashtable, burst trie, suffix array, BWT,. . .)� choice of the algorithm to locate seeds

TA C GC

contiguous seed

∗A ∗ GC

spaced seed

∗∗, ε ∗, ε∗, εGTCC

∗, ε

A C

∗, ε

GC T

∗∗∗ ∗

approximate seed

(up to 1 error)

Page 34: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

New seeds for sequence comparisonResults

I study of the coverage criterion

Coverage measure for a seed

DefinitionNumber of match symbols covered by at least one 1 symbol from anyseed hit [Benson and Mak, 2008, Martin, 2013]

ExampleATCAGTGCGAATGCGCAAGA|||||:||:|||||.|||||A•T•C•AG•CG•C•AA•A•T•G•C•TC•A•A•G•A

111*1*11

111*1*11

111*1*11

Coverage is of 15

Laurent Noe, Donald E. K. Martin A coverage criterion for spaced seeds and its applications

I optimisation of spaced seeds for eukaryotic genomecomparison

I new type of seeds for short patterns with high error rate

ATGG TACA TCAA CGTA GCAT

ATGG TATA TCGAA CGGA GCAT

0 1 1 1 0

ATG TACA TCTA CGTA GACAT

0 1 0

Page 35: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

New seeds for sequence comparisonSome examples of biological applications

I read mapping (ongoing) [BGE 2014]

I finding microRNA target at genome scale [IWOCA 2014]

I taxonomic assignment in metagenomics (ongoing)

I 20 000 new alignments between human and mouse genomes[NAR 2014]

I non coding RNA classification by Support Vector Machine StringKernels [JCB 2014]

Page 36: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Metagenomic sample analysis

Page 37: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Comparison of metagenomic data

• Input :

– Sequencing data from environment (water, sol, air, etc.)

– Protein sequence banks

• Output :

– The set of proteins that match with metagenomic data

comparison

Metagenomic dataRNA-seq Protein bank

List ofmatchingproteins

FUNCTIONS

Comparing metagenomic samples to protein banks is a way to functionally characterize a specific environment

PROTEIN => FUNCTION

Page 38: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Common method

BLASTx

Metagenomic data

Protein bank

SELECT

Protein list

Standard software used by everyone

Time consuming process:Several hours (days) of computation on multicore systems

MASTODONS CHALLENGE

speed-up the process at least one order of magnitude

Page 39: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

MASTODONS Approach

Metagenomic data

MetaContiger

PLASTx SELECT

Protein bank Protein list

BLASTx

PLASTxSoftware developed by GenScale (before MASTODONS)SPEED-UP = ~ X5

MetaContigerNew software developed in this projectEliminate redundancy of metagenomic data significantly decrease the number of metagenomic sequences to compare

Standard approach

Page 40: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Results

• Project still going on

• Preliminary results:

– Global speed-up : from X10 to X30 (1 day vs 1 hour)

– Highly correlated to redundancy of metagenomic data

• Future

– Validation from a qualitative point of view

– Test on various metagenomic projects

– Extend the method to the general sequence comparison problem

Page 41: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Conclusion

Page 42: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Actions and highlights

I Colloque � Indexing scientific big data �

147 participants, Paris 15 Jan 2014

I Joint Workshop with COST Action SeqAhead� Data Structures in Bioinformatics � 10 countries

I LifeClef challenge launched and meetings over 2014

I New partner teams : Telabotanica, Univ. Rouen, UPMC,Paris 5, CIRAD

Page 43: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

BIG DATA SCIENTIFQUEdonnées de séquencage, phenotypage, images

Analyse de Données etworkflows

Programmes pour le

Séquençage àHaut Débit

IndexationMédiation

Recommandation et Recherchede Contenus Complexes

P2P, Cloud, Muti-Site, HPC

• Volumineuses• Complexes• Hétérogènes

Page 44: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Future

I Fusion between projects SePhHaDe and Credible

I New graphs, index and algo. for genome assembly

I Metagenomic pipeline

I Recommandation and plant identification for LifeCLEF

I New edition workshop � Data Structures in Bioinformatics �

Page 45: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Publications

I Drezen et al., GATB : Genome Assembly & Analysis Tool Box,Bioinformatics, 2014

I Salmela et Rivals, LoRDEC : accurate and efficient long read error correction,Bioinformatics, 2014

I Noe et Martin, A coverage criterion for spaced seeds and its applications toSVM string kernels and k-mer distances, J. Computational Biology, 2014.

I Frith et Noe, Improved search heuristics find 20 000 new alignments betweenhuman and mouse genomes, Nucleic Acids Research, 2014.

I Cazaux et al., From Indexing Data Structures to de Bruijn Graphs, CPM, 2014

I Blanc-Mathieu et al., An improved genome of the model marine algaOstreococcus tauri unfolds by assessing Illumina read de novo assemblies, BMCGenomics, 2014

I Servajean et al., Profile Diversity for Query Processing using UserRecommendations, Information Systems, 2015

I Joly et al., Are Species Identification Tools Biodiversity-friendly ?, ACM IW onMultimedia Analysis for Ecological Data, 2014

I Joly et al., Lifeclef 2014 : multimedia life species identification challenges.Information Access Evaluation, 2014

I Cazaux et Rivals, Reverse Engineering of Compact Suffix Trees and Links, J.Discrete Algorithms, 2014.

Page 46: SePhHaDe( Computaonal ChallengesonHigh( Throughput ... · January 2014: training data and task details release March 2014: test data release May 1st 2014: deadline for submission

Partners

Merci pour votre attention