Thierry Lecroq, Eric Rivals and Hélène Touzet (Eds ...Informatique Mathématique) of the CNRS and of the project MASTODONS SePhHaDe, ... D38–D51, 2011. [2]S. K. Wyman, R. K

Thierry Lecroq, Eric Rivals and Helene Touzet (Eds.)

Montpellier, FranceNovember 25th and 26th, 2013

Table of Contents

Preface 3

Committees 4

Invited TalksI Can Only Read Equations: The ICORE framework for dynamic programming over sequences and trees 5

Robert Giegerich

The GEM suite for sequencing data analysis: present and future perspectives 6Paolo Ribeca

Regular SubmissionsFinding the Core-genes of Chloroplast Species 7

Bassam Alkindy, Jean-Francois Couchot, Christophe Guyeux and Michel Salomon

The HRS-seq: a new method for genome-wide profiling of nuclear compartment-associated sequences 9Marie-Odile Baudement, Axel Cournac, Franck Court, Marie Seveno, Hugues Parrinello, Christelle

Reynes, Marie-Noelle Le Lay-Taha, Cathala Guy, Laurent Journot and Thierry Forne

Greedy algorithm for the shortest superstring and shortest cyclic cover of linear strings 11Bastien Cazaux and Eric Rivals

On the number of prefix and border tables 13Julien Clement and Laura Giambruno

Indexation de sequences d’ADN au sein d’une base de donnees NoSQL a l’aide d’algorithmes de hachageperceptuel 15

Jocelyn De Goer De Herve, Myoung-Ah Kang, Xavier Bailly and Engelbert Mephu Nguifo

Tyrosine-1 phosphorylation of Pol II CTD is associated with antisense promoter Transcription and activeenhancers in mammalian cells 18

Nicolas Descostes, Martin Heidemann, Ahmad Maqbool, Lionel Spinelli, Romain Fenouil, Marta Gut,Ivo Gut, Dirk Eick and Jean-Christophe Andrau

Abelian Repetition in Sturmian Words 20Gabriele Fici, Alessio Langiu, Thierry Lecroq, Arnaud Lefebvre, Filippo Mignosi and Elise Prieur-Gaston

Co-optimality and Ambiguity, Disentangled 21Robert Giegerich and Benedikt Lowes

1

Les chemins auto-evitants plies relies au repliement des proteines 23Christophe Guyeux

New software for mapping high-throughput reads for genomic and metagenomic data 25Evguenia Kopylova, Laurent Noe, Mikael Salson and Helene Touzet

SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences 27Celine Mercier, Frederic Boyer, Aurelie Bonin and Eric Coissac

Nouveaux Algorithmes d’Extraction de Motifs et de Ponderation pour le Classement de Proteines 30Faouzi Mhamdi, Mehdi Kchouk and Salma Aouled El Haj Mohamed

Suffix tree and suffix array of an alignment 33Joong Chae Na, Heejin Park, Sunho Lee, Minsung Hong, Thierry Lecroq, Laurent Mouchard and Kunsoo

Park

Comparison of sets of homologous gene transcripts 34Aida Ouangraoua, Krister M. Swenson and Anne Bergeron

Disentangling homeologous contigs in tetraploid assembly : application to durum wheat 36Vincent Ranwez, Yan Holtz, Gautier Sarah, Morgane Ardisson, Sylvain Santoni, Sylvain Glemin, Muriel

Tavaud-Pirra and Jacques David

An algorithm for pattern occurrences P -values computation 38Mireille Regnier, Evgenia Furletova, Victor Yakovlev and Mikhail Roytberg

Filtering systematic errors in next-generation genotype calls : stranding matters 40Laure Sambourg and Nicolas Thierry-Mieg

Reconstructing Textual Documents from perfect n-gram Information 41Matias Tealdi and Matthias Galle

2

PrefaceThe pluridisciplinary workshop SeqBio 2013 was held at the CNRS Campus in Montpellier, France on

November 2013, 25th and 26th. It gathered computer science and bioinformatic communities working ontextual analysis methods and biologists and geneticists interested in sequence bioinformatics.

Thanks to the financial support of GdR (working groups) BIM (BioInformatique Moleculaire) and IM(Informatique Mathematique) of the CNRS and of the project MASTODONS SePhHaDe, the participationwas completely free.

The programme includes talks selected on submissions and two invited talks by Robert Giegerich andPaolo Ribeca.

The problems tackled during SeqBio spread from combinatorics on words and text algorithmics to theirapplications to bioinformatics analysis of biological sequences. This includes, without being restricted to,the following topics:

— text algorithmics;— indexing data structures;— combinatorics and statistics on words;— high performance or parallel algorithmics;— text mining;— compression;— alignment and similarity search;— pattern or repeat matching, extraction and inference;— analysys of high throughput sequencing data (genomic, RNA-seq, Chip-seq, . . . );— genome annotation, gene prediction;— haplotypes and polymorphisms;— comparative genomics;— control signals.This meeting comes after the following previous editions:— Marne-la-Vallee, November 2012;— Lille, December 2011;— Rennes, January 2011;— Montpellier, January 2010;— Rouen, September 2008;— Marne-la-Vallee, September 2007;— Orsay, November 2005;— Lille, December 2004;— Nantes, May 2004;— Montpellier, November 2003;— Nancy, January 2003;— Rouen, June 2002;— Montpellier, March 2002.

3

Programme Committee— Guillaume Blin, LIGM, Univ. Marne-la-Vallee— Jeremie Bourdon, LINA, Univ. Nantes— Christine Brun, TAGC, Inserm UMR 1090, Marseille— Annie Chateau, LIRMM, CNRS Univ. Montpellier 2— Helene Chiapello, INRA Toulouse— Julien Clement, GREYC, Univ. Caen— Eric Coissac, LECA Univ. Grenoble 1— Bernard de Massy, CNRS Montpellier— Thomas Faraut, INRA Toulouse— Nicolas Galtier, ISEM, CNRS Univ. Montpellier 2— Thierry Lecroq, LITIS, Univ. Rouen (chair)— Claire Lemaitre, INRIA Rennes— Laurent Mouchard, LITIS, Univ. Rouen— Macha Nikolski, LABRI, Univ. Bordeaux I— Olivier Panaud, LGDP, Univ. Perpignan— Fabio Pardi, LIRMM, CNRS Univ. Montpellier 2— Eric Rivals, LIRMM, CNRS Univ. Montpellier 2 (chair)— Eric Tannier, LBBE, CNRS Univ. Lyon I— Helene Touzet, LIFL, Univ. Lille I (chair)— Raluca Uricaru, LABRI, Univ. Bordeaux I— Jean-Stephane Varre, LIFL, Univ. Lille I

Organizing CommitteeThe local organization has been realized by:

— the team “Methods and Algorithms for Bioinformatics (MAB)” of the LIRMM (Lab. of ComputerScience, Robotics and Microelectronics of Montpellier)

— the “Institut de Biologie Computationnelle (IBC)”— the French CNRS (Centre National de la Recherche Scientifique)

Members:— Bastien Cazaux— Annie Chateau— Maxime Hebrard— Vincent Lefort— Sylvain Milanesi— Fabio Pardi— Eric Rivals

4

I Can Only Read Equations: The ICORE framework fordynamic programming over sequences and treesRobert Giegerich

Univ. of Bielefeld, Germany, Practical Computer Science lab

Dynamic programming problems are ubiquitous in bioinformatics, mostly dealing with sequences andtree-structured data. A large class of such problems can be cast in a uniform framework based on algebrasand term rewrite systems:

A solution of a dynamic programming problem, indicated by an optimal score, can be represented by theformula (term) which computes this score. A simple term rewrite system can specify the transformation ofsuch a formula “backwards” to the input(s) it is derived from. The inverse of this rewrite relation constitutesa declarative problem specification with a high intuitive appeal. Algorithmic ideas and relationships betweensimilar problems come out clearly, and re-use of specification components is high.

The presentation will introduce the novel framework of Inverse COupled REwrite systems (ICOREs),and demonstrate their appeal with a familiar set of bioinformatics problems encoded as ICOREs. It willindicate the sub-class of the framework which we can implement automatically and efficiently today, andsketch the challenges of full ICORE implementation.

ICOREs are joint work with Helene Touzet

5

The GEM suite for sequencing data analysis: presentand future perspectivesPaolo Ribeca

National Center of Genomic Analyses, Barcelona

In this talk we describe the key features and the future roadmap of the GEM suite of software tools forsequencing data analysis. Built upon a couple of high-performance alignment and processing libraries, as oftoday several utilities and pipelines are available, mainly a mapper (which has been shown to be significantlyfaster and more accurate than many other short sequencing read aligners), a tool to compute the mappabilityof a reference, an RNA-seq alignment pipeline, plus several ancillary tools for easy genome browser trackproduction and visualization. However, new challenges are just round the corner (mainly new sequencingtechnologies producing longer reads with a higher error rate, and the general need for a higher processingyield), and we discuss how we prepare to tackle them.

6

Finding the Core-genes of Chloroplast SpeciesBassam Alkindy, Jean-Francois Couchot, Christophe Guyeux and Michel Salomon

FEMTO-ST Institute, UMR 6174 CNRS, University of Franche-Comte, France

Identifying core genes is important to understand evolutionary and functional phylogenies. Therefore, inthis work we present two methods to build a genes content evolutionary tree. More precisely, we focus onthe following questions considering a collection of 99 chloroplasts annotated from NCBI [1] and Dogma [2] :how can we identify the best core genome and what is the evolutionary scenario of these chloroplasts. Twomethods are considered here. The first one is based on NCBI annotation, it is explained below. We start bythe following definition.

Definition 1 Let A = A, T,C,G be the nucleotides alphabet, and A∗ be the set of finite words on A (i.e.,of DNA sequences). Let d : A∗ ×A∗ → [0, 1] be a distance on A∗. Consider a given value T ∈ [0, 1] called athreshold. For all x, y ∈ A∗, we will say that x ∼d,T y if d(x, y) 6 T .

∼d,T is obviously an equivalence relation. When d = 1 − ∆, where ∆ is the similarity scoring functionembedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote ∼d,0.1 by∼. The method starts by building an undirected graph based on the similarity rates rij between sequencesgi and gj (i.e., rij = ∆(gi, gj)). In this latter, nodes are constituted by all the coding sequences of the setof genomes under consideration, and there is an edge between gi and gj if the similarity rate rij is greaterthan the given similarity threshold. The Connected Components (CC) of the “similarity” graph are thuscomputed. This produces an equivalence relation between sequences in the same CC based on Definition 1.Any class for this relation is called “gene” here, where its representatives (DNA sequences) are the “alleles”of this gene. Thus this first method produces for each genome G, which is a set gG

1 , ..., gGmG of mG DNA

coding sequences, the projection of each sequence according to π, where π maps each sequence into its gene(class) according to ∼. In other words, G is mapped into π(gG

1 ), ..., π(gGmG

). Remark that a projectedgenome has no duplicated gene, as it is a set. The core genome (resp. the pan genome) of G1 and G2 isdefined thus as the intersection (resp. as the union) of these projected genomes.

Figure 1: General overview of the system pipeline

7

We then consider the intersection of all the projected genomes, which is the set of all the genes x suchthat each genome has at least one allele in x. The pan genome is computed similarly as the union of all theprojected genomes. However such approach suffers from producing too small core genomes, for any chosensimilarity threshold, compared to what is usually waited by biologists regarding these chloroplasts. We arethen left with the following questions: how can we improve the confidence put in the produced core? Canwe thus guess the evolution scenario of these genomes?The second method is based on NCBI and Dogma annotations. In this method, we compute an IntersectionCore Matrix (ICM). ICM is a two dimensional symmetric matrix where each row and column represents agenome with its set of genes. Each position in ICM contains the Intersection Score (IS) or the cardinalityvalue after intersecting each genome with the other ones. Iteratively, we select the maximum intersectioncardinality value in the matrix, according to Equation 1, creating a new core id. Then we remove the twocorrespondent genomes and add the new established core.

Score = maxi<j|xi ∩ xj | (1)

References[1] E. W. Sayers et al. Database resources of the National Center for Biotechnology Information. Nucleic

Acids Research, 39(suppl 1):D38–D51, 2011.[2] S. K. Wyman, R. K. Jansen, and J. L. Boore. Automatic annotation of organellar genomes with

DOGMA. Bioinformatics, 20(172004):3252–3255, 2004.

8

The HRS-seq: a new method for genome-wide profilingof nuclear compartment-associated sequencesMarie-Odile Baudement, Axel Cournac, Franck Court, Marie Seveno, Hugues Parrinello, Christelle Reynes,Marie-Noelle Le Lay-Taha, Cathala Guy, Laurent Journot and Thierry Forne

We are interested in how mammalian nuclear organization controls gene expression in physiological andpathological situations. We have suggested that genome organization in gene-rich domains is based ona fundamental level of chromatin compaction: the statistical helix [1]. Functional folding would then beachieved by chromatin looping mediated by locus-specific factors and/or by recruitment to specific nuclearcompartments (e.g. rDNA/nucleolus; snRNA genes/Cajal bodies,...). However, most nuclear compartmentsare difficult to isolate and methods that provide general overviews of such sequences are quite complex tohandle. We have developed the HRS-seq method, a novel straightforward genome-wide approach wherebysequences associated with some nuclear compartments are isolated (the HRS fraction) from the rest ofgenomic DNA (the loop fraction) upon high-salt treatments (HRS=High-salt Recovered Sequences) andsubjected to high-throughput sequencing.

After bioinformatic filters applied by the MGX platform of Montpellier, we apply two other filters basedon the location of the start of tags with the restriction site used to separate the wo different types of sequences,and based on the length of tags due to our strategy of sequencing. Tags not fulfilling the two conditions aredeleted and the sum of tags for each restriction enzyme fragment are calculated, independently in the twofractions (R software).

After the application of a statistical analysis (collaboration with Robert Sabatier), we determine whichrestriction enzyme fragments are significantly enriched in the HRS fraction versus the loop fraction. Withthe help of this analysis, we obtained a sub-list of fragments that are called “HRS fragments”.

Using mouse liver cells, we showed that these HRS fragments are highly clustered in the genome by thehelp of differential analysis versus a randomization and that two categories of HRS can be distinguished: AT-rich HRS and GC-rich HRS. Thanks to UCSC genes database, we identify bioinformatically TranscriptionStart Sites (TSS) inside and close to HRS fragments : remarkably, GC-rich HRS are seen to map close toTSS, including TSS of histone genes (DAVID Ontology analysis). With a collaboration with Axel Cournac,we have decided to try to identify if HRS fragments cluster together in 3 dimension. For this, we have usedcontact map identified by Hi-C. Firstly, we calculated the mean interaction score obtained by all possiblecontacts between all HRS fragments, isolated from liver nuclei, and secondly we confront this result versusa randomization. We found a strong difference ; so; our HRS fragments clustered together preferentially in3D space of the nucleus. We tried to see if there are typical family of repeats over-represented in our data; we found transfert RNA genes, which are already known to cluster in 3D within the nucleus in yeast [2].This last finding confirms that a significant part of the GC-rich HRS represents sequences associated withspecific nuclear compartments.

To better understand more the specificity of the AT-rich population of HRS, we have decided to cross ourdata with data from Bas van Steensel lab [3], which identify Lamina-Associated Domains (LADs). TheseLADs are described as AT-rich sequences but also as associated to the laminB1. We have selected DNAmicroarrays that are triple positive in three different cell types, and we have compared its positioning withthe location of HRS, with the help of the ”IRanges” and ”GenomicRanges” package. We obtained a stronginteraction between AT-rich HRS and LADS. This result is concordant with the knowledge of the so-calledMAR sequences that are also AT-rich.

We are now applying this method to cellular models in which specific types of nuclear compartments areperturbed, but also in undifferentiated and differentiated cells. Global profiling of HRS should help us tobetter understand how genome organization impacts on its functions and so, if some diseases are due to analteration of the recruitment of specific HRS in these nuclear bodies.

9

References[1] Court et al. Genome Biol., 12:R42, 2011.[2] Haeusler et al. Genes Dev., 22:2204–2214, 2008.[3] Peric-Hupkes et al. Molecular Cell, 38:603–613, 2010.

10

Greedy algorithm for the shortest superstring and shor-test cyclic cover of linear stringsBastien Cazaux and Eric Rivals

L.I.R.M.M. & Institut Biologie Computationnelle, University of Montpellier II, CNRS U.M.R. 5506, 161 rueAda, F-34392 Montpellier Cedex 5, France

Abstract: In bioinformatics, the assembly of reads to produce a complete genome currently is a majorbottleneck in genomics. This problem has been modeled by the Shortest Superstring problem, where givena set of input words, one asks for a shortest string such that each input word is substring of this superstring.This well studied problem is known to be NP-hard [2] and difficult to approximate [1]. The optimisationcan measure either the length of the obtained superstring, or the amount of compression it realizes (i.e., thecumulated length of the words minus that of the superstring). Numerous approximation algorithms, whichachieve a constant approximation ratio have been described in the literature; see for instance [6]. Many ofthese use the question of finding a set of cyclic superstrings of minimal length for the input words, also knownas Shortest Cyclic Cover, as a procedure. We study the greedy algorithm, which iteratively agglomerate thetwo words having the maximal largest overlap, and has been shown to reach 1/2 compression ratio [7].Using hereditary systems [5], we provide a simple proof for this 1/2 approximation ratio. By extending thereasoning, we obtain that the greedy algorithm is optimal for the Shortest Cyclic Cover (when each word isallowed to overlap itself). The simplicity of the greedy algorithm confers it practical advantages comparedto other complex approximation algorithms in terms of coding for instance.

Resume : En biologie moleculaire, les methodes de sequencage d’ADN produisent les sequences de petitesportions de la molecule sequencee. Ensuite, la phase d’assemblage consiste a determiner la sequence completede la molecule a partir des chevauchements entre les sequences produites. Les molecules d’ADN ou d’ARNpeuvent etre lineaires ou circulaires. Ainsi, l’assemblage peut etre modelise par la recherche d’une Plus CourteSuperchaıne, dont chacune des sequences en entree est par definition une sous-chaıne. Il s’agit d’un problemeNP-difficile [2] et difficile a approximer [1] pour lequel de nombreux algorithmes a ratio constant ont etedecrits (par exemple voir [6]). Si l’on recherche non plus une superchaıne lineaire, mais un ensemble de su-perchaınes circulaires, aussi appele Couverture Cyclique, le probleme s’appelle alors Plus Courte CouvertureCyclique (de chaınes lineaires). Pour ces deux problemes, nous etudions les performances de l’algorithmeglouton qui consiste a iterativement agglomerer deux sequences en entree ayant le plus grand chevauchementmaximal. En utilisant les systemes hereditaires [5], nous donnons une preuve simple que l’algorithme gloutondonne un ratio d’approximation de compression de 1/2 pour la question de la Plus Courte Superchaıne (PCSlineaire - [7]). En etendant le raisonnement, nous obtenons que ce meme algorithme repond exactement auprobleme de la Plus Courte Couverture Cyclique (PCCC). Ce probleme, pour un ensemble de mots donnesen entree, demande un ensemble de chaınes circulaires, telles que leurs tailles cumulees soient minimale etque chaque mot en entree soit au moins sous-chaıne d’une chaıne cyclique. Une chaıne cyclique est en faitune chaıne lineaire disposee sur un cercle et bouclant sur elle-meme. L’algorithme glouton peut etre misen œuvre en temps lineaire par rapport a la somme des longueurs des chaınes en entree grace a des struc-tures d’indexation telles que l’arbre des suffixes generalise. Notre resultat acquiert aussi une portee pratiquepuisque la resolution de PCCC est utilisee dans divers algorithmes d’approximation de PCS [4, 3].

References[1] A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis. Linear approximation of shortest superstrings.

In ACM Symposium on the Theory of Computing, pages 328–336, 1991.

11

[2] J. Gallant, D. Maier, and J. A. Storer. On finding minimal length superstrings. J. Comput. Syst. Sci.,20 :50–58, 1980.

[3] D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, 1997.[4] H. Kaplan, M. Lewenstein, N. Shafrir, and M. Sviridenko. Approximation algorithms for asymmetric

tsp by decomposing directed regular multigraphs. J. ACM, 52(4) :602–626, July 2005.[5] J. Mestre. Greedy in Approximation Algorithms. In Proceedings of 14th Annual European Symposium

on Algorithms (ESA), volume 4168 of Lecture Notes in Computer Science, pages 528–539. Springer,2006.

[6] M. Mucha. Lyndon words and short superstrings. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 958–972, 2013.

[7] J. Tarhio and E. Ukkonen. A greedy approximation algorithm for constructing shortest common super-strings. Theor. Comput. Sci., 57 :131–145, 1988.

12

On the number of prefix and border tablesJulien Clement and Laura Giambruno

GREYC, CNRS-UMR 6072, Universite de Caen, 14032 Caen, France

Julien Clement: [email protected] Giambruno: [email protected]

The prefix table of a string w reports for each position i the length of the longest substring of w thatbegins at i and matches a prefix of w. This table stores the same information as the border table of thestring, which memorises for each position the maximal length of prefixes of the string w ending at thatposition. Indeed two strings have the same border table if and only if they have the same prefix table.

Both tables are useful in several algorithms on strings. They are used to design efficient string-matchingalgorithms and are essential for this type of applications (see for example [6] or [2]). It has been noted thatfor some text algorithms (like the Knuth-Morris-Pratt pattern matching algorithm), the string itself is notconsidered but rather its structure meaning that two strings with the same prefix or border table are treatedin the same manner. For instance, strings abbbbb, baaaaa and abcdef are the same in this aspect.

The study of these tables has become topical. In fact several recent articles in literature (cf. [5, 3, 1, 4])focus on the problem of validating prefix and border tables, that is the problem of checking if an integer arrayis either the prefix or the border table of at least one string. In a previous paper [7] the authors representeddistinct border tables by canonic strings and gave results on generation and enumeration of these string forbounded and unbounded alphabets. Some of these results were reformulated in [4] using automata-theoreticmethods. Note that different words on a binary alphabet have distinct prefix/border tables. This gives us atrivial lower bound in 2n−1 (since exchanging the two letters of the alphabet does not change tables). Thisis no longer true as soon as the alphabet has cardinality strictly greater than 2: for instance, words abb andabc admit the same prefix table [3, 0, 0].

In this paper we are interested in giving better estimations on the number of prefix/border tables pn ofwords of a given length n, that those known in literature.

For this purpose, we define the combinatorial class of p-lists, where a p-list L = [`1, . . . , `k] is a finitesequence of non negative integers.

We constructively define an injection ψ from the set of prefix tables to the set of p-lists which are easierto count. In particular we furnish an algorithm associating to a prefix table a p-list. We define prefix lists asp-lists that are images of prefix tables under ψ. We moreover describe an “inverse” algorithm that associatesto a prefix list L = ψ(P ) a word whose prefix table is P . This result confirms the idea that prefix-listsrepresent a more concise representation for prefix tables.

We then deduce a new upper bound and a new lower bound on the number pn of prefix tables (see Table1 for first numerical values) for strings of length n or, equivalently, on the number of border tables of lengthn.

Let ϕ = 12 (1 +

√5) ≈ 1.618, the golden mean, we have:

Proposition 1 (Upper bound) The number of valid prefix tables pn can be asymptotically upper boundedby the quantity 1

2

(1 +

√5

5

)(1 + ϕ)n + o(1).

Proposition 2 (Lower bound) For any ε > 0 there exists a family of prefix tables (Ln)n≥0 such thatCard(Ln) = Ω((1 + ϕ− ε)n).

The problem of finding an asymptotic equivalent for the number of prefix tables is however still open,and would require a very fine understanding of the autocorrelation structure of words.

13

[email protected]

[email protected]

n pn,1 pn,2 pn,3 pn,4 pn,5 pn

1 1 12 1 1 23 1 3 44 1 7 1 95 1 15 4 206 1 31 15 477 1 63 46 1108 1 127 134 1 2639 1 255 370 4 630

10 1 511 997 16 152511 1 1023 2625 52 370112 1 2047 6824 162 903413 1 4095 17544 500 2214014 1 8191 44801 1467 5446015 1 16383 113775 4180 13433916 1 32767 287928 11742 1 33243917 1 65535 726729 32466 4 82473518 1 131071 1831335 88884 16 205130719 1 262144 4610078 241023 52 5113298

Table 1: First values: pn is the total number of prefix tables for strings of size n, pn,k is the number of prefixtables for strings of size n with an alphabet of size k which cannot be obtained using a smaller alphabet.

References[1] J. Clement, M. Crochemore, and G. Rindone. Reverse engineering prefix tables. In S. Albers and J.-Y.

Marion, editors, 26th International Symposium on Theoretical Aspects of Computer Science (STACS2009), volume 3 of Leibniz International Proceedings in Informatics (LIPIcs), pages 289–300, Dagstuhl,Germany, 2009. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.

[2] M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on strings. Cambridge University Press,Cambridge, UK, 2007.

[3] J.-P. Duval, T. Lecroq, and A. Lefebvre. Border array on bounded alphabet. Journal of Automata,Languages and Combinatorics, 10(1):51–60, 2005.

[4] J.-P. Duval, T. Lecroq, and A. Lefebvre. Efficient validation and construction of border arrays andvalidation of string matching automata. RAIRO-Theoretical Informatics and Applications, 43(2):281–297, 2009.

[5] F. Franek, S. Gao, W. Lu, P. J. Ryan, W. F. Smyth, Y. Sun, and L. Yang. Verifying a border array inlinear time. Journal on Combinatorial Mathematics and Combinatorial Computing, 42:223–236, 2002.

[6] D. Gusfield. Algorithms on strings, trees and sequences: computer science and computational biology.Cambridge University Press, Cambridge, UK, 1997.

[7] D. Moore, W. F. Smyth, and D. Miller. Counting distinct strings. Algorithmica, 23(1):1–13, 1999.

14

Indexation de sequences d’ADN au sein d’une base dedonnees NoSQL a l’aide d’algorithmes de hachage per-ceptuelJocelyn De Goer De Herve1-2,3, Myoung-Ah Kang1,2, Xavier Bailly3 and Engelbert Mephu Nguifo1,2

1 CNRS, UMR 6158, LIMOS, Universite Blaise Pascal, F-63173 AUBIERE2 2 Clermont Universite, Universite Blaise Pascal, LIMOS, BP 10448, F-63000 CLERMONT-FERRAND3 INRA, UR346 Epidemiologie Animale, F-63122 ST GENES CHAMPANELLE

Mots-clefs : Indexation, Stockage, NoSQL, Bioinformatique, Sequences d’ADN, Alignement de sequences,Traitement d’images, Hachage perceptuel, TCD, Distance de Hamming

1 ContexteL’arrivee de nouvelles generations de sequenceurs ADN « haut debit » [12] a permis la production de

donnees genomiques avec un debit de plus en plus eleve pour un cout de plus en plus bas. Le volume de donneea stocker et a analyser par les biologistes a donc connu une evolution exponentielle. En informatique, cesdernieres annees l’acceleration des calculs a ete rendue possible grace a la parallelisation des algorithmes, lamultiplication des unites de calculs des processeurs (cœur) [2, 10] ou le calcul generique a base de processeursgraphiques (GPU) [11]. Cependant, en terme d’analyse de donnees genomiques, l’evolution de ces techno-logies n’est pas assez rapide. Outre le fait que cette enorme masse d’informations va entrainer l’emergencede nouveaux questionnements biologiques et de nouvelles applications, un des defis majeurs durant ces 10prochaines annees est de faire evoluer les outils informatiques pour les dimensionner en consequence.

La recherche de similarite entre des sequences ADN stockees au sein de larges banques de donnees estune etape fondamentale de toutes etudes en genomique. Ainsi, de nombreux algorithmes ont ete developpes.On peut citer encore aujourd’hui comme references dans le domaine, des algorithmes d’alignement local(BLAST [1]), d’alignement global (Needleman-Wusch [14]) et d’alignements issus des methodes de compa-raison de textes [5, 8] ou d’indexation. Neanmoins, la plupart de ces methodes necessitent un grand nombred’operations complexes ou le chargement en memoire vive de tout ou partie des sequences brutes pour effec-tuer les taches de comparaison, ce qui etant donne l’evolution de la quantite de donnees a traiter demandede plus en plus de ressources materielles.

2 Objectifs et methodesLe travail presente ici, consiste a proposer une methode visant a accelerer la recherche de similarites entre

une sequence candidate et une base de donnees de sequences de references. Elle s’appuie sur des methodes is-sues du domaine de la recherche d’image par le contenu [3] et plus particulierement des methodes de hachageperceptuel. Contrairement aux algorithmes mentionnes ci-dessus, il ne s’agit pas d’effectuer des alignementsde sequences, mais d’effectuer un tri en amont permettant de renvoyer pour une sequence candidate toutesles sequences de references ayant une probabilite non nul d’alignement.

Afin d’accelerer les operations de lecture et d’ecriture, la table de hachage correspondant aux sequencesest stockee dans une base de donnees Redis (de type NoSQL [13, 9] cle/valeur). Ce moteur de base dedonnees reparties, a la particularite de maintenir l’integralite des donnees dans la memoire vive des ma-chines. La table ne contenant que les haches (d’une taille de 64 bits) et les identifiants des sequences, elle serevele de 8 a 30x moins volumineuse que le volume total des sequences brutes.

15

Le processus d’indexation par hachage est realise via des methodes de Hachage Perceptuel [6]. De facongenerale, ces methodes permettent l’identification via la generation d’une cle unique (cle de hachage) a par-tir d’un document multimedia (image, son ou video). Se rapprochant par certains aspects des methodes dehachage cryptographique (algorithme MD5), elles different cependant, par le fait que les cles de hachagecorrespondant a deux documents presentant une legere difference sont relativement proches et comparablespar des methodes telles que la distance de Hamming [7]. Elles ne sont donc pas sensibles a l’« effet avalanche» [4] ou une difference d’un bit entre deux documents a pour effet de generer deux cles de hachage radicale-ment differentes, rendant de ce fait le calcul de distance impossible.

La methode proposee ici permet l’indexation rapide de sequences ADN grace a un algorithme de HachagePerceptuel [16] utilisant une Transformee en Cosinus Discrete (TCD) [15]. Afin de pouvoir utiliser la TCDpour calculer les differents haches, les sequences sont prealablement converties sous forme d’une image enniveau de gris, plus precisement une matrice de pixels attribuant une valeur d’intensite lumineuse a chaquetypes de nucleotides : Adenine 63, Thymine 127, Cytosine 191 et Guanine 255.

3 EvaluationNous avons realise une premiere implementation de cette methode (octobre 2013) qui montre qu’avec un

ordinateur portable (processeur Core i7 et 4 cœurs), la methode permet d’indexer 500 000 nucleotides a laseconde soit, 8 000 cles de hachage. Une evaluation de la qualite de la methode est actuellement en cours derealisation. L’objectif est de realiser l’alignement via l’outil BLAST entre une base de donnees de referenceset des sequences candidates, dans un premier temps sans l’etape d’indexation, puis dans un second tempsavec l’etape d’indexation en amont. Enfin nous procederons a la comparaison des resultats obtenus avec leurstemps d’executions.

References[1] S. Altschul, W. Gish, W. Miller, E. Myers, , and D. Lipman. Basic local alignment search tool. Journal

of Molecular Biology, pages 215 :403–410, 1990.[2] V. Balaji. Multi-core processors - an overview, arxiv :1110.3535v1 [cs.ar], 2011.[3] R. da Silva Torres and A. X. Falcao. Content-based image retrieval : Theory and applications, vol. xiii

(2) : 165-189, 2006.[4] H. Feistel. Cryptography and computer privacy. Scientific American, pages May, 228(5) : 15–23, 1973.[5] N. G. and R. M. Flexible Pattern Matching in Strings - Practical on-line search algorithms for texts

and biological sequences. 2002.[6] J. Haitsma, T. Kalker, and J. Oostveen. International workshop on content-based multimedia indexing

(cbmi). Robust Audio Hashing for Content Identification, pages 4 : 117–124, 2001.[7] R. W. Hamming. Error detecting and error correcting codes. Bell System Technical Journal, pages

147–160, 1950.[8] D. E. Knuth, J. H. M. Jr, and V. R. Pratt. Fast pattern matching in strings. SIAM J. Comput., pages

6(1) : 323–350, 1977.[9] N. Leavitt. Will nosql databases live up to their promise ? Computer, pages 43(2) : 12–14, 2010.

[10] G. Lowney. Why intel is designing multi-core processors. Proceedings of the Eighteenth Annual ACMSymposium on Parallelism in Algorithms and Architectures, pages 113–113, 2006.

[11] D. Luebke, M. Harris, N. Govindaraju, A. Lefohn, M. Houston, J. Owens, M. Segal, M. Papaki-pos, and I. Buck. Gpgpu : general-purpose computation on graphics hardware, article no. 208.[doi :10.1145/1188455.1188672], 2006.

16

[12] M. Metzker. Sequencing technologies - the next generation. Nature Review Genetics, pages 31–46, 2010.[13] A. B. M. Moniruzzaman and S. A. Hossain. Nosql database : New era of databases for big data analytics,

arxiv :1307.0191 [cs.db], 2013.[14] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the

amino acid sequence of two proteins. Journal of Molecular Biology, page 48 (3) : 443–53, 1970.[15] K. R. Rao and P. Yip. Discrete cosine transform : Algorithms, advantages, applications. Academic

Press, Boston, 1990.[16] C. Zauner. Implementation and benchmarking of perceptual image hash functions. master’s thesis.

Upper Austria University of Applied Sciences, Hagenberg Campus, 2010.

17

Tyrosine-1 phosphorylation of Pol II CTD is associatedwith antisense promoter Transcription and active en-hancers in mammalian cellsNicolas Descostes, Martin Heidemann, Ahmad Maqbool, Lionel Spinelli, Romain Fenouil, Marta Gut, IvoGut, Dirk Eick and Jean-Christophe Andrau

Genome-wide characterization of transcription mechanisms requires the development of adapted bioinfor-matics approaches. The development of high-throughput sequencing (HTS) technologies such as ChIP-Seqand RNA-Seq offered the possibility to explore the inner conundrum of transcription almost at a base-pairresolution. However, the revealed complexity of molecular processes and sequences organization and com-position [2] brought bioinformaticians to deal with more complex data than ever before. In this talk, Iwill present how we developed original approaches for HTS data treatment and analysis of transcriptionprocesses.

I will first present the R packaqe PASHA (Preprocessing of Aligned Sequences from HTS Analyses)we developed for ChIP-Seq, RNA-Seq and MNase-Seq data treatment. This package deals with removal ofsequencing artifacts, elongation of sequenced tags, scoring, binning, multiple alignment of reads, nucleosomalpositionning and paired-end sequencing. It also yields different statistics and controls about data treatment.

Then through the case study of Tyrosine 1 phosphorylation (Tyr1P) of the carboxy-terminal domain [8,1, 3] (CTD) of RNA Polymerase II (Pol II), I will then show how epigenetic [5, 9] data, enhancers [4, 7]and Pol II phospho-isoforms analysis led us to develop specific lines of data analyses. For example, wedeveloped specific strategies for isolating enhancers and studying the spatial organization of Pol II isoforms.Finally, I will show that contrary to previously observed association of Tyr1P to transcription elongationin yeast model [6], Pol II post-translational modification in human cells is involved in the initiation phaseand antisense transcription as well in enhancer transcription through the analysis of relevant ChIP-Seq,nucleosome (MNase-Seq) and short strand specific RNA-Seq data.

References[1] S. Egloff and S. Murphy. Cracking the RNA polymerase II CTD code. Trends Genet, 24(6):280–288,

2008.[2] R. Fenouil, P. Cauchy, F. Koch, N. Descostes, J. Cabeza, and C. Innocenti et al. CpG islands and GC

content dictate nucleosome depletion in a transcription-independent manner at mammalian promoters.Genome Res, 22(12):2399–2408, 2012.

[3] C. Hintermair, M. Heidemann, F. Koch, N. Descostes, M. Gut, and I. Gut et al. Threonine-4 ofmammalian RNA polymerase II CTD is targeted by Polo-like kinase 3 and required for transcriptionalelongation. EMBO J, 31(12):2784–2797, 2012.

[4] F. Koch, R. Fenouil, M. Gut, P. Cauchy, T. Albert, and J. Zacarias-Cabeza et al. Transcriptioninitiation platforms and GTF recruitment at tissue-specific enhancers and promoters. Nat Struct MolBiol, 18(8):956–963, 2011.

[5] T. Kouzarides. Chromatin modifications and their function. Cell, 128(4):693–705, 2007.[6] A. Mayer, M. Heidemann, M. Lidschreiber, A. Schreieck, M.Sun, and C. Hintermair et al. CTD tyrosine

phosphorylation impairs termination factor recruitment to RNA polymerase II. Science, 336(6089):1723,2012.

[7] G. Natoli and J. Andrau. Noncoding transcription at enhancers: general principles and functionalmodels. Annu Rev Genet, 46:1–19, 2012.

18

[8] H. Phatnani and A. Greenleaf. Phosphorylation and functions of the RNA polymerase II CTD. GenesDev, 20(21):2922–2936, 2006.

[9] V. Zhou, A. Goren, and B. Bernstein. Charting histone modifications and the functional organizationof mammalian genomes. Nat Rev Genet, 12(1):7–18, 2011.

19

Abelian Repetition in Sturmian WordsGabriele Fici1, Alessio Langiu2, Thierry Lecroq3, Arnaud Lefebvre3, Filippo Mignosi4 and Elise Prieur-Gaston3

1 Universita di Palermo, Italy2 King’s College London, UK3 Normandie Universite, LITIS EA 4108, Universite de Rouen, France4 Universita dell Aquila, Italy

In this talk we investigate abelian repetitions in Sturmian words. We exploit a bijection between factorsof Sturmian words and subintervals of the unitary segment that allows us to study the periods of abelianrepetitions by using classical results of elementary Number Theory. If km denotes the maximal exponent ofan abelian repetition of period m, we prove that lim sup km/m ≥ 5 for any Sturmian word, and the equalityholds for the Fibonacci infinite word. We further prove that the longest prefix of the Fibonacci infiniteword that is an abelian repetition of period Fj , j > 1, has length Fj(Fj+1 + Fj−1 + 1) − 2 if j is even orFj(Fj+1 + Fj−1) − 2 if j is odd. This allows us to give an exact formula for the smallest abelian periodsof the Fibonacci finite words. More precisely, we prove that for j ≥ 3, the Fibonacci word fj has abelianperiod equal to Fn, where n = dj/2e if j = 0, 1, 2 mod 4, or n = 1 + dj/2e if j = 3 mod 4 [1].

References[1] G. Fici, A. Langiu, T. Lecroq, A. Lefebvre, F. Mignosi, and E. Prieur-Gaston. Abelian repetitions in

sturmian words. In M.-P. Beal and O. Carton, editors, Proceedings of the 17th International Conferenceon Developments in Language Theory (DLT 2013), volume 7907 of Lecture Notes in Computer Science,pages 227–238, Marne-la-Vallee, France, 2013. Springer-Verlag, Berlin.

20

Co-optimality and Ambiguity, DisentangledRobert Giegerich and Benedikt Lowes

Faculty of Technology, Bielefeld University, Germany

Summary of the ContributionSolutions to combinatorial programming problems often are not unique. Co-optimal solutions are alter-

natives that achieve the same (optimal) score, while otherwise, they need not be similar at all. Dynamicprogramming algorithms can report co-optimal alternatives, or determine at least their number. But theyrarely do. If it produces an excessive number of co-optimals, our optimization problem is ill-defined.

While co-optimality is a property of the model, ambiguity is a property of the algorithm employed for itssolution. An ambiguous algorithm reports the same solution multiple times, often to an exponential degree.Internally creating a larger solution space than actually exists, it also explodes the number of co-optimals.In the presence of ambiguity, we cannot tell whether our model is well-defined or not.

We show how to disentangle these issues, using a recent algorithm for minisatellite alignment as a casestudy. We observe that the ARLEM algorithm produces an exorbitant number of co-optimal alignments.We show that it is ambiguous, and create a variant that is proved to be unambiguous using formal languagetechnology. The unambiguous algorithm demonstrates that, independent of any algorithm, the model allowsfor a high number of co-optimals. Hence, we conclude that the present model of minisatellite alignment andduplication history reconstruction is not refined enough to yield meaningful alignments.

The presentation at the SEQBIO workshop is closely based on the publication [3], but emphasizes thegeneral approach over the specific findings concerning minisatellite alignments.

Overview of the methodPrerequisites The prerequisite of our method is that we can express our dynamic programming problemat hand in the framework of algebraic dynamic programming (ADP) [2].

Its perfect separation of search space and scoring allows us to present our case study on minsatellites as ademonstration of a general method for assessing co-optimality and ambiguity. In ADP, we deal with dynamicprogramming problems over sequence(s) x. The given problem is encoded by a search space generator G,written in form of a regular tree grammar, and by an evaluation algebra A that includes the objectivefunction h. Solution candidates are trees in L(G) which carry x as their yield sequence. A(t) denotes thescore of candidate t. A problem instance is posed by a sequence x (or several sequences), and the problemis solved by computing

h[A(t) | t ∈ L(G), yield(t) = x].

The square brackets denote multisets, as there may be co-optimal solutions. The framework of ADP issupported by systems such as Bellman’s GAP [4].

Methodical steps1. Encode the original algorithm for the problem at hand by tree grammar G and evaluation algebra A

(if not already formulated in this way).2. Design a canonical string representation of solutions. Implement it as a “printing” algebra such thatG(P, x) enumerates the search space in canonical representation, following [1]. Duplicates in theresulting multiset indicate semantic ambiguity, i.e. the same solution is found several times.

3. Revise G into a unambiguous grammar G0.

21

4. Prove that G0 is semantically unambiguous, i.e. G0(P, x) has no duplicates for all x. (One way toachieve this is to partially evaluate G0 and P into a context-free string grammar Gcfg

0 , whose syntacticunambiguity (in the formal language sense) can be shown by an ambiguity checker, and if positive,guarantees semantic unambiguity of G0.)

5. Compare co-optimal answers of G(A, x) and G0(A, x). If their number is reduced to a tolerable degree,the underlying problem is well-defined.

Fallacies Although systematic, this is by no means an automated method. It requires creativity in all stepsand may fail. (1) The given algorithm may be difficult to express in algebraic style. (2) A canonical stringrepresentation P may not be obvious and require some clever encoding. (3) Grammar G0 may be tricky toconstruct. In fact, if G(P, x) | for all x is an inherently ambiguous language, Gcfg

0 does not exist. (4) Itmay be a debatable matter of expert judgement what is to be considerd a tolerable degree of co-optimality.

In the case of minisatellite alignment we observed that G0(A, x) has about 10|x| co-optimal solutions, andwe conclude that the underlying model is ill-defined.

References[1] R. Giegerich. Explaining and controlling ambiguity in dynamic programming. In Proceedings of Com-

binatorial Pattern Matching, volume 1848 of Springer Lecture Notes in Computer Science, pages 46–59.Springer, 2000.

[2] R. Giegerich, C. Meyer, and P. Steffen. A discipline of dynamic programming over sequence data.Science of Computer Programming, 51(3):215–263, 2004.

[3] B. Lowes and R. Giegerich. Avoiding ambiguity and assessing uniqueness in minisatellite alignment. InBeissbarth et al., editor, German Conference on Bioinformatics, OASIcs, 2013.

[4] G. Sauthoff, M. Mohl, S. Janssen, and R. Giegerich. Bellman’s GAP - a language and compiler fordynamic programming in sequence analysis. Bioinformatics, 2013.

22

Les chemins auto-evitants plies relies au repliement desproteinesChristophe Guyeux

Mots-clefs : Chemins auto-evitants ; Repliement des proteines ; Prediction de conformations

Commencons par rappeler la definition des chemins auto-evitants [4].

Definition 2 (Chemins auto-evitants) Soit d > 1. Un chemin auto-evitant a n pas, de x ∈ Zd dansy ∈ Zd, est une fonction w : J0, nK→ Zd avec :

— w(0) = x et w(n) = y,— |w(i+ 1)− w(i)| = 1,— ∀i, j ∈ J0, nK, i 6= j ⇒ w(i) 6= w(j).

Nous nous sommes apercus que, parmi les methodes bio-informatiques utilisees pour predire la confor-mation 3D des proteines, certaines ne permettaient pas d’atteindre tous les chemins auto-evitants (CAE)quand d’autres le pouvaient. Ainsi, certains outils de prediction sont dans l’incapacite de produire certainesconformations, quand d’autres atteignent toutes les formes possibles. Partant de ce constat, la question estde savoir ce qui a le plus de sens au niveau biologique. De maniere annexe, la preuve de NP-completude duprobleme de prediction du repliement des proteines n’est pas valable dans tous les contextes.

Nous avons donc redecouvert et exploite un resultat de non ergodicite de certains types de transformationde chemins auto-evitants, en avons deduit une sous-famille de chemins auto-evitants plies (CAEP) quiapparaıt naturellement dans les outils de prediction des formes de proteines, et avons produit de premiersresultats sur ces CAEP. Un des questionnements dirigeant notre recherche est d’estimer le ratio CAE/CAEP,afin de savoir si les outils de prediction se focalisant sur les CAEP perdent ou non beaucoup de formespossibles pour les proteines.

L’objet de la presentation que je propose est de positionner ce probleme qui nous est apparu, de fairele point sur les connaissances que l’on a obtenues sur les CAEP, de lister des problemes ouverts, et dediscuter des consequences au niveau de la prediction de la conformation 3D des proteines [1, 2]. Je montrerainotamment que les CAE nulle part depliables sont en nombre infinis, et sont aussi grand que l’on veut. Qu’en-dessous d’une certaine taille, tous les CAE sont CAEP. Enfin, j’exhiberai le plus petit chemin auto-evitantnulle part depliable actuellement connu.

Figure 2 – Le premier CAE qui n’est pas CAEP (Madras et Sokal [3])

23

References[1] J. Bahi, C. Guyeux, K. Mazouzi, and L. Philippe. Computational investigations of folded self-avoiding

walks related to protein folding. Journal of Bioinformatics and Computational Biology, 2013. Acceptedmanuscript. To appear.

[2] C. Guyeux, N. M.-L. Cote, W. Bienia, and J. Bahi. Is protein folding problem really a NP-complete one ?first investigations. Journal of Bioinformatics and Computational Biology, 2013. Accepted manuscript.To appear.

[3] N. Madras and A. D. Sokal. The pivot algorithm : A highly efficient monte carlo method for theself-avoiding walk. Journal of Statistical Physics, 50 :109–186, 1988.

[4] N. N. Madras and G. Slade. The self-avoiding walk. Probability and its applications. Birkhauser, Boston,1993.

24

New software for mapping high-throughput reads forgenomic and metagenomic dataEvguenia Kopylova, Laurent Noe, Mikael Salson and Helene Touzet

LIFL, UMR8022 CNRS Universite Lille 1, and INRIA Lille Nord Europe, France

Context

The arrival of high-throughput sequencing technologies has introduced new problems for read mapping.The main challenges involve efficient processing of large amounts of sequenced read data and deliveringrobust algorithms generic to different sequencing technologies and their characteristic error types. Thenature of the reads to map depends on the sequencing technology. Today, Illumina, 454 and Ion Torrentproduce read lengths of 100-1000 bp with the lowest error rates in one round of sequencing, whilst thesingle-molecule sequencing platform, PacBio (Pacific BioScience), can produce average read lengths of 4600bp but with a much higher error rate than for other technologies (nearly 15%, mostly indels). Furthermore,new applications such as community metagenomics and metatranscriptomics require sensitive algorithmscapable of aligning low-complexity regions and distantly related species.

Read mapping problem

Without applying heuristics, algorithms which identify both substitution and indel errors are computa-tionally expensive for large quantities of data. Although heuristics can effectively speed up the algorithm,many existing alignment tools such as BWA-SW [4], SOAP2 [5] and Bowtie2 [3] (all based on the Burrows-Wheeler-Transform (BWT) [1]) have been optimized for genome resequencing (∼99% sequence similarity)and impose error-free or substitution-only ‘seeding’ techniques for identifying short homologs prior to extend-ing an alignment using dynamic programming. In the last decade, the application of sequencing technologieshas been extended to metagenomics, that is to DNA extracted directly from an environmental sample. Rawsamples of microbial organisms can be easily sequenced in parallel and this new culture-independent practiceallows for unanimous study of all genomes recovered from an environmental community. For this type ofapplication, where the reference sequences may share ∼75-98% similarity to the query, the aforementionedtools are no longer sensitive enough. Morover, sequencing technologies such as Ion Torrent, 454 and PacBiointroduce artifacts into the reads in the form of indel errors. In these contexts, approximate seeds allowingmismatch and indel errors anywhere in the seed would serve as an optimal choice (but at some computationalcost) and few tools exist today that efficiently implement them.

Methods

In this talk we present a new software called SortMeDNA which can map reads generated by second-and third-generation technologies from genomic and metagenomic studies. SortMeDNA uses approximateseeds (based on the universal Levenshtein automaton [7, 6]) allowing up to 1 error: The error can either bea mismatch or an indel, and its position in the seed is not predetermined. This unique feature gives the seedflexibility for different error types, such as indels in 454 reads, unpredictable error distribution, as readilyobserved with PacBio reads and capturing similarities between distantly related species. Furthermore, weintroduce a new arborescent indexing data structure based on a lookup table and the Burst trie [2], which isspecifically tailored to perform fast queries in large texts using this approximate seed. Lastly, SortMeDNAalso applies statistical analysis to evaluate the significance of an alignment, as this becomes important toconsider when aligning distantly related species.

25

Experimental results

The performance of SortMeDNA was evaluated against a representative selection of read mappers:Bowtie2, BWA-SW and SHRiMP2. Our tests took into consideration sequencing data produced by Illu-mina, 454, Ion Torrent and PacBio technologies, and dealt with a wide variety of applications from genomicto metagenomic projects. SortMeDNA has shown to be the most robust tool of the set, quickly and accu-rately mapping all types of reads using one set of default parameters. However, the tradeoff for maintainingsuch seeds is the new text indexing data structure which is able to accomodate searches with insertionsand deletions, requiring more space than the BWT. We implement an index fragmentation technique whichnonetheless allows convenient utilization of the tool.

References[1] M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. Technical report,

1994.[2] S. Heinz, J. Zobel, and H. E. Williams. Burst tries: A fast, efficient data structure for string keys. ACM

Transactions on Information Systems, 20:192–223, 2002.[3] B. Langmead and S. Salzberg. Fast gapped-read alignment with Bowtie 2. Nat Methods, 9:357–359,

2012.[4] H. Li and R. Durbin. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioin-

formatics, 25:1754–60, 2009.[5] R. Li, C. Yu, and Y. Li. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics,

25:1966–1967, 2009.[6] S. Mihov and K. Schulz. Fast approximate search in large dictionaries. J. Comput. Ling., 30:451–477,

2004.[7] K. Schulz and S. Mihov. Fast string correction with Levenshtein automata. IJDAR, 5:67–85, 2002.

26

sumatra and sumaclust: fast and exact comparison andclustering of sequencesCeline Mercier, Frederic Boyer, Aurelie Bonin and Eric Coissac

LABORATOIRE D’ECOLOGIE ALPINE, UMR5553 CNRS, LECA BP 53, 2233 Rue de la Piscine, 38041Grenoble, Cedex 9, France

Next-generation sequencing (NGS) has undergone impressive developments within the last decade [11].Today, an experiment can produce several millions of sequences, and efficient tools are needed to handle thesevolumes of data in reasonable amounts of time. The development of NGS has found numerous applicationsin the assessment and description of all forms of genetic diversity, from species, to populations, to individuals[1, 3]. In particular, DNA metabarcoding now allows high-throughput monitoring of biodiversity withoutrequiring the collection/identification of specima in the field [13, 12]. This approach is therefore widelyused in environmental microbiology and is becoming fairly popular for many ecological studies involvingbiodiversity assessment. DNA metabarcoding relies on the extraction of DNA from environmental samples(soil or water samples for example). Once the DNA is extracted, short DNA fragments called markers orbarcodes (by analogy with DNA barcoding [7]) are amplified by PCR and sequenced. The tools used totreat the several millions of sequences produced by a DNA metabarcoding experiment have to be efficient,but also adapted to the type of data produced by DNA metabarcoding, i.e. entirely sequenced and shortmarkers.

Classification methods are a key aspect in the analysis of DNA metabarcoding data. Sequences can beassigned to taxa by comparing them to a reference database containing barcodes of described taxa with asupervised classification method. However, such reference databases are not always available or exhaustive(typically for microorganisms). In that case, unsupervised classification enables to cluster highly similarsequences into groups called Molecular Operational Taxonomic Units (MOTUs) [2] which become the unitof measurement for biodiversity. The choice of an adaptated clustering procedure implies to think aboutthe reasons leading to the necessity of the clustering [8]. Indeed, depending on the similarity measure andon the clustering algorithm used, the result of this classification procedure can be highly variable, both interms of number of clusters and cluster composition.

Here, we present sumaclust and sumatra, a package of two programs which aim to compare sequencesin a way that is fast and exact at the same time, unlike the most popular clustering methods that usuallyrely on fast heuristics, such as uclust [5] or cd-hit [6]. sumaclust and sumatra are devoted to the typeof data generated by DNA metabarcoding, i.e. entirely sequenced, short markers. There are two componentsfor a clustering process. The first one is the computation of the pairwise similarities between sequences, andthe second one is the clustering itself. sumatra performs the first step, the computation of the pairwisesimilarities. The output can then go through a classification process with programs such as mcl [4] ormothur [10]. sumaclust performs both the similarities computation and the clustering, using the sameclustering algorithm as uclust and cd-hit.

Four elements in particular make sumaclust and sumatra interesting for handling DNA metabarcodingdata:

Clustering algorithm. When representing the similarities between sequences as a graph, with sequencesas vertices and similarities as edges, erroneous sequences and the true sequences from which they weregenerated during the PCR or sequencing steps appear as star-shaped clusters. The ‘true’ sequence are thenthe centres of the ‘stars’, and all the erroneous variants are linked to the centre from which they derive.This form of clustering corresponds to the one produced by the clustering algorithm used by sumaclust,uclust and cd-hit, which makes them well-adapted to finding erroneous sequences.

27

Order of the sequences. Since the clustering procedure implemented in these tools is greedy, clustercomposition is highly dependent on the order of the sequences. Clustering sequences sorted by length will leadto cluster centres corresponding to the longest sequences, whereas clustering sequences sorted by count willlead to most abundant sequences as cluster centres. This justifies the fact that sumaclust sorts sequencesby count, because ‘true’ sequences should be more abundant than erroneous sequences and should becomethe centre of their clusters.

Similarity indice. uclust and cd-hit perform semiglobal alignments. Semiglobal alignments aim tofind the best possible alignment that includes the whole length of one of the sequences, in their case, theshortest sequence. This method was interesting in the first works of bacterial DNA metarcoding, when bar-codes were long and generally not entirely sequenced. Nowadays, barcodes are generally entirely sequenced.Moreover, the size polymorphisms of barcodes are part of the signal allowing to differentiate the studiedtaxa. They consequently have to be aligned on their whole lengths, with global alignment algorithms.

Speed and exactitude. sumatra and sumaclust are implemented using a banded alignment algo-rithm [9], very efficient when high thresholds are used as it is often the case in DNA metabarcoding. Alossless k-mer filter enables to align only the pairs of sequences that potentially present an identity greaterthan the chosen threshold. Besides, the filter and alignment steps are both parallelized with the use ofSimple Instruction Multiple Data instructions, allowing to present execution times similar to those of themost popular programs based on heuristics, while staying exact.

In conclusion, sumatra and sumaclust combine the speed of heuristic methods with the accuracy ofexact methods, and their characteristics make them particularly well adapted to DNA metabarcoding data.

References[1] H. M. Bik, D. L. Porazinska, S. Creer, J. G. Caporaso, R. Knight, and W. K. Thomas. Sequencing our

way towards understanding global eukaryotic biodiversity. Trends Ecol Evol, 27(4):233–43, Apr 2012.[2] M. Blaxter, J. Mann, T. Chapman, F. Thomas, C. Whitton, R. Floyd, and E. Abebe. Defining opera-

tional taxonomic units using DNA barcode data. Philos Trans R Soc Lond B Biol Sci, 360(1462):1935–43, Oct 2005.

[3] P. I. Diaz, A. K. Dupuy, L. Abusleme, B. Reese, C. Obergfell, L. Choquette, A. Dongari-Bagtzoglou,D. E. Peterson, E. Terzi, and L. D. Strausbaugh. Using high throughput sequencing to explore thebiodiversity in oral bacterial communities. Mol Oral Microbiol, 27(3):182–201, Jun 2012.

[4] S. V. Dongen. Graph clustering by flow simulation. PhD thesis, University of Utrecht, Utrecht, TheNetherlands, 2000.

[5] R. C. Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinformatics,26(19):2460–1, Oct 2010.

[6] L. Fu, B. Niu, Z. Zhu, S. Wu, and W. Li. CD-HIT: accelerated for clustering the next-generationsequencing data. Bioinformatics, 28(23):3150–2, Dec 2012.

[7] P. D. N. Hebert, A. Cywinska, S. L. Ball, and J. R. deWaard. Biological identifications through DNAbarcodes. Proc Biol Sci, 270(1512):313–21, Feb 2003.

[8] S. M. Huse, D. M. Welch, H. G. Morrison, and M. L. Sogin. Ironing out the wrinkles in the rarebiosphere through improved OTU clustering. Environ Microbiol, 12(7):1889–98, Jul 2010.

[9] J. B. Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules.SIAM review, 25(2):201–237, 1983.

[10] P. D. Schloss, D. Gevers, and S. L. Westcott. Reducing the effects of PCR amplification and sequencingartifacts on 16s rRNA-based studies. PLoS One, 6(12):e27310, 2011.

28

[11] J. Shendure and H. Ji. Next-generation DNA sequencing. Nat Biotechnol, 26(10):1135–45, Oct 2008.[12] P. Taberlet, E. Coissac, F. Pompanon, C. Brochmann, and E. Willerslev. Towards next-generation

biodiversity assessment using DNA metabarcoding. Mol Ecol, 21(8):2045–50, Apr 2012.[13] A. Valentini, F. Pompanon, and P. Taberlet. DNA barcoding for ecologists. Trends Ecol Evol, 24(2):110–

7, Feb 2009.

29

Nouveaux Algorithmes d’Extraction de Motifs et de Pon-deration pour le Classement de ProteinesFaouzi Mhamdi, Mehdi Kchouk and Salma Aouled El Haj Mohamed

Laboratoire de Recherche en Technologies de l’Information et de la Communication & Genie Electrique,ESTT- Universite de Tunis, Tunisie

Faouzi Mhamdi : [email protected] Kchouk : [email protected] Aouled El Haj Mohamed : [email protected]

Resume : Dans leur structure primaire, les donnees biologiques sont representees sous forme d’une suitede caracteres. Le but de ce travail est de classer automatiquement les sequences de proteines. Pour ce faire, ilsemblait judicieux d’utiliser un processus tres connu dans la fouille des donnees : c’est le processus de l’ECD(Extraction des Connaissances a partir des Donnees). Nous nous interessons a la premiere phase de l’ECDcelle de pretraitement et nous nous concentrons a la tache d’extraction de motifs. L’extraction de motifs setraduit par la generation d’un ensemble de descripteurs que l’on presente aux algorithmes d’apprentissagesupervise pour faire la classification. La methode d’extraction la plus connue est le n-gramme, qui corresponda une suite de caracteres de longueur n. Les algorithmes des n-grammes consiste a fixer a priori la valeurde n, extraire des n-grammes, et a travailler avec cette valeur tout au long du processus de classement deproteines. Dans ce papier, nous proposons un algorithme de construction de n-grammes afin d’obtenir desdescripteurs de taille variable ensuite on propose une nouvelle de ponderation basee sur la programmationdynamique. Les performances de ces nouvelles propositions sont evaluees par le taux d’erreur obtenu avec leclassifieur SVM lineaire. Les experimentations sur des donnees biologiques reelles donnent des bons resultatsen les comparant avec des anterieurs.Mots-clefs : ECD, Extraction d’attributs, n-gramme, Ponderation d’attributs, Classement des sequencesde proteines, SVM, Donnees biologiques

1 Algorithme d’extraction de motifsNotre algorithme est un algorithme d’extraction de motifs adoptant une demarche hierarchique « des-

cendante ». Cette derniere, consiste a construire des n-grammes (motifs) de taille variable. D’une manieregenerale, hierarchiquement, on extrait les n − i grammes a partir des n-grammes tant que n − i ≥ 2. Nousutilisons le terme descendant car nous commencons par l’extraction des motifs de taille n (avec n donnepar l’utilisateur), puis comme deuxieme etape, nous extrairons les motifs de taille n− 1 et nous repetons laprocedure et a chaque etape nous diminuons la taille des motifs extraits de n = n − 1 jusqu’a n = 2. Parexemple : si n = 5 l’algorithme commence a extraire les motifs de taille 5 puis 4, 3 enfin, 2. L’algorithmeextrait hierarchiquement des descripteurs a partir des sequences de proteines regroupees en familles et noustentons de prouver par cette idee que nous pouvons augmenter la taille des n-grammes afin de melanger leplus possible de descripteurs afin d’obtenir de meilleurs resultats.

2 Methode de ponderationAfin de realiser notre ponderation, nous construirons une matrice de ponderation dont les lignes presentent

les sequences proteiques de deux familles de proteines differentes. Quant aux colonnes, elles contiendront lesattributs extraits a partir de la technique de n-grammes. L’intersection sequence/attribut n’est autre qu’unappel de la fonction qui calcule le score d’alignement S/W que nous avons adapte a nos besoins. Ce score est

30

[email protected]

[email protected]

[email protected]

Figure 1 – Processus descendant de construction des descripteurs.

exprime en pourcentatge et determine si l’attribut est existant dans la sequence en entier ou uniquement unepartie de l’attribut y existe. Bien evidemment, si l’attribut existe en entier, son score sera alors egal a 100%et plus le nombre de caracteres inexistants augmente, plus le score diminue. Dans notre cas les scores serontcompris entre 0 et 1. La derniere colonne de la matrice designera la famille d’appartenance de la sequence.Cette methode nous permettra d’etre plus precis quant au classement des sequences et leur affectation a lafamille adequate. Afin de demontrer l’efficacite de cette ponderation, nous devons nous comparer aux autrestypes de ponderation existants (la ponderation Booleenne, la ponderation par Occurrence, la ponderationpar Frequence et la ponderation par TF*IDF).

Figure 2 – Exemple de ponderation basee sur l’alignement S/W avec 2 et 4-grammes

3 ConclusionNous avons presente dans ce papier un nouvel algorithme d’extraction d’attributs a partir des sequences

biologiques et une nouvelle methode de ponderation de ces attributs. L’objectif est d’ameliorer le tauxd’erreur de classification. Nous avons obtenu des bons resultats par rapport a des travaux anterieurs.

31

[1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool.J Mol Biol, 1990.

[2] C. Cortes and V. N. Vapnik. Support Vector Networks. Machine Learning J, 1995.[3] M. Maddouri and M. Elloumi. Encoding of primary structures of biological macromolecules within a

data mining perspective. Journal of Computer Science and Technology, pages 78–88, 2004.[4] F. Mhamdi, R. Rakotomalala, and M. Elloumi. Textmining, Features Selection and Datamining for

Proteins Classification. ICTTA, IEEE Catalog Number, Damascus, Syria, 2004.[5] F. Mhamdi, R. Rakotomalala, and M. Elloumi. An hierarchical approach of n-grams extraction from

proteins classification. SITIS (Hammamet, Tunisia), 1, 2006.[6] T. F. Smith and M. S. Waterman. In Identification of common molecular subsequences, pages 195–197.

J Mol Biol, 1981.

32

Suffix tree and suffix array of an alignmentJoong Chae Na1, Heejin Park2, Sunho Lee3, Minsung Hong3, Thierry Lecroq4, Laurent Mouchard4 andKunsoo Park3

1 Sejong University, Korea2 Hanyang University, Korea3 Seoul National University, Korea4 Normandie Universite, LITIS EA 4108, Universite de Rouen, France

The huge amount of sequencing of individual genomes rises the problem of indexing highly similar se-quences. Classical indexing structures such as generalized suffix trees or generalized suffix arrays do not offersatisfying solutions to this problem because they do not sufficiently exploit redundancies inside these data.

In this talk we present two new recently introduced data structures aimed at addressing the problem:the suffix tree of an alignment [1] and the suffix array of an alignment [2].

The alignment of two sequences x = αβγ and y = αδγ is denoted by α(β/δ)γ where α and γ arerespectively the longest prefix and the longest suffix common to x and y (β and δ are thus different). Thesuffix tree and the suffix array of an alignment of two sequences x = αβγ and y = αδγ enable to representall the suffixes of x and y by storing only once the suffixes common to the two sequences.

The suffix tree of the alignment α(β/δ)γ can be constructed in time O(|x|+ |α∗|+ |δ|+ |γ|) where α∗ isthe longest suffix of α that occurs twice in x or y et γ is the longest prefix of γ such that dγ occurs twice inx and y (d is the symbol before γ).

The suffix array of the alignment α(β/δ)γ can be constructed in time O(|x|+ |α∗|+ |δ|+ |γ∗|) where γ∗is the longest prefix of γ that occurs twice in x or y.

The two structures can be extended to several non-common regions and to more than two sequences.Pattern matching in these two structures can be done with similar complexities than in classical struc-

tures.

References[1] J. C. Na, H. Park, M. Crochemore, J. Holub, C. S. Iliopoulos, L. Mouchard, and K. Park. Suffix tree of

an alignment: An efficient index for similar data. In T. Lecroq and L. Mouchard, editors, Proceedings ofthe 24th International Workshop on Combinatorial Algorithms (IWOCA 2013), number 8288 in LectureNotes in Computer Science, Rouen, France, 2013. Springer-Verlag, Berlin. To appear.

[2] J. C. Na, H. Park, S. Lee, M. Hong, T. Lecroq, L. Mouchard, and K. Park. Suffix array of alignment: Apractical index for similar data. In O. Kurland, M. Lewenstein, and E. Porat, editors, Proceedings of the20th International Symposium on String Processing and Information Retrieval (SPIRE 2013), number8214 in Lecture Notes in Computer Science, pages 243–254, Jerusalem, Israel, 2013. Springer-Verlag,Berlin.

33

Comparison of sets of homologous gene transcriptsAıda Ouangraoua1, Krister M. Swenson2 and Anne Bergeron3

1 INRIA Lille, LIFL, Universite Lille 1, Villeneuve d’Ascq, France2 Universite de Montreal and McGill University, Canada3 LaCIM, Universite du Quebec a Montreal, Montreal, Canada

Aıda Ouangraoua: aida.ouangraoua@inriaKrister M. Swenson: [email protected] 1

Anne Bergeron: [email protected]

One of the most intriguing and powerful discoveries of the post-genomic era is the revelation of theextent of alternative splicing in eukaryote genomes, where a single gene sequence can produce a multitudeof transcripts [2, 3]. The “one gene, one protein” dogma of the last century has been shattered into pieces,and these pieces tell a story in which genome sequences acquire new function not only by mutation, but bybeing processed differently.

Most studies on the analysis of gene transcript variants are based on cataloging splicing events betweenpairs of transcripts. For instance, in a recent paper [5], an analysis was carried on hundreds of transcriptsfrom over three hundred genes in human, mouse and other genomes, yielding dozens of conserved or species-specific splicing events. The results are given as combined statistics by species or group of species, andcataloged as one of 68 different kinds of splicing events.

However, beyond recognizing that two transcripts are conserved between species, or that a specific al-ternative splicing event is conserved, there is no formal setting for the comparison of two or more sets oftranscripts that are transcribed from homologous genes of different genomes. The most widely used approachis to resort to comparing all pairs of transcripts within a set, or between two sets (see [11] for a review).

There are several hurdles on the way to a good representation. The first comes from the fact that, whenalternate transcripts were scarce, much of the focus was directed towards the representation of alternativesplicing events: splicing graphs [4] or pictograms [1] are adequate but do not scale easily to genes that canhave dozens of transcripts, or to comparison between multiple species. Other representation techniques,such as bit matrices and codes (see [6, 9] and references therein), proposed for the identification and thecategorization of alternative splicing events are often more appropriate for computers than for human beings.A second problem is the identification of the features to compare. The splicing machinery is entangled witha myriad of bits and pieces that can vary within and between species: transcripts, coding sequences, exons,introns, splicing donor and acceptor sites, start and stop codons, untranslated regions of arbitrary lengths,frame shifts, etc. Ideally, a model would capture as much as is known about transcripts, including theunderlying sequences. In that direction, the goal of the Exalign method [8, 10] is to integrate the exon-intronstructure of transcripts with gene comparison, in order to find “splicing orthology” for pairs of transcripts.What about integrating the whole structure of orthologous sets of transcripts in gene comparison, and thediscovery of homologous genes?

Here we propose a switch from the paradigm of comparing single transcripts between species, to comparingall transcripts from a global perspective, rather than focussing on specific splicing events. We describe arepresentation of sets of transcripts that is straightforward, readable by both humans and computers, andthat can incorporate the various mechanisms that drive transcript evolution (see [7] for a detailed descriptionof the model). This representation yields very flexible tools to compare sets of transcripts. It serves asa powerful representation for the reconstruction of evolution histories of sets of transcripts, and for theevaluation of the structural similarity between potentially homologous genes. On the other hand, the modelhas a precise, formal specification that insures its coherence, consistency and scalability. We show severalapplications, among them a comparison of 24 Smox gene transcripts across five species.

1. current affiliation: Institut de Biologie Computationnelle, Montpellier, France

34

aida.ouangraoua@inria

[email protected]

[email protected]

References[1] D. Bollina, B. Lee, T. Tan, and S. Ranganathan. ASGS: an alternative splicing graph web service.

Nucleic Acids Res., 34:W444–447, Jul 2006.[2] P. Carninci, T. Kasukawa, S. Katayama, and . others. The transcriptional landscape of the mammalian

genome. Science, 309:1559–1563, Sep 2005.[3] T. E. P. Consortium. Identification and analysis of functional elements in 1% of the human genome by

the encode pilot project. Nature, 447:799–816, 2007.[4] S. Heber, M. Alekseyev, S. Sze, H. Tang, and P. Pevzner. Splicing graphs and EST assembly problem.

Bioinformatics, 18 Suppl 1:S181–188, 2002.[5] J. Mudge, A. Frankish, J. Fernandez-Banet, T. Alioto, T. Derrien, C. Howald, A. Reymond, R. Guigo,

T. Hubbard, and J. Harrow. The origins, evolution and functional potential of alternative splicing invertebrates. Molecular biology and evolution, 28:2949–2959, Oct 2011.

[6] H. Nagasaki, M. Arita, T. Nishizawa, M. Suwa, and O. Gotoh. Automated classification of alterna-tive splicing and transcriptional initiation and construction of visual database of classified patterns.Bioinformatics, 22(10):1211–1216, 2006.

[7] A. Ouangraoua, K. M. Swenson, and A. Bergeron. On the comparison of sets of alternative transcripts.In ISBRA, LNCS 7292, pages 201–212, 2012.

[8] G. Pavesi, F. Zambelli, C. Caggese, and G. Pesole. Exalign: a new method for comparative analysis ofexon-intron gene structures. Nucleic Acids Res, 36:e47, May 2008.

[9] M. Sammeth, S. Foissac, and R. Guigo. A general definition and nomenclature for alternative splicingevents. PLoS Computational Biology, 8:e1000147, 2008.

[10] F. Zambelli, G. Pavesi, C. Gissi, D. Horner, and G. Pesole. Assessment of orthologous splicing isoformsin human and mouse orthologous genes. BMC Genomics, 11:534, 2010.

[11] M. Zavolan and E. van Nimwegen. The types and prevalence of alternative splice forms. Curr. Opin.Struct. Biol., 16:362–367, Jun 2006.

35

Disentangling homeologous contigs in tetraploid assem-bly: application to durum wheatVincent Ranwez1*, Yan Holtz2, Gautier Sarah2, Morgane Ardisson2, Sylvain Santoni2, Sylvain Glemin3,Muriel Tavaud-Pirra1 and Jacques David1

1 Montpellier SupAgro, UMR AGAP, F-34060 Montpellier, France2 INRA, UMR AGAP, F-34060 Montpellier, France3 Institut des Sciences de l’Evolution de Montpellier (ISE-M), UMR 5554 CNRS Universite Montpellier II,place E. Bataillon, CC 064, 34 095 Montpellier cedex 05, France

* Corresponding author

Vincent Ranwez : [email protected] Holtz : [email protected] Sarah : [email protected] Ardisson : [email protected] Santoni : [email protected] Glemin : [email protected] Tavaud-Pirra : [email protected] David : [email protected]

ContexteEn utilisant les outils de sequencage haut debit, la detection de SNP est devenue relativement routiniere

pour les especes diploıdes. Elle reste cependant problematique concernant les especes polyploıdes, notammentsuite aux confusions entre locus homeologues qui peuvent etre assembles de maniere erronee en un seulcontig. Nous proposons une methode permettant de separer efficacement de tels contigs chimeriques en deuxcontigs homologues sur la base du differentiel d’expression de ces deux copies. Le logiciel HomeoSplitter, quiimplemente cette solution, permet de gerer efficacement ces problemes de melange d’homeologues a l’aided’une approche par maximum de vraisemblance.

Nous avons valide HomeoSplitter sur des donnees RNAseq reelles issues de trente accessions de ble dur(Triticum turgidum, tetraploıde contenant les genomes A et B, 2n=4x=28). Les transcriptomes des especesdiploıdes donneuses des genomes elementaires, Aegilops speltoides (proche du genome B) et Triticum urartu(proche du genome A) ont ete utilises comme element de comparaison afin de valider la methode.

Les millions de reads ont ete assembles par assemblage de-novo et les contigs obtenus clusterises. Les2505 clusters contenant des sequences homologues de ble dur, de urartu et de speltoides, ont constitue un jeude test permettant de confirmer l’apport d’HomeoSplitter. HomeoSplitter permet, sur ces donnees, d’obtenirdes contigs de ble dur plus proches de ceux de ses ancetres diploıdes. Le mapping des reads sur ces nouveauxcontigs, plutot que directement sur ceux issus de l’assemblage de novo, permet de multiplier par 4 le nombrede SNP fiables identifies (762 SNP parmi 1360 sites polymorphes au lieu de 188 parmi 1620).

Le programme HomeoSplitter est disponible gratuitement a l’adresse http://davem-193/homeoSplitter/.Cet outil constitue une solution pratique resolvant les problemes de melange des homeo-genomes pour lesespeces allo-tetraploıdes, et permet une detection des SNP plus performante chez ces especes.

Ce travail a ete accepte pour publication dans [1].

36

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

http://davem-193/homeoSplitter/

References[1] V. Ranwez, Y. Holtz, G. Sarah, M. Ardisson, S. Santoni, S. Glemin, and M. Tavaud-Pirra. BMC

Bioinformatics, 14(Suppl 15) :S15. (RECOMB-CG 2013 special issue). Accepted.

37

An algorithm for pattern occurrences P -values compu-tationMireille Regnier1*, Evgenia Furletova2*, Victor Yakovlev2,4 and Mikhail Roytberg2,3,4*

1 INRIA team AMIB, LIX-Ecole Polytechnique and LRI-UPSud, 1 rue d’Estienne d’Orves, 91 120 Palaiseau,France2 Institute of Mathematical Problems of Biology, 142290, Institutskaya, 4, Pushchino, Russia3 Laboratoire J.-V. Poncelet (UMI 2615), 119002, Bolshoy Vlasyevskiy Pereulok, 11, Moscow, Russia4 National Research University “Higher School of Economics”, 101978, Myasnitskaya str., 20, Moscow, Russia

∗ corresponding author

Mireille Regnier: [email protected] Furletova: [email protected] Yakovlev: [email protected] Roytberg: [email protected]

Our study is related to finding of new functional fragments in biological sequences, that is an importantproblem in bioinformatics. Methods addressing this problem commonly search for clusters of pattern occur-rences that are statistically significant. A measure of statistical significance is the P -value of the numberof pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern H in arandom text of length N generated according to a given probability model. All words of the pattern aresupposed to be of a same length.

We present a novel algorithm SufPref computing an exact P -value for three types of probability models:Bernoulli, Markov models of arbitrary order and Hidden Markov models (HMMs). The algorithm computesP -value as the probability of the set of all sequences containing at least S occurrences of H. We described thestructure of such set using auxiliary sets of sequences and obtained equations for probabilities of auxiliarysets. The equations allow efficient processing of overlaps between pattern words, that is the main advantageof our algorithm. The information about overlaps and their relations is stored in a specific data structure,the overlap graph. The nodes of the graph are associated with the overlaps of words from H. Edges areassociated to the prefix and suffix relations between overlaps. An originality of our data structure is thatpattern H need not be explicitly represented in nodes or leaves. The algorithm inductively traverses thegraph. The algorithm relies on the Cartesian product of the overlap graph and the graph of model states;the approach is analogous to the automaton approach from [2]. We carefully analyze the structure of theCartesian product, e.g. the reachability of vertices; this leads to extra improvement of time and spacecomplexity. Taking into account overlaps between the pattern words significantly decreases space and timecomplexities. Remark that the nodes of the extensively used structure Aho-Corasick trie, used in particularby the algorithm AhoPro [1], are associated with prefixes of pattern words. The number of prefixes is muchlarger than the number of overlaps.

The algorithm SufPref was implemented as a C++ program; it can be used both as Web-server anda stand alone program for Linux and Windows; the program is available at http://server2.lpm.org.ru/bio/online/sf/.

The implementation of the algorithm SufPref was compared with program AhoPro for Bernoullimodel and first order Markov model. The comparison shows that, for all considered cases, our algorithm isfaster than AhoPro in more than four times for Bernoulli models and in more than two times for Markovmodel. In a vast majority of cases, it outperforms AhoPro in space.

38

[email protected]

[email protected]

[email protected]

[email protected]

http://server2.lpm.org.ru/bio/online/sf/

http://server2.lpm.org.ru/bio/online/sf/

References[1] V. Boeva, J. Clement, M. Regnier, M. Roytberg, and V. Makeev. Exact p-value calculation for het-

erotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatorymodules. Algorithms for Molecular Biology, 2(13):25 pages, 2007. [http://www.almob.org/content/2/1/13].

[2] G. Kucherov, L. Noe, and M. Roytberg. A unifying framework for seed sensitivity and its applicationto subset seeds. Journal of Bioinformatics and Computational Biology, 4(2):553–569, 2009.

39

http://www.almob.org/content/2/1/13

http://www.almob.org/content/2/1/13

Filtering systematic errors in next-generation genotypecalls: stranding mattersLaure Sambourg and Nicolas Thierry-Mieg

UJF-Grenoble 1 / CNRS / TIMC-IMAG UMR 5525, Computational and Mathematical Biology (BCM),Grenoble, F-38041, France

BackgroundNext generation sequencing technologies have enabled the production of massive datasets, opening up

new avenues of research. In particular, exome capture and sequencing provides a cost-effective means ofgenotyping the coding portion of the genome for large cohorts of individuals. However calling genotypesfrom NGS data remains difficult, and high rates of discrepancies between called genotypes are observedwhen using different sequencing technologies on the same samples, or different software pipelines on thesame raw sequencing data. Some of these difficulties may be due to systematic biases and errors arising atthe sequencing, alignment or genotype-calling steps. In order to investigate these possibilities, we analyzed108 deep exome-seq replicates of non-tumor cells produced by The Cancer Genome Atlas with ABI SOLiDand Illumina GAII platforms, searching for sources of systematic errors.

ResultsWe aligned the TCGA short-reads using the MAGIC pipeline, which allowed us to distinguish reads

aligning on the forward and reverse strands of the genome, and developed a simple and effective algorithmfor calling genotypes. This algorithm favors specificity over sensitivity: calls are only made when the datais unambiguous and sufficiently deep. Furthermore, we called genotypes independently on each strandand compared the resulting calls. Surprisingly, we discovered that they disagree in 40.6% of the positionswhere high-confidence calls can be made on both strands (excluding homozygous reference positions). Thisobservation is not an artifact of the MAGIC aligner, as shown by reanalyzing a published dataset initiallystudied with GATK. These stranddiscordant positions appear due to the sequence context, which is strand-specific and can lead to systematic sequencing errors on one strand. Furthermore, the TCGA replicatesallowed us to identify systematic error-prone positions in the genome, some of which are specific to the ABIor Illumina sequencers and some of which are cross-platform. Filtering these positions significantly improvesthe genotyping quality.

ConclusionsOur results clearly show that strand-specific read counts should always be provided, and that a reliable

genotype can only be called when the two strands are compatible and sufficiently covered. In addition, lists oferror-prone positions are provided and should help to filter out systematic errors. Beyond genotype-calling,our findings have implications on the experimental design of exomecapture experiments: capture librariesshould be short enough to allow a significant proportion of positions to be sequenced on both strands.

40

Reconstructing Textual Documents from perfect n-gramInformationMatias Tealdi and Matthias Galle

Xerox Research Centre Europe

Companies may be interested in releasing part of the data they own for reasons of general good, prestige,harnessing the work of those the data is released to or because it opens access to new sources (in a marketplacesetting for instance). However, most of the times it is not possible to release the complete data due to privacyconcerns, legal constraints or economic interest. A compromise is to release some statistics computed overthis data. In the case of releasing n-gram counts of text documents (the case we study here), two examplesare the release of copyrighted material (notably the Google Ngram Corpus) and the exchange of phrasetables for machine translation when the original parallel corpora are private or confidential.

The obvious question that then arises is how much of the information should be released in order toavoid reconstruction from a third party. We analyse here the question of what are the longest blocks thatcan be reconstructed with total certainty (this is, not considering probabilistic approaches) starting from ann-gram corpus containing perfect information (no n-gram or count is omitted, and no noise is introduced).

A similar problem is solved routinely in DNA sequencing mapping the n-grams into a graph (the deBruijn graph) and finding an Eulerian tour in this graph [1]. However, the number of different Euleriantours can grow worse than exponentially with the number of nodes, and only one of these tours correspondsto the original document. We present a novel reduction of this graph into its most irreducible form, fromwhich large blocks of substrings of the document can easily be read off. In our experiments on books from theProject Gutenberg 1 we were able to obtain blocks of an average size of 53.21 words starting from 5-grams.

1 DefinitionsWe will be working with directed multigraphs where an edge not only has a multiplicity attached to it,

but also a label denoting the substring it represents. This motivates our following definition of graph:

Definition 3 A graph G is a tuple (V,E), with V the set of nodes and E the set of edges, where each edgeis of the form (〈u, v, `〉, k) with u, v ∈ V ; ` ∈ Σ∗, k ∈ N ; where Σ is the vocabulary of the original sequence.

Given an edge e = (〈v, w, `〉, k) we use the following terms to refer to its components: tail(e) = v,head(e) = w, label(e) = `, multiplicity(e) = k.

The indegree of a node v, din(v) is∑

e∈E:head(e)=v multiplicity(e); and the outdegree dout(v) is∑

e∈E:tail(e)=v

multiplicity(e).A graph is Eulerian if it is connected and din(v) = dout(v) for all nodes v. In this case we define

d(v) = din(v) = dout(v).

We furthermore require that the labels determine the edges uniquely. This is, ∀e1, e2 ∈ E, if label(e1) =label(e2), then e1 = e2.

An Eulerian cycle in our graphs is then a cycle that visits each edge e exactly multiplicity(e) times. Wedenote the set of all Eulerian cycles of G with ec(G). Given an Eulerian cycle c = e1, . . . , en, its labelsequence is the list `(c) = [label(e1), . . . , label(en)], and the string it represents is the concatenation of theselabels: s(c) = label(e1).label(e2). . . . .label(en).

Remember that our original problem was to find substrings as long as possible of which we are sure thatthey appear in the original sequence, given the evidence of the n-grams. Our strategy is to start with theoriginal de Bruijn graph, and to merge some edges iteratively until the final edges correspond exactly tothose maximal strings. Formally, given the original graph G, we are interested in a graph G∗ that:

1. http://www.gutenberg.org/

41

http://www.gutenberg.org/

1. is equivalent:s(c) : c ∈ ec(G) = s(c) : c ∈ ec(G∗) (2)

2. is irreducible:

6 ∃e1, e2 ∈ E∗ : [label(e1), label(e2)] appears in all `(c), c ∈ ec(G∗) (3)

2 Reduction StepsTo achieve this we will introduce two reduction rules which preserve Eulerian cycles (correctness) and

whose successive application ensures an irreducible Eulerian graph (completeness).The first one – which in practice is applied the most – takes only into consideration local information by

analyzing incoming and outgoing edges of a specific vertex. The second one uses global information, throughthe use of division points, which are nodes that are either articulation point [2] or have a self-loop edge.

References[1] P. E. C. Compeau, P. A. Pevzner, and G. Tesler. How to apply de Bruijn graphs to genome assembly.

Nature biotechnology, 29(11):987–91, Nov. 2011.[2] R. Tarjan. Depth-First Search and Linear Graph Algorithms. SIAM Journal on Computing, 1(2):146–

160, June 1972.

42