82
BIOINFORMATICS 2012 STOCKHOLM JUNE 1114 2012 h9p://socbin.org/bioinfo2012/

BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND [email protected] Mayank Kumar Saarland University GERMANY [email protected] Kanthida Kusonmano

  • Upload
    others

  • View
    1

  • Download
    1

Embed Size (px)

Citation preview

Page 1: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

BIOINFORMATICS  2012  STOCKHOLM  JUNE  11-­‐14  2012  

h9p://socbin.org/bioinfo2012/  

Page 2: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Welcome  

SocBiN  in  collaboration  with  Center  for  Biomembrane  Research  welcomes  you  to  the  12th  annual  conference  in  bioinformatics.  This  year  the  conference  will  be  held  in  beautiful  Stockholm  starting  at  lunch-­‐time  June  11  and  ending  at  lunch  on  June  14.  It  will  be  held  in  the  lecture  hall  "Berzelius"  (Berzelius  väg  3  /  Tomtebodavägen  at  the  bus  stop  for  SL  bus  69)    Stockholm  on  the  Karolinska  Institutet  campus,  close  to  Science  for  Life  Laboratory,  Stockholm.  We  are  looking  forward  to  an  exciting  scientific  program  with  4  invited  keynote  speakers  and  5  sessions  (Molecular  Machines,  Using  next  generation  sequence  data,  Data  analysis  of  proteomics  assays,  Bioinformatics  of  chemical  biology  and  RNA  bioinformatics).  

We  wish  you  all  very  welcome  

The  organization  committee    Arne  Elofsson,  Department  of  Biochemistry  and  Biophysics,  Stockholm  University,  Erik  Lindahl,  Theoretical  Physics,  KTH  and  Bengt  Persson,  Linköping  University  

SocBin    

SocBiN  (Society  for  Bioinformatics  in  Northern  Europe)  is  a  non-­‐profit  organization  for  people  working  with  and  interested  in  bioinformatics  and  computational  biology.  The  members  of  the  organization  are  predominantly  from  the  Nordic  and  Baltic  countries,  but  others  are  also  welcome.  

We  are  grateful  for  the  help  of  our  session  chairs  

• Arne  Elofsson,Science  for  Life  Laboratory,  Stockholm  University,Sweden  

• Lukas  Käll,  Science  for  Life  Laboratory,  KTH,Sweden  • Anders  Andersson,  Science  for  Life  Laboratory,  KTH,Sweden  • Jens  Carlsson,  Center  for  Biomembrane  Research,  Stockholm  University,  

Sweden  • Janusz  Bujnicki,  International  Institute  of  Molecular  and  Cell  Biology  in  

Warsaw,  Poland  

 

Page 3: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Program Mon 11

Data Analysis of Proteomics Assays 13:45-14:00 Arne Elofsson Welcome

14:00-14:30

Ruedi Aebersold Searching and Mining of Proteomic SWATH-MS datasets

14:30-15:00

Lennart Martens

Snakes and ladders: where do proteomics assays fail and how can we fix them?

15:00-15:15 Finn Drabløs The Triform algorithm: improved sensitivity and specificity in

ChIP-Seq peak finding Coffee 15:45-16:15

Edward Marcotte

Insights from proteomics into protein organization, evolution, and genetic disease

16:15-16:45

Roman Zubarev Pathway Analysis in Expression Proteomics

16:45-17:00 Paul Horton MoiraiSP: a novel mitochondrial cleavage site predictor

17:00-19:00 Reception and poster session (Presentation by odd numbers)

   

Page 4: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Tue June 12 RNA Bioinformatics 09:30-10:00 Bob Darnell 10:00-10:30 Jan Gorodkin Towards the search for RNA-RNA interaction based networks

10:30-10:45

Mihaela Zavolan

A biophysical model to infer canonical and non-canonical microRNA-target interaction

Coffee 11:15-11:45 Eric Westhof The Detection of the Architectural Modules of RNA and Recent

Progress in RNA Modelling 11:45-12:15 Samuel Flores A structural and dynamical model of human telomerase

12:15-12:30 Nanjiang Shu Computational analysis of membrane protein topology evolution

LUNCH Keynotes Session 13:30-14:30 Anders Krogh On the accuracy of short read mapping

14:30-15:30

Kerstin Lindblad-Toh

Coffee 16:00-17:00 Jens Nielsen Genome-Scale Metabolic Models: A Bridge between Bioinformatics

and Systems Biology 17:00-18:00 Paul Horton Excavating human NUMTs

18:00-19:00 Michael Levitt olving the Recalcitrant Crystal Structure of Group II Chaperonin

TRiC/CCT by Mass Spectrometry and Sentinel Correlation Analysis 19:30-24:00 Conference Dinner

   

Page 5: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Wed 13 Bioinformatics of chemical biology 09:30-10:00 Gert Vriend What can we (not yet) learn from 70 GPCR structures

10:00-10:30

Raymond Stevens

Understanding Human G-protein Coupled Receptor Structural Diversity and Modularity

10:30-10:45 David Gloriam Chemogenomic Discovery of Allosteric Antagonists at the GPRC6A

Receptor Coffee 11:15-11:45 Helgi Schiöth The origin of GPCRs, the largest family of membrane bound proteins

11:45-12:15

Andreas Bender Using Chemogenomics Approaches to Modulate Biological Systems

12:15-12:30 Kentaro Tomii PoSSuM: a database of known and potential ligand-binding sites in

proteins Using Next generation sequence data 14:00-14:30 Jeroen Raes Metagenomics data analysis: from the oceans to the human

microbiome 14:30-15:00

Christopher Quince Extracting ecological signal from noisy microbiomics data

15:00-15:15

Johan Bengtsson

Comprehensive Analysis of Antibiotic Resistance Genes in River Sediment, Well Water and Soil Microbial Communities Using Metagenomic DNA Sequencing

15:15-15:30

Daniel Edsgärd Allele specic expression changes after induction of inflammation

Coffee 16:00-16:30

Erik van Nimwegen

Reconstructing transcription regulatory networks in mammals using a combination of modeling and next-generation sequencing data

16:30-17:00

Joakim Lundeberg

Sequencing and assembly of the largest and most complex genome to date - the Norway spruce (Picea abies)

17:00-17:30 Ivo Gut High-resolution whole-genome analysis and cancer

17:30-19:00 Poster session (Presentation by even numbers)

   

Page 6: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Thu 14 Molecular machines 09:00-09:30 Martin Weigt From sequence variability to protein (complex) structure prediction

09:30-10:00 Burkhard Rost Evolution teaches protein prediction

10:00-10:15

Janusz Bujnicki

If Thereʼs an Order in All of This Disorder…: Structural Bioinformatics of the Human Spliceosomal Proteome

10:15-10:30

Joanna M Kasprzak

PyRy3D: a software tool for modelling of large macromolecular complexes

Coffee 11:00-11:30 Ingemar André Design and Prediction of Protein Self-assembly

11:30-12:00 Rob Russel 12:00-12:15 Closing words

 

Page 7: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

List of partcipants

Conference "Bioinformatics 2012", June 11-14 at Karolinska Institutet, Stockholm Sweden

First name Surname/Family name University/Organization Country E-mail addressRuedi Aebersold ETH Zurich SWITZERLAND [email protected] Agarwal Chalmers SWEDEN [email protected] Alam Khan KTH SWEDEN [email protected] Hashim Ali Kungliga Tekniska Hogskolan SWEDEN [email protected] Andersson KTH SWEDEN [email protected] André Lund University SWEDEN [email protected] Andreson University of Tartu ESTONIA [email protected] Arvestad Stockholm University SWEDEN [email protected] Barghash Saarland University GERMANY [email protected] Basile SWEDEN [email protected] Bengtsson University of Gothenburg SWEDEN [email protected] Boekel Scilifelab Stockholm SWEDEN [email protected] Borg SWEDEN [email protected] Bornelöv Uppsala university SWEDEN [email protected] Boss Karolinska Institutet SWEDEN [email protected] Boulund Chalmers University of Technology SWEDEN [email protected] Brüffer Lund University SWEDEN [email protected] Brömstrup KTH SWEDEN [email protected] Bujnicki IIMCB POLAND [email protected] Bunikis Uppsala University SWEDEN [email protected] Carlsson Stockholm University SWEDEN [email protected] Chernobrovkin IBMC RAMS RUSSIA [email protected] Czerwoniec Adam Mickiewicz University POLAND [email protected] Darnell The Rockefeller University UNITED STATES [email protected] Daub Karolinska Institutet and SciLifeLab SWEDEN [email protected] De Bruijn Stockholm University SWEDEN [email protected] Drablos Norwegian Univ of Science and Technology NORWAY [email protected] Du Karolinska Institutet SWEDEN [email protected] Dunin-Horkawicz IIMCB POLAND [email protected] Edsgärd KTH, Science for Life Laboratory SWEDEN [email protected] Elofsson Principal Investigator/Lab Head/Senior R SWEDEN [email protected] Emanuelsson Kungliga Tekniska Högskolan SWEDEN [email protected] Foroughi Asl Karolinska Institutet SWEDEN [email protected] Frings SWEDEN [email protected] Frånberg Stockholms Universitet/Karolinska Instit SWEDEN [email protected] Gloriam University of Copenhagen DENMARK [email protected] Gomez-Cabrero BILS SWEDEN [email protected] Gorodkin University of Copenhagen DENMARK [email protected] Granholm Stockholm University SWEDEN [email protected] Helge Grindhaug Uni Research NORWAY [email protected] Guala SWEDEN [email protected] Gut Centro Nacional de Análisis Genómico SPAIN [email protected] Hamed Saarland University GERMANY [email protected] Hautaniemi University of Helsinki FINLAND [email protected] Hayat [email protected] Horton AIST, Computational Biology Res. Ctr. JAPAN [email protected] Hugerth KTH SWEDEN [email protected] Hultin Rosenberg Karolinska Institutet SWEDEN [email protected] Huminiecki [email protected] Abigail Icay University of Helsinki FINLAND [email protected] Johansson Karolinska Institute SWEDEN [email protected] Johnning University of Gothenburg SWEDEN [email protected] Jonsson University of Gothenburg SWEDEN [email protected] Junttila University of Turku FINLAND [email protected] Jørgensen University of Copenhagen DENMARK [email protected]

Page 8: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Yvonne Kallberg Karolinska Institutet SWEDEN [email protected] Kasprzak Kasprzak Adam Mickiewicz University POLAND [email protected] Khaliq Uppsala University SWEDEN [email protected] Kraulis SWEDEN [email protected] Krogh University of Copenhagen DENMARK [email protected] Kumar Adam Mickiewicz University POLAND [email protected] Kumar Saarland University GERMANY [email protected] Kusonmano Uni Research AS NORWAY [email protected] Kytömäki University of Turku FINLAND [email protected] Käll Royal Institute of Technology (KTH) SWEDEN [email protected] Lagergren KTH SWEDEN [email protected] Laht University of Tartu ESTONIA [email protected] Larhammar Uppsala University SWEDEN [email protected] Lavrichenko University of Bergen NORWAY [email protected] Levander Lund University SWEDEN [email protected] Levitt Stanford University UNITED STATES [email protected] Light SWEDEN [email protected] Lindahl KTH Royal Institute of Technology SWEDEN [email protected] Lindvall Karolinska Institutet SWEDEN [email protected] Lundeberg KTH, Science for Life Laboratory SWEDEN [email protected] Lundell Uppsala University SWEDEN [email protected] Lysholm Linköping University SWEDEN [email protected] Löytynoja University of Helsinki FINLAND [email protected] Owais Mahmudi KTH SWEDEN [email protected] Marcotte University of Texas UNITED STATES [email protected] Margus University of Tartu ESTONIA [email protected] Martens VIB and Ghent University BELGIUM [email protected] Martinez Chalmers University of Technology SWEDEN [email protected] Matelska IIMCB POLAND [email protected] Mäkinen University of Helsinki FINLAND [email protected] Nielsen Chalmers University of Technology SWEDEN [email protected] Nielsen Technical University of Denmark DENMARK [email protected] Nilsson Karolinska Institutet SWEDEN [email protected] Nowak Uniwersytet M.Kopernika w Toruniu POLAND [email protected] Nylander Swedish Museum of Natural History SWEDEN [email protected] Isolfur Olason Uppsala University SWEDEN [email protected] Paine Karolinska Institute SWEDEN [email protected] Parviainen KTH SWEDEN [email protected] Pernemalm Karolinska Institutet SWEDEN [email protected] Persson Linköping University and BILS SWEDEN [email protected] Peters SWEDEN [email protected] Petersen Uni Research AS NORWAY [email protected] Pilstål Linköping University SWEDEN [email protected] Pinto Umeå University SWEDEN [email protected] Pouya Royal Institute of Technology SWEDEN [email protected] Pruner Uppsala University SWEDEN [email protected] Quince University of Glasgow UNITED KINGDOM [email protected] Raes Vrije Universiteit Brussel BELGIUM [email protected] Rajashekar Tartu University ESTONIA [email protected] Ranganathan [email protected] Raska Tallinn University of Technology ESTONIA [email protected] Reimegård Royal Institute of Technology SWEDEN [email protected] Remm University of Tartu ESTONIA [email protected] Repsilber Leibniz Inst. for Farm Animal Biology GERMANY [email protected] Maria Rodriguez Sanchez KAROLINSKA UNIVERSITY HOSPITAL SWEDEN [email protected] Rost Technische Universitaet Muenchen GERMANY [email protected] Rubio García Technical University of Denmark DENMARK [email protected] Russell University of Heidelberg GERMANY [email protected] Sahlin KTH/Science for life Laboratory SWEDEN [email protected] Schiöth Uppsala University SWEDEN [email protected] Schmitt SWEDEN [email protected]

Page 9: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Sophie Schwaiger KTH SWEDEN [email protected] Sennblad Karolinska Institutet SWEDEN [email protected] Sergushichev University ITMO RUSSIA [email protected] Shahrabi Farahani KTH SWEDEN [email protected] Shu SWEDEN [email protected] Silberberg Stockholm University SWEDEN [email protected] Sinha Karolinska Institute SWEDEN [email protected] Sjöstrand SWEDEN [email protected] Skwark POLAND [email protected] Sonnhammer SWEDEN [email protected] Studham SWEDEN [email protected] Sukumar University of Wisconsin- Madison UNITED STATES [email protected] Svensson SWEDEN [email protected] Svensson Science for Life Laboratory SWEDEN [email protected] Tellgren-Roth Uppsala University SWEDEN [email protected] Tjärnberg SWEDEN [email protected] Tomii AIST JAPAN [email protected] Ullah KTH SWEDEN [email protected] Unneberg SWEDEN [email protected]örn Wallner Linköping University SWEDEN [email protected] Valls Guimera SWEDEN [email protected] Van Nimwegen University of Basel SWITZERLAND [email protected] Wang Umeå University SWEDEN [email protected] Warholm SWEDEN [email protected] Weigt University Pierre & Marie Curie FRANCE [email protected]örn Wesén KTH SWEDEN [email protected] Westhof University of Strasbourg, IBMC-CNRS FRANCE [email protected] Vezzi KTH Royal Institute of Technology SWEDEN [email protected] Volpato University College Dublin (UCD) IRELAND [email protected] Vriend CMBI NETHERLANDS, THE [email protected] Xu Uppsala University SWEDEN [email protected]Özge Yoluk KTH SWEDEN [email protected] Zaremba-Niedzwiedzka Uppsala University SWEDEN [email protected] Zhao Uppsala University SWEDEN [email protected] Öhman Stockholm University SWEDEN [email protected] Östberg Karolinska Institutet SWEDEN [email protected]

Page 10: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Abstracts from Invited speakers Ruedi Aebersold, Institute of Molecular Systems Biology, ETH Zurich and Faculty of Science, University of Zurich

Searching and Mining of Proteomic SWATH-MS datasets

Recently we introduced a new data independent (DIA) acquisition method termed SWATH-MS (1). This method, in effect, is a time-and-mass segmented acquisition method where complex, high-specificity fragment ion maps of all precursor ions within a user-defined precursor RT and m/z space are being generated and recorded. This is accomplished by stepping the isolation window of a specifically tuned quadrupole time-of-flight (QqTOF) instrument in discrete increments recursively throughout the duration of the LC separation. The data acquired by SWATH-MS are not searchable by conventional database search engines, because each fragment ion spectrum is a composite of multiple, concurrently fragmented precursor ions.

In this presentation we will describe an automatic pipeline for peptide identification and quantification from SWATH-MS datasets. It is conceptually related to the mProphet algorithm developed for the analysis of S/MRM datasets (2). The algorithm applies a targeted search strategy, whereby peak groups uniquely identifying a particular peptide are extracted from the SWATH-MS dataset and assigned a probability of being correctly associated with the target peptide. The algorithm uses a system of individual feature score rankings that are then combined into a composite score.

The performance of the method will be illustrated with selected examples that indicate the power of the approach for the reproducible analysis of proteomes, the detection of modified peptides and the estimation of the absolute quantity of proteins and proteomes.

1. Gillet LC, Navarro P, Tate S, Roest H, Selevsek N, Reiter L, Bonner R, Aebersold R. (2012) Targeted data extraction of the MS/MS spectra generated by data independent acquisition: a new concept for consistent and accurate proteome analysis MCP [Epub ahead of print]

2. Reiter L, Rinner O, Picotti P, Huettenhain R, Beck M, Brusniak MY, Hengartner MO, Aebersold R. (2011) mProphet: automated data processing and statistical validation for large-scale SRM experiments. Nat Methods: 8(5):430-5.

Page 11: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Ingemar André Center for Molecular Protein Science Biochemistry and Structural Biology Lund University [email protected]

Design and Prediction of Protein Self-assembly

Many of the largest protein complexes in biology are composed of a single type of subunit that is repeated a large number of times to generate a functional assembly. Such homomeric structures are often assembled spontaneously from individual components through the process of self-assembly. Research in our group is focused on the prediction of the three-dimensional structure of homomeric assemblies and the rational design of novel self-assembling proteins and peptides. Over the last several years we have developed computational methods to model the structure of homomeric assemblies using the powerful constraint of molecular symmetry. In this presentation I will illustrate how these prediction methods, in conjunction with limited experimental constraints, can be used to tackle important problems in structural biology. The second part of the talk will deal with the rational design of self-assembling proteins and peptides. We combine the powerful design template of self-assembly with structural modeling and computational protein to design protein assemblies on an atomic level.

Page 12: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Andreas Bender, Cambridge University

Using Chemogenomics Approaches to Modulate Biological Systems

Modulating biological systems can be achieved via biological means (such as knock-out animals, or RNA interference etc.); however, chemical modulation by small molecules is an alternative method with significantly different properties, such as the ability to control dose and timecourse of the administration in detail. In this presentation, different methods for the analysis of the mode-of-action of small molecules which show an effect in phenotypic assays will be discussed, in order to understand small molecule action better. Also, reversing the direction of the analysis, we will outline how large bioactivity databases available today can be be used to design molecules with the desired effect on a biological system, be it by modulating single targets or, becoming more popular recently, by modulating a defined set of target proteins.

Page 13: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Samuel Flores, Uppsala University

A structural and dynamical model of human telomerase

Mutations in the telomerase complex disrupt either nucleic acid binding or catalysis, and are the cause of numerous human diseases. Despite its importance, the structure of the human telomerase complex has not been observed crystallographically, nor are its dynamics understood in detail. Fragments of this complex from Tetrahymena thermophila and, more controversially,Tribolium castaneum have been crystallized. Biochemical probes provide important insight into dynamics. In this work we use available structural fragments to build a homology model of human TERT, and validate the result with functional assays. We then generate a trajectory of telomere elongation following a “typewriter” mechanism: the RNA template moves to keep the end of the growing telomere in the active site, disengaging after every 6-residue extension to execute a “carriage return” and go back to its starting position. A hairpin can easily form in the telomere, from DNA residues leaving the telomere-template duplex. The trajectory is consistent with available experimental evidence and suggests focused biochemical experiments for further validation.

Page 14: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Jan Gorodkin, Center for non-coding RNA in Technology and Health, Denmark

Towards the search for RNA-RNA interaction based networks

Within recent years the awareness of non-coding RNAs has increased rapidly and experimental as well as in silico results elucidate the large potential. Here, the motivation takes outset in the thousands of in silico generated RNA structure candidates in the genome. A major challenge is to assign function to these. The first step is to search for RNA interactions to other RNAs (DNA or proteins). Searching for RNA-RNA interactions is in general a time consuming task. As a first approach we have developed an approach searching for only near complement interactions (ignoring intra molecular base pairs). We show that this approach is faster than existing methods, while maintaining accuracy and show that the method can be used as filter (on existing methods) for microRNA target search. In a case study on microRNAs, we combined target predictons (conserved in human and mouse) to protein coding genes with literature mining and obtained a combined enrichment to only transcriptor factors (TFs) and subsequently found that TFs are also enriched for targeting microRNAs. Our results suggests a network of mutual activating and suppressive regulation.

Page 15: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Ivo Gut, Centro Nacional de Analysis Genomico, C/Baldiri Reixac 4, 08028 Barcelona, Spain.

High-resolution whole-genome analysis and cancer

The International Cancer Genome Consortium (ICGC) aims to fully characterize in the 50 most common forms of cancer 50 tumour/normal sample pairs exhaustively and then to validate observations in further 450 samples. The first three years of this project have seen huge advances in the development, implementation and standardisation of the methods for characterising samples, ethical approval, whole-genome sequencing, exome sequencing, RNA sequencing, epigenetic analysis, methods for validation, informatics analysis and data basing.

The Spanish contribution to the ICGC is on Chronic Lymphocytic Leukaemia (CLL). Our main responsibility has been on whole genome sequence analysis, exome analysis, RNA sequence analysis and epigenetic analysis. Complete genome sequencing of many samples requires bringing together many different elements, starting from samples, preparation for sequencing, sequencing itself, data analysis, through to verification of results and translating a result into biological knowledge. Thorough examination of the first 4 tumour/normal pairs and follow up in a large replication set allowed us to identify four recurrent in the NOTCH1, XPO1, MYD88 and the KLHL6 genes. In an extension we analysed 100 tumour/normal pairs by exome sequencing which allowed the identification of further recurrent somatic mutations, the most frequent being in SF3B1 and POT1. Interestingly the two recognised subtypes of CLL, immunoglobulin modulated and not, do not completely reflect themselves in the recurrent mutations. The methods and findings will be discussed.

Page 16: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Paul Horton, CBRC

Excavating human NUMTs

NUMTs (Nuclear mtDNA), are partial copies of the mitochondrial genome found in the nuclear genome. They are sometimes referred to as molecular fossils, and, due to the higher mutation rate of mtDNA, can in some cases be more similar to parts of our ancestral mtDNA than our extent mtDNA genome is. The existence of NUMTs has been known for decades and many informatics studies on NUMTs have attempted to elucidate the characteristics of their insertion sites. By showing that NUMTs are typically very clean insertions with only minimal deletion or duplication of the surrounding nuclear DNA, these studies have lead to a consensus opinion that most NUMTs are likely inserted as filler DNA via NHEJ (Non-Homologous End Joining). Previous informatics studies have not shed much light upon the preferred insertion sites of NUMTs. Most of them conclude that NUMT insertion is random -- except for contradictory reports that NUMTs correlate positively, or negatively, with retrotransposons. Fortunately, by employing more careful methodology, we were able to discover several as yet undiscovered aspects of this phenomenon. We found that inferred NUMTs insertion sites strongly correlate with predicted physical properties of DNA (curvature and bendability) and A+T rich oligomers. Moreover, recently inserted NUMTs correlate strongly with nucleosome free regions as measured by DNase-seq and FAIRE-seq. We also firmly establishing that NUMTs do indeed tend to co-occur with retrotransposons. As for the source mtDNA which is copied to create NUMTs, we find that part of the mtDNA D-loop region is very seldom copied. Relating these facts to concrete hypotheses regarding the mechanism of NUMT insertion proved very challenging, but also fascinating, as it touched upon diverse topics in molecular biology: from retrotransposon activity and DNA repair to evolutionary conservation of chromatin structure and the packaging of mtDNA.

REFERENCES Tsuji et al., under revision, NAR

Page 17: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Anders Krogh, Copenhagen University, Denmark

On the accuracy of short read mapping.

Next-generation DNA sequencing technologies produce huge amounts of DNA sequence reads. Often the initial bioinformatics task is to map these reads to a reference genome. For this, well-tested methods like Blast are way too slow and next-generation bioinformatics tools are needed. Several new methods have been developed, some of which builds on the Burrows-Wheeler index – an elegant indexing of the genome that facilitates fast searches in a small memory footprint. These methods are based on mapping the reads exactly apart from a few mismatches and indels. Most of them do not report any significance or probability that a match is actually correct. In this talk I will briefly review the field, give some general results for mapping accuracy, and suggest a more precise notion of uniqueness. I will also present a probabilistic approach to short read mapping, which uses quality scores to calculate mapping probabilities. This can improve mapping accuracy, in particular when mapping very short reads, such as small RNAs, various tag sequences, and ancient DNA. The effect on mapping performance will be illustrated using both simulated and actual DNA reads.

Page 18: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Michael Levitt, Stanford, USA

Solving the Recalcitrant Crystal Structure of Group II Chaperonin TRiC/CCT by Mass Spectrometry and Sentinel Correlation Analysis

Eukaryotic group II Chaperonin TRiC or CCT is a 0.95 megadalton protein complex that is essential for the correct and efficient folding of cytosolic polypeptides. The closed form is a 16 nm sphere made of two hemi‐spherical rings of 8 subunits (~550 residues/subunit) that rotate to open a central folding chamber. In eukaryotes, 8 different genes encode the subunits of this ATP‐powered nanomachine. The high sequence identity of subunits made the 40,320 (=8 factorial) possible arrangements indistinguishable in previous cryo‐electron microscopy and crystallographic analysis. We solve this problem by independent studies on bovine and yeast TRiC chaperonin.

First we use cross‐linking, mass spectrometry and combinatorial homology modeling. We react bovine TRiC under native conditions with a lysine‐specific cross‐linker, follow up with trypsin digestion, and use mass spectrometry to identify 63 cross‐linked pairs providing distance restraints. Independently of the cross‐link set, we construct all 40,320 homology models of the TRiC particle. When we compared each model with the cross‐link set, we discovered that one model is significantly more compatible than any other model. Bootstrapping analysis confirms that this model is 10 times more likely to result from this cross‐link set than the next best‐fitting model.

Second, we re‐examine the 3.8 Å resolution X‐ray data of yeast TRiC. Our method of Sentinel Correlation Analysis (SCA) exhaustively tests all 2,580,460 possible models. This unbiased analysis singles out with overwhelming significance one model, which is fully consistent with our previous biochemical data and refines to a much lower Rfree value than reported previously with the same X‐ray data. With four‐fold averaging, our structure reveals remarkably resolved details of the unique conformation of each subunit, and suggests a mechanism for the initiation of transition to the open state. More generally, we expect SCA to resolve ambiguity in future low‐resolution crystallographic studies.

Page 19: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Joakim Lundeberg, SciLifeLab, Sweden and The Spruce genome project

Sequencing and assembly of the largest and most complex genome to date - the Norway spruce (Picea abies)

Conifers are the dominant plant species in many ecosystems, including large areas in Sweden. Despite this, no conifer genome has yet been published, mainly owing to their large size and complexity. The lack of a genome sequence has hampered our understanding of conifer biology and evolution, as well as the development of potential novel breeding strategies of these economically important species.

We are currently performing whole genome sequencing and assembly of the 20 Gbp Norway spruce genome. This genome contains huge amounts of repeated elements, with an estimated gene density of only 1/500 kbp. In common with other tree genomes, heterozygozity is high, which further complicates the assembly process. The Spruce Genome Project is addressing questions of genome size, content and evolution, including analyses of gene families and repeats, and will establish Norway spruce as a prime model species for conifer research.

In this talk, we will present our main strategies concerning sequencing and assembly of this de novo genome, and give an update on the results obtained so far. In brief, we use a combination of whole genome shotgun and fosmid pool sequencing, followed by scaffolding and merging of the separate assemblies. This is complemented by a manually curated spruce-specific repeat library, sequencing of random fosmid clones for assembly benchmarking, as well as assemblies of the chloroplast and mitochondrial genomes.

Page 20: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Ed Marcotte, University of Texas Austin, US

Insights from proteomics into protein organization, evolution, and genetic disease

Page 21: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Lennart Martens, Lennart Martens VIB, Gent, Belgium

Snakes and ladders: where do proteomics assays fail and how can we fix them?

Proteomics assays increasingly rely on two distinct and largely independent informatics processing steps: identification and quantification. Both procesing steps can rely on a plethora of available algrotihms and tools, but the maturity of these algorithms is quite distinct. Whereas identification is typically handled by venerable algorithms called search engines, that have been in use for many years, quantification algorithms are still continuously evolving to accommodate the increasing resolution and sensitivity of modern mass spectrometers. Despite this difference in maturity, both steps can be improved. Indeed, the performance of current quantitative workflows can be boosted by simply combining several of them into a single, joint analysis, making the most of the specific sensitivities of each of the algorithms used. On the other hand, the long-serving search engines have also reached crucial limits in terms of specificity, effectively preventing proteomics from reaching a central status in the life sciences. Fortunately, this inherent limitation of current search engines can be fixed by improving the way in which we use the measurements provided by the mass spectrometer. We will here discuss these developments, and highlight how both quantification and identification can be improved; the former by incremental advances, the latter by a more radical change in approach.

Page 22: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Jens Nielsen Department of Chemical and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden

Genome-Scale Metabolic Models: A Bridge between Bioinformatics and Systems Biology

We are currently working on building a Human Metabolic Atlas, a novel web-based database and modelling tool that can be used by medical and pharmaceutical researchers to analyse clinical data with the objectives of identifying biomarkers associated with disease development and improving health care. The central technology in the Human Metabolic Atlas is so-called genome- scale metabolic modelling (GEMs), which will be made tissue-specific by using different types of experimental data, e.g. from the Human Protein Atlas. These models allow for context-dependent analysis of clinical data, providing much more information than traditional statistical correlation analysis, and hence advance the identification of biomarkers from high-throughput experimental data that can be used for early diagnosis of metabolic related diseases. As part of the Human Metabolic Atlas we are developing GEMs for the gut microbiome. In this context we are using metagenomics for identification of different metabolic functions that are associated with human diseases. Here we are using metagenomics sequencing data from the gut microbiome of patients with different diseases, e.g. arteriosclerosis and type 2 diabetes. Through the combination of the bacterial GEMs and metagenomics data we have identified enriched metabolic functions in the microbiome, and based on this we point to novel prospective biomarkers for disease development. We are further integrating metagenomics information into predictive metabolic models that have the prospect for simulation of how the gut microbiome will respond to diet.

Page 23: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Raymond Stevens, The Scripps Research Institute, USA

Understanding Human G-protein Coupled Receptor Structural Diversity and Modularity

GPCRs constitute one of the largest protein families in the human genome and play essential roles in normal cell processes, most notably in cell signaling. The human GPCR family contains more than 800 members and recognizes thousands of different ligands and activates a number of signaling pathways through interactions with a small number of binding partners. GPCRs have also been implicated in numerous human diseases, and represent more than 40% of drug targets. Delivering GPCR structures in close collaboration with experts on specific receptor systems is of immense value to the basic science community interested in cell signaling and molecular recognition, as well as the applied science community interested in drug discovery. This work is being followed up with additional biophysical characterization including NMR spectroscopy, HDX mass spectrometry, medicinal chemistry and community wide assessments with computational biology groups throughout the world. Crystal structures are now available for rhodopsin, adrenergic, and adenosine receptors in both inactive and activated forms, as well as for chemokine, dopamine, histamine, S1P1, muscarinic, opioid receptors in inactive conformations. A review of the common structural features seen in these receptors will be discussed and the scope of structural diversity of GPCRs at different levels of homology provides insight into our growing understanding of the biology of GPCR action and their impact on drug discovery. Given the current set of GPCR structural data, a distinct modularity is now being observed between the extracellular (ligand-binding) and intracellular (signaling) regions. The rapidly expanding repertoire of GPCR structures provides a solid framework for experimental and molecular modeling studies, and helps to chart a roadmap for comprehensive structural coverage of the whole superfamily and an understanding of GPCR biological and therapeutic mechanisms. The long range goal is to understand GPCR molecular recognition and evolution in relation to human cognition.

This work was supported by NIGMS PSI:Biology for GPCR structure processing (U54GM094618) and the NIH Roadmap Initiative (JCIMPT) for technology development (GM073197).

Page 24: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Burkhard Rost, TU Munich

Evolution teaches protein prediction

The objective of our group is to predict aspects of protein function from sequence. The only reason why we can pursue such an ambitious goal is the wealth of evolutionary information available through the comparison of the whole bio-diversity of species. Many approaches have benefited substantially from using evolutionary information; for some of these methods learning from evolution made the difference between possible and impossible. In my talk I will present examples of methods that target the prediction of protein interactions, of protein disorder, and of the effect of single residue mutations upon protein structure and function.

Page 25: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Schiöth HB.

The origin of GPCRs, the largest family of membrane bound proteins

G protein-coupled receptors (GPCRs) are the largest superfamily among membrane bound proteins. The GPCRs in humans are classified into the five main families named Glutamate, Rhodopsin, Adhesion, Frizzled and Secretin according to the GRAFS classification. Several families of GPCRs show however no apparent sequence similarities to each other, and it has been debated which of them share a common origin. Mining of early vertebrates including lancelet (Branchiostoma floridae) and one of the most primitive animals, the cniderian sea anemones (Nematostella vectensis) provided considerable evidence suggesting that the Adhesion family is ancestral to the peptide hormone binding Secretin family of GPCRs. We also used integrated and independent HHsearch, Needleman-Wunsch-based and motif analyses to determine at the relationship of the other main families. We found strong evidence that the Adhesion and Frizzled families are children to the cyclic AMP (cAMP) family while the large Rhodopsin family is likely a child of the cAMP family. We suggest that the Adhesion and Frizzled families originated from the cAMP family in an event close to that which gave rise to the Rhodopsin family. We also found convincing evidence that the Rhodopsin family is parent to the important sensory families; Taste 2 and Vomeronasal type 1 as well as the Nematode chemoreceptor families. The insect odorant, gustatory, and Trehalose receptors, frequently referred to as GPCRs, form a separate cluster without relationship to the other families, and we propose, based on these and other results, that these families are ligand-gated ion channels rather than GPCRs. We suggest common descent of at least 97% of the GPCRs sequences found in humans. Moreover, we provide the first evidence that four of the five main mammalian families of GPCRs, namely Rhodopsin, Adhesion, Glutamate and Frizzled, are present in Fungi. The unicellular relatives of the Metazoan lineage, Salpingoeca rosetta and Capsaspora owczarzaki have a rich group of both the Adhesion and Glutamate families, which in particular provided insight to the early emergence of the N-terminal domains of the Adhesion family. Further mining of Dictyostelium discoideum suggests that the Glutamate family is as ancient as the cAMP receptor family. Together, these studies clarify the early evolutionary history of the GPCR superfamily and their emergence could be traced back at least 1400 MYA.

Page 26: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Gert Vriend,Radboud University Nijmegen Medical Centre, Neatherlands

What can we (not yet) learn from 70 GPCR structures?

Headed by the next speaker, the crystallography community has cracked the GPCR crystallisation problem, and the past years we have seen at least one new GPCR structure enter the PDB each month. These structures are in an active state, semi active state, inactive state, or sometimes also an artefactual state. We have been comparing all available structures trying to average out the things done to make the GPCRs crystallize (mutation of crucial residues; adding llama antibodies; adding funny salts and lipids; cloning-in lysozyme). The shear volume of data now allows us to extract the beginning of a coherent story about the activation of GPCRs. Not surprisingly, this story agrees more with basic laws of physics and thermodynamics, and less with the myriads of funny activation schemas that include distict states like R, R*, etc, that have entered the literature over the years.

Martin Weigt, University of Sorbonne, France

From sequence variability to protein (complex) structure prediction

Many families of homologous proteins show a remarkable degree of structural and functional conservation, despite their large variability in amino acid sequences. We have developed a statistical-mechanics inspired inference approach to link this variability (easy to observe) to structure (hard to obtain), i.e. to infer directly co-evolving residue pairs which turn our to form native contacts in the folded protein with high accuracy. The gained information is used to guide tertiary and quaternary structure prediction. As a specific example, I will discuss the auto-phosphorylation complex of histidine kinases, which are involved in the majority of signal transduction systems in the bacteria. Only a multidisciplinary approach integrating statistical genomics, biophysical protein simulation, and mutagenesis experiments, allows us to predict and verify the - so far unknown - active kinase structure.

Page 27: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Eric Westhof Architecture et Réactivité de lʼARN, Université de Strasbourg, Institut de Biologie Moléculaire et Cellulaire, CNRS, 15 rue René Descartes, 67084 Strasbourg, France

The Detection of the Architectural Modules of RNA and Recent Progress in RNA Modelling

RNA architecture can be viewed as the hierarchical assembly of preformed doublestranded helices defined by Watson-Crick base pairs and RNA modules maintained by non-Watson-Crick base pairs. RNA modules are recurrent ensemble of ordered nonWatson-Crick base pairs. Such RNA modules constitute a signal for detecting noncoding RNAs with specific biological functions. It is, therefore, important to be able to recognize such genomic elements within genomes. Through systematic comparisons between homologous sequences and x-ray structures, followed by automatic clustering, the whole range of sequence diversity in recurrent RNA modules has been characterized. These data permitted the construction of a computational pipeline for identifying known 3D structural modules in single and multiple RNA sequences in the absence of any other information. Any module can in principle be searched, but four can be searched automatically: the G-bulged loop, the Kink-turn, the C-loop and the tandem GA loop. The present pipeline can be used for RNA 2D structure refinement, 3D model assembly, and for searching and annotating structured RNAs in genomic data. Following the recent dramatic advances in tools aimed at RNA 3D modelling, a first, collective, blind experiment in RNA three-dimensional structure prediction has been performed. The goals are to assess the leading edge of RNA structure prediction techniques, compare existing methods and tools, and evaluate their relative strengths, weaknesses, and limitations in terms of sequence length and structural complexity. The results should give potential users insight into the suitability of available methods for different applications and facilitate efforts in the RNA structure prediction community in their efforts to improve their tools.

Page 28: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Roman A. Zubarev Division of Physiological Chemistry I, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Scheeles väg 2, S-171 77 Stockholm, Sweden

Pathway Analysis in Expression Proteomics

Proteomics studies have revealed unexpected plasticity and dynamic nature of the human proteome. The paradigm that the time evolution of a biological system can be described by abundance variation of relatively few “regulated” proteins has been shuttered, being replaced by the growing understanding that the whole proteome is regulated, and virtually no protein remains unaffected when the system undergoes transition from one state to another.

This finding underlines the importance of systems biology analysis of expression proteomics data. Systems biology shifts the analytical focus from thousands of proteins to hundreds of signaling pathways, thus reducing the number of entities to be analyzed. Application of these methods required the development of novel systems biology tools, such as the pathway search engine (PSE [1-3]). These tools can only be effective when they are quantitative, i.e. predict not only the activated pathway, but also the relative degree of its activation. Introducing the quantitative aspect in systems biology is one of the greatest challenges this field is facing today, since the final goal of pathway analysis, which is the creation of a quantitative predicting model of the biological process under investigation.

1. Zubarev, R. A.; Nielsen, M. L.; Savitski, M. M.; Kel-Margoulis, O.; Wingender, E.; Kel, A. Identification of dominant signaling pathways from proteomics expression data, J. Proteomics, 2008, 1, 89-96. 2. Ståhl, S.; Fung, Y.M.E.; Adams, C. M.; Lengqvist, J.; Mörk, B.; Stenerlöw, B.; Lewensohn, R.; Lehtiö, J.; Zubarev, R. A.; Viktorsson, K. Proteomics and Pathway Analysis Identifies JNK-signaling as Critical for High-LET Radiation-induced Apoptosis in Non-Small Lung Cancer Cells, Mol. Cell Proteomics, 2009, 8, 1117-1129. 3. Marin-Vicente, C.; Zubarev, R. A. Search engine for proteomics, Fact or Fiction? G.I.T. Lab J, 2009, 11-12, 10-11.

 

Page 29: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

ScalaLife  –  Scalable  Software  Services  for  Life  Science  

Rossen  Apostolov,  KTH   The Life Sciences have rapidly become one of the major beneficiaries of the European e-Infrastructures, placing a growing demand on the capabilities of simulation software and on the support services. The ScalaLife project has set to address some of the specific problems associated with this growth, acting along two distinct and complementary directions. On the one hand, the project is concerned with the discrepancy between the scalability advances made by e-Infrastructure projects such as PRACE/DEISA on large molecular systems and the reality of the typical Life Science simulation, which works predominantly with small-to-medium systems. Thus, ScalaLife is implementing new techniques for efficient small-system parallelisation, developing new hierarchical approaches (explicitly based on ensemble and high-throughput computing for new multi-core and streaming/GPU architectures) and establishing open software standards for data storage and exchange. On the other hand, the project is committed to the long- term support of the Life Science users and communities, providing both training and expert advice. First, ScalaLife is documenting and developing training material for the new techniques and data storage formats implemented by the project. Second, the project has created a pilot for a cross-disciplinary Competence Centre, which enables the Life Science community to exploit the key European applications developed as part of the project as well as the existing European e-Infrastructures effectively. By providing a training and support infrastructure and by developing an adequate framework and associated policies to foster collaboration, the Competence Centre establishes a long- term structure for the maintenance and optimisation of Life Science software. The ScalaLife Comptence Center is welcoming developers of bioinformatics applications for partnership projects!

Poster Number 1

Page 30: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Bioinformatics2012Abstract

Prokaryotic and eukaryotic genomes each encode for hundreds of membrane transporter proteins that play important roles for the cellular import and export of ions, small molecules or macromolecules. Therefore, the functional classification of membrane proteins is an important task in genome annotation. Experimental knowledge about transporter function has been compiled in databases such as TCDB, TransportDB, and Aramemnon. An important research question for membrane biology is whether two membrane transporters in organisms X and Y that show a certain sequence similarity will have the same function or not. Previous computational work in this area includes, e.g., the tools TransportTP (Xhao 2009) and work by (Gromiha 2008). Prediction methods often include features such as sequence homology, enriched motifs, and amino acid properties. Interestingly, no study has sofar critically analyzed the reliability margins of the individual features. Here, we provide a benchmarking study of the transferability of functional classifications of membrane transporters between organisms. We have tested the method using the transporters of the two model organisms E. coli and Arabidopsis thaliana. 157 experimentally validated transporter sequences from E. coli were obtained from TransportDB and 156 such sequences from A. thaliana were obtained from the Aramemnon database. The statistical significance of sequence similarity between an input sequence and sequences in the training set was determined using the well-known tools BLAST and HMMER. The MEME program suite was used to identify enriched motifs in different transporter families. Later, the MAST program from the MEME suite provided a score for statistically significant motifs identified in the unknown sequence. If all 3 approaches (BLAST, HMMER, MEME) assigned membership to the same TC family, this was considered a high confidence annotation. We tested at which E-value annotations could be reliable transferred between E.coli and A. thaliana. For this purpose we created subsets according to (1) TC families, (2) substrate annotations and (3) Substrates split into TC families. According to the TC system, transporters of the two organisms are annotated to 47 different TC families (E. coli) and 29 (A. thaliana). 14 TC families are shared and could be used for testing. Concerning the first subset, E-values of 10-4, 10-3 and 10-8 were identified as reliable thresholds for the three classifiers BLAST, HMMER and MEME, respectively. Different thresholds were discovered for the other subsets. To the best of our knowledge, these results provide the first benchmarking study for the transfer of functional annotations for the important class of membrane transporter proteins.

Poster Number 2

arnee
Typewritten Text
Barghash_Ahmad_Saarland University_Germany.pdf
Page 31: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Comprehensive Analysis of Antibiotic Resistance Genes in River Sediment, Well Water and Soil Microbial Communities Using Metagenomic DNA Sequencing

Johan Bengtsson*, Fredrik Boulund^, Erik Kristiansson^, DG Joakim Larsson*

* Dept. of Neuroscience and Physiology, University of Gothenburg, Sweden

^ Dept. of Mathematical Statistics, Chalmers University of Technology, Sweden The development and spread of antibiotic resistance across the globe has emerged as one of the most immense health problems in modern time, further accentuated by the slow pace of development of antibiotics with new functional mechanisms. While the role of antibiotics use and abuse in resistance development has been extensively investigated, examination of the impact of environmental antibiotic pollution in promoting emergence and dissemination of resistance genes has been limited. We have shown that the selection pressure of antimicrobial agents can be exceptionally high in environments contaminated by wastewater from antibiotic manufacturing facilities, creating the kind of extreme conditions that likely could drive mobilization of resistance genes. We have compared bacterial genes within microbial communities from river sediments upstream and downstream of a treatment plant in India receiving wastewater from the pharmaceutical industry, and releasing effluent containing high concentrations of several antibiotics into a small river. We have previously characterized these metagenomes using 454 pyrosequencing, however, to get a more thorough view of the community composition and the resistance gene content, we have now sequenced the same communities using high throughput Illumina sequencing. In addition, we have sampled soil from nearby farmland, as well as water from wells in villages affected by antibiotics pollution. From the DNA extracted from these microbial communities, we have generated more than 650 million paired-end reads, corresponding to between 15 and 20 million pairs of reads per sample. In this data, we can identify a wide range of resistance gene types. Preliminary analysis of the resistance gene content reveals clear differences in abundances between upstream and downstream samples; for example the sul2 and sul3 genes are much more commonly encountered downstream from the treatment plant. In addition, in a nearby lake polluted by dumping of industrial waste, we find further deviations from the resistance gene pattern of the river communities, with for example higher abundance of sul1. The preliminary data also indicates that there are substantial differences in the prevalence of antibiotic resistance genes between bacterial communities from different well water. Utilizing short-read sequencing technologies opens up for broader screening for antibiotic resistance genes in various environments, as the vast number of reads generated by e.g. Illumina sequencing allows for far deeper studies than the fairly limited pyrosequencing approaches. Thus, we are able to search also for relatively rare types of resistance genes. However, some caution should be exercised, as the complexity of the sampled community may be too large to generate sufficiently long stretches of DNA to accurately identify and classify resistance genes and mobile genetic elements. Nevertheless, the material investigated allows more precise studies of the effect on resistance promotion in microbial communities, and consequently risks for further dissemination to human pathogens as a result of antibiotic pollution from manufacturing sites

Poster Number 3

Page 32: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Searching metagenomes to identify and discover mobile fluoroquinolone antibiotic resistance genes using hidden Markov

models Fredrik Boulund1, Anna Johnning2, Mariana B. Pereira1, Joakim D.G. Larsson2, Erik Kristiansson1

1Department of Mathematical Sciences, Chalmers University of Technology and University of Gothenburg, SE-412 96 Göteborg, Sweden, 2Department of Neuroscience and Physiology, the Sahlgrenska Academy at the University of Gothenburg, Box 434, SE-405 30 Göteborg, Sweden

Antibiotics are one of our most powerful tools for treating bacterial infections and have since their introduction vastly improved human health and drastically reduced mortality rates. However, the growing use of antibiotics has brought increased resistance in pathogens. Bacteria can acquire resistance either through chromosomal mutations or via horizontal transfer of antibiotic resistance genes. It is believed that there exists a vast and unexplored environmental library of mobile resistance genes called the resistome. Many antibiotics are derived from compounds produced by organisms in the environment and bacteria have therefore developed natural protection mechanisms against such substances. Not surprisingly, it has been shown that several of the clinically important resistance genes originate from the environment. Fluoroquinolones are family of widely used broad-spectrum antibiotics of synthetic origin, thus lacking any known natural production system. Consequently, it was originally believed that they would lack any natural resistance mechanisms. However, a class of mobile fluoroquinolone resistance genes called qnr was recently discovered. There are currently five known subclasses of plasmid-mediated qnr genes, with the last novel subclass discovered as late as 2009. It is unknown whether more subclasses exist in the environmental resistome. The Qnr proteins are pentapeptide repeat proteins that display a repeating pattern of five amino acid residues. Based on this distinctive sequence feature we created a hidden Markov model from the sequences of all currently known plasmid mediated subclasses and variants. To enable identification of novel qnr-like gene variants or subclasses, we developed a classifier to discriminate between putative novel qnr sequence fragments and non-qnr fragments in metagenomic data. Evaluation of the model’s performance showed that the statistical power for correctly classifying fragments from a novel class of qnr genes was more than 94% for input sequences as short as 100 nucleotides. We applied the model to several large datasets containing both annotated (e.g. NCBI GenBank) and metagenomic sequences produced with high-throughput sequencing technologies (e.g. CAMERA, Meta-HIT). Using our method, we were able to identify all previously known qnr-genes, as well as several putative novel variants. In addition, we discovered several sequences in the annotated data sources where we could correct and improve annotation.

Poster Number 4

Page 33: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Two-site mechanism for the allosteric modulation of pentameric ligand gated ion channels by anesthetics and alcohols

Torben Broemstrupac, Rebecca Howardd, Samuel Murailc, James Trudelle, Adrian Harrisd, Eric Lindahlab

aCenter for Biomembrane Research, Department of Biochemistry and Biophysics, Stockholm University, SE-10961 Stockholm, SwedenbTheoretical and Computational Biophysics, Kungliga Tekniska högskolan Royal Institute of Technology, SE-10691 Stockholm, Sweden cInstitut Pasteur, Groupe Récepteurs-Canaux, and Centre National de la Recherche Scientifique, Unité de Recherche Associée 2182, F-75015 Paris, France dWaggoner Center for Alcohol and Addiction Research, The University of Texas at Austin, Austin, Texas, United States of AmericaeDepartment of Anesthesia and Beckman Program for Molecular and Genetic Medicine, Stanford University School of Medicine, Stanford, United States of America

Pentameric ligand-gated ion channels (pLGIC) of the Cys-loop family mediate fast chemo-electrical transduction. General anesthetics and n-alcohols alter the nerve signaling by interacting with pLGICs. Despite mutagenesis and labeling studies, the relevant anesthetic binding sites remain controversial as modeling studies have proposed diverse intrasubunit and intersubunit binding sites. The recent determination of the crystal of GLIC a prokaryotic member of pGLIC family enables structural studies to characterize the anesthetic and alcohol binding sites. But GLIC as a lower-organism pLGIC resembles the bimodal n-alcohol modulation of eukaryotic channels, while methanol and ethanol are potentiating longer n-alcohols are inhibiting the channel.

Site-directed mutagenesis studies and a chimera between the GLIC and the human glycine receptor identified the transmembrane domain as alcohol binding location. A single mutation in GLIC was identified, which turns the volatile anesthetics desflurane and chloroform from inhibitors to activators. Further this mutation increases ethanol potentiation and extends n-alcohol potentiation to hexanol while longer chain alcohols still inhibiting the channel, compared to only methanol and ethanol potentiating the wild-type. To explain the increased potentiation of the GLIC mutant, the exact interaction sites of general anesthetics and n-alcohols need to be characterized and the binding to differential sites needs to be quantified.

To this end we apply atomistic MD simulations and the Free Energy Perturbation method (FEP) to get binding free energies for desflurane and chloroform as well as n-alcohols in the intra- and intermolecular binding sites of GLIC. Our results demonstrate two independent binding sites for alcohols and anesthetics in GLIC, an inhibitory intrasubunit site and a potentiating intersubunit site. For example, the free energies of binding show that the wild-type inhibition by desflurane correlates with superior intrasubunit binding of desflurane (intra: -21.8 ± 0.3 KJ/mol versus inter: -14.4 ± 0.6 KJ/mol), while the potentiating-enhancing mutation makes desflurane intersubunit binding superior to intrasubunit binding (intra: -19.7 ± 0.4 KJ/mol versus inter: -23.2 ± 0.5 KJ/mol). Similar, binding affinities of n-alcohols are increased in the intersubunit site by the mutation correlating with the increased potentiation of GLIC by n-alcohols.

In conclusion, we present a two-site model for the modulation of pLGICs with an inhibitory intrasubunit site and a potentiation intersubunit site. Computational predicting of the binding affinities give quantitative support for the two-site model demonstrating that differential binding to both sites results in differential modulation of pLGIC.

Poster Number 5

Page 34: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

If There’s an Order in All of This Disorder…: Structural Bioinformatics of the Human Spliceosomal Proteome Iga Korneta1, Marcin Magnus1, Janusz M. Bujnicki1,2,*

1 Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, Warsaw, PL-02-109, Poland 2 Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Faculty of Biology, Adam Mickiewicz University, Poznań, PL-61-614 , Poland * [email protected]

The spliceosome is one of the largest molecular machines known. It performs the excision of introns from eukaryotic pre-mRNAs. In human cells it comprises five RNAs, over one hundred “core” proteins and more than one hundred additional associated proteins. The details of the spliceosome mechanism of action are unclear, because only a small fraction of spliceosomal proteins have been characterized structurally in high resolution. To aid structural and functional analyses of the spliceosomal proteins and complexes, and to provide a starting point for multiscale modeling, we carried out a comprehensive structural bioinformatics analysis of the entire spliceosomal proteome.

First, we discovered that almost a half of the combined sequence of proteins abundant in the spliceosome is predicted to be intrinsically disordered, at least when the individual proteins are considered in isolation. The distribution of intrinsic order and disorder throughout the spliceosome is uneven, and is related to the various functions performed by the intrinsic disorder of the spliceosomal proteins in the complex. In particular, proteins involved in the secondary functions of the spliceosome, such as mRNA recognition, intron/exon definition and spliceosomal assembly and dynamics, are more disordered than proteins directly involved in assisting splicing catalysis. Conserved disordered regions in splicing proteins are evolutionarily younger and less widespread than ordered domains of essential splicing proteins at the core of the spliceosome, suggesting that disordered regions were added to a preexistent ordered functional core. The spliceosomal proteome contains a much higher amount of intrinsic disorder predicted to lack secondary structure than the proteome of the ribosome, another large RNP machine. This result agrees with the currently recognized different functions of proteins in these two complexes. For the ordered part of the spliceosomal proteome, we have carried out protein structure prediction. We identified new domains in spliceosomal proteins and predicted 3D folds for many previously known domains. We also established a non-redundant set of experimental models of spliceosomal proteins, as well as constructed in silico models for regions without an experimental structure. Altogether, over 90% of the ordered regions of the spliceosomal proteome can be represented structurally with a high degree of confidence. The combined set of structural models for the entire spliceosomal proteome is available for download from the SpliProt3D database (http://iimcb.genesilico.pl/SpliProt3D).

Finally, we analyzed the reduced spliceosomal proteome of the intron-poor organism Giardia lamblia, and as a result, we proposed a candidate set of ordered structural regions necessary for a functional spliceosome.

The results of this work enable multiscale modeling of the structure and dynamics of the entire spliceosome and its subcomplexes and will have a profound impact on the understanding of the molecular mechanism of mRNA splicing.

Poster Number 6

Page 35: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

COMPREHENSIVE ANALYSIS OF UNIDENTIFIED LC-MS FEATURES FOR INVESTIGATING PROTEINS DIVERSITY IN HIGH-THROUGHPUT PROTEOMICS EXPERIMENTS A.L. Chernobrovkin*, V.G. Zgoda, A.V. Lisitsa and A.I. Archakov Institute of Biomedical Chemistry RAMS, Moscow, Russia e-mail: [email protected] *Corresponding author Key words: single amino-acid polymorphisms; lc-ms; proteins identification Motivation and Aim More than 65 thousands nsSNP are known to exist in human genome, and more than 20% of them associated with different diseases. However, the vast majority of annotated nsSNP have not been observed at protein level yet. Investigation of diseases-related nsSNP at protein level can shed light on the molecular nature of diseases and provide additional information for molecular biomarkers discovering. Methods and Algorithms According to recent estimation only a small proteomes can be analyzed properly using high-accuracy LC-MS without using MS/MS for peptide identification [1]. Within the human proteome only 20% peptides can be properly identified using only accurate parent mass and retention time data. Here we propose the new strategy for unidentified LC-MS features analysis, which allows significantly increase the sequence coverage of proteins, identified using MS/MS data and reveal protein variants caused by translation of non-synonymous nucleotide polymorphisms. The method uses accurate m/z and retention time data analysis for assigning theoretical peptides of identified using MS/MS proteins to the unidentified LC-MS features. As an additional resource for removing the ambiguity in features annotating we use quantitative data of protein abundance changes during cells differentiation. Results There were 1370 proteins identified in HL60 cells using LC-MS/MS (LTQ Orbitrap Velos, Thermo Scientific) analysis of triptically digested cell lysates. Quantitative analysis was performed using Progenesis-LC-MS software and allows us to reveal 300 proteins that have changed their abundance more than 3 times during cells differentiation process. LC-MS chromatograms were reanalyzed to select those features that could be matched to the triptic peptides of selected proteins and their variants. Such procedure allows two to three fold increase in the sequence coverage of selected proteins. Additionally we observed 38 features that match 17 SAP-specific proteotypic peptides of identified proteins. Conclusion Proposed approach makes it possible to decrease number of unsigned features in LC-MS based proteomics experiments. Assigning of additional features to previously identified proteins allows increasing protein sequence coverage and revealing variant-specific proteotypic peptides. References 1. P. Bochet et al. (2010) Fragmentation-free LC-MS can identify hundreds of proteins, Proteomics, 11(1): 22-32.

Poster Number 7

Page 36: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Structural bioinformatics analysis of pre-mRNA editing complex in Trypanosoma brucei

Anna Czerwoniec1, Joanna Kasprzak

1, Patrycja Bytner

1, Janusz M. Bujnicki

1,2

1 Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Adam

Mickiewicz University, Umultowska 89, PL-61-614 Poznan, Poland

2 Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular

and Cell Biology, Ks. Trojdena 4, PL-02-190 Warsaw, Poland

Corresponding author: [email protected]

Key words

structural bioinformatics, pre-mRNA editing complex, Trypanosoma brucei

Abstract

Mitochondrial pre-mRNA in trypanosomas kinetoplastids undergoes editing process to

become translatable molecule. Insertion and deletion of uridine nucleotides is catalyzed by up

to 20 proteins acting in a series of catalytic steps. Despite intensive research on editing

complexes their complete structure and components interactions remain unknown. Here we

present structural analysis of ~20S pre-mRNA editing complexes of Trypanosoma brucei. We

built homology models for components of ~20S complexes and gathered information about

disordered regions of proteins, macromolecular interactions between individual elements and

within whole editing complexes. Then we used in software developed in our group – PyRy3D

– to build and visualize very low-resolution 3D models of large macromolecular complexes fit

into density maps. Procedure used represents components as experimental structures (e.g. X-

ray or NMR models), structural models (e.g. homology models) or flexible shapes and applies

Monte Carlo approach to find solutions fulfilling experimental restraints. All generated

models were clustered, scored and ranked and best complexes are presented. Obtained results

provide us with information about macromolecular interactions in pre-mRNA editing

complexes.

Acknowledgments

This analysis was funded by the Polish Ministry of Science and Higher Education (grant to

AC - number 0083/IP1/2011/71, grant to JK - N N301 123138).

Poster Number 8

Page 37: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

The Triform algorithm: improved sensitivity and specificity in ChIP-Seq peak finding Karl Kornacker1, Morten Beck Rye2, Tony Håndstad2, and Finn Drabløs2 1Division of Sensory Biophysics, Ohio State University, Columbus, Ohio, USA 2Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), NO-7491 Trondheim, Norway Chromatin immunoprecipitation combined with high-throughput sequencing (ChIP-Seq) is the most frequently used method to identify the binding sites of transcription factors. Active binding sites can be seen as peaks in enrichment profiles when the sequencing reads are mapped to a reference genome. However, the profiles are normally noisy, making it challenging to identify all significantly enriched regions in a reliable way, and with an acceptable false discovery rate. We have developed the Triform algorithm, an improved approach to automatic peak finding in ChIP-Seq enrichment profiles for transcription factors. The method uses model-free statistics to identify peak-like distributions of sequencing reads, taking advantage of improved peak definition in combination with known characteristics of ChIP-Seq data. The statistical test in Triform is fully nonparametric, i.e. free from any assumed relationships or fitted parameters. In particular, the test is free from any assumed background model and is therefore more robust than model-based tests, which depend on locally uniform background models and fitted background parameters. Triform outperforms several existing methods (i.e. MACS, Meta, QuEST, PeakRanger, PICS, FindPeaks, and TPic) in the identification of representative peak profiles in curated benchmark data sets for the transcription factors NRSF/REST, SRF and MAX [1]. We also show that Triform in many cases is able to identify peaks that are more consistent with biological function, compared with other methods. In particular, we test for properties that are significantly associated with peak regions identified by Triform, MACS, Meta, QuEST, PeakRanger and TPic, using statistical overrepresentation analysis. Finally, we show that Triform can be used to generate novel information on transcription factor binding in repeat regions, which represents a particular challenge in many ChIP-Seq experiments. 1. Rye MB, Saetrom P, Drablos F: A manually curated ChIP-seq benchmark

demonstrates room for improvement in current peak-finder programs. Nucleic Acids Res 2011, 39(4):e25.

Poster Number 9

Page 38: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

HAMP domains: implications for transmembrane signal transduction

Stanisław Dunin-Horkawicza,b, Andrei Lupasb

a International Institute of Molecular and Cell Biology, Warsaw, Polandb Max Planck Institute for Developmental Biology, Tuebingen, Germany

Homodimeric receptors with one or two transmembrane (TM) segments per monomer are universal to life and represent the largest and most diverse group of cellular TM receptors. They frequently share domain types across phyla and, in some cases, have been recombined experimentally into functional chimeras (e.g., the bacterial aspartate chemoreceptor with the human insulin receptor), suggesting that they have a common mechanism. We have proposed a model for transduction mechanism by axial helix rotation, based on the structure of a widespread domain, HAMP, that frequently occurs in direct continuation of the last TM segment. Here we show by statistical analysis that HAMP domain sequences have biophysical properties compatible with the two conformations proposed by the model. The analysis also identifies networks of coevolving residues, which allow the mechanism to subdivide into individual steps. The most extended of these networks is specific for membrane-bound HAMP domains and most likely accepts the signal from the TM helices. In a classification based on sequence clustering, these HAMPs form a central supercluster, surrounded by smaller clusters of divergent HAMPs, which typically combine into arrays of up to 31 consecutive copies and accept conformational input from other HAMP domains.

Poster Number 10

Page 39: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Allele specific expression changes after induction

of inflammation

Recent advances in RNA and DNA sequencing technology has enabled amore detailed picture of gene expression and genomic differences to emerge.One particularly interesting aspect is the difference in expression between thetwo different alleles of a gene within a single individual, one inherited fromthe mother and one from the father. Any such allele specific expression (ASE)could indicate an allele-specific cis-acting genetic factor. ASE thereby providesan efficient means to explore the functional effects of genomic variation and canhelp in identifying functional variants in the extensive conserved non-codingpart of the genome.

In this study we assessed ASE in human white blood cells with and withouttreatment of the immune-inducing chemical LPS by performing RNA-seq onseveral individuals. This allowed studying ASE of transcripts which potentiallyare of special importance in inflammation. Further, to find candidate haplo-types responsible for observed allelic differences we conducted whole genomegenotyping of the RNA source subjects. Preliminary results indicate that about5% of all genes show ASE. Searching for variants where a change in allele speci-ficity was induced by the treatment, a total of 117 unique significant variantswere detected among all individuals, of which ten variants were found in twoor more individuals. To our knowledge, ASE analysis coupled with differentialexpression analysis of inflammatory induced cells have not previously been done.

Poster Number 11

arnee
Typewritten Text
Edsgärd_Daniel_KTHSciLifeLab_Sweden.pdf
Page 40: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Statistical assessment of gene group crosstalk enrichment in networks

Oliver Frings1,2,, Theodore McCormack1,2, ‡, Andrey Alexeyenko1,3, Erik L.L. Sonnhammer1,2,4

1Stockholm Bioinformatics Centre, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden.2Department of Biochemistry and Biophysics, Stockholm University3School of Biotechnology, Royal Institute of Technology4Swedish eScience Research Center

Abstract

MotivationAnalyzing groups of functionally coupled genes or proteins in the context of global interaction networks has become an important part of bioinformatic analysis. Typically, one wants to analyze the crosstalk, that is, the extent of connectivity between or within functional groups. However, this is only meaningful if statistical significance of the measured crosstalk enrichment is assessed.

ResultsCrossTalkZ, a statistical method and software to assess the significance of crosstalk enrichment between pairs of gene or protein groups in large undirected biological networks. We demonstrate that the standard z-score is generally an appropriate and unbiased statistic. We further evaluate the ability of four different methods to recover crosstalk within known biological pathways and estimate the confidence of the findings. We conclude that the methods preserving second-order topological properties perform the best for crosstalk analyses.

Availability and ImplementationCrossTalkZ (available at http://sonnhammer.sbc.su.se/download/software/CrossTalkZ/) is implemented in C++ and is fast, accepts various input file formats, and produces a number of statistics. These include z-score, p-value, false discovery rate, and a test of normality for the random distribution.

Poster Number 12

Page 41: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Associate Professor David E. Gloriam

University of Copenhagen, Department of Drug Design and Pharmacology, Universitetsparken 2,

21000 Copenhagen, E-mail: [email protected]

Chemogenomic Discovery of Allosteric Antagonists

at the GPRC6A Receptor

We have integrated chemogenomic ligand inference, homology modeling, compound synthesis, and

pharmacological mechanism-of-action studies to discover the most selective GPRC6A allosteric

antagonists discovered to date1. GPRC6A is a Family C G protein-coupled receptor recently discovered

and deorphanized by the Bräuner-Osborne group at University of Copenhagen. Three compounds with

at least ~3-fold selectivity for GPRC6A were discovered, which present a significant step forward

compared with the previously published GPRC6A antagonists, calindol and NPS 2143, which instead

are ~30-fold selective for the calcium-sensing receptor. The antagonists constitute novel research tools

toward investigating the signaling mechanism of the GPRC6A receptor at the cellular level and serve

as initial ligands for further optimization of potency and selectivity enabling future ex vivo/in vivo

pharmacological studies.

Our chemogenomic lead identification is, to our knowledge, the first ligand inference between two

different GPCR families, Families A and C. The unprecedented inference of pharmacological activity

across GPCR families provides proof-of-concept for in silico approaches against Family C targets

based on Family A templates, greatly expanding the prospects of successful drug design and discovery.

Furthermore, ongoing work on the application of the chemogenomic method to a large number of

orphan receptors and drug targets will be described. Finally, a novel bioinformatic method for the

identification of endogenous peptide ligands will be presented.

(1) Gloriam, D. E. et al., Chem. Biol. 2011, 11, 1489-1498.

Poster Number 13

Page 42: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

CCdeep - a bioinformatics tool for chromatin conformation data analysis.

David Gomez-Cabrero1,2, Davide Cittaro3, Deborah Farmer4, Alejandro Woodbridge5,

Jesper Tegnér1, Elia Stupka3

1Unit of Computational Medicine, Center for Molecular Medicine, Department of

Medicine, Karolinska University Hospital, Solna, Sweden. 2BILS, Bioinformatics Infrastructure for Life Sciences, Sweden 3San Raffaele Scientific Institute, Center for Translational Genomics and

Bioinformatics, Italy 4Division of Infection and Immunity, University College of London, United Kingdom 5Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Solna,

Sweden

Next Generation Sequencing (NGS) technologies make it possible to address biological questions at a new level of resolution resulting in large volumes of data which as a rule requires sophisticated but yet manageable bioinformatics pipelines for their analysis and understanding. One recent example is the analysis of Chromatin Conformation (CC) data capturing the chromosome structure and the chromosome - chromosome interactions.. Recent experimental identification of CC has been performed by what is referred to as the Chromosome Conformation Capture (3C) technique. Here the contact probabilities between specific loci is quantified and recent developments the quantification of multiple loci, thus setting the stage for genome-wide mapping of pairwise contacts. However, the analysis of the data provided by 3C or Hi-C faces two major challenges. First, the pre-processing and normalization of the data is key since the experimental procedures have inherent biases and experimental artifacts and even though experimental methods are under development proper methods for pre-processing and normalization are urgently needed. The second challenge reads – how to generate biological conclusions from the data. To illustrate, in the Hi-C experiments the ratio between the "number of reads" in a sequencing sample and the “number of possible chromosomal interactions” is approximately the same order of magnitude for human samples, and furthermore the mean and median number of reads per sample is not larger than one. Hence, to compute the differences between signal and noise under these conditions is difficult. Finally, to the best of our knowledge there is not yet any published software allowing researchers to manipulate and analyze their own CC data. We have therefore designed - CCdeep – which is a software written in C++ that incorporates tools for pre-processing, normalization and bioinformatics analysis of CC data. The tool allows four stages in the analysis: the first stage

Poster Number 14

Page 43: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

is devoted to filter the reads and to identify the CC loci which have low quality (for instance mapping quality and mappability for reads and CC locus respectively). In the second stage CCdeep summarizes the reads in locus-to-locus interactions, by assigning reads to locus. The third stage implements three algorithms: (a) a normalization algorithm (similar to (4)), (b) an interaction identification algorithm and (c) demarcation of physical domains (see (3)). A fourth stage allows the integration of the CC data with other data types such as ChIP-Seq data. CCdeep allows researchers to define their own parameters for the analysis, but it also provides suggestions based on samples of the data. The summary outputs are prepared to allow the generation of explanative plots.  

Page 44: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Title: Topology prediction and three-dimensional modeling of single-chain and multi-chain transmembrane β-barrel proteins Authors: Sikander Hayat and Arne Elofsson Affiliation: Dept. Biochem. & Biophy., Stockholm University, SciLifeLab, Stockholm, Sweden Transmembrane β-barrels play a major role in the normal functioning of a cell such as translocation and insertion of other proteins, transport of substrates across the membrane. Further, both single-chain and multi-chain transmembrane β-barrels are key constituents of Type V secretion system in gram-negative bacteria and are relevant anti-microbial drug targets. Given that it is difficult to crystallize membrane proteins to determine their 3D structure experimentally, it is necessary to develop computational methods to predict their topology and three-dimensional structure. The barrel region of single-chain transmembrane β-barrels is formed of one single protein chain and the number of β-strands varies from 8 to 24. While the barrel of multi-chain transmembrane β-barrels consists of at least 3 chains that contribute equal number of strands to form a barrel. We recently developed computational methods for the topology prediction (BOCTOPUS [1]) and three-dimensional modeling (tobmodel [2]) of single-chain transmembrane β-barrels. Briefly, BOCTOPUS is a two-stage topology predictor that employs a support vector machine and hidden markov model layer to account for local and global residue preferences. BOCTOPUS uses position specific scoring matrix as the input and outputs the topology (i=inner loop, o=outer loop, M=β-strand). BOCTOPUS predicts the correct number of strands in 30.1±1.5 out of 36 proteins in the dataset. Further, correct topology is predicted for 25.4±2.0 proteins, which is slightly better than the other methods [1]. For three-dimensional modeling, first BOCTOPUS [1] is used to obtain alternative topologies for the given sequence. Then, multiple Cα models of the transmembrane β-barrel region are generated for different tilts of β-strands for all obtained topologies [2]. A novel z-coordinate predictor called ZPRED3 is used to predict the distance of residues from the membrane center. The top-ranking model is then chosen based on the minimum difference between the predicted z-coordinate and z-coordinate obtained from the generated models [2]. Tobmodel is compared with TMBpro [4] and 3D-SPoT [3]. Models obtained from TMBpro [4] and tobmodel have an average RMSD of 8.79 and 7.24 Å. The average TM_Score for TMBpro models is 0.56 and is slightly higher than for top-ranking tobmodel models (0.43). The average RMSD when correct topology is available is 5.86 Å and 4.10 Å, for tobmodel and 3DSPoT [3]. However, we believe that tobmodel can be a useful tool for topology prediction and 3D modeling of transmembrane β-barrels. In future, more advanced model selection methods will be developed to select the best possible model. Further, BOCTOPUS and other methods cannot predict the topology of multi-chain chain transmembrane β-barrels with high accuracy. Thus, we are currently working on the topology prediction and three-dimensional modeling of multi-chain transmembrane β -barrels. BOCTOPUS and tobmodel are freely available at boctopus.cbr.su.se and tobmodel.cbr.su.se 1. Hayat, S. & Elofsson, A., Bioinformatics, 28, 516–522, (2012). 2. Hayat, S. & Elofsson, A., ISMB Proceedings, (2012). 3. Naveed, H., Xu, Y., Jackups, R., and Liang, J., Journal of the Am. Chem. Society, 134, 1775–1781, (2012). 4. Randall, A., Cheng, J., Sweredoski, M., and Baldi, P., Bioinformatics, 24, 513–520, (2008).

Poster Number 15

Page 45: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Luisa Hugerth1, Daniel Lundin1, Ino DeBruijn1, Daniel Herlemann2, Anders F Andersson1

1) Science for Life Laboratory, School of Biotechnology, KTH Royal Institute of Technology,Tomtebodavägen 23 B, SE­17165 Solna, Sweden2) Leibniz Institute for Baltic Sea Research Warnemünde, Seestraße 15, 18119 Rostock

Systems Biology of Baltic Sea Microbial Food Webs

The Baltic Sea represents one of world’s largest brackish water reservoirs. It faces large yearly variations intemperature and nutrient levels and is highly affected by human activities, such as overfishing andeutrophication. As in all environments, microbes are key drivers of the fluxes of nutrients and energy in theBaltic Sea. The extremely wide variety of microbes in natural environments baffles any attempt to identify theirdiversity through culture or microscopy. In reality, only deep sequencing can provide enough data for athorough inventory of microbial stocks.

While being drivers of nutrient and energy fluxes, microbes are also greatly affected by these same conditions.Therefore, along the length and depth of the Baltic Sea, which represent gradients of salinity and dissolvedoxygen, bacterial communities are highly structured (Herlemann, 2011). Communities also experience strongfluctuations over the course of a year, but show recurrent annual patterns (Andersson, 2010).

Bacterial communities are controlled both by the environment and by interactions with other microbes. Inparticular, single­cellular eukaryotes (protists) that feed on bacteria are important elements in the energy fluxfrom bacteria to higher organisms. To investigate environmental protist variety and dynamics, we performed anextensive in silico evaluation of over 55,000 eukaryotic ribosomal RNA sequences, selecting the primer sitesthat will allow us both to amplify the widest possible variety of eukaryotes and to extract the highest proportionof unique sequences from short paired Illumina reads. A highly resolved temporal series of bacterial and protistcommunities and physico­chemical parameters will allow us to infer the network structure of food webs in theBaltic Sea environment.

However, even with high temporal resolution, identifying microbes will not reveal much about the biologicaland climatic factors shaping an environment unless this information is coupled to knowledge about individualmicrobes’ metabolic capabilities. To this end, we have initiated the work of reconstructing the genomes of themost abundant Baltic Sea microbial populations through de novo assembly of shotgun metagenomicsequencing reads. In a first study we reconstructed the genome of a highly abundant bacterial population inBaltic surface waters belonging to the Verrucomicrobia by 454 sequencing. Unsupervised binning ofmetagenomic genome fragments (contigs) using tetra­nucleotide frequencies and a self­organizing map, andverification with marker genes, resulted in a cluster of contigs that represents a near complete genome of theVerrucomicrobium. The draft genome sequence of this organism, which lacks cultured relatives, revealed atypical aerobic heterotrophic metabolism but also several glycoside hydrolases that allow the use of complexcarbon molecules as carbon source. Its high abundance in the Baltic Sea and the presence of potentialcellulase genes in its genome suggest an important role of this organism for organic matter decomposition inthis environment. To enable assembly of lower abundance organism we are currently optimizing assemblyprotocols for Illumina­based metagenomics.

Andersson AF, Riemann L, Bertilsson S. Pyrosequencing reveals contrasting seasonal dynamics of taxawithin Baltic Sea bacterioplankton communities. ISME J, 2010. 4(2): p. 171­81.

Herlemann DPR, Labrenz M, Jürgens K, Bertilsson S, Waniek JJ, Andersson AF. Transitions in bacterialcommunities along the 2000 km salinity gradient of the Baltic Sea. ISME J, 2011. Apr 7.

Poster Number 16

Page 46: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

NETWORK BASED ANALYSIS OF IN-DEPTH PROTEOMICS DATA TO ASSESS FBW7 CELLULAR TARGETS AND FUNCTIONS Hultin Rosenberg L1, Branca R1, Forshed J1 and Lehtiö J1. 1. Clinical Proteomics Mass Spectrometry, Science For Life Laboratory, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden. Background The omics research fields have so far mainly failed to deliver biomarkers for clinical and therapeutic use. Traditionally, biomarkers have been selected by differential expression analysis, scoring each gene or protein for how well its expression pattern discriminates between groups of samples. However, biological processes are driven by functional modules rather than individual genes or proteins so it is necessary to understand how the state of single units jointly determines the higher level state of these functional modules. To enhance the generation of biologically and clinically relevant information from expression data and to enable better models of disease and healthy phenotype, systems based approaches are necessary. In systems biology the focus is on complex interactions between components in biological systems. Several recent studies have shown for example that the predictive performance of gene expression data can be improved by incorporating interactome data1,2. These studies revealed that changes in network activity can be more predictive of clinical outcome in cancer patients than individual genes expression. Aim In the current study, a network based approach will be developed and coupled to protein quantities from mass spectrometry data. The idea is to shift the focus from individual proteins showing differential expression to whole protein subnetworks with altered activity or regulation between conditions. The aim is to identify protein subnetworks that are altered or deregulated between wild type and Fbw7 knockout samples to define new targets of Fbw7 and improve the understanding of its function. Fbw7 is a known tumor suppressor that targets several oncogenes for ubiquitin-mediated proteasomal degradation. Disruption of the Fbw7 gene is associated with embryonic lethality, genetic instability and tumorigenesis, however the full extent of Fbw7 targets and functions are still poorly understood. By studying protein expression in the context of protein-protein interaction networks we hope to detect differences on a biological system level, although not detectable on a protein by protein level. Methods Mass spectrometry (TMT IEF-LC-MS/MS) is applied to identify and quantify proteins in samples from HCT116 colon cancer cell line, wild type and Fbw7 knockout. The protein quantities are merged with protein-protein interaction data from the Human Protein Reference Database (HPRD). Using a heuristic search algorithm the protein-protein interaction network is searched for subnetworks with significant difference in activity between phenotypic classes. The scoring of subnetworks will be based on different measures of subnetwork state: average expression activity as well as variance between proteins in subnetwork. The highest scoring subnetworks and corresponding measures will be fed to a multivariate model to select subnetworks most predictive of phenotypic outcome. The identified subnetworks will also be studied in terms of enriched functions and pathways to characterize them further. 1. Chuang, HY. et al. Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 3, 140 (2007) 2. Taylor, IW. et al. Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nature

Biotech. 27, 2 (2009)

Poster Number 17

Page 47: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Bioinformatics2012Abstract

Lukasz Huminiecki, DBB, Stockholm University

Abstract

Whole genome duplication (WGD) is a special case of gene duplication, observed rarely in animals, whereby all genes duplicate simultaneously through polyploidisation. Two rounds of WGD (2R-WGD) occurred at the base of vertebrates, giving rise to an enormous wave of genetic novelty, but a systematic analysis of functional consequences of this event has not yet been performed.

We show that 2R-WGD affected an overwhelming majority (74%) of signalling genes, in particular developmental pathways involving receptor tyrosine kinases, Wnt and transforming growth factor-β ligands, G protein-coupled receptors and the apoptosis pathway. 2R-retained genes, in contrast to tandem duplicates, were enriched in protein interaction domains and multifunctional signalling modules of Ras and mitogen-activated protein kinase cascades. 2R-WGD had a fundamental impact on the cell-cycle machinery, redefined molecular building blocks of the neuronal synapse, and was formative for vertebrate brains. We investigated 2R-associated nodes in the context of the human signalling network, as well as in an inferred ancestral pre-2R (AP2R) network, and found that hubs (particularly involving negative regulation) were preferentially retained, with high connectivity driving retention. Finally, microarrays and proteomics demonstrated a trend for gradual paralog expression divergence independent of the duplication mechanism, but inferred ancestral expression states suggested preferential subfunctionalisation among 2R-ohnologs (2ROs).

The 2R event left an indelible imprint on vertebrate signalling and the cell cycle. We show that 2R-WGD preferentially retained genes are associated with higher organismal complexity (for example, locomotion, nervous system, morphogenesis), while genes associated with basic cellular functions (for example, translation, replication, splicing, recombination; with the notable exception of cell cycle) tended to be excluded. 2R-WGD set the stage for the emergence of key vertebrate functional novelties (such as complex brains, circulatory system, heart, bone, cartilage, musculature and adipose tissue). A full explanation of the impact of 2R on evolution, function and the flow of information in vertebrate signalling networks is likely to have practical consequences for regenerative medicine, stem cell therapies and cancer treatment.

Poster Number 18

Page 48: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Genomic variation of mouse microRNAs

Katherine Icay1,2*, Tessa Sipilä1,2,3* Dario Greco4, Iiris Hovatta1,2,3,5

1Research Programs Unit, Molecular Neurology, Biomedicum-Helsinki, University of Helsinki, Finland

2Department of Medical Genetics, Haartman Institute, University of Helsinki, Finland

3Department of Mental Health and Substance Abuse Services, National Institute for Health and Welfare, Helsinki,

Finland 4Department of Bioscience and Nutrition, Karolinska Institutet, Sweden

5Institute of Molecular Medicine Finland (FIMM), University of Helsinki, Finland

The last decade of genetics and biomedical research has seen an explosion of interest in microRNAs

(miRNAs), a set of small non-protein coding molecules with functional roles in the post-transcriptional

regulation of gene expression. MiRNAs are characterized in all higher organisms and tissues. A single miRNA

can target hundreds of genes, often with similar functions and/or biological processes, and regulate their

expression by inhibiting the translation of their mRNAs. Consequently, abnormal miRNA expression has

been observed in numerous diseases including psychiatric disorders. We hypothesize that genetic variation

within miRNA genes and their putative regulatory regions could affect the biological function of miRNAs,

thus resulting in phenotypic differences.

The well-characterized biochemical and/or behavioural phenotypes of inbred mouse strains make

them ideal models for the study of wide range of phenotypes. We have previously used a panel of inbred

mouse strains and carried out gene expression profiling to identify genes that regulate anxiety-like

behaviour. Keane et al. (2011) recently published genome sequences of 17 inbred mouse strains, including

the most commonly used laboratory strains and wild-derived strains. We utilized this publicly available

dataset (http://www.sanger.ac.uk/resources/mouse/genomes/) and performed a genome-wide analysis of

SNPs and structural variations that could affect the biological function of miRNAs (i.e. within hairpin,

mature, seed and putative regulatory regions) in different inbred mouse strains.

We observed, as expected, genomic variations occurring less frequently within miRNA genes

compared to the rest of the genome. Interestingly, we also observed that a miRNA’s genomic location and

membership in a miRNA cluster or family significantly influenced the occurrence of polymorphisms.

Moreover, we observed SNPs in miRNA seed regions altering the set of predicted target genes of a miRNA

by over 90% and, consequently, the biological processes and pathways potentially regulated by a miRNA.

Additionally, miRNAs having human and/or rat orthologs were more likely to be conserved and not contain

genetic variation.

In conclusion, these findings provide a valuable characterization of miRNAs with sequence

information to be used in the consideration and design of future experimental studies. We aim to further

investigate if SNPs affect miRNA expression level with miRNA-seq data (derived from the hippocampus of

five mouse strains), with special interest in finding correlations to anxiety-like behavior.

Poster Number 19

Page 49: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Copernicus : Enabling large scale computing as a workflowThe amount of compute resources has grown vastly in the recent years, however they are underutilized but there are many problems that can put them into good use. As an example within the field of molecular dynamics, adaptive sampling algorithms such as Markov State modeling or Free energy calculations constitute of many short(100-1000) simulations to gather statistics followed by iterations of adaptive sampling in order to guide simulations for the coming iteration.Although it is a simple workflow to define these type of problems can easily utilize thousands of cores and generate massive amounts of data and require something more than a queue. Copernicus is a platform that enables large scale computing to be defined as a workflow.The platform will take care of the task breakdown, distribute it to available compute resources, all in a secure and fault tolerant manner.Its overlay P2P network can utilize a wide variety of heterogeneus compute resources such as desktops,clusters and cloud compute instances and automatic resource awareness makes sure to to use the best resources for the defined job.As a proof of concept the folding of the villin headpiece was performed by combining molecular simulations with Markov state modelling for kinetic clustering and statistical model building. This combination made it possible to identify the native folded state without any prior knowledge within 46 hours utilizing a total of 5736 cores on a Cray XT6 and an AMD Istanbul cluster. By being able to combine simulations with statistical model building parallelization was achieved on a fine grained level and on an algorithmical level resulting in much stronger scaling.The Copernicus platform is built in a general manner and its plugin architecture makes it possible to enable any executable to be utilized in a workflow making it ideal for any type of large scale statistical computations and data processing.

Poster Number 20

arnee
Typewritten Text
arnee
Typewritten Text
IMAN_POUYA_THEORETICAL_COMPUTATIONAL_BIOPHYSICS_SWEDEN.pdf
Page 50: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Cross species comparison of C/EBPα, PPARγ, DHS and chromatin state in mouse and human adipocytes highlights factors important for retention of PPARγ binding Mette Jørgensen2†, Søren F Schmidt1†,Yun Chen2, Ronni Nielsen1, Albin Sandelin2* and Susanne Mandrup1*

1 Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, DK-5230 Odense M, Denmark

2The Bioinformatics Centre, Department of Biology and Biomedical Research and Innovation Centre, Copenhagen University, Ole Maaløs Vej 5, DK-2200, Copenhagen N, Denmark

* Corresponding authors: Albin Sandelin [email protected] - Susanne Mandrup [email protected] † Equal contributors The transcription factors peroxisome proliferator activated receptor γ (PPARγ) and CCAAT/enhancer binding protein α (C/EBPα) are key transcriptional regulators of adipocyte differentiation and function.

In this study have mapped all binding sites of C/EBPα and PPARγ in human SGBS adipocytes and compared these with the genome-wide profiles from mouse adipocytes to systematically investigate what biological features correlate with retention of sites in orthologous regions between mouse and human. Despite a limited interspecies retention of binding sites(~20%), several biological features make sites more likely to be retained. First, co-binding of PPARγ and C/EBPα in mouse is the most powerful predictor of retention of the corresponding binding sites in human. Second, vicinity to genes highly upregulated during adipogenesis significantly increases retention. Third, the presence of C/EBPα consensus sites correlate with retention of both factors, indicating that C/EBPα facilitates recruitment of PPARγ. Fourth, retention correlates with overall sequence conservation within the binding regions independent of C/EBPα and PPARγ sequence patterns, indicating that other transcription factors work cooperatively with these two key transcription factors. Fifth, We show that binding sites that are highly accessible(Based on public available DHS data) in preadipocytes are more likely to be retained. Sixth, PPARγ sites are more likely to be retained, if they have high H3K27ac levels in either preadipocytes, adipocytes or both. Thus, the total H3K27ac levels in the PPARγ binding regions seem to be more important for retention than the development of acetylation during adipogenesis.

This study provides a comprehensive and systematic analysis of what biological features impact on retention of binding sites between human and mouse. Specifically, we show that the binding of C/EBPα and PPARγ in adipocytes have evolved in a highly interdependent manner, indicating a significant cooperativity between these two transcription factors.

 

Poster Number 21

Page 51: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Resistance mutations in environmental bacterial communities living under different antibiotic selection pressures

Anna Johnning1, Erik Kristiansson2, Birgitta Weijdegård1, D.G. Joakim Larsson1 1Institute of Neuroscience and Physiology, Sahlgrenska Academy at University of Gothenburg,

Gothenburg, Sweden, 2Department of Mathematical Sciences, Chalmers University of

Technology/University of Gothenburg, Gothenburg, Sweden

Antibiotic resistance is a pressing concern for the health care sector globally. Resistant pathogenic

bacteria can cause refractory infections which lead to added suffering, greater risk of spreading of

the disease and death. Recently, pollution with antibiotics in the environment has been recognized

as a potential driver of microbial resistance. Several human pathogens, such as Escherichia coli and

Salmonella spp., can spread through contaminated water. Therefore, exposing environmental

communities of pathogenic bacteria to sufficiently high concentrations of antibiotics could select for

resistant strains and hence pose a risk to human health.

We have sampled river sediment up- and downstream from an Indian treatment plant receiving

industrial effluent from pharmaceutical production, and from a regular Swedish sewage treatment

plant. We have previously showed that the Indian river receiving the treated effluent is highly

polluted with fluoroquinolone antibiotics downstream from the treatment plant, but some are also

detected upstream. The Indian sediment samples hence represent a gradient of fluoroquinolone

pollution (in the range 914-5.24 µg ciprofloxacin/g organic matter) while no fluoroquinolones were

detected in the Swedish samples. Fluoroquinolones target the essential enzymes DNA gyrase and

topoisomerase IV, encoded by the two gene pairs gyrA and gyrB, and parC and parE respectively.

Certain mutations within these genes, especially in gyrA and parC, have been linked to a lowered

susceptibility to fluoroquinolones.

To study if the abundance of resistance mutations in the Escherichia and Salmonella communities

residing in the river sediments depends on the level of fluoroquinolone exposure, we designed

primers targeting gyrA and parC in these families. The resulting amplicons were sequenced using

massively parallel pyrosequencing, and any deviations from the amino acid sequence of the type

strain were analyzed using the GS Amplicon Variant Analyzer from 454. For Escherichia, the

resistance mutations S83L and D87N in gyrA, and S80I and E84V or E84G in parC could be detected at

all sampling sites, more or less frequently occurring in the same sequence read. There was no

apparent correlation between the level of fluoroquinolone pollution and the mutation abundance,

except that the most polluted river site showed the highest abundance of the aforementioned

double mutations in both genes. For Salmonella, we detected resistance mutations S83F and D87N in

gyrA, and Y57S in parC in all samples. In the most polluted sample, the mentioned parC mutation was

often coupled with mutation E84K. Single mutations in gyrA or parC are not sufficient to provide a

high level resistance to fluoroquinolones. The sediment directly downstream from the Indian

wastewater treatment plant appears to be the only investigated site where the selection pressure

was sufficiently high to promote bacteria with such double mutations.

Poster Number 22

Page 52: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Statistical challenges of comparative metagenomics

Viktor Jonsson, Olle Nerman, Erik Kristiansson

Mathematical Sciences, University of Gothenburg and Chalmers University ofTechnology, 412 96 Göteborg, Sweden.

In metagenomics the whole joint genome of microbial communities is analyzed. Samples are taken

directly from the environment and many organisms that cannot be cultivated in the laboratory can

therefore be investigated. In comparative metagenomics the difference between samples is studied by

quantifying and comparing gene abundance. However, there are many statistical challenges associated

with comparative metagenomics that can, if not correctly handled, result in a substantial decrease in

power and data reliability. On this poster we give a summary of some of these challenges and present a

first view on new improved statistical methods for comparative metagenomics. The main challenge in

statistical analysis of metagenomics data is the substantial variation. The enormous diversity of most

microbial communities results in high biological variability between different metagenomes. Technical

variability is also introduced by sequencing errors, the limited length of the generated DNA fragments

and imprecise matching of reads with the correct gene function. In addition, most metagenomes are

heavily undersampled as even modern massively parallel DNA sequencing techniques can only sequence

a fraction of the total DNA content available in most samples. Finally, the data observed is discrete

(counts of gene occurrences) and high dimensional (many genes tested simultaneously) which makes

many standard statistical methods unsuitable. For example we show that the standard t-test has a poor

performance under these conditions. Furthermore we present work being done on a new method for

statistical comparison of metagenomes. The method is built on hierarchical Bayesian modeling within

the framework of a generalized linear model. Markov chain Monte Carlo (MCMC) will be used to fit the

model to the data. The main benefit is the sharing of variance between genes making the variance

estimates more stable. Using a generalized linear model as basis allows for flexibility in the choice of

experimental design. These parts together will form a data driven statistical model that will improve the

potential of comparative metagenomics.

Poster Number 23

Page 53: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

De novo assembly and annotation of the grey reindeer lichen (Cladonia

rangiferina) transcriptome

Sini Junttila and Stephen Rudd

Turku Centre for Biotechnology, University of Turku and Åbo Akademi University,

Tykistökatu 6, 20520 Turku, Finland

Lichen is a symbiotic relationship between a fungus and an alga, and these organisms

have a remarkable ability to survive in some of the harshest climates on earth. They

can endure frequent drying and wetting and are able to survive in the desiccated state

for long periods at a time. Although molecular biological and genetic resources have

been established for the systematic study and classification of lichens, there are no

published lichen reference genomes available and the sequences available in public

databases are very limited. We report the de novo assembly and annotation of the

transcriptome of the grey reindeer lichen, Cladonia rangiferina, using high-

throughput next generation sequencing and traditional Sanger sequencing data.

High quality sequence reads from a Roche GS FLX sequencing run and

Sanger sequencing were de novo assembled. The genome of origin for the lichen

sequences was determined and the assembled sequences were annotated using

BLASTX analysis against the non-redundant database. The C. rangiferina

transcriptome was further characterised by functional annotation of the sequences

against GO and KEGG databases.

Our results present the first transcriptome sequencing and de novo assembly of

any lichen species, describe the ongoing molecular processes and the most active

pathways in C. rangiferina, and bring a significant increase to publicly available

lichen sequence information. These data provide a first look into the molecular nature

of the lichen symbiosis and characterise the transcriptional space of this remarkable

organism. These data will also enable further studies aimed at deciphering the genetic

mechanisms behind lichen desiccation tolerance.

Poster Number 24

Page 54: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

PyRy3D: a software tool for modelling of large macromolecular

complexes

Joanna M. Kasprzak1, *, Wojciech Potrzebowski2, Mateusz Dobrychłop1, Janusz M. Bujnicki 1,2

1 Faculty of Biology, Adam Mickiewicz University, ul. Umultowska 89, 61-614 Poznan,

POLAND 2 International Institute of Molecular and Cell Biology in Warsaw, ul. Ks. Trojdena 4, 02-109

Warsaw, POLAND

* presenting author

One of the major challenges in structural biology is to determine the structures of macromolecular

complexes and to understand their function and mechanism of action. However, compared to

structure determination of the individual components, structural characterization of

macromolecular assemblies is very difficult. To maximize completeness, accuracy and efficiency

of structure determination for large macromolecular complexes, a hybrid computational approach

is required that will be able to incorporate spatial information from a variety of experimental

methods (like X-ray, NMR, cryo-EM, cross-linking and mass spectrometry, etc.) into modeling

procedure. For many biological complexes such an approach might become the only possibility

to retrieve structural details essential for planning further experiments e.g. in order to explain

mechanism of action.

We developed PyRy3D, a method for building and visualizing low-resolution models of large

macromolecular complexes. The components can be represented as rigid bodies (e.g.

macromolecular structures determined by X-ray crystallography or NMR, theoretical models, or

abstract shapes) or as flexible shapes (e.g. disordered regions or parts of protein or nucleic acid

sequence with unknown structure). Spatial restraints are used to identify components interacting

with each other, and to pack them tightly into contours of the entire complex (e.g. cryoEM density

maps or ab initio reconstructions from SAXS or SANS methods). Such an approach enables

creation of low-resolution models even for very large macromolecular complexes with

components of unknown 3D structure. Our model building procedure applies Monte Carlo

approach to sample the space of solutions fulfilling experimental restraints.

Acknowledgements: This analysis was funded by the Polish Ministry of Science and Higher

Education grant N N301 123138 to JMK, and by the European Research Council (StG grant

RNA+P=123D to JMB). JMK has been a scholarship-holder of Adam Mickiewicz University

Foundation for 2011. JMB has been supported by the "Ideas for Poland" fellowship from the

Foundation for Polish Science.

References:

1. Alber F, Förster F, Korkin D, Topf M, Sali A., , Integrating diverse data for structure

determination of macromolecular assemblies., Rev Biochem.

2. Sali A, Glaeser R, Earnest T, Baumeister W., , From words to literature in structural

proteomics., Nature

Poster Number 25

Page 55: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

RNA Specificity of RNA Recognition Motif (RRM) Domains

Deepak Kumar1,*, Joanna M. Kasprzak1, Janusz M. Bujnicki2,1

1 Laboratory of Structural Bioinformatics, Institute of Molecular Biology and Biotechnology, Collegium

Biologicum,Adam Mickiewicz University, ul. Umultowska 89, 61-614 Poznan, Poland. 2 Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in

Warsaw, ul. Ks. Trojdena 4, 02-109 Warsaw, Poland. * presenting author

The RNA-recognition motif (RRM), also known as RBD (RNA binding domain) or RNP

(ribonucleoprotein domain) is the most abundant RNA-binding domain in higher vertebrates and

is the most extensively studied RNA-binding domain, both in terms of structure and

biochemistry. RRM-containing proteins are involved in most post-transcriptional gene

expression processes (i.e. mRNA and rRNA processing, RNA export and stability).

The mechanism of protein and RNA recognition by RRMs is not clear owing to the high

variability of interactions. To elucidate sequence-structure-function relationships in the RRM

family, a comprehensive bioinformatics analysis was carried out. Extensive database search was

performed to identify all proteins with the RRM domain. Clustering analysis on the basis of

sequence similarity revealed subfamilies of closely related sequences that are likely to share a

similar function and specificity. Analysis was performed for two sets of data including

full-length sequences and sequences limited to RRM domains only. We analyzed relations

between groups of related RRMs by phylogenetic patterns and calculation of structure- and

sequence-based trees for representative members of the RRM family. Based on these results, we

inferred the phylogenetic tree and suggested a scenario for the evolutionary origin of RRMs.

Molecular Dynamics simulations and application of statistical potentials are being performed on

the RRMs for the study of RRM-RNA interaction.

Acknowledgements:

Deepak Kumar has been supported by the International PhD school grant from the Foundation of

Polish Science ( grant MPD/2010/3). J.M.K. has been supported by the Polish Ministry of

Science and Higher Education ( grants 0067/P01/2010/70 and N N301 123138 ) and by the

Foundation from Polish Science ( grant POMOST C/58). JMK has been a scholarship-holder of

the Adam Mickiewicz University Foundation for 2011. J.M.B. has been supported by the FNP

(TEAM/2009-4/2 and "Ideas for Poland" fellowship).

REFERENCES:

1. Maris C, Dominguez C, Allain FH , 2005 : The RNA recognition motif, a plastic

RNA-binding platform to regulate post-transcriptional gene expression. FEBS Journal,

272:2118-2131.

Poster Number 26

Page 56: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

LocusVu: Abstract

Mayank Kumar, Christian Spaniol, Volkhard Helms

May 14, 2012

Here we present LocusVu, a novel, easy to use software tool that can analyzelarge sets of genomic loci data, e.g. from NGS experiments, and perform statis-tics on it. On the back-end, the tool is linked to The Genome Browser fromUCSC via its MySQL interface. The front-end is supported by a Java Swingbased GUI, to enable that the user interacts with the data in a simple and inter-active manner. The tool takes as input a list of genomic loci (positions on thechromosome), and fetches attributes (e.g. gene name, cytogenetic band, repeatsinformation, etc.) for each of these loci. It then presents this information ina tabular form, where each locus and its corresponding attribute is listed (theuser is given the freedom to choose among various attributes). One can thendo many operations on this information, which include but are not restrictedto, viewing N neighboring upstream-downstream genes (the user can choose thevalue of N dynamically); draw pie-charts/bar-charts on the data in order toobtain a graphical summary of the data; view many datasets in the same win-dow; compare different datasets to draw comparative-genomic conclusions, etc.The tool’s ready connection to UCSC gives it a direct advantage - it does notrequire the maintenance of a local copy of large databases, at the same time en-suring the user always has access to the most up-to-date data. Present day toolswith similar functionality are either interactive (requiring mouse clicks, etc. andthus slow and tedious), or are non-interactive (one can do batch submissions,but then lose the interactivity). LocusVu instead provides the user with thepower to do batch submissions, without taking away the interactivity. LocusVualso provides the user with the ability to perform statistics on the generateddata, which gives it the distinct advantage of analyzing large sets of data in arelatively short time.

1

Poster Number 27

Page 57: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

An ensemble-based feature selection algorithm for identifying candidate metastasis marker genes in endometrial cancer

Kanthida Kusonmano1,2,3, Elisabeth Wik2,3, Helga Salvesen2,3, Kjell Petersen1

1Computational Biology Unit, Uni Computing, Uni Research AS, Bergen, Norway2Department of Obstetrics and Gynecology, Haukeland University Hospital, Bergen, Norway3Department of Clinical Medicine, University of Bergen, Bergen, Norway

Abstract:Endometrial cancer is a malignant growth of the endometrium, the lining of the uterus. It is the most common pelvic gynecological malignancy. Although majority of patients are treated at an early stage, about one forth of patients in Norway suffer from distant metastases, which are largely incurable leading to death. Even if most studies of comprehensive profiling of malignant tissues are based on primary lesions, the metastatic lesions may be more relevant to define targets for new therapeutics with systemic disease. A metastatic lesion of cancer cells disseminated to remote sites will have certain cell-biological and molecular properties that may be different from the primary tumor. Nowadays, with the availability of high-throughput technologies, the development of more specific metastasis diagnostic markers to define the most relevant therapeutic targets at molecular level is of broad interest. However, there are only a small number of studies on metastasis in general and in endometrial cancer in particular. In this study, we want to identify metastasis signature genes in endometrial cancer, which show distinct changes or have high discriminatory ability between transcriptional profiles of 122 primary tumors and 19 metastatic lesions based on microarray data. We propose an ensemble-based feature (marker or gene in this context) selection method for identification of our interest candidate marker genes. Instead of having different ranking features sets from different feature selection techniques, we try to identify consensus a feature set among various methods e.g. t-test, Significance analysis of microarrays (SAM), Information Gain, ReliefF, etc. The algorithm provides the common features between different feature selection methods with consideration of ranking order. An ensemble approach is widely used for classification algorithms and has been proven to provide more robust results than applying only a single method. The selected feature subsets will subsequently give better confidence in the selection of biologic relevant markers for further validation studies.

Keywords: Feature selection, Biomarker identification, Endometrial cancer, Ensemble method

Poster Number 28

Page 58: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Genome and transcriptome of venomous marine snail Conus consors. Age Brauer1,Reidar Andreson1, Silja Laht1, Lauris Kaplinski1, Aleksander Sudakov1, Mikk Eelmets1, Maido Remm1, CONCO Consortium2 1 Estonian Biocentre, Riia 23C, 51010 Tartu, Estonia, 2 http://www.conco.eu Conus consors is a fish hunting snail that uses venom to paralyze its prey. The venom consists of a mixture of neurotoxins called conopeptides. Conopeptides are very specifically blocking different ion channels. Some conopeptides are used as medications, several others are in drug development. The main goal of sequencing Conus consors genome and venom duct transcriptome was to find new conopeptides. From the venom duct transcriptome 53 conopeptides were discovered (published by Terrat et.al, 2012). 47 of these conopeptides were also found in the genome assembly. In addition, 33 conopeptides not present in the transcriptomes were discovered from the genome assembly. Only a few mollusk genomes have been sequenced. C. consors genome is the largest assembled genome so far.

Species Length of

the genome

Raw sequence coverage

Length of assembled sequences

N50 (bp) Nr of core

genes (max=458)

Core gene coverage

Lottia gigantea

0.5 Gb 8.9x 360 Mbp 1870055 452 80 %

Aplysia californica

1.8 Gb 9.9x 716 Mbp 264327 451 77 %

Conus bullatus

3.0 Gb 3.0x 201 Mbp 182 317 12.1 %

Conus consors

3.0 Gb 6.0x 1393 Mbp 599 457 88.5 %

Pinctada fucata

1.15 Gb 40.0x 1413 Mbp 1629 457 85 %

To have a better comparison of very different genome assemblies we used eukaryotic core gene coverage as one of the parameters. A set of 458 eukaryotic core genes that should exist in all eukaryotes was searhced against the assembled genomes (http://korflab.ucdavis.edu/Datasets/cegma/). In addition we used mitochondrial genome to choose the best C. consors genome assembly. Mitochondrial genome has been sequenced and assembled independently. We compared, how much of the mitochondrial genome was present in each assembly and in how many contigs/scaffolds. The values ranged from 6-100% and from 4-251 scaffolds. The assembly chosen for annotation had 100% of mitochondrial genome present in 4 scaffolds.

Poster Number 29

Page 59: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Accurate extension of multiple sequence alignments

Ari Loytynoja 1, Albert J. Vilella 2 and Nick Goldman 2

1Institute of Biotechnology, University of Helsinki, Finland2EMBL-European Bioinformatics Institute, Hinxton, UK

Accurate multiple alignment is demanding and extension of existing alignments with new data isoften an attractive option: addition of new sequences without re-computation of the full alignmentsaves time and resources, especially when amounts of data added are small relative to the full alignmentsizes; extension of existing alignments also retains the relative matching of reference sequences andthus ensures that downstream analyses depending on certain features of the alignment will not brokenand the need for manual re-annotation is minimised. However, the benefits of alignment extension areespecially significant in the analysis of fragmented sequences such as those coming from next-generationmetagenomics. Popular progressive alignment methods designed for global alignment struggle withshort sequence fragments that do not all overlap with each other and contain little information toanchor them in their correct context. In evolutionary analyses of metagenomic data, multiple alignmentis therefore often performed with HMMER package [1] that first generates a profile HMM of a pre-defined reference alignment, then aligns the sequence fragments against this profile and finally mapsthe against-profile alignments to the original reference sequences. A limitation of profile-based methodsis that they do not incorporate and use phylogenetic information and are affected by the compositionof the reference alignment and the phylogenetic positions of query sequences within it.

We have developed a method for phylogeny-aware alignment of partial-order sequence graphs andapply it to the extension of alignments with new data [2]. Our new method, called PAGAN [3], infersancestral sequence history for the reference alignment and adds new sequences by aligning them againstextant sequences or inferred ancestral sequences in their phylogenetic context, either to pre-definedpositions or by finding the best placement for sequences of unknown origin. Unlike profile-basedalternatives, PAGAN considers the phylogenetic relatedness of the sequences and is not affected byinclusion of more diverged sequences in the reference set. Our analyses show that PAGAN outperformsalternative methods for alignment extension and provides superior accuracy for both DNA and proteindata, the improvement being especially large for fragmented sequences. PAGAN-generated alignmentsof noisy NGS sequences are accurate enough for the use of RNA-seq data in evolutionary analyses whilethe method also scales up to analyses of large metagenomic data sets. The concepts developed forprogressive alignment of sequence graphs can be extended to phylogeny-aware alignment refinementand co-estimation of sequence alignment and phylogeny.

[1] S Eddy. HMMER 3.0 (http://hmmer.org).

[2] A Loytynoja, AJ Vilella, and N Goldman. Accurate extension of multiple sequence alignments using aphylogeny-aware graph algorithm. Bioinformatics, accepted.

[3] A Loytynoja. PAGAN (http://code.google.com/p/pagan-msa).

Poster Number 30

Page 60: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Analysis of functional divergency in the EFG I - and EFG II subfamily

Author(s): Tõnu Margus1,2, Maido Remm 1,2 and Tanel Tenson1

Affiliation(s): 1University of Tartu, 2Estonian Biocentre,

Elongation factor G (EFG) is an indispensable protein whose primary function is to catalyze

translocation in protein synthesis. EFG duplications in bacteria form four subfamilies: EFG I, EFG

II, spdEFG1 and spdEFG2. The four EFG subfamilies are characterized by genome context

conservation, evolutionary speed and by their indispensability. We have described in depth, for the

first time, the EFG II subfamily (e.g., Thermus thermophilus EFG-2). This differs from the EFG I

subfamily (e.g., Escherichi coli EFG) by its high levels of primary sequence divergence. To study

the EFG II, we analyzed conservation of domains and motifs and identified differentially conserved

positions between the EFG I and EFG II.

The main EFG II specific characteristics are: low conserved functionally important GTPase

domain; absence of trGTPase family specific consensus RGITI in the G2 motif; and six

differentially conserved positions. The latter are related to substantial changes in the physical-

chemical properties indicating EFG II specific functional changes. Interestingly, the differentially

conserved positions were found within the most divergent domains of EFG II (GTPase domain and

domain II). Moreover, three of these positions were located in the GTPase domain consensus

elements P-loop and G2 motif. This location strongly suggests that one part of the EFG II specific

functional peculiarities are associated with changes in GTP/GDP binding and hydrolysis conditions.

The mapping of differentially conserved positions onto the tertiary structure revealed that

another three positions in domain II point towards different interaction partners of the domain. This

means that the nature of interactions between GTPase domain, ribosome 16S rRNA (h5 & h15) and

domain III, are different in EFG I and EFG II. The presence of these characteristics, amongst the

otherwise highly divergent sequences of EFG II, is consistent with functional peculiarities unique to

this subfamily. The nature of these interactions is a suitable subject for investigation in future

experimental designs targeting differentially conserved amino acid residues within EFG II.

Poster Number 31

Page 61: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Exploratory Metagenomic Analysis of Antibiotic Resistance Genes in Bacterial Communities

Paula Andrea Martinez*, Viktor Jonsson, Fredrik Boulund, Erik Kristiansson Division of Mathematical Statistics, Department of Mathematical Sciences

Chalmers University of Technology and University of Gothenburg * E-mail: [email protected]

The increasing prevalence of antibiotic-resistant bacteria has become a

notorious threat to human health. Bacteria become resistant through resistance genes

that can move between cells using horizontal gene transfer. Antibiotics are naturally

produced by microorganisms in the environment and therefore bacterial communities

maintain a large collection of resistance genes (the resistome). The diversity and

mobility of the environmental resistome is however not well studied and further

research into these issues is warranted.

The aim of this project is to explore the environmental resistome and to

characterize the abundance of known resistance genes in the environment. We used

98 gigabytes of publicly available data from The Community Database for Metagenomic

Data CAMERA, including more than 650 study sites around the world. Based on this

data, we identified several common antibiotic resistance genes families spread in

different environments, where the beta-lactamase TEM was the most abundant (having

41.7 % occurrence in 347 sites). We also compared different sites by clustering, and

found that the resistome is highly variable. However, similarities were found also in

geographically close sites and between sites from similar environments. For example,

environments contaminated with antibiotics showed similarities in their resistome

abundance. Additionally, we also cluster the resistome, observing groups of antibiotic

resistance genes with similar abundance patterns between the sites. Several of these

groups could be associated with genetically linked co-resistance through known

horizontally transferred elements.

We conclude that metagenomics is a powerful tool for identifying antibiotic

resistance genes in uncultured bacteria.

Keywords: metagenomics, environmental bacterial communities, antibiotic

resistance, resistome, next generation sequencing NGS.

Poster Number 32

Page 62: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

A web tool for discovering protein-protein interactions using sequence informationDorota Matelska a,b, Robert B. Russell a

a Cell Networks, University of Heidelberg, Im Neuenheimer Feld 267, 69120 Heidelberg, Germanyb International Institute of Molecular and Cell Biology, 4 Ks. Trojdena Street, 02-109 Warsaw, Poland

Proteins fulfill their function as a part of large molecular machines that are coordinated by regulatory interaction networks. Currently, details of interaction interfaces are captured only in high-resolution 3D structures of protein complexes. Initial efforts to classify such molecular details of interaction interfaces resulted in the development of databases of domain-domain and domain-peptide interactions. However, despite growing number of bioinformatic tools making use of protein sequences, there is no meta-service that incorporates local sequence features and uses them to search for interaction interfaces in a set of proteins.

We present a web tool that integrates various structural and sequence data and predicts possible modes of interactions between given proteins. Predictions of protein domains, sequence motifs, and homology to known structures are used to reveal several types of interfaces, i.e. domain-domain, domain-peptide and peptide-peptide interactions. Moreover, we introduce the probability measure of the possible modes and visualize them using an intuitive graph model.

Poster Number 33

Page 63: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

MoiraiSP: a novel mitochondrial cleavage site predictor

Yoshinori Fukasawa, Szu-Chin Fu, Junko Tsuji, Noriyuki Sakiyama, Kenichiro Imai and Paul Horton

A large fraction of mitochondrial proteins are cleaved upon entry into the mitochondria, but prediction of this cleavage is still challenging. In recent years, large-scale mitochondrial proteomics research has provided large data sets of mitochondrial protein cleavage sites [1, 2].

We present MoiraiSP (Mitochondrial matrix targeting Signal Predictor), a novel mitochondrial cleavage site predictor trained on recent proteomics data. To prepare our dataset, we needed to mark intermediate cleave events fortunately Vögtle et al [1] provide data from knock out experiments determining cleavage by Oct1 and Icp55. However, many cleavage site remained which do not follow the observation that almost all known MPP (Mitochondrial Processing Peptidase) cleavage sites occur with arginine in the -2 position (the “R-2 rule” [3]). Although explaining those cleavage sites is an interesting scientific question, for training of our predictor we filtered out all cleavage sites which do not follow the R-2 rule.

We trained an SVM (LIBSVM) classifier to predict MPP cleavage sites. For the classification task, we defined several feature types. For each protease we trained a profile HMM (HMMER2) to learn the local sequence patterns near their cleavage sites; and treated the likelihood ratio score of the HMM as a feature for the SVM. To reflect distance from the N-terminus, we trained a mixture model of Γ distributions; and additionally used physico-chemical properties such as net charge and the number of charged residues. When predicting, the SVM model is used to scan the sequence for MPP cleavage sites. Secondary cleavage events are predicted using the Oct1 and Icp55 profile HMM's respectively (Figure 1A).

We developed a separate version of MoiraiSP for plants, because they exhibit important differences from yeast, including generally longer MTS's [2] and the lack of an Oct1 protease. Due to the relatively small plant dataset, we used the yeast trained profiles to model the sequence preferences around the MPP but retrained the Icp55 profile on plant data. Also we modeled the distance from MPP cleavage to the original N-terminus with a mixture model of Γ distributions fit to the plant data.

Using 10-fold cross-validation on a non-redundant dataset, we estimated the performance of MoiraiSP and compared it to two previous predictors [4, 5], as summarized in Table 1 and Figure 1. MCC stands for Matthew’s correlation coefficient and shows the performance of classification between cleaved and non-cleaved mitochondrial proteins. The preliminary results indicate that, having the advantage of a large training dataset for cleavage site, MoiraiSP makes more accurate predictions than previous methods.

Table 1. Preliminary classification result for yeast dataset[1]

MoiraiSP TargetP[4] MitoProtII[5]

MCC 0.794 ± 0.094 0.582 0.552

Figure 1. (A) Flow of MoiraiSP (B) Result of cleavage site prediction in yeast dataset[1]. TheY-axis shows fraction of cleaved proteins which are correctly predicted their cleavage position and x-axis shows accepted range between prediction and actual site determined by experiments. (C) Result for plant dataset[2].

References:[1] Vögtle,F.-N. et al., Cell, 139, 428-39, 2009.[2] Huang,S. et al., Plant physiology, 150, 1272-85, 2009.[3] Gakh,O., Biochimica et Biophysica Acta (BBA) - Molecular Cell Research, 1592, 63-77, 2002.[4] Emanuelsson,O. et al., Journal of molecular biology, 300, 1005-16, 2000.[5] Claros,M.G. and Vincens,P, European journal of biochemistry / FEBS, 241, 779-86, 1996.

Poster Number 34

Page 64: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Bioinformatics2012Abstract

Nanomechanics of proteins at synaptic junction

W. Nowak*, K.Mikulska, R.Jakubowski, L.Pepłowski, J.Strzelecki

Theoretical Molecular Biophysics Group, Institute of Physics, N. Copernicus University,

Grudziądzka 5, 87-100 Toruń, Poland * presenting author ([email protected])

Mechanical stability of synaptic junction is of paramount importance for proper functioning of brain. Out of hundreds of proteins present in the cleft some form pairs linking pre- and post-synaptic neurons., for example neuroligins (NLG)-neurexins (NRX). Others, such as contactins (CNTN) contribute to proper functioning of Ranvier nodes and early formation of a neuronal network. Better understanding of nanomechanics of these modular protein help in the future engineer molecular machines working in a neural membrane or the extracellular matrix environment. Moreover, recent genetic studies indicate that mutations in genes coding these protein lead to severe diseases such as autism. In order to understand better mechanical properties and stability of synaptic junction proteins we combined single molecules experimental techniques, such as Atomic Force Microscopy (AFM), and theoretical methods, such as the Steered Molecular Dynamics (SMD) simulations [1,2] to unfold selected modular, adhesive proteins such as CNTN4 [3,4]. Computer simulations of mechanical unfolding, despite the known problems with experimental timescale mismatch, provide information on intra-molecular interactions critical for protein functionality. In this presentation, for the first time, we will show results of SMD unfolding of the whole CNTN4 protein (100 ns timescale, 10 modules) and an enforced dissociation of a NRX-NLG pair in the presence (and absence) of calcium ions. Problems arising when natural interactions of protein modules with other signaling molecules are affected by a molecular strain will be discussed. We believe that our computational approach, together with bioinformatical analysis of homologous systems, provides new scientific data on the biological role of these abundant proteins.

Acknowledgements: Support from Polish Funds for Science (grant No. N N202 262038 and the

nationwide license for Accelrys software) is acknowledged. Calculations were performed at the

Computational Center TASK in Gdansk and UMK Torun. UMK grants (2011) to KM and JS are

also acknowledged.

References: 1. W. Nowak, P.Marszalek, 2005, Molecular Dynamics Simulations of Single Molecule Atomic

Force Microscope Experiments, Current Trends in Computational Chemistry, 47-83.

2. Ł. Pepłowski, M. Sikora, W. Nowak, M. Cieplak, 2011, Molecular jamming - The cystine slipknot

mechanical clamp in all-atom simulations , J. Chem. Phys., 134: 085102-1 - 085102-14.

3. K. Mikulska, Ł. Pepłowski, W. Nowak, 2011, Nanomechanics of Ig-like domains of human contactin (BIG-

2) , J. Mol. Model. (Springer) 17 (2011) 2313 - 2323.

4. K. Mikulska, J.Strzelecki, A.Balter, W. Nowak, Nanomechanical unfolding of α-neurexin – a major

component of the synaptic junction, Chem. Phys. Lett. (Elsevier) 521 (2012) 134-137.

Poster Number 35

Page 65: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

StoreBioinfo - high capacity storage for Life Sciences projects in Norway

Kjell Petersen1, H. Sagehaug7, S. Omholt8, K.S. Jakobsen9, N.P. Willassen6, F. Drabløs2, E. Hovig3,4 and I. Jonassen1,5

1Computational Biology Unit, Uni Computing, Uni Research AS, Bergen.2Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Trondheim.3Medical Informatics and Department of Tumor Biology, Norwegian Radium Hospital.4Institute for Informatics, University of Oslo.5Department of Informatics, University of Bergen.6Department of Molecular Biotechnology, Institute of Medical Biology, Faculty of Medicine, University of Tromsø.7Parallab, Uni Computing, Uni Research AS, Bergen.8Centre for Integrative Genetics, Norwegian University of Life Sciences (UMB), Ås.9Centre for Ecological and Evolutionary Synthesis (CEES), University of Oslo.

Abstract:

The overall aim of the Storebioinfo project is to provide life science users with integrated access to NorStore storage resources and Notur computational resources. Notur is the Norwegian national metacenter for computational resources as well as the coordinator of the NorStore project for high capacity storage of scientfic data.

The StoreBioinfo project originates from the Norwegian FUGE Bioinformatics platform, extended with representatives from UMB at Ås and CEES at UiO. The platform is coordinated from the Computational Biology Unit at the University of Bergen.

The main deliverables of the StoreBioinfo project is to i) manage a large block quota of storage on behalf of the Life Sciences community in Norway, and ii) develop e-services for better integration of NorStore and Notur resources in the tools of the Bioinformatics platform, and the Life Sciences community in general.

This work presents the operation of the StoreBioinfo project as well as the developed infrastructure solutions in the project.

Keywords: Large scale data storage, structured storage, sharing project data, interactive data access, programmatic data access.

Poster Number 36

Page 66: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Transcriptome  analysis  reveals  genes  involved  in  early  cone  setting    Tree  products  are  one  of  the  largest  exports  in  Sweden.  Tree  selection  programs  to  increase  the  benefit  has  started  but  are  progressing  slowly  due  to  the  long  generation  time  in  trees.  This  is  especially  true  for  the  Norway  spruce,  which  sets  cones  after  20  years.  We  are  interested  in  identifying  the  genes  involved  in  cone  setting  and  use  those  to  reduce  the  generation  time  of  Norway  spruce.  A  naturally  occurring  mutant  of  the  Norway  spruce  called  Acrocona  sets  cones  after  just  four  years.  We  have  collected  samples  from  both  Norway  spruce  and  Acrocona  during  development  in  order  to  identify  the  genes  involved  in  early  cone  setting.      Since  the  genome  sequence  is  large,  more  than  20  giga  bases,  no  genome  or  transcriptome  sequence  for  Norway  spruce  is  available.  We  have,  by  preforming  de  novo  assembly  on  RNAseq,  created  a  transcriptome  covering  approximately  80  percent  of  all  transcripts  in  spruce.    By  comparing  the  transcriptome  of  Norway  spruce  and  the  Acrocona  we  are  generating  an  Acrocona  SNP  specific  library.  We  have  further  analyzed  the  differential  expression  pattern  of  the  different  transcripts  in  the  different  cell  types  and  time  points  to  identify  genes  involved  in  cone  setting.  Among  other  results  we  found  one  transcription  factor,  known  to  be  important  in  flower  development  in  other  plants,  which  we  hypothesize  to  be  one  of  the  key  players  in  initiation  of  cone  setting.  The  poster  will  explain  the  pipeline  and  the  results  that  lead  to  this  hypothesis.      

Poster Number 37

arnee
Typewritten Text
Reimegård_Johan_SciLifeLab_Sweden.pdf
Page 67: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Screening for few but independent biomarkercandidates by a genetic algorithm

Dirk Repsilber1, Lena Scheubert2 and Georg Fuellen2

1) Leibniz Institute for Farm Animal Biology, Dummerstorf, Germany2) Institute for Biostatistics and Informatics in Medicine and Ageing Re-search, University of Rostock, Germany

We present an approach for screening OMICs datasets explicitly for smalldiscriminating biosignatures. A genetic algorithm is used together with astatistical learning algorithm combined in a wrapper, enabling us to rewardsmall biosignatures in particular. Interestingly, the resulting signatures ap-pear to be enriched in features which contribute independently to the ob-served patterns. Features selected by this method are compared to featuresselected by other common algorithms, with respect to pairwise mutual infor-mation among the top candidates. Two examples of schoolbook-like 2D geneexpression patterns are presented from two datasets, involving mouse geneexpression data on pluripotency as well as human brain tissue gene expressiondata related to Alzheimer’s disease. We suggest that our approach success-fully proposes small biosignature candidates by eliminating redundancy inthe resulting sets of features.

Scheubert, L., Schmidt, R., Repsilber, D., Lustrek, M. & Fullen, G. (2011).Learning biomarkers of pluripotent stem cells in mouse. DNA Research,18, 233–251.

Poster Number 38

Page 68: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

HumLoc: An Integrated Service forCoarse-Grained Subcellular Localization

Arcadio Rubio1, Bernard de Bono2,3, Henrik Nielsen1 and Ramneek Gupta1

1Center for Biological Sequence Analysis, Technical University of Denmark

2European Bioinformatics Institute 3Auckland Bioengineering Institute, University of Auckland

Background

Proteins are directed to cellular compartments by peptide sequences that act as targeting signals.Mislocalization due to disrupted signaling caused by sequence mutations is likely to have a majorimpact on protein function, as well as on physiological processes that such a function brokers.Localization models achieve good predictive performance for most individual cell compartments,but fail to scale to many transport signals simultaneously or to integrate existing annotations.

Methods

We describe HumLoc, a protein subcellular annotation pipeline aimed at Homo sapiens andclosely related mammalian organisms. This tool enriches expert-curated localization informa-tion from UniProt with machine learning predictions to maximize coverage and provide a one-stop shop. Integration of both types of information is achieved by mapping compartments to a3-element ontology (extracellular, cell membrane or intracellular) which eliminates granularitydifferences in annotations and makes prediction more tractable.

Results

The prediction pipeline of HumLoc achieves an estimated 83% correct classification rate assigningproteins to the 3-element ontology. It is of special interest for the interpretation of GWAS results.Given a set of SNPs, HumLoc can be used to filter those which are predicted to alter localization,potentially leading to a disease. We present such a set of germline and somatic mutations, inaddition to some general findings about mislocalization SNPs.

Availability

A preliminary web interface as well as a web service can be accessed at http://cbs.dtu.

dk/cgi-bin/humloc-2.0.cgi. These allow browsing existing subcellular annotations and pre-computed predictions. Furthermore, the submission of novel sequences is also possible.

Poster Number 39

Page 69: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

IMPROVED GAP SIZE ESTIMATION FOR SCAFFOLDING

ALGORITHMS

KRISTOFFER SAHLIN, NATHANIEL STREET, JOAKIM LUNDEBERG,AND LARS ARVESTAD

Abstract.

Motivation:One of the important steps of genome assembly is scaffolding, in which contigs

are linked using information from read-pairs. Scaffolding provides estimates

about the order, relative orientation and distance between contigs. We havefound that contig distance estimates are generally strongly biased and based

on false assumptions. Since erroneous distance estimates can mislead in sub-

sequent analysis, it is important to provide unbiased estimation of contig dis-tance.

Results:

We show that state-of-the-art programs for scaffolding are using an incorrectmodel of gap size estimation. We discuss why current ML estimators are bi-

ased and describe what different cases of bias we are facing. Furthermore, we

provide a model for the distribution of reads that span a gap, and derive theML equation for the gap length. We motivate why this ML estimate is sound

and show empirically that it outperforms gap estimators in popular scaffoldingprograms. Our results have consequences both for scaffolding software, struc-

tural variation detection, and for library insert-size estimation as is commonly

performed by read aligners.A new scaffolding tool (BESST) is also presented. BESST has a new way of

inferring spurious links between contigs based on the gap estimator mentioned

above.

KTH Royal Institute of Technology, Science for Life Laboratory, School of Com-puter Science and Communication, Solna, Sweden.

Umea Plant Science Centre, Department of Plant Physiology, Umea University,

Sweden.KTH Royal Institute of Technology, Science for Life Laboratory, School of

Biotechnology, Division of gene Technology, Solna, Sweden

Swedish eScience Research Centre (SeRC), Department of Numerical Analysis andComputing Science, Stockholm University. KTH Royal Institute of Technology, Sci-

ence for Life Laboratory, School of Computer Science and Communication, Solna,Sweden.

E-mail addresses: [email protected].

1

Poster Number 40

Page 70: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

 Title:    Comparative  interactomics  with  FunCoup    Abstract:    FunCoup  (http://FunCoup.sbc.su.se)  is  a  database  that  maintains  and  visualizes  global  gene/protein  networks  of  functional  coupling  that  have  been  constructed  by  Bayesian  integration  of  diverse  high-­‐throughput  data.  FunCoup  achieves  high  coverage  by  orthology-­‐based  integration  of  data  sources  from  different  model  organisms  and  from  different  platforms.  Network  links  are  annotated  with  confidence  scores  in  support  of  different  kinds  of  interactions:  physical  interaction,  protein  complex  membership,  metabolic,  or  signaling  link.  The  current  release,  version  2.0,  integrates  70  large-­‐scale  experimental  datasets  of  such  diverse  types  as:  mRNA  expression,  protein  expression,  sub-­‐cellular  localization,  protein-­‐protein  interaction,  miRNA-­‐mRNA  targeting,  transcription  factor  binding,  phylogenetic  profile,  genetic  interaction,  and  domain-­‐domain  interaction.  A  total  of  22  million  links  has  been  predicted  for  11  different  species  including  the  major  model  organisms.  The  FunCoup  website  allows  query-­‐based  analysis  of  conserved  subnetworks  in  multiple  species.  

Poster Number 41

arnee
Typewritten Text
Schmitt_Thomas_StockholmUniversity_Sweden.pdf
Page 71: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Tracking a complete voltage-sensor cycle with metal-ionbridges and modulation through neurotoxins

Ulrike Henrion‡, Jakob Renhorn

‡, Sara I. Borjesson

‡, Erin M. Nelson

‡,

Christine S. Schwaiger†*, Par Bjelkmar†, Bjorn Wallner

‡,

Erik Lindahl†, and Fredrik Elinder

†KTH Royal Institute of Technology, Stockholm, Sweden

‡Linkoping University, Sweden

*[email protected]

Voltage-gated ion channels open and close in response to changes in membrane potential,

thereby enabling electrical signaling in excitable cells. The voltage sensitivity is conferred

through four voltage-sensor domains (VSDs) where positively charged residues in the fourth

transmembrane segment (S4) sense the potential. While an open state is known from the

Kv1.2/2.1 X-ray structure, the conformational changes underlying voltage sensing have not

been resolved.

We present 20 additional interactions in one open and four different closed conformations

based on metal-ion bridges between all four segments of the VSD in the voltage-gated Shaker

K channel. A subset of the experimental constraints was used to generate Rosetta models

of the conformations that were subjected to molecular simulation and tested against the

remaining constraints. This achieves a detailed model of intermediate conformations during

VSD gating. The results provide molecular insight into the transition, suggesting that S4

slides at least 12 A along its axis to open the channel with a 310-helix region present that

moves in sequence in S4 in order to occupy the same position in space opposite F290 from

open through the three first closed states.

Additional we study how neurotoxins, such as Hanatoxin1, can modulate the gating

process. Through free energy calculation and further electrophysiological experiments we try

to understand why the toxin stabilizes the resting over the activated state and identify the

important residues ensuring toxin’s high binding affinity.

Poster Number 42

Page 72: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Combining de Bruijn graph, overlaps graph and microassembly for de novo genome

assembly

Anton Alexandrov, Sergey Kazakov, Sergey Melnikov, Alexey Sergushichev1, Anatoly Shalyto,

Fedor Tsarev

St. Petersburg National Research University of Information Technologies, Mechanics and Optics

Genome Assembly Algorithms Laboratory

197101, Kronverksky pr. 49, St. Petersburg, Russia

In this paper we present a method for de novo genome assembly that splits the process into three

stages: quasicontigs assembly; contigs assembly from quasicontigs; contigs postprocessing with

microassembly. We have carried out an experiment of assembling the E. Coli genome from an Illumina

Genome Analyzer 160-fold coverage paired-end reads library SRR001665 with insert sizes of about 200

bp and got 247 contigs with an N50 size of 53720 and covering 98% of the reference genome.

The first stage uses a de Bruijn graph built from all the input data. For each pair of reads a path

connecting reads’ beginning k-mers is searched for, assuming that reads are directed inwards. For this

we are searching for all paths connecting these k-mers with lengths bounded from up and down by a

priori limits of insert sizes. This is done by a pair of simultaneous breadth-first searches starting from

the k-mers. If all paths found have the same length and are similar to each other then we have a sequence

likely to be in the genome. We call these sequences quasicontigs as they are far from being contigs but

are greater than raw reads.

For the second stage the previously assembled quasicontigs are used. In the beginning short ones

are thrown out to get to a reasonable size of an input data, e.g. 10-fold coverage can be kept. Then

contigs are assembled with the algorithm based on the overlap-layout-consensus approach.

The third stage is similar to OLC and scaffolding. We are trying to order the contigs and fill the

gaps between them. At first all of the paired-end reads are aligned to the contigs using Bowtie (reads in

a pair are aligned independently). Then if both reads in a pair are aligned but to different contigs such

reads are called bridging and the contigs are called bridged (see Figure). For every pair of bridged

contigs we can infer their order from orientations of alignments of the bridging reads. After that all pairs

of reads with at least one read aligned to one of these contigs are used to build a relatively small (thus,

microassembly) de Bruijn graph.

Figure. Contigs A and B are bridged, reads a1 and a2 are bridging, pairs (b1, b2) and (c1, c2) can be

used for microassembly.

As graph is small and “local” we are likely to find a path connecting reads in a bridging pair

using the same technique as in the first stage of the whole algorithm (quasicontigs assembly). This path

gives us a distance between contigs and a filling sequence. After the distance is determined (it’s

accurate, not like in scaffolding) we have a layouting tasks similar to the one of the second stage.

On the E. coli dataset after the first stage we had about 10 million quasicontigs with a total size

of two Gbp. Then this data was truncated to 175 Mbp. After the second phase there were 525 contigs

with an N50 size of 17804 and a maximum size of 73908. After the third phase there were 247 contigs

with an N50 size of 53720 and a maximum size of 167319.

1 Corresponding author. Email: [email protected]

Poster Number 43

Page 73: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Computational analysis of membrane protein topology evolution

Nanjiang Shu and Arne ElofssonDepartment of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm; Sweden

Traditionally, the topology of integral membrane proteins (IMPs) is viewed as a set of helical-bundleroughly perpendicular to the membrane plane. Further, the topology within a protein family has beenbelieved to be quite conserved in evolution. However, recent observations provide a much more complexpicture of the structure of IMPs: homologous proteins can adopt opposite orientations, internal duplicationsare common (von Heijne, 2006). Here, we aim to gain deep insights into how the topology varies in theprotein family on a genomic scale.

0

20

40

60

80

100

120

0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100

Perc

enta

ges o

f to

polo

gy c

om

parison c

lasses

Sequence identity

Number of cases

IDTSHIFT

INVINV_SHIFT

DUPOthers

5597 25555 28043 28848 28878 28905 17929 8417 4349 3798

Figure 1: Topology comparison versus sequenceidentity. The relationships of the compared topolo-gies are categorized in the following groups,identical(IDT), inverted (INV), and shifted (SHIFT), invertedand shifted (INV SHIFT), duplicated (DUP) andother different topology (Others). The numbers aboveeach bar are the number of pairs included in statis-tics at each sequence identity range. Topology waspredicted by TOPCONS (Bernsel et al., 2009).

By analyzing tens of thousands pairs of homologous IMPsranging at various sequence identities, we observed that thefraction of identical topology increases with the sequenceidentity (see Figure 1). Moreover, we noticed that about4% of pairs are adopting inverted topology at different se-quence identity ranges. This probably indicates that pro-teins with dual topology are extensively existing in proteinfamilies. Homologous pairs with shifted topology are com-mon when sequence identity is low. However, their fractiondrops quickly as the sequence identity increases. The highoccurrence of shifted topology at low sequence identity rangeis most probably caused by the unreliably alignment whensequence identity is low. A small fraction of pairs are havingdifferent topologies just because one is a duplicated form ofthe other. Such duplications take up about 1% of all cases(shown in cyan in Figure 1).

Furthermore, we analyzed topology evolution in all fam-ilies of transmembrane proteins by multiple comparison oftopologies. Figure 2 shows an example of the phylogenetic tree of the family Cytochrome C assembly protein(Pfam ID: PF01578), highlighted by topology comparison. From the tree, the evolution of the topology isclearly shown. Take this family for example, roughly 7 major topology changes are found from the phy-logenetic tree. We found that family Histidine kinase (PfamID: PF07730), PAP2 superfamily (PfamID:PF01569), CAAX amino terminal protease family (PfamID: PF02517) are families with most varied topol-ogy. They are probably most evolved families.

1

Legend:

numTM

cmpclass

NtermState

cluster

196157868

56460209

315127284119946640

54308145

320155695

209695708

190150687

52424661

170718889

15601875

319776653

261868646

269139801

308187654

300717721

288935069

260598762

261821978

283786007

224583264

291283447322831922

294504673

157371623145298379

256822208

110833697

296106440

148652611

296113455

315499023

302381995

295688440

197106152

254293640

114797201

304320621

114570209

300022655

312115916

254560521

182679193

217976627

148254513

316934901

209885694

92116992

298290741

158422758

154245601

319784555

110633253

256369048

153009995

319898563

227821357

241203809

325292353154251397

310816541

89054643

99080968

110680404

294678652

119386684

148261836

114326933

209545049

58040083

258541990

209966284

288959710

159046589

159045951

83313299

260752328

85373601

87199411

103487001

148556891

294011735

294084314

291278607

313673645

291286273

258405922256830275

258405106

317154170

94987398

5647

6662

5537

8155

7680

2824

9177

30352944

9528

5

2986

7481

8

2975

6803

7

2889

3175

828

4047

311

1832

2131

0

1598

9923

8

1567

4434

215

6744

282

2225

2430

319

3214

962 26

8317

973

8381

5614

2564

2056

811

0639

618

3251

0455

131

2129

523

2840

3636

6 2840

3869

9

2550

3882

7

3136

7657

6

2889

3133

6

2841

6175

3

1460

0426

1240

2835

9

1569

3684

2

2568

3027

231

7154

167 94

9873

95

2584

0510

9

119716187

53804697300113512

114320838121998148

226941007297537635

71909603257092360

17146398889898865

302878982291614255

288941889

11433009282702341

114331627

121603499

56477418119900222237654648

311105630264678148241664027194291417

171058230124266406

188575856182682084

188576861194366474

319786874

260598765

224585617187732266283786010

288935066261821981269139804

2945046763228319191573716263007177243081876575646021219015069015601872261868649319776656

5242465817071888632015569254308143209695711

196157871315127287

145298376119946643

71279652294142903308048020

296106437

15299741630754631812055487583647063

163858386152983049

226945110

218892650

25682220519235949990021513254785935

110833733

148653963296112400

294084647

114326935148261834

58040085

258541993

20954505287199408

85373598

103487004

294011738

148556889260752325

83313296

209966280

288959707

154251859

209883720

148251987

316931825

92116033

227823748

241206748

325294058

153007453

256368562

110635688

319784210 319898304

254561755

217979609

182677704

312113925

300024752

298292747

158426081

154248051

315497484

302381434

197103602

295691518

310815844

110679785

159044402

294677324

119384152

99080842

89055140254292471114571397114800532304320899

7108299218396637

194033222

238650839

189184161254796864

225630236269959096

88657668

117926774

310820950

108757543

256371151

313679527

320451153

291295224

325283128

297622664

269925359

269837439

221633200317122107

3000

8755

6

5537

8158

7680

1669

298675328

91772979

294495317

296133773

78044897 320161341

220931226

291278610

291286270

313673642

2621

9654

7

9496

9329

1166

2332

9

3224

3487

6

3201

0696

916

2451

122

2262

2800

325

7792

316

2570

6350

1

2568

2694

6

322421892

118579683

218781889

320355201

307718894

319789850

187734606189220200

294055944182413906

146329354

313680474

320105063

325106616

32476107

193215987

193212403

194336062189499891

194334344

31064359723099524

169829411222151552

319892716

321312348

212638452

297529190

226311389

258511806

261855287

212218939

226941399

34499287

313669228

74318390

114331693

56477902

119899522

217969598

319779086

163858645

311108980

300309718

134095938

152981551

296135008

124268403

171057068

89902112

121608472

121606269

264680209

160896953

222112107

319764322

319791707

171463018

312795095

241664228

71906281

257094572

302879900

291615195

297539517

313201906

91774876

117925956

83311463

288958351

92115133

30754645583644604

120555224

226946000

218889975

110833657

152997773

192361942

90020848

254785307

53804559

121997248

114321088

300114673

288942106

78485341

289207610

299771375

148652403

296113558

182680698

188578080

194364927

319786593

56460829

71277916

315126106

196156498

322831428238899330

197284281

300722259

253990680

294505010157369095

260598990

224584531

291283883

311278431

288934013

308187877

300718049

85058522

307132205

261820437

269140195

190150708

170718342

15603049261867192

52424573

320155361

54310153

209694238145297734

237809534

119946974

308048456

294139725256822049

188585748

317052494

291288796

291279207313673930

195952539

15605652

289548322

288817385

225848627

225849843

325294751

319789488

302038055

322421446118581778

302341771162449894

11662228394969592

322435845320106344

262199946220916564310820648

108760340

206889887

94986765

256829528

258405587

317153248

116749024218781276

158523247224368201

297570490

51246932

320354696 85858416

294675423

189499622

194334643

189500965

194337750193212069

313206425

256819940

150006785

188995612

159903538

145356516

255087388

37521591

16329589

170078526

166362931

186685993

158334881

220909823

22299158

7525084

187736408

220916957

182416211

182414404

182414636

283781948

325111302

32475464

46446728

297621219

42523110

320105099

220916858

3252

9846

1

3132

0334

6

1500

0728

2

2600

6085

1

2568

2058

7

2600

6266

4

3056

6669

2

3252

8804

5

1500

2547

7

2982

0852

9

1204

3616

0

2951

3417

1

3136

7682

830

5667

530

3132

0590

6

2564

2279

1

1529

9289

8

3199

5712

2

1529

9093

1

3136

8251

6

3077

2093

7

3170

5249

1

2258

5097

8

2686

7912

1

2686

7933

8

3455

6815

2686

7934

7

3455

7356

2962

7314

7

3199

5709

0

3151

2364

5

3455

8438

3154

5257

9

2243

7260

5

219669376167630032296132650

260893727

297617841

94967780158321386188585746

260893906

83591023

171060789

296132938296132644

322418130

296132546

296134347322418153

296133716

317124999

297572017

300859261291298351315501132

159040025

117927463312200090256397179119716429290958313256826202

15728388928493726170782902

50954031

308178243239918273184200137

311112098

163841439

229821730269955255269794058256831764

29626826427196214626912862372163100297563925

258651172300857705296394098296138425

31544239526220103454027129312141071

256380695300782431257054374

134103357

284992937

51891874

51892868226312003

23099275

321311785297529551

212638879

169827293

169828198222151386

229916171

2258

4789

5

2888

1769

1

1560

7018

1959

5342

328

9548

408

3020

3621

8

2186

6692

6

2892

0722

5

1143

2199

8

78484395

261856748319778686311109559

163859236222109907

319761548

121611739160901019

319791906

121603577

124265627

312795813300309497

134096287

152981762

241664498

194290988

74316203

313669019

226942011

34499843

217969948

91774590

3197

8933

8

3252

9544

9

3136

7245

1

2068

8920

329

7568

745

2975

6823

4

5124

5533

3203

5406

6

8585

7935

3023

4454

015

8521

117

2243

6899

3

2187

7911

9

1167

4814

232

2418

178

1185

8099

9

2841

6175

214

6019

60

1240

2836

115

6937

108

159899241

222524300

156744345

193214335

3136

7657

3

2840

3636

1

3121

3207

9

2550

3882

4

1106

3961

5

3251

0359

025

6422

095

83816236

2683

1660

2

220931223

288931729

288931227

257063498

257792319

256826943

297622661

325283125

313679530

320451151

291295227

108759086

310820947

256371154

1197161901

83220136

284045926

300087551

320161350

269926612

221633205

269837444

78044186

296133779

317122885

226226226

162456456

262196537

116623323

94970234

320106717

322435239

194033251

238479207

71082989

254796927

238651019

88657620

225630052

288941892

171058233

124266404

71909606

257092357

171463985

163858087

114330090

8989886282702338

114331624

291614252

302878979

297537638

121603497

194291414

241664030

264678145311105627

119900219

56477421

237654645

114320840

121998151

226941010

188575853

188576673

182682087194366477

319786871300113509

53802370

71280528

308049378

54308455

209695458

320156809

190150377

52425869

261868635

319775129

15601892

145299281

269140134

261821986

187731032

224586047

283787057

294142898

308048025

152997413

192360395

254785938

90021516163858388

152982910

226945107

218892647

152980890

307546315

120554334

120554872

83647060

Figure 2: Phylogenetic tree of family CytochromeC assembly protein (PfamID: PF01578). From inner-most to utmost, circles are representing topology rela-tionship categories (as in Figure 1), inside/outside sta-tus of N-terminal (red for inside and blue for outside),topology clustering, and number of transmembranehelices, respectively. Topology predicted by Topcons.

From our study, we discovered that topology variationssuch as topology shifting, inversion and duplication exist ex-tensively in protein families.

References

Bernsel, A., Viklund, H., Hennerdal, A., and Elofsson, A. (2009). TOP-CONS: consensus prediction of membrane protein topology. NucleicAcids Res., 37(Web Server issue), W465–468.

von Heijne, G. (2006). Membrane-protein topology. Nat. Rev. Mol.Cell Biol., 7(12), 909–918.

Poster Number 44

Page 74: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

ULTRA RAPID, ACCURATE QUALITY ASSESSMENT OF PROTEIN STRUCTURE MODELS

Marcin J. Skwark1,2 & Arne Elofsson1,2*

Dept. of Biochemistry and Biophysics1, Stockholm University. Science for Life Laboratory2, Stockholm, Sweden

*[email protected]

INTRODUCTION

Prediction of 3D structure of proteins is one of the major goals of contemporary bioinformatics. For each predicted model, there is a need for use of an independent measure to evaluate correctness of the model. This is the role of Model Quality Assessment programs (MQAPs). Traditionally, MQAPs focused on evaluating structural features of predicted models, to assess the likelihood of model being similar to a native protein structure. These approaches are capable of detecting non-physical model conformations, but of discriminating which of the two biophysically feasible structure models is in the correct conformation. The advent of consensus methods alleviated this problem. Consensus methods are based on the premise, that among different models of the same protein, the one that is most similar to the others is most likely to be correct.

Most of consensus-based MQAPs rely on structural superposition, which is a computationally expensive process and as such makes consensus approaches unfeasible for larger model ensembles. Additionally, structural superposition does not account for conformational flexibility of proteins.

METHOD

Approach presented in this work does not rely on structural superposition, but rather on comparison of inter-atom distance matrices. It is at least as efficient in selecting the most accurate models from the model ensemble, as world-leading consensus methods. The increase in selection accuracy is particularly notable in case of more difficult target, where there is no evident largest cluster of structurally similar models1. Due to use of the streaming computing platform (off-the-shelf CUDA-compatible GPU), it is able to obtain at least a 10-fold speed-up in comparison to the other approaches, with no upper bounds on the amount of models in the ensemble, nor on the model size.

RESULTS

The method presented in this work – PconsD Q-score (GPU implementation of distance-driven quality metric) is significantly faster than other model quality assessment methods for non-trivial targets. It is up to 60 times faster than ModFOLDclustQ2 – another method relying on the same principle and approximately 8-10 times faster than superposition based methods, such as Pcons3. Additionally, it scales very well, both as far as the model length and model amount are concerned (see Figures 1 and 2)

Increased performance allows for much shorter turnaround times, thus enabling options not feasible with other approaches (e.g. iterative modelling, nearly real-time assessment etc.).

While, it is intuitively obvious that quality assessment methods based on distance matrix comparison do not correlate perfectly well with superposition based metrics, PconsD has been demonstrated to outperform superposition based methods as far as capability to select most native-like decoy in the set is concerned.

REFERENCES

1. Ben-David M. et al Proteins 26(Suppl 9)7, 50-76, (2009)

2. McGuffin, L.J. & Roche, D.B Bioinformatics 26(2), 182-186 (2009)

3. Wallner B. et al Nucleic Acid Research 35(Suppl 2), W369-W374, (2007)

Poster Number 45

Page 75: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Optimal Sparsity Criteria for Network Inference

Network inference is an intense area of research in Systems biology. Most contemporary inference methods rely on a sparsity parameter, which we call zeta, to obtain sparse network estimates. Since small changes in zeta can lead to very different networks, it is crucial to correctly set this parameter. We here propose a method for optimization of zeta which maximizes the accuracy of the predicted network for any given inference method and data set. Our procedure is based on leave one out cross optimization and selection of the zeta value that minimizes the prediction error. We demonstrate that our zeta optimization method for two widely used inference algorithms – Glmnet and NIR -- gives accurate prediction of the network structure, given that the data is informative enough. We also use a simple least square approximation algorithm with a link strength threshold cutoff to demonstrate the effect of our method. Our results hence show how to improve the experimental workflow from data to meaningful transcriptional networks.

Poster Number 46

arnee
Typewritten Text
Tjarnberg_Andreas_SBC_Sweden.pdf
Page 76: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

PoSSuM: a database of known and potential ligand-binding sites in proteins Jun-Ichi Ito1,2,4, Yasuo Tabei3, Kana Shimizu2, Koji Tsuda2,3 and Kentaro Tomii1,2

1. Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa,

Chiba 277-8568, Japan, 2. Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and

Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan, 3. Minato Discrete Structure Manipulation System Project, ERATO,

JST, Sapporo 060-0814, Japan, 4. Present address: National Institute of Biomedical Innovation (NIBIO), Saito-Asagi, Ibaraki, Japan.

We proposed an ultrafast alignment-free method that can compare over 1 million ligand-binding sites

in the Protein Data Bank (PDB) [1]. In our method, ligand-binding sites are first encoded as feature

vectors based on their physicochemical and geometric properties. Once ligand-binding sites are

converted to bit strings, called structural sketches, which is obtained by random projections of

feature vectors, a multiple sorting method is applied to the enumeration of all similar pairs in terms of

the Hamming distance. We created our new database, called Pocket Similarity Search using

Multiple-sketchsorts (PoSSuM) to compile all similar pairs detected using our method [2]. As the

source dataset, we concatenated the following two sets: 226,630 small molecule-binding sites

obtained from protein–ligand complexes in the PDB, and 3,134,413 potential ligand-binding sites

identified using an existing pocket detection algorithm. We applied our method to all-pair similarity

searches for the 3.4 million known and potential ligand-binding sites. Consequently, we discovered

ca. 24 million similar binding sites, which is the largest-scale study of binding site comparison for the

PDB entries ever reported. We provide those results as a relational database including all the

discovered pairs with annotations of various types such as CATH, SCOP, EC numbers, and Gene

Ontology (GO) terms. Therefore, users can easily scrutinize similar ligand-binding sites between

proteins with different folds or similar sites between enzymes with different EC numbers. Users can

also browse superpositions of similar sites with the Jmol viewer. Our database is expected to be

useful for annotation of protein functions and rapid screening of target proteins in drug design. The

PoSSuM database is available for use by researchers at http://possum.cbrc.jp/PoSSuM/.

References:

[1] Ito et al., Proteins. (2012) 80 (3): 747-763.

[2] Ito et al., Nucl. Acids Res. (2012) 40 (D1): D541-D548.

Poster Number 47

Page 77: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

ERNE: a multi-purpose alignment package

Francesco Vezzi∗, Cristian Del Fabbro†, Alexandru I. Tomescu‡, Nicola Prezza§,and Alberto Policriti¶

May 15, 2012

Abstract

String alignment against a genome reference is the first and most important phase in every (re-)sequencingproject based on Next Generation Sequencing data. The importance of this problem is demonstrated bythe large number of tools (i.e., aligners) designed to tackle this problem.

Aligners must be able to manipulate a broad variety of reads (different lengths, paired reads, etc.),obtained from a wide range of types of organisms (from short viruses up to Giga base pairs long plantsgenomes), and sequenced for different purposes (DNA-seq, RNA-seq, BS-seq, etc.).

In practice, different problems are solved by different tools, thus obliging researchers to use more thanone program to align different types of reads. This situation gives rise to several problems, all stemmingfrom the fact that users must become familiar with different tools and learn how to tune a large numberof parameters for each one of them. Moreover, different tools can handle similar problems in differentway (e.g., reads mapping in multiple position) or output alignments in different (often non standard)formats.

We present ERNE (Extended Randomized Numerical alignEr), a short string alignment packagewhose goal is to provide an all-inclusive set of tools to handle short (NGS-like) reads. ERNE comprisesERNE-MAP (core alignment tool/algorithm), ERNE-DMAP (distributed version of the aligner), ERNE-BS5 (bisulfite treated reads aligner), and ERNE-VISUAL (graphical user interface).

ERNE-MAP (ERNE MAPper) is an highly performing and sensitive hash-based aligner: it imple-ments an Hamming-aware hash function able to handle mismatches extending the approach originallyproposed by Rabin and Karp. ERNE-MAP handles paired reads, allows both gapped and un-gappedalignments, and outputs alignment in standard SAM/BAM format. Moreover, it can align RNA-seqreads taking care of reads spanning over exon-junctions.

ERNE-DMAP (ERNE Distributed MAPer) was designed to tackle the main computational bottle-neck of all classical parallel implementation of aligners: references longer than 4 Gbp. ERNE-DMAPdistributes alignment’s computation over a cluster of computers using the OpenMPI protocol. Its imple-mentation allows to split the genome across nodes: the maximum allowed reference length depends onlyon the number of available nodes in a cluster. The computation is based on a PIPELINE model but weare developing a new faster approach based on point-to-point intercomunication.

ERNE-BS5 (ERNE-BiSulfite 5, the newest) has been developed to efficiently map bisulfite-treatedreads against large genomes (e.g., Human). To achieve this goal we have implemented three differentideas: 1) we use a weighted context-aware Hamming distance to identify a T coming from an unmethy-lated C context, 2) we use a 5-letter alphabet for storing methylation information, and 3) we use aniterative process to position multiple-hit reads starting from a preliminary map built using single hits.The map is corrected and extended at each cycle using the alignments added in the previous step. ERNE-BS5 implements an improved (xor based) hash function that we plan to integrate in ERNE-MAP andERNE-DMAP.

In order to ease the interaction with the various components of the tool we developed a GraphicalUser Interface (GUI) dubbed ERNE-VISUAL.

ERNE executables and source codes are freely downloadable at http://erne.sourceforge.net/.

[email protected] - KTH Royal Institute of Technology, Science for Life Laboratory, School of ComputerScience and Communication, Solna, Sweden†[email protected] - Applied Genomics Institute- Italy‡[email protected] - University of Udine - Italy§[email protected] - University of Udine - Italy¶[email protected] - University of Udine and Applied Genomics Institute- Italy

1

Poster Number 48

Page 78: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Accurate prediction of protein enzymatic class by N-to-1 Neural Networks

Viola Volpato1,2, Alessandro Adelfio1,2 and Gianluca Pollastri1,2 1School of Computer Science and Informatics, University College Dublin, Ireland 2Complex and Adaptive Systems Laboratory, University College Dublin, Ireland

Genome sequencing projects and high-throughput experimental procedures have recently produced a rapid growth

in protein databases but only a small fraction of known sequences have been determined to have a function by experimental means. Besides, the prediction of protein functions, to date, remains problematic; when dealing with lack of significant sequence homology between two proteins, it is hard to transfer functional annotations reliably and divergent/convergent evolutionary events make this task even more complex [1]. Accordingly, one of the most important challenges of Bioinformatics at present is to develop accurate computational methods capable of determining or accurately predicting protein functions and enhancing the annotation of sequence databases in order to expand our knowledge of the mechanisms of life [2].

Since protein structures are known for less than 1% of known protein sequences, most proteins of newly sequenced genomes have to be characterized by their amino-acid sequences alone [2]. We present a novel ab initio N-to-1 Neural Network predictor based on the architecture SCLpred, developed to predict subcellular localization [3]. Our model, trained on a large, curated database of over 6,000 non-redundant proteins, can classify proteins, solely based on their sequences, into one of six classes extracted from the enzyme commission (EC) classification scheme. In addition, in order to exploit evolutionary information effective at detecting functionally significant residue patterns (e.g. active-site residues/portion) which can be properly harnessed for enzymatic class prediction, we represent each input sequence position by the residue frequency derived from multiple sequence alignments instead of using single protein sequences. The model is capable of approximating non-linear functions mapping sequences to features and features to classes in a two-step prediction. As the model operates on the full sequence and not on predefined features, in a first step all motifs of a predefined length (31 residues in this work) are considered and are compressed by an N-to-1 Neural Network into a feature vector which is automatically determined during training. In a second step, the vectorial outputs of all networks are added up and the resulting feature vector is input to the final network to produce the enzymatic class prediction.

We test our predictor in 10-fold cross-validation and obtain state of the art results, with a 96% correct classification and 86% generalized correlation. All six classes are predicted with a specificity of at least 80% and false positive rates never exceeding 7%. It has been reported that even for pairs of enzymes with over 70% residue identity in the optimal alignment more than 30% do not belong to the same class (first EC number) [4], underlying the difficulty of obtaining accurate predictions based on sequence identity. However, the overall accuracy of our method in predicting the main enzymatic classes is very high for the datasets used, in which sequence identity is below 30% for any two proteins. We also compare our method to ProtFun [5], which has been shown to be one of the most accurate methods for function prediction. Although comparisons on different datasets are to be taken with caution, we obtain performances exceeding those of ProtFun by over 10% while also considerably reducing false positive rates.

In conclusion, the high classification performances we achieve suggest that the network is able to recognize those functionally conserved portions of enzymatic sequences that are related to the chemical reactions on which the EC classification scheme is based. Therefore, we are now analysing trained networks to mine motifs that are most informative for the prediction, hence, likely, functionally relevant in order to provide them as a public database.

References 1. Ganfornina MD, Sanchez D: Generation of evolutionary novelty by functional shift. BioEssays 1999, 21:432–439. 2. Whisstock JC, Lesk AM: Prediction of protein function from protein sequence and structure. Quarterly Reviews of Biophysics 2003, 36:307–340. 3. Mooney C, Wang YH, Pollastri G: SCLpred: protein subcellular localization prediction by N-to-1 neural networks. Bioinformatics 2011, 27(20):2812–9. 4. Rost B: Enzyme function less conserved than anticipated. J. Mol. Biol. 2002, 318:595–608. 5. Jensen LJ, Gupta R, Blom N, Devos D, Tamames J, Kesmir C, Nielsen H, Stærfeldt HH, Rapacki K, Workman C, Andersen CAF, Knudsen S, Krogh A, Valencia A, Brunak S: Prediction of human protein function form post-translational modifications and localization features. J. Mol. Biol. 2002, 319:1257–1265.

Poster Number 49

Page 79: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Microsecond simulations of membrane-proteins as a global ligand-docking methodWesén, B. Most drugs on the market target membrane-proteins, but their exact molecular functions are rarely known. Anesthetics are thought to act on membrane-bound ion-channels in the nervous system, and having better knowledge of their exact sites of action and functioning would accelerate the optimization of current drugs and design of new ones.

To find sites of action for a drug, automatic docking algorithms can heuristically search a protein although these are commonly optimized for globular proteins and don't perform as well with membrane-proteins of complex topology. Free-energy perturbation methods can accurately determine the binding affinity for a drug in a specific, small site but the site has to be given explicitely, thereby inducing bias and work. Hence, for analyzing membrane-protein / ligand combinations which are not well known, a global, unbiased and unguided method which can report putative sites and probable poses and automatically seed more detailed FEP calculations is desired.

In this work, we apply microsecond-scale simulations of the prokaryotic pH-gated ion-channel GLIC (from Gloeobacter violaceus) together with the anesthetic desflurane in order to evaluate the feasability of this method. The trajectories of the ligands and water are aligned to the protein and occupancy maps are built, which are automatically analyzed for hotspots. Some of the sites of high occupancy correspond to a known crystallographically resolved binding site of desflurane on GLIC near residues I201 and I202, while some others are suggested in previous studies but are not thoroughly analyzed.

The ongoing project will add the automatic spawning of FEP calculations for each identified site of high occupancy to add affinity values to the individual sites. Also, by correlating the ligand occupancy in the identified binding site regions to other dynamical properties of the system, further interesting analysis is possible, for example modulation of patterns of hydrogen bonding among structurally important protein backbone parts, salt bridges between the channel subunits and the tilting angle or other motion of the major parts of the protein.

Poster Number 50

Page 80: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

Where  do  anesthetics  bind:  Modeling  of  human  ligand-­‐gated  ion  channels.  Ozge  Yoluk  1,2,  Erik  Lindahl1,2  1  Theoretical  &  Computational  Biophysics,  Royal  Institue  of  Technology,  Stockholm,  Sweden  

2  Center  for  Biomembrane  Research,  Stockholm  University,  Stockholm,  Sweden  

Anesthesia   is   crucial   in   all   fields   of  medicine   allowing   the   patients   to   undergo  surgery   without   stress   and   pain.   However,   adverse   effects   of   anesthetics   are  quite   strong   and   occasionally   even   lethal.   Anesthetics   with   fewer   side   effects  would   increase   the   confidence   of   patients,   lowering   the   discomfort   and   stress  before   the   procedures.   To   be   able   to   identify   potential   anesthetics  with   fewer  side   effects,   it   is   important   to   fully   understand   how   anesthetics   work   on  molecular   level.   The  main   targets   of   anesthetics   in   nervous   system   are   ligand-­‐gated   ion  channels   (LGICs).  There  are   still  no  X-­‐ray   structures  of  human  LGICs  available   due   to   the   difficulties   with   overexpression   and   crystallization.  However,   studies   on   prokaryotic   homologues   of   ligand-­‐gated   ion   channels  contribute  greatly  to  our  knowledge  on  human  LGICs  (i.e.  GABAAR,  a  glutamate-­‐gated  ion  channel  that  transports  Cl  ions  into  the  cell).    A   recent   eukaryotic   structure   (X-­‐ray   structure   of   GluCl   from   C.elegans)   with  higher  similarity   in  structure  and  function  has  opened  up  another  door  to  help  us   study   human   LGICs.   However,   presence   of   co-­‐crystallized   ligands   in   the  structure   is   a  major   concern   for   computational   studies  as   they  might   force   the  channel  to  a  particular  state.  As  part  of  our  modeling  efforts  of  human  LGICs,  we  are   therefore   performing   simulations   with   and   without   ligands   in   order   to  analyze   the   stability   and   the   native   state   of   the   structure.   In   particular,   GluCl  remains   stable   in  presence  of   the  drug   Ivermectin   and   subunits  preserve   their  distance  as  it  is  in  crystal  structure  (~9Å).  When  simulated  without  Ivermectin,  subunits   are   closer   (~7Å)   and   the   pore   is   partially   collapsed.   The   resulting  trajectories  are  used   to  construct  different  models   for   the  human  GABAAR   that  should   correspond   to   states  without   ligands.  We   are   also   carrying   out   further  tests  with  these  different  models  to  understand  the  role  of  the  subunit  distance  and  binding  pockets  for  anesthetics  in  human  LGICs.  

Poster Number 51

Page 81: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

The metagenomics of the dead: taxonomic and functional annotation methods for analysis of ancient datasets

Katarzyna Zaremba-Niedzwiedzka1 and Siv G.E. Andersson1

1Department of Molecular Evolution, Biomedical Center, Uppsala University, Uppsala, Sweden;

Metagenomics allow insights into microbiology of unique samples, such as fossils and mummies. Many questions about microbes in ancient samples remain without answer, including basic knowledge of which bacteria are present there, what is their source and are they old or modern. The Neandertal genome has recently been determined from DNA extracted from a 38,000-year-old fossil. Mammalian DNA accounted for only a few percent of the DNA sample and the large majority, 80%, remained uncharacterized. The main difficulty in the bioinformatic analyses of such data is the short read length and unknown reference genomes. We test the performance of different annotation methods using both artificially generated metagenomes and real sequencing reads of known origin. Full dataset annotation by BLAST performed poorly in the tests, especially in nucleotide searches, but even protein-based searches suffered from lack of very closely related reference sequences. Testing ribosomal RNA based taxonomic methods revealed tRNAs, adjacent to rRNA genes as the main source of false positive hits. We propose a modification of the lowest common ancestor assignment procedure that prevents high-level taxonomic assignment due to single misclassified sequences in the databases. BLAST-based assignments served as starting point for assembly in subsets of reads with limited diversity, to prevent chimeras. Ultimately, identified sequences of interest were analyzed by phylogenies to confirm annotation and taxonomic position. Finally, we developed substitution calculation procedure based on comparison of individual reads to the consensus sequences that allowed recognition of the typical ancient substitution patterns. We applied these bioinformatic approaches to the analysis of metagenome sequence data generated from DNA extracted from the remains of the Neandertal. Our taxonomic profiling analyses suggest that Actinobacteria dominate the Neandertal metagenome and that a single species of Streptomyces accounted for 25% of all bacterial sequences in the data. Estimates of substitution frequencies from assembled rRNA sequence data verified the typical ancient features of the Neandertal sequences, and suggested a modern origin for the bacterial reads. Streptomyces-like collagenase genes were present at similar genome-equivalents as the rRNA genes, indicating that they are derived from the same genome. We hypothesize that actinobacteria of the genus Streptomyces have been enriched inside the Neandertal bones because of their ability to degrade collagen and obtain nutrients from this otherwise nutrient-poor environment. The approach used here paves the way for similar metagenomic studies of microbial communities associated with the remains of extinct organisms.

Poster Number 52

Page 82: BIOINFORMATICS,2012, - SocBiN · Deepak Kumar Adam Mickiewicz University POLAND deepak.k.choubey@gmail.com Mayank Kumar Saarland University GERMANY mayankumar@gmail.com Kanthida Kusonmano

A biophysical model to infer canonical and non-canonical microRNA-target interactions

Mihaela Zavolan, Mohsen Khorshid, Jean Hausser, Erik van NimwegenBiozentrum, University of Basel and Swiss Institute of Bioinformatics, Klingelbergstrasse50-70, CH-4056 Basel, Switzerland

miRNAs are a large class of regulators of gene expression, acting at post-transcriptional level tomodulate the stability of mRNA targets and their rate of translation into proteins. Concertedexperimental and computational approaches revealed that in mammals, 7-8 nucleotides of per-fect complementarity between the miRNA 5’ end and the target mRNA is frequently sufficient toelicit a response, typically measured in terms of mRNA degradation. Many putative binding siteshowever, do not seem to elicit an effect and some binding sites that do not have perfect comple-mentarity to the corresponding miRNA ”seed” region have been found to be effective in mRNAdestabilization. Further progress in understanding the determinants of miRNA-dependent reg-ulation will likely depend on the availability of much more comprehensive data sets of bona fidemiRNA binding sites. Such data sets may be obtained with recently developed experimentalmethods relying on Argonaute (Ago) protein crosslinking and immunoprecipitation (CLIP) [1–4]if one were able to identify the miRNA that guided the interaction of Ago with the isolated RNAsite. To address this need, we developed a quantitative, biophysical model for comparativelyevaluating the likelihood of interaction of individual miRNAs with a given site. We inferredthe model’s parameters from Ago2 CLIP data and found that they largely reflect previouslyuncovered principles of miRNA-target interaction. Application of this model to various Ago2CLIP data sets enabled us to identify a substantial number of miRNA binding sites that arenon-canonical, yet effective in mRNA destabilization upon miRNA transfection. The degree ofmRNA destabilization correlates well with the predicted affinities of binding of these sites tomiRNAs, indicating that our model enables the discovery of binding sites of individual miRNAsthrough Ago-CLIP. Combining Ago-CLIP with mRNA and protein expression measurements weare unraveling the kinetics and mechanism of action of miRNAs.

References

1. Chi SW, Zang JB, Mele A, Darnell RB (2009) Argonaute HITS-CLIP decodes microRNA-mRNA interaction maps. Nature 460: 479-86.

2. Zisoulis DG, Lovci MT, Wilbert ML, Hutt KR, Liang TY, et al. (2010) Comprehensive dis-covery of endogenous Argonaute binding sites in Caenorhabditis elegans. Nature structural& molecular biology 17: 173–9.

3. Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, et al. (2010) Transcriptome-wide Identification of RNA-Binding Protein and MicroRNA Target Sites by PAR-CLIP.Cell 141: 129–141.

4. Kishore S, Jaskiewicz L, Burger L, Hausser J, Khorshid M, et al. (2011) A quantitativeanalysis of CLIP methods for identifying binding sites of RNA-binding proteins. Naturemethods 8: 559–564.

Poster Number 53