Cool Informatics Tools and Services for Biomedical Research

Cool Informatics Tools and Services for Biomedical Research

David Ruau, PhD.August 1st, 2012

Sponsored by the Office of Postdoctoral Affairs andthe Lane Medical Library

@druau

BIG DATA

BIG DATA

Big Data in Biomedicine

http://www.nature.com/news/gene-data-to-hit-milestone-1.11019

We live in a Big Data world

1. Analyzing genomic data1. Traditional bioinformatics tools2. Microarrays/gene lists without any code3. Microarrays/gene lists with code4. NGS and mRNA-seq

2. Beyond genomic1. Protein-protein interaction network

3. General data handling tools 1. Storing your data2. Data are dirty

4. Statistics made easy5. Graphics rules!6. Demystifying “the work”! (the code)7. Conclusion + Q&A

Course outline

We live in a Big Data worldBioinformatics software to solve everyday problems.

The EMBOSS tool suite http://emboss.sourceforge.net/ One web portal is: http://mobyle.pasteur.fr/cgi-bin/portal.py - DNA / AA Pairwise global and local alignment- Sequence feature analysis (CpG island, gene scan, restriction enzyme site,

2D/3D structure...)- Protein structure and domains- Similarity search (Blast, phi-blast, psi-blast, delta-blast...)- Phylogenetics (trees from multiple alignments)- ...

Traditional bioinformatics tools

http://emboss.sourceforge.net/


http://mobyle.pasteur.fr/cgi-bin/portal.py



The EMBOSS tool suite http://emboss.sourceforge.net/ One web portal is: http://mobyle.pasteur.fr/cgi-bin/portal.py - DNA / AA Pairwise global and local alignment- Sequence feature analysis (CpG island, gene scan, restriction enzyme site,

2D/3D structure...)- Protein structure and domains- Similarity search (Blast, phi-blast, psi-blast, delta-blast...)- Phylogenetics (trees from multiple alignments)- ...


UPGMA joining method






Some tools are provided through databases interface such as NCBI Entrez.- The UCSC genome browser.

- The Encode project results- For example: visualize GC content and restriction enzyme site in your gene

of interest.


This is not because you have a GUI that the analysis is

brain dead simple.

Stating the obvious

We live in a Big Data worldAnalyzing microarray gene expression microarray without any code.

Gene Pattern: http://genepattern.broadinstitute.org/gp/

Analyzing genomic data

http://genepattern.broadinstitute.org/gp/

http://genepattern.broadinstitute.org/gp/

We live in a Big Data worldUpload your expression data as a text file.Gene Pattern takes RES and GCT files. Conversion tools are provided

To transform CEL files to GCT. RES


We live in a Big Data world

StarBiogene http://web.mit.edu/star/biogene/index.html (java web app)- Part of GenePattern but provide pipeline style process online

SeqExpress http://www.seqexpress.com/ (Windows only)- Alternative independent application (less activity than GenePattern)

Expander http://acgt.cs.tau.ac.il/expander/ - Alternative independent application (less activity than GenePattern)

RMAExpress http://rmaexpress.bmbolstad.com/ - Interesting to perform a quality control of your microarrays.

Cluster http://bonsai.hgc.jp/~mdehoon/software/cluster/ - This is the original program to analyze microarray results. No pre-processing

functionality. You need to pre-process separately (using RMAExpress for example)

SAM http://www-stat.stanford.edu/~tibs/SAM/ (significance Analysis of Microarrays)- To extract the DE genes. This is a Excel plugin. Again, you need to pre-process separately


http://web.mit.edu/star/biogene/index.html

http://www.seqexpress.com/

http://acgt.cs.tau.ac.il/expander/

http://rmaexpress.bmbolstad.com/

http://bonsai.hgc.jp/~mdehoon/software/cluster/

http://www-stat.stanford.edu/~tibs/SAM/

We live in a Big Data worldCommercial solutionGenespring GX (first 20 days are free)Access through subscription @ Stanford with CMGM http://cmgm3.stanford.edu


http://cmgm3.stanford.edu/membership

We live in a Big Data worldInterpreting a gene list rely on external knowledge.Several resources / tools are available to help.

KEGG: http://www.genome.jp/kegg/ pathway database

REACTOME: http://www.reactome.org pathway 2.0 database

Gene Ontology: http://www.geneontology.org/ the ultimate resource for gene function,

processes, localizationBioMart: http://www.biomart.org/

Portal providing access to multiple databaseGSEA: http://www.broadinstitute.org/gsea/index.jsp

part of GenePattern but also RDavid: http://david.abcc.ncifcrf.gov/

to perform an over-representation analysisBingo: http://www.psb.ugent.be/cbd/papers/BiNGO/

Home.html over-representation analysis but produce

graphical result (cytoscape)BioGPS: http://biogps.org/

To know where your gene is expressed in the body or which cell line

Interpreting your results

http://www.genome.jp/kegg/

http://www.reactome.org/

http://www.geneontology.org/

http://www.biomart.org/

http://www.broadinstitute.org/gsea/index.jsp

http://david.abcc.ncifcrf.gov/

http://www.psb.ugent.be/cbd/papers/BiNGO/Home.html

http://www.psb.ugent.be/cbd/papers/BiNGO/Home.html

http://biogps.org/

http://biogps.org/

We live in a Big Data worldReactome• Made to be used programmatically

• Cytoscape (a network tool) has a plugin for Reactome.Just give a gene list or a list of gene + the number of sample where the gene is

mutated (for Cox survival analysis)


- Retrieve a network from a gene list- Do network analysis- Perform Gene Ontology analysis- Survival analysishttp://www.reactome.org/userguide/Usersguide.html#FI_Network_Tool

http://www.reactome.org/userguide/Usersguide.html

http://www.reactome.org/userguide/Usersguide.html

We live in a Big Data worldDAVID databasePerform fast over-representation analysis again different databases- KEGG; Reactome; OMIM (diseases), Generif (literature), protein domain etc...


Protein domains

We live in a Big Data worldbioGPS. Exploring expression across tissues and cell lines


Look at other library oftissues

We live in a Big Data worldRMAexpress and quality control of microarrays

Several test exist to test if the microarray performed correctly.

Hall of fame of failed microarrays:http://plmimagegallery.bmbolstad.com/


http://plmimagegallery.bmbolstad.com/

http://plmimagegallery.bmbolstad.com/

We live in a Big Data worldAnalyzing public microarray with code (kind of...)

Analyzing public gene expression data

We live in a Big Data worldAnalyzing public gene expression data

Then clic on “TOP 250” button

We live in a Big Data worldAnalyzing public gene expression data

R code

Top 250 genes

We live in a Big Data worldNext Generation SequencingThe main NGS platform are:

• Roche /454 (Genome Sequencer; GS)

• Illumina/Solexa (Genome Analyzer software)

• SOLiD (Applied Bioscience)

Upcoming challengers:• Ion Torrent (Illumina)• Oxford Nanopore

Next Generation Sequencing

Done by the core facility

What you should request

We live in a Big Data worldAnalyzing mRNA-seq data: 4 steps.1- Alignment and trimming of reads:

[no GUI]Tophat (assembly and splice junction mapper) Cufflinks (assembly and RPKM estimates)GALAXY provide access to Tophat, Cufflinks.

2- Calling variants and indels:GATK (http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page) VarScan (http://varscan.sourceforge.net/) SHRIMP2; VARiD; Atlas-SNP2; SomaticSniper...Interpretation of variants: SIFT (galaxy)

3- Finding differentially expressed genesCuffdiff (galaxy)DEXseq (R)

4- Visualization:SAVANT (http://genomesavant.com/savant/) IGV (http://www.broadinstitute.org/software/igv)

Analyzing mRNA-seq

[with GUI and commercial]Genome Studio from IlluminaGenomequest [looks pretty awesome.]

http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page

http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page

http://varscan.sourceforge.net/

http://varscan.sourceforge.net/

http://genomesavant.com/savant/

http://genomesavant.com/savant/

http://www.broadinstitute.org/software/igv

We live in a Big Data worldAnalyzing mRNAseq data: Introducing GALAXY

How to use Galaxy?

http://galaxy.psu.edu/



We live in a Big Data worldWorking in the cloud

Dudley JT, and Butte AJ. 2010. In silico research in the era of cloud computing. Nat Biotechnol 28: 1181–1185.

We live in a Big Data worldSummary mRNA-seq

GALAXYThis is a compendium of software. You even have UNIX tools and EMBOSS in it.Take home message:FASTQ files > Tophat > Cuffdiff > IGV (for differential expression)FASTQ files > Tophat > GATK > IGV (for variant detection)

Where to find help: http://seqanswers.com

Analyzing RNAseq using RDEXSeq is a R / BioConductor package. R is a statistical programming software widely used in bioinformatics

http://seqanswers.com/forums/forumdisplay.php?s=6dfca16ab09b5a5c23e616500d044922&f=18

We live in a Big Data worldSummary mRNA-seq

Additional tools for genomic-- Genomespace: http://www.genomespace.org

Collection of tools: GenePattern, Galaxy, cytoscape, genomica etc... (free apparently). Data are stored in the cloud on Amazon VM.

If you do not want to do it yourself:-- Science exchange: https://www.scienceexchange.com/

Science job for hire! This is where top core facilities compete to provide the best service.-- Assay Depot: https://www.assaydepot.com/

like home depot but for science

-- taskrabbit: http://www.taskrabbit.com/ If science take too much of your time!

http://www.genomespace.org/

http://www.genomespace.org/

https://www.scienceexchange.com/

https://www.scienceexchange.com/

https://www.assaydepot.com/

https://www.assaydepot.com/

http://www.taskrabbit.com/

http://www.taskrabbit.com/

We live in a Big Data worldBeyond genomics: results interpretation

Interpreting your gene list with protein-protein interaction network.

iHOP: http://www.ihop-net.org/UniPub/iHOP/

Ingenuity Pathway Analysis(commercial) access through CMGM @ stanford

http://www.ihop-net.org/UniPub/iHOP/



We live in a Big Data worldBeyond genomics: results interpretation

Looking into PPI databases:IntAct: http://www.ebi.ac.uk/intact/ BioGrid: http://thebiogrid.org/ (soon multigene search)

HPRD: http://www.hprd.org/index_html

What about open-source solutions for searching the interaction between the genes in your gene list?• Cytoscape http://cytoscape.org

• BioNetBuilder http://chianti.ucsd.edu/cyto_web/plugins/ • ...

• R for programmatic access to databases• http://brainchronicle.blogspot.com

The plus of using R is that results are reproducible and you can share your method more easily than with point and clic interface.

http://www.ebi.ac.uk/intact/

http://www.ebi.ac.uk/intact/

http://thebiogrid.org/

http://thebiogrid.org/

http://www.hprd.org/index_html

http://www.hprd.org/index_html

http://cytoscape.org/

http://chianti.ucsd.edu/cyto_web/plugins/

http://chianti.ucsd.edu/cyto_web/plugins/

http://brainchronicle.blogspot.com/

http://brainchronicle.blogspot.com/

We live in a Big Data worldData management and manipulation

REDCap: http://project-redcap.org/ Web app for building and managing online survey and databases

To find participants: https://www.researchmatch.org

MySQL for a professional relational database.Requires some programming skills in SQL and database design.

Application to query and build databases (goodbye command line):[OS X]: SequelPro [Windows]: sqlyog; Toad for MySQL...

http://project-redcap.org/

http://project-redcap.org/

https://www.researchmatch.org/

https://www.researchmatch.org/

We live in a Big Data worldData are dirty...

How to clean your data more efficiently than doing everything by hand?

12:10:00 9999999 POCT Comment GLUCOSE BY METER21:24:00 51 O2 Saturation, ISTAT (Ven) ISTAT EG7, VENOUS

5:39:00 91 Glu GLUCOSE BY METER10:58:00 9999999 Comments BLOOD CULTURE (2 AEROBIC BOTTLES)

9:36:00 9999999 Report Status BLOOD CULTURE (2 AEROBIC BOTTLES)16:25:00 25 CO2, Ser/Plas METABOLIC PANEL, COMPREHENSIVE

8:12:00 132 Glucose, Ser/Plas METABOLIC PANEL, BASIC8:06:00 5.7 MONO, % CBC WITH DIFF8:01:00 9.6 Glucose METABOLIC PANEL, BASIC

13:22:00 16.2 CO2 (a) BLOOD GASES, ARTERIAL4:45:00 2.7 MONO CBC WITH DIFF

DataWrangler @ Stanfordhttp://vimeo.com/19185801

Google-refine @ down the road.A bit less intuitive than Wrangler.

For more complex data transformation: reshape2 package in R

http://vimeo.com/19185801

http://vimeo.com/19185801

We live in a Big Data worldStatistics made easy...

Excel... Obviously. But what else when you want something more powerful?

• Switch to a statistical software like R.• R graphical interface: Deducer (http://www.deducer.org/) • http://www.youtube.com/watch?v=T6kOvlMaFCA

The case of starting using R1. Powerful statistics procedures

• R has become the lingua franca for statistical programming2. Packages for everything from

• Flow cytometry• DNA microarrays• RNA-seq• Google graph API• ... See http://goo.gl/RwER7

3. Graphics, graphics, graphics...• R graphical manual: http://goo.gl/qSHMQ

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual

http://www.youtube.com/watch?v=T6kOvlMaFCA

http://www.youtube.com/watch?v=T6kOvlMaFCA

http://goo.gl/RwER7

http://goo.gl/qSHMQ

http://goo.gl/qSHMQ

We live in a Big Data worldGraphics in R

We live in a Big Data worldData Science Visualization: Circos

CIRCOS: http://circos.ca/To visualize genome scale interaction and functional information

CIRCOS is a Perl program. Some light programming is needed. But it is worth it!

http://circos.ca/

http://circos.ca/

We live in a Big Data worldData Science Visualization

Tableau: http://www.tableausoftware.com/ Great for geo-localized data

http://www.tableausoftware.com/




Google Visualization: https://developers.google.com/chart/interactive/docs/gallery

Require data in JSON format. Fortunately a bridge with R is possible.

Earthquake in Japan

https://developers.google.com/chart/interactive/docs/gallery



Google Visualization: https://developers.google.com/chart/interactive/docs/gallery

Motion charthttp://www.youtube.com/watch?v=rnF-7TCIe08

R commands:> M1 <- gvisMotionChart(Fruits, idvar="Fruit", timevar="Year”)> plot(M1)



http://www.youtube.com/watch?v=rnF-7TCIe08

http://www.youtube.com/watch?v=rnF-7TCIe08

We live in a Big Data worldDemystifying the work

Its all about “reproducible research”

Sharing your analytical process (aka. what you did) is as important as the final manuscript.

How do you share what you did with a graphical interface?

The solution is to use a programming language, like R if suitable, and share your code.

Several tools can make your life easier.Rstudio or Deducer

Come to the workshop in 2 weeks!

We live in a Big Data worldThe kitchen

TextMate and NotePad++ for coding

Use version control systems like GitHub or Bitbucket

To make research reproducible when data are not available:DataThief: http://www.datathief.org/

To follow the last buzz in science: Twitter @druau

Some R books. Most of those book are available online for free through the Stanford Library.

http://www.datathief.org/

http://www.datathief.org/

We live in a Big Data worldQ&A

This Class was sponsored by the Office of Postdoctoral Affairs and the Lane Library

Offline questions to [email protected]

Thanks!

Technology

Cool Informatics Tools and Services for Biomedical Research