40
Cool Informatics Tools and Services for Biomedical Research David Ruau, PhD. August 1 st , 2012 Sponsored by the Office of Postdoctoral Affairs and the Lane Medical Library @druau

Cool Informatics Tools and Services for Biomedical Research

Embed Size (px)

Citation preview

Page 1: Cool Informatics Tools and Services for Biomedical Research

Cool Informatics Tools and Services for Biomedical Research

David Ruau, PhD.August 1st, 2012

Sponsored by the Office of Postdoctoral Affairs andthe Lane Medical Library

@druau

Page 2: Cool Informatics Tools and Services for Biomedical Research

BIG DATA

Page 3: Cool Informatics Tools and Services for Biomedical Research

BIG DATA

Page 4: Cool Informatics Tools and Services for Biomedical Research

Big Data in Biomedicine

http://www.nature.com/news/gene-data-to-hit-milestone-1.11019

Page 5: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data world

1. Analyzing genomic data1. Traditional bioinformatics tools2. Microarrays/gene lists without any code3. Microarrays/gene lists with code4. NGS and mRNA-seq

2. Beyond genomic1. Protein-protein interaction network

3. General data handling tools 1. Storing your data2. Data are dirty

4. Statistics made easy5. Graphics rules!6. Demystifying “the work”! (the code)7. Conclusion + Q&A

Course outline

Page 6: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldBioinformatics software to solve everyday problems.

The EMBOSS tool suite http://emboss.sourceforge.net/ One web portal is: http://mobyle.pasteur.fr/cgi-bin/portal.py - DNA / AA Pairwise global and local alignment- Sequence feature analysis (CpG island, gene scan, restriction enzyme site,

2D/3D structure...)- Protein structure and domains- Similarity search (Blast, phi-blast, psi-blast, delta-blast...)- Phylogenetics (trees from multiple alignments)- ...

Traditional bioinformatics tools

Page 7: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldBioinformatics software to solve everyday problems.

The EMBOSS tool suite http://emboss.sourceforge.net/ One web portal is: http://mobyle.pasteur.fr/cgi-bin/portal.py - DNA / AA Pairwise global and local alignment- Sequence feature analysis (CpG island, gene scan, restriction enzyme site,

2D/3D structure...)- Protein structure and domains- Similarity search (Blast, phi-blast, psi-blast, delta-blast...)- Phylogenetics (trees from multiple alignments)- ...

Traditional bioinformatics tools

UPGMA joining method

Page 8: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldBioinformatics software to solve everyday problems.

Some tools are provided through databases interface such as NCBI Entrez.- The UCSC genome browser.

- The Encode project results- For example: visualize GC content and restriction enzyme site in your gene

of interest.

Traditional bioinformatics tools

Page 9: Cool Informatics Tools and Services for Biomedical Research

This is not because you have a GUI that the analysis is

brain dead simple.

Stating the obvious

Page 10: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldAnalyzing microarray gene expression microarray without any code.

Gene Pattern: http://genepattern.broadinstitute.org/gp/

Analyzing genomic data

Page 11: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldUpload your expression data as a text file.Gene Pattern takes RES and GCT files. Conversion tools are provided

To transform CEL files to GCT. RES

Analyzing genomic data

Page 12: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data world

StarBiogene http://web.mit.edu/star/biogene/index.html (java web app)- Part of GenePattern but provide pipeline style process online

SeqExpress http://www.seqexpress.com/ (Windows only)- Alternative independent application (less activity than GenePattern)

Expander http://acgt.cs.tau.ac.il/expander/ - Alternative independent application (less activity than GenePattern)

RMAExpress http://rmaexpress.bmbolstad.com/ - Interesting to perform a quality control of your microarrays.

Cluster http://bonsai.hgc.jp/~mdehoon/software/cluster/ - This is the original program to analyze microarray results. No pre-processing

functionality. You need to pre-process separately (using RMAExpress for example)

SAM http://www-stat.stanford.edu/~tibs/SAM/ (significance Analysis of Microarrays)- To extract the DE genes. This is a Excel plugin. Again, you need to pre-process separately

Analyzing genomic data

Page 13: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldCommercial solutionGenespring GX (first 20 days are free)Access through subscription @ Stanford with CMGM http://cmgm3.stanford.edu

Analyzing genomic data

Page 14: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldInterpreting a gene list rely on external knowledge.Several resources / tools are available to help.

KEGG: http://www.genome.jp/kegg/ pathway database

REACTOME: http://www.reactome.org pathway 2.0 database

Gene Ontology: http://www.geneontology.org/ the ultimate resource for gene function,

processes, localizationBioMart: http://www.biomart.org/

Portal providing access to multiple databaseGSEA: http://www.broadinstitute.org/gsea/index.jsp

part of GenePattern but also RDavid: http://david.abcc.ncifcrf.gov/

to perform an over-representation analysisBingo: http://www.psb.ugent.be/cbd/papers/BiNGO/

Home.html over-representation analysis but produce

graphical result (cytoscape)BioGPS: http://biogps.org/

To know where your gene is expressed in the body or which cell line

Interpreting your results

Page 15: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldReactome• Made to be used programmatically

• Cytoscape (a network tool) has a plugin for Reactome.Just give a gene list or a list of gene + the number of sample where the gene is

mutated (for Cox survival analysis)

Interpreting your results

- Retrieve a network from a gene list- Do network analysis- Perform Gene Ontology analysis- Survival analysishttp://www.reactome.org/userguide/Usersguide.html#FI_Network_Tool

Page 16: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldDAVID databasePerform fast over-representation analysis again different databases- KEGG; Reactome; OMIM (diseases), Generif (literature), protein domain etc...

Interpreting your results

Protein domains

Page 17: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldbioGPS. Exploring expression across tissues and cell lines

Interpreting your results

Look at other library oftissues

Page 18: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldRMAexpress and quality control of microarrays

Several test exist to test if the microarray performed correctly.

Hall of fame of failed microarrays:http://plmimagegallery.bmbolstad.com/

Interpreting your results

Page 19: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldAnalyzing public microarray with code (kind of...)

Analyzing public gene expression data

Page 20: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldAnalyzing public gene expression data

Then clic on “TOP 250” button

Page 21: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldAnalyzing public gene expression data

R code

Top 250 genes

Page 22: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldNext Generation SequencingThe main NGS platform are:

• Roche /454 (Genome Sequencer; GS)

• Illumina/Solexa (Genome Analyzer software)

• SOLiD (Applied Bioscience)

Upcoming challengers:• Ion Torrent (Illumina)• Oxford Nanopore

Next Generation Sequencing

Done by the core facility

What you should request

Page 23: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldAnalyzing mRNA-seq data: 4 steps.1- Alignment and trimming of reads:

[no GUI]Tophat (assembly and splice junction mapper) Cufflinks (assembly and RPKM estimates)GALAXY provide access to Tophat, Cufflinks.

2- Calling variants and indels:GATK (http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page) VarScan (http://varscan.sourceforge.net/) SHRIMP2; VARiD; Atlas-SNP2; SomaticSniper...Interpretation of variants: SIFT (galaxy)

3- Finding differentially expressed genesCuffdiff (galaxy)DEXseq (R)

4- Visualization:SAVANT (http://genomesavant.com/savant/) IGV (http://www.broadinstitute.org/software/igv)

Analyzing mRNA-seq

[with GUI and commercial]Genome Studio from IlluminaGenomequest [looks pretty awesome.]

Page 24: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldAnalyzing mRNAseq data: Introducing GALAXY

How to use Galaxy?

http://galaxy.psu.edu/

Page 25: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldWorking in the cloud

Dudley JT, and Butte AJ. 2010. In silico research in the era of cloud computing. Nat Biotechnol 28: 1181–1185.

Page 26: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldSummary mRNA-seq

GALAXYThis is a compendium of software. You even have UNIX tools and EMBOSS in it.Take home message:FASTQ files > Tophat > Cuffdiff > IGV (for differential expression)FASTQ files > Tophat > GATK > IGV (for variant detection)

Where to find help: http://seqanswers.com

Analyzing RNAseq using RDEXSeq is a R / BioConductor package. R is a statistical programming software widely used in bioinformatics

Page 27: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldSummary mRNA-seq

Additional tools for genomic-- Genomespace: http://www.genomespace.org

Collection of tools: GenePattern, Galaxy, cytoscape, genomica etc... (free apparently). Data are stored in the cloud on Amazon VM.

If you do not want to do it yourself:-- Science exchange: https://www.scienceexchange.com/

Science job for hire! This is where top core facilities compete to provide the best service.-- Assay Depot: https://www.assaydepot.com/

like home depot but for science

-- taskrabbit: http://www.taskrabbit.com/ If science take too much of your time!

Page 28: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldBeyond genomics: results interpretation

Interpreting your gene list with protein-protein interaction network.

iHOP: http://www.ihop-net.org/UniPub/iHOP/

Ingenuity Pathway Analysis(commercial) access through CMGM @ stanford

Page 29: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldBeyond genomics: results interpretation

Looking into PPI databases:IntAct: http://www.ebi.ac.uk/intact/ BioGrid: http://thebiogrid.org/ (soon multigene search)

HPRD: http://www.hprd.org/index_html

What about open-source solutions for searching the interaction between the genes in your gene list?• Cytoscape http://cytoscape.org

• BioNetBuilder http://chianti.ucsd.edu/cyto_web/plugins/ • ...

• R for programmatic access to databases• http://brainchronicle.blogspot.com

The plus of using R is that results are reproducible and you can share your method more easily than with point and clic interface.

Page 30: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldData management and manipulation

REDCap: http://project-redcap.org/ Web app for building and managing online survey and databases

To find participants: https://www.researchmatch.org

MySQL for a professional relational database.Requires some programming skills in SQL and database design.

Application to query and build databases (goodbye command line):[OS X]: SequelPro [Windows]: sqlyog; Toad for MySQL...

Page 31: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldData are dirty...

How to clean your data more efficiently than doing everything by hand?

12:10:00 9999999 POCT Comment GLUCOSE BY METER21:24:00 51 O2 Saturation, ISTAT (Ven) ISTAT EG7, VENOUS

5:39:00 91 Glu GLUCOSE BY METER10:58:00 9999999 Comments BLOOD CULTURE (2 AEROBIC BOTTLES)

9:36:00 9999999 Report Status BLOOD CULTURE (2 AEROBIC BOTTLES)16:25:00 25 CO2, Ser/Plas METABOLIC PANEL, COMPREHENSIVE

8:12:00 132 Glucose, Ser/Plas METABOLIC PANEL, BASIC8:06:00 5.7 MONO, % CBC WITH DIFF8:01:00 9.6 Glucose METABOLIC PANEL, BASIC

13:22:00 16.2 CO2 (a) BLOOD GASES, ARTERIAL4:45:00 2.7 MONO CBC WITH DIFF

DataWrangler @ Stanfordhttp://vimeo.com/19185801

Google-refine @ down the road.A bit less intuitive than Wrangler.

For more complex data transformation: reshape2 package in R

Page 32: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldStatistics made easy...

Excel... Obviously. But what else when you want something more powerful?

• Switch to a statistical software like R.• R graphical interface: Deducer (http://www.deducer.org/) • http://www.youtube.com/watch?v=T6kOvlMaFCA

The case of starting using R1. Powerful statistics procedures

• R has become the lingua franca for statistical programming2. Packages for everything from

• Flow cytometry• DNA microarrays• RNA-seq• Google graph API• ... See http://goo.gl/RwER7

3. Graphics, graphics, graphics...• R graphical manual: http://goo.gl/qSHMQ

Page 33: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldGraphics in R

Page 34: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldData Science Visualization: Circos

CIRCOS: http://circos.ca/To visualize genome scale interaction and functional information

CIRCOS is a Perl program. Some light programming is needed. But it is worth it!

Page 35: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldData Science Visualization

Tableau: http://www.tableausoftware.com/ Great for geo-localized data

Page 36: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldData Science Visualization

Google Visualization: https://developers.google.com/chart/interactive/docs/gallery

Require data in JSON format. Fortunately a bridge with R is possible.

Earthquake in Japan

Page 37: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldData Science Visualization

Google Visualization: https://developers.google.com/chart/interactive/docs/gallery

Motion charthttp://www.youtube.com/watch?v=rnF-7TCIe08

R commands:> M1 <- gvisMotionChart(Fruits, idvar="Fruit", timevar="Year”)> plot(M1)

Page 38: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldDemystifying the work

Its all about “reproducible research”

Sharing your analytical process (aka. what you did) is as important as the final manuscript.

How do you share what you did with a graphical interface?

The solution is to use a programming language, like R if suitable, and share your code.

Several tools can make your life easier.Rstudio or Deducer

Come to the workshop in 2 weeks!

Page 39: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldThe kitchen

TextMate and NotePad++ for coding

Use version control systems like GitHub or Bitbucket

To make research reproducible when data are not available:DataThief: http://www.datathief.org/

To follow the last buzz in science: Twitter @druau

Some R books. Most of those book are available online for free through the Stanford Library.

Page 40: Cool Informatics Tools and Services for Biomedical Research

We live in a Big Data worldQ&A

This Class was sponsored by the Office of Postdoctoral Affairs and the Lane Library

Offline questions to [email protected]

Thanks!