Upload
david-ruau
View
5.372
Download
0
Embed Size (px)
Citation preview
Cool Informatics Tools and Services for Biomedical Research
David Ruau, PhD.August 1st, 2012
Sponsored by the Office of Postdoctoral Affairs andthe Lane Medical Library
@druau
BIG DATA
BIG DATA
Big Data in Biomedicine
http://www.nature.com/news/gene-data-to-hit-milestone-1.11019
We live in a Big Data world
1. Analyzing genomic data1. Traditional bioinformatics tools2. Microarrays/gene lists without any code3. Microarrays/gene lists with code4. NGS and mRNA-seq
2. Beyond genomic1. Protein-protein interaction network
3. General data handling tools 1. Storing your data2. Data are dirty
4. Statistics made easy5. Graphics rules!6. Demystifying “the work”! (the code)7. Conclusion + Q&A
Course outline
We live in a Big Data worldBioinformatics software to solve everyday problems.
The EMBOSS tool suite http://emboss.sourceforge.net/ One web portal is: http://mobyle.pasteur.fr/cgi-bin/portal.py - DNA / AA Pairwise global and local alignment- Sequence feature analysis (CpG island, gene scan, restriction enzyme site,
2D/3D structure...)- Protein structure and domains- Similarity search (Blast, phi-blast, psi-blast, delta-blast...)- Phylogenetics (trees from multiple alignments)- ...
Traditional bioinformatics tools
We live in a Big Data worldBioinformatics software to solve everyday problems.
The EMBOSS tool suite http://emboss.sourceforge.net/ One web portal is: http://mobyle.pasteur.fr/cgi-bin/portal.py - DNA / AA Pairwise global and local alignment- Sequence feature analysis (CpG island, gene scan, restriction enzyme site,
2D/3D structure...)- Protein structure and domains- Similarity search (Blast, phi-blast, psi-blast, delta-blast...)- Phylogenetics (trees from multiple alignments)- ...
Traditional bioinformatics tools
UPGMA joining method
We live in a Big Data worldBioinformatics software to solve everyday problems.
Some tools are provided through databases interface such as NCBI Entrez.- The UCSC genome browser.
- The Encode project results- For example: visualize GC content and restriction enzyme site in your gene
of interest.
Traditional bioinformatics tools
This is not because you have a GUI that the analysis is
brain dead simple.
Stating the obvious
We live in a Big Data worldAnalyzing microarray gene expression microarray without any code.
Gene Pattern: http://genepattern.broadinstitute.org/gp/
Analyzing genomic data
We live in a Big Data worldUpload your expression data as a text file.Gene Pattern takes RES and GCT files. Conversion tools are provided
To transform CEL files to GCT. RES
Analyzing genomic data
We live in a Big Data world
StarBiogene http://web.mit.edu/star/biogene/index.html (java web app)- Part of GenePattern but provide pipeline style process online
SeqExpress http://www.seqexpress.com/ (Windows only)- Alternative independent application (less activity than GenePattern)
Expander http://acgt.cs.tau.ac.il/expander/ - Alternative independent application (less activity than GenePattern)
RMAExpress http://rmaexpress.bmbolstad.com/ - Interesting to perform a quality control of your microarrays.
Cluster http://bonsai.hgc.jp/~mdehoon/software/cluster/ - This is the original program to analyze microarray results. No pre-processing
functionality. You need to pre-process separately (using RMAExpress for example)
SAM http://www-stat.stanford.edu/~tibs/SAM/ (significance Analysis of Microarrays)- To extract the DE genes. This is a Excel plugin. Again, you need to pre-process separately
Analyzing genomic data
We live in a Big Data worldCommercial solutionGenespring GX (first 20 days are free)Access through subscription @ Stanford with CMGM http://cmgm3.stanford.edu
Analyzing genomic data
We live in a Big Data worldInterpreting a gene list rely on external knowledge.Several resources / tools are available to help.
KEGG: http://www.genome.jp/kegg/ pathway database
REACTOME: http://www.reactome.org pathway 2.0 database
Gene Ontology: http://www.geneontology.org/ the ultimate resource for gene function,
processes, localizationBioMart: http://www.biomart.org/
Portal providing access to multiple databaseGSEA: http://www.broadinstitute.org/gsea/index.jsp
part of GenePattern but also RDavid: http://david.abcc.ncifcrf.gov/
to perform an over-representation analysisBingo: http://www.psb.ugent.be/cbd/papers/BiNGO/
Home.html over-representation analysis but produce
graphical result (cytoscape)BioGPS: http://biogps.org/
To know where your gene is expressed in the body or which cell line
Interpreting your results
We live in a Big Data worldReactome• Made to be used programmatically
• Cytoscape (a network tool) has a plugin for Reactome.Just give a gene list or a list of gene + the number of sample where the gene is
mutated (for Cox survival analysis)
Interpreting your results
- Retrieve a network from a gene list- Do network analysis- Perform Gene Ontology analysis- Survival analysishttp://www.reactome.org/userguide/Usersguide.html#FI_Network_Tool
We live in a Big Data worldDAVID databasePerform fast over-representation analysis again different databases- KEGG; Reactome; OMIM (diseases), Generif (literature), protein domain etc...
Interpreting your results
Protein domains
We live in a Big Data worldbioGPS. Exploring expression across tissues and cell lines
Interpreting your results
Look at other library oftissues
We live in a Big Data worldRMAexpress and quality control of microarrays
Several test exist to test if the microarray performed correctly.
Hall of fame of failed microarrays:http://plmimagegallery.bmbolstad.com/
Interpreting your results
We live in a Big Data worldAnalyzing public microarray with code (kind of...)
Analyzing public gene expression data
We live in a Big Data worldAnalyzing public gene expression data
Then clic on “TOP 250” button
We live in a Big Data worldAnalyzing public gene expression data
R code
Top 250 genes
We live in a Big Data worldNext Generation SequencingThe main NGS platform are:
• Roche /454 (Genome Sequencer; GS)
• Illumina/Solexa (Genome Analyzer software)
• SOLiD (Applied Bioscience)
Upcoming challengers:• Ion Torrent (Illumina)• Oxford Nanopore
Next Generation Sequencing
Done by the core facility
What you should request
We live in a Big Data worldAnalyzing mRNA-seq data: 4 steps.1- Alignment and trimming of reads:
[no GUI]Tophat (assembly and splice junction mapper) Cufflinks (assembly and RPKM estimates)GALAXY provide access to Tophat, Cufflinks.
2- Calling variants and indels:GATK (http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page) VarScan (http://varscan.sourceforge.net/) SHRIMP2; VARiD; Atlas-SNP2; SomaticSniper...Interpretation of variants: SIFT (galaxy)
3- Finding differentially expressed genesCuffdiff (galaxy)DEXseq (R)
4- Visualization:SAVANT (http://genomesavant.com/savant/) IGV (http://www.broadinstitute.org/software/igv)
Analyzing mRNA-seq
[with GUI and commercial]Genome Studio from IlluminaGenomequest [looks pretty awesome.]
We live in a Big Data worldAnalyzing mRNAseq data: Introducing GALAXY
How to use Galaxy?
http://galaxy.psu.edu/
We live in a Big Data worldWorking in the cloud
Dudley JT, and Butte AJ. 2010. In silico research in the era of cloud computing. Nat Biotechnol 28: 1181–1185.
We live in a Big Data worldSummary mRNA-seq
GALAXYThis is a compendium of software. You even have UNIX tools and EMBOSS in it.Take home message:FASTQ files > Tophat > Cuffdiff > IGV (for differential expression)FASTQ files > Tophat > GATK > IGV (for variant detection)
Where to find help: http://seqanswers.com
Analyzing RNAseq using RDEXSeq is a R / BioConductor package. R is a statistical programming software widely used in bioinformatics
We live in a Big Data worldSummary mRNA-seq
Additional tools for genomic-- Genomespace: http://www.genomespace.org
Collection of tools: GenePattern, Galaxy, cytoscape, genomica etc... (free apparently). Data are stored in the cloud on Amazon VM.
If you do not want to do it yourself:-- Science exchange: https://www.scienceexchange.com/
Science job for hire! This is where top core facilities compete to provide the best service.-- Assay Depot: https://www.assaydepot.com/
like home depot but for science
-- taskrabbit: http://www.taskrabbit.com/ If science take too much of your time!
We live in a Big Data worldBeyond genomics: results interpretation
Interpreting your gene list with protein-protein interaction network.
iHOP: http://www.ihop-net.org/UniPub/iHOP/
Ingenuity Pathway Analysis(commercial) access through CMGM @ stanford
We live in a Big Data worldBeyond genomics: results interpretation
Looking into PPI databases:IntAct: http://www.ebi.ac.uk/intact/ BioGrid: http://thebiogrid.org/ (soon multigene search)
HPRD: http://www.hprd.org/index_html
What about open-source solutions for searching the interaction between the genes in your gene list?• Cytoscape http://cytoscape.org
• BioNetBuilder http://chianti.ucsd.edu/cyto_web/plugins/ • ...
• R for programmatic access to databases• http://brainchronicle.blogspot.com
The plus of using R is that results are reproducible and you can share your method more easily than with point and clic interface.
We live in a Big Data worldData management and manipulation
REDCap: http://project-redcap.org/ Web app for building and managing online survey and databases
To find participants: https://www.researchmatch.org
MySQL for a professional relational database.Requires some programming skills in SQL and database design.
Application to query and build databases (goodbye command line):[OS X]: SequelPro [Windows]: sqlyog; Toad for MySQL...
We live in a Big Data worldData are dirty...
How to clean your data more efficiently than doing everything by hand?
12:10:00 9999999 POCT Comment GLUCOSE BY METER21:24:00 51 O2 Saturation, ISTAT (Ven) ISTAT EG7, VENOUS
5:39:00 91 Glu GLUCOSE BY METER10:58:00 9999999 Comments BLOOD CULTURE (2 AEROBIC BOTTLES)
9:36:00 9999999 Report Status BLOOD CULTURE (2 AEROBIC BOTTLES)16:25:00 25 CO2, Ser/Plas METABOLIC PANEL, COMPREHENSIVE
8:12:00 132 Glucose, Ser/Plas METABOLIC PANEL, BASIC8:06:00 5.7 MONO, % CBC WITH DIFF8:01:00 9.6 Glucose METABOLIC PANEL, BASIC
13:22:00 16.2 CO2 (a) BLOOD GASES, ARTERIAL4:45:00 2.7 MONO CBC WITH DIFF
DataWrangler @ Stanfordhttp://vimeo.com/19185801
Google-refine @ down the road.A bit less intuitive than Wrangler.
For more complex data transformation: reshape2 package in R
We live in a Big Data worldStatistics made easy...
Excel... Obviously. But what else when you want something more powerful?
• Switch to a statistical software like R.• R graphical interface: Deducer (http://www.deducer.org/) • http://www.youtube.com/watch?v=T6kOvlMaFCA
The case of starting using R1. Powerful statistics procedures
• R has become the lingua franca for statistical programming2. Packages for everything from
• Flow cytometry• DNA microarrays• RNA-seq• Google graph API• ... See http://goo.gl/RwER7
3. Graphics, graphics, graphics...• R graphical manual: http://goo.gl/qSHMQ
We live in a Big Data worldGraphics in R
We live in a Big Data worldData Science Visualization: Circos
CIRCOS: http://circos.ca/To visualize genome scale interaction and functional information
CIRCOS is a Perl program. Some light programming is needed. But it is worth it!
We live in a Big Data worldData Science Visualization
Tableau: http://www.tableausoftware.com/ Great for geo-localized data
We live in a Big Data worldData Science Visualization
Google Visualization: https://developers.google.com/chart/interactive/docs/gallery
Require data in JSON format. Fortunately a bridge with R is possible.
Earthquake in Japan
We live in a Big Data worldData Science Visualization
Google Visualization: https://developers.google.com/chart/interactive/docs/gallery
Motion charthttp://www.youtube.com/watch?v=rnF-7TCIe08
R commands:> M1 <- gvisMotionChart(Fruits, idvar="Fruit", timevar="Year”)> plot(M1)
We live in a Big Data worldDemystifying the work
Its all about “reproducible research”
Sharing your analytical process (aka. what you did) is as important as the final manuscript.
How do you share what you did with a graphical interface?
The solution is to use a programming language, like R if suitable, and share your code.
Several tools can make your life easier.Rstudio or Deducer
Come to the workshop in 2 weeks!
We live in a Big Data worldThe kitchen
TextMate and NotePad++ for coding
Use version control systems like GitHub or Bitbucket
To make research reproducible when data are not available:DataThief: http://www.datathief.org/
To follow the last buzz in science: Twitter @druau
Some R books. Most of those book are available online for free through the Stanford Library.
We live in a Big Data worldQ&A
This Class was sponsored by the Office of Postdoctoral Affairs and the Lane Library
Offline questions to [email protected]
Thanks!