67
Galaxy: A Web-based Platform for Accessible, Reproducible, and Transparent High-throughput (Genome) Biology Jeremy Goecks Depts. of Biology and Math & Computer Science Emory University 1

Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Galaxy: A Web-based Platform for Accessible, Reproducible, and Transparent

High-throughput (Genome) Biology

Jeremy GoecksDepts. of Biology and Math & Computer Science

Emory University

1

Page 2: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

GenomicsGenomics

Identify and annotate all functional genomic elements

Understand interactions among elements

Apply genome knowledge to address biomedical challenges

2

Page 3: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

3

Page 4: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

4“Will Computers Crash Genomics?”, Pennisi, E., Science, Feb 11, 2011

Page 5: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

5Figure 1. Contribution of di!erent factors to the overall cost of a sequencing project across time. Left, the four-step process: (i) experimental design and sample collection, (ii) sequencing, (iii) data reduction and management, and (iv) downstream analysis. Right, the changes over time of relative impact of these four components of a sequencing experiment. BAM, Binary Sequence Alignment/Map; BED, Browser Extensible Data; CRAM, compression algorithm; MRF, Mapped Read Format; NGS, next-generation sequencing; TAR, transcriptionally active region; VCF, Variant Call Format.

Pre-NGS(Approximately 2000)

Now(Approximately 2010)

Future(Approximately 2020)

0%

2040

6080

100%

Data reductionData management

Sample collection and experimental design

Sequencing Downstreamanalyses

Dat

a m

anag

emen

t

Figure 2. Cost of 1 MB of DNA sequencing. Decreasing cost of sequencing in the past 10 years compared with the expectation if it had followed Moore’s law. Adapted from [11]. Cost was calculated in January of each year. MB, megabyte.

$0.10

$1.00

$10.00

$100.00

$1,000.00

$10,000.00

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Cost per Mb of DNA Sequence Moore's law

Sboner et al. Genome Biology 2011, 12:125 http://genomebiology.com/2011/12/8/125

Page 2 of 10

Page 6: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

A Key Challenge in Genomics

Generating data is easy✦ high-throughput/next-generation sequencing

(HTS/NGS) technologies improving rapidly✦ datasets are many MBs or GBs

Analyzing data is THE bottleneck✦ computation is essential due to dataset size

6

Page 7: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Computation in Science?

Scientists unfamiliar with computation

Reproducibility hindered by complexity: systems, scripts, tools, parameters

Collaboration and publishing difficult because current media do not support computational artifacts well

7

Page 8: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Galaxy Project: Fundamental Questions

When Biology (or any science) becomes dependent on computational methods:

✦ how best to make methods accessible to scientists?✦ how best to ensure that analyses are reproducible?✦ how best to enable transparent communication and

reuse of analyses?

8

Page 9: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Vision

Galaxy is an open, Web-based platform for accessible, reproducible, and transparent computational biomedical research

9

Page 10: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

What is Galaxy?

GUI for genomics✦ for complete analyses: analyze, visualize, share, publish

A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage

Open source software that makes integrating your own tools and data and customizing for your own site simple

Page 11: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Accessibility

11

Page 12: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Accessibility

12

Page 13: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Accessibility

13

Page 14: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Accessibility

14

Page 15: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Connecting Users with Accessibility

15

Page 16: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Connecting Users with Accessibility

16

Page 17: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Galaxy Accessibility

Only a Web browser is required

Standardization✦ tools, parameters, outputs all look the same

Easy to use output of one tool as input for another tool

17

Page 18: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Accessibility for Tool Developers

Defined via abstract interface:

✦ inputs & outputs✦ parameters✦ how to generate

command line

As simple as possible but allows for rigorous

18

Page 19: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Dozens of tools for different HTS applications packaged with Galaxy19

Page 20: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Amplification

Many tools available in a single place means that tools can be combined in novel ways

Users, developers, community benefit

20

Page 21: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Reproducibility

21

Page 22: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Reproducibility in Genomics

18 Nat. Genetics experiments in microarray gene expression

<50% of reproducible

Problems✦ missing data (38%)✦ missing software, hardware

details (50%) ✦ missing method, processing

details (66%)

Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression analyses.” Nat Genet 41, 149-155 (2009)

22

Page 23: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Reproducibility in Genomics

18 Nat. Genetics experiments in microarray gene expression

<50% of reproducible

Problems✦ missing data (38%)✦ missing software, hardware

details (50%) ✦ missing method, processing

details (66%)

Ioannidis, J.P.A. et al. “Repeatability of published microarray gene expression analyses.” Nat Genet 41, 149-155 (2009)

23

14 re-sequencing experiments in Nat. Genetics, Nature, Science

0% reproducible?

Problems✦ missing primary data (50%)✦ tools unavailable (50%)✦ missing parameter setting,

tool versions (100%)

"Devil in the details," Nature, vol. 470, 305-306 (2011).

Page 24: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Metadata = Reproducibility

24

Page 25: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Automatic Metadata

25

Page 26: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

User Metadata

26

Page 27: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Workflows can be constructed from scratch or extracted from existing analysis histories

Facilitate reuse and provide precise reproducibility of a complex analysis

Galaxy Workflow System

Page 28: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Galaxy Workflows

28

Page 29: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Example: Workflow for differential expression analysis of RNAseq using Tophat/Cufflinks tools

Page 30: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Example: Diagnosing lowfrequency heterosplasmic sites in two tissues from the same individual

Page 31: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Transparency ~ Sharing, Publishing, Reusing

31

Page 32: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Everything can be shared

32

Page 33: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

33

Page 34: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

34

Text

Page 35: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Galaxy Pages

A web-based, interactive medium for presenting all aspects of an analysis: data, methods, visualization, and results

35

Page 36: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Galaxy Page for a recent study on mitochondrial heteroplasmy

Page 37: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Actual histories and datasets directly accessible from the text

Page 38: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Histories can be imported and the exact parameters inspected

Page 39: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Workflows and other entities can also be embedded

Page 40: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

And imported for inspection, verification, and reuse

Page 41: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Galaxy's publishing features facilitate access and reproducibility without any extra leg work

One click grants access to the actual analysis you performed to generate your original results

✦ not just data access, the full pipeline + annotations✦ anyone can import your work and immediately

reproduce or build on it

The Power of Galaxy Publishing

Page 42: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:
Page 43: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Three Ways to Use Galaxy

1. Public Website (http://usegalaxy.org)

2. Download and run locally

3. Run on the cloud

43

Page 44: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Galaxy main site(http://usegalaxy.org)

Public Website, anybody can use

~500 new users per month, ~100 TB of user data, ~130,000 analysis jobs per month, every month is our busiest month ever...

Will continue to be maintained and enhanced, but with limits and quotas

Centralized solution cannot scale to meet data analysis demands

Page 45: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Download and Run Locally

No configuration needed but everything can be configured✦ tools✦ computing cluster integration✦ proxy server and authentication

Prominent local installations at:✦ Cold Spring Harbor Lab✦ Dept. of Energy’s Joint Genome Institute✦ Harvard School of Public Health✦ U of Texas System✦ Netherlands Bioinformatics Center✦ Oxford College

Requires computing resources, technical expertise, and maintenance

45

Page 46: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Galaxy on the Cloud

For extended or particular resource needs✦ customization necessary✦ oscillating data volume

For when informatics expertise or infrastructure is limited

46

Work done by Enis Afgan

Page 47: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Using Amazon EC2: Startup in 3 steps

Page 48: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

48

Page 49: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Like any other Galaxy instance plus: ✦ additional compute nodes acquired and released

automatically as needed✦ can share or publish an instance

Page 50: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Challenges Going Forward

Promoting community involvement✦ tools, assays, analyses growing too fast for us alone✦ facilitate community contributions and usage of contributions

Scaling to many, many Galaxies✦ moving objects between Galaxies while ensuring reproducibility✦ enabling users to find useful “stuff”

Novel application areas✦ genomics ideal application area -- what next?

50

Page 51: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Supported by the NHGRI (HG005542, HG004909, HG005133), NSF (DBI-0850103), Penn State University, Emory University, and the Pennsylvania Department of Public Health

Dan Blankenberg Nate Coraor

Greg von Kuster

Enis Afgan Dannon Baker

Jeremy Goecks

Anton NekrutenkoJames Taylor

Dave Clements Jennifer Jackson

51

Page 52: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

http://galaxyproject.org

Galaxy publications: http://galaxyproject.org/wiki/Citing

Galaxy is hiring! http://galaxyproject.org/wiki/GalaxyisHiring

[email protected]

52

Thanks! Questions?

Page 53: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

53

Page 54: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Visualization

54

Page 55: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Integration with many existing browsers extensible

Page 56: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

56

Genome Browser✦ physical depiction of data✦ visually identify correlations✦ find interesting regions, features

Galaxy✦ tool integration framework✦ heavy focus on usability✦ sharing, publication framework

Trackster

Page 57: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:
Page 58: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:
Page 59: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:
Page 60: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Dynamic Filtering

60

Page 61: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

61

Integrating Tools and Visualization

Page 62: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

62

Integrating Tools and Visualization

Page 63: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

63

Integrating Tools and Visualization

Page 64: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

Scaling Galaxy: two distinct problems

So much data, not enough infrastructure ✦ solution: encourage local Galaxy instances, cloud

Galaxy, support increasingly decentralized model

So many tools and workflows, not enough manpower

✦ focus on building infrastructure to allow community to integrate and share tools, workflows, and best practices

64

Page 65: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

1 2 3 ∞

http://usegalaxy.org

http://usegalaxy.org/community

Galaxy Tool Shed

...

Galaxies on private clouds

Galaxies on public clouds

...

private Galaxy installations

private Tool Sheds

Page 66: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

66

“The expanding transcriptome: the genome as the 'Book of Sand'”, Soares and Valcarcel, The EMBO Journal (2006) 25

Page 67: Galaxy: A Web-based Platform for Accessible, …...accessible, reproducible, and transparent computational biomedical research 9 What is Galaxy? GUI for genomics for complete analyses:

67

Figure 1. Contribution of di!erent factors to the overall cost of a sequencing project across time. Left, the four-step process: (i) experimental design and sample collection, (ii) sequencing, (iii) data reduction and management, and (iv) downstream analysis. Right, the changes over time of relative impact of these four components of a sequencing experiment. BAM, Binary Sequence Alignment/Map; BED, Browser Extensible Data; CRAM, compression algorithm; MRF, Mapped Read Format; NGS, next-generation sequencing; TAR, transcriptionally active region; VCF, Variant Call Format.

Pre-NGS(Approximately 2000)

Now(Approximately 2010)

Future(Approximately 2020)

0%

2040

6080

100%

Data reductionData management

Sample collection and experimental design

Sequencing Downstreamanalyses

Dat

a m

anag

emen

t

Figure 2. Cost of 1 MB of DNA sequencing. Decreasing cost of sequencing in the past 10 years compared with the expectation if it had followed Moore’s law. Adapted from [11]. Cost was calculated in January of each year. MB, megabyte.

$0.10

$1.00

$10.00

$100.00

$1,000.00

$10,000.00

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Cost per Mb of DNA Sequence Moore's law

Sboner et al. Genome Biology 2011, 12:125 http://genomebiology.com/2011/12/8/125

Page 2 of 10

Sboner et al. Genome Biology 2011, 12:125

Trends in Genomics