Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD...

Preview:

Citation preview

Galaxy for Bioinformatics AnalysisAn Introduction

TCD Bioinformatics Support TeamFiona Roche, PhDDate: 31/08/15

Overview

• What is Galaxy?

• Why is it useful?

• Command-line vs Galaxy

• A Basic Analysis with Galaxy

• Resources for Learning

What is Galaxy?

A web-based genome analysis

platform designed for experimental

biologists

www.galaxyproject.org

Why is it useful to a biologist?

Easy to use! Allows data import from popular resources Provides access to best practice bioinformatics tools Allows you to build analysis pipelines and share them Provides multiple ways to visualise your data

Trinity College Dublin, The University of Dublin

Case Study: Chip-seq Analysis Pipeline

Peak callingEnriched regions

Quality control Map reads to reference genome

Pre-processing of raw reads

Sequencing

Trinity College Dublin, The University of Dublin

Case Study: Chip-seq Analysis Pipeline

Quality control Map reads to reference genome

Peak calling

Pre-processing of raw reads

Enriched regions

Sequencing

Visualisation with genome browser

Motif discovery Relationship with gene structure

Gene set analysis

Differential profile analysis

Trinity College Dublin, The University of Dublin

Question?

Which promoter regions of genes do these enriched regions map to???

Trinity College Dublin, The University of Dublin

Command-line approach

1. Extract gene coordinates from UCSC

2. Extract 1kb upstream coordinates from UCSC

3. Merge upstream coordinates and gene annotation

5. Join the input files

6. Create user track for UCSC

7. Import to UCSC

8. Run a Wrapper script to enable a re-run of this pipeline with different parameters.

4. Clean files

Trinity College Dublin, The University of Dublin

Command-line approach

1. Extract gene coordinates from UCSC

2. Extract 1kb upstream coordinates from UCSC

3. Merge upstream coordinates and gene annotation

5. Join the input files

6. Create user track for UCSC

7. Import to UCSC

8. Run a Wrapper script to enable a re-run of this pipeline with different parameters.

4. Clean files

Trinity College Dublin, The University of Dublin

Galaxy Approach

Trinity College Dublin, The University of Dublin

The Galaxy Interface

Datasources and Tools

Main Analysis window

History of commandsMain Menus

Trinity College Dublin, The University of Dublin

Overview of Analysis

Import two datasets into Galaxy

1. Genomic coordinates of enriched peaks

2. Genomic coordinates of genes

Extract upstream regions of genes

Data cleaning

Identify overlap between promoter regions and enriched regions

Visualise on a genome browser

Question:Which gene promoter regions do these enriched regions map to???

Analysis steps:

Trinity College Dublin, The University of Dublin

Let’s begin!

Register an account

http://bioinf.gen.tcd.ie/workshops/Galaxy/

Trinity College Dublin, The University of Dublin

Let’s begin!

Step 1: Get Data into Galaxy

Step 1: Get data #1 TAF1 peaks

Get Data -> Upload File -> Paste/Fetch -> Enter URL -> Start

1. Click Upload File

2. Click Paste/Fetchto display the URL box above

3. Paste in the URL containing your data

6. Click Start to upload the data to your history!

5. Type hg19 and specify Human Feb. 2009

(GRCh37/hg19) (hg19)

7. Click Close

4. Select ‘tabular’ file type

http://bioinf.gen.tcd.ie/workshops/Galaxy/TAF1_peaks.txt

Data uploaded to your history!

The file was sent to your history and given a

number

The history keeps track of all steps in your analysis

Step 2: Rename your History

1. Click here to rename your history

You can have multiple histories with different names2. Click the cog wheel if you want to create a new history or see a list of your saved histories

Step 3: Review your dataset

1. Click on dataset name to expand/collapse the meta data and mini view of the file content

3. Click the pencil icon to edit the file attributes

2. Click the eye icon to see the file contents in the main analysis window

4. Click the x to delete the file

Step 4a: Edit dataset

1. Click the pencil icon to edit the file attributes

3. First rename the file

5. Click save

Change File name to a shorter name

4. Copy and paste the old name into the info to keep a record of it

2. There are four tabs in edit mode:To change file name click Attributes

Step 4b: Edit dataset

1. Click Datatype to change the file format

3. Define which columns of your TAF1 file are “chrom”, “start” and “end”. Look at the mini view image to see your TAF1 file

4. Click save

Change File format so Galaxy knows where to find chr, start, end

2. Select interval from drop down and then click save

5. Format changed to interval. Galaxy now knows where chr, start and end are.

Step 5: Get data #2 -> GenesGet Data -> UCSC Main Table Browser

Step 5: Get data #2 -> GenesEnsure all drop downs as shown below are selected

1. Select all fields from drop downs as shown above, then click get output

2. Click Send query to Galaxy

Step 6: Edit dataset

Click the pencil icon to edit the file name

Change File name to a shorter name File name changed

File format = bedGalaxy already knows where Chr, start and

end are

Step 7: Get Promoter RegionsTool: Operate on Genomic Intervals Get Flanks

4. Click Execute

3. Select 1000bp upstream

1. Select Genes dataset

2. Select upstream 5. Output sent to history!Same file content as ‘Genes’ but start and end coordinates are replaced with promoter regions

6. Rename file to ‘Promoters’

Step 8: Clean datasetTool: Text Manipulation Cut

1. Cut out the specific columns we want from the ‘Promoters’ file

2. Click Execute

3. Rename the output file to ‘Clean Promoters’

Datasets ready for analysis!

Both files are associated with human hg19

Galaxy knows for each file where chr, start and end

are.

Now, we are ready to join these files and see which

promoters have TAF1 peaks!

Dataset #1 Dataset #2

How do we Join Genomic Intervals?

Chr1 100 500 int1 + Chr1 200 400 cloneA +

Chr Start End Name Strand Chr1 100 500 int1 +Chr1 1000 1200 int2 +

Intervals that overlap!

Interval file #1 Interval file #2

Example

Chr Start End Name Strand Chr1 200 400 cloneA +Chr1 900 1000 cloneB +

100-500

200-400

1000-1200

900-1000

#1

#2

Step 9: Join on Genomic IntervalsTool: Operate on Genomic Intervals Join

The second dataset is the one we use for the filter (i.e. we want to filter the promoter dataset for just those regions that contain the TAF1 peaks)

The first dataset is the one we want to filter (i.e. the large dataset containing all of the promoter regions)

Click Execute

Inner join returns only the genomic regions that overlap in both files

Step 9: Join on Genomic IntervalsOutput

We have reduced the promoters from >54,000 to 154!All of these promoter regions contain a TAF1 peak region.

Rename the output file to ‘Overlap’

Step 10: Build Custom Tracks for UCSCTool: Graph/Display Data Build custom track

Click ‘Insert Track’ to open the track information.

We will add three tracks to UCSC:

1. TAF1 peaks2. Promoter regions

3. TAF1 peaks in promoter regions

Step 10: Build Custom Tracks for UCSC

Click ‘Insert Track’ to open another track

Select dataset

Label the track

Describe the track

Select the colour of the track

Track 1: TAF1 peaks

Step 10: Build Custom Tracks for UCSCTracks 2 and 3:

Click Execute when all three tracks are filled in

Click here to visualise your three tracks on UCSC Genome Browser

This single output file contains the information to visualise three trackson UCSC Genome Browser

Step 10: Build Custom Tracks for UCSCOutput

Visualisation on UCSC Genome Browser

The three tracks

Zoom out to see a larger genomic

context

Extract Workflow from HistoryWant to rerun your analysis but extract 3kb upstream?

Click the cog wheel and select

‘Extract Workflow’ from the drop down menu

Extract Workflow from History

Create a workflow name

Lists all the tools used to create your

history

Click Create workflow

Extract Workflow from History

Click edit workflow

Or access your workflows from the top menu

Editing Workflows

Click on a box and you can edit the variables of that step in the Details sectionon the right (in orange)

Each box is a step of the analysis

Noodles connect the steps

Use blue window to move around the workflow

Editing Workflows

This input dataset is the transcription factor dataset . Label this dataset in the details box on the right

Editing Workflows

This input dataset is the Gene dataset . Label this dataset in the details box on the right

Editing Workflows

1. Click on Get Flanks tool to edit the upstream promoter region

2. Change the upstream promoter region to 3000

3. Click cog wheel to save workflow. Then click cog wheel again toRun the workflow

Running Workflows

1. Select Transcription factor file (e.g. TAF1_peaks)

3. Send output to a new history

4. Run workflow and go for a coffee!!

2. Select Genes file(e.g. Genes)

Your new History!

Trinity College Dublin, The University of Dublin

Summary

What you learned today

– Getting data into Galaxy

– How to review and edit datasets

– Running Common Galaxy Tools

– How to visualise your data in UCSC genome browser

– How to extract workflows from a history

Large Tool Repository

Trinity College Dublin, The University of Dublin

Data Visualisations

UCSC Genome Browser

Clustered Heatmaps

Visualisation of Chip-seq dataCharts

Circster – structural variation

Galaxy Learning Resources

Thank You

Please fill in the online survey at bioinf.gen.tcd.ie/surveys/Galaxy