26
Green Center Computational Core ChIPSeq Pipeline, Just a Click Away Venkat Malladi Computational Biologist Computational Core Cecil H. and Ida Green Center for Reproductive Biology Science Green Center for Reproduc/ve Biology Science

Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Green Center Computational Core ChIP-­‐Seq  Pipeline,  Just  a  Click  Away  

Venkat Malladi Computational Biologist Computational Core Cecil H. and Ida Green Center for Reproductive Biology Science

Green  Center  for  Reproduc/ve  Biology  Science  

Page 2: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Green  Center  for  Reproduc/ve  Biology  Science  

Introduc<on  to  the  Green  Center  

●  Basic research in female reproductive biology, with a focus on signaling, gene regulation, and genome function. ◦  pregnancy ◦  parturition ◦  stem cells ◦  oncology ◦  inflammation

●  Key areas: ◦  Chromatin structure and gene regulation ◦  Epigenetics ◦  Nuclear endpoints of cellular signaling pathways ◦  Genome organization and evolution ◦  DNA replication and repair

Page 3: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Green  Center  for  Reproduc/ve  Biology  Science  

Who  is  in  the  Green  Center?  

W. Lee Kraus, Ph.D., Director of the Green Center.

● Associated with the Department of

Obstetrics and Gynecology

● Consists of 9 main faculty/labs

● 20 associated faculty/labs

● Computational Core

Page 4: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

● Consists of 4 Computational Biologists ● Analysis of Genomic Sequencing Data ● Responsibilities

◦  Data Quality assurance

◦  Perform basic analyses

◦  Work with investigator to perform integrative analyses

Green Center Computation Team

Anusha Nagari Tulip Nandu Venkat Malladi Aishwarya Gogate

Role  of  the  Computa<onal  Core  

Page 5: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Modified  from  PLoS  Biol  9-­‐e1001046,2011  (M.  Pazin)   Green  Center  for  Reproduc/ve  Biology  Science  

ATAC-seq RNA-seq GRO-seq

Challenge:  Variety  of  Assays  Supported?  

Page 6: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Assay for transposase-accessible chromatin using Sequencing (ATAC-Seq): Genomic method that captures open chromatin sites.

What  is  ATAC-­‐seq?  

Buenrostro et  al.  (  2013)  Nature Methods  

Page 7: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

RNA Sequencing (RNA-Seq) : RNA-seq measures RNA abundance of mature RNA species in the cell. These experiments contribute to the understanding of how RNA-based mechanisms impact gene regulation.

● Types: ● Total RNA ● polyA mRNA (Long and short) ● shRNA ● small RNA ● microRNA ● polyA depleted RNA

What  is  RNA-­‐Seq?  

Green  Center  for  Reproduc/ve  Biology  Science  

Page 8: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Global Run On Sequencing (GRO-Seq) : This is a genomic method that maps the position and orientation of all actively transcribing RNA polymerases.

● Transcription from all three RNA Polymerases is captured providing transcriptional profiles including: ● protein coding mRNA ●  long non-coding RNAs (lncRNAs) ● enhancer RNAs (eRNAs) ● divergent transcription ● antisense transcription ●  intergenic transcription in both annotated and unannotated regions of the genome.

What  is  GRO-­‐Seq?  

Annotated

Divergent

Intergenic

Antisense

ERα Enhancer Annotated

Other Genic

Green  Center  for  Reproduc/ve  Biology  Science  Hah  et  al.  (  2011)  Cell  

Page 9: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Chromatin immunoprecipitation followed by Sequencing (ChIP-Seq): Identify the binding sites of chromatin-associated proteins.

● Categories: •  Transcription factor ChIP-Seq: proteins

that associate with specific DNA sequences to influence the rate of transcription

•  Histone ChIP-Seq: measure histone content of chromatin, specifically to the incorporation of particular post-translational histone modifications in chromatin

What  is  ChIP-­‐Seq?  

Green  Center  for  Reproduc/ve  Biology  Science  Park (  2009)  Nature Reviews      

Page 10: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Considera<on  of  making  a  Pipeline  

1.  Who are the users

2.  Define what the pipeline should deliver

3.  Identify all input and output files

4.  What QA/QC metrics should be available for users

5.  Identify all software used in pipeline

6.  Breakdown pipeline into discrete steps (based on deliverable files and metrics)

Green  Center  for  Reproduc/ve  Biology  Science  

Page 11: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Users  and  Goals  

Green  Center  for  Reproduc/ve  Biology  Science  

● Users:

● Wet lab scientists (Grad Students/Post Docs)

● Computational Biologists in the Green Center

● Goals:

● Allow wet lab scientists to quickly assess the quality and explore

their data

● Allow for easily reproducible analysis within the Green Center

Page 12: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Schema:  ChIP-­‐seq  Pipeline  

FASTQ (SE/PE)

Map bowtie2

Quality fastqc

BAM

QA Metrics

Remove Duplicates

picard

QA Metrics

BAM Cross-correlation

tagAlign Fragment

size

Call Peaks macs2

bigWig

narrowPeak

QA Metrics

Green  Center  for  Reproduc/ve  Biology  Science  

Page 13: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

FASTQ:  Quality  Metrics   3/13/13 10:44 AMHF_K9_GATCAG_L005_R1_001.fastq.gz FastQC Report

Page 1 of 15file:///Users/anushanagari/Desktop/TMP/HectorGROseq/HF_K9_GATCAG_L005_R1_001_fastqc/fastqc_report.html

FastQC Report Tue 19 Feb 2013HF_K9_GATCAG_L005_R1_001.fastq.gz

Summary

Basic Statistics

Per base sequence quality

Per sequence quality scores

Per base sequence content

Per base GC content

Per sequence GC content

Per base N content

Sequence Length Distribution

Sequence Duplication Levels

Overrepresented sequences

Kmer Content

Basic StatisticsMeasure Value

Filename HF_K9_GATCAG_L005_R1_001.fastq.gz

File type Conventional base calls

Encoding Sanger / Illumina 1.9

Total Sequences 22571166

Filtered Sequences 0

Sequence length 50

%GC 42

3/13/13 10:44 AMHF_K9_GATCAG_L005_R1_001.fastq.gz FastQC Report

Page 1 of 15file:///Users/anushanagari/Desktop/TMP/HectorGROseq/HF_K9_GATCAG_L005_R1_001_fastqc/fastqc_report.html

FastQC Report Tue 19 Feb 2013HF_K9_GATCAG_L005_R1_001.fastq.gz

Summary

Basic Statistics

Per base sequence quality

Per sequence quality scores

Per base sequence content

Per base GC content

Per sequence GC content

Per base N content

Sequence Length Distribution

Sequence Duplication Levels

Overrepresented sequences

Kmer Content

Basic StatisticsMeasure Value

Filename HF_K9_GATCAG_L005_R1_001.fastq.gz

File type Conventional base calls

Encoding Sanger / Illumina 1.9

Total Sequences 22571166

Filtered Sequences 0

Sequence length 50

%GC 42

Per Base Sequence Quality

Good quality calls

Reasonable quality calls

Poor quality calls

Green  Center  for  Reproduc/ve  Biology  Science  

Page 14: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Alignment:  Quality  Metrics  

FASTQ File:

DNA sequence

Aligned File:

DNA sequence +

Genomic localization

Alignment % = No. of aligned reads Total no. of raw reads

* 100

Green  Center  for  Reproduc/ve  Biology  Science  

Page 15: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Uniquely  Mapped  Reads:  Quality  Metrics  

● Depth ● Number of uniquely mapping reads

● Library Complexity ● Non-Redundant Fraction (NRF) - Number of distinct uniquely mapping reads

(i.e. after removing duplicates) / Total number of reads.

● PCR Bottlenecking Coefficient 1 (PBC1) ◦ PBC1=M1/M_DISTINCT where

M1: number of genomic locations where exactly one read maps uniquely M_DISTINCT: number of distinct genomic locations to which some read maps uniquely

● PCR Bottlenecking Coefficient 2 (PBC2) ◦ PBC2= M1/M2 where

M1: number of genomic locations where only one read maps uniquely M2: number of genomic locations where two reads map uniquely

Green  Center  for  Reproduc/ve  Biology  Science  ENCODE  Standards  hPps://www.encodeproject.org/data-­‐standards/chip-­‐seq/      

Page 16: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Uniquely  Mapped  Reads:  Quality  Metrics  (cont.)  

NRF Guidelines PBC1 Guidelines

PBC2 Guidelines

ENCODE  Standards  hPps://www.encodeproject.org/data-­‐standards/chip-­‐seq/      

Green  Center  for  Reproduc/ve  Biology  Science  

Page 17: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Alignment:  Quality  Metrics  Report  

Sample Information Raw reads Alignment %Control Replicate 1 28,259,069 96.30%Control Replicate 2 28,892,302 96.00%Sample 2 Replicate 1 23,239,486 96.10%Sample 2 Replicate 2 25,637,094 96.90%Sample 3 Replicate 1 22,713,054 96.60%Sample 3 Replicate 2 20,419,272 95.90%Sample 4 Replicate 1 22,617,154 96.60%Sample 4 Replicate 2 20,068,460 96.00%

Sample Information Raw reads Alignment % Control Replicate 1 28,259,069 96.30% Control Replicate 2 28,892,302 96.00% Sample 2 Replicate 1 23,239,486 96.10% Sample 2 Replicate 2 25,637,094 96.90% Sample 3 Replicate 1 22,713,054 96.60% Sample 3 Replicate 2 20,419,272 95.90% Sample 4 Replicate 1 22,617,154 96.60% Sample 4 Replicate 2 20,068,460 96.00%

Green  Center  for  Reproduc/ve  Biology  Science  

Page 18: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Cross-­‐correla<on:  Quality  Metrics  Report  

Sample Information Raw reads Alignment % Control Replicate 1 28,259,069 96.30% Control Replicate 2 28,892,302 96.00% Sample 2 Replicate 1 23,239,486 96.10% Sample 2 Replicate 2 25,637,094 96.90% Sample 3 Replicate 1 22,713,054 96.60% Sample 3 Replicate 2 20,419,272 95.90% Sample 4 Replicate 1 22,617,154 96.60% Sample 4 Replicate 2 20,068,460 96.00%

Green  Center  for  Reproduc/ve  Biology  Science  

Sample 1 Sample 2

R=0.99 R=0.99

R: Pearson correlation coefficient

Page 19: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Call  Peaks:  Quality  Metrics  Report  

Green  Center  for  Reproduc/ve  Biology  Science  

1.  Peak calls for individual replicates

2.  Overlapping peaks between the pooled pseudo replicates

3.  Bigwig files (UCSC Genome Browser, IGV…)

Page 20: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Call  Peaks:  Quality  Metrics  Report  

Green  Center  for  Reproduc/ve  Biology  Science  

Visualizing signal tracks (Bigwig files) in UCSC Genome Browser:

Franco et al (2015)

Page 21: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Working  With  BioHPC  and  Astrocyte  

Green  Center  for  Reproduc/ve  Biology  Science  

Page 22: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Crea<ng  a  Project  

Green  Center  for  Reproduc/ve  Biology  Science  

Create New Project to run analysis

Page 23: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Adding  Data  

Green  Center  for  Reproduc/ve  Biology  Science  

Select “Add Data to this Project” ...

Page 24: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

ChIP-­‐Seq  Workflow  

Green  Center  for  Reproduc/ve  Biology  Science  

ChIP-Input fastq files

ChIP TF or Histone fastq files

Sequence format

Assembly

Page 25: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Run  Time  of  ChIP-­‐Seq  Pipeline  

Page 26: Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 22571166

Thank  you  !  

𐀌𐀧𐀻𐀾𐁎𐁏𐁞𐁟𐁠𐁡𐁢𐁣𐁤𐁥𐁦𐁧𐁨𐁩𐁪𐁫𐁬𐁭𐁮𐁯𐁰𐁱𐁲𐁳𐁴𐁵𐁶𐁷𐁸𐁹𐁺𐁻𐁼𐁽𐁾𐁿𐃻𐃼𐃽𐃾𐃿𐄃𐄄𐄅𐄆𐄴𐄵𐄶𐆋𐆌𐆍𐆎𐆏𐆜𐆝𐆞𐆟𐆠𐆡𐆢𐆣𐆤𐆥𐆦𐆧𐆨𐆩𐆪𐆫𐆬𐆭𐆮𐆯𐆰𐆱𐆲𐆳𐆴𐆵𐆶𐆷𐆸𐆹𐆺𐆻𐆼𐆽𐆾𐆿𐇀𐇁𐇂𐇃𐇄𐇅𐇆𐇇𐇈𐇉𐇊𐇋𐇌𐇍𐇎𐇏𐇾𐇿𐈀𐈁𐈂𐈃𐈄𐈅𐈆𐈇𐈈𐈉𐈊𐈋𐈌𐈍𐈎𐈏𐈐

Questions?