Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
useR2012 High-throughput sequence analysis
with R and Bioconductor
Valerie Obenchain Martin Morganlowast
June 2012
Abstract
DNA sequence analysis generates large volumes of data presentingchallenging bioinformatic and statistical problems This tutorial intro-duces Bioconductor packages and work flows for the analysis of sequencedata We learn about approaches for efficiently manipulating sequencesand alignments and introduce common work flows and the unique sta-tistical challenges associated with lsquoRNA-seqrsquo lsquoChIP-seqlsquo and variant an-notation experiments The emphasis is on exploratory analysis and theanalysis of designed experiments The workshop assumes an intermedi-ate level of familiarity with R and basic understanding of biological andtechnological aspects of high-throughput sequence analysis The workshopemphasizes orientation within the Bioconductor milieu we will touch onthe Biostrings ShortRead GenomicRanges edgeR and DiffBind andVariantAnnotation packages with short exercises to illustrate the func-tionality of each package Participants should come prepared with a mod-ern laptop with current R installed
Contents
1 Introduction 311 Rstudio 312 R 313 Bioconductor 514 Resources 7
2 Sequences and Ranges 821 Biostrings 822 GenomicRanges 823 Resources 14
lowastmtmorganfhcrcorg
1
3 Reads and Alignments 1531 The pasilla Data Set 1532 Reads and the ShortRead Package 1533 Alignments and the Rsamtools Package 19
4 RNA-seq 2441 Differential Expression with the edgeR Package 2442 Additional Steps in RNA-seq Work Flows 2943 Resources 30
5 ChIP-seq 3151 Initial Work Flow 3152 Motifs 34
6 Annotation 3761 Gene-Centric Annotations with AnnotationDbi 3762 Genome-Centric Annotations with GenomicFeatures 4063 Exploring Annotated Variants VariantAnnotation 43
2
Table 1 Schedule
Introduction Rstudio and the AMI R packages and help Bioconductor
Sequences and ranges Representing and manipulating biological sequencesworking with genomic coordinates
Reads and alignments High-throughput sequence reads (fastq files) andtheir aligned representation (bam files)
RNA-seq A post-alignment workflow for differential representation
ChIP-seq Working with multiple ChIP-seq experiments
Annotation Resources for annotation of gene and genomes working the vari-ants (VCF files)
1 Introduction
This workshop introduces use of R and Bioconductor for analysis of high-throughput sequence data The workshop is structured as a series of shortremarks followed by group exercises The exercises explore the diversity of tasksfor which R Bioconductor are appropriate but are far from comprehensive
The goals of the workshop are to (1) develop familiarity with R Biocon-ductor software for high-throughput analysis (2) expose key statistical issues inthe analysis of sequence data and (3) provide inspiration and a framework forfurther independent exploration An approximate schedule is shown in Table 1
11 RstudioExercise 1Log on to the Rstudio account set up for this course
Visit the lsquoPackagesrsquo tab in the lower right panel find the useR2012 packageand discover the vignette (ie this document)
Under the lsquoFilesrsquo tab figure out how to upload and download (small) filesto the server
12 R
The following should be familiar to you R has a number of standard datatypes that represent integer numeric (floating point) complex characterlogical (boolean) and raw (byte) data It is possible to convert between datatypes and to discover the class (eg class) of a variable All of the vectorsmentioned so far are homogenous consisting of a single type of element A list
can contain a collection of different types of elements and like all vectors theseelements can be named to create a key-value association A dataframe is a list
3
of equal-length vectors representing a rectangular data structure not unlike aspread sheet Each column of the data frame is a vector so data types must behomogenous within a column A dataframe can be subset by row or columnand columns can be accessed with $ or [[ A matrix is also a rectangular datastructure but subject to the constraint that all elements are the same type
R has object-oriented ways of representing complicated data objects Bio-conductor makes extensive use of lsquoS4rsquo objects Objects are often created byfunctions (eg GRanges below) with parts of the object extracted or assignedusing accessor functions Many operations on classes are implemented as meth-ods that specialize a generic function for the particular class of objects used toinvoke the function For instance countOverlaps is a generic that counts thenumber of times elements of its query argument overlaps elements of its sub-
ject there are methods with slightly different behaviors when the argumentsare IRange instances or GRanges instances (in the latter case the countOverlaps
method pays attention to whether ranges are on the same strand and chromo-some for instance)
Packages provide functionality beyond that available in base R There aremore than 500 Bioconductor packages Packages are contributed by diversemembers of the community they vary in quality (many are excellent) and some-times contain idiosyncratic aspects to their implementation New packages (egShortRead VariantAnnotation packages and their dependencies) can be addedto an R installation using
gt source(httpbioconductororgbiocLiteR)
gt biocLite(c(ShortRead VariantAnnotation)) new packages
gt biocLite(character()) update packages
installpackages A package is installed only once per R installation butneeds to be loaded (with library) in each session in which it is used
Find help using the R help system Start a web browser with
gt helpstart()
The lsquoSearch Engine and Keywordsrsquo link is helpful in day-to-day use Use man-ual pages to find detailed descriptions of the arguments and return values offunctions and the structure and methods of classes Find help within an Rsession using the package defining the help page must have been loaded (withlibrary)
gt library(ShortRead)
gt readFastq
S4 classes and generics can be discovered with syntax like the following for thecomplement generic in the Biostrings package
gt library(Biostrings)
gt showMethods(complement)
4
Function complement (package Biostrings)
x=DNAString
x=DNAStringSet
x=MaskedDNAString
x=MaskedRNAString
x=RNAString
x=RNAStringSet
x=XStringViews
Methods defined on the DNAStringSet class of Biostrings can be found with
gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))
Obtaining help on S4 classes and methods requires syntax such as
gt class DNAStringSet
gt method complementDNAStringSet
Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use
gt vignette(package=useR2012)
to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette
13 Bioconductor
Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation
The Bioconductor web site is at bioconductororg Features include
bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene
ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages
bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-
ting new packages
5
Table 2 Selected Bioconductor packages for high-throughput sequence analysis
Concept PackagesData representation IRanges GenomicRanges GenomicFeatures
Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-
layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)
Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion
Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa
ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq
EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq
ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi
Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM
3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta
mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq
oneChannelGUI rnaSeqMapDatabase SRAdb
High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)
Exercise 2Scavenger hunt Spend five minutes tracking down the following information
a The package containing the readFastq function
b The author of the alphabetFrequency function defined in the Biostringspackage
c A description of the GappedAlignments class
6
d The number of vignettes in the GenomicRanges package
e From the Bioconductor web site instructions for installing or updatingBioconductor packages
f A list of all packages in the current release of Bioconductor
g The URL of the Bioconductor mailing list subscription page
Solution Possible solutions are found with the following R commands
gt readFastq
gt library(Biostrings)
gt alphabetFrequency
gt classGappedAlignments
gt vignette(package=GenomicRanges)
and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3
14 Resources
Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity
1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list
7
Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures
Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-
ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data
GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name
GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes
Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences
BSgenome Representation and manipulation of large (eg whole-genome) sequences
2 Sequences and Ranges
Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3
21 Biostrings
The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages
Fixme Exercise trimLRPattern
22 GenomicRanges
Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed
8
on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation
The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges
GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)
A complete definition of these genes as GRanges is
gt genes lt- GRanges(seqnames=c(3R X)
+ ranges=IRanges(
+ start=c(19967117 18962306)
+ end=c(19973212 18962925))
+ strand=c(+ -)
+ seqlengths=c(`3R`=27905053L `X`=22422827L))
The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as
gt genes
GRanges with 2 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] 3R [19967117 19973212] +
[2] X [18962306 18962925] -
---
seqlengths
3R X
27905053 22422827
9
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
3 Reads and Alignments 1531 The pasilla Data Set 1532 Reads and the ShortRead Package 1533 Alignments and the Rsamtools Package 19
4 RNA-seq 2441 Differential Expression with the edgeR Package 2442 Additional Steps in RNA-seq Work Flows 2943 Resources 30
5 ChIP-seq 3151 Initial Work Flow 3152 Motifs 34
6 Annotation 3761 Gene-Centric Annotations with AnnotationDbi 3762 Genome-Centric Annotations with GenomicFeatures 4063 Exploring Annotated Variants VariantAnnotation 43
2
Table 1 Schedule
Introduction Rstudio and the AMI R packages and help Bioconductor
Sequences and ranges Representing and manipulating biological sequencesworking with genomic coordinates
Reads and alignments High-throughput sequence reads (fastq files) andtheir aligned representation (bam files)
RNA-seq A post-alignment workflow for differential representation
ChIP-seq Working with multiple ChIP-seq experiments
Annotation Resources for annotation of gene and genomes working the vari-ants (VCF files)
1 Introduction
This workshop introduces use of R and Bioconductor for analysis of high-throughput sequence data The workshop is structured as a series of shortremarks followed by group exercises The exercises explore the diversity of tasksfor which R Bioconductor are appropriate but are far from comprehensive
The goals of the workshop are to (1) develop familiarity with R Biocon-ductor software for high-throughput analysis (2) expose key statistical issues inthe analysis of sequence data and (3) provide inspiration and a framework forfurther independent exploration An approximate schedule is shown in Table 1
11 RstudioExercise 1Log on to the Rstudio account set up for this course
Visit the lsquoPackagesrsquo tab in the lower right panel find the useR2012 packageand discover the vignette (ie this document)
Under the lsquoFilesrsquo tab figure out how to upload and download (small) filesto the server
12 R
The following should be familiar to you R has a number of standard datatypes that represent integer numeric (floating point) complex characterlogical (boolean) and raw (byte) data It is possible to convert between datatypes and to discover the class (eg class) of a variable All of the vectorsmentioned so far are homogenous consisting of a single type of element A list
can contain a collection of different types of elements and like all vectors theseelements can be named to create a key-value association A dataframe is a list
3
of equal-length vectors representing a rectangular data structure not unlike aspread sheet Each column of the data frame is a vector so data types must behomogenous within a column A dataframe can be subset by row or columnand columns can be accessed with $ or [[ A matrix is also a rectangular datastructure but subject to the constraint that all elements are the same type
R has object-oriented ways of representing complicated data objects Bio-conductor makes extensive use of lsquoS4rsquo objects Objects are often created byfunctions (eg GRanges below) with parts of the object extracted or assignedusing accessor functions Many operations on classes are implemented as meth-ods that specialize a generic function for the particular class of objects used toinvoke the function For instance countOverlaps is a generic that counts thenumber of times elements of its query argument overlaps elements of its sub-
ject there are methods with slightly different behaviors when the argumentsare IRange instances or GRanges instances (in the latter case the countOverlaps
method pays attention to whether ranges are on the same strand and chromo-some for instance)
Packages provide functionality beyond that available in base R There aremore than 500 Bioconductor packages Packages are contributed by diversemembers of the community they vary in quality (many are excellent) and some-times contain idiosyncratic aspects to their implementation New packages (egShortRead VariantAnnotation packages and their dependencies) can be addedto an R installation using
gt source(httpbioconductororgbiocLiteR)
gt biocLite(c(ShortRead VariantAnnotation)) new packages
gt biocLite(character()) update packages
installpackages A package is installed only once per R installation butneeds to be loaded (with library) in each session in which it is used
Find help using the R help system Start a web browser with
gt helpstart()
The lsquoSearch Engine and Keywordsrsquo link is helpful in day-to-day use Use man-ual pages to find detailed descriptions of the arguments and return values offunctions and the structure and methods of classes Find help within an Rsession using the package defining the help page must have been loaded (withlibrary)
gt library(ShortRead)
gt readFastq
S4 classes and generics can be discovered with syntax like the following for thecomplement generic in the Biostrings package
gt library(Biostrings)
gt showMethods(complement)
4
Function complement (package Biostrings)
x=DNAString
x=DNAStringSet
x=MaskedDNAString
x=MaskedRNAString
x=RNAString
x=RNAStringSet
x=XStringViews
Methods defined on the DNAStringSet class of Biostrings can be found with
gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))
Obtaining help on S4 classes and methods requires syntax such as
gt class DNAStringSet
gt method complementDNAStringSet
Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use
gt vignette(package=useR2012)
to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette
13 Bioconductor
Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation
The Bioconductor web site is at bioconductororg Features include
bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene
ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages
bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-
ting new packages
5
Table 2 Selected Bioconductor packages for high-throughput sequence analysis
Concept PackagesData representation IRanges GenomicRanges GenomicFeatures
Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-
layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)
Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion
Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa
ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq
EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq
ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi
Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM
3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta
mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq
oneChannelGUI rnaSeqMapDatabase SRAdb
High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)
Exercise 2Scavenger hunt Spend five minutes tracking down the following information
a The package containing the readFastq function
b The author of the alphabetFrequency function defined in the Biostringspackage
c A description of the GappedAlignments class
6
d The number of vignettes in the GenomicRanges package
e From the Bioconductor web site instructions for installing or updatingBioconductor packages
f A list of all packages in the current release of Bioconductor
g The URL of the Bioconductor mailing list subscription page
Solution Possible solutions are found with the following R commands
gt readFastq
gt library(Biostrings)
gt alphabetFrequency
gt classGappedAlignments
gt vignette(package=GenomicRanges)
and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3
14 Resources
Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity
1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list
7
Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures
Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-
ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data
GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name
GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes
Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences
BSgenome Representation and manipulation of large (eg whole-genome) sequences
2 Sequences and Ranges
Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3
21 Biostrings
The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages
Fixme Exercise trimLRPattern
22 GenomicRanges
Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed
8
on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation
The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges
GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)
A complete definition of these genes as GRanges is
gt genes lt- GRanges(seqnames=c(3R X)
+ ranges=IRanges(
+ start=c(19967117 18962306)
+ end=c(19973212 18962925))
+ strand=c(+ -)
+ seqlengths=c(`3R`=27905053L `X`=22422827L))
The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as
gt genes
GRanges with 2 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] 3R [19967117 19973212] +
[2] X [18962306 18962925] -
---
seqlengths
3R X
27905053 22422827
9
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Table 1 Schedule
Introduction Rstudio and the AMI R packages and help Bioconductor
Sequences and ranges Representing and manipulating biological sequencesworking with genomic coordinates
Reads and alignments High-throughput sequence reads (fastq files) andtheir aligned representation (bam files)
RNA-seq A post-alignment workflow for differential representation
ChIP-seq Working with multiple ChIP-seq experiments
Annotation Resources for annotation of gene and genomes working the vari-ants (VCF files)
1 Introduction
This workshop introduces use of R and Bioconductor for analysis of high-throughput sequence data The workshop is structured as a series of shortremarks followed by group exercises The exercises explore the diversity of tasksfor which R Bioconductor are appropriate but are far from comprehensive
The goals of the workshop are to (1) develop familiarity with R Biocon-ductor software for high-throughput analysis (2) expose key statistical issues inthe analysis of sequence data and (3) provide inspiration and a framework forfurther independent exploration An approximate schedule is shown in Table 1
11 RstudioExercise 1Log on to the Rstudio account set up for this course
Visit the lsquoPackagesrsquo tab in the lower right panel find the useR2012 packageand discover the vignette (ie this document)
Under the lsquoFilesrsquo tab figure out how to upload and download (small) filesto the server
12 R
The following should be familiar to you R has a number of standard datatypes that represent integer numeric (floating point) complex characterlogical (boolean) and raw (byte) data It is possible to convert between datatypes and to discover the class (eg class) of a variable All of the vectorsmentioned so far are homogenous consisting of a single type of element A list
can contain a collection of different types of elements and like all vectors theseelements can be named to create a key-value association A dataframe is a list
3
of equal-length vectors representing a rectangular data structure not unlike aspread sheet Each column of the data frame is a vector so data types must behomogenous within a column A dataframe can be subset by row or columnand columns can be accessed with $ or [[ A matrix is also a rectangular datastructure but subject to the constraint that all elements are the same type
R has object-oriented ways of representing complicated data objects Bio-conductor makes extensive use of lsquoS4rsquo objects Objects are often created byfunctions (eg GRanges below) with parts of the object extracted or assignedusing accessor functions Many operations on classes are implemented as meth-ods that specialize a generic function for the particular class of objects used toinvoke the function For instance countOverlaps is a generic that counts thenumber of times elements of its query argument overlaps elements of its sub-
ject there are methods with slightly different behaviors when the argumentsare IRange instances or GRanges instances (in the latter case the countOverlaps
method pays attention to whether ranges are on the same strand and chromo-some for instance)
Packages provide functionality beyond that available in base R There aremore than 500 Bioconductor packages Packages are contributed by diversemembers of the community they vary in quality (many are excellent) and some-times contain idiosyncratic aspects to their implementation New packages (egShortRead VariantAnnotation packages and their dependencies) can be addedto an R installation using
gt source(httpbioconductororgbiocLiteR)
gt biocLite(c(ShortRead VariantAnnotation)) new packages
gt biocLite(character()) update packages
installpackages A package is installed only once per R installation butneeds to be loaded (with library) in each session in which it is used
Find help using the R help system Start a web browser with
gt helpstart()
The lsquoSearch Engine and Keywordsrsquo link is helpful in day-to-day use Use man-ual pages to find detailed descriptions of the arguments and return values offunctions and the structure and methods of classes Find help within an Rsession using the package defining the help page must have been loaded (withlibrary)
gt library(ShortRead)
gt readFastq
S4 classes and generics can be discovered with syntax like the following for thecomplement generic in the Biostrings package
gt library(Biostrings)
gt showMethods(complement)
4
Function complement (package Biostrings)
x=DNAString
x=DNAStringSet
x=MaskedDNAString
x=MaskedRNAString
x=RNAString
x=RNAStringSet
x=XStringViews
Methods defined on the DNAStringSet class of Biostrings can be found with
gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))
Obtaining help on S4 classes and methods requires syntax such as
gt class DNAStringSet
gt method complementDNAStringSet
Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use
gt vignette(package=useR2012)
to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette
13 Bioconductor
Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation
The Bioconductor web site is at bioconductororg Features include
bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene
ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages
bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-
ting new packages
5
Table 2 Selected Bioconductor packages for high-throughput sequence analysis
Concept PackagesData representation IRanges GenomicRanges GenomicFeatures
Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-
layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)
Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion
Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa
ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq
EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq
ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi
Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM
3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta
mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq
oneChannelGUI rnaSeqMapDatabase SRAdb
High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)
Exercise 2Scavenger hunt Spend five minutes tracking down the following information
a The package containing the readFastq function
b The author of the alphabetFrequency function defined in the Biostringspackage
c A description of the GappedAlignments class
6
d The number of vignettes in the GenomicRanges package
e From the Bioconductor web site instructions for installing or updatingBioconductor packages
f A list of all packages in the current release of Bioconductor
g The URL of the Bioconductor mailing list subscription page
Solution Possible solutions are found with the following R commands
gt readFastq
gt library(Biostrings)
gt alphabetFrequency
gt classGappedAlignments
gt vignette(package=GenomicRanges)
and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3
14 Resources
Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity
1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list
7
Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures
Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-
ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data
GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name
GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes
Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences
BSgenome Representation and manipulation of large (eg whole-genome) sequences
2 Sequences and Ranges
Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3
21 Biostrings
The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages
Fixme Exercise trimLRPattern
22 GenomicRanges
Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed
8
on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation
The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges
GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)
A complete definition of these genes as GRanges is
gt genes lt- GRanges(seqnames=c(3R X)
+ ranges=IRanges(
+ start=c(19967117 18962306)
+ end=c(19973212 18962925))
+ strand=c(+ -)
+ seqlengths=c(`3R`=27905053L `X`=22422827L))
The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as
gt genes
GRanges with 2 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] 3R [19967117 19973212] +
[2] X [18962306 18962925] -
---
seqlengths
3R X
27905053 22422827
9
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
of equal-length vectors representing a rectangular data structure not unlike aspread sheet Each column of the data frame is a vector so data types must behomogenous within a column A dataframe can be subset by row or columnand columns can be accessed with $ or [[ A matrix is also a rectangular datastructure but subject to the constraint that all elements are the same type
R has object-oriented ways of representing complicated data objects Bio-conductor makes extensive use of lsquoS4rsquo objects Objects are often created byfunctions (eg GRanges below) with parts of the object extracted or assignedusing accessor functions Many operations on classes are implemented as meth-ods that specialize a generic function for the particular class of objects used toinvoke the function For instance countOverlaps is a generic that counts thenumber of times elements of its query argument overlaps elements of its sub-
ject there are methods with slightly different behaviors when the argumentsare IRange instances or GRanges instances (in the latter case the countOverlaps
method pays attention to whether ranges are on the same strand and chromo-some for instance)
Packages provide functionality beyond that available in base R There aremore than 500 Bioconductor packages Packages are contributed by diversemembers of the community they vary in quality (many are excellent) and some-times contain idiosyncratic aspects to their implementation New packages (egShortRead VariantAnnotation packages and their dependencies) can be addedto an R installation using
gt source(httpbioconductororgbiocLiteR)
gt biocLite(c(ShortRead VariantAnnotation)) new packages
gt biocLite(character()) update packages
installpackages A package is installed only once per R installation butneeds to be loaded (with library) in each session in which it is used
Find help using the R help system Start a web browser with
gt helpstart()
The lsquoSearch Engine and Keywordsrsquo link is helpful in day-to-day use Use man-ual pages to find detailed descriptions of the arguments and return values offunctions and the structure and methods of classes Find help within an Rsession using the package defining the help page must have been loaded (withlibrary)
gt library(ShortRead)
gt readFastq
S4 classes and generics can be discovered with syntax like the following for thecomplement generic in the Biostrings package
gt library(Biostrings)
gt showMethods(complement)
4
Function complement (package Biostrings)
x=DNAString
x=DNAStringSet
x=MaskedDNAString
x=MaskedRNAString
x=RNAString
x=RNAStringSet
x=XStringViews
Methods defined on the DNAStringSet class of Biostrings can be found with
gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))
Obtaining help on S4 classes and methods requires syntax such as
gt class DNAStringSet
gt method complementDNAStringSet
Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use
gt vignette(package=useR2012)
to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette
13 Bioconductor
Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation
The Bioconductor web site is at bioconductororg Features include
bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene
ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages
bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-
ting new packages
5
Table 2 Selected Bioconductor packages for high-throughput sequence analysis
Concept PackagesData representation IRanges GenomicRanges GenomicFeatures
Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-
layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)
Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion
Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa
ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq
EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq
ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi
Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM
3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta
mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq
oneChannelGUI rnaSeqMapDatabase SRAdb
High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)
Exercise 2Scavenger hunt Spend five minutes tracking down the following information
a The package containing the readFastq function
b The author of the alphabetFrequency function defined in the Biostringspackage
c A description of the GappedAlignments class
6
d The number of vignettes in the GenomicRanges package
e From the Bioconductor web site instructions for installing or updatingBioconductor packages
f A list of all packages in the current release of Bioconductor
g The URL of the Bioconductor mailing list subscription page
Solution Possible solutions are found with the following R commands
gt readFastq
gt library(Biostrings)
gt alphabetFrequency
gt classGappedAlignments
gt vignette(package=GenomicRanges)
and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3
14 Resources
Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity
1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list
7
Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures
Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-
ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data
GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name
GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes
Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences
BSgenome Representation and manipulation of large (eg whole-genome) sequences
2 Sequences and Ranges
Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3
21 Biostrings
The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages
Fixme Exercise trimLRPattern
22 GenomicRanges
Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed
8
on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation
The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges
GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)
A complete definition of these genes as GRanges is
gt genes lt- GRanges(seqnames=c(3R X)
+ ranges=IRanges(
+ start=c(19967117 18962306)
+ end=c(19973212 18962925))
+ strand=c(+ -)
+ seqlengths=c(`3R`=27905053L `X`=22422827L))
The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as
gt genes
GRanges with 2 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] 3R [19967117 19973212] +
[2] X [18962306 18962925] -
---
seqlengths
3R X
27905053 22422827
9
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Function complement (package Biostrings)
x=DNAString
x=DNAStringSet
x=MaskedDNAString
x=MaskedRNAString
x=RNAString
x=RNAStringSet
x=XStringViews
Methods defined on the DNAStringSet class of Biostrings can be found with
gt showMethods(class=DNAStringSet where=getNamespace(Biostrings))
Obtaining help on S4 classes and methods requires syntax such as
gt class DNAStringSet
gt method complementDNAStringSet
Vignettes especially in Bioconductor packages provide an extensive narra-tive describing overall package functionality Use
gt vignette(package=useR2012)
to see a list of vignettes available in the useR2012 package Vignettes usuallyconsist of text with embedded R code a form of literate programming Thevignette can be read as a PDF document while the R source code is present asa script file ending with extension R The script file can be sourced or copiedinto an R session to evaluate exactly the commands used in the vignette
13 Bioconductor
Bioconductor is a collection of R packages for the analysis and comprehensionof high-throughput genomic data Bioconductor started more than 10 yearsago It gained credibility for its statistically rigorous approach to microarraypre-preprocessing and analysis of designed experiments and integrative and re-producible approaches to bioinformatic tasks There are now more than 500Bioconductor packages for expression and other microarrays sequence analy-sis flow cytometry imaging and other domains The Bioconductor web siteprovides installation package repository help and other documentation
The Bioconductor web site is at bioconductororg Features include
bull Introductory work flowsbull A manifest of Bioconductor packages arranged in BiocViewsbull Annotation (data bases of relevant genomic information eg Entrez gene
ids in model organisms KEGG pathways) and experiment data (contain-ing relatively comprehensive data sets and their analysis) packages
bull Mailing lists including searchable archives as the primary source of helpbull Course and conference information including extensive reference materialbull General information about the projectbull Package developer resources including guidelines for creating and submit-
ting new packages
5
Table 2 Selected Bioconductor packages for high-throughput sequence analysis
Concept PackagesData representation IRanges GenomicRanges GenomicFeatures
Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-
layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)
Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion
Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa
ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq
EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq
ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi
Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM
3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta
mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq
oneChannelGUI rnaSeqMapDatabase SRAdb
High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)
Exercise 2Scavenger hunt Spend five minutes tracking down the following information
a The package containing the readFastq function
b The author of the alphabetFrequency function defined in the Biostringspackage
c A description of the GappedAlignments class
6
d The number of vignettes in the GenomicRanges package
e From the Bioconductor web site instructions for installing or updatingBioconductor packages
f A list of all packages in the current release of Bioconductor
g The URL of the Bioconductor mailing list subscription page
Solution Possible solutions are found with the following R commands
gt readFastq
gt library(Biostrings)
gt alphabetFrequency
gt classGappedAlignments
gt vignette(package=GenomicRanges)
and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3
14 Resources
Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity
1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list
7
Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures
Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-
ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data
GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name
GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes
Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences
BSgenome Representation and manipulation of large (eg whole-genome) sequences
2 Sequences and Ranges
Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3
21 Biostrings
The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages
Fixme Exercise trimLRPattern
22 GenomicRanges
Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed
8
on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation
The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges
GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)
A complete definition of these genes as GRanges is
gt genes lt- GRanges(seqnames=c(3R X)
+ ranges=IRanges(
+ start=c(19967117 18962306)
+ end=c(19973212 18962925))
+ strand=c(+ -)
+ seqlengths=c(`3R`=27905053L `X`=22422827L))
The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as
gt genes
GRanges with 2 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] 3R [19967117 19973212] +
[2] X [18962306 18962925] -
---
seqlengths
3R X
27905053 22422827
9
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Table 2 Selected Bioconductor packages for high-throughput sequence analysis
Concept PackagesData representation IRanges GenomicRanges GenomicFeatures
Biostrings BSgenome girafeInput output ShortRead (fastq) Rsamtools (bam) rtrack-
layer (gff wig bed) VariantAnnotation (vcf)R453Plus1Toolbox (454)
Annotation GenomicFeatures ChIPpeakAnno VariantAnnota-tion
Alignment Rsubread BiostringsVisualization ggbio GvizQuality assessment qrqc seqbias ReQON htSeqTools TEQC Rolexa
ShortReadRNA-seq BitSeq cqn cummeRbund DESeq DEXSeq
EDASeq edgeR gage goseq iASeq tweeDEseqChIP-seq etc BayesPeak baySeq ChIPpeakAnno chipseq
ChIPseqR ChIPsim CSAR DiffBind MEDIPSmosaics NarrowPeaks nucleR PICS PING RED-seq Repitools TSSi
Motifs BCRANK cosmo cosmoGUI MotIV seqLogorGADEM
3C etc HiTC r3CseqCopy number cnmops CNAnorm exomeCopy seqmentSeqMicrobiome phyloseq DirichletMultinomial clstutils manta
mcaGUI Work flows ArrayExpressHTS Genominator easyRNASeq
oneChannelGUI rnaSeqMapDatabase SRAdb
High-throughput sequence analysis Table 2 enumerates many of the pack-ages available for sequence analysis The table includes packages for repre-senting sequence-related data (eg GenomicRanges Biostrings) as well asdomain-specific analysis such as RNA-seq (eg edgeR DEXSeq) ChIP-seq(eg ChIPpeakAnno DiffBind) and SNPs and copy number variation (eggenoset ggtools VariantAnnotation)
Exercise 2Scavenger hunt Spend five minutes tracking down the following information
a The package containing the readFastq function
b The author of the alphabetFrequency function defined in the Biostringspackage
c A description of the GappedAlignments class
6
d The number of vignettes in the GenomicRanges package
e From the Bioconductor web site instructions for installing or updatingBioconductor packages
f A list of all packages in the current release of Bioconductor
g The URL of the Bioconductor mailing list subscription page
Solution Possible solutions are found with the following R commands
gt readFastq
gt library(Biostrings)
gt alphabetFrequency
gt classGappedAlignments
gt vignette(package=GenomicRanges)
and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3
14 Resources
Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity
1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list
7
Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures
Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-
ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data
GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name
GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes
Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences
BSgenome Representation and manipulation of large (eg whole-genome) sequences
2 Sequences and Ranges
Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3
21 Biostrings
The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages
Fixme Exercise trimLRPattern
22 GenomicRanges
Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed
8
on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation
The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges
GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)
A complete definition of these genes as GRanges is
gt genes lt- GRanges(seqnames=c(3R X)
+ ranges=IRanges(
+ start=c(19967117 18962306)
+ end=c(19973212 18962925))
+ strand=c(+ -)
+ seqlengths=c(`3R`=27905053L `X`=22422827L))
The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as
gt genes
GRanges with 2 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] 3R [19967117 19973212] +
[2] X [18962306 18962925] -
---
seqlengths
3R X
27905053 22422827
9
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
d The number of vignettes in the GenomicRanges package
e From the Bioconductor web site instructions for installing or updatingBioconductor packages
f A list of all packages in the current release of Bioconductor
g The URL of the Bioconductor mailing list subscription page
Solution Possible solutions are found with the following R commands
gt readFastq
gt library(Biostrings)
gt alphabetFrequency
gt classGappedAlignments
gt vignette(package=GenomicRanges)
and by visiting the Bioconductor web site eg installation instructions1 currentsoftware packages2 and mailing lists3
14 Resources
Dalgaard [4] provides an introduction to statistical analysis with R Matloff [11]introduces R programming concepts Chambers [3] provides more advancedinsights into R Gentleman [5] emphasizes use of R for bioinformatic program-ming tasks The R web site enumerates additional publications from the usercommunity
1httpbioconductororginstall2httpbioconductororgpackagesreleasebioc3httpbioconductororghelpmailing-list
7
Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures
Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-
ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data
GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name
GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes
Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences
BSgenome Representation and manipulation of large (eg whole-genome) sequences
2 Sequences and Ranges
Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3
21 Biostrings
The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages
Fixme Exercise trimLRPattern
22 GenomicRanges
Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed
8
on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation
The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges
GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)
A complete definition of these genes as GRanges is
gt genes lt- GRanges(seqnames=c(3R X)
+ ranges=IRanges(
+ start=c(19967117 18962306)
+ end=c(19973212 18962925))
+ strand=c(+ -)
+ seqlengths=c(`3R`=27905053L `X`=22422827L))
The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as
gt genes
GRanges with 2 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] 3R [19967117 19973212] +
[2] X [18962306 18962925] -
---
seqlengths
3R X
27905053 22422827
9
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Table 3 Selected Bioconductor packages for representing and manipulatingranges strings and other data structures
Package DescriptionIRanges Defines important classes (eg IRanges Rle) and meth-
ods (eg findOverlaps countOverlaps) for representingand manipulating ranges of consecutive values Also in-troduces DataFrame SimpleList and other classes tai-lored to representing very large data
GenomicRanges Range-based classes tailored to sequence representation(eg GRanges GRangesList) with information aboutstrand and sequence name
GenomicFeatures Foundation for manipulating data bases of genomicranges eg representing coordinates and organizationof exons and transcripts of known genes
Biostrings Classes (eg DNAStringSet) and methods (eg alpha-betFrequency pairwiseAlignment) for representing andmanipulating DNA and other biological sequences
BSgenome Representation and manipulation of large (eg whole-genome) sequences
2 Sequences and Ranges
Many Bioconductor packages are available for analysis of high-throughput se-quence data This section introduces two essential ways in which sequence dataare manipulated Sets of DNA strings represent the reads themselves and thenucleotide sequence of reference genomes Ranges describe both aligned readsand features of interest on the genome Key packages are summarized in Table3
21 Biostrings
The Biostrings package provides tools for working with DNA (and other biolog-ical) sequence data The essential data structures are DNAString and DNAS-tringSet for working with one or multiple DNA sequences The Biostrings pack-age contains additional classes for representing amino acid and general biologicalstrings The BSgenome and related packages (eg BSgenomeDmelanogasterUCSCdm3)are used to represent whole-genome sequences The following exercise exploresthese packages
Fixme Exercise trimLRPattern
22 GenomicRanges
Next-generation sequencing data consists of a large number of short reads Theseare typically aligned to a reference genome Basic operations are performed
8
on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation
The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges
GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)
A complete definition of these genes as GRanges is
gt genes lt- GRanges(seqnames=c(3R X)
+ ranges=IRanges(
+ start=c(19967117 18962306)
+ end=c(19973212 18962925))
+ strand=c(+ -)
+ seqlengths=c(`3R`=27905053L `X`=22422827L))
The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as
gt genes
GRanges with 2 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] 3R [19967117 19973212] +
[2] X [18962306 18962925] -
---
seqlengths
3R X
27905053 22422827
9
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
on the alignment asking eg how many reads are aligned in a genomic rangedefined by nucleotide coordinates (eg in the exons of a gene) or how manynucleotides from all the aligned reads cover a set of genomic coordinates How isthis type of data the aligned reads and the reference genome to be representedin R in a way that allows for effective computation
The IRanges GenomicRanges and GenomicFeatures Bioconductor pack-ages provide the essential infrastructure for these operations we start with theGRanges class defined in GenomicRanges
GRanges Instances of GRanges are used to specify genomic coordinates Sup-pose we wish to represent two D melanogaster genes The first is located on thepositive strand of chromosome 3R from position 19967117 to 19973212 Thesecond is on the minus strand of the X chromosome with lsquoleft-mostrsquo base at18962306 and right-most base at 18962925 The coordinates are 1-based (iethe first nucleotide on a chromosome is numbered 1 rather than 0) left-most(ie reads on the minus strand are defined to lsquostartrsquo at the left-most coordi-nate rather than the 5rsquo coordinate) and closed (the start and end coordinatesare included in the range a range with identical start and end coordinates haswidth 1 a 0-width range is represented by the special construct where the endcoordinate is one less than the start coordinate)
A complete definition of these genes as GRanges is
gt genes lt- GRanges(seqnames=c(3R X)
+ ranges=IRanges(
+ start=c(19967117 18962306)
+ end=c(19973212 18962925))
+ strand=c(+ -)
+ seqlengths=c(`3R`=27905053L `X`=22422827L))
The components of a GRanges object are defined as vectors eg of seqnamesmuch as one would define a dataframe The start and end coordinates aregrouped into an IRanges instance The optional seqlengths argument specifiesthe maximum size of each sequence in this case the lengths of chromosomes 3Rand X in the lsquodm2rsquo build of D melanogaster genome This data is displayed as
gt genes
GRanges with 2 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] 3R [19967117 19973212] +
[2] X [18962306 18962925] -
---
seqlengths
3R X
27905053 22422827
9
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Figure 1 Ranges
For the curious the gene coordinates and sequence lengths are derived fromthe orgDmegdb package for genes with Flybase identifiers FBgn0039155 andFBgn0085359 using the annotation facilities described in section 6
The GRanges class has many useful methods defined on it Consult the helppage
gt GRanges
and package vignettes (especially lsquoAn Introduction to GenomicRangesrsquo)
gt vignette(package=GenomicRanges)
for a comprehensive introduction
Operations on ranges The GRanges class has many useful methods fromthe IRanges class some of these methods are illustrated here We use IRangesto illustrate these operations to avoid complexities associated with strand andseqname but the operations are comparable on GRanges We begin with asimple set of ranges
gt ir lt- IRanges(start=c(7 9 12 14 2224)
+ end=c(15 11 12 18 26 27 28))
These and some common operations are illustrated in the upper panel of Figure1 and summarized in Table 4
elementMetadata (values) The GRanges class (actually most of the data struc-tures defined or extending those in the IRanges package) has two additional veryuseful data components The elementMetadata function (or its synonym values)allows information on each range to be stored and manipulated (eg subset)along with the GRanges instance The element metadata is represented as aDataFrame defined in IRanges and acting like a standard R dataframe butwith the ability to hold more complicated data structures as columns (and withelement metadata of its own providing an enhanced alternative to the Biobaseclass AnnotatedDataFrame)
10
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Table 4 Common operations on IRanges GRanges and GRangesList
Category Function DescriptionAccessors start end width Get or s et the starts ends and widths
names Get or set the nameselementMetadata metadata Get or set metadata on elements or objectlength Number of ranges in the vectorrange Range formed from min start and max end
Ordering lt lt= gt gt= == = Compare ranges ordering by start then widthsort order rank Sort by the orderingduplicated Find ranges with multiple instancesunique Find unique instances removing duplicates
Arithmetic r + x r - x r x Shrink or expand ranges r by number x
shift Move the ranges by specified amountresize Change width ancoring on start end or middistance Separation between ranges (closest endpoints)restrict Clamp ranges to within some start and endflank Generate adjacent regions on start or end
Set operations reduce Merge overlapping and adjacent rangesintersect union setdiff Set operations on reduced rangespintersect punion psetdiff Parallel set operations on each x[i] y[i]gaps pgap Find regions not covered by reduced rangesdisjoin Ranges formed from union of endpoints
Overlaps findOverlaps Find all overlaps for each x in y
countOverlaps Count overlaps of each x range in y
nearest Find nearest neighbors (closest endpoints)precede follow Find nearest y that x precedes or followsx in y Find ranges in x that overlap range in y
Coverage coverage Count ranges covering each positionExtraction r[i] Get or set by logical or numeric index
r[[i]] Get integer sequence from start[i] to end[i]
subsetByOverlaps Subset x for those that overlap in y
head tail rev rep Conventional R semanticsSplit combine split Split ranges by a factor into a RangesList
c Concatenate two or more range objects
11
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
gt elementMetadata(genes) lt-
+ DataFrame(EntrezId=c(42865 2768869)
+ Symbol=c(kal-1 CG34330))
metadata allows addition of information to the entire object The information isin the form of a list any data can be provided
gt metadata(genes) lt-
+ list(CreatedBy=A User Date=date())
GRangesList The GRanges class is extremely useful for representing simpleranges Some next-generation sequence data and genomic features are morehierarchically structured A gene may be represented by seven exons within itAn aligned read may be represented by discontinuous ranges of alignment to areference The GRangesList class represents this type of information It is alist-like data structure which each element of the list itself a GRanges instanceThe gene FBgn0039155 contains several exons and can be represented as a listof length 1 where the element of the list contains a GRanges object with 7elements
GRangesList of length 1
$FBgn0039155
GRanges with 7 ranges and 2 elementMetadata cols
seqnames ranges strand | exon_id exon_name
ltRlegt ltIRangesgt ltRlegt | ltintegergt ltcharactergt
[1] chr3R [19967117 19967382] + | 64137 ltNAgt
[2] chr3R [19970915 19971592] + | 64138 ltNAgt
[3] chr3R [19971652 19971770] + | 64139 ltNAgt
[4] chr3R [19971831 19972024] + | 64140 ltNAgt
[5] chr3R [19972088 19972461] + | 64141 ltNAgt
[6] chr3R [19972523 19972589] + | 64142 ltNAgt
[7] chr3R [19972918 19973212] + | 64143 ltNAgt
---
seqlengths
chr3R
27905053
The GenomicFeatures package Many public resources provide annotationsabout genomic features For instance the UCSC genome browser maintains thelsquoknownGenersquo track of established exons transcripts and coding sequences ofmany model organisms The GenomicFeatures package provides a way to re-trieve save and query these resources The underlying representation is assqlite data bases but the data are available in R as GRangesList objects Thefollowing exercise explores the GenomicFeatures package and some of the func-tionality for the IRanges family introduced above
12
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Exercise 3Load the TxDbDmelanogasterUCSCdm3ensGene annotation package andcreate an alias txdb pointing to the TranscriptDb object this class defines
Extract all exon coordinates organized by gene using exonsBy What is theclass of this object How many elements are in the object What does eachelement correspond to And the elements of each element Use elementLengths
and table to summarize the number of exons in each gene for instance howmany single-exon genes are there
Select just those elements corresponding to flybase gene ids FBgn0002183FBgn0003360 FBgn0025111 and FBgn0036449 Use reduce to simplify genemodels so that exons that overlap are considered lsquothe samersquo
Solution
gt library(TxDbDmelanogasterUCSCdm3ensGene)
gt txdb lt- TxDbDmelanogasterUCSCdm3ensGene alias
gt ex0 lt- exonsBy(txdb gene)
gt head(table(elementLengths(ex0)))
1 2 3 4 5 6
3182 2608 2070 1628 1133 886
gt ids lt- c(FBgn0002183 FBgn0003360 FBgn0025111 FBgn0036449)
gt ex lt- reduce(ex0[ids])
Exercise 4The objective of this exercise is to calculate the GC content of the exons of asingle gene whose coordinates are specified by the ex object of the previousexercise
Load the BSgenomeDmelanogasterUCSCdm3 data package containing theUCSC representation of D melanogaster genome assembly dm3
Extract the sequence name of the first gene of ex Use this to load theappropriate D melanogaster chromosome
Use Views to create views on to the chromosome that span the start and endcoordinates of all exons
The useR2012 package defines a helper function gcFunction to calculate GCcontent Use this to calculate the GC content in each of the exons
Look at the helper function and describe what it does
Solution Here we load the D melanogaster genome select a single chromo-some and create Views that reflect the ranges of the FBgn0002183
gt library(BSgenomeDmelanogasterUCSCdm3)
gt nm lt- ascharacter(unique(seqnames(ex[[1]])))
gt chr lt- Dmelanogaster[[nm]]
gt v lt- Views(chr start=start(ex[[1]]) end=end(ex[[1]]))
13
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Using the gcFunction helper function the subject GC content is
gt gcFunction(v)
[1] 04767442 05555556 05389610 05513308 05351788 05441426 04933333
[8] 05189394 05110132
The gcFunction is really straight-forward it invokes the function alphabetFre-
quency from the Biostrings package This returns a simple matrix of exon timesnuclotiede probabilities The row sums of the G and C columns of this matrixare the GC contents of each exon
gt gcFunction
function (x)
alf lt- alphabetFrequency(x asprob = TRUE)
rowSums(alf[ c(G C)])
ltenvironment namespaceuseR2012gt
23 Resources
There are extensive vignettes for Biostrings and GenomicRanges packages Auseful online resource is from Thomas Grikersquos group
14
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Table 5 Selected Bioconductor packages for sequence reads and alignments
Package DescriptionShortRead Defines the ShortReadQ class and functions for ma-
nipulating fastq files these classes rely heavily onBiostrings
GenomicRanges GappedAlignments and GappedAlignmentPairs storesingle- and paired-end aligned reads
Rsamtools Provides access to BAM alignment and other largesequence-related files
rtracklayer Input and output of bed wig and similar files
3 Reads and Alignments
The following sections introduce core tools for working with high-throughputsequence data key packages for representating reads and alignments are sum-marized in Table 5 This section focus on the reads and alignments that arethe raw material for analysis Section 6 introduces resources for annotating se-quences while section 4 addresses statistical approaches to assessing differentialrepresentation in RNA-seq experiments Section 5 outlines ChIP-seq analysis
31 The pasilla Data Set
As a running example we use the pasilla data set derived from [2] The authorsinvestigate conservation of RNA regulation between D melanogaster and mam-mals Part of their study used RNAi and RNA-seq to identify exons regulated byPasilla (ps) the D melanogaster ortholog of mammalian NOVA1 and NOVA2Briefly their experiment compared gene expression as measured by RNAseq inS2-DRSC cells cultured with or without a 444bp dsRNA fragment correspond-ing to the ps mRNA sequence Their assessment investigated differential exonuse but our worked example will focus on gene-level differences
In this section we look at a subset of the ps data corresponding to readsobtained from lanes of their RNA-seq experiment and to the same reads alignedto a D melanogaster reference genome Reads were obtained from GEO and theShort Read Archive (SRA) and were aligned to the D melanogaster referencegenome dm3 as described in the pasilla experiment data package
32 Reads and the ShortRead Package
Short read formats The Illumina GAII and HiSeq technologies generatesequences by measuring incorporation of florescent nucleotides over successivePCR cycles These sequencers produce output in a variety of formats butFASTQ is ubiquitous Each read is represented by a record of four components
SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
15
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
+SRR0317241 HWI-EAS299_4_30M2BAAXX5115131024 length=37
IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
The first and third lines (beginning with and + respectively) are unique identi-fiers The second and fourth lines of the FASTQ record are the nucleotides andqualities of each cycle in the read This information is given in 5rsquo to 3rsquo orienta-tion as seen by the sequencer A letter N in the sequence is used to signify basesthat the sequencer was not able to call The fourth line of the FASTQ recordencodes the quality (confidence) of the corresponding base call The qualityscore is encoded following one of several conventions with the general notionbeing that letters later in the visible ASCII alphabet
$amp()+-0123456789lt=gtABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz|~
are of lower quality Both the sequence and quality scores may span multiplelines Technologies other than Illumina use different formats to represent se-quences 454 sequence input is supported in R453Plus1Toolbox lsquocolor spacersquo isnot supported
FASTQ files can be read in to R using the readFastq function from theShortRead package Use this function by providing the path to a FASTQ fileThere are sample data files available in the useR2012 package each consistingof 1 million reads from a lane of the Pasilla data set
gt fastqDir lt- filepath(bigdata() fastq)
gt fastqFiles lt- dir(fastqDir full=TRUE)
gt fq lt- readFastq(fastqFiles[1])
gt fq
class ShortReadQ
length 1000000 reads width 37 cycles
The data are represented as an object of class ShortReadQ
gt head(sread(fq) 3)
A DNAStringSet instance of length 3
width seq
[1] 37 GTTTTGTCCAAGTTCTGGTAGCTGAATCCTGGGGCGC
[2] 37 GTTGTCGCATTCCTTACTCTCATTCGGGAATTCTGTT
[3] 37 GAATTTTTTGAGAGCGAAATGATAGCCGATGCCCTGA
gt head(quality(fq) 3)
class FastqQuality
quality
A BStringSet instance of length 3
width seq
[1] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIII+HIIIIltIE
[2] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
[3] 37 IIIIIIIIIIIIIIIIIIIIIIIIIIIGBIIII2I+
16
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
There are many ways to manipulate these objects the alphabetByCycle functionsummarizes use of nucleotides at each cycle in a (equal width) ShortReadQ orDNAStringSet instance
gt abc lt- alphabetByCycle(sread(fq))
gt abc[14 18]
cycle
alphabet [1] [2] [3] [4] [5] [6] [7] [8]
A 78194 153156 200468 230120 283083 322913 162766 220205
C 439302 265338 362839 251434 203787 220855 253245 287010
G 397671 270342 258739 356003 301640 247090 227811 246684
T 84833 311164 177954 162443 211490 209142 356178 246101
FASTQ files are getting larger A very common reason for looking at dataat this early stage in the processing pipeline is to explore sequence quality Inthese circumstances it is often not necessary to parse the entire FASTQ fileInstead create a representative sample
gt sampler lt- FastqSampler(fastqFiles[1] 1000000)
gt yield(sampler) sample of 1000000 reads
class ShortReadQ
length 1000000 reads width 37 cycles
A second common scenario is to pre-process reads eg trimming low-qualitytails adapter sequences or artifacts of sample preparation The FastqStreamerclass can be used to lsquostreamrsquo over the fastq files in chunks processing each chunkindependently
ShortRead contains facilities for quality assessment of FASTQ files Here wegenerate a report from a sample of 1 million reads from each of our files anddisplay it in a web browser
gt qas0 lt- Map(function(fl nm)
+ fq lt- FastqSampler(fl)
+ qa(yield(fq) nm)
+ fastqFiles
+ sub(_subsetfastq basename(fastqFiles)))
gt qas lt- docall(rbind qas0)
gt rpt lt- report(qas dest=tempfile())
gt browseURL(rpt)
A report from a larger subset of the experiment is available
gt rpt lt- systemfile(GSM461176_81_qa_report indexhtml
+ package=useR2012)
gt browseURL(rpt)
17
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Exercise 5Use the helper function bigdata (defined in the useR2012 package) and thefilepath and dir functions to locate two fastq files from [2] (the files wereobtained as described in the appendix and pasilla experiment data package
Input one of the fastq files using readFastq from the ShortRead packageUsing the helper function gcFunction from the useR2012 package draw a
histogram of the distribution of GC frequencies across readsUse alphabetByCycle to summarize the frequency of each nucleotide at each
cycle Plot the results using matplot from the graphics packageAs an advanced exercise and if on Mac or Linux use the parallel package
and mclapply to read and summarize the GC content of reads in two files inparallel
Solution Discovery
gt dir(bigdata())
[1] bam fastq
gt fls lt- dir(filepath(bigdata() fastq) full=TRUE)
Input
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 4071585 2175 6193578 3308 6193578 3308
Vcells 125856983 9603 279142478 21297 279142148 21297
gt fq lt- readFastq(fls[1])
GC content
gt alf0 lt- alphabetFrequency(sread(fq) asprob=TRUE collapse=TRUE)
gt sum(alf0[c(G C)])
[1] 05457237
A histogram of the GC content of individual reads is obtained with
gt gc lt- gcFunction(sread(fq))
gt hist(gc)
Alphabet by cycle
gt abc lt- alphabetByCycle(sread(fq))
gt matplot(t(abc[c(A C G T)]) type=l)
Advanced (Mac Linux only) processing on multiple cores
18
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
gt library(parallel)
gt gc0 lt- mclapply(fls function(fl)
+ fq lt- readFastq(fl)
+ gc lt- gcFunction(sread(fq))
+ table(cut(gc seq(0 1 05)))
+ )
gt simplify list of length 2 to 2-D array
gt gc lt- simplify2array(gc0)
gt matplot(gc type=s)
33 Alignments and the Rsamtools Package
Most down-stream analysis of short read sequences is based on reads aligned toreference genomes There are many aligners available including BWA [10 9]Bowtie [8] and GSNAP merits of these are discussed in the literature Thereare also alignment algorithms implemented in Bioconductor (eg matchPDict inthe Biostrings package and the Rsubread package) matchPDict is particularlyuseful for flexible alignment of moderately sized subsets of data
Alignment formats Most main-stream aligners produce output in SAM (text-based) or BAM format A SAM file is a text file with one line per aligned readand fields separated by tabs Here is an example of a single SAM line split intofields
gt fl lt- systemfile(extdata ex1sam package=Rsamtools)
gt strsplit(readLines(fl 1) t)[[1]]
[1] B7_591496693509
[2] 73
[3] seq1
[4] 1
[5] 99
[6] 36M
[7]
[8] 0
[9] 0
[10] CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
[11] ltltltltltltltltltltltltltltltltltltltltltltltlt5ltltltltltlt7
[12] MFi18
[13] Aqi73
[14] NMi0
[15] UQi0
[16] H0i1
[17] H1i0
19
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
The SAM specification summarizes these fields We recognize from the FASTQfile the identifier string read sequence and quality The alignment is to a chro-mosome lsquoseq1rsquo starting at position 1 The strand of alignment is encoded in thelsquoflagrsquo field The alignment record also includes a measure of mapping qualityand a CIGAR string describing the nature of the alignment In this case theCIGAR is 36M indicating that the alignment consisted of 36 Matches or mis-matches with no indels or gaps indels are represented by I and D gaps (egfrom alignments spanning introns) by N BAM files encode the same informationas SAM files but in a format that is more efficiently parsed by software BAMfiles are the primary way in which aligned reads are imported in to R
Aligned reads in R The readGappedAlignments function from the Genom-icRanges package reads essential information from a BAM file in to R Theresult is an instance of the GappedAlignments class The GappedAlignmentsclass has been designed to allow useful manipulation of many reads (eg 20million) under moderate memory requirements (eg 4 GB)
gt alnFile lt- systemfile(extdata ex1bam package=Rsamtools)
gt aln lt- readGappedAlignments(alnFile)
gt head(aln 3)
GappedAlignments with 3 alignments and 0 elementMetadata cols
seqnames strand cigar qwidth start end width
ltRlegt ltRlegt ltcharactergt ltintegergt ltintegergt ltintegergt ltintegergt
[1] seq1 + 36M 36 1 36 36
[2] seq1 + 35M 35 3 37 35
[3] seq1 + 35M 35 5 39 35
ngap
ltintegergt
[1] 0
[2] 0
[3] 0
---
seqlengths
seq1 seq2
1575 1584
The readGappedAlignments function takes an additional parameter param allow-ing the user to specify regions of the BAM file (eg known gene coordinates)from which to extract alignments
A GappedAlignments instance is like a data frame but with accessors assuggested by the column names It is easy to query eg the distribution ofreads aligning to each strand the width of reads or the cigar strings
gt table(strand(aln))
+ -
1647 1624
20
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
gt table(width(aln))
30 31 32 33 34 35 36 38 40
2 21 1 8 37 2804 285 1 112
gt head(sort(table(cigar(aln)) decreasing=TRUE))
35M 36M 40M 34M 33M 14M4I17M
2804 283 112 37 6 4
Exercise 6Use bigdata filepath and dir to obtain file paths to the BAM files These area subset of the aligned reads overlapping just four genes
Input the aligned reads from one file using readGappedAlignments Explorethe reads eg using table or xtabs to summarize which chromosome andstrand the subset of reads is from
The object ex created earlier contains coordinates of four genes Use coun-
tOverlaps to first determine the number of genes an individual read aligns toand then the number of uniquely aligning reads overlapping each gene Sincethe RNAseq protocol was not strand-sensitive set the strand of aln to
Write the sequence of steps required to calculate counts as a simple functionand calculate counts on each file On Mac or Linux can you easily parallelizethis operation
Solution We discover the location of files using standard R commands
gt fls lt- dir(filepath(bigdata() bam) bam$ full=TRUE)
gt names(fls) lt- sub(_ basename(fls))
Use readGappedAlignments to input data from one of the files and standard Rcommands to explore the data
gt input
gt aln lt- readGappedAlignments(fls[1])
gt xtabs(~seqnames + strand asdataframe(aln))
strand
seqnames + -
chr3L 5402 5974
chrX 2278 2283
To count overlaps in regions defined in a previous exercise load the regions
gt data(ex) from an earlier exercise
Many RNA-seq protocols are not strand aware ie reads align to the plusor minus strand regardless of the strand on which the corresponding gene isencoded Adjust the strand of the aligned reads to indicate that the strand isnot known
21
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
gt strand(aln) lt- protocol not strand-aware
For simplicity we are interested in reads that align to only a single gene Countthe number of genes a read aligns to
gt hits lt- countOverlaps(aln ex)
gt table(hits)
hits
0 1 2
772 15026 139
and reverse the operation to count the number of times each region of interestaligns to a uniquely overlapping alignment
gt cnt lt- countOverlaps(ex aln[hits==1])
A simple function for counting reads is
gt counter lt-
+ function(filePath range)
+
+ aln lt- readGappedAlignments(filePath)
+ strand(aln) lt-
+ hits lt- countOverlaps(aln range)
+ countOverlaps(range aln[hits==1])
+
This can be applied to all files using sapply
gt counts lt- sapply(fls counter ex)
The counts in one BAM file are independent of counts in another BAM fileThis encourages us to count reads in each BAM file in parallel decreasing thelength of time required to execute our program On Linux and Mac OS astraight-forward way to parallelize this operation is
gt if (require(parallel))
+ simplify2array(mclapply(fls counter ex))
The summarizeOverlaps function in the GenomicRanges package implementsmore appropraite counting strategies
Exercise 7Consult the help page for ScanBamParam and construct an object that restrictsthe information returned by a scanBam query to the aligned read DNA sequenceYour solution will use the what parameter to the ScanBamParam function
Use the ScanBamParam object to query a BAM file and calculate the GCcontent of all aligned reads Summarize the GC content as a histogram (Figure2)
22
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Figure 2 GC content in aligned reads
Solution
gt param lt- ScanBamParam(what=seq)
gt seqs lt- scanBam(fls[[1]] param=param)
gt readGC lt- gcFunction(seqs[[1]][[seq]])
gt hist(readGC)
23
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Table 6 Selected Bioconductor packages for RNA-seq analysis
Package DescriptionEDASeq Exploratory analysis and QA also qrqc ShortReadedgeR DESeq Generalized Linear Models using negative binomial er-
rorDEXSeq Exon-level differential representationgoseq Gene set enrichment tailored to RNAseq count data
also limmarsquos roast or camera after transformation withvoom or cqn
easyRNASeq Workflow also ArrayExpressHTS rnaSeqMaponeChannelGUI
Rsubread Alignment (Linux only) also Biostrings matchPDict forspecial-purpose alignments
4 RNA-seq
Varieties of RNA-seq RNA-seq experiments typically ask about differencesin trancription of genes or other features across experimental groups The anal-ysis of designed experiments is statistical and hence an ideal task for R Theoverall structure of the analysis with tens of thousands of features and tensof samples is reminiscent of microarray analysis some insights from the mi-croarray domain will apply at least conceptually to the analysis of RNA-seqexperiments
The most straight-forward RNA-seq experiments quantify abundance forknown gene models The known models are derived from reference databasesreflecting the accumulated knowledge of the community responsible for the dataA more ambitious approach to RNA-seq attempts to identify novel transcriptsthis is beyond the scope of todayrsquos tutorial
Bioconductor packages play a role in several stages of an RNA-seq analysis(Table 6 a more comprehensive list is under the RNAseq and HighThroughput-Sequencing BiocViews terms) The GenomicRanges infrastructure can be effec-tively employed to quantify known exon or transcript abundances Quantifiedabundances are in essence a matrix of counts with rows representing featuresand columns samples The edgeR [15] and DESeq [1] packages facilitate anal-ysis of this data in the context of designed experiments and are appropriatewhen the questions of interest involve between-sample comparisons of relativeabundance The DEXSeq package extends the approach in edgeR and DESeqto ask about within-gene between group differences in exon use ie for a givengene do groups differ in their exon use
41 Differential Expression with the edgeR Package
RNA-seq differential representation experiments like classical microarray ex-periments consist of a single statistical design (eg comparing expression of
24
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
samples assigned to lsquoTreatmentrsquo versus lsquoControlrsquo groups) applied to each fea-ture for which there are aligned reads While one could naively perform simpletests (eg t-tests) on all features it is much more informative to identify impor-tant aspects of RNAseq experiments and to take a flexible route through thispart of the work flow Key steps involve formulation of a model matrix to cap-ture the experimental design estimation of a test static to describe differencesbetween groups and calculation of a P value or other measure as a statementof statistical significance
Counting and filtering An essential step is to arrive at some measure ofgene representation amongst the aligned reads A straight-forward and com-monly used approach is to count the number of times a read overlaps exonsNuance arises when a read only partly overlaps an exon when two exons over-lap (and hence a read appears to be lsquodouble countedrsquo) when reads are alignedwith gaps and the gaps are inconsistent with known exon boundaries etc ThesummarizeOverlaps function in the GenomicRanges package provides facilitiesfor implementing different count strategies using the argument mode to deter-mine the counting strategy The result of summarizeOverlaps can easily be usedin subsequent steps of an RNA-seq analysis Software other than R can also beused to summarize count data An important point is that the desired input fordownstream analysis is often raw count data rather than normalized (eg readsper kilobase of gene model per million mapped reads) values This is becausecounts allow information about uncertainty of estimates to propagate to laterstages in the analysis
The following exercise illustrates key steps in counting and filtering readsoverlapping known genes
Exercise 8The useR2012 package contains a data set counts with pre-computed countdata Use the data command to load it Create a variable grp to define thegroups associated with each column using the column names as a proxy formore authoritative metadata
Create a DGEList object (defined in the edgeR package) from the count matrixand group information Calculate relative library sizes using the calcNormFac-
tors functionA lesson from the microarray world is to discard genes that cannot be in-
formative (eg because of lack of variation) regardless of statistical hypothesisunder evaluation Filter reads to remove those that are represented at less than1 per million mapped reads in fewer than 2 samples
Solution Here we load the data (a matrix of counts) and create treatmentgroup names from the column names of the counts matrix
gt data(counts)
gt dim(counts)
[1] 14470 7
25
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
gt grps lt- factor(sub([1-4] colnames(counts))
+ levels=c(untreated treated))
gt pairs lt- factor(c(single paired paired
+ single single paired paired))
gt pData lt- dataframe(Group=grps PairType=pairs
+ rownames=colnames(counts))
We use the edgeR package creating a DGEList object from the count andgroup data The calcNormFactors function estimates relative library sizes foruse as offsets in the generalized linear model
gt library(edgeR)
gt dge lt- DGEList(counts group=pData$Group)
gt dge lt- calcNormFactors(dge)
To filter reads we scale the counts by the library sizes and express the resultson a per-million read scale This is done using the sweep function dividing eachcolumn by itrsquos library size and multiplying by 1e6 We require that the gene berepresented at a frequency of at least 1 read per million mapped (m gt 1 below)in two or more samples (rowSums(m gt 1) gt= 2) and use this criterion to subsetthe DGEList instance
gt m lt- sweep(dge$counts 2 1e6 dge$samples$libsize ``)gt ridx lt- rowSums(m gt 1) gt= 2
gt table(ridx) number filtered retained
ridx
FALSE TRUE
6476 7994
gt dge lt- dge[ridx]
Experimental design In R an experimental design is specified with themodelmatrix function The function takes as its first argument a formula
describing the independent variables and their relationship to the response(counts) and as a second argument a dataframe containing the (phenotypic)data that the formula describes A simple formula might read ~ 1 + Groupwhich says that the response is a linear function involving an intercept (1) plusa term encoded in the variable Group If (as in our case) Group is a factor thenthe first coefficient (column) of the model matrix corresponds to the first levelof Group and subsequent terms correspond to deviations of each level from thefirst If Group were numeric rather than factor the formula would representlinear regressions with an intercept Formulas are very flexible allowing repre-sentation of models with one two or more factors as main effects models withor without interaction and with nested effects
26
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Exercise 9To be more concrete use the modelmatrix function and a formula involvingGroup to create the model matrix for our experiment
Solution Here is the experimental design it is worth discussing with yourneighbor the interpretation of the design instance
gt (design lt- modelmatrix(~ Group pData))
(Intercept) Grouptreated
treated1fb 1 1
treated2fb 1 1
treated3fb 1 1
untreated1fb 1 0
untreated2fb 1 0
untreated3fb 1 0
untreated4fb 1 0
attr(assign)
[1] 0 1
attr(contrasts)
attr(contrasts)$Group
[1] contrtreatment
The coefficient (column) labeled lsquoInterceptrsquo corresponds to the first level ofGroup ie lsquountreatedrsquo The coefficient lsquoGrouptreatedrsquo represents the deviationof the treated group from untreated Eventually we will test whether the secondcoefficient is significantly different from zero ie whether samples with a lsquo1rsquo inthe second column are on average different from samples with a lsquo0rsquo On theone hand use of modelmatrix to specify experimental design implies that theuser is comfortable with something more than elementary statistical conceptswhile on the other it provides great flexibility in the experimental design thatcan be analyzed
Negative binomial error RNA-seq count data are often described by a neg-ative binomial error model This model includes a lsquodispersionrsquo parameter thatdescribes biological variation beyond the expectation under a Poisson modelThe simplest approach estimates a dispersion parameter from all the data Theestimate needs to be conducted in the context of the experimental design sothat variability between experimental factors is not mistaken for variability incounts The square root of the estimated dispersion represents the coefficient ofvariation between biological samples The following edgeR commands estimatedispersion
gt dge lt- estimateTagwiseDisp(dge)
gt mean(sqrt(dge$tagwisedispersion))
[1] 01778359
27
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
This approach assumes that a common dispersion parameter is shared by allgenes A different approach appropriate when there are more samples in thestudy is to estimate a dispersion parameter that is specific to each tag (usingestimateTagwiseDisp in the edgeR package) As another alternative Andersand Huber [1] note that dispersion increases as the mean number of reads pergene decreases One can estimate the relationship between dispersion and meanusing estimateGLMTrendedDisp in edgeR using a fitted relationship across allgenes to estimate the dispersion of individual genes Because in our case sam-ple sizes (biological replicates) are small gene-wise estimates of dispersion arelikely imprecise One approach is to moderate these estimates by calculating aweighted average of the gene-specific and common dispersion estimateGLMTag-wiseDisp performs this calculation requiring that the user provides an a prioriestimate of the weight between tag-wise and common dispersion
Differential representation The final steps in estimating differential repre-sentation are to fit the full model to perform the likelihood ratio test comparingthe full model to a model in which one of the coefficients has been removed andto summarize from the likelihood ratio calculation genes that are most differ-entially represented The result is a lsquotop tablersquo whose row names are the Flybasegene ids used to label the elements of the ex GRangesList
Exercise 10Use glmFit to fit the general linear model This function requires the input datadge the experimental design design and the estimate of dispersion
Use glmLRT to form the likelihood ratio test This requires the original datadge and the fitted model from the previous part of this question Which coeffi-cient of the design matrix do you wish to test
Create a lsquotop tablersquo of differentially represented genes using topTags
Solution Here we fit a generalized linear model to our data and experimentaldesign using the tagwise dispersion estimate
gt fit lt- glmFit(dge design)
The fit can be used to calculate a likelihood ratio test comparing the fullmodel to a reduced version with the second coefficient removed The secondcoefficient captures the difference between treated and untreated groups andthe likelihood ratio test asks whether this term contributes meaningfully to theoverall fit
gt lrTest lt- glmLRT(dge fit coef=2)
Here the topTags function summarizes results across the experiment
gt tt lt- topTags(lrTest n=10)
gt tt[13]
28
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Coefficient Grouptreated
logFC logCPM LR PValue FDR
FBgn0039155 -4698051 6030423 5423132 5918302e-120 4731091e-116
FBgn0039827 -4275280 4591903 2472889 1012761e-55 4048005e-52
FBgn0029167 -2233973 8246355 2114650 6580299e-48 1753430e-44
As a rsquosanity checkrsquo summarize the original data for the first several probesconfirming that the average counts of the treatment and control groups aresubstantially different
gt sapply(rownames(tt$table)[14]
+ function(x) tapply(counts[x] pData$Group mean))
FBgn0039155 FBgn0039827 FBgn0029167 FBgn0034736
untreated 1576 554 6447000 38225
treated 64 31 1482667 3600
42 Additional Steps in RNA-seq Work Flows
The forgoing provides an elementary work flow There are many interestingadditional opportunities including
Annotation Standard Bioconductor facilities eg the select method fromthe AnnotationDbi package applied to packages such as orgDmegdbcan provide biological context (eg gene name KEGG or GO pathwaymembership) for interpretting genes at the top of a top table Packagesusing GenomicFeatures eg TxDbDmelanogasterUCSCdm3ensGenecan provide information on genome structure eg genomic coordinates ofexons and relationship between exons coding sequences transcripts andgenes
Gene Set Enrichment Care needs to be taken because statistical signficanceof genes is proportional to the number of reads aligning to the gene (egdue to gene length or GC content) see eg goseq
Exon-level Differential Representation The DEXSeq package takes an in-teresting approach to within-gene differential expression testing for in-teraction between exon use and treatment The forthcoming SpliceGraphpackage takes this a step further by summarizing gene models into graphswith lsquobubblesrsquo representing alternative splicing events this reduces thenumber of statistical tests (increasing count per edge and statistical power)while providing meaningful insight into the types of events (eg lsquoexonskiprsquo lsquoalternative acceptorrsquo) occuring
29
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
43 Resources
The edgeR DESeq and DEXSeq package vignettes provide excellent exten-sive discussion of issues and illustration of methods for RNA-seq differentialexpression analysis
30
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Table 7 Selected Bioconductor packages for RNA-seq analysis
Package Descriptionqrqc Quality assessment also ShortRead chipseqPICS Peak calling also mosaics chipseq ChIPseqR
BayesPeak nucleR (nucleosome positioning)ChIPpeakAnno Peak annotationDiffBind Multiple-experiment analysisMotIV Motif identification and validation also rGADEM
5 ChIP-seq
ChIP-seq and similar experiments combine chromosome immuno-precipitation(ChIP) with sequence analysis The idea is that the ChIP protocol enrichesgenomic DNA for regions of interest eg sites to which transcription factors arebound The regions of interest are then subject to high throughput sequencingthe reads aligned to a reference genome and the location of mapped reads(lsquopeaksrsquo) interpreted as indicators of the ChIPrsquoed regions Reviews include thoseby Park and colleagues [13 6] there is a large collection of peak-calling softwaresome features of which are summarized in Pepke et al [14]
The main challenge in early ChIP-seq studies was to develop efficient peak-calling software often tailored to the characteristics of the peaks of interest(eg narrow and well-defined CTCF binding sites vs broad histone marks)More comprehensive studies draw from multiple samples eg in the ENCODEproject [7 12] Decreasing sequence costs and better experimental and dataanalytic protocols mean that these larger-scale studies are increasingly accessibleto individual investigators Peak-calling in this kind of study represents aninitial step but interpretting analyses derived from multiple samples presentsignificant analytic challenges Bioconductor packages play a role in severalstages of a ChIP-seq analysis (Table 7 a more comprehensive list is under theChIPseq and HighThroughputSequencing BiocViews terms)
Our attention is on analyzing multiple samples from a single experiment andidentifying and annotating peaks We start with a typical work flow re-iteratingkey components in an exploration of data from the ENCODE project and con-tinue with down-stream analysis including motif discovery and annotation
51 Initial Work Flow
We use data from GEO accession GSE30263 representing ENCODE CTCFbinding sites CTCF is a zinc finger transcription factor It is a sequencespecific DNA binding protein that functions as an insulator blocking enhanceractivity and possibly the spread of chromatin structure The original analysisinvolved Illumina ChIP-seq and matching lsquoinputrsquo lanes of 1 or 2 replicates frommany cell lines The GEO accession includes BAM files of aligned reads inaddition to tertiary files summarizing identified peaks We focus on 15 cell lines
31
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
aligned to hg19As a precursor to analysis it is prudent to perform an overall quality as-
sesssement of the data an example is available
gt rpt lt- systemfile(GSE30263_qa_report indexhtml
+ package=useR2012 mustWork=TRUE)
gt if (interactive())
+ browseURL(rpt)
The main computational stages in the original work flow involved alignmentusing Bowtie followed by peak identification using an algorithm (lsquoHotSpotsrsquo[16]) originally developed for lower-throughput methodologies We collated theoutput files from the original analysis with a goal of enumerating all peaks fromall files but collapsing the coordinates of sufficiently similar peaks to a commonlocation The DiffBind package provides a formalism with which to do theseoperations Here we load the data as an R object stam (an abbreviation for thelab generating the data)
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
gt stam
class SummarizedExperiment
dim 369674 96
exptData(0)
assays(2) Tags PVals
rownames NULL
rowData values names(0)
colnames(96) A549_1 A549_2 Wi38_1 Wi38_2
colData names(10) CellLine Replicate PeaksDate PeaksFile
Exercise 11Explore stam Tabulate the number of peaks represented 1 2 96 times Weexpect replicates to have similar patterns of peak representation do they
Solution Load the data and display the SummarizedExperiment instance ThecolData summarizes information about each sample the rowData about eachpeak Use xtabs to summarize Replicate and CellLine representation withincolData(stam)
gt head(colData(stam) 3)
DataFrame with 3 rows and 10 columns
CellLine Replicate TotTags TotPeaks Tags Peaks
ltcharactergt ltfactorgt ltintegergt ltintegergt ltnumericgt ltnumericgt
A549_1 A549 1 1857934 50144 1569215 43119
A549_2 A549 2 2994916 77355 2881475 73062
Ag04449_1 Ag04449 1 5041026 81855 4730232 75677
32
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
FastqDate FastqSize PeaksDate
ltDategt ltnumericgt ltDategt
A549_1 2011-06-25 463 2011-06-25
A549_2 2011-06-25 703 2011-06-25
Ag04449_1 2010-10-22 368 2010-10-22
PeaksFile
ltcharactergt
A549_1 wgEncodeUwTfbsA549CtcfStdPkRep1narrowPeakgz
A549_2 wgEncodeUwTfbsA549CtcfStdPkRep2narrowPeakgz
Ag04449_1 wgEncodeUwTfbsAg04449CtcfStdPkRep1narrowPeakgz
gt head(rowData(stam) 3)
GRanges with 3 ranges and 0 elementMetadata cols
seqnames ranges strand
ltRlegt ltIRangesgt ltRlegt
[1] chr1 [10100 10370]
[2] chr1 [15640 15790]
[3] chr1 [16100 16490]
---
seqlengths
chr1 chr2 chr3 chr4 chr22 chrX chrY
249250621 243199373 198022430 191154276 51304566 155270560 59373566
gt xtabs(~Replicate + CellLine colData(stam))[15]
CellLine
Replicate A549 Ag04449 Ag04450 Ag09309 Ag09319
1 1 1 1 1 1
2 1 1 1 1 1
Extract the Tags matrix from the assays This is a standard R matrix Testwhich matrix elements are non-zero tally these by row and summarize thetallies This is the number of times a peak is detected across each of thesamples
gt m lt- assays(stam)[[Tags]] gt 0 peaks detected
gt peaksPerSample lt- table(rowSums(m))
gt head(peaksPerSample)
1 2 3 4 5 6
174574 35965 18939 12669 9143 7178
gt tail(peaksPerSample)
91 92 93 94 95 96
1226 1285 1542 2082 2749 14695
33
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Figure 3 Hierarchical clustering of ENCODE samples
To explore similarity between replicates extract the matrix of counts Trans-form the counts using the asinh function (a log-like transform except near 0are there other methods for transformation) and use the lsquocorrelationrsquo distance(cordist from bioDist) to measure similarity Cluster these using a hierarchi-cal algorithm via the hclust function
gt library(bioDist) for cordist
gt m lt- asinh(assays(stam)[[Tags]]) transformed tag counts
gt d lt- cordist(t(m)) correlation distance
gt h lt- hclust(d) hierarchical clustering
Plot the result as in Figure 3
gt plot(h cex=8 ann=FALSE)
52 Motifs
Transcription factors and other common regulatory elements often target spe-cific DNA sequences (lsquomotifsrsquo) These are often well-characterized and can beused to help identify a priori regions in which binding is expected Knownbinding motifs may also be used to identify promising peaks identified using denovo peak discovery methods like MACS This section explores use of knownbinding motifs to characterize peaks packages such as MotIV can assist in motifdiscovery
Known binding motifs The JASPAR data base curates known binding mo-tifs obtained from the literature A binding motif is summarized as a positionweight matrix PWM Rows of a PWM correspond to nucleotides columns topositions and entries to the probability of the nucleotide at that position Eachstart position in a reference sequence can be compared and scored for similarityto the PWM and high-scoring positions retained
34
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Exercise 12The objective of this exercise is to identify occurrences of the CTCF motif onchromosome 1 of H sapiens
Load needed packages Biostrings can represent a PWM and score a referencesequence The BSgenomeHsapiensUCSChg19 package contains the hg19 buildof H sapiens retrieved from the UCSC genome browser seqLogo and latticeare used for visualization
Retrieve the PWM for CTCF with JASPAR id MA01391pfm using thehelper function getJASPAR defined in the useR2012 package
Use matchPWM to score the plus strand of chr1 for the CTCF PWM Visualizethe distribution of scores using eg densityplot and summarize the high-scoring matches (using consensusMatrix) as a seqLogo
As an additional exercise work up a short code segment to apply the PWMto both strands (see PWM for some hints) and to all chromosomes
Solution Here we load the required packages and retrieve the position weightmatrix for CTCF
gt library(Biostrings)
gt library(BSgenomeHsapiensUCSChg19)
gt library(seqLogo)
gt library(lattice)
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
Chromosome 1 can be loaded with Hsapiens[[chr1]] matchPWM returns alsquoviewrsquo of the high-scoring locations matching the PWM Scores are retrievedfrom the PWM and hits using PWMscoreStartingAt
gt chrid lt- chr1
gt hits lt-matchPWM(pwm Hsapiens[[chrid]]) + strand
gt scores lt- PWMscoreStartingAt(pwm subject(hits) start(hits))
The distribution of scores can be visualized with eg densityplot from thelattice package
gt densityplot(scores xlim=range(scores) pch=|)
consensusMatrix applied to the views in hits returns a position frequency amtrixthis can be plotted as a logo with the result in Figure 4 Reassuringly the foundsequences have a logo very similar to the expected
gt cm lt- consensusMatrix(hits)[14]
gt seqLogo(makePWM(scale(cm FALSE colSums(cm))))
35
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Figure 4 CTCF position weight matrix of found sites on the plus strand of chr1(hits within 80 of maximum score)
36
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
6 Annotation
Bioconductor provides extensive annotation resources summarized in Figure 5These can be gene- or genome-centric Annotations can be provided in packagescurated by Bioconductor or obtained from web-based resources Gene-centricAnnotationDbi packages include
bull Organism level eg orgMmegdb
bull Platform level eg hgu133plus2db hgu133plus2probes hgu133plus2cdf
bull Homology level eg homDminpdb
bull System biology level GOdb KEGGdb Reactomedb
Examples of genome-centric packages include
bull GenomicFeatures to represent genomic features including constructingreproducible feature or transcript data bases from file or web resources
bull Pre-built transcriptome packages eg TxDbHsapiensUCSChg19knownGenebased on the H sapiens UCSC hg19 knownGenes track
bull BSgenome for whole genome sequence representation and manipulation
bull Pre-built genomes eg BSgenomeHsapiensUCSChg19 based on the Hsapiens UCSC hg19 build
Web-based resources include
bull biomaRt to query biomart resource for genes sequence SNPs and etc
bull rtracklayer for interfacing with browser tracks especially the UCSC genomebrowser
61 Gene-Centric Annotations with AnnotationDbi
Organism-level (lsquoorgrsquo) packages uses a central gene identifier (eg Entrez Geneid) and contain mappings between this identifier and other kinds of identifiers(eg GenBank or Uniprot accession number RefSeq id etc) The name ofan org package is always of the form orgltAbgtltefggtdb (eg orgScsgddb)where ltAbgt is a 2-letter abbreviation of the organism (eg Sc for Saccha-romyces cerevisiae) and ltefggt is an abbreviation (in lower-case) describing thetype of central identifier (eg sgd for gene identifiers assigned by the Saccha-romyces Genome Database or eg for Entrez gene ids) The How to use the lsquodbrsquoannotation packages vignette in the AnnotationDbi package (org packages areonly one type of ldquodbrdquo annotation packages) is a key reference The lsquodbrsquo andmost other Bioconductor annotation packages are updated every 6 months
37
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
GENE ID
PLATFORM PKGS
GENE ID
ONTO IDrsquoS
ORG PKGS
GENE ID
ONTO ID
TRANSCRIPT PKGS
SYSTEM BIOLOGY
(GO KEGG)
GENE ID
HOMOLOGY PKGS
Figure 5 Annotation Packages the big picture
Exercise 13What is the name of the org package for Drosophila Load it
Use ls(packageltpkgnamegt) to display the list of all symbols defined in thispackage Explore a few of the symbols by looking at their man page at theirclass and by viewing their head with toTable
Most maps can be reversed with revmap Reverse the orgDmegUNIPROT mapand extract a few identifiers from the reversed map
Solution
gt library(orgDmegdb)
gt head(ls(packageorgDmegdb) 3)
[1] orgDmeg orgDmegdb orgDmegACCNUM
gt orgDmegUNIPROT
UNIPROT map for Fly (object of class AnnDbBimap)
gt class(orgDmegUNIPROT)
[1] AnnDbBimap
attr(package)
[1] AnnotationDbi
gt toTable(head(orgDmegUNIPROT 3))
gene_id uniprot_id
1 30970 Q8IRZ0
2 30970 Q95RP8
3 30971 Q95RU8
4 30972 Q9W5H1
38
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Each map consists of left keys and right keys The left keys are the Entrez geneids and the right keys the Uniprot accession numbers For all maps in an orgpackage the left key is always the central gene id
gt toTable(head(revmap(orgDmegUNIGENE) 3))
gene_id unigene_id
1 30970 Dm6474
2 30971 Dm9
3 30972 Dm12271
gt identical(Lkeys(orgDmegUNIGENE) Lkeys(revmap(orgDmegUNIGENE)))
[1] TRUE
Recent versions of many annotation packages allow a simpler way of extract-ing annotations Annotation packages supporting these new methods contain anobject named after the package itself These objects are collectively called An-notationDb objects with more specific classes named OrgDb ChipDb or Tran-scriptDb objects Methods that can be applied to these objects include colskeys keytypes and select
Exercise 14Display the OrgDb object for the orgDmegdb package
Use the cols method to discover which sorts of annotations can be extractedfrom it
Use the keys method to extract UNIPROT identifiers and then pass thosekeys in to the select method in such a way that you extract the SYMBOL (genesymbol) and KEGG pathway information for each
Use select to retrieve the ENTREZ and SYMBOL identifiers of all genes inthe KEGG pathway 00310
Solution The OrgDb object is named orgDmegdb
gt cols(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] GENENAME MAP PATH PMID REFSEQ
[11] SYMBOL UNIGENE CHRLOC CHRLOCEND FLYBASE
[16] FLYBASECG FLYBASEPROT UNIPROT ENSEMBL ENSEMBLPROT
[21] ENSEMBLTRANS GO
gt keytypes(orgDmegdb)
[1] ENTREZID ACCNUM ALIAS CHR ENZYME
[6] MAP PATH PMID REFSEQ SYMBOL
[11] UNIGENE FLYBASE FLYBASECG FLYBASEPROT UNIPROT
[16] ENSEMBL ENSEMBLPROT ENSEMBLTRANS GO
39
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
gt uniprotKeys lt- head(keys(orgDmegdb keytype=UNIPROT))
gt cols lt- c(SYMBOL PATH)
gt select(orgDmegdb keys=uniprotKeys cols=cols keytype=UNIPROT)
UNIPROT SYMBOL PATH
1 Q8IRZ0 CG3038 ltNAgt
2 Q95RP8 CG3038 ltNAgt
3 Q95RU8 G9a 00310
4 Q9W5H1 CG13377 ltNAgt
5 P39205 cin ltNAgt
6 Q24312 ewg ltNAgt
Selecting UNIPROT and SYMBOL ids of KEGG pathway 00310 is very similar
gt kegg lt- select(orgDmegdb 00310 c(UNIPROT SYMBOL) PATH)
gt nrow(kegg)
[1] 32
gt head(kegg 3)
PATH UNIPROT SYMBOL
1 00310 Q95RU8 G9a
2 00310 Q9W5E0 Suv4-20
3 00310 Q9W3N9 CG10932
62 Genome-Centric Annotations with GenomicFeatures
Genome-centric packages are very useful for annotations involving genomic co-ordinates It is straight-forward for instance to discover the coordinates ofcoding sequences in regions of interest and from these retrieve correspondingDNA or protein coding sequences Other examples of the types of operationsthat are easy to perform with genome-centric annotations include defining re-gions of interest for counting aligned reads in RNA-seq experiments (Section 4)and retrieving DNA sequences underlying regions of interest in ChIP-seq anal-ysis (Section 5) eg for motif characterization
Exercise 15The objective of this exercise is to characterize the distance between identifiedpeaks and nearest transcription start site
Load the ENCODE summary data select the peaks found in all samplesand use the center of these peaks as a proxy for the true ChIP binding site
Use the transcript data base for the UCSC Known Genes track of hg19 as asource for transcripts and transcription start sites (TSS)
Use nearest to identify the TSS that is nearest each peak and calculate thedistance between the peak and TSS measure distance taking account of thestrand of the transcript so that peaks 5rsquo of the TSS have negative distance
Summarize the locations of the peaks relative to the TSS
40
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Solution Read in the ENCODE ChIP peaks for all cell lines
gt stamFile lt- systemfile(data stamRda package=useR2012)
gt load(stamFile)
Identify the rows of stam that have non-zero counts for all cell lines and extractthe corresponding ranges
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt peak lt- rowData(stam)[ridx]
Select the center of the ranges of these peaks as a proxy for the ChIP bindingsite
gt peak lt- resize(peak width=1 fix=center)
Obtain the TSS from the TxDbHsapiensUCSChg19knownGene using thetranscripts function to extract coordinates of each transcript and resize to awidth of 1 for the TSS does this do the right thing for transcripts on the plusand on the minus strand
gt library(TxDbHsapiensUCSChg19knownGene)
gt tx lt- transcripts(TxDbHsapiensUCSChg19knownGene)
gt tss lt- resize(tx width=1)
The nearest function returns the index of the nearest subject to each query
element the distance between peak and nearest TSS is thus
gt idx lt- nearest(peak tss)
gt sgn lt- asinteger(ifelse(strand(tss)[idx] == + 1 -1))
gt dist lt- (start(peak) - start(tss)[idx]) sgn
Here we summarize the distances as a simple table and density plot focusingon binding sites within 1000 bases of a transcription start site the density plotis in Figure 6
gt bound lt- 1000
gt ok lt- abs(dist) lt bound
gt dist lt- dist[ok]
gt table(sign(dist))
-1 0 1
1262 4 707
gt griddensityplot lt-
+ function()
+ panel function to plot a grid underneath density
+
+ panelgrid()
+ paneldensityplot()
41
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Figure 6 Distance to nearest TSS amongst conserved peaks
+
gt print(densityplot(dist[ok] plotpoints=FALSE
+ panel=griddensityplot
+ xlab=Distance to Nearest TSS))
The distance to transcript start site is a useful set of operations so letrsquos makeit a re-usable function
gt distToTss lt-
+ function(peak tx)
+
+ peak lt- resize(peak width=1 fix=center)
+ tss lt- resize(tx width=1)
+ idx lt- nearest(peak tss)
+ sgn lt- asnumeric(ifelse(strand(tss)[idx] == + 1 -1))
+ (start(peak) - start(tss)[idx]) sgn
+
Exercise 16As an additional exercise extract the sequences of all conserved peaks on lsquochr6rsquoDo this using the BSgenomeHsapiensUCSChg19 package and getSeq functionUse matchPWM to find sequences with a strong match to the JASPAR CTCFPWM motif and plot the density of distances to nearest transcription start sitefor those with and without a match What strategies are available for motifdiscovery
Solution Here we select peaks on chromosome 6 and extract the DNA se-quences corresponding to these peaks
42
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
gt library(BSgenomeHsapiensUCSChg19)
gt ridx lt- rowSums(assays(stam)[[Tags]] gt 0) == ncol(stam)
gt ridx lt- ridx amp (seqnames(rowData(stam)) == chr6)
gt pk6 lt- rowData(stam)[ridx]
gt seqs lt- getSeq(Hsapiens pk6 ascharacter=FALSE)
gt head(seqs 3)
A DNAStringSet instance of length 3
width seq
[1] 311 CAGGGAGACTTGGGAAGGCTTCACGAAGGAGGGTACCCAACTCCTAAGCGTCACACATATAATCCTG
[2] 331 GCTAATAATTTACCATGAAGTAACAACTTTTCACTTTCCTAGGCAGCGAATTTAAGGGTAATGATCA
[3] 751 GTAAAGAATGGACTGACTTAAAGGCAGATGGAATAATCAAACAAGACAAAGAATCTTCGTGGCCACA
matchPWM operates on one DNA sequence at a time so we arrange to search forthe PWM on each sequence using lapply We identify sequences with a matchby testing the length of the returned object and use this to create a densityplot
gt pwm lt- getJASPAR(MA01391pfm) useR2012getJASPAR
gt hits lt- lapply(seqs matchPWM pwm=pwm)
gt hasPwmMatch lt- sapply(hits length) gt 0
gt dist lt- distToTss(pk6 tx)
gt ok lt- abs(dist) lt bound
gt df lt- dataframe(Distance = dist[ok] HasPwmMatch = hasPwmMatch[ok])
gt print(densityplot(~Distance group=HasPwmMatch df
+ plotpoints=FALSE panel=griddensityplot
+ autokey=list(
+ columns=2
+ title=Has Position Weight Matrix
+ cextitle=1)
+ xlab=Distance to Nearest Tss))
63 Exploring Annotated Variants VariantAnnotation
A major product of DNASeq experiments are catalogs of called variants (egSNPs indels) We will use the VariantAnnotation package to explore this typeof data Sample data included in the package are a subset of chromosome 22from the 1000 Genomes project Variant Call Format (VCF full description)text files contain meta-information lines a header line with column names datalines with information about a position in the genome and optional genotypeinformation on samples for each position
Data are read from a VCF file and variants identified according to regionsuch as coding intron intergenic spliceSite etc Amino acid coding changesare computed for the non-synonymous variants SIFT and PolyPhen databasesprovide predictions of how severely the coding changes affect protein function
43
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Exercise 17The objective of this exercise is to compare the quality of called SNPs that arelocated in dbSNP versus those that are novel
Locate the sample data in the file system Explore the metadata (informationabout the content of the file) using scanVcfHeader Discover the lsquoinforsquo fields VT
(variant type) and RSQ (genotype imputation quality)Input sample data in using readVcf Yoursquoll need to specify the genome build
(genome=hg19) on which the variants are annotated Take a peak at the rowData
to see the genomic locations of each variantdbSNP uses abbreviations such as ch22 to represent chromosome 22 whereas
the VCF file uses 22 Use rowData and renameSeqlevels to extract the row dataof the variants and rename the chromosomes
The SNPlocsHsapiensdbSNP20101109 contains information about SNPs ina particular build of dbSNP Load the package use the dbSNPFilter function tocreate a filter and query the row data of the VCF file for membership
Create a data frame containing the dbSNP membership status and imputa-tion quality of each SNP Create a density plot to illustrate the results
Solution Explore the header
gt library(VariantAnnotation)
gt fl lt- systemfile(extdata chr22vcfgz
+ package=VariantAnnotation)
gt (hdr lt- scanVcfHeader(fl))
class VCFHeader
samples(5) HG00096 HG00097 HG00099 HG00100 HG00101
meta(1) fileformat
fixed(1) ALT
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
gt info(hdr)[c(VT RSQ)]
DataFrame with 2 rows and 3 columns
Number Type Description
ltcharactergt ltcharactergt ltcharactergt
VT 1 String indicates what type of variant the line represents
RSQ 1 Float Genotype imputation quality from MaCHThunder
Input the data and peak at their locations
gt (vcf lt- readVcf(fl hg19))
class VCF
dim 10376 5
genome hg19
exptData(1) header
44
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
fixed(4) REF ALT QUAL FILTER
info(22) LDAF AVGPOST VT SNPSOURCE
geno(3) GT DS GL
rownames(10376) rs7410291 rs147922003 rs144055359 rs114526001
rowData values names(1) paramRangeID
colnames(5) HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1) Samples
gt head(rowData(vcf) 3)
GRanges with 3 ranges and 1 elementMetadata col
seqnames ranges strand | paramRangeID
ltRlegt ltIRangesgt ltRlegt | ltfactorgt
rs7410291 22 [50300078 50300078] | ltNAgt
rs147922003 22 [50300086 50300086] | ltNAgt
rs114143073 22 [50300101 50300101] | ltNAgt
---
seqlengths
22
NA
Rename chromosome levels
gt rowData(vcf) lt- renameSeqlevels(rowData(vcf) c(22=ch22))
Discover whether SNPs are located in dbSNP
gt library(SNPlocsHsapiensdbSNP20101109)
gt snpFilt lt- dbSNPFilter(SNPlocsHsapiensdbSNP20101109)
gt inDbSNP lt- snpFilt(rowData(vcf) subset=FALSE)
gt table(inDbSNP)
inDbSNP
FALSE TRUE
6126 4250
Create a data frame summarizing SNP quality and dbSNP membership
gt metrics lt-
+ dataframe(inDbSNP=inDbSNP RSQ=values(info(vcf))$RSQ)
Finally visualize the data eg using ggplot2 (Figure 7)
gt library(ggplot2)
gt ggplot(metrics aes(RSQ fill=inDbSNP)) +
+ geom_density(alpha=05) +
+ scale_x_continuous(name=MaCH Thunder Imputation Quality) +
+ scale_y_continuous(name=Density) +
+ opts(legendposition=top)
45
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
Figure 7 Quality scores of variants in dbSNP compared to those not in dbSNP
46
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
References
[1] S Anders and W Huber Differential expression analysis for sequence countdata Genome Biology 11R106 2010
[2] A N Brooks L Yang M O Duff K D Hansen J W Park S DudoitS E Brenner and B R Graveley Conservation of an RNA regulatorymap between Drosophila and mammals Genome Research pages 193ndash2022011
[3] J M Chambers Software for Data Analysis Programming with RSpringer New York 2008
[4] P Dalgaard Introductory Statistics with R Springer 2nd edition 2008
[5] R Gentleman R Programming for Bioinformatics Computer Science ampData Analysis Chapman amp HallCRC Boca Raton FL 2008
[6] J W Ho E Bishop P V Karchenko N Negre K P White and P JPark ChIP-chip versus ChIP-seq lessons for experimental design and dataanalysis BMC Genomics 12134 2011 [PubMed CentralPMC3053263][DOI1011861471-2164-12-134] [PubMed21356108]
[7] P V Kharchenko A A Alekseyenko Y B Schwartz A Minoda N CRiddle J Ernst P J Sabo E Larschan A A Gorchakov T GuD Linder-Basso A Plachetka G Shanower M Y Tolstorukov L JLuquette R Xi Y L Jung R W Park E P Bishop T K CanfieldR Sandstrom R E Thurman D M MacAlpine J A Stamatoyannopou-los M Kellis S C Elgin M I Kuroda V Pirrotta G H Karpenand P J Park Comprehensive analysis of the chromatin landscape inDrosophila melanogaster Nature 471480ndash485 Mar 2011 [PubMed Cen-tralPMC3109908] [DOI101038nature09725] [PubMed21179089]
[8] B Langmead C Trapnell M Pop and S L Salzberg Ultrafast andmemory-efficient alignment of short DNA sequences to the human genomeGenome Biol 10R25 2009
[9] H Li and R Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics 251754ndash1760 Jul 2009
[10] H Li and R Durbin Fast and accurate long-read alignment with Burrows-Wheeler transform Bioinformatics 26589ndash595 Mar 2010
[11] N Matloff The Art of R Programming No Starch Pess 2011
[12] R M Myers J Stamatoyannopoulos M Snyder and P JGood A userrsquos guide to the encyclopedia of DNA elements (ENCODE)PLoS Biol 9e1001046 Apr 2011 [PubMed CentralPMC3079585][DOI101371journalpbio1001046] [PubMed21526222]
47
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48
[13] P J Park ChIP-seq advantages and challenges of a maturing technologyNat Rev Genet 10669ndash680 Oct 2009 [PubMed CentralPMC3191340][DOI101038nrg2641] [PubMed19736561]
[14] S Pepke B Wold and A Mortazavi Computation for ChIP-seq and RNA-seq studies Nat Methods 622ndash32 Nov 2009 [DOI101038nmeth1371][PubMed19844228]
[15] M D Robinson D J McCarthy and G K Smyth edgeR a Bioconductorpackage for differential expression analysis of digital gene expression dataBioinformatics 26139ndash140 Jan 2010
[16] P J Sabo M Hawrylycz J C Wallace R Humbert M Yu A ShaferJ Kawamoto R Hall J Mack M O Dorschner M McArthur and J AStamatoyannopoulos Discovery of functional noncoding elements by digitalanalysis of chromatin structure Proc Natl Acad Sci USA 10116837ndash16842 Nov 2004
48