40
Exploring the Ranges Infrastructure Michael Lawrence July 28, 2017

Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Exploring the Ranges Infrastructure

Michael Lawrence

July 28, 2017

Page 2: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Outline

Introduction

Data structures

Algorithms

Example workflow: Structural variants

Page 3: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Outline

Introduction

Data structures

Algorithms

Example workflow: Structural variants

Page 4: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

The Ranges infrastructure: what is it good for?

MethodPrototyping

Data Analysis

Insight incubation

Platform Integration

Page 5: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Integrative data analysis

Page 6: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Developing and prototyping methods

Peak callingIsoform expression

Variant calling

Page 7: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Software integration

S4Vectors

SummarizedExperimentrtracklayer

01 00 01

11 1110010

VariantAnnotation

Page 8: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Outline

Introduction

Data structures

Algorithms

Example workflow: Structural variants

Page 9: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Data types

Data on genomic ranges Summarized data

Page 10: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

GRanges: data on genomic ranges

249250621chr1hg19

seqnames start end strand . . .chr1 1 10 +chr1 15 24 -

I Plus, sequence information (lengths, genome, etc)

Page 11: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

SummarizedExperiment: the central data model

Page 12: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Reality

I In practice, we have a BED file:bash-3.2$ ls *.bed

my.bed

I And we turn to R to analyze the datadf <- read.table("my.bed", sep="\t")colnames(df) <- c("chrom", "start", "end")

chrom start end1 chr7 127471196 1274723632 chr7 127472363 1274735303 chr7 127473530 1274746974 chr9 127474697 1274758645 chr9 127475864 127477031

Page 13: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Reality bites

Now for a GFF file:df <- read.table("my.bed", sep="\t")colnames(df) <- c("chr", "start", "end")

GFF

chr start end1 chr7 127471197 1274723632 chr7 127472364 1274735303 chr7 127473531 1274746974 chr9 127474698 1274758645 chr9 127475865 127477031

BED

chrom start end1 chr7 127471196 1274723632 chr7 127472363 1274735303 chr7 127473530 1274746974 chr9 127474697 1274758645 chr9 127475864 127477031

Page 14: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

From reality to idealityThe abstraction gradient

BED FileOf Genes

Text

read.table()

Table

rtracklayer

01 00 01

11 1110010

Genomic Ranges

Gene Coordinates

I Abstraction is semantic enrichmentI Enables the user to think of data in terms of the problem

domainI Hides implementation detailsI Unifies frameworks

Page 15: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Semantic slack

rtracklayer

01 00 01

11 1110010

Genomic Ranges

Gene Coordinates

> mcols(gr)[1] “gene_name”[2] “gene_symbol”

I Science defies rigidity: we define flexible objects that combinestrongly typed fields with arbitrary user-level metadata

Page 16: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Abstraction is the responsibility of the user

I Only the user knows the true semantics of the dataI Explicitly declaring semantics:

I Helps the software do the right thingI Helps the user be more expressive

Page 17: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Outline

Introduction

Data structures

Algorithms

Example workflow: Structural variants

Page 18: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

The Ranges API

I Semantically rich data enables:I Semantically rich vocabularies and grammarsI Semantically aware behavior (DWIM)

I The range algebra expresses typical range-oriented operationsI Base R API is extended to have range-oriented behaviors

Page 19: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

The Ranges API: Examples

Type Range operations Range extensionsFilter subsetByOverlaps() [()Transform shift(), resize() *() to zoomAggregation coverage(), reduce() intersect(), union()Comparison findOverlaps(), nearest() match(), sort()

Page 20: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Range algebra

range(gr)

reduce(gr)

Operation

disjoin(gr)

flank(gr)

psetdiff(range(gr), gr)

Page 21: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Overlap detection

Page 22: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Outline

Introduction

Data structures

Algorithms

Example workflow: Structural variants

Page 23: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Structural variants are important for disease

I SVs are rarer than SNVsI SNVs: ~ 4,000,000 per genomeI SVs: 5,000 - 10,000 per genome

I However, SVs are much larger (typically > 1kb) and covermore genomic space than SNVs.

I The effect size of SV associations with disease is larger thanthose of SNVs.

I SVs account for 13% of GTEx eQTLsI SVs are 26 - 54 X more likely to modulate expression than

SNVs (or indels)

Page 24: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Detection of deletions from WGS data

Coverage Read Pairs Split Reads Assembly

DEL

Page 25: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Motivation

ProblemI Often need to evaluate a tool before adding it to our workflowI "lumpy" is a popular SV caller

GoalEvaluate the performance of lumpy

Page 26: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Data

I Simulated a FASTQ containing known deletions using varsimI Aligned the reads with BWAI Ran lumpy on the alignments

Page 27: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Overview

1. Import the lumpy calls and truth set2. Tidy the data3. Match the calls to the truth4. Compute error rates5. Diagnose errors

Page 28: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Data import

Read from VCF:library(RangesTutorial2017)calls <- readVcf(system.file("extdata", "lumpy.vcf.gz",

package="RangesTutorial2017"))truth <- readVcf(system.file("extdata", "truth.vcf.bgz",

package="RangesTutorial2017"))

Select for deletions:truth <- subset(truth, SVTYPE=="DEL")calls <- subset(calls, SVTYPE=="DEL")

Page 29: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Data cleaning

Make the seqlevels compatible:seqlevelsStyle(calls) <- "NCBI"truth <- keepStandardChromosomes(truth,

pruning.mode="coarse")

Page 30: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Tighten

Move from the constrained VCF representation to a range-orientedmodel (VRanges) with a tighter cognitive link to the problem:calls <- as(calls, "VRanges")truth <- as(truth, "VRanges")

Page 31: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

More cleaning

Homogenize the ALT field:ref(truth) <- "."

Remove the flagged calls with poor read support:calls <- calls[called(calls)]

Page 32: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Comparison

I How to decide whether a call represents a true event?I Ranges should at least overlap:

hits <- findOverlaps(truth, calls)

I But more filtering is needed.

Page 33: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Comparing breakpoints

Compute the deviation in the breakpoints:hits <- as(hits, "List")call_rl <- extractList(ranges(calls), hits)dev <- abs(start(truth) - start(call_rl)) +

abs(end(truth) - end(call_rl))

Select and store the call with the least deviance, per true deletion:dev_ord <- order(dev)keep <- phead(dev_ord, 1L)truth$deviance <- drop(dev[keep])truth$call <- drop(hits[keep])

Page 34: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Choosing a deviance cutoff

library(ggplot2)rdf <- as.data.frame(truth)ggplot(aes(x=deviance),

data=subset(rdf, deviance <= 500)) +stat_ecdf() + ylab("fraction <= deviance")

Page 35: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Choosing a deviance cutoff

0.00

0.25

0.50

0.75

1.00

0 100 200 300 400

deviance

frac

tion

<=

dev

ianc

e

Page 36: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Applying the deviance filter

truth$called <-with(truth, !is.na(deviance) & deviance <= 300)

Page 37: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Sensitivity

mean(truth$called)

[1] 0.8214107

Page 38: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Specificity

Determine which calls were true:calls$fp <- TRUEcalls$fp[subset(truth, called)$call] <- FALSE

Compute FDR:mean(calls$fp)

[1] 0.1009852

Page 39: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

Explaining the FDR

I Suspect that calls may be error-prone in regions where thepopulation varies

I Load alt regions from a BED file:file <- system.file("extdata",

"altRegions.GRCh38.bed.gz",package="RangesTutorial2017")

altRegions <- import(file)seqlevelsStyle(altRegions) <- "NCBI"altRegions <-

keepStandardChromosomes(altRegions,pruning.mode="coarse")

Page 40: Exploring the Ranges InfrastructureExploring the Ranges Infrastructure Author Michael Lawrence Created Date 7/28/2017 3:18:31 AM

FDR and variable "alt" regions

I Compute the association between FP status and overlap of analt region:calls$inAlt <- calls %over% altRegionsxtabs(~ inAlt + fp, calls)

fpinAlt FALSE TRUEFALSE 1402 112TRUE 58 52