Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
To Enrich or Not To Enrich: How Target Enrichment Can Advance Your Research
[0:00:00] Sean Sanders: Hello and welcome to this Science/AAAS webinar. My name is Sean
Sanders and I’m the commercial editor and webinar editor at Science. Slide 1 Today’s webinar will focus on the methodologies for DNA target
enrichment prior to next‐generation sequencing. As the price of next‐generation sequencing continues to fall, the debate is heating up about whether it is viable and prudent to perform some type of target selection procedure before sequencing. A number of variables must be considered to make the best decision including the number of samples, the amount of DNA available, the sequencing platform used, budgets, reproducibility requirements, and the availability of automation. We hope to expand on and clarify some of these issues today, discussing the pros and cons of various target enrichment procedures and how they can be applied to your next‐generation workflow.
We’re using a slightly different format for this webinar today. We are for
the first time broadcasting live from the headquarters of the American Association for the Advancement of Science, the publisher of the journal Science here in Washington, D.C. in front of a live audience. I trust that everything will go smoothly, but if we have some technical hiccups, please bear with us.
Joining me today are three expert scientists to share their knowledge and
expertise in this field. Just to my left is Dr. Dale Hedges from the Hussman Institute for Human Genomics in Miami, Florida. Next, we have Dr. Elaine Mardis from Washington University in St. Louis, Missouri. And finally, we have Dr. Jun Wei from the National Institutes of Health in Bethesda, Maryland.
A reminder to our online viewers that you can see an enlarged version of
any of the slides by clicking the enlarge slides button and that is located just underneath the slide window of your web console. You can also download a PDF copy of the slides by using the download slides button. If you’re joining us live you can submit a question to the panel at any time by typing it in to the ask‐a‐question box to the bottom left of your viewing console just under the video screen and clicking the submit
2
button. Please remember to keep your questions short and to the point as this will give them the best chance of being put to the panel. We’ll get to as many questions as we can in the Q&A session following the presentations at which time we will also take questions from our live studio audience.
Finally, thank you to Agilent Technologies for their sponsorship of today’s
webinar. So I’d like now to introduce our first speaker for the webinar, Dr. Dale
Hedges. Slide 2 Dr. Hedges is an assistant professor of Human Genetics and serves as the
Assistant Director of the Center for Genome Technology, part of the Hussman Institute of Human Genomics at the University of Miami. Through his role in the University of Miami Center for Genome Technology, Dr. Hedges is actively involved in the incorporation of novel genomic technologies into the process of searching for genetic variation underlying human disease risk.
So welcome, Dr. Hedges. Dr. Dale Hedges: Thank you, Sean, for the introduction. Slide 3 I thought I’ll start today by discussing the essential problem that targeted
enrichment seeks to address, which is the fact that despite all we hear about the increased throughput of next‐generation sequencing and it’s almost becoming a cliché now, I was talking about the volumes of data that we’re producing, the actual ability for us to selectively provide material for those platforms to leverage or to take advantage of that increasing throughput, actually this remains surprisingly limited.
Slide 4 And so why would we want to selectively provide material to these
platforms? Well, despite the fact that we have this capacity, we still can’t routinely go out and sequence human genomes, at least not at the sort of skills that we would need for instance at the ‐‐ for the institute I work with where we’re looking at complex genetic disorders. The populations, the sizes that we actually need to examine for many of these diseases
3
where the odds ratios aren’t very high can be in many hundreds to sometimes thousands before you could expect to actually detect a novel ‐ a locus conveying risk for a disease.
So being that we’re not there yet, what can we do in the meantime
between sort of routine sequencing of whole human genomes and where we are now? And I think that’s ‐‐ and I think everyone recognizes that we’re sort of in a temporary phase right now and we would like to just sort of sit around in some sense and wait till we could leapfrog over this whole problem and just routinely sequence all the genomes we want. But we all know you just can’t sit around twirling your thumbs and that’s exactly where targeted resequencing comes in.
Slide 5 It’s sort of the idea of let’s scale up the process of reaching down into the
genome and grabbing those particular regions that we’re interested in in much the same way that we are doing with conventional ‐‐ we were doing with conventional PCR, but at a larger scale to sort of match the capacity of these next‐generation sequencing platforms. And so hopefully, that will help bridge the gap between where we are now and where we see ourselves going in the next few years.
[0:05:17] Slide 6 So what are the primary strategies that are employed right now to do
this? I’ll very briefly go over these. My colleagues will go over them in more detail soon. But on the one hand, you have array‐based oligonucleotide selection, which is a solid phase substrate where so you grab those regions of interest on a physical array, or the solution‐phase capture, which is a same basic idea of hybridization but removing the array from the picture.
An entirely different set of approaches involve essentially using PCR but
just at a larger scale. One of those is molecular inversion probe sequencing and also very popular right now is RainDance Technologies which uses a massively parallel PCR approach that leverages their particular microfluidics emulsion system.
Slide 7
4
And the power of target enrichment can’t really be utilized effectively unless you can combine it right now at least with a barcoding strategy or indexing. So simply providing some sort of identifier onto the sequence very near preparation process for sequencing allows you to sort out the particular samples after the fact bioinformatically.
Slide 8 And so there are several commercially available options right now for
targeted enrichment. Agilent and Nimblegen both offer array and solution‐based approaches, RainDance as I mentioned the PCR‐based approach. Febit is newer on the scene and offers an array‐based capture in a more automated ‐‐ complete automated system, and LC Sciences offers services for both array and solution‐based capture.
Slide 9 So where exactly this is targeting inter into the workflow for next‐
generation sequencing? It really depends on what sort of system you’re using. Basically, if your capture results in fragments that aren’t really amenable to sonication basically or they’re below 2000 base pairs or so and they don’t really fragment efficiently, you have to go through an additional procedure of concatenation and refragmentation of those larger fragments. On the other hand, depending on what combination of capture and platform you’re using, the capture can actually very seamlessly work into your sequencing library preparation and go straight forward, straight directly to the sequencing platform.
Slide 10 And so one of the things I wanted to address was this sort of major
considerations that you have to think about when you talk about ‐‐ when you’re going about target enrichment, and one is the fact that you’re not going to get everything that you target and that a lot of what you actually will sequence is going to be off target. And this is very important to note and think about beforehand that you have some at least rough estimate of how much that is going to be so that you can calculate how much sequence throughput to dedicate to each particular sample that you have. If you’re planning a very large project, this is a huge deal.
Slide 11 First a few sort of notes on semantics. Even I think a lot of as mix this up
often including myself but just to sort of keep the word usage sort of
5
standardized. We like to talk about when we talk about coverage, the horizontal coverage, basically how much of my intended target is covered by one or more reads of sequence and sometimes people will specify a particular depth at which they’re getting that target, that coverage estimate; as opposed to depth which is the more vertical dimension, how many reads are stacked up at one given nucleotide position.
And so one of the metrics and I’m going to show a little bit of our own
data later that I like to use is basically the percent that’s covered at a 20 to 30 fold, which is very important for how the reliability of the genotype information that you can extract from that sequencing.
Slide 12 Another important issue to think about is the number of unique reads
that you actually have in your data. [0:09:58] As many of you are aware, when you’re going through the process of
library preparation, there are some amplification steps and that can actually produce clonal copies of what were originally independent fragments, and so those don’t actually represent truly unique sources of data. And so there are many informatics pipelines right now that toss out those duplicate reads. There’s basically through the process of de‐duplification.
Now, there are some more recent algorithms that claim to be able to
actually use those non‐unique reads to increase the robustness of the actual unique reads. So basically, understand that they’re not unique. I mean understand that they’re not representing independent data points, but nevertheless using that redundancy to increase the robustness of the one data point that they do represent.
Slide 13 Another important concept is the idea of uniformity among targets. So
you can have on average 50X coverage, but if that includes having 6000‐fold coverage at one region and three‐fold coverage at a bunch of others, that doesn’t, you know, work well for your downstream inferences that you can make. And so be very wary when you hear about, you know, people giving you their average read depths, you know, after the fact of the study. It really tells you very little about how reliable that data is going to be for inferences. You really need some sort of metric of the
6
dispersion around the mean, and people have come up with various approaches to measure that.
Slide 14 Ultimately, this sort of depth and coverage come down to at any given
base position that I’m looking at for my sequencing data how many independent samples do I have of fragments for my sequencing library and how reliable is my inference for a diploid genome for the state of that particular position. Is it heterozygous, homozygous for a particular variant? And all those off‐screen metrics that I was talking about in terms of depth and coverage and so forth ultimately for us at least is what it comes down to is do I have enough independent sampling events at any position to reliably call a genotype.
And I should also mention that the particular algorithms that you then go
on to use to call that variation and also call and decide what to toss out and what to keep in when you make those calls can tremendously affect the numbers that you find and the quality of the genotype calls themselves.
Slide 15 So very briefly, I wanted to go over some preliminary data from a study
we very recently concluded in our laboratory where we looked at essentially the same region on three different platforms: the Nimblegen array capture, RainDance massively parallel PCR, and Agilent SureSelect. Now, as many of you are aware, these actually allow you to target different‐sized regions, and at that time, Nimblegen could ‐‐ the 385K array could target 5 megabases when we’re planning the experiment. Agilent was about 3.3 megabases and RainDance could target about 1.6.
So we basically picked 5 megabases essentially randomly. It’s like
sampling from across the genome. And then for Agilent ‐‐ that was used for Nimblegen and then the 3.3 megabase subset of that was used for Agilent and the 1.6 megabase subset of that was used for RainDance, and at least some portion was covered by all three platforms.
Slide 16 And sort of highlighting the importance of some of the metrics that I
mentioned earlier, the very first thing you’ll want to discuss or want to think about when you’re designing the study is not every platform can actually target everything that you might be interested in, and I think this
7
is ‐‐ in our particular instance, we looked at what each platform could actually target out of what we gave it, and this is an instance where RainDance actually comes out looking very well because they can adjust the primers that they’re using to avoid repetitive regions then get a more comprehensive coverage of the regions of interest. Hybridization approaches are a little more limited by what they can actually design probes to because of repetitive sequences and other temperature (TM) considerations.
Slide 17 As far as percent on target, we didn’t really see any appreciable
differences between one capture platform or the other in terms of the number of reads we got per SOLiD octet spot. Basically, we got 28 to 39 million reads and each of these were independently loaded on a SOLiD octet spot I neglected to mention.
[0:15:15] As far as how many, 60 to 65% of those reads mapped somewhere on the
genome, which is about what one expects from the SOLiD data. Where things really start to get interesting is when you look at the
percent on target of that fraction that mapped what percent were on target. So Nimblegen, we saw about 55% on target. Oh, the little asterisk there indicates that we actually had to ‐‐ they were captured using Nimblegen’s standard protocol, which included attaching their own anchors so we actually had to re‐concatenate and shear again, and they actually lost some efficiency due to that process. So it probably should have another 6 to 10% coverage to that number.
RainDance, 50%, and they also had to go through a concatenation
process as well although that’s sort of unavoidable in their particular instance. Agilent, we saw on average, and this is an average of a couple of samples that I’ve looked at so far, 66%. So we’re quite happy.
Slide 18 How much did we cover for each given depth? So at 1X, all the platforms
did particularly well. At 5, 10, 20X you can see things start to slowly go down, but actually, we’re pleasantly surprised with or happy, you know, to see what the sort of numbers we were getting for each platform. Really at 20X is where we start to pay attention because that’s where we begin to get confident calling genotypes across millions of base pairs and
8
not worry so much that our false negative array is going to be extremely low, and obviously, 30, 50X is even better.
But it’s important that I mention here that when I get back and actually
publish the data, I’m going to adjust for the actual total amount targeted for each. So if one made those adjustments, then you would see a ‐‐ actually, Nimblegen’s numbers actually come out a little better than they look here because they’re actually targeting more sequence and so they’re leveraging ‐‐ I mean, so the throughput that’s dedicated to that sequence is a little less than Agilent and a little ‐‐ and even more cells since RainDance. So Nimblegen would actually come off looking a little better in this case once that adjustment is made and RainDance a little worse.
Slide 19 Just from what we’ve seen so far in our hands, we certainly can say that
each method has its own strengths and weaknesses and it depends a lot on what the goal of your particular project is. If you absolutely have to cover everything and you can’t afford to miss 10% here or there that you simply can’t even target, then RainDance is a really attractive option.
We’ve also found that in the process of working with the arrays and with
the solution capture that we certainly find the solution to be a lot more amenable to scaling up. And also, it’s really important to know that you won’t be getting every base position. So if your downstream analysis is completely dependent on getting every bit of information from the regions that you’re interested in, then you’re going to be disappointed at the end because you miss a lot and you won’t always miss the same thing from the same region. You won’t always drop out on every different individual. You’ll miss different things in different people although there is some consistency about which from what you miss.
Slide 20 So very briefly, I wanted to acknowledge in terms of the data I was just
speaking about the few technicians that actually did all the SOLiD library preparation work for that, our Ashley team from UM.
Sean Sanders: Okay, thank you. Okay, thank you very much, Dr. Hedges. Slide 21 Slide 22
9
Our next speaker today is Dr. Elaine Mardis and Dr. Mardis holds a Ph.D. in Biochemistry and Chemistry and a B.S. in Zoology from the University of Oklahoma. Prior to joining The Genome Center, she was a senior research scientist at Bio‐Rad Laboratories in Hercules, California, and in her current position, Dr. Mardis has played a pivotal role in the evaluation, optimization, and application of novel sequencing instrumentation, chemistry, and molecular biology toward improving genome sequencing.
[0:20:00] She also orchestrates the center’s efforts to explore next‐generation
sequencing technologies and to transition them into production sequencing capabilities.
So welcome, Dr. Mardis. Dr. Elaine Mardis: Thanks very much, Sean, for the introduction and for the opportunity to
be here to present some data from our center regarding our experiences with targeted hybridization capture of exomes.
Slide 23 So I thought it would be interesting to at least give you an overview
briefly of some of the targeted resequencing projects that we have ongoing that involve re‐examining portions of the human genome and they’re listed here, and essentially range from studying ovarian tumor/normal pairs as part of the cancer genome atlas to a variety of complex disease studies ranging from eye diseases such as retinitis pigmentosa, macular degeneration, to broad skill studies that involve thousands of samples including this Allelic Spectrum of Metabolic Disorders project that we’re involved in now. And the scale of some of these studies actually hasn’t required us to invent automation around the solution phase capture results that I’ll talk about today.
For the purposes of today’s presentation, I’m going to focus on results
from five matched tumor/normal pairs of ovarian cancer samples, and in particular, I should tell you in advance that these are also samples that our sequencing center has sequenced by whole genome methods to complete full depth suitable for genome‐wide analysis. And I’ll come back round to that in the very last slide to show you some comparative data between exome capture and whole genome sequencing analysis.
Slide 24
10
So the next slide essentially looks at the overlap between the two
commercially available platforms, which shall remain nameless because of our confidentiality and disclosure agreements that we have signed with both companies that offer this, and the conserved coding sequences defined by the NCBI, which is shown in the green circle. And what you can see is that there’s a large amount of overlap but that it’s not complete. Now, both of these exome capture platforms actually offer a nice increase amount over and above coding sequences for genes of which include non‐coding RNAs and other regions of interest in the human genome, but most people, for the purposes of capturing exomes, would like to be able to say that they are at least targeting the majority of the conserved coding sequences available through NCBI. So that’s sort of our metric for comparison.
Slide 25 And Dale did a really nice job of explaining coverage depth and breadth. I
just have a couple of slides to illustrate this from real data. This is a region on chromosome 8 that’s been targeted by both platforms A and B, and what you can see is across the breadth of coverage for the region, at 1X coverage, both of the platforms actually provide full coverage. 100% of the base is covered at 1X. But of course, this is not a sufficient depth at which one would call variants.
Slide 26 So if you fast forward to change the y‐axis now to start at 20‐fold
coverage and on up, what you can see is that the picture very rapidly changes; namely, about 7% of the target region has no coverage whatsoever by either exome capture set, and the shared coverage between the two reagents drops to about 10.5%. And as you can also see, the coverage depth varies quite a bit still.
Slide 27 So how does this look if you flesh out now to the entire genome‐wide
coverage with these two reagents, both at 1X depth and 20X depth? As you can see, this is a quite different coverage model at these two levels and varies a little bit between the two platforms depending upon whether you’re looking for 1X coverage or 20X coverage across the totality of regions that are represented by the exome capture reagents.
Slide 28
11
Now, let’s look very carefully at a number of slides that follow the course
of analysis across some of these metrics that Dale just introduced for these five ovarian tumor/normal pairs. So you can see the graph showing here represents on the y‐axis the total number of high quality nucleotides that were aligned onto the genome and then how they broke out into different categories. So if you look carefully at the blue bars along the bottom, this represents truly unique.
[0:25:00] These are paired‐in reads from the Illumina System. On‐target so these
conform to the targeted regions defined by the BED files for these exome capture reagents A and B. And on average, about 50% of what you’re looking for in terms of unique on‐target reads actually is representing that category.
Dale also mentioned that there are problems with PCR biasing that lead
to true duplicate reads with the same start and stop site. These are represented by the varying levels of the small intermediate red bars. And then the green bars and the purple bars above those actually represent the combination of unique and duplicate off‐target reads. So this is in many cases approximately equal to the amount of on‐target reads that you obtain.
And then at the very top, one can see the number of unaligned reads that
don’t map to the human genome, and this can be due to a variety of reasons that include the overall quality of the read but more likely just that there are regions of the reference genome that often don’t correspond.
Slide 29 So now let’s look at these ovarian tumor/normal pairs broken out by
exome reagent A and B at the level of depth relative to the conserved coding sequences that I talked about earlier. And if you can establish for yourself what depth you’re comfortable with in terms of truly calling variants typically falls somewhere in the range between 10X and 20X coverage. So what you can see from sort of looking at this varying level of blue bars is that on average, you’re getting on the overall about 75 to 80% coverage across the CCDS regions of the exomes. If you want to have your coverage higher at 30X, then the levels drop commensurately, but they’re roughly equal across all of the tumor/normal pairs shown with the exception of ovarian cancer 5 in the B reagent where we
12
unfortunately encountered some low data quality that was not related to the quality of the capture reagent in that particular case.
Slide 30 Let’s just break this out now for one of these ovarian tumor samples,
again showing the percent of base pairs for the CCDS on the y‐axis. The blue bars represent the shared amount of coverage between reagent A and B at various levels of overall coverage, ranging from 1X to 20X, and you can see that the levels again drop as you go out to the 10 to 20X range, which is where you truly need to be calling variants. The red bars represent sequences that are unique to the A reagent, the green ones to the blue reagent, and then the purple bars on top are regions of the exome that are defined but not covered by these reagents.
Slide 31 And then lastly, I have some specific gene examples that would be of
interest for example in studying ovarian cancer and how well they are covered. So you see at the top level the canonical BRCA1 and 2 loci where in particular, BRCA1 is covered quite nicely by both reagents, BRCA2 a little bit less so, and the two genes along the bottom, ERCC2 and STK11, have varying levels of coverage, STK11 in particular being not terribly well covered at all.
So I mentioned the comparison to whole genome shotgun data and this
really describes for me, even though I spend a lot of time talking about capture and coverage, we have to have coverage in order to define variants. But at the end of the day, what we really care about is detecting variants, and so that’s to me the key metric of success for these methodologies.
Slide 32 And what you can see is again the five ovarian tumors now just
specifically represented. Whole genome shotgun data is represented in the first column, the two exome capture reagents in the second and third columns. And parenthetically after the number of high quality genic variants detected is the number that we’ve actually now been able to go back and validate using directed PCR in an orthogonal sequencing method.
And what you can see quite readily is that in every case for whole
genome sequencing, we’re able to detect a larger number of somatic
13
mutations and also validate a larger number of somatic mutations from that dataset.
Slide 33 And so I’ll finish there and acknowledge the group that provided me with
these slides, especially Todd Wylie and Jason Walker, who in association with Jasreet Hundal, Dan Koboldt, and Will Schierding produced a lot of the data analysis that was shown in these slides and these views, and Bob Fulton overseas, the production of all the capture data at our center.
Sean Sanders: That’s great. Thank you very much, Dr. Mardis. Slide 34 [0:30:00] So we’re on to our final speaker for today and that is Dr. Jun Wei. Slide 35 Dr. Wei completed his undergraduate degree in Biology at the University
of Houston before doing his doctoral work at the Baylor College of Medicine, also in Houston. Following a postdoctoral fellowship at the National Human Genome Research Institute at the National Institutes of Health, Dr. Wei stayed on at the NIH and is now a staff scientist in the Pediatric Oncology Branch of the National Cancer Institute. He and his laboratory are currently focused on using next‐generation sequencing technologies to identify causative genes for pediatric cancers.
Welcome, Dr. Wei. Dr. Jun Wei: Thank you, Sean, for the nice introduction, and I would like to share with
you our experience with the target enrichment and the next‐gen sequencing project in our lab.
Slide 36 Here is the outline of my portion of the talk. I will introduce you why
we’re doing this next‐generation sequencing, and also, I wanted to introduce to you the application of the next‐generation sequencing technology. And as our previous two speakers eloquently elaborated the necessity of the enrichment at this step, so it’s majorly the two points we needed during enrichment is the cost and the coverage of the
14
sequencing. So I will show you some data produced in our lab using the AB SOLiD platform as well as the Agilent SureSelect whole exome enrichment.
Slide 37 So currently in pediatric oncology field, as any other cancers, metastatic
disease is treated uniformly with a standard therapy. However, some people respond to this therapy but some people with the cancer will die eventually.
Slide 38 And in order to improve the successful rate of the therapy and now we
employ the genomics to device a biomarker to further categorize this group into different groups with either good signature.
Slide 39 So this group of patients that can be treated with a standard therapy
have a good outcome. Slide 40 Or this poor signature group of people hopefully will use this new
technology to discover the cause or the driver of mutation or the cause of disease so we can do targeted individual combinational therapy to improve the survival in this group of patients.
Slide 41 So next‐generation sequencing allowed us to do the comprehensive
analysis of cancer genomes on the same platform. Currently, we have DNA and RNA derived from either germ line or tumor samples and we can use the same platform to acquire all the genomic information that we need to either discover the new biomarkers or the biology of the tumor, and hopefully, we can identify the therapeutic targets. So this information including on the DNA side, including the DNA copy number, gene arrangement, the structure change, and the methylome information, and then of course the damaging mutations.
On the RNA side, we can acquire the information of the gene expression
levels and chimeric genes and the splice variants, novel transcripts and then also the damaging mutations.
15
Slide 42 Sequencing is all about the coverage as you see Dale and Elaine has gone
through all this technology. So the more coverage means the higher accuracy in the data.
Slide 43 However, the high coverage is also associated with the high cost. Slide 44 And this is the chart I listed here to compare the different types of
sequencing experiments, the cost for those kind of experiment done in the SOLiD 3 Plus platform.
[0:35:00] So currently, to sequence a whole human genome at about the 20X
coverage, the cost is about $16,000, but if you do the whole exome, you will only need to sequence about a quarter of the flow cell. The cost is about $2400. And if you do the whole genome, a whole transcriptome sequencing, it will cost about like double the cost of the whole exome currently.
Slide 45 So the whole exome sequencing cost only is a fraction of the whole
genome sequencing, and also, it’s about a half of the transcriptome sequencing at adequate coverage. And also here at the cost, I didn’t include the cost of the data storage and the analysis. It’s not counted here. So if we associate those kinds of data storage and analysis, the cost is even more.
Slide 46 And this is the genome partition platform we use. We use the Agilent
SureSelect system. I’m not going to go into the detail because I think this audience isn’t familiar with this system. And when we use this platform, we can capture a certain region of the genome, targeted genome. For example, a part of the 1p36, that’s the region that we are interested in the disease of neuroblastoma, a pediatric cancer.
16
And also, Agilent now has a valuable kit to target a whole exome, which is targeting a coding region of all the genes. It’s about a worth of about 38 megabase. So instead of you sequence the whole genome, which is a 3 billion base pair, now we can reduce the complexity to about a 38 megabase, and this 38 megabase is equivalent to about 20,000 genes in the consensus coding regions. And as I previously showed, that will reduce the cost. However, this platform has ‐‐ using this system to pull out the exome, we’re going to lose the structural information.
Slide 47 What do we mean about coverage? So the coverage is equivalent to a
particular base how many times a unique sequence was covered. For example, as you see here, every sequence reads start with a different site. So they are coming from the different molecule from the genome. It will give you unique information as Dale and Elaine already alludes. So you need this kind of unique reads to acquire adequate information, but if you do not exclude those reads that is duplicated, then you don’t increase your information. You just sequence the same thing over and over, which is not good with sequencing.
Slide 48 And now we’re looking at the coverage. As I previously said, this whole
genome we sequenced at about the 16X coverage, and at that time, we used the SOLiD 3 platform which took us three sequence runs equivalent to six flow cells of sequencing reactions. And the total base pair we covered for sequencing is about 53 billion base and it cost a whole bunch of dollars. And if you see this, the y‐axis is a representation of the bases, of the 3 billion base pair, and then you can see that by 10X coverage or more, we have about 90% of these mappable reads to cover.
Slide 49 And this is an experiment of the transcriptome sequencing. What we do
here is we use SureSelect to cover the region as a baseline to see how much sequence of a transcriptome were covered. The reason we use this is because we want to see how much of the transcriptome sequence were mapped through the exon region. And from this figure, you can see that at the 10X coverage, we only covered about 38% of a SureSelect targeted region. That’s represented by the 7500 CCD coding genes.
[0:40:13]
17
Slide 50 However, if we use SureSelect, we only needed to sequence on SOLiD
platform a quarter of the flow cells. We only needed about 3 billion base pair sequencing and a fraction of the cost; and this is a typical coverage applied for this kind of exome sequencing, what we’re usually routinely generating in our lab.
Here you can see that at the 10X or more coverage, we covered about 70
to 80% of the Agilent targeted region. So this is represented by the 16,000 CCD coding genes.
Slide 51 What about the on target rate? And the definition we use in our lab is
very simple. It’s any base of a read falls into a SureSelect region is called on target.
Slide 52 So if we calculate, a SureSelect whole exome target equals about 37.8
megabase. The on target rate is ‐‐ routinely what we’ve got is 70 to 80% of all the mappable reads.
Slide 53 And this is the IGV viewer of the three different experiments. You look at
the coverage. The first one is the whole genome germ line sequencing I just showed you. This is about the 16X average coverage. You can see there’s a lot of reads, but it’s up and down throughout the whole genome. But if you look at the whole exome, the reads is very concentrated on the exome region that show us at the bottom is the exome, this gene structure. This is the FGFR4 gene which is important in rhabdomyosarcoma tumor.
You can see that the exome sequencing, the reads, these are
concentrated on the exome which is supposed to. And if you look at the same kind of experiment I described earlier for the transcriptome sequencing for this gene, the coverage is very poor.
Slide 54 And this is one of the samples that we previously identified has this CTT
mutation which would change the amino acid from winning into losing.
18
We published it last year. And with this whole exome sequencing, we were able to identify these mutations only exist in tumor but not in the germ line just as we reported before, but you can see if we use the transcriptome sequencing, likely we’re not going to find this because of the low coverage.
Slide 55 So in the summary, I put down these. All the different kind of
experiments, you have to consider the advantages and disadvantages. And for the whole genome sequencing, of course this is the most complete genome information you can get. You get everything including the chromosome structure information, copy number, and the mutation. But the disadvantage is it’s still very expensive with the current platforms, and another disadvantage is that it’s difficult to interpret the data especially those operations that you’ll find outside of a coding region.
The whole exome sequencing, the advantage is it’s relatively less
expensive, easy to achieve high coverage on the targeted region for mutation detection. However, it lacks coverage outside of CCD region and the loss of the structure information.
The whole transcriptome sequencing still had the advantage of it can
detect RNA editing events and also can detect the transcript variants and the novel transcripts and detection of the fusion genes, mutation detections, and also can yield gene expression level information. However, the disadvantage is uneven coverage and currently, it’s expensive due to the different level of the gene expression.
[0:45:00] Slide 56 I would like to acknowledge the chief of our section, Javed Khan who
heads these sections, and also the biologists in our labs who generated all these data, and the support from the bioinformatics group in our lab. Thank you.
Sean Sanders: Thanks. Slide 57
19
Okay, thank you very much, Dr. Wei. So thank you all for the excellent presentations and we are going to move on to the Q&A portion of the webinar now. Just a reminder to everyone online that you can submit your questions by simply typing them into the ask‐a‐question box and clicking the submit button. Also to our live audience, if you have any questions, just step up to the mike and I’ll acknowledge you when I see you.
So the first question is all of you talked a little bit about current available
options, but maybe we can talk a little bit more about target enrichment and sample tagging. We’ve had a couple of questions come in on that. So maybe I’ll start with Dr. Hedges and we can work our way down.
Dr. Dale Hedges: Well, in terms of tagging, pretty much all the major platforms out there
right now have some amenable tagging available to work with and can be used in conjunction with capture. You just have to, you know, get to know beforehand what you’re working with and make sure that you integrate that tagging appropriately in the library preparation step. You know, if you’re not thinking things through ahead of the time, you can find yourself, you know, having to stop and redo some part of the protocol because you haven’t used the appropriate anchor or primer sequence at some point. Right now, we are working with 16 barcodes available for SOLiD and buy potentially more on the way, 12 for Illumina and I’m sure they’re also working on more ‐‐ and we have as far as Nimblegen…
Dr. Elaine Mardis: I think it’s 12. Dr. Dale Hedges: 12, yeah. Dr. Elaine Mardis: Of the 454. Dr. Dale Hedges: Of the 454. Dr. Elaine Mardis: Yeah. Dr. Dale Hedges: Okay. But basically, each platform, each major capture technology, they
have worked directly with indexing method or because of the way their output is produced, for instance RainDance, you can just move in to whatever library platform and indexing mechanism you wish, essentially starting with double‐stranded DNA.
Sean Sanders: Dr. Mardis?
20
Dr. Elaine Mardis: Yeah. I mean, I would agree with Dale. I think the push to do additional indexing or barcoding has really risen commensurate with the rising capacity. So, you know, you see yourself at a point on these different platforms where a single lane or region of, you know, the microfluidic chip can yield an ever‐increasing number of reads and you know this sort of drives that equation, that need for more, you know, figuring out how many samples you can multiplex together onto the region. You know really it’s sort of a complex calculation that goes directly to what average local coverage you would like to achieve, the number of samples that you have available, the amount that you expect on average of sequence to get out of that region, you know, and so on. So it really falls doing that calculation and then figuring out what the right combination is. And then, you know, there’s also the de‐convolution piece on the other end which goes to actually sorting the different reads into bins so that you can associate them with the appropriate sample that they originated with.
Sean Sanders: Uh‐hum. Dr. Wei? Dr. Jun Wei: Yeah, I agree with Elaine about this throughput thing because the
advantage of barcoding is you can sequence much more samples in one goal, one sequencing goal. But this is really in the condition that you have enough throughput. If you don’t have enough throughput on a platform, there’s almost no point to do that.
Dr. Elaine Mardis: Yeah. Dr. Jun Wei: And another thing is I think we have to pay attention to the
bioinformatics side of this exercise to have a tag because any platform, if you add another layer of complexity, you’re subjected to more error happening in this decoding process. For example, in our experience with SOLiD that you have to really match up which tag you use to mix them because you cannot be using these two together. You will cause an imbalance of the color and that will cause the increase of the error when sequencing the tag. So those factors need to be considered before you design this kind of experiment.
[0:50:15] Dr. Dale Hedges: Let me just chime in also once again briefly. When you actually ‐‐ if you
have the luxury to do so, if in a larger project for instance, instead of trying to just empirically calculate how much throughput you are going to need for each individual, one of the strategies we have been implementing is simply to empirically sort of do a ramp‐up phase where we start by undershooting what we know pretty certain is going to be our
21
general ballpark of how many we’ll be able to index and then sort of ramp up from there, and that’s how we sort of empirically can see how many samples we can index. Because in all the empiric calculations you can think of, you know, there’s always some other factor you may not have to think about in your particular target set that may cause you to undershoot your predictions.
Sean Sanders: Excellent. Our next question is how could small insertions of lesions
possibly impact enrichment? So Dr. Mardis, you had mentioned earlier that you’ve done some work within those.
Dr. Elaine Mardis: Yeah. We have a little bit of experience with this especially in the ovarian
capture where you actually see quite a large number of insertion deletions and important genes like TP53 for example. And, you know, the experience there has been that these regions are capturable I think, you know, to a certain extent. Typically, what you’re going to find in genes are along the lines of one, two, or three base pair insertion deletion events, and within that region, you can probably capture those with reasonably good efficiency and then beyond that sort of all bets are off, right? ‘Cause you’re going to lose hybridization efficiency at some level.
Sean Sanders: Right, right. So we have a question from the audience. Audience Member: Yes. For Dr. Hedges and Mardis, can you speak a little bit to the
amplification bias and how that your results compared and how that weighs into the amount of material that goes under the sequencer?
Sean Sanders: Okay. In case the online audience didn’t hear, that was about the
amplification bias and how that might affect your results. Dr. Elaine Mardis: I’ll start. So yeah, so the biasing really goes to the following
phenomenology, you know. Duplication in and of itself is not necessarily a bad thing. It’s a bad thing when errors creep into the PCR step sort of early on, and then what you end up doing is sort of biasing an error‐containing amplicon if you will, but when you align it to the genome can masquerade as a true variant. And so the region for eliminating these duplicate reads is so that you just represent that erroneous PCR product once if you will and then all others should represent the true sequence. So that’s really the reason for removing those.
The problem becomes more severe actually as you go down in terms of
the size and number of megabases that you’re targeting. So below a certain threshold, I don’t even recommend that people try and do targeted capture mainly because the yield coming off of the solution
22
phase, you know, the beads that are used to pull these biotinylated fragments down is going to be so infinitesimally small compared to the size of the genome that you’re trying to capture out of, that to get that material then ready to do sequencing on, you’ll have to do a large number of PCR steps, and therein, you know, enhance the bias even more than you would if you did. So we typically do a very, very limited number of PCR cycles somewhere in the range of 10 to 13, and we don’t go any higher than that because beyond that, the duplication rate becomes too severe.
Dr. Dale Hedges: I completely agree with Elaine and the primary thing you really need to
worry about is not simply duplication itself but what sort of biases you’re introducing there, and if those errors have crept up early on in the amplification, then those looked just like a real legitimate reason. Even your SOLiD color correction will not filter those out. And so what we continually try to do internally is, you know, try to reduce the total amplifications. We have to do it in any step we can, and that’s something we’re always working on because if we didn’t, you know, leave as much of that out as possible, then that problem gets less and less.
Although I have to say sort of anecdotally, we’ve had a couple of datasets
and I wasn’t working directly with them so I didn’t have a hands‐on experience with the data, but we have had a couple of cases with really high coverage where if you ‐‐ we’ve actually done genotype concordance, turning undue duplification and turning it off and not seeing a lot of difference, but those were particularly cases of extremely high average coverage. So I would not recommend it as a general, you know, rule of thumb.
[0:55:20] Sean Sanders: Great. I’m going to address this question to Dr. Wei. It’s a couple of
maybe related questions. The first is are there regions of the genome that can’t be hybridized and enriched, and the other part of that is could repeated elements cause problems with enrichment procedures?
Dr. Jun Wei: Yeah, that certainly is a problem. You saw the data I showed from the
whole exome sequencing. There’s about 5% we never have coverage of the targeted region. So the cause can be, you know, maybe the probe has the issues because there are so many probes and then some perform better and some doesn’t perform as it should be so that’s maybe the cause of that. So the repeated region definitely is a problem especially for the human genome is such a complex genome. It’s always a problem. Even with the whole genome sequencing, the data will show that always
23
about a half genome base pair is really hard to map with anything, any method, unless you do a special like a main pair and allow the main pair to try to spin across that kind of region. So it is a problem, yeah.
Sean Sanders: Okay. Dr. Elaine Mardis: Okay. Most of the off‐target effects we see actually map really nicely to
line and sign elements through the predominant class of repetitive elements in the genome. Yeah, they may be capturing actually something that’s unique. It’s just that that comes along for the ride. And these repetitive regions can be in introns, you know, intergenic regions. They’re sort of all over the place. So that’s really the major problem that repetitive content brings into the picture.
Sean Sanders: Uh‐hum. So I’m going to stay with you, Dr. Mardis, for this question. Are
the off‐target reads random or reproducible across runs and across enrichment platforms?
Dr. Elaine Mardis: I would say that based on my last comment, they’re largely reproducible
across, you know, regardless of the platform because the genome remains the same, right? And so, you know, there will be some small differences just based on differences and how the different capture platforms are better able or less well able to, you know, reliably target that particular region. But it’s a general phenomenon that just kind of goes with the territory.
Sean Sanders: Uh‐hum. A question for you, Dr. Hedges; online viewer asking if you could
spend a little bit more about Febit’s technology and how it compares to the others available.
Dr. Dale Hedges: Sure. I’ve actually had no personal experience with Febit. I’m just aware
that as a system, it is an array‐based hybridization approach that they’ve actually integrated into an automated fluidic system. It’s actually apparently a very automatable process. My understanding at the moment is that it’s actually focusing more on targeting things in the neighborhood of 100K to a megabase in scale and in various complexities so that it’s sort of in the bank. In terms of the general approaches, it’s in the array hybridization approach, although as I mentioned earlier, we’ve found in our own hands that solution was more accommodating for ramping up in the scale and doing large projects. In this case, Febit is basically from what I assume the technology offering sort of an array a SOLiD phase array‐based approach that is actually automatable and could potentially be a solution for larger‐scale approaches.
24
Sean Sanders: Okay. So we’re coming to the end of the hour. So if there’s any more questions from the audience, get them in now. All right. I have a couple of more questions I’d like to pose. Online viewers are asking how is it possible to use these enrichment technologies to enrich for rare transcripts in the sample sequence analysis. Has anyone had experience with rare transcripts? Dr. Wei?
Dr. Jun Wei: I saw a paper using this technology to capture the transcript. I think it
theoretically is possible, but we don’t have any experience with that. Dr. Elaine Mardis: There are a number of permutations I think that are quite interesting
about certainly one of them. You know, other people have talked about actually using a pre‐capture step and there is a recent paper published on The Plant Genome, plant genomes being highly repetitive in nature, where they actually use a capture technology to selectively capture out the repetitive content of the genome first and then a second go‐round on the exome of that particular plant. So, you know, give scientists the technology and they’ll figure out a number of ways to use it to their great advantage. I thought that was rather clever actually.
[1:00:23] Sean Sanders: Uh‐hum, great. So maybe a final question for you about the future and it
seems that right now there’s a balance between doing whole genome sequencing and doing enrichment or using enrichment technologies. Where do you see the tipping point and how soon or far in the future do you feel that that’s going to be? So why don’t we start with Dr. Wei and work our way back down?
Dr. Jun Wei: Yeah. As I showed in the data, I think the most important factor
nowadays is still the cost. The cost of the whole genome is still too high for every ‐‐ the lab or not like a genome center but for us, it still costs high. And also, the complexity of the informatics. You think about you’re dealing with whole exome coding sequencing about 38 megabase versus 3 billion base pair. That’s a huge difference, the storage, analyze. This is all very challenging currently.
And another thing, it’s difficult nowadays to just make sense of the whole
genome sequencing outside of the coding area. You have to really do experiment to prove this is biologically relevant; otherwise, it’s hard to make sense of it. So that’s the current status. I think in the future when the throughput is high enough, the cost is low and the storage, whatever, the price is coming down, yeah, this is the way to go. Yes.
25
Sean Sanders: Uh‐hum. Dr. Mardis? Dr. Elaine Mardis: Yes. So I mean for our work, I tend to not see this as an issue of cost but
rather cost‐to‐benefit. So, you know, our laboratory has done a very large amount of whole genome cancer sequencing comparing tumor and normal, you know. We have an excess of probably 160 genome pairs now completed and that rate is going up very dramatically. And to me, whole genome sequencing of cancer makes sense as cost‐to‐benefit comparison because there’s so much more going on in the cancer genome than simple point mutations and small windows that it’s a shame if you leave the other important things behind. So I completely agree with Dr. Wei on the cost of analysis being the driver, not the cost of sequencing being the driver.
Sean Sanders: Uh‐hum. Dr. Hedges, final word to you. Dr. Dale Hedges: I guess I’m a little more of a technological optimist in the sense that I
definitely see that sort of inevitable push towards routine whole genome sequencing in the near ‐‐ sooner than later. Probably in the three to five year range, we’re going to start to see large studies that are going to involve whole genome sequencing.
And just like Dr. Mardis was saying, it makes sense for cancer. I would
argue it makes sense for everything really, and in most of our cases at least, we don’t know what we’re looking for, and if I have the opportunity to have everything as opposed to a fraction, then I would go for it. So I think as soon as you see that point, it’s going to all come down to the price point. As soon as the price point drops to a point where people ‐‐ or even before that, would pay a little bit of premium to have that much more information, and I think we’ll get there pretty soon.
Dr. Elaine Mardis: Yeah. I mean, I think in the broad sense of time, however, I get asked
questions a lot like well, once X then is Y sort of off the table? And I think that it’s really fantastic how much effort in technology development has gone into targeted capture because even if whole genome sequencing becomes the choice de jure for the original discovery phase, there will still be a very compelling reason for targeted capture. For example, moving many of these discoveries into the clinic where you know exactly what you want to go after, you don’t need the rest of the genome to come along for the ride, and you have an exquisitely sensitive vehicle by which to selectively take that from your sample and sequence it, you know, in almost real time. And I think that’s a really compelling, you know, aspect that a lot of what’s going on now is really going to feed into very, very nicely in the upcoming next few years.
26
Sean Sanders: Great. Well, I think that’s a great way to end this webinar. So I’d like to
thank our panelists for being with us today and for the enlightening discussion that they’ve provided: Dr. Dale Hedges from the Hussman Institute for Human Genomics, Dr. Elaine Mardis from Washington University in St. Louis, and Dr. Jun Wei from the NIH.
Thank you also to our online audience and to those people in the
audience here. Apologies that we didn’t get to all of the questions. We did have many more than we could possibly use in just the short time that we have, but thank you for sending them in.
Please go to the URL at the bottom of your slide viewer now for more
information on products related to today’s webinar, and just know that this webinar will be available as an on‐demand video within approximately 48 hours from now. If you would like to share some thoughts with us about the webinar, you can email us at [email protected].
And again, thank you to our panel and thank you to Agilent Technologies
for their kind sponsorship of today’s webinar. Goodbye. [1:06:10] End of Audio