18
1 Clinical Applications for Next-Generation Sequencing. http://dx.doi.org/10.1016/B978-0-12-801739-5.00001-5 Copyright © 2016 Elsevier Inc. All rights reserved. CHAPTER NEXT GENERATION SEQUENCING—GENERAL INFORMATION ABOUT THE TECHNOLOGY, POSSIBILITIES, AND LIMITATIONS Rafał Płoski Department of Medical Genetics, Centre of Biostructure, Medical University of Warsaw, Warsaw, Poland 1 CHAPTER OUTLINE NGS Versus Traditional (Sanger Sequencing) 2 Coverage 3 NGS Library Preparation 3 Sequence Assembly: De Novo Sequencing vs Resequencing 4 Paired-End and Mate-Pair Libraries and Long Fragment Read Technology 5 NGS Platforms 6 Illumina ��������������������������������������������������������������������������������������������������������������������������������� 6 Illumina apparatuses ���������������������������������������������������������������������������������������������������������������������� 7 Semiconductor-Based Platforms ����������������������������������������������������������������������������������������������� 8 Semiconductor-based apparatuses ������������������������������������������������������������������������������������������������� 9 Sequencing by the Oligo Ligation Detection (SOLiD) Platform������������������������������������������������������ 9 Pyrosequencing on Roche/454 Platforms ��������������������������������������������������������������������������������� 10 Complete Genomics Analysis (CGA™) Platform ������������������������������������������������������������������������ 10 Single-Molecule Sequencing �������������������������������������������������������������������������������������������������� 11 Pacific biosciences single-molecule real-time sequencing ������������������������������������������������������������� 11 Helicos genetic analysis system (HeliScope) ��������������������������������������������������������������������������������� 11 Targeted Resequencing/Enrichment Strategies 12 Whole Exome Sequencing and Whole Genome Sequencing 14 Limitations of NGS in Clinical Medicine 15 Conclusion 16 References 17

Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

1Clinical Applications for Next-Generation Sequencing. http://dx.doi.org/10.1016/B978-0-12-801739-5.00001-5Copyright © 2016 Elsevier Inc. All rights reserved.

CHAPTER

NEXT GENERATION SEQUENCING—GENERAL INFORMATION ABOUT THE TECHNOLOGY, POSSIBILITIES, AND LIMITATIONS

Rafał PłoskiDepartment of Medical Genetics, Centre of Biostructure, Medical University of Warsaw, Warsaw, Poland

1

CHAPTER OUTLINE

NGS Versus Traditional (Sanger Sequencing)���������������������������������������������������������������������������������������������� 2Coverage �������������������������������������������������������������������������������������������������������������������������������������������������� 3NGS Library Preparation ���������������������������������������������������������������������������������������������������������������������������� 3Sequence Assembly: De Novo Sequencing vs Resequencing ����������������������������������������������������������������������� 4Paired-End and Mate-Pair Libraries and Long Fragment Read Technology ���������������������������������������������������� 5NGS Platforms ������������������������������������������������������������������������������������������������������������������������������������������ 6

Illumina ��������������������������������������������������������������������������������������������������������������������������������� 6Illumina apparatuses ���������������������������������������������������������������������������������������������������������������������� 7

Semiconductor-Based Platforms ����������������������������������������������������������������������������������������������� 8Semiconductor-based apparatuses ������������������������������������������������������������������������������������������������� 9

Sequencing by the Oligo Ligation Detection (SOLiD) Platform ������������������������������������������������������ 9Pyrosequencing on Roche/454 Platforms ��������������������������������������������������������������������������������� 10Complete Genomics Analysis (CGA™) Platform ������������������������������������������������������������������������ 10Single-Molecule Sequencing �������������������������������������������������������������������������������������������������� 11

Pacific biosciences single-molecule real-time sequencing ������������������������������������������������������������� 11Helicos genetic analysis system (HeliScope) ��������������������������������������������������������������������������������� 11

Targeted Resequencing/Enrichment Strategies ����������������������������������������������������������������������������������������� 12Whole Exome Sequencing and Whole Genome Sequencing ����������������������������������������������������������������������� 14Limitations of NGS in Clinical Medicine ��������������������������������������������������������������������������������������������������� 15Conclusion ��������������������������������������������������������������������������������������������������������������������������������������������� 16References ��������������������������������������������������������������������������������������������������������������������������������������������� 17

Page 2: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

CHAPTER 1 INTRODUCTION2

Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence of a DNA molecule(s) with total size significantly larger than 1 million base pairs (1 million bp or 1 Mb). From a clinical perspective the important feature of NGS is the possibility of sequencing hundreds/thousands of genes or even a whole genome in one experiment.

The high-throughput characteristic of NGS is achieved by a massively parallel approach allowing one to sequence, depending on the platform used, from tens of thousands to more than a billion mole-cules in a single experiment (Figure 1). This massively parallel analysis is achieved by the miniaturiza-tion of the volume of individual sequencing reactions, which limits the size of the instruments and reduces the cost of reagents per reaction. In the case of some platforms (referred to as third generation sequencers) the miniaturization has reached an extreme and allows sequencing of single DNA molecules.

An important characteristic of main NGS platforms used today is the limited length of sequence generated in individual reactions, that is, limited read length. Despite constant improvements the read length for the majority of platforms has stayed in the range of hundreds of base pairs. To sequence DNA longer than the feasible read length, the material is fragmented prior to analysis. After the sequencing, the reads are reassembled in silico to provide the information on the sequence of the whole target molecule.

NGS VERSUS TRADITIONAL (SANGER SEQUENCING)The term NGS emphasizes an increase in output relative to traditional DNA sequencing developed by Sanger in 1975 [1], which, despite the improvements introduced since then, still has an output limited to ∼75,000 bp (75 kb). This increase in output translates to the possibility of genome-wide analyses, again contrasting with Sanger DNA sequencing, allowing in practice the analysis of single genes or parts thereof. Despite the spreading use of NGS, Sanger sequencing remains the method of choice for validation necessary for all clinically relevant NGS findings.

FIGURE 1

General principles of technical solutions for NGS. The central part of the process consists of a large number of sequencing reactions carried out in parallel on fragmented DNA in very small volumes (multicolored dots). The outcome of the individual reactions is read by an optical or electronic detector. The final step is the assembly of the thus-generated sequences (reads) allowing the determination of the sequence of the DNA molecule(s) before fragmentation.

Page 3: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

3 NGS LIBRARY PREPARATION

COVERAGEAn important feature of NGS is multiple sequencing of each base of the target sequence. The number of times a given position has been sequenced in an NGS experiment (i.e., number of reads containing this position) is termed “coverage.” On one hand, multiple coverage is a consequence of the above-mentioned random target fragmentation necessitated by short read lengths. On the other hand, obtain-ing multiple reads covering the same target is necessary for eliminating random sequencing errors and, equally importantly, enabling the detection of individual components in DNA mixtures. The DNA mixtures that commonly need to be resolved in a clinical setting are those due to heterozygosity.

Sufficient coverage is important for good quality of an NGS experiment. Although the detection of heterozygosity may seem straightforward with a coverage of ∼10, it should be realized that the proba-bility of obtaining all reads from the chromosome without the variant is 1 in 210 = 1/1024, meaning that in whole-genome sequencing (WGS) hundreds of heterozygous variants can be missed. Even more challenging is the detection of a variant present in a proportion smaller than 50%, which often is the case for somatic mutations in neoplastic tissue, chimerism, and mosaicism, or heteroplasmy in mito-chondrial DNA.

Coverage, sometimes also called “sequencing depth” or “depth or coverage,” can be quantified by “mean coverage,” that is, the sum of coverage for all nucleotides in the target sequence divided by the number of nucleotides. Mean coverage gives a general idea about experiment design but it may be mis-leadingly high if some few regions are covered excessively and others poorly or not at all. A more informative way to characterize coverage is to calculate what percentage of the target has been sequenced with a specified (or higher) depth that is deemed satisfactory. A reasonable result when looking for germ-line variants (e.g., disease-causing mutations, expected to be present in 50% or 100% of appropri-ately positioned reads) is to have more than 80% of the target covered a minimum of 20 times.

From an economical perspective it is also desirable to obtain coverage that is maximally smooth, that is, there are no discrete regions that are covered excessively or insufficiently. Excessive coverage is unnecessary and it generates cost since it uses up expensive sequencing reagents. A parameter describing smoothness of coverage is “fold 80 base penalty”—the fold overcoverage necessary to raise 80% of the bases in targets to the mean coverage in those targets. A value of 1 indicates a perfectly smooth coverage (unrealistic). If the mean coverage was satisfactory, a value of 2 indicates that the sequencing done so far should be repeated (doubled) to have 80% of targets satisfactorily covered. A fold 80 base penalty of <3 is regarded as satisfactory.

NGS LIBRARY PREPARATIONThe steps needed to prepare DNA for NGS analysis are collectively called “library preparation.” NGS libraries are platform specific, so that a library prepared for one platform cannot be used on another unless it is explicitly compatible (usually coming from the same manufacturer). NGS libraries can be prepared starting directly from target DNA (usually total genomic DNA) or from polymerase chain reaction (PCR) products. To undergo sequencing, DNA molecules in a library need short sequences called “adapters” to be present on both ends.

If a library is prepared by PCR the simplest approach is to incorporate adapter sequences into PCR primers so that they become part of the PCR products, which are then ready for sequencing. PCR usage

Page 4: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

CHAPTER 1 INTRODUCTION4

for library preparation is particularly attractive when few genes/exons need to be analyzed. However, advanced approaches based on emulsion PCR have also been developed allowing large-scale analyses [2] (see also below).

If library preparation is not based on PCR the first stage is fragmentation of DNA. The next step after fragmentation is ligation of the adapters. Adapters may contain an “index”—4- to 10-bp sequences that provide tags allowing one to distinguish different samples sequenced together. The indexing is also known as “bar coding.” Multiplexing of samples (pooling samples for a single sequencing experiment) is a common strategy allowing the use of high-throughput machines for analysis of samples that indi-vidually require less extensive and/or less deep coverage than that offered by a given platform. It is particularly efficient to use double indexing, usually in a strategy using a separate index for each of the two paired end reads (see below). The samples are then identified by the combination of two indices, which increases the multiplexing possibilities (10 indices when used in pairs allow multiplexing of 100 samples). Most of the commonly used NGS applications allow multiplexing of 24–96 samples, in some cases this number is 384; even higher numbers can be achieved with customized approaches.

Traditional methods of fragmentation are based on sonication: The Adaptive Focused Acoustics™ technology patented by Covaris (http://covarisinc.com) or Adaptive Cavitation Technology of Diagenode (http://www.diagenode.com/en/index.php). While sonication gives high-quality results in terms of randomness of break points and reproducible fragment size distribution, it introduces damage at the ends of DNA molecules necessitating an additional enzymatic step of repair.

An ingenious advancement in the preparation of NGS libraries relies on enzymatic reaction with transposase, which catalyzes simultaneously both DNA fragmentation and adaptor/tag incorporation [3], a process that has been nicknamed “tagmentation.” Tagmentation greatly reduces the amount of material needed for library construction, allowing one to routinely process samples of 50 ng DNA or less (vs ∼1 μg typically required if sonication was used). It also speeds up library preparation and allows easy automation. For example, using tagmentation a library for whole-exome sequencing (WES) can be finished within 3 h, whereas previous protocols required ∼2 days.

Libraries made from genomic DNA, although not based on PCR in the initial stages, often use 10–20 PCR cycles at the final stage. Although PCR compensates for sample losses during library preparation and increases the yield of molecules with correctly ligated adapters, it has disadvantages: (1) during PCR fragments with extremely high or low GC content are less efficiently amplified. Since GC-rich sequences are often located in functionally important 5′ regions of genes (first exons, in par-ticular), this leads to annoying gaps in coverage [4]. (2) PCR decreases the diversity of a library, gen-erating “duplicates,” that is, multiple fragments that are all copies of a single molecule. Duplicates decrease the quality of sequencing—they can falsely suggest homozygosity and/or amplify a random error to an such extent that it can be accepted as a true variant. A solution to these problems, increas-ingly often used for WGS, is provided by protocols and kits that allow one to make PCR-free libraries (http://www.illumina.com, http://www.biospace.com).

SEQUENCE ASSEMBLY: DE NOVO SEQUENCING VS RESEQUENCINGOwing to short reads generated by NGS platforms, the important step of analysis is the assembly of the sequence. Two basically different approaches exist: de novo assembly and resequencing. De novo assembly is performed whenever a completely unknown target is analyzed, as is typically the case

Page 5: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

5 PAIRED-END AND MATE-PAIR LIBRARIES

when a genome/plasmid is sequenced for the first time. De novo sequencing requires high coverage to provide enough overlapping reads to guide assembly throughout the whole target. It is also computa-tionally demanding since all reads need to be checked against one another for overlaps. Further chal-lenge in de novo sequencing comes from abundant repetitive regions often present in genomes. Sequences of such regions are particularly difficult to infer from the short reads generated by sequencers.

In resequencing the assembly of reads is guided by an a priori knowledge of the target available as a reference sequence. The ideal reference sequence is a consensus sequence providing a general frame-work of the target with its most prevalent variants. When a reference sequence is available it is typically used as a target for alignment of the generated reads. Despite their short length the majority of reads can usually be mapped with high confidence, that is, defined as coming from a given part of the genome. After being mapped, the reads are scanned for mismatches with the reference sequence and these are interpreted as variants.

Given the high and constantly improving quality of the human reference genome [5–7], resequenc-ing is the predominant approach in medical genetics. In comparison to de novo sequencing, resequenc-ing requires less coverage and is simpler computationally. Resequencing is efficient at detecting variants much shorter than the length of the reads. Typically these are single nucleotide variants (SNVs) as well as small insertions/deletions. Conversely, detection of larger variants such as copy number vari-ants (CNVs, which involve fragments >1000 bp) or even bigger structural chromosomal variants is more challenging or even impossible. Obviously, detection of variants in repetitive regions is also chal-lenging as it is difficult to confidently map reads from such regions.

PAIRED-END AND MATE-PAIR LIBRARIES AND LONG FRAGMENT READ TECHNOLOGY [8]Problems associated with short read lengths generated by NGS platforms can be to some extent alle-viated by certain strategies of library preparation and/or sequencing. A common approach is to per-form sequencing from both ends of the fragments contained in the library. This is known as paired-end sequencing and allows one to effectively double the length of the sequenced DNA molecule. Paired-end sequencing is typically performed on libraries of DNA fragments longer than the part that under-goes sequencing to ensure that the reads from both ends do not overlap. Usually paired-end sequencing is used for libraries of fragments <1 kb.

An approach allowing one to sequence fragments located much farther apart (up to 25 kb) is known as mate-pair sequencing [9]. After an initial gentle fragmentation that leaves appropriately long DNA fragments the ends of the DNA molecules are labeled with biotin and the molecules are circularized. Then, the second round of fragmentation yielding fragments <1 kb is performed, followed by enrich-ment for biotin-labeled molecules. Finally paired-end sequencing is carried out on the molecules, which effectively consist of terminal ends of the initial long DNA fragment joined together. Mate-pair sequencing allows one to overcome to some extent the limitations of NGS associated with long repeti-tive stretches commonly present in the human genome and is useful for the detection of chromosomal rearrangements.

An interesting approach to sequencing relatively long fragments (∼10 kb) using the available short reads is offered by long fragment read (LFR) technology [8]. The first step of LFR is to dilute high-molecular-mass DNA (fragments ∼10 kb) and physically separate it into aliquots, which are then

Page 6: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

CHAPTER 1 INTRODUCTION6

processed in parallel: DNA in each well is fragmented, amplified, and ligated to uniquely indexed adapters, thus allowing them to be distinguished from the other DNA in all the wells. Next, DNA from all the wells is pooled and submitted to a standard NGS procedure. Three hundred eighty-four aliquots are commonly prepared on a microtiter plate and this number is regarded as sufficient for whole human genome analysis. The DNA concentration in each aliquot (well) is low enough to ensure that a given DNA fragment, with a reasonably high likelihood, is present in a number of wells as a single copy. Since information about the “well of origin” is kept owing to indexing, provided successful bioinfor-matics assembly, each individual ∼10-kb sequence can be regarded as representing a continuous stretch of DNA from a single chromosome. The LFR approach has been implemented in a commercially avail-able kit (TruSeq Synthetic Long-Read DNA Library Prep Kit, http://www.illumina.com).

The LFR approach is not fully equivalent to single-molecule sequencing since in some cases the ∼10-kb fragments may be difficult to assemble owing to repetitive sequences. Notwithstanding this, LFR is a valuable tool to obtain phase information, that is, information on which variants are located together in a single maternal or paternal chromosome. Phase information can be of paramount impor-tance in a number of settings [10], in medical genetics it is, for example, important in the search for compound heterozygous mutations in diagnosing autosomal recessive diseases. Phase information allows one to easily filter out variants found in cis, whereas without it each candidate pair of mutations has to be verified in a family study, which is laborious, or by inclusion of the parents in the initial study, which is expensive.

NGS PLATFORMSILLUMINAIllumina platforms rely on fluorescence-based sequencing of single DNA molecules after a non-PCR-based clonal amplification on solid support. The approach was developed in 2006 by the com-pany Solexa, which was subsequently acquired by Illumina.

Library preparation for Illumina platforms originally included DNA fragmentation by physical means and enzymatic repair of the ends of molecules with subsequent addition of a single adenine base to the 3′ end of the DNA fragments. The final step was ligation of adapters. The ligation is facilitated by a single thymine overhanging the 3′ end of each adapter, which complements the adenine overhang of the DNA fragments. Although this procedure is still used, alternative protocols based on transposase-catalyzed tagmentation are gaining increasing popularity [3]. Illumina adapters always include (1) so-called P5 and P7 binding regions, which are complementary to oligos on the surface of the flow cell (see below) and (2) sequencing primer binding regions. Whenever multiplexing is planned adapters should also have one or two indices.

On Illumina platforms sequencing takes place on the surface of a fluidic chamber (flow cell) designed to provide access to reagents and make optical imaging possible [11]. A flow cell can have up to eight channels called lanes, which can accommodate independent samples. The surface of a flow cell is coated with a lawn of oligonucleotides complementary to the P5 and P7 binding regions in the adapters.

The prerequisite for sequencing is binding of the DNA library to the flow cell. A denatured (single stranded) and appropriately diluted library is applied to the flow cell allowing hybridization (noncova-lent binding) between DNA fragments and oligos at the flow cell’s surface. The relatively weak

Page 7: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

7 NGS PLATFORMS

noncovalent binding is subsequently converted to strong covalent bonds by synthesis of the comple-mentary (reverse) strand followed by washing away of the originally bound DNA strand.

The next step is bridge amplification—a cyclic process that clonally replicates DNA molecules bound to the flow cell, creating so-called “clusters.” During bridge amplification a single-stranded molecule flips over and forms a bridge by hybridizing to an adjacent, complementary primer. After extension by polymerase a double-stranded bridge is formed, which, after denaturation, yields a reverse copy of the original (forward) DNA fragment covalently bound to the flow cell surface. The process is cyclically repeated, and at the final step the reverse strands are cleaved away leaving a homogeneous cluster with ∼1000 forward strands.

Appropriate density of clusters is a critical determinant of successful sequencing. Too few clusters decrease the sequencing yield; too many clusters result in overlaps, which negatively affect the quality of data and in extreme cases cause total failure of the experiment.

The DNA sequencing proper on Illumina platforms is performed by sequencing-by-synthesis (SBS) technology. The reaction is started by hybridization of a primer complementary to the part of the adapter adjacent to the sequenced insert followed by cycles of (1) addition of DNA polymerase with four nucleotides, (2) imaging, and (3) cleavage of the fluorophore and deblocking. The nucleotides are reversibly blocked (terminated) and individually labeled fluorescently, which ensures that during each cycle the primer is extended by a single base only and that this base can be identified by its fluorescence during appropriate excitation and scanning.

Depending on the application and the particular platform 36–301 cycles can performed, allowing one to sequence 35–300 bp (base calling at the nth cycle requires fluorescence data for this cycle as well as the n − 1 and n + 1 cycles; thus the number of cycles is always higher by 1 than the length of sequence obtained).

All Illumina platforms support paired-end sequencing. The sequencing of the other end of a DNA molecule is achieved by stripping off the strand synthesized during the first read and performing a single cycle of bridge amplification with the cleavage of the original forward strand. This converts single-stranded DNA molecules in each cluster into their reverses, which are then sequenced as described above.

The indices are sequenced in separate additional reads (one or two as required). Each index read starts by hybridization of a dedicated primer followed by a number of SBS cycles appropriate to the length of the index.

Illumina apparatusesThe first platform using the described technology was the Genome Analyzer (GA), initially offered by Solexa, a company acquired by Illumina in 2007. Although still used, GA is being largely replaced by HiSeq instruments (HiSeq 1000 and 2000), and their software and/or hardware upgraded versions (HiSeq 1500, HiSeq2500 and HiSeq 3000, HiSeq4000, respectively) as well as by MiSeq and MiSeqDx. The important upgrade of the HiSeq 3000/4000 is a patterned flow cell with nanowells directing cluster formation, which ensures optimal cluster density. The HiSeq instruments are as of this writing the dominant platforms for high-throughput NGS applications worldwide. The MiSeq machines belong to a category of benchtop sequencers—MiSeq is being developed toward low-scale research projects and MiSeqDx is focused on clinical applications. It is noteworthy that as of 2014 MiSeqDx is the only NGS instrument that has received FDA clearance. A comparison of the most popular Illumina NGS plat-forms is shown in Table 1.

Page 8: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

CHAPTER 1 INTRODUCTION8

For specialized centers Illumina also offers the HiSeq X Ten platform, which is a set of five or 10 machines sold together to enable human WGS at a population scale with a price of US $1000 per genome (www.illumina.com).

Sequencing on GA or HiSeq 1000/1500/2000/2500 (but not MiSeq, NextSeq 500) requires a sepa-rate machine (cBOT) for the clustering.

SEMICONDUCTOR-BASED PLATFORMSSemiconductor-based platforms rely on detection of pH changes occurring during DNA synthesis [12]. Sequencing is performed after single-molecule amplification by a process known as emulsion PCR [13].

Library preparation starts from DNA fragmentation and adapter ligation similar to the Illumina approach, and then emulsion PCR is performed [13]. The library is mixed with microscopic beads in an environment of oil and an aqueous solution containing PCR reagents and the mixture is shaken to form an emulsion. The dilution of the library is low enough to ensure that only single DNA molecules have a chance to be encapsulated together with a bead in one emulsion micelle. The emulsion is then subjected to thermal changes allowing PCR. The micelles are separated from one another by oil so that PCR occurs independently in each without diffusion of products. As the beads are covalently coated with oligonucle-otides complementary to adapter sequences, the PCR products generated within a micelle adhere to the bead surface. After PCR the emulsion is broken, and the beads are separated, enriched for those that contain PCR products, and primed for sequencing by annealing of an appropriate primer.

The primed beads are placed into wells of a specialized chip (Ion Chip). The size of the Ion Chip wells ensures that each can accommodate a single bead only. The chip is then cyclically flushed with four nucleotides (one after another, in a constant order) and reagents allowing DNA synthesis. Each time a nucleotide is incorporated there is a release of H+ ions leading to a pH drop in the well, which is detected by a sensor located at the bottom of the well. If two or more identical nucleotides are present side by side in the sequenced fragment their number can be inferred from the stronger decrease in pH relative to what is observed for a single nucleotide.

Since signals from all wells are collected simultaneously without the need for sequential scanning the semiconductor-based sequencing is fast, with a single run taking ∼2 h. The use of unlabeled nucleo-tides simplifies sequencing chemistry, lowering the cost. The availability of chips with different num-bers of wells makes the size of sequencing experiments easily scalable.

Table 1 Comparison of Most Popular Illumina NGS Platforms

Max. Output (Gb)

Max. Read Number (M, Paired-End Reads)

Max. Read Length (bp)

No. of Lanes

Genome Analyzer IIx 95 300 2 × 150 8

HiSeq 2500 1000 4000 2 × 125 2 × 8

HiSeq 2500 rapid mode/single flow cell 90 300 2 × 150 2

HiSeq 3000/4000 750/1500 2500/5000 2 × 150 8/16 (2 × 8)

NextSeq 500 120 400 2 × 150 1

MiSeq 15 25 2 × 300 1

Page 9: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

9 NGS PLATFORMS

Semiconductor-based apparatusesThe first semiconductor-based machine was the Ion PGM released in 2010 by Ion Torrent, a company later acquired by Life Technologies Corp. (now part of Thermo Fisher Scientific, Inc.). The Ion PGM can be used with three chips, allowing 0.03–2 Gb of output. It is a low-throughput, low-cost benchtop sequencer dedicated mainly to amplicon sequencing. A considerably upgraded version of the PGM is the Ion Proton, which, although still in the benchtop class, is capable of WES and, according to company claims (www.lifetechnologies.com), in the near future should allow WGS as well. The characteristics of performance of semiconductor-based apparatuses from Life Technologies are shown in Table 2.

A recent development in the field of clinically oriented semiconductor-based NGS platforms comes from Vela Diagnostics (http://www.veladx.com/), who offer the Sentosa system, including both a sam-ple preparation station and a sequencer. The sequencer is manufactured by Thermo Fisher according to Vela Diagnostics specifications and uses Ion Torrent technology. As of this writing this system is dedi-cated to running CE-IVD cancer panels developed by the company.

SEQUENCING BY THE OLIGO LIGATION DETECTION (SOLiD) PLATFORMThe SOLiD platform relies on ligation [14] with fluorescence-based detection. After the standard steps of fragmentation and adapter ligation, the library is amplified by emulsion PCR or, in the final upgrade, by an isothermal amplification on the surface of a flow cell (flow chip), called “template walking” or “Wildfire,” which is a process with some similarities to the bridge amplification of Illumina [15].

Sequencing on the SOLiD [14] is based on ligation. It starts with annealing of a primer ending at the last base of the adapter. Next a mixture of 16 oligonucleotide octamer probes labeled with four different fluorochromes is added. Each probe at one end has an interrogation sequence representing one of the 16 combinations of a 2-base sequence followed by a 6-bp degenerate stretch and the fluorochrome label at the other end. After hybridization ligation is performed, which covalently links the primer and the adja-cently annealed probe. Next, unbound probes are washed away, fluorescence is read, and the probe is cleaved, removing the label and three neighboring bases (leaving a 5-mer bound). The hybridization–ligation cycle is repeated six more times, the only difference being that ligation occurs with the previously bound probe instead of the primer. This completes the first, so-called, “round” of sequencing. Next the synthesized strand is removed and the second round is started by annealing a new primer, finishing one base before the end of the adapter (n − 1 primer). Five rounds are performed with successively more offset primers (to n − 4). Although four fluorochromes do not allow discrimination of 16 probes, the sequence can be determined using information from all the offset cycles. In addition, in some cycles knowledge of the adapter sequence is used to interpret the data (see [16] for a detailed explanation for the n − 1 cycle). The advantage of SOLiD sequencing is accuracy due to effective double interrogation of each position.

Table 2 Characteristics of Performance of Semiconductor-Based Apparatuses

Machine Chip Max. Output (Gb) Max. Read Number (M) Read Length (bp)

Ion PGM™ Ion 314™ v2 0.03–0.1 0.4–0.55 ∼400

Ion 316™ v2 0.3–1 2-–3 ∼400

Ion 318™ v2 0.6–2.0 4–5.5 ∼400

Ion Proton Ion Proton I Chip 10 60–80 ∼200

Page 10: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

CHAPTER 1 INTRODUCTION10

The first SOLiD platform was released by Applied Biosystems in 2007, followed by upgrades of which the 5500xl with the Wildfire chemistry was the most advanced. The SOLiD platform has good accuracy (up to 99.99%), moderate output (30 Gb), and rather short reads (from the initial 35 to 85 bp for SOLiD 5500xl). Although SOLiD is potentially attractive for high-throughput-dependent diagnos-tic applications such as WES or WGS, in 2013 Life Technologies announced that it has no plans for further development and in 2014 SOLiD was available only to existing customers.

PYROSEQUENCING ON ROCHE/454 PLATFORMSPyrosequencing relies on the detection of the pyrophosphate molecule released during DNA synthesis [17]. It was the first NGS platform available commercially and some concepts behind it were later used in semiconductor sequencing. Both processes share: (1) emulsion PCR on microbeads, (2) deposition of the beads on a microplate (PTP or picotiterplate) according to the one bead–one well principle, (3) sequencing by strand elongation after sequential flushing with four nucleotides, and (4) detection of a product released during strand elongation (pyrophosphate or H+, respectively) whose concentration is proportional to the number of bases incorporated. A difference is that in pyrosequencing the signal ultimately comes from conversion of luciferin into oxyluciferin, which generates visible light [16].

Since the first release of the system by Roche in 2005 the platform has been upgraded, with its final high-throughput version being the 454 GS FLX Titanium system. This system used eight independent lanes each allowing ∼100,000 reads and produced 14 Gb in an ∼10-h run. In 2009 Roche, as the first company, introduced a benchtop NGS sequencer called Junior. Junior is a machine similar to FLX but it can accommodate a PTP with a single lane only.

The advantage of the Roche/454 system is the long read length (up to 800 bp); the disadvantages are low output, problems with homopolymers, and very costly reagents. As announced in 2013, both Roche platforms will be no longer developed.

COMPLETE GENOMICS ANALYSIS (CGA™) PLATFORMThe CGA platform employs sequencing by ligation with fluorescence-based detection. Sequencing is performed on self-assembling DNA nanoarrays or DNB™ arrays [18,19].

An unique feature of the library preparation for the CGA is amplification of fragmented DNA by rolling-circle replication, which produces covalently linked tandem copies of single-stranded DNA, called “DNA nanoballs” (DNBs). DNB formation allows very dense packaging of amplified library molecules—hundreds of fragments are effectively squeezed, forming a sphere with a diam-eter of approximately 200 nm. Next, the DNBs are immobilized on the surface of a chip manufac-tured to contain ∼3 billion regularly patterned sticky spots, each binding only one DNB. The chip with bound nanoballs is called the DNB™ array. The dense and ordered pattern of the DNB™ array reduces the volume of sequencing reagents and maximizes the efficiency of the imaging by ensuring an optimal alignment with the camera, so that every two pixels are used to image a dif-ferent DNB.

The sequencing by ligation on the CGA™ platform has some similarities to the SOLiD platform. The difference is that in the CGA protocol nucleotide positions are interrogated one at a time. Further-more, the CGA approach is fully “unchained,” that is, there is no need to determine the first base before reading the second one, etc. Thus, possible errors (in particular deletions/insertions) introduced at the beginning of sequence do not affect the quality of downstream bases as is the case with other methods.

Page 11: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

11 NGS PLATFORMS

Complete Genomics, Inc., was established in 2006, in Mountain View, California, USA, and in 2013 it was acquired by BGI-Shenzhen (www.completegenomics.com). The company has never commercial-ized its platform but offers DNA sequencing as a service with a focus on high-quality human WGS [19].

SINGLE-MOLECULE SEQUENCINGAll NGS platforms described above rely on clonal amplification of a library prior to sequencing (bridge amplification, emulsion PCR, etc.), which is necessary to make the sequencing signal strong enough for detection but can lead to errors and biases (GC bias in PCR, preferential amplification of shorter fragments in bridge amplification). Single-molecule sequencing, also known as third genera-tion sequencing (TGS), avoids these pitfalls.

Pacific biosciences single-molecule real-time sequencingOn the PacBio SMRT platform the sequencing is carried out by monitoring in real time the activity of a single DNA polymerase extending a primer annealed to the sequenced template. This is achieved by recording the fluorescence emitted each time a labeled nucleotide is bound by the enzyme [20,21]. The reactions are performed in “zero-mode waveguide” microwells—sophisticated ultra-small wells with a transparent bottom, which allow one to immobilize a single molecule of DNA polymerase and guide the light emitted by the nucleotides it binds in a way that facilitates detection [22].

The PacBio library is prepared by ligating SMRTbell™ adapters to both ends of double-stranded DNA fragments. The adapters have a hairpin structure so that after ligation a topologically circular single-stranded template is generated, called the SMRTbell. After annealing of a primer complemen-tary to the adapter sequence the SMRTbell allows for multiple rounds of DNA synthesis so that the insert (especially a short one) can be sequenced many times.

The advantage of SMRT technology is long read length (up to 30 kb), whereas a high error rate (15%) and limited number of reads as well as high price, large size, and complicated maintenance of the instrument are disadvantages. In contrast to other platforms the errors of PacBio are essentially random [23] and can be corrected by multiple reads from SMRTbell templates (although at the expense of read length). A unique feature of the SMRT technology is direct detection of modified bases based on the variation of fluorescence duration reflecting differences in polymerase kinetics (although detec-tion of 5-methylcytosine, the most common epigenetic modification in humans, is challenging) [24].

The first platform based on SMRT sequencing was commercialized in 2010. The latest upgrade, PacBio RS II with P5-C3 chemistry, yields, for each of 16 independent SMRT cells, 50,000 reads with 375 Mb of sequence, half of which comes from reads >10 kb (www.pacificbiosciences.com).

Helicos genetic analysis system (HeliScope)The Helicos Genetic Analysis System (HeliScope) was the first TGS platform introduced in 2009 by Helicos BioSciences Corporation. The HeliScope library is prepared by adding a poly(A) tail to the molecules, which are then bound to a flow cell coated with poly(T) [25,26]. Sequencing is carried out by synthesis using fluorescently labeled nucleotide analogs [25–27]. Despite short reads (24 to 70 bp), low throughput (35 Gb/8 days), and a rather high error rate (∼4%), the HeliScope has been shown to be capable of WGS [28]. The HeliScope is so far the only platform allowing direct sequencing of RNA [29] and it has been used in some high-profile studies of human gene expression [30,31] by the FAN-TOM consortium (http://fantom.gsc.riken.jp). The future of the Helicos technology is uncertain owing to the bankruptcy announced by the company in 2012.

Page 12: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

CHAPTER 1 INTRODUCTION12

TARGETED RESEQUENCING/ENRICHMENT STRATEGIESThe gold standard of medical NGS analysis is WGS, but owing to the size of the human genome (∼3 Gb) this approach is too expensive for widespread use. A popular alternative to WGS is a selec-tive analysis of regions of interest. The target may range from a single exon to a substantial part of the genome such as a whole exome. For small targets the selection (enrichment) is typically achieved by PCR, including long-range PCR used, for example, in whole mitochondrial DNA sequencing. The PCR enrichments usually rely on multiplex PCR, in some cases requiring sophisticated equipment such as the Fluidigm Access Array (allowing simultaneous amplification of up to ∼500 loci in 48 samples) or ThunderStorm from RainDance (allowing PCR with up to 20,000 primer pairs to be performed in emulsion droplets). Large targets are usually enriched by hybridization with oligonu-cleotide probes (baits), which can be made as DNA or RNA molecules. There are also methods using ligation with or without circularization that can be used for intermediate/large enrichments [32]. PCR- and ligation-based enrichments start directly from DNA, whereas those using hybridization require preparation of a library appropriate for the platform used. PCR-based enrichment requires much less DNA than other strategies except for those based on transposons and pioneered by Illu-mina (Nextera).

An enrichment can be prepared locally but it is more common to order it from a commercial ven-dor (Table 3), either as a predesigned product or as a set of reagents prepared according to user speci-fication (including genomic coordinates or gene symbols). Predesigned enrichments are usually well performing, whereas custom panels often need to be optimized. Typical optimization steps for hybridization enrichments include adjustment of library size and sequencing read lengths and, if necessary, redesign of the primers/probes set. As human exons have a median length of only ∼120 bp, exon-targeted enrichments often perform better with relatively small library sizes and shorter read lengths. It is also desirable to reduce the number of PCR cycles throughout the procedure to decrease PCR-inherent biases.

An important issue with predesigned enrichments is a formal certificate for in vitro diagnostic use (IVD). Although there is a clear demand for such products at present (January 2015), there are only two enrichments with FDA clearance (Illumina’s cystic fibrosis assays).

Enrichments, especially those that are customized, are expensive and they are seldom ordered if only a few samples are to be analyzed. Some enrichments are designed for up to 12 samples that need to be processed simultaneously, while others allow single-sample experiments. An example of a deci-sion tree in enrichment selection has been proposed [33].

An important factor in considering an enrichment is the amount of sequencing necessary to obtain desired coverage. Whereas for enrichments of small targets by PCR the amount of sequenc-ing reflects target size multiplied by desired coverage, for larger targets the situation is compli-cated by the relatively low enrichment efficiency possible, which implies that a considerable amount of sequencing is “wasted” on off-target regions. Figure 2 shows the relationship between target size and fraction of bases expected to be on target for an enrichment with a performance in the range of enrichments used in exome studies (∼50) as well as the amount of sequencing neces-sary to achieve a mean coverage of 50. Note that relatively similar amounts of sequencing (∼3 Gb vs ∼8 Gb) are needed for disproportionately different targets (∼1 Mb vs ∼100 Mb, respectively). A detailed discussion of enrichment parameters vs amount of necessary sequencing can be found elsewhere [32].

Page 13: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

13 TARGETED RESEQUENCING/ENRICHMENT STRATEGIES

Table 3 Commercially Available Enrichment Technologies for NGS

Company Technology Predesigned Panels Custom Designs Platforms

Agilent Technologies www.agilent.com

120-nt RNA probes

Exome, kinome, X-chromosome, SureSelect Inherited Disease (2742 genes)

1 kb–24 Mb Illumina, SOLiD

Ligation based (up to 200,000 amplicons)

Exome, cancer, cardiomyopathy, Noonan syndrome, connec-tive tissue disorder, X chromosome. ICCG (the International Collabora-tion for Clinical Genom-ics www.iccg.org/panel)

1 kb–5 Mb Illumina, Ion Torrent (excl. whole exome enrichment)

Roche NimbleGen www.NimbleGen.com

60- to 90-nt DNA probes

Exome, cancer (575 genes), neurology (256 genes)

<7 Mb or <50 Mb

Illumina, Roche 454

Illumina www.illumina.com

DNA probes Exome, tumor (somatic mutations), myeloid malignancies (somatic mutations), cancer (germ-line mutations), cardiomyopathy, inher-ited disease (recessive pediatric onset diseases), autism, HLA, TruSight One >4800 genes

Nextera rapid capture custom 0.5–15 Mb

Illumina

Ligation based Up to 1536 amplicons, cov-ering ∼650 kb

Illumina

Integrated DNA Technologies www.idtdna.com

Oligos individu-ally synthesized on the IDT Ultramer® platform

Acute myeloid leukemia (>260 genes), cancer (127 genes), inherited disease (4503 genes)

Up to 2000 probes

Compatible with Agilent and Nimble-Gen libraries

Qiagen http://www.qiagen.com

PCR Various cancer panels (up to 160 genes), cardiomyopathy (58 genes), carrier testing (157 genes)

A subset of 570 preverified gene primer sets

MiSeq/HiSeq, PGM/Proton Ion

Life Technologies PCR Cancer panels (>400 genes), fusion panel, inherited diseases (>300 genes), exome

Up to 6144 amplicons

PGM/Proton

DNA probes Exome (TargetSeq) PGM/Proton

RainDance Technologies www.raindancetech.com

Microdoplet PCR (equipment required)

Cancer (142 genes), tumor (somatic), HLA, autism (62 genes), pharmacokinetic and pharmacology genes (36/242), X-linked dis-orders (800 genes)

Up to 20,000 amplicons

HiSeq/MiSeq, PGM, SOLiD, PacBio, FLX/Junior

Continued

Page 14: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

CHAPTER 1 INTRODUCTION14

WHOLE EXOME SEQUENCING AND WHOLE GENOME SEQUENCINGWES was described in 2009 [34] as a technique allowing one to sequence the exome, which is the portion of the genome including all of the protein-coding regions (exons). The attractiveness of WES comes from the fact that although it encompasses only ∼1.5% of the genome it harbors the majority (∼85%) of variants causing single-gene disorders. Indeed, in the same year the detection of congenital chloride-losing diar-rhea in a patient misdiagnosed as having Bartter syndrome was reported as the first clinical application of

Company Technology Predesigned Panels Custom Designs Platforms

Fluidigm Corporation www.fluidigm.com

Multiplex PCR in microfluidic cham-bers (equipment required)

ThunderBolts cancer panel (50 cancer genes)

Up to 480 ampli-cons per each of 48 samples

Illumina, Junior/FLX, PGM/Ion

Multiplicom www.multiplicom.com

PCR-based MASTR™ (multi-plex amplification of specific targets for resequencing) assays

Panels for >10 inherited diseases including CE-IVD validated panels for BRCA1/2 and CFTR mutations), three onco-panels, an aneuploidy test

No Junior/FLX, MiSeq

Table 3 Commercially Available Enrichment Technologies for NGS—cont’d

0

10

20

30

40

50

60

70

80

1 20 40 60 80 100

Enrichment target size [Mb]

% of bases on targetAmount of sequencing [Gbx10]

FIGURE 2

Relationship between target size and fraction of bases expected to be on target as well as amount of sequenc-ing required for mean coverage of 50, for an enrichment characterized by an enrichment factor or EF = 50. EF is defined as the mean coverage of the target (T) divided by the mean coverage of the rest of the genome. The fraction of bases on target F = EF × target size/[(EF − 1) × target size + genome size] [32]; the amount of sequencing = 50 × (T/F), genome size = 3 Gb.

Page 15: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

15 LIMITATIONS OF NGS IN CLINICAL MEDICINE

WES [35]. A report on spectacular WES-aided diagnosis and therapy of a case of severe inflammatory bowel disease illustrated the potential of this technology not only for diagnosis but also for treatment [36].

In 2011 Ambry Genetics, as the first CLIA-certified laboratory, offered WES together with medical interpretation for clinical purposes. As of October 2014 the GeneTests web site (www.genetests.org) listed 16 labs offering clinical WES. The turnaround time is in most cases 12–13 weeks, with the short-est being 4–8 weeks, and the price US $4000–5500. Notably, the cost of WES may be only two to four times as high as the cost of some much more selective panels. The majority of labs also offer interpreta-tion of data, including in some cases interpretation of external data. The GeneTests web site (www.genetests.org) lists a single lab (Medical College of Wisconsin) offering WGS declaring a mini-mal coverage of 10 for over 90% of the genome with a turnaround time of 12–13 weeks. WGS can also be obtained from Illumina and BGI (www.bgiamericas.com).

WES is usually performed using hybridization-based enrichments using kits from Agilent Tech-nologies, Roche NimbleGen, Illumina, or others. Although all the offered kits allow WES, differences exist in their design and performance [37]. In some cases the kits cover only protein-coding exons, whereas other kits include additional loci for noncoding RNAs (including micro RNAs) and 5′ as well 3′ untranslated regions. The resulting differences in target size (e.g., 37 Mb for Nextera Rapid Capture Exome from Illumina vs 96 Mb for SeqCap EZ Exome + UTR from NimbleGen) imply significant, up to threefold, differences in sequencing cost.

WES is particularly attractive when there is a suspicion that the disease is caused by an unknown locus—so far WES has allowed the linkage of more than 150 new genes to human Mendelian disorders [38]. However, in a clinical setting the majority of information supplied by WES remains difficult to interpret as it pertains to loci that have so far not been linked to any human disease. To focus on most clinically relevant loci, the so-called “clinical exome” panels have been developed, such as TruSight One from Illumina, which covers >4800 genes and has a cumulative target size of only ∼12 Mb, or SureSelect Inherited Disease from Agilent (>2700 genes, 10.5 Mb).

The use of WGS, which in theory should offer a much more powerful diagnostic tool than WES, is at present limited by poor understanding of the role of variation in the noncoding part of the genome. Intriguingly, a study of 50 patients with severe intellectual disability in whom the causative molecular defect had not been found in extensive former studies, including WES and comparative genomic hybridization concluded that WGS has a high diagnostic yield of 42%, but mainly owing to findings of exonic variants apparently missed by earlier studies [39].

LIMITATIONS OF NGS IN CLINICAL MEDICINEThe major limitations of clinical NGS applications stem from technical shortcomings of current tech-nologies (short read lengths, relatively high error rate, incomplete coverage) and from challenges in data interpretation (see later chapters).

The short reads support identification of SNVs and short deletions/insertions, whereas it is difficult or impossible to detect variants involving longer sequences. In particular, using NGS it is not possible to analyze stretches of short tandem repeats including those causing clinically important diseases such as fragile X, Huntington disease, and others [40]. Despite efforts (see “Analysis of Structural Chromo-some Variants by NGS”) it remains difficult to detect CNVs such as those causing DiGeorge syndrome (22q11.2 deletion syndrome) or Charcot–Marie–Tooth disease type 1A. Similarly chromosomal

Page 16: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

CHAPTER 1 INTRODUCTION16

translocations and aneuploidy are likely to be missed by NGS analyses unless they are specifically designed for detection of such variants. Short read lengths cause problems in analyzing regions of the genome with segmental duplications containing stretches of sequences with high similarity.

The most commonly used NGS platforms (Illumina, PGM, Proton Ion) have error rates in the range of ∼0.1–1%. These error rates translate into thousands of mistakes in extensive analyses such as WES or WGS. The problem is compounded by a generally nonrandom distribution of errors, which makes corrections based on increased coverage difficult [21]. It is noteworthy that a study of WGS in 12 sub-jects found only 66% concordance in variants detected by Illumina and CG platforms [41].

Relatively high error rates pose unique problems in the detection of somatic variants, for example, in searches of clinically actionable mutations in neoplastic tissue. Owing to tumor heterogeneity and contami-nation with normal cells, the rate of occurrence of such mutations may well be comparable to the NGS error rate, making their detection difficult/impossible. Ingenious strategies to minimize errors based on combin-ing information from both strands of the DNA molecule have been proposed [42], but they are not routine.

Whereas small (<10 kb) targets can often be enriched (usually by PCR) and sequenced with 100% efficiency, this is not possible with longer targets, whose analysis as a rule has gaps from incomplete coverage. When considering such tests it should be borne in mind that adequate coverage can be gener-ally expected only for 85–95% of the target. Although WGS can improve coverage by avoiding the shortcomings of enrichment techniques, it is still not perfect—in a 2014 study of 56 genes with a high clinical importance according to the American College of Medical Genetics and Genomics, a median of up to 17% were not appropriately covered by WGS [41].

Owing to the above-mentioned problems, positive results of NGS tests, in particular WES and WGS, should be confirmed with an independent technique (e.g., Sanger sequencing), whereas in the case of nega-tive findings it should always be remembered that they may be a false negative. Importantly, even WES or WGS, despite their power, cannot be used to exclude a possibility of a genetic defect in a patient [41,43].

CONCLUSIONThe development of NGS has revolutionized genetic research as well as the practice of clinical genetics. As will become apparent from the following chapters, NGS testing is heavily used in virtually all branches of medical science. An integral part of NGS implementation is related to the development of appropriate data analysis tools. Actually, even at present the challenges presented by analysis and inter-pretation of NGS results are bigger than those associated with sequencing itself. Thus, no discussion of NGS is complete without mentioning bioinformatics issues (see the chapters Basic Bio-informatic Anal-ysis of NGS Data and Analysis of Structural Chromosome Variants by NGS). An important aspect of NGS is linked to ethical problems (see Ethical Issues), in particular the problem of incidental findings, that is, finding variants that, with a high probability, indicate the presence or high risk of a disease that was not searched for. The strategy of dealing with incidental findings is the most hotly debated issue that has emerged from NGS. Another aspect of NGS, relevant in particular to its clinical applications, is the necessity for development of efficient and balanced reimbursement strategies (see Organizational and Financial Challenges). Finally, it should be emphasized that the process of NGS development is by no means over. There is a constant progress in increasing the throughput of existing platforms; in parallel, as discussed in Future Directions, there is an ongoing search for completely new technical strategies, such as nanopore sequencing and novel massively parallel bioinformatical solutions.

Page 17: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

17 REFERENCES

REFERENCES [1] Sanger F, Coulson AR. A rapid method for determining sequences in DNA by primed synthesis with DNA

polymerase. J Mol Biol May 25, 1975;94(3):441–8. [2] Tewhey R, Warner JB, Nakano M, Libby B, Medkova M, David PH, et al. Microdroplet-based PCR enrichment

for large-scale targeted sequencing. Nat Biotechnol November 2009;27(11):1025–U94. [3] Adey A, Morrison H, Asan, Xun X, Kitzman J, Turner E, et al. Rapid, low-input, low-bias construction of

shotgun fragment libraries by high-density in vitro transposition. Genome Biol 2010;11(12):R119. [4] Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, et al. Characterizing and measuring bias

in sequence data. Genome Biol 2013;14(5). [5] Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of

the human genome. Nature February 15, 2001;409(6822):860–921. [6] Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome.

Science February 16, 2001;291(5507):1304–51. [7] Genome Reference Consortium. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/; 2014. [8] Peters BA, Kermani BG, Sparks AB, Alferov O, Hong P, Alexeev A, et al. Accurate whole-genome sequencing

and haplotyping from 10 to 20 human cells. Nature July 12, 2012;487(7406):190–5. [9] Korbel JO, Urban AE, Affourtit JP, Godwin B, Grubert F, Simons JF, et al. Paired-end mapping reveals

extensive structural variation in the human genome. Science October 19, 2007;318(5849):420–6. [10] Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ. The importance of phase information for human

genomics. Nat Rev Genet March 2011;12(3):215–23. [11] Lebl M, Buermann D, Reed MT, Heiner DL, Triener A. Flow cells and manifolds having an electroosmotic

pump. 08.02.12 [Google Patents]. [12] Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, et al. An integrated semiconductor

device enabling non-optical genome sequencing. Nature July 21, 2011;475(7356):348–52. [13] Dressman D, Yan H, Traverso G, Kinzler KW, Vogelstein B. Transforming single DNA molecules into

fluorescent magnetic particles for detection and enumeration of genetic variations. Proc Natl Acad Sci USA July 22, 2003;100(15):8817–22.

[14] Shendure J, Porreca GJ, Reppas NB, Lin XX, McCutcheon JP, Rosenbaum AM, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science September 9, 2005;309(5741):1728–32.

[15] Ma ZC, Lee RW, Li B, Kenney P, Wang YF, Erikson J, et al. Isothermal amplification method for next-generation sequencing. Proc Natl Acad Sci USA August 27, 2013;110(35):14320–3.

[16] Voelkerding KV, Dames SA, Durtschi JD. Next-generation sequencing: from basic research to diagnostics. Clin Chem April 2009;55(4):641–58.

[17] Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, et al. Genome sequencing in micro-fabricated high-density picolitre reactors. Nature September 15, 2005;437(7057):376–80.

[18] Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science January 1, 2010;327(5961):78–81.

[19] Reid C, Complete Genomics Inc. Future Oncol 2011;7(2):219–21. [20] Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al. Real-time DNA sequencing from single polymerase

molecules. Science January 2, 2009;323(5910):133–8. [21] Mardis ER. Next-generation sequencing platforms. Annu Rev Anal Chem 2013;6(6):287–303. [22] Levene MJ, Korlach J, Turner SW, Foquet M, Craighead HG, Webb WW. Zero-mode waveguides for single-

molecule analysis at high concentrations. Science January 31, 2003;299(5607):682–6. [23] Carneiro MO, Russ C, Ross MG, Gabriel SB, Nusbaum C, DePristo MA. Pacific biosciences sequencing

technology for genotyping and variation discovery in human data. BMC Genomics August 5, 2012;13. [24] Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, et al. Direct detection of DNA

methylation during single-molecule, real-time sequencing. Nat Methods June 2010;7(6):461–U72.

Page 18: Chapter 1 - Next Generation Sequencing—General Information ......Next generation sequencing (NGS) is defined as technology allowing one to determine in a single experiment the sequence

CHAPTER 1 INTRODUCTION18

[25] Braslavsky I, Hebert B, Kartalov E, Quake SR. Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci USA April 1, 2003;100(7):3960–4.

[26] Ozsolak F. Third-generation sequencing techniques and applications to drug discovery. Expert Opin Drug Discov March 2012;7(3):231–43.

[27] Bowers J, Mitchell J, Beer E, Buzby PR, Causey M, Efcavitch JW, et al. Virtual terminator nucleotides for next-generation DNA sequencing. Nat Methods August 2009;6(8):593–5.

[28] Pushkarev D, Neff NF, Quake SR. Single-molecule sequencing of an individual human genome. Nat Bio-technol September 2009;27(9):847–50.

[29] Ozsolak F, Platt AR, Jones DR, Reifenberger JG, Sass LE, McInerney P, et al. Direct RNA sequencing. Nature October 8, 2009;461(7265):814–8.

[30] Forrest ARR, Kawaji H, Rehli M, Baillie JK, de Hoon MJL, Haberle V, et al. A promoter-level mammalian expression atlas. Nature March 27, 2014;507(7493):462–70.

[31] Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature March 27, 2014;507(7493):455–61.

[32] Mertes F, ElSharawy A, Sauer S, van Helvoort JM, van der Zaag PJ, Franke A, et al. Targeted enrich-ment of genomic DNA regions for next-generation sequencing. Briefings Funct Genomics November 1, 2011;10(6):374–86.

[33] Altmuller J, Budde BS, Nurnberg P. Enrichment of target sequences for next-generation sequencing applica-tions in research and diagnostics. Biol Chem February 2014;395(2):231–7.

[34] Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature September 10, 2009;461(7261):272–U153.

[35] Choi M, Scholl UI, Ji WZ, Liu TW, Tikhonova IR, Zumbo P, et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci USA November 10, 2009;106(45):19096–101.

[36] Worthey EA, Mayer AN, Syverson GD, Helbling D, Bonacci BB, Decker B, et al. Making a definitive diag-nosis: successful clinical application of whole exome sequencing in a child with intractable inflammatory bowel disease. Genet Med March 2011;13(3):255–62.

[37] Chilamakuri CSR, Lorenz S, Madoui MA, Vodak D, Sun JC, Hovig E, et al. Performance comparison of four exome capture systems for deep sequencing. BMC Genomics June 9, 2014;15.

[38] Rabbani B, Tekin M, Mahdieh N. The promise of whole-exome sequencing in medical genetics. J Hum Genet January 2014;59(1):5–15.

[39] Gilissen C, Hehir-Kwa JY, Thung DT, van de Vorst M, van Bon BWM, Willemsen MH, et al. Genome sequencing identifies major causes of severe intellectual disability. Nature July 17, 2014;511(7509):344–7.

[40] Almeida B, Fernandes S, Abreu IA, Macedo-Ribeiro S. Trinucleotide repeats: a structural perspective. Front Neurol 2013;4.

[41] Feero WG. Clinical application of whole-genome sequencing proceed with care. JAMA March 12, 2014;311(10):1017–9.

[42] Kirsch S, Klein CA. Sequence error storms and the landscape of mutations in cancer. Proc Natl Acad Sci USA September 4, 2012;109(36):14289–90.

[43] Biesecker LG, Green RC. Diagnostic clinical genome and exome sequencing. N Engl J Med June 18, 2014;370(25):2418–25.