10
NEXT GENERATION SEQUENCING AT THE FGCZ 1 newsletter Technologies, Applications, and Access to Support Next Generation Sequencing at the Functional Genomics Center Zurich Next Generation Sequencing (NGS) has become one of the major technologies at the FGCZ. The impressive throughput and read lengths of the current high-end systems enable research groups to de novo sequence genomes of small to complex organisms. They allow for the analysis of genome alterations, gene expression and DNA modifications, as well as a constantly growing number of other applications. To account for the vast diversity of life science research at the ETH Zurich and the University of Zurich, the FGCZ has continuously extended its portfolio of NGS technologies and applications. The emphasis of this expansion has not only been on capacities but also on capabilities, leading to the establishment of a complete range of NGS technologies: Short read technologies (Life Technologies SOLiD and Illumina HiSeq) for the generation of very large numbers of short to medium length reads, long read technologies (Roche/454 GS FLX+) for the generation of medium to large numbers of long reads, plus the latest technologies focusing on the very rapid generation of sequencing data (Ion Torrent PGM) and very long reads and novel applications at t h e s i n g l e m o l e c u l e l e v e l ( P a c i f i c Biosciences RS). 2008 2009 2010 2011 2012 The on site availability of these systems allows us to flexibly combine the different technologies within or between projects and opens up combinations of possibilities that ensure suitable analytical strategies tailored to the research project’s needs. As a consequence of this flexibility and the number of technologies available, NGS projects need careful planning and consideration at multiple levels, including sample availability and quality, robustness of methods and protocols, processing and analysis of data, as well as financial and time constraints. This complexity emphasizes the need for a close collaboration of ETH and UZH research groups with analytical and bioinformatics experts of the FGCZ, which is why access to the FGCZ NGS platform is provided through the User Lab. In parallel to the technology expansion, significant efforts have been undertaken in increasing the User Lab staff working in the technologies and bioinformatics sections. As a result, optimized analytical protocols and data analysis workflows have been established that lead to significantly shortened turnover times from sample generation to data interpretation. Technologies and support modes at the FGCZ In the following, we briefly describe the available technologies (from most recent to most established) including supported applications. More information on the setup of NGS and access via the FGCZ User Lab and User Lab Services can be found on the FGCZ website at www.fgcz.ch . For further or more specific questions, please contact us at [email protected] OVERVIEW 1 NGS AT THE FGCZ Technologies and organization 2 NGS APPLICATIONS AND TECHNOLOGIES Applications in sequencing 3 PACIFIC BIOSCIENCES RS Single molecule sequencing 4 ION TORRENT PGM Fast turnover sequencing 5 ILLUMINA HISEQ2000 High throughput short reads 6 LIFE TECHNOLOGIES SOLID 5500XL Flexible short read sequencing 7 ROCHE/454 GS FLX+ Reliable long read sequencing 8 BIOINFORMATICS OF NGS Analysis workflows and support FGCZ NEWSLETTER FALL 2011

Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

NEXT GENERATION SEQUENCING AT THE FGCZ 1

newsletterTechnologies, Applications, and Access to Support

Next Generation Sequencing at the Functional Genomics Center Zurich

Next Generation Sequencing (NGS) has become one of the major technologies at the FGCZ. The impressive throughput and read lengths of the current high-end systems enable research groups to de novo sequence genomes of small to complex organisms. They allow for the analysis of genome alterations, gene expression and DNA modifications, as well as a constantly growing number of other applications.

To account for the vast diversity of life science research at the ETH Zurich and the University of Zurich, the FGCZ has continuously extended its portfolio of NGS technologies and applications. The emphasis of this expansion has not only b e e n o n c a p a c i t i e s b u t a l s o o n capabilities, leading to the establishment of a complete range of NGS technologies: Short read technologies (Life Technologies SOLiD and Illumina HiSeq) for the generation of very large numbers of short to medium length reads, long read technologies (Roche/454 GS FLX+) for the generation of medium to large numbers of long reads, plus the latest technologies focusing on the very rapid generation of sequencing data (Ion Torrent PGM) and very long reads and novel applications at the single molecule level (Pacif ic Biosciences RS).

2008 $$$2009 $ $2010 $ $ $2011 $ $ $ $ $ $2012 $$

The on site availability of these systems allows us to flexibly combine the different technologies within or between projects and opens up combinations of possibilities that ensure suitable analytical strategies tailored to the research project’s needs. As a consequence of this flexibility

and the number of technologies available, NGS projects need careful planning and consideration at multiple levels, including sample availability and quality, robustness of methods and protocols, processing and analysis of data, as well as financial and time constraints.

This complexity emphasizes the need for a close collaboration of ETH and UZH research groups with analytical and bioinformatics experts of the FGCZ, which is why access to the FGCZ NGS platform is provided through the User Lab. In parallel to the technology expansion, significant efforts have been undertaken in increasing the User Lab staff working in the technologies and bioinformatics sections. As a result, optimized analytical protocols and data analysis workflows have been established that lead to significantly shortened turnover times from sample generation to data interpretation.

Technologies and support modes at the FGCZ

In the following, we briefly describe the available technologies (from most recent to most established) including supported applications. More information on the setup of NGS and access via the FGCZ User Lab and User Lab Services can be found on the FGCZ website at www.fgcz.ch. For further or more specific ques t i ons , p l ease con tac t us a t [email protected]

OVERVIEW

1NGS AT THE FGCZTechnologies and organization

2NGS APPLICATIONS AND TECHNOLOGIESApplications in sequencing

3PACIFIC BIOSCIENCES RSSingle molecule sequencing

4ION TORRENT PGMFast turnover sequencing

5ILLUMINA HISEQ2000High throughput short reads

6LIFE TECHNOLOGIES SOLID 5500XLFlexible short read sequencing

7ROCHE/454 GS FLX+Reliable long read sequencing

8BIOINFORMATICS OF NGSAnalysis workflows and support

FGCZ NEWSLETTER FALL 2011

Page 2: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

NEXT GENERATION SEQUENCING AT THE FGCZ 2

NGS@FGCZ Applications and Technologies De novo sequencing of genomes and metagenomes

Emphasis on longer reads, medium to high throughput* * * Roche GS FLX+* * * Pacific Biosciences RS* * Illumina HiSeq (+++ in combination with PacBio or GS FLX+)* * Ion Torrent PGM (for smaller genomes)

De novo sequencing of transcriptomes Emphasis on medium to longer reads, high to very high throughput* * * Illumina HiSeq * * * Roche GS FLX+* * Ion Torrent PGM

Genome-wide SNP discovery and variant detection Emphasis on high to ultra-high throughput, short to medium read lengths* * * Illumina HiSeq (max. throughput)* * * LifeTech SOLiD (max. flexibility)* * Ion Torrent PGM (max. speed)

Transcriptome analysis Emphasis on high to ultra-high throughput, short to medium read lengths* * * Illumina HiSeq (max. throughput)* * * LifeTech SOLiD (max. flexibility)* * Ion Torrent PGM (max. speed)

Small RNAs Emphasis on high to ultra-high throughput, short read lengths* * * Illumina HiSeq (max. throughput)* * * LifeTech SOLiD (max. flexibility)* * Ion Torrent PGM (max. speed)

ChIP-seq Emphasis on high to ultra-high throughput, short to medium read lengths* * * Illumina HiSeq (max. throughput)* * * LifeTech SOLiD (max. flexibility)* * Ion Torrent PGM (max. speed)

Amplicon sequencing Emphasis on longer reads, medium to high throughput* * * Roche GS FLX+* * * Pacific Biosciences RS* * Illumina HiSeq (+++ in combination with PacBio or GS FLX+)* * LifeTech SOLiD (+++ in combination with PacBio or GS FLX+)* * Ion Torrent PGM

DNA Methylation analysis Emphasis on flexibility and protocols* * * Pacific Biosciences RS * * * LifeTech SOLiD * * * Illumina HiSeq

The list of applications and suitable technologies is not exhaustive nor does it exclude the use of a specific analytical technology for applications mentioned.The purpose of the list is to provide an initial overview that is always refined based on individual needs during the setup phase of an FGCZ project. Technologies of equal suitability are listed in alphabetical order.The list of applications and suitable technologies is not exhaustive nor does it exclude the use of a specific analytical technology for applications mentioned.The purpose of the list is to provide an initial overview that is always refined based on individual needs during the setup phase of an FGCZ project. Technologies of equal suitability are listed in alphabetical order.

FGCZ NEWSLETTER FALL 2011

* MP: mate pair; PE: paired-end§ The actual amount depends on the library type (and the number of needed SMRT cells in the case of PacBio). For accurate information please refer to each technology section.

NGS@FGCZ Technology SpecificationsPlatforms Library type * Library

preparation Chemistry Run time Read length

(bp) Throughput (Gb per run)

Amount of input materials §

PACIFIC BIOSCIENCES RS

Frag PCR free Single Molecule Real Time Sequencing

24 hrs (90 min per SMRT cell)

1500 1.2 (16 SMRT cells)

1-10 µg DNA

ION TORRENT PGM

Frag emPCR Ion semiconductor sequencing

2 hrs 200 1 0.1- 5 µg DNA 0.2- 1 µg poly(A)RNA or rRNA -

depleted total RNA

ILLUMINA HISEQ2000

Frag, MP, PE

Solid phase Reversible Terminator

10 days 2x100 600 1 µg DNA, 0.1-4 µg total RNA 1 – 10 µg for small RNA 10 µg Single ChIP enriched DNA

LIFE TECHNOLOGIES SOLID 5500XL

Frag, MP, PE

emPCR Sequencing by Ligation

10 days 35-75 (Frag) 75+35 (PE) 2x60 (MP)

200 0.1-5 µg DNA 0.2-1 µg poly(A)RNA or rRNA - depleted total RNA

ROCHE/454 GS FLX+

Frag, MP emPCR Pyrosequencing

23 hrs 700 0.7 1 µg DNA 1-2 µg total RNA

!

Page 3: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

NEXT GENERATION SEQUENCING AT THE FGCZ 3

Single Molecule Sequencing

Pacific Biosciences RS

The Pacific Biosciences RS system is a third generation sequencer that provides single molecule and real time sequencing technology (SMRT) based on an uninterrupted template-directed DNA-polymerase synthesis. The RS gives the longest read length achievable to date: the average read length per run is currently around 1.5 kb, with instances of over 10,000 base pairs, which facilitates mapp ing and assembly. SMRT technology bears the potential to retrieve k inet ic sequencing informat ion of individual molecules, a method that is fo reseen to a l l ow fo r the d i rec t identification of chemical modifications, such as methylation, at a single base r e s o l u t i o n . T h e P a c B i o R S , i n combination with massive high throughput sequencing (i.e. HiSeq 2000 or SOLiD 5500xl) provides an ideal solution to cover a wide range of applications.

TechnologyT h e S M RT D N A s e q u e n c i n g

technology is built upon three key innovations. The SMRT Cell, provides single molecule, real-time observation of individual fluorophores against a dense background of labeled nucleotides while maintaining a high signal-to-noise ratio. In contrast to using base-linked nucleotides, SMRT Sequencing uses phospholinked-nucleotides (the fluorescent dye is attached to the poly-phosphate chain of the nucleotide). Through such attachment, incorporat ion of a phosphol inked nucleotide by the DNA polymerase results in separation of the dye molecule from the nucleotide when the enzyme forms the phospho-diester bond between two

nucleotides. A Real Time Detection platform provides single molecule, real-time detection as well as flexibility in run configurations and applications.

DNA sequencing is performed on SMRT Cells, each containing thousands of zero-mode waveguides (ZMWs). A ZMW is a hole, tens of nanometers in diameter, fabricated in a 100 nm metal film deposited on a silicon dioxide substrate. Each ZMW becomes a nanophotonic visual izat ion chamber providing a detection volume of just 2*10-20 liters. At this volume, the activity of a single molecule can be detected amongst a background of thousands of labeled nucleotides. Within each chamber, a single DNA polymerase molecule is attached to the bottom surface such that it permanently resides within the detection volume. Phospholinked nucleotides, each type labeled with a different colored fluorophore, are then introduced into the reaction solution. As the DNA polymerase incorporates complementary nucleotides, each base is held within the detection volume for tens of milliseconds. During this time, the engaged fluorophore emits fluorescent light whose color corresponds to the base identity. Then, as part of the n a t u r a l i n c o r p o r a t i o n c y c l e , t h e polymerase cleaves the bond holding the fluorophore in place and the dye diffuses out of the detection volume. Following

incorporation, the signal immediately returns to baseline and the process repeats. The DNA polymerase continues incorporating an average of 1-2 bases per second producing a long chain of DNA in minutes.

ApplicationsSMRT technology is ideal for

applications such as de novo genome and transcriptome sequencing (especially i n c o m b i n a t i o n w i t h s h o r t r e a d sequencers). PacBio’s long reads make the assembly of the genomic structure much easier, enabling a comprehensive view of the genome. SMRT technology is a PCR free approach: standard PacBio library preparation and sequencing methods do not include any amplification step. Native DNA is directly sequenced without PCR bias and potentially allows the identification of chemical base modifications, such as methylation.

PerformanceThe PacBio RS can process 16

SMRT cells per run in approximately 24 hours (90 minutes per SMRT cell). Currently every cell produces up to 50‘000 reads, with average read length of 1.5 kb with few reads hitting the 10 kb. With the next chemistry release (Q1, 2012) the average read length will be doubled to 3 kb.

Input Requirements* 500 bp library: 1 µg high quality gDNA* 2000 bp library: 3 µg high quality gDNA* 8000 bp library: 8 µg high quality gDNA

PacBio RS at FGCZ The PacBio RS system is accessible

through the FGCZ User Lab. For the time being, no User Lab Services are offered but projects including all analytical steps are carried out in close collaboration between FGCZ experts and User Group researchers. More information: PacBioRS

FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011

APPLICATIONSAND MAIN USE

Backbone sequencing

De novo sequencing

Methylation studies

TECHNOLOGYIN SHORT

Third generation sequencer using Single Molecule Real Time Sequencing Technology (SMRT).

Very long reads up to 10 kb, average read length 1.5 kb. Samp le p repa ra t i on and sequencing are PCR free: sequence native DNA.

Sma l l genomes de -novo sequencing and potent ia l detection of DNA methylation at base resolution.

Page 4: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

NEXT GENERATION SEQUENCING AT THE FGCZ 4

Personal Genome Machine

PGM by Ion Torrent

The Personal Genome Machine (PGM) Sequencer is a benchtop PostLight semiconductor-based platform that p e r f o r m s a s e q u e n c i n g r u n i n approximately two hours.

TechnologyThe PGM advances Next-Generation

sequencing to PostLight sequencing: the t ranslat ion of chemical sequence information directly into digital form. The sequencing technology underlying the PGM exploits a well-characterized biochemical process: When a nucleotide is incorporated into a strand of DNA by a polymerase, a hydrogen ion (H+) is released as a byproduct. This hydrogen ion carries a charge which the PGM System’s ion sensor—essentially the world's smallest solid-state pH meter— can detect. As the sequencer floods the chip with one nucleotide after another, any nucleotide added to a DNA template will be detected as a voltage change, and the PGM System will call the base. If a nucleotide is not a match for a particular template, no voltage change will be detected and no base will be called for that template.

By eliminating the need for the optical system, the PGM provides sequencing that is simpler, faster, more cost effective, and more scalable than any other technology available.

A principal component of the PGM is the sequencing chip. This microprocessor chip incorporates an extremely dense array of >1 million micro-machined wells married to our proprietary ion sensor.

Each well contains a different DNA template, allowing massively parallel sequencing. Chips can scale in density for any application, from small, targeted experiments to large genomes.

ApplicationsThe PGM system enables many

possib le sequencing appl icat ions. Performing multiplex amplicon sequencing is possible in a very short time. The 200bp reads make sequencing long amplicons possible, resulting in more simple and affordable projects. It's perfect platform to obtain optimal results for microbial sequencing projects with extraordinary uniformity of coverage and long read lengths. The four to eight million reads produced by the 316 and 318 chips are ideal for applications such as small RNA sequencing, transcriptome sequencing, and ChIP-Seq.

PerformancePerformance of the system directly

depends on the sequencing chip used.

OutlookIon Torrent recently launched a long

read kit for the PGM sequencer with modal high-quality read lengths of 225 bases. Read lengths greater than 500 bases are feasible, as demonstrated by the generation of a perfect 525 bases read.

A novel paired-end sequencing (PES) method was developed for the PGM sequencer that produces paired reads, each 100 bases in length (2 x 100). We anticipate that applying recent improvements in read length to the Ion PES protocol should produce paired reads of 200 bases in length (2 x 200), with the possibility of paired reads up to 400 bases in length (2 x 400).

Input Requirements* DNA Seq - From 100 ng to 5 µg high-

quality RNA-free genomic DNA* RNA Seq - From 200 ng to 1 µ g

poly(A)RNA or rRNA-depleted total RNA

PGM at FGCZ The PGM system is accessible

through the FGCZ User Lab Services. More information: PGM

FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011

APPLICATIONSAND MAIN USE

Small genomes sequencing

Targeted re-sequencing (amplicon)

Low complexity transcriptomes and small RNA sequencing

TECHNOLOGYIN SHORT

Output depending on chip

> 10 Megabases > 100 Megabases

Read length up to 200 bases

Run Time < 2 Hours

Ion Semiconductor Sequencing Chip

Output Read Length Total Sequencing Time

314

> 10 Megabases Year 2011 >200 bp

Year 2012 > 400 bp

< 2 Hours

316 > 100 Megabases

318

> 1 Gigabase

Accuracy > 99.99% consensus accuracy and >99.5% raw accuracy !

Page 5: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

NEXT GENERATION SEQUENCING AT THE FGCZ 5

High Throughput Short Read Sequencing

Illumina HiSeq 2000

The HiSeq2000 from Illumina is a high throughput sequencer that uses a massively paral le l sequencing-by-synthesis approach to generate billions of bases of high quality sequence data. To date, it delivers the industry’s highest throughput of quality filtered data at up to 55 Gb per day or 600 Gb per run with read lengths of 2X100 bp. The high throughput and read length of the system provides the coverage necessary for various next generation sequencing applications. .

TechnologyIllumina is different from the SOLiD

and 454 sequencers because it does not use beads for clonal amplification of f r a g m e n t s , i n s t e a d a f t e r l i b r a r y preparation, cluster or bridge PCR is performed directly on the flow cell.

For sequencing, Il lumina uses the s e q u e n c i n g b y s y n t h e s i s ( S B S ) technology. It is a proprietary reversible terminator-based method that enables detection of single bases as they are incorporated into growing DNA strands. A fluorescently-labeled terminator is imaged as each dNTP is added and then cleaved to allow incorporation of the next base. Since all four reversible terminator-bound dNTPs a re p resen t du r i ng each sequencing cycle, natural competition minimizes incorporation bias.

ApplicationsSBS technology supports both single

read and paired-end libraries. A wide array of available sample preparation methods serve to enable diverse applications, including: - DNA Sequencing- Transcriptome Analysis (RNA-Seq)- SNP Discovery and Structural Variation Analysis- Cytogenetic Analysis (copy number variation analysis, CNV)- DNA-Protein Interaction Analysis by - Chromatin Immunoprecipitation (ChIP-Seq)- Methylation Analysis- Small RNA Discovery and Analysis

PerformanceThe HiSeq 2000 can process 2 flow

cells in parallel . Each flow cell is made of 8 lanes that can be run independently. The HiSeq 2000 can generate a maximum of 600 Gigabases of qual i ty - f i l tered sequence data per run with read lengths of up to 2 x 100 base pairs, and provide up to three billion single-end reads and up to six billion paired-end reads. A run takes about ten days.

Input Requirements* DNA Seq - From 1 µg high-quality RNA-

free genomic DNA* RNA Seq - From 100 ng to 4 µg column

purified, genomic DNA free total RNA* Small RNA Seq - From 1 to 10 µg of

Trizol purified total RNA* Chip Seq -From 10 ng Single ChIP

enriched DNA or input DNA

HiSeq2000 at FGCZ The HiSeq2000 system is accessible

through the FGCZ User Lab Services. More information: HiSeq2000

FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011

APPLICATIONSAND MAIN USE

DNA Sequencing

Gene Regulation Analysis

Sequencing-Based Transcriptome Analysis

SNP Discovery and Structural Variation Analysis

Cytogenetic Analysis, DNA-Protein Interaction (ChIP-Seq)

Sequencing-Based Methylation Analysis

Small RNA Discovery and Analysis

TECHNOLOGYIN SHORT

Sequence up to 16 independent lanes on 2 flow cells in one run

24 unique barcodes provided by Illumina for multiplexing for RNA and DNA libraries

Up to 384 independent samples per run

~ 1.5 billion high-quality reads per flow cell (~180 million/lane)

up to 600 Gb output per run

single-read (50 or 100 bp) or paired-end sequencing (2 x 50 bp; 2 x 100bp)

No. of Lanes No. of reads (Millions) Throughput (Gigabases) 2 X 100 sequencing

1 lane 188 37.5 8 lanes (1 flow cell) 1504 300

16 lanes (2 flow cells) 3008 600 !

Page 6: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

NEXT GENERATION SEQUENCING AT THE FGCZ 6

APPLICATIONSAND MAIN USE

DNA Sequencing (Fragment, Paired End, Mate-Paired)

Whole Transcriptome Analysis

Small RNA Expression

SAGE - genome-wide expression

TECHNOLOGYIN SHORT

Sequencing by ligation, two bases at a time

Sequence up to 12 independent lanes on 2 flow chips in one run

96 unique barcodes

Up to 200 Gb output per run

Flexible read lengths

Paired-end sequencing (75 x 35 bp) is available

High Accuracy and Reference-free analysis with Exact Call Chemistry Module

Flexible Short Read Sequencing

Life Technologies SOLiD 5500xl

The SOLiD 5500xl is a highly accurate, massively paral lel next-generation sequencing platform that supports a wide range of applications. Sequencing lanes in the flow chip can be run independently with pay-per-use reagents. Multiplexing capability (96 barcodes) allows sequencing of multiple library types in a single run.

TechnologyThe major difference between SOLiD

sequencing and other high-throughput sequencing platforms is its unique sequencing by l igat ion instead of sequencing by synthesis chemistry.

The ligation is performed using specific, flourescently labeled octamer probes made up of 2 known bases at the 3 prime end followed by three degenerate bases then three universal bases. These probes are present simultaneously and compete for incorporation. After each ligation, the fluorescence signal is measured and then cleaved before another round of ligation takes place. A

reset phase allows a reduction in noise - a capping step that prevents dephasing. Sequential ligations of fluorescently labeled probes detect every combination of two adjacent bases. Multiple cycles of ligation, detection and cleavage are performed, with the number of cycles determining the eventual read length. Following a series of ligation cycles, the extension product is removed and the t emp la te i s r ese t w i t h a p r ime r complementary to the n-1 position for a second round of ligation cycles. Five rounds of primer reset are completed for each sequence tag. Through the primer reset process, virtually every base is interrogated in two independent ligation reactions by two different primers. Up to 99.99 % accuracy is achieved with the Exact Call Chemistry Module by sequencing with an additional primer using a multi-base encoding scheme.

ApplicationsT h e s e q u e n c i n g b y l i g a t i o n

technology supports both single read and paired-end libraries. A wide array of sample preparation methods serve to enable diverse applications.- Fragment Sequencing (from 35 bp to 75 bp) also with Barcodes (currently 96) - Paired End (50-75)/35 bp (96 barcodes) - Mate-Paired 2 x 60 bp - Whole Transcriptome Analysis (96 barcodes) - Small RNA Expression (96 barcodes) - SAGE - genome-wide expression

PerformanceThe SOLiD 5500xl can process 2

flow chips in parallel. Each flow cell is made of six lanes that can be run independently. SOLiD 5500xl can to generate up to 198 gigabases of raw sequence data per run with read lengths of up to 75 X 35 base pairs. A run takes about ten days.

Input Requirements* DNA Fragment library - 500 ng of high

quality column purified genomic DNA * Mate Pair library - 5-20 µg of high

quality column purified genomic DNA* Targeted Enrichment (Sure Select) -

3 µg of high quality column purified genomic DNA

* Chip-Seq - 20 ng -1 µg DNA* RNA seq - 200 ng of poly(A) RNA or

200 ng of rRNA-depleted RNA* small RNA analysis - 5-10 µg of integral

high quality total RNA

SOLiD 5500xl at FGCZ The SOLiD 5500xl system is

accessible through the FGCZ User Lab Services.

More information: SOLiD5500xl

FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011

No. of Lanes No. of reads (Millions) Throughput (Gigabases) 75 X 35 sequencing

1 lane 150 16.5 6 lanes (1 flow chip) 900 99

12 lanes (2 flow chips) 1800 198 !

Page 7: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

NEXT GENERATION SEQUENCING AT THE FGCZ 7

High Reliability Long Read Sequencing

Roche/454 GS FLX+

454 Sequencing uses a large- scale parallel pyrosequencing system, which is a bioluminescence method that relies on a single addition of a dNTP by a DNA polymerase.

TechnologyThe system depends on fixing

nebulised and adapter-ligated DNA fragments to small DNA-capture beads in a water-in-oil emulsion. The DNA fixed to these beads, ideally one DNA molecule per bead, is then singly clonally amplified by PCR. Each DNA-bound bead is placed into a ~29 µm well on a PicoTiterPlate (PTP), a titanium coated fiber optic slide. A mix of enzymes such as DNA polymerase, ATP sulfurylase, and luciferase are also packed into the well, the two latter facilitate light production.

Pyrosequencing

Emulsion PCR for signal amplification

The PTP is then mounted in a flow chamber, and individual dNTPs are flushed across the wells in a pre-determined sequential order. The lower surface of the PTP is directly attached to a high-resolution CCD camera, which allows

detection of the light generated from each PTP well undergoing the pyrosequencing reaction. Pyrosequencing, basically, measures the release of inorganic py rophospha te by p ropo r t i ona l l y converting it into light using a series of enzymatic reactions. The light signal generated by the enzymatic cascade is recorded as a series of peaks called a flowgram.

Applications454 sequencing enables:

- De novo whole genome sequencing- RNA analysis- Re-sequencing of whole genomes and

target DNA regions- Metagenomics

The GS FLX+ system allows to perform straightforward de novo assembly to decode previously uncharacterized genomes or transcriptomes, or re-sequence organisms with an available reference. Whole genome sequencing projects may use shotgun reads alone or in combination with mate paired reads to generate accurate draft assemblies of any organism. The clonal nature of the 454 Sequencing System renders this platform par t icu lar ly su i tab le for ampl icon sequencing. This application allows unambiguous allele resolution of variation in complex regions of the genome along with quantitative detection of variants present in less than 1 % of a mixture.

PerformanceThe GS FLX+ System features the

unique combination of long reads, exceptional accuracy and high-throughput, making the system well suited for varied genomic projects.

Latest improvements in sequencing chemistry, instrumentation and software have lead to an increased read length of up to 1000 bp. Thus, this technology a l lows a comprehens ive genome coverage and the exploration of the full range of genetic variation using long, high-quality reads with a very high consensus accuracy (99.995% to 99.997 %).

The GS FLX+ System is designed to use with both, the new long-read Sequencing Ki t XL+ and exist ing Sequencing Kit XLR70, each with their various features.

Input RequirementsFor shotgun fragment libraries: * 1 µg for average read length of 700 bp* 500 ng for a shotgun fragment library of

400 bp read-length * For mate pair libraries, with insert sizes

of 3 kb, 8 kb or 20 kb, we need a input of gDNA of 5 µg, 15 µg or 30 µg.

* For RNA analysis 1 to 2 µg of total RNA.

GS FLX+ at FGCZ The GS FLX+ system is accessible

through the FGCZ User Lab Services.More information: FLX+

FGCZ NEWSLETTER - NGS TECHNOLOGIES FALL 2011

APPLICATIONSAND MAIN ADVANTAGES

De novo sequencing

Re-sequencing

Targeted re-sequencing (Amplicon)

RNA analysis

Metagenomics

TECHNOLOGYIN SHORT

Over 1 Mio reads per run

Consensus accuracy greater than 99.99 %

For shotgun sequencing: Average read length of 700 bp Up to 1000 bp reads

Sequencing Kit New GS FLX Titanium XL+ GS FLX Titanium XLR70

Read Length

Up to 1'000 bp Up to 600 bp

Mode Read Length 700 bp 450 bp

Typical Throughput 700 Megabases 450 Megabases Reads per Run ~1 Million Shotgun ~1 Million Shotgun

~700'000 Amplicon Consensus Accuracy

99.997% 99.995%

Run Time 23 Hours 10 Hours Multiplexing Multiplex Identifiers (MIDs): 132 Gaskets: 2, 4, 8, 16 Regions !

Page 8: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

NEXT GENERATION SEQUENCING AT THE FGCZ 8

Turning Sequencing Reads into Knowledge

NGS Bioinformatics

NGS technology and the resulting data are uniquely suited for answering a wide range of biological questions by overcoming limitations that so far existed using classical sequencing or alternative microarray approaches. Due to the massive amount of data involved, the management and analysis of the data requires dedicated software and high-performance and capacity computing resources. The FGCZ has developed and productively implemented data processing pipelines for standard data analysis workflows, which can be readily applied to analyze the sequencing data generated at the center. In addition to standard tools and workflows, the bioinformatics team provides access to its expertise and the optional development of customized solutions via the FGCZ User Lab. As for the experimental analysis part, the analysis and interpretation of the resulting data relies on the close interaction of FGCZ staff with the users that is essential to generate scientifically relevant results.

Standard Data Analysis SupportAs an in tegra l par t o f every

sequencing project, the FGCZ provides experimental design consultation during the setup phase of a project in conjunction with the discussion for selecting the most suitable analytical platform. The actual data analysis support following data production in a given project then significantly depends on the type of the study:

For projects with a reference genome, the standard support includes:- files of raw reads- read alignment to the reference(s)- initial secondary analysis (e.g. SNP

calling, expression quantitation, peak finding, and so on)

- QC Report

For de novo sequencing projects, the standard support includes:- files of raw reads- assembly of the raw reads with vendor

assembler using default parameters

The currently available standard data analysis workflows cover popular NGS applications, such as:- Re-sequencing, SNP discovery and variant detection- Transcriptome analysis and expression quantitation- Regulome analysis (small RNA, transcription factor binding, histone modification, methylation)

Standard data analysis and support efforts in the form of a limited number of consulting hours are part of the User Lab Services and are free of charge. Analysis pipelines and consulting for additional applications can be added, based on demand and available resources.

Customized Data Analysis ServiceA significant number of research

projects using the large flexibility and many options of NGS show a distinct need for customized support or even the development of new data analysis procedures or tools. Depending on the availability of resources and subject to project-specific agreements, the FGCZ collaborates on non-standard data analysis, like for example, the discovery and in silico verification of new mirRNAs.

Training and SupportIn addition to project-specific data

analysis, the FGCZ provides training and education at the conceptual and concrete software usage levels. Courses and tutorials are announced on the FGCZ Genomics Bioninformatics website.

FGCZ NEWSLETTER - NGS BIOINFORMATICS FALL 2011

DATA ANALYSIS WORKFLOWSIN SHORT

De novo assembly

Mapping to the reference genome

SNP and INDEL detection

Digital expression

ChIP-seq region detection

Page 9: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

NEXT GENERATION SEQUENCING AT THE FGCZ 9

Example Workflows

F r o m R a w D a t a t o Knowledge

D e n o v o a s s e m b l y a n d a n n o t a t i o n o f g e n o m e s a n d transcriptomes

De novo assembly and annotation of genomes have high demands on customization of the analysis workflows. While the assembly using default parameters will frequently yield a satisfactory result, an iterative approach with manual inspection and adaptation of assembly parameters considerably improves the assembly outcome. The a n n o t a t i o n p r o c e s s h a s t w o components: structural annotation (identification of genomic elements) and functional annotation (linking biological information to the genomic elements). Both steps highly depend on the species sequenced and require close collaboration of the researchers with the FGCZ Bioinformaticians. Within such collaborations we provide sequence simi lar i ty and sequence domain searches to provide in silico predictions and functional annotations of genes. Additionally, we also bring our expertise into further analyses, such as in-depth analysis of gene families (e.g. secreted proteins, membrane proteins).

RNA-seq analysis For RNA-seq samples, we map the

reads to both the transcriptome and genome. For example, for human samples this will be the latest version of ENSEMBL transcripts and the hg19 assembly. Subsequently we use R/Bioconductor to load the reads from the BAM alignment files and generate expression counts for genes. With the R language DESeq package, we compute the significantly differentially expressed genes.

As an example for a recent project, RNA from four samples of human tissues were sequenced as paired-reads on a single slide of the SOLiD 5500xl. Following sequencing, the reads were mapped to the reference human genome (hg19), creating four sorted and indexed BAM files. Using the R/Bioconductor library Rsamtools, and Ensembl definitions of gene regions, the mapped reads were transformed into counts for each gene. The counts table was analyzed with DESeq test to reveal significantly differentially expressed genes. The results were uploaded into a MetaCore workspace and functionally analyzed for the enr ichment of pathways, gene networks and Gene Ontology terms. As an additional element, we searched the intergenic spaces for regions of previously unknown transcription and did in-depth analysis of splicing variants for the list of genes pre-selected by the users.

ChIP-seq analysis A frequent goal of ChIP-seq is to

identify active transcription factor binding sites or chromatin modifications. Here the standard workflow starts with mapping the reads of the IP and control samples to the corresponding reference genome using the default mapper (e.g. for SOLiD reads included in the L i f e S c o p e s o f t w a r e p a c k a g e ) . Depending on whether the study searches for short IP-enriched regions (e.g. transcription factor binding sites) or extended enriched regions (e.g. promoter acetylation), we select different peak-finding strategies. For short peaks, we use the MACS software and for extended regions we use SICER. Additionally, we provide a third fully customizable analysis approach using the chipseq package in R/Bioconductor. In all cases, we provide coverage plots for visual verification of identified peaks as well as interactive b rows ing us ing the In teg ra t i ve Genomics Viewer (IGV).

FGCZ NEWSLETTER - NGS BIOINFORMATICS FALL 2011

Page 10: Next Generation Sequencing at the Functional Genomics ... · Frag PCR free Single Molecule Real Time Sequencing 24 hrs (90 min per SMRT cell) 1500 1.2 (16 SMRT cells) 1-10 µg DNA

10

A c c e s s t o C o m p u t i n g a n d Software

Resources for Users

Access to and organization of dataAll NGS data is automatically stored in

the FGCZ B-Fabric system and accessible via web interface. Sample meta-information and project information are available through the same interface to all members of the respective User Lab Services project.

Access to Computing ResourcesWhile the high performance computing

infrastructure for processing and basic analysis of NGS data is used exclusively within the established pipelines and by FGCZ bioinformatics staff, the center provides users access to a dedicated high-performance linux computer at the FGCZ for NGS data analysis. This way, users have access to a wide range of NGS data analysis software tools and databases. Computationally-intensive data analysis tasks can be scheduled by the FGCZ team on the FGCZ computing cluster.

Bioinformatics Tools and DatabasesMore than one hundred bioinformatics

software packages and many standard life science databases are implemented and maintained at the FGCZ. Additional applications or databases can be hosted on request.

Frequently used packages are:- Open source software: Abyss, amos, BEDTools, BLAST, BLAT, Bowtie, bwa, consed, FASTQC, GATK, InterProScan, JIGSAW, MACS, MAQ, mira, Mosaik, mothur, phrap, phred, picard, PyroNoise2, Qiime, R/Bioconductor, Samtools, SHRiMP, SICER, SOAPdenovo, Tophat, velvet- Commercial software: CASAVA, CLCBio Genomics Workbench, CLCBio NGS Cell, LifeScope, 454 Software Package, tmap

Frequently used databases:- Standard sequence databases: NCBI nr, nt, est, cdd, SwissProt, PFAM, KEGG- Specialized databases: miRBase, 16S rDNA databases (RDP, GreenGenes).

FURTHER INFORMATIONSEQUENCING CONTACTVia eMail to [email protected] or [email protected]. FGCZ staff will initiate a meeting to discuss options and workflows, issue quotes, and advice on all aspects from study design to analysis to data interpretation.

BIOINFORMATICS SUPPORTBioinformatics consulting and support is an integral part of User Lab projects and User Lab Services. As resources are limited, discussion about the level of support and training offered to users is part of the project setup process.

USER LAB AND SERVICESMore information on the general setup of the FGCZ and its access modes can be found at www.fgcz.ch

FGCZ NEWSLETTER - NGS BIOINFORMATICS FALL 2011

Sources:  Pictures  of  instruments,  consumables  and  workflows  have  been  retrieved  from  vendor  sources.  Text  is  in  part  based  on  informa>on  materials  of  the  vendors.

Disclaimer:   The   technologies   men>oned  may  be  well   used   for   applica>ons  men>oned   or  non-­‐men>oned   also   in   the   absence   of  an   FGCZ  recommenda>on:   due   to   the   large  number  of  protocols  and   applica>ons  possible,  the  FGCZ  can  only  support  a  limited   number  of  protocols   and  methods  per  plaGorm  and  therefore  limits   the  number  of  standard  recommenda>ons.  Before  excluding  op>ons,  please  consult  with  us.

SOFTWARE TOOLS AND ACCESS TO RESOURCES

Software available:

NGS reads mapping and assembly

Tag counting, quantification, profile pattern discovery

Sequence analysis and annotation

Databases available:

Primary sequence database

Protein sequence database

Genome database

Specialized Hardware available:

Dedicated high memory machine (24 processor, 146 Gb RAM) with direct user access

SGE cluster of 120 cores with restricted access