Course Expectations Sequencing technology and (very) large datasets 6/1/2015

Course Expectations

Sequencing technology and

(very) large datasets

6/1/2015

Goals for the course

Understand how next-generation sequencing technologies are used in biomedical research

Learn how to conduct a RNA-seq analysis Learn how to analyze gene lists to form hypotheses that

can be tested experimentally Learn to write a results section for a manuscript

Logistics

Course website: http://biochem.slu.edu/bchm628/

Some data will be shared via Google drive Contact:

Phone: 977-8858 Email: [email protected]

Office – DRC 507 Call or email. Usually at WashU on Thursdays

Lab – DRC 654

http://biochem.slu.edu/bchm628/

mailto:[email protected]

Exercise format

There will be 5 exercises, each consisting of 2-4 sections which represent a biological question to be answered with bioinformatics tools/resources from that week or earlier weeks.

You’ll provide the answer in the same format as you would write for the results section of a paper Why did you do this experiment or analysis? What did you actually do? What did you observe? What does it mean?

Include supporting data Figures with figure legends Correctly formatted tables of data.

Exercises, cont

You will hand in your exercise via email in either Word or PDF format, with supplemental data in Excel, Word or PDF format.

The exercise should print in portrait orientation. The exercise should include a header with your name at

the top and the file should be named: Your Name-Ex #.

There is a penalty for turning in your exercises after the deadline. The timestamp on your email is the final determination of whether an exercise is on-time or not.

Final project

This will be a project summary of the analyses that you will do over the course of the 4 weeks.

You will be asked to choose 3 genes from your gene lists that you would follow-up on at the bench. You will be asked to give a rationale for making the choices that

you did.

You will analyze the three genes virtually using some of the tools from weeks 3 & 4.

You will also be asked to propose additional bench experiments for them.

Final project will be due July 7th at 3:00 pm.

Data tables

In general, columns describe attributes and rows contain the individual data. The first row contains a header. If you have lots of data, it is generally formatted to have more rows than columns.

Gene name Log 2 (Cond. X/untreated)

Log 2 (Cond. Y/untreated)

Log 2 (Cond. Z/untreated)

NM_00522 2.56 3.12 2.75

NM_06588 -1.25 -1.02 -0.98

Clinical parameter

Group 1 (avg ± mean)

Group 2 (avg ± mean)

P-value

ALT/AST ratio 25 ± 1 35 ± 2 0.0021

Leukocyte count 1200 ± 32 950 ± 65 0.0512

Table 1: Gene expression for WT cells under conditions X,Y, Z.

Table 2: Comparison of clinical parameters for groups 1 and 2.

1 Statistical significance was determined by a Mann-Whitney test2 Statistical significance was determined by 2-tailed t-test

Data tables, cont

For the purposes of this class, the tables should be formatted to fit onto a letter size page in portrait orientation. If your table is so wide that it forces the page into landscape

orientation, then it should be included as a supplemental attachment to the exercise. If the table extends past 1 page, then include it as a supplemental attachment.

Refer to supplemental tables in your write-up and number then and the file as Name_SuppTable1, ect.

Supplemental tables can be in Excel format.

Figures

If you can export the figure from whatever program in jpeg or png format, those can be inserted into a Word document easily.

PDFs can be converted to other formats using Illustrator There are some online converters

http://www.wikihow.com/Convert-PDF-to-JPEG

Screen capture and placement may also work. Talk to me if you have issues. I won’t be very picky about high resolution.



Figures, cont.

Figures should have figure legends. The figure legends should describe the experiment that lead to the data in the figure and include an explanation for any symbols used.

Figures should be numbered consecutively and should not take up more than ¼ of the page. If larger than that, include as supplemental data.

Create a text box in Word, write the figure legend and then insert the figure above the figure legend. This will allow you to resize as necessary.

Again, talk to me is you have issues.

Grading

Grading: Exercises 65 % Final exam 25 % Class attendance 10 %

Grading policy handout Details about late assignment and tests

Lecture outline

Overview of sequencing a genome

Next generation sequencing

High-throughput experiments by sequencing

Genome browsers

Genome sequencing

Approach depends on the source, size, complexity and goal for the data for a given organism

Goal? De novo sequencing Re-sequencing for annotation Sequencing to identify variations

Size and complexity Virus, bacterial, single-celled eukaryote, mammal, plant

Sample prep Can it be cultured? Tissue source: unlimited or limited quantities? Virus levels, RNA or DNA

Genome sizes

OrganismGenome size (base pairs)

Number of genes

Hepatitis C virus 0.01 x 106 10

Epstein-Barr virus 0.172 x 106 37

Bacterium (E. coli) 4.6 x 106 4406

Yeast (S. cerevisiae) 12.5 x 106 6172

Nematode worm (C. elegans) 100.3 x 106 19,099

Thale cress (A. thaliana) 115.4 x 106 25,498

Fruit fly (D. melanogaster) 128.3 x 106 13,601

Corn (Z. mays) 2500 x 106 39,469

Human (H. sapiens) 3223 x 106 20,500

Wheat (T. aestivium) 5500 x 106 (x 3) ~95,000

Types of questions

How many genes? How many functional genetic elements miRNAs, ncRNAs

What’s different about this genome compared to another one? Virulence differences in pathogenic organisms What is the cause of this particular phenotype?

What taxonomic groups are represented in this population of bacteria, viruses or fungi?

How do the gene expression patterns change between samples (across time)?

Where does this transcription factor bind in the genome?

Genetic maps

Chromosomal banding patterns Stain with Giemsa (G-banding pattern)

Dark regions heterchromatic, late replicating and AT richLighter regions euchromatic, early replicating and GC rich

Chromosomes are numbered based on size

Giemsa binds to phosphate groups & attaches to regions that are AT rich

Chromosome nomenclature

p (petite) = short arm

q (queue) = long arm

Bands are numbered going away from centromere

4q21.1 represents chromosome 4, long arm 2nd band, 1st sub-bandand 1st sub-sub-band

DNA sequencing – Overview

Gel electrophoresis Predominant in 1980s

Whole genome strategies Physical mapping (BAC clones) Walking Shotgun sequencing Capillary sequencing machines

Computational fragment assembly

Next generation technologies Polony based sequencing Novel assembly techniques

Cost/base for DNA sequence

19901992

19941996

19982000

20022006

2010

1.0E-07

1.0E-06

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

1.0E+00

1.0E+01

1.0E+02

Traditional approach

Shear the very large genome into smaller chunks Clone in vectors that can support large inserts Digest and separate on high resolution gel to determine

the clone overlap Pick minimum number of clones Shotgun sequence each clone Read the traces and assemble Make the gene calls Load it into a genome viewer

BAC library in DNA sequencing

Shotgun sequencing

D Sequence each clone

Individualsequence reads

GapContig A Contig B

Contig assemblyE

Paired reads vs single reads

GapContig A Contig B GapContig A Contig B

Gap closure!!Prefer 3-10 mate pairs per gap

Inserts of different, but known sizes

Single reads• M13 clones

• robotic template prep

Paired reads• Plasmids, cosmids,

BACs

Steps to Assemble a Genome

1. Find overlapping reads

4. Derive consensus sequence ..ACGATTACAATAGGTT..

2. Merge some “good” pairs of reads into longer contigs

3. Link contigs to form supercontigs

Some Terminology

read a 500-900 long word that comes out of sequencer

mate pair a pair of reads from two endsof the same insert fragment

contig a contiguous sequence formed by several overlapping readswith no gaps

supercontig an ordered and oriented set(scaffold) of contigs, usually by mate

pairs

consensus sequence derived from thesequence multiple alignment of reads in contig

Target: 30X coverage or >30 high quality reads per base

Assembled into chromosomes

Refseq nomenclature: NT: genomic sequence of complete gene NC: chromosome NM: mRNA sequence NP: protein sequence

Assembly: completed genome, multiple assemblies

Calling the genes

De novo computer algorithms Identify coding sequences by GC content Start and stop sites Intron/exon boundaries

Comparison with other known genes EST libraries

Sanger method

Misha Angrist

Sanger sequencing reached its technical limits

Only modestly parallel (394 lanes/machine) Long read lengths (500-900 bp) & >99.9% correct Need to clone the DNA to obtain enough for sequencing

reaction

At SLU: cost for typical Sanger sequencing is $5-6/sample with reliable 500 bp of sequence

DNA sequencing timeline

How many sequenced genomes?NCBI: >12,000 genomes deposited

JGI (Joint Genome Institute):

6600 complete>20,000 draft genomes

NGS sequencing

Polony: discrete clonal amplifications of a single DNA molecule, grown in a gel matrix. The clusters can then be individually sequenced, producing short reads

Polony-based or cluster-based sequencing is the basis of most second generation sequencers

Typical NGS workflow:

1. Library construction to add adapters to sequence2. Template CLONAL amplification (on a bead or chip)3. Massively PARALLEL sequencing

A) Fragment DNA

B) Repair ends/Add A overhang DNA

C) Ligate adapters

D) Select ligated DNA

E) Attach DNA to flow cell

F) Bridge amplification

G) Generate clusters

H) Anneal sequencing primer

I) Extend 1st base, read & deblock

J) Repeat to extend strand

K) Generate base calls

Library Prep:~ 6 hours

Cluster generation~ 6 hours

Sequencing2-6 days

Illumina NGS

Illumina HiSeq and miSeq

100 – 200 bp read lengths Available locally with MoGene and Cofactor Genomics GTAC (Wash U) has HiSeq 2000 which has 50bp single end

reads and 100 bp paired-end reads

Why not use this for all sequencing? Cost is ~300-400/library and ~$1100/lane of sequencing Generate Tb of data per run Gb per lane

Ion Torrent – measures pH changes

Done on a semi-conductor chip

Ion Torrent workflow

Illumina vs Ion Torrent

Illumina has greater capacity but longer run times Latest versions of both have read lengths ~200 bp SLU has an Ion Torrent machine Cost is ~$270/sample, including the sequencing

Can do single- or pair-end reads Paired end are 2X cost for library construction, but

necessary for de novo genome assembly

Bioinformatics challenges

Each flow cell in the Illumina Hiseq 2000 can generate a billion bases of sequence Raw read files are Tb in size Processed read files are several 700-800 Mb Alignment files 150-300 Mb

Assembly of millions of short (75-100 bp) reads into vertebrate genome Need high-performance compute (HPC) cluster for vertebrate

sized genomes

Sequencing has become a standard technique

RNA sequencing for expression ChIP sequencing for TF site identification DNA sequencing for variants Identification of populations/genetic changes in highly

variable viruses and bacteria Metagenomics

Identification of unknown/non-culturable communities of bacteria/viruses/fungi

Why RNAseq over microarray?

Technical variation is less Do not need a sequenced genome Greater dynamic range of expression Detect transcript isoforms Identify novel transcripts Identify non-coding RNAs

Data availability

Public repository of microarray, RNAseq and other high-throughput expression data is GEO & SRA at the NCBI

GEO: Gene expression omnibus http://www.ncbi.nlm.nih.gov/geo/ Tools for downloading as well as querying datasets Array and sequence-based data available

SRA: short read archive http://www.ncbi.nlm.nih.gov/sra Can download raw sequence data (fastq files)

http://www.ncbi.nlm.nih.gov/geo/

http://www.ncbi.nlm.nih.gov/geo/

http://www.ncbi.nlm.nih.gov/sra

http://www.ncbi.nlm.nih.gov/sra

Today in computer lab

Tutorial on searching NCBI/GEO for large datasets Partek Genomics Suite (PGS) tutorial

Documents

Course Expectations Sequencing technology and (very) large datasets 6/1/2015