Upload
laurence-goodman
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
Course Expectations
Sequencing technology and
(very) large datasets
6/1/2015
Goals for the course
Understand how next-generation sequencing technologies are used in biomedical research
Learn how to conduct a RNA-seq analysis Learn how to analyze gene lists to form hypotheses that
can be tested experimentally Learn to write a results section for a manuscript
Logistics
Course website: http://biochem.slu.edu/bchm628/
Some data will be shared via Google drive Contact:
Phone: 977-8858 Email: [email protected]
Office – DRC 507 Call or email. Usually at WashU on Thursdays
Lab – DRC 654
Exercise format
There will be 5 exercises, each consisting of 2-4 sections which represent a biological question to be answered with bioinformatics tools/resources from that week or earlier weeks.
You’ll provide the answer in the same format as you would write for the results section of a paper Why did you do this experiment or analysis? What did you actually do? What did you observe? What does it mean?
Include supporting data Figures with figure legends Correctly formatted tables of data.
Exercises, cont
You will hand in your exercise via email in either Word or PDF format, with supplemental data in Excel, Word or PDF format.
The exercise should print in portrait orientation. The exercise should include a header with your name at
the top and the file should be named: Your Name-Ex #.
There is a penalty for turning in your exercises after the deadline. The timestamp on your email is the final determination of whether an exercise is on-time or not.
Final project
This will be a project summary of the analyses that you will do over the course of the 4 weeks.
You will be asked to choose 3 genes from your gene lists that you would follow-up on at the bench. You will be asked to give a rationale for making the choices that
you did.
You will analyze the three genes virtually using some of the tools from weeks 3 & 4.
You will also be asked to propose additional bench experiments for them.
Final project will be due July 7th at 3:00 pm.
Data tables
In general, columns describe attributes and rows contain the individual data. The first row contains a header. If you have lots of data, it is generally formatted to have more rows than columns.
Gene name Log 2 (Cond. X/untreated)
Log 2 (Cond. Y/untreated)
Log 2 (Cond. Z/untreated)
NM_00522 2.56 3.12 2.75
NM_06588 -1.25 -1.02 -0.98
Clinical parameter
Group 1 (avg ± mean)
Group 2 (avg ± mean)
P-value
ALT/AST ratio 25 ± 1 35 ± 2 0.0021
Leukocyte count 1200 ± 32 950 ± 65 0.0512
Table 1: Gene expression for WT cells under conditions X,Y, Z.
Table 2: Comparison of clinical parameters for groups 1 and 2.
1 Statistical significance was determined by a Mann-Whitney test2 Statistical significance was determined by 2-tailed t-test
Data tables, cont
For the purposes of this class, the tables should be formatted to fit onto a letter size page in portrait orientation. If your table is so wide that it forces the page into landscape
orientation, then it should be included as a supplemental attachment to the exercise. If the table extends past 1 page, then include it as a supplemental attachment.
Refer to supplemental tables in your write-up and number then and the file as Name_SuppTable1, ect.
Supplemental tables can be in Excel format.
Figures
If you can export the figure from whatever program in jpeg or png format, those can be inserted into a Word document easily.
PDFs can be converted to other formats using Illustrator There are some online converters
http://www.wikihow.com/Convert-PDF-to-JPEG
Screen capture and placement may also work. Talk to me if you have issues. I won’t be very picky about high resolution.
Figures, cont.
Figures should have figure legends. The figure legends should describe the experiment that lead to the data in the figure and include an explanation for any symbols used.
Figures should be numbered consecutively and should not take up more than ¼ of the page. If larger than that, include as supplemental data.
Create a text box in Word, write the figure legend and then insert the figure above the figure legend. This will allow you to resize as necessary.
Again, talk to me is you have issues.
Grading
Grading: Exercises 65 % Final exam 25 % Class attendance 10 %
Grading policy handout Details about late assignment and tests
Lecture outline
Overview of sequencing a genome
Next generation sequencing
High-throughput experiments by sequencing
Genome browsers
Genome sequencing
Approach depends on the source, size, complexity and goal for the data for a given organism
Goal? De novo sequencing Re-sequencing for annotation Sequencing to identify variations
Size and complexity Virus, bacterial, single-celled eukaryote, mammal, plant
Sample prep Can it be cultured? Tissue source: unlimited or limited quantities? Virus levels, RNA or DNA
Genome sizes
OrganismGenome size (base pairs)
Number of genes
Hepatitis C virus 0.01 x 106 10
Epstein-Barr virus 0.172 x 106 37
Bacterium (E. coli) 4.6 x 106 4406
Yeast (S. cerevisiae) 12.5 x 106 6172
Nematode worm (C. elegans) 100.3 x 106 19,099
Thale cress (A. thaliana) 115.4 x 106 25,498
Fruit fly (D. melanogaster) 128.3 x 106 13,601
Corn (Z. mays) 2500 x 106 39,469
Human (H. sapiens) 3223 x 106 20,500
Wheat (T. aestivium) 5500 x 106 (x 3) ~95,000
Types of questions
How many genes? How many functional genetic elements miRNAs, ncRNAs
What’s different about this genome compared to another one? Virulence differences in pathogenic organisms What is the cause of this particular phenotype?
What taxonomic groups are represented in this population of bacteria, viruses or fungi?
How do the gene expression patterns change between samples (across time)?
Where does this transcription factor bind in the genome?
Genetic maps
Chromosomal banding patterns Stain with Giemsa (G-banding pattern)
Dark regions heterchromatic, late replicating and AT richLighter regions euchromatic, early replicating and GC rich
Chromosomes are numbered based on size
Giemsa binds to phosphate groups & attaches to regions that are AT rich
Chromosome nomenclature
p (petite) = short arm
q (queue) = long arm
Bands are numbered going away from centromere
4q21.1 represents chromosome 4, long arm 2nd band, 1st sub-bandand 1st sub-sub-band
DNA sequencing – Overview
Gel electrophoresis Predominant in 1980s
Whole genome strategies Physical mapping (BAC clones) Walking Shotgun sequencing Capillary sequencing machines
Computational fragment assembly
Next generation technologies Polony based sequencing Novel assembly techniques
Cost/base for DNA sequence
19901992
19941996
19982000
20022006
2010
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E+01
1.0E+02
Traditional approach
Shear the very large genome into smaller chunks Clone in vectors that can support large inserts Digest and separate on high resolution gel to determine
the clone overlap Pick minimum number of clones Shotgun sequence each clone Read the traces and assemble Make the gene calls Load it into a genome viewer
BAC library in DNA sequencing
Shotgun sequencing
D Sequence each clone
Individualsequence reads
GapContig A Contig B
Contig assemblyE
Paired reads vs single reads
GapContig A Contig B GapContig A Contig B
Gap closure!!Prefer 3-10 mate pairs per gap
Inserts of different, but known sizes
Single reads• M13 clones
• robotic template prep
Paired reads• Plasmids, cosmids,
BACs
Steps to Assemble a Genome
1. Find overlapping reads
4. Derive consensus sequence ..ACGATTACAATAGGTT..
2. Merge some “good” pairs of reads into longer contigs
3. Link contigs to form supercontigs
Some Terminology
read a 500-900 long word that comes out of sequencer
mate pair a pair of reads from two endsof the same insert fragment
contig a contiguous sequence formed by several overlapping readswith no gaps
supercontig an ordered and oriented set(scaffold) of contigs, usually by mate
pairs
consensus sequence derived from thesequence multiple alignment of reads in contig
Target: 30X coverage or >30 high quality reads per base
Assembled into chromosomes
Refseq nomenclature: NT: genomic sequence of complete gene NC: chromosome NM: mRNA sequence NP: protein sequence
Assembly: completed genome, multiple assemblies
Calling the genes
De novo computer algorithms Identify coding sequences by GC content Start and stop sites Intron/exon boundaries
Comparison with other known genes EST libraries
Sanger method
Misha Angrist
Sanger sequencing reached its technical limits
Only modestly parallel (394 lanes/machine) Long read lengths (500-900 bp) & >99.9% correct Need to clone the DNA to obtain enough for sequencing
reaction
At SLU: cost for typical Sanger sequencing is $5-6/sample with reliable 500 bp of sequence
DNA sequencing timeline
How many sequenced genomes?NCBI: >12,000 genomes deposited
JGI (Joint Genome Institute):
6600 complete>20,000 draft genomes
NGS sequencing
Polony: discrete clonal amplifications of a single DNA molecule, grown in a gel matrix. The clusters can then be individually sequenced, producing short reads
Polony-based or cluster-based sequencing is the basis of most second generation sequencers
Typical NGS workflow:
1. Library construction to add adapters to sequence2. Template CLONAL amplification (on a bead or chip)3. Massively PARALLEL sequencing
A) Fragment DNA
B) Repair ends/Add A overhang DNA
C) Ligate adapters
D) Select ligated DNA
E) Attach DNA to flow cell
F) Bridge amplification
G) Generate clusters
H) Anneal sequencing primer
I) Extend 1st base, read & deblock
J) Repeat to extend strand
K) Generate base calls
Library Prep:~ 6 hours
Cluster generation~ 6 hours
Sequencing2-6 days
Illumina NGS
Illumina HiSeq and miSeq
100 – 200 bp read lengths Available locally with MoGene and Cofactor Genomics GTAC (Wash U) has HiSeq 2000 which has 50bp single end
reads and 100 bp paired-end reads
Why not use this for all sequencing? Cost is ~300-400/library and ~$1100/lane of sequencing Generate Tb of data per run Gb per lane
Ion Torrent – measures pH changes
Done on a semi-conductor chip
Ion Torrent workflow
Illumina vs Ion Torrent
Illumina has greater capacity but longer run times Latest versions of both have read lengths ~200 bp SLU has an Ion Torrent machine Cost is ~$270/sample, including the sequencing
Can do single- or pair-end reads Paired end are 2X cost for library construction, but
necessary for de novo genome assembly
Bioinformatics challenges
Each flow cell in the Illumina Hiseq 2000 can generate a billion bases of sequence Raw read files are Tb in size Processed read files are several 700-800 Mb Alignment files 150-300 Mb
Assembly of millions of short (75-100 bp) reads into vertebrate genome Need high-performance compute (HPC) cluster for vertebrate
sized genomes
Sequencing has become a standard technique
RNA sequencing for expression ChIP sequencing for TF site identification DNA sequencing for variants Identification of populations/genetic changes in highly
variable viruses and bacteria Metagenomics
Identification of unknown/non-culturable communities of bacteria/viruses/fungi
Why RNAseq over microarray?
Technical variation is less Do not need a sequenced genome Greater dynamic range of expression Detect transcript isoforms Identify novel transcripts Identify non-coding RNAs
Data availability
Public repository of microarray, RNAseq and other high-throughput expression data is GEO & SRA at the NCBI
GEO: Gene expression omnibus http://www.ncbi.nlm.nih.gov/geo/ Tools for downloading as well as querying datasets Array and sequence-based data available
SRA: short read archive http://www.ncbi.nlm.nih.gov/sra Can download raw sequence data (fastq files)
Today in computer lab
Tutorial on searching NCBI/GEO for large datasets Partek Genomics Suite (PGS) tutorial