23
Session 1b Applications, genome sequencing, RNA Seq, GBS Applications: RNASeq and GBS (Simon) Making RNA-Seq libraries Making GBS libraries Quality filtering Next-gen sequence data Applications for RNA-Seq data: finding differentially-expressed genes. Applications for GBS: genetic map making, Genomic Selection (basic intrduction) 1 Sunday, April 14, 13 Introduction to gene expression studies Why it is so useful What you can learn with it. 2 Sunday, April 14, 13

Session 1b Applications, genome sequencing, RNA …cassavabase.wikispaces.com/file/view/Session1bSeqApplications_LQ...2003 Drosophila genome 250k Sanger ESTs 700 175 2005 ... With

  • Upload
    vunhi

  • View
    218

  • Download
    0

Embed Size (px)

Citation preview

Session 1b Applications, genome sequencing, RNA Seq, GBS

Applications: RNASeq and GBS (Simon)Making RNA-Seq libraries

Making GBS librariesQuality filtering Next-gen sequence data

Applications for RNA-Seq data: finding differentially-expressed genes.Applications for GBS: genetic map making, Genomic Selection (basic intrduction)

1Sunday, April 14, 13

Introduction to gene expression studies

• Why it is so useful

• What you can learn with it.

2Sunday, April 14, 13

The central dogma of genetics

123RF.com

Genes are expressed as

mRNAs. We can

sequence mRNAs

3Sunday, April 14, 13

Studying gene expression by sequencing mRNAs

• mRNA sequence data is the most effective data for studying the genes located in genomic sequence

• mRNA sequence data aren’t perfect, or complete, because of limited expression by cell/tissue type or during development or particular environmental/stress conditions

• Trying to capture all full-length transcripts has technical challenges - only done for special model organism systems e.g. Drosophila

• Instead, nowadays people capture material from several different tissues/stages/stresses and sequence very deeply

Date Project Read length (bp)Total sequence

(Mbp)

2003 Drosophila genome 250k Sanger ESTs 700 175

2005Drosophila full-length cDNA

12-13k full-length cDNAs, Sanger shotgun sequenced 700 ~650

2009 Cassava ESTs 1.7 m 454 ESTs 250 425

2012 Cassava RNA-Seq 400 m Illumina reads 100 40,000

4Sunday, April 14, 13

DNA and genome sequencing

5Sunday, April 14, 13

Cut plant sample and purify DNA

Fragment DNA into several size

ranges

Assemble sequence reads into contigs by

looking for near-perfect overlaps between reads

contig

reads

Long genomic DNA

molecules

Sequence one or both ends of fragments on 454

platform

Construct scaffolds by joining contigs that

overlap sequence read pairs from both ends of long DNA

fragments

GGATCTAGNNNNNNNNNNNNNNNNNNNGGCTATTTCCGaps in the sequence (usually

representing repeat sequences that could not be crossed) are filled with Ns

Genome sequence

Sequence read

DNA fragment

paired-end readscaffold

Whole genome shotgun sequencing

6Sunday, April 14, 13

The genome of cassava has been sequenced

• JGI CSP pilot 1x coverage from over 700,000 Sanger shotgun reads using plasmid and fosmid libraries

• Main sequencing effort: partnership with 454 Life Sciences (Roche), Steve Rounsley, Dan Rokhsar, Chinnappa Kodira, and Tim Harkins

• 454 GS FLX Titanium platform. Nearly 61 million 454 reads (single and paired-end) were generated and combined with the Sanger data from the pilot project as input for genome assembly

• $1.3 million grant by the Bill & Melinda Gates Foundation

7Sunday, April 14, 13

Cassava assembly v1

• 11,243 scaffolds

• 416Mb (26% N = gaps)

• scaffold N50/L50 514/180kb

• cucumber assemblies N50/L50 ~60 / ~900kb (Therese’s presentation)

• Cassava has an estimated genome size of ~760Mb

• Remaining 300Mb estimated to be non-genic, repetitive

• 1) A large fraction of reads (both Sanger and 454) were not used by the assembly software, and were primarily repetitive in nature.

• 2) 95% of publicly-available cassava ESTs map to the assembly.

8Sunday, April 14, 13

Whole genome resequencing

Genome

Many short reads

Alignment to genome

Alignment of two genomes to a reference genome

Extract DNA and sequence

Reference genome sequenceReference genome sequence

reads

9Sunday, April 14, 13

Expression studies

• ESTs Expressed sequence tags

• RNA-Seq

10Sunday, April 14, 13

Generating RNA-Seq data from expressed genes

RNA-seq

49

Generate Library(Martin, Wang, 2011)

50

Win Hide

11Sunday, April 14, 13

RNA-seq

49

Generate Library(Martin, Wang, 2011)

50

Ligate and size select

51

Data analysisLow quality reads X

Sequence errors X

52

12Sunday, April 14, 13

Ligate and size select

51

Data analysisLow quality reads X

Sequence errors X

52

Assemblycorrect assembly errors

53

Estimate expression

54

13Sunday, April 14, 13

Overview of transcript reconstruction

Genome sequence

stranded mRNA sequence aligned to

genome

Process alignments, find splice sites and join

exons to reconstruct transcripts

14Sunday, April 14, 13

Using overlapping k-mers to reconstruct transcripts

!"#$%&'()*&+,,(-./0&1,234&*(&5$6273&8$)9",&

:;8+<=+8=+88<==+8+=8<::&

:;8+<=+8=+& +8=+88<==+8&==+8+=8<::&

:>8+<=>+<=+><=+8>=+8=>+8=+&+8=+>8=+8>=+88>+88<&

==+8>=+8+>+8+=>8+=8>+=8<>:;&>88<=>8<==><==+>==+8&

?(&5$6273&&8$)9"&=#3,%$6@A#3&

BCDEF&

!(G6(3@(&'(@#3,%$6@A#3&50&H)%"&<$)I($,)/& !"#$%&$#&$##%&&$#$&#%!!'

<)$4(%&

'()*,&

Brian Haas

15Sunday, April 14, 13

Generating and analyzing EST sequences

16Sunday, April 14, 13

Sequences obtained for the cassava genome

represented in an assembly of publicly available cassava ESTsequences (http://cassava.igs.umaryland.edu/blast/db/EST_asmbl_and_single.fasta), 96% can be mapped to thegenome assembly. It can be estimated that the remainingportion of the genome is largely repetitive and non-gene-coding. Consistent with this, the fractions of the estimatedgenome size (~31%) and WGS reads (~36%) that do notappear in the assembly are approximately equal, despite lowread error rates. The scaffolds obtained have yet to be assignedto chromosomes, as this requires genetic markers with knownsequence. However, an 88% complete genetic map comprising

23-linkage groups includes the genetic locations of 284 scaf-folds (Sraphet et al. 2011).

With the gene-rich portion of the genome in hand, thenext step is to identify the protein-coding genes and theexons that comprise them. This is achieved computationallyby aligning sequences from mRNA fragments (ESTs) to thegenome, as well as looking for regions with homology toknown proteins from other plant species. The 80,459 SangerESTs from Genbank were augmented by a new set of 2.7million reads from leaf and root libraries, generated by 454Life Sciences using the FLX Titanium platform. While half

b

aFig. 1 a Overview of wholegenome shotgun sequencingand assembly. Starting withplant material, many genomes’worth of DNA is extracted,purified, fragmented, pooledby length and sequenced to ahigh level of redundancy withthe aim of sequencing everyregion of the genome so that thechromosomal sequence canbe generated (assembled) byoverlapping fragments thathave (near-)identical sequences.Longer range, paired-end se-quence information is used tobridge sections of the genomethat are not unique (repeats) andimpossible to resolve by thisapproach b The phytozomegenome browser (http://www.phytozome.net/cassava)provides a portal for accessing,browsing, searching anddownloading all available cas-sava sequence and annotationdata and for comparative plantgenomic analysis

Table 1 Cassava genomic and mRNA sequence data

Sequence type Source Technology Sequencesgenerated

Notes

Genome shotgun Roche 454 Titanium 39,259,112

454 FLX Plus (experimental) 10,785,244

JGI 454 Titanium 21,581,680

Sanger 723,958

U. Maryland & UC Davis Sanger 75,748 BAC-end seq.

Expressed sequence tags (ESTs) Roche 454 Titanium 1.51 M reads (leaf) 0.30 M after removingchloroplast and rDNAsequences

454 Titanium 1.19 M reads (root)

various Sanger 80,459

Tropical Plant Biol. (2012) 5:88–94 91

17Sunday, April 14, 13

ESTs in cassava

• 80,459 ESTs from Manihot esculenta from GenBank database

• Roche 454 sequenced leaf EST library highly redundant

• 1,512,780 reads -> filtered -> 299,509

• Take 10,000 random sequences and group with blastclust 50% length 95% ID

• Four commonest clusters include 30% of reads

• BLASTN &/or TBLASTN a few sequences from each cluster against nr

• Remove 30-50% of ESTS with homology to 18S, 26S rDNA

• seqclean removes short, low-complexity sequences, vector etc

• 1,187,328 root ESTs from 454

gene % EST reads

26S rDNA 48%

18S rDNA 18%

chloroplast 8.5%

Highly redundant leaf EST lib

18Sunday, April 14, 13

Cassava genome at Phytozome

ESTs

Genes

Alternative splice forms

Similar plant proteins

19Sunday, April 14, 13

RNA-Seq Analysis steps

• Filtering

• Constructing transcripts

• Looking at changes in expression

20Sunday, April 14, 13

Tools for processing RNA-Seq data

• fastQC

• trimming and quality filtering tools: fastx toolkit fastx_trimmer, fastx_quality_trimmer.

• For trimming and filtering etc, careful to make sure both reads from Paired-End reads remain in their pairs

• trim adapter with cutadapt or seqPrep

• normalizing reads with diginorm (Titus Brown) or insilinorm (Trinity)

21Sunday, April 14, 13

FastQC generates a detailed report on your sequence fastq file

22Sunday, April 14, 13

fastQC sample output

Base distribution by position in read

High quality bases in reads

23Sunday, April 14, 13

Many sequencing problems can be seen quickly

Failed library: can you see why?

There’s a clue in the output below

24Sunday, April 14, 13

Digital normalization

Genome sequence

mRNA sequence aligned to genome x

x

x

x

x

x

x

xx

xx x

xx x x

= sequence error. Generates a possible alternative transcript form. This uses up lots of time and memory to process

x

Digital normalization removes all reads

with depth > e.g. 20, leaving a cleaner signal from real

transcripts

25Sunday, April 14, 13

Short read aligners

Win Hide

Bowtie2 maps reads with a single splice site

GSNAP http://research-pub.gene.com/gmap/ fast, accurate short read version of GMAP

Yes

26Sunday, April 14, 13

Trinity overview: Inchworm, Chrysalis, Butterfly

• New in 2012: preprocessing with diginorm (Titus Brown) khmer, screed

• Inchworm builds up k-mers overlapping by 1, starting with the most abundant and using each k-mer exactly once, so alt splice forms are represented by one most complete assembly and partials that overlap k-1 bp.

• Chrysalis collects Inchworm fragments that overlap perfectly by k-1, with minimum number of reads spanning the junction, assigns read regions to appropriate parts of de Bruijn graph thus constructed for each transcript.

• Butterfly resolves ambiguities, constructs alt splice forms, paralogs. First merges adjacent sections and prunes bad edges. Second in DP plausible path scoring reconciles actual read (pair)s with paths.

2 ADVANCE ONLINE PUBLICATION NATURE BIOTECHNOLOGY

complexity of overlaps between variants. Finally, Butterfly (Fig. 1c) analyzes the paths taken by reads and read pairings in the context of the corresponding de Bruijn graph and reports all plausible transcript sequences, resolving alternatively spliced isoforms and transcripts derived from paralogous genes. Below, we describe each of Trinity’s modules.

Inchworm assembles contigs greedily and efficientlyInchworm efficiently reconstructs linear transcript contigs in six steps (Fig. 1a). Inchworm (i) constructs a k-mer dictionary from all sequence reads (in practice, k = 25); (ii) removes likely error-containing k-mers from the k-mer dictionary; (iii) selects the most frequent k-mer in the dictionary to seed a contig assembly, excluding both low-complexity

For transcriptome assembly, each path in the graph represents a possible transcript. A scoring scheme applied to the graph structure can rely on the original read sequences and mate-pair information to discard non-sensical solutions (transcripts) and compute all plausible ones.

Applying the scheme of de Bruijn graphs to de novo assembly of RNA-Seq data represents three critical challenges: (i) efficiently construct-ing this graph from large amounts (billions of base pairs) of raw data; (ii) defining a suitable scoring and enumeration algorithm to recover all plausible splice forms and paralogous transcripts; and (iii) providing robustness to the noise stemming from sequencing errors and other artifacts in the data. In particular, sequencing errors would introduce a large number of false nodes, resulting in a massive graph with millions of possible (albeit mostly implausible) paths.

Here, we present Trinity, a method for the efficient and robust de novo reconstruction of transcriptomes, consisting of three software modules: Inchworm, Chrysalis and Butterfly, applied sequentially to process large volumes of RNA-Seq reads. We evaluated Trinity on data from two well-annotated species—one microorganism (fission yeast) and one mam-mal (mouse)—as well as an insect (the whitefly Bemisia tabaci), whose genome has not yet been sequenced. In each case, Trinity recovers most of the reference (annotated) expressed tran-scripts as full-length sequences, and resolves alternative isoforms and duplicated genes, per-forming better than other available transcrip-tome de novo assembly tools, and similarly to methods relying on genome alignments.

RESULTSTrinity: a method for de novo transcriptome assemblyIn contrast to de novo assembly of a genome, where few large connected sequence graphs can represent connectivities among reads across entire chromosomes, in assembling transcriptome data we expect to encounter numerous individual disconnected graphs, each representing the transcriptional com-plexity at nonoverlapping loci. Accordingly, Trinity partitions the sequence data into these many individual graphs, and then processes each graph independently to extract full-length isoforms and tease apart transcripts derived from paralogous genes.

In the first step in Trinity, Inchworm assembles reads into the unique sequences of transcripts. Inchworm (Fig. 1a) uses a greedy k-mer–based approach for fast and efficient transcript assembly, recovering only a single (best) representative for a set of alternative variants that share k-mers (owing to alterna-tive splicing, gene duplication or allelic varia-tion). Next, Chrysalis (Fig. 1b) clusters related contigs that correspond to portions of alterna-tively spliced transcripts or otherwise unique portions of paralogous genes. Chrysalis then constructs a de Bruijn graph for each cluster of related contigs, each graph reflecting the

cba

>a121:len = 5,845

>a122:len = 2,560

>a123:len = 4,443

>a124:len = 48

>a126:len = 66

k – 1

Read set

Extend in k-merspace andbreak ties

Linear sequences

...

!

A

A

A A

A

CGT

CTC

G

TCGT

T C

T G

T C

T* C

... ... ......

Overlap linearsequences byoverlaps of k – 1to build graphcomponents

De Bruijngraph (k = 5)

Compactgraph

Compact graphwith reads

Transcripts

Compacting

Finding paths

Extracting sequences

ATTCG CTTCG

TTCGC

TCGCA

CGCAA

GCAAT

CAATG CAATC

AATGA AATCA

ATGAT ATCAT

TGATC TCATC

GATCG CATCG

ATCGG

TCGGA

CGGAT

... ...

A C

TTCGCAA...T

ATCGGAT...

CG

... ...

A C

CG

... ...

...CTTCGCAA...TGATCGGAT...

...ATTCGCAA...TCATCGGAT...

k – 1

k – 1

k – 1

k – 1

TTCGCAA...T

ATCGGAT...

Figure 1 Overview of Trinity. (a) Inchworm assembles the read data set (short black lines, top) by greedily searching for paths in a k-mer graph (middle), resulting in a collection of linear contigs (color lines, bottom), with each k-mer present only once in the contigs. (b) Chrysalis pools contigs (colored lines) if they share at least one k – 1-mer and if reads span the junction between contigs, and then it builds individual de Bruijn graphs from each pool. (c) Butterfly takes each de Bruijn graph from Chrysalis (top), and trims spurious edges and compacts linear paths (middle). It then reconciles the graph with reads (dashed colored arrows, bottom) and pairs (not shown), and outputs one linear sequence for each splice form and/or paralogous transcript represented in the graph (bottom, colored sequences).

ART ICL ES

Grabherr et al.

27Sunday, April 14, 13

Transcript reconstruction with a reference genome: the Tuxedo suite by Trapnell et al.

©20

12 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

PROTOCOL

564 | VOL.7 NO.3 | 2012 | NATURE PROTOCOLS

feel comfortable creating directories, moving files between them and editing text files in a UNIX environment. Installation of the tools may require additional expertise and permission from one’s computing system administrators.

Read alignment with TopHatAlignment of sequencing reads to a reference genome is a core step in the analysis workflows for many high-throughput sequencing assays, including ChIP-Seq31, RNA-seq, ribosome profiling32 and others. Sequence alignment itself is a classic problem in computer science and appears frequently in bioinformatics. Hence, it is per-haps not surprising that many read alignment programs have been developed within the last few years. One of the most popular and to date most efficient is Bowtie33 (http://bowtie-bio.sourceforge.net/index.shtml), which uses an extremely economical data structure called the FM index34 to store the reference genome sequence and allows it to be searched rapidly. Bowtie uses the FM index to align reads at a rate of tens of millions per CPU hour. However, Bowtie is not suitable for all sequence alignment tasks. It does not allow alignments between a read and the genome to contain large gaps; hence, it cannot align reads that span introns. TopHat was created to address this limitation.

TopHat uses Bowtie as an alignment ‘engine’ and breaks up reads that Bowtie cannot align on its own into smaller pieces called seg-ments. Often, these pieces, when processed independently, will align to the genome. When several of a read’s segments align to the genome far apart (e.g., between 100 bp and several hundred kilobases) from one another, TopHat infers that the read spans a splice junction and estimates where that junction’s splice sites are. By processing each ‘initially unmappable’ read, TopHat can build up an index of splice sites in the transcriptome on the fly without a priori gene or splice site annotations. This capability is crucial, because, as numerous RNA-seq studies have now shown, our cata-logs of alternative splicing events remain woefully incomplete. Even in the transcriptomes of often-studied model organisms, new splic-ing events are discovered with each additional RNA-seq study.

Aligned reads say much about the sample being sequenced. Mismatches, insertions and deletions in the alignments can iden-tify polymorphisms between the sequenced sample and the ref-erence genome, or even pinpoint gene fusion events in tumor samples. Reads that align outside annotated genes are often strong evidence of new protein-coding genes and noncoding RNAs. As mentioned above, RNA-seq read alignments can reveal new alter-native splicing events and isoforms. Alignments can also be used to accurately quantify gene and transcript expression, because the number of reads produced by a transcript is proportional to its abundance (Box 2). Discussion of polymorphism and fusion

detection is out of the scope of this protocol, and we address transcript assembly and gene discovery only as they relate to dif-ferential expression analysis. For a further review of these topics, see Garber et al.12.

Transcript assembly with CufflinksAccurately quantifying the expression level of a gene from RNA-seq reads requires accurately identifying which isoform of a given gene produced each read. This, of course, depends on knowing all of the splice variants (isoforms) of that gene. Attempting to quantify gene and transcript expression by using an incomplete or incorrect transcriptome annotation leads to inaccurate expression values8. Cufflinks assembles individual transcripts from RNA-seq reads that have been aligned to the genome. Because a sample may contain reads from multiple splice variants for a given gene, Cufflinks must be able to infer the splicing structure of each gene. However, genes sometimes have multiple alternative splicing events, and there may be many possible reconstructions of the gene model that explain the sequencing data. In fact, it is often not obvious how many splice variants of the gene may be present. Thus, Cufflinks reports a parsi-monious transcriptome assembly of the data. The algorithm reports as few full-length transcript fragments or ‘transfrags’ as are needed to ‘explain’ all the splicing event outcomes in the input data.

TopHat

Cufflinks

Cuffmerge

Finaltranscriptome

assembly

Condition A

Reads

Mappedreads

Assembledtranscripts

Mappedreads

Condition B

Differentialexpression results

Cuffdiff

Expression plots

CummeRbund

Reads

Mappedreads

Assembledtranscripts

Mappedreads

Step 1

Step 2

Steps 3–4

Step 5

Steps 6–18

Figure 2 | An overview of the Tuxedo protocol. In an experiment involving two conditions, reads are first mapped to the genome with TopHat. The reads for each biological replicate are mapped independently. These mapped reads are provided as input to Cufflinks, which produces one file of assembled transfrags for each replicate. The assembly files are merged with the reference transcriptome annotation into a unified annotation for further analysis. This merged annotation is quantified in each condition by Cuffdiff, which produces expression data in a set of tabular files. These files are indexed and visualized with CummeRbund to facilitate exploration of genes identified by Cuffdiff as differentially expressed, spliced, or transcriptionally regulated genes. FPKM, fragments per kilobase of transcript per million fragments mapped.

©20

12 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

PROTOCOL

NATURE PROTOCOLS | VOL.7 NO.3 | 2012 | 563

TopHat and Cufflinks are both operated through the UNIX shell. No graphical user interface is included. However, there are now commercial products and open-source interfaces to these and other RNA-seq analysis tools. For example, the Galaxy Project18 uses a web interface to cloud computing resources to bring command-line–driven tools such as TopHat and Cufflinks to users without UNIX skills through the web and the computing cloud.

Alternative analysis packagesTopHat and Cufflinks provide a complete RNA-seq workflow, but there are other RNA-seq analysis packages that may be used instead of or in combination with the tools in this protocol. Many alterna-tive read-alignment programs19–21 now exist, and there are several alternative tools for transcriptome reconstruction22,23, quantifica-tion10,24,25 and differential expression26–28 analysis. Because many of these tools operate on similarly formatted data files, they could be used instead of or in addition to the tools used here. For example, with straightforward postprocessing scripts, one could provide GSNAP19 read alignments to Cufflinks, or use a Scripture22 tran-scriptome reconstruction instead of a Cufflinks one before differ-ential expression analysis. However, such customization is beyond the scope of this protocol, and we discourage novice RNA-seq users from making changes to the protocol outlined here.

This protocol is appropriate for RNA-seq experiments on organ-isms with sequenced reference genomes. Users working without a sequenced genome but who are interested in gene discovery should consider performing de novo transcriptome assembly using one of several tools such as Trinity29, Trans-Abyss30 or Oases (http://www.ebi.ac.uk/~zerbino/oases/). Users performing expression ana-lysis with a de novo transcriptome assembly may wish to consider RSEM10 or IsoEM25. For a survey of these tools (including TopHat and Cufflinks) readers may wish to see the study by Garber et al.12, which describes their comparative advantages and disadvantages and the theoretical considerations that inform their design.

Overview of the protocolAlthough RNA-seq experiments can serve many purposes, we describe a workflow that aims to compare the transcriptome pro-files of two or more biological conditions, such as a wild-type versus mutant or control versus knockdown experiments. For simplicity, we assume that the experiment compares only two biological con-ditions, although the software is designed to support many more, including time-course experiments.

This protocol begins with raw RNA-seq reads and concludes with publication-ready visualization of the analysis. Figure 2 highlights the main steps of the protocol. First, reads for each condition are mapped to the reference genome with TopHat. Many RNA-seq users are also interested in gene or splice variant discovery, and the failure to look for new transcripts can bias expression estimates and reduce accuracy8. Thus, we include transcript assembly with

Cufflinks as a step in the workflow (see Box 1 for a workflow that skips gene and transcript discovery). After running TopHat, the resulting alignment files are provided to Cufflinks to generate a transcriptome assembly for each condition. These assemblies are then merged together using the Cuffmerge utility, which is included with the Cufflinks package. This merged assembly provides a uni-form basis for calculating gene and transcript expression in each condition. The reads and the merged assembly are fed to Cuffdiff, which calculates expression levels and tests the statistical signifi-cance of observed changes. Cuffdiff also performs an additional layer of differential analysis. By grouping transcripts into biologi-cally meaningful groups (such as transcripts that share the same transcription start site (TSS)), Cuffdiff identifies genes that are dif-ferentially regulated at the transcriptional or post-transcriptional level. These results are reported as a set of text files and can be displayed in the plotting environment of your choice.

We have recently developed a powerful plotting tool called CummeRbund (http://compbio.mit.edu/cummeRbund/), which provides functions for creating commonly used expression plots such as volcano, scatter and box plots. CummeRbund also han-dles the details of parsing Cufflinks output file formats to con-nect Cufflinks and the R statistical computing environment. CummeRbund transforms Cufflinks output files into R objects suitable for analysis with a wide variety of other packages available within the R environment and can also now be accessed through the Bioconductor website (http://www.bioconductor.org/).

This protocol does not require extensive bioinformatics exper-tise (e.g., the ability to write complex scripts), but it does assume familiarity with the UNIX command-line interface. Users should

Cufflinks package

Cuffcompare Compares transcript assemblies to annotation

Cuffmerge Merges two or more transcript assemblies

Cuffdiff Finds differentially expressed genes and transcripts Detects differential splicing and promoter use

TopHatAligns RNA-Seq reads to the genome using Bowtie

Discovers splice sites

CummeRbundPlots abundance and differential expression results from Cuffdiff

BowtieExtremely fast, general purpose short read aligner

Cufflinks Assembles transcripts

Figure 1 | Software components used in this protocol. Bowtie33 forms the algorithmic core of TopHat, which aligns millions of RNA-seq reads to the genome per CPU hour. TopHat’s read alignments are assembled by Cufflinks and its associated utility program to produce a transcriptome annotation of the genome. Cuffdiff quantifies this transcriptome across multiple conditions using the TopHat read alignments. CummeRbund helps users rapidly explore and visualize the gene expression data produced by Cuffdiff, including differentially expressed genes and transcripts.

28Sunday, April 14, 13

Expression data (RNA-Seq, ESTs) can detect

• New exons, splice forms

• Differentially-expressed genes

29Sunday, April 14, 13

<

Additional 3’ exon predicted by GeneMark and supported by EST evidence. PASA buills this and other transcripts.

30Sunday, April 14, 13

Novel exon discovery

some slides from Win

Win Hide

31Sunday, April 14, 13

!"#$%&'()*+,-.#)//'"0+1%&2)/++•  !"#$%&'()*+3"#+4"56+&)0756+"3+56)+5#%0/8#'.5+%0*+5"5%&+*).56+"3+/)92)08'07:+

•  !2$4)#+"3+;!<=>)9+!#%7$)05/++++++")#+#'&"4%/)+"3+5#%0/8#'.5+++++++++++++.)#+5"5%&+$'&&'"0+3#%7$)05/+$%..)*+

!"#$%!"5)?+&@AB+C+&)%*/+.)#+D+++'0/5)%*+"3+E#%7$)05/+'/+"F)0+2/)*+G'56+/'07&)=)0*+#)%*/:+

Brian Haas

32Sunday, April 14, 13

!"#$%&'($)*+(,#-#$%.//'*012-#33#"*4-.$35-(263*

•  76.%3%5./*6#363*2#-&8-9#"*8$*&-.)9#$6*58:$63*;$86*<=>?*@./:#3AB*

•  C(@#$*8D3#-@#"*-#."*58:$63*&8-*.*6-.$35-(26*($*#.5E*8&*6F8*3.92/#3G*FE.6H3*6E#*2-8D.D(/(6'*6E#'*F#-#*"#-(@#"*&-89*6E#*3.9#*"(36-(D:%8$*;$://*E'286E#3(3AI**;#1B*<(3E#-3*#1.56*6#36A*!&*;=*JK*LBLMAG*3()$(N5.$6/'*"(,#-#$6*

•  +8$H6*&8-)#6*68*."O:36*=P@./:#3*":#*68*&./3#*"(358@#-'*-.6#*;<+QA*-#3:/%$)*&-89*-:$$($)*9.$'*;6E8:3.$"3*8&A*36.%3%5./*6#363B**;#1B*:3#*RP@./:#3A*

Brian Haas

33Sunday, April 14, 13

!"#$%&'%()*+,%$)-+$)!(%".-/0"1)203%$%".&44/)

567$%''%()8$&"'#$079'):09;)2%%7%$)<%=>%"#0"1)

?+1@A-+4()#;&"1%B)

CD%$&1%)4+1@A#+>"9'B)

!"#$%&'(##%&)*+&,-'./#0.1#%&)*2&%3#456-)7/#

E)@F-+4()0')'9&9)'01"0G#&"9)

;%$%H)

E)@F-+4()0')IJ8))'9&9)

'01"0G#&"9);%$%H)

Brian Haas

34Sunday, April 14, 13

!"#$%&'()*+,-.$-/*0%#$-/1*

!"#$%#&'*$&"2'3/*%(*/4/5.2/*6""-*7"&*(%2')%.()*3'4/&/(.%-*/8$&/11'"(*%5&"11*#,-.$-/*1%#$-/19**()*'$"+,-.*5%(*:/*$/&7"&#/3*%5&"11*:"6;*%8/1<*

*=5-,16/&*6&%(15&'$61*>'6;*1'#'-%&*/8$&/11'"(**$%?/&19**=5-,16/&*1%#$-/1*%55"&3'()*6"*1'#'-%&**/8$&/11'"(*2%-,/1*%#"()*6&%(15&'$619***

Brian Haassimilar expression profiles are grouped and

highlighted with different colours

35Sunday, April 14, 13

!"#$%&%&'()#*+,&-(./(!"0,+--%.&(12,.--(3#$04+-(5#&(+"6,#26(247-6+,-(./(6,#&-2,%06-(#&8(+"#$%&+(69+$(-+0#,#6+4:;(

Brian Haas

36Sunday, April 14, 13

Genotyping by sequencing

• Reduced-representation sequencing

• Genotyping by sequencing and variant analysis

• Genetic mapping

37Sunday, April 14, 13

Making a GBS library: overview

ApeKI: G C W G C C G W C G

W= A or T

1. Digest genomic DNA

3. Pool adapter-ligated DNA 4. Size select: 400–800 bp

5. PCR: 5–10 cyclesTbarcodeCWG CWGbarcodeAAbarcodeGWC GWCbarcodeT

6. Sequence: 100 bp paired-end reads on HiSeq 2000

5’ ACACTCTTTCCCTACACGA

3’ GAGCC

GTAAGGAC

GACTTG

!"#$%&'()*

CGCTCTTCCGATCTbarcode 3’GCGAGAAGGCTAGAbarcodeGWC 5’

+,-,,,'./

0,,'./

+,,'./

+-,,,'./

1

2. Ligate barcoded adapters to DNA

Jessica Lyons

38Sunday, April 14, 13

!"#$%&'(#)*+&*,"-."#/(#)*0!123*4$5*/6,,6768*5"9./"9:5"'5","#%6;$#*6''5$6/<=>*?#@&*6*,.+,"%*$4*2AB,*65"*,6C'@"9*45$C*"6/<*(#9(7(9.6@D#""9*4"E"5*5"69,****'"5*(#9(7(9.6@F*6@@$E(#)*4$5*C.@;'@"G(#)=>*H",%5(/;$#*9()",%*"#,.5",*%<6%*%<"*,6C"*,(%",*65"*,6C'@"9*45$C*"6/<*(#9(7(9.6@=

B"54$5C*9()",%,*$4*.'*%$*IJ*,6C'@",*(#*'656@@"@*$#*6*'@6%"=K<"*,6C"*6C$.#%*$4*LA8*(,*.,"9*45$C*"6/<*,6C'@"=

>*!MN*$7"5<6#)*(,*/$C'@"C"#%65&*%$*%<"*$7"5<6#)*****@"O*+&*8'"PQ*9()",;$#=

>*165/$9",*0655$E3*65"*R:S*+'*@$#)F*45$C*T@,<(5"*"%*6@=>*T6/<*9()",%"9*,6C'@"*5"/"(7",*6#*696'%"5*****E(%<*6*9(U"5"#%*+65/$9"=

>*V#@(W"*T@,<(5"*"%*6@F*E"*.,"*X:,<6'"9*696'%"5,F****E<(/<*"#,.5",*%<6%*"6/<*,%56#9*$4*LA8*<6,*6****9(U"5"#%*,"-."#/"*$#*"6/<*"#9=

!"##$%&'()#*'+,-.).#'.'/()+0#(-#123

5’ ACACTCTTTCCCTACACGA

3’ GAGCC

GTAAGGAC

GACTTG

CGCTCTTCCGATCTbarcode 3’GCGAGAAGGCTAGAbarcodeGWC 5’

!"#$%&'()*

Jessica Lyons

39Sunday, April 14, 13

!"#$%&'(#)*+&*,"-."#/(#)*0!123*4$5*/6,,676

?#/"*%<"*LA8,*<67"*.#(-."*696'%"5,*@()6%"9*%$*%<"CF*%<"&*65"*'$$@"9*(#%$*$#"*%.+"=

K<"*,6C"*[email protected]"*(,*%6W"#*45$C*"6/<*@()6;$#=

B$$@"9F*/@"6#"9F*6#9*/$#/"#%56%"9*696'%"5:@()6%"9*LA8*(,*5.#*$#*6#*6)65$,"*)"@=

8*,(Y"*56#)"*$4*RZZ[SZZ*+'*(,*E"@@:5"C$7"9*45$C*6#&*696'%"5:C"5,*06,%"5(,W3F*6#9*,"@"/%,*4$5*456)C"#%,*%<6%*E(@@*/@.,%"5*E"@@*9.5(#)*,"-."#/(#)=

8*#655$E"5*,(Y"*,"@"/;$#*5",.@%,*(#*6*C$5"*5"9./"9*5"'5","#%6;$#*04"E"5*,(%",*,6C'@"9F*%<$.)<*4"E"5*5"69,*#""9"9*'"5*,6C'@"*4$5*69"-.6%"*9"'%<3=

K<"*1./W@"5*%"/<#(-."*9$",*#$%*'"54$5C*%<(,*W(#9*$4*,(Y"*,"@"/;$#=

!"##$%%&#'(')*+,-&./'*+(#012

!"##$%&'#(')'*+,#!--./--#01

!"#"""$%&

'""$%&

!""$%&

!#"""$%&

(

Jessica Lyons

40Sunday, April 14, 13

!"#$%&'(#)*+&*,"-."#/(#)*0!123*4$5*/6,,676BNH*'5(C"5,*65"*/$C'@"C"#%65&*%$*696'%"5*,"-."#/",*6#9*%<.,*"#5(/<*4$5*'5$'"5@&:@()6%"9*LA8*456)C"#%,=8O"5*%<"*BNH*"6/<*9$.+@":,%56#9"9*'("/"*$4*LA8*<6,*6*9(U"5"#%*,"-."#/"*$#*"6/<*"#9=**

K<"*$56#)"*6#9*)5""#*,"-."#/",F*699"9*9.5(#)*%<"*BNHF*46/(@(%6%"*+(#9(#)*%$*%<"*\$E*/"@@=2"-."#/"*5"69,*,<$.@9*+")(#*E(%<*%<"*+65/$9"*4$@@$E"9*+&*%<"*/.%,(%"=

]"5"*(,*6*%&'(/6@*1($6#6@&Y"5*%56/"*4$5*6*/6,,676*!12*@(+565&=0^*BNH*/&/@",F*]()<*2"#,(;7(%&*1($6#6@&Y"53

!"##$%&'#!()*#+,+-./

TbarcodeCWG CWGbarcodeAAbarcodeGWC GWCbarcodeT

!"##$%&'%()%*#+,,#-.#./01%23%(2#1%/24#5(#60$%&#7,,,

Jessica Lyons

41Sunday, April 14, 13

Reduced representation sequencing of variants

Diploid organisms have two sets of chromatids/chromosomes, each with different mutations

xx

x

x

x

x

x

x x

Reduced representation or GBS sequences

some portions of the genome very

deeply - from both copies

xxxx

Reads from a heterozygous genomic position will contain the two alleles approximately 1:1 ratio

cgatcgactagctatcgactcgatcgactagctatcgactcgatcgactagctatcgactcgatcgactagctatcgactcgatcgactagctatcgactcgatcgactagctatcgactcgatcggctagctatcgactcgatcggctagctatcgactcgatcggctagctatcgactcgatcggctagctatcgactcgatcggctagctatcgactcgatcggctagctatcgactcgatcggctagctatcgact

ReadsSequences

42Sunday, April 14, 13

Visualization software like Gigabayes or tview

!"""#$%&'$()*$+,-./$0&123$4(55")6$

Gabor Marth

43Sunday, April 14, 13

Building genetic maps from GBS data

Distribution of SNPs across genome in OWB Barley (Poland et al. 2012)

44Sunday, April 14, 13

Building genetic maps from GBS data with JoinMap

JoinMap encoding of parental genotypes

lm x ll nn x np

Expected Genotype Frequencies in progeny

ll: 0.5

lm: 0.5

mm: 0.0

nn: 0.5

np: 0.5

pp: 0.0

chromosome position SNP type

chr1 354 lm

chr1 4983 lm

chr1 4985 lm

chr1 10876 ll

chr2 765 ll

chr2 1034526 lm

chr2 1034700 ll

chr3 45673 lm

etc

Example SNPs

A/G x A/A C/C x C/A

Data table encoding characters or markers for making the map

45Sunday, April 14, 13

Genomic selection from GBS data

Phenotypic evaluation of cassava is a lengthy process

Make crosses: 10000’s of seedlings

5 to 10 best

500 to 3000 clones

20 to 25 clones

50 to 100 clones

Year 1

Year 2

Year 3

Year 4

Year 5

Year 6 5 best

Martha Hamblin

Evaluation at the Seedling Stageby Genomic Selection

Meuwissen et al. 2001:

Use genome-wide markers to capture effects at many (20-30k) loci

Develop a statistical model to predict breeding value

“Selection on genetic values predicted from markers could substantially increase the rate of genetic gain in animals and plants.”

46Sunday, April 14, 13