52
International Tomato Finishing Workshop Wellcome Trust Sanger Institute April 2007 Wellcome Trust Medical Photographic Library

International Tomato Finishing Workshop Wellcome Trust Sanger Institute April 2007 Wellcome Trust Medical Photographic Library

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

International Tomato Finishing Workshop

Wellcome Trust Sanger InstituteApril 2007

Wellcome Trust Medical Photographic Library

Overview

WTSIFinishingPipeline

WTSIFinishingStrategy

Tomato GenomeFinishingStandards

Document on SGN

Contiguous Finished Sequence to HTGS Phase 3

All bases above phred 30

Base Error rate <1:10,000

Discussion on Day 2

WTSI Finishing Pipeline

ShotgunSequencing

Auto-Prefinishing

ManualFinishing

QCFinal EMBLSubmission

HTGS 3

Established Clone Pipeline

Finishing Software

Finishing Software

Main software tools used in the WTSI finishing process:

Sequence data viewed in Gap4 databases (Staden)(assemblies created using phrap)

Read pair viewer – Orchid (Flowers)

Restriction Digest Viewer - Confirm (Attwood - WTSI)

Sequence plot viewer – Dotter (Sonnhammer)

(URLs on Handout)

Used throughout the finishing process and for final confirmation of assembly (and QC)

WTSI Finishing Strategy

BAC Confirmation

Identify Region to be Finished

Contig Order and Orientation

Assessment of Gap Sizes and Type

Improvement of Low Quality Sequence

Confirmation of Contiguous Assembly

Selection of

Finishing Reactions

Finishing Strategy – Getting Started

BAC confirmation

Checking for overlapping BACs

Identifying Region to be Finished

Confirming BAC Clone Ends

Prevents overlaps being finished twice

Confirmation of clone placement on map

Resources available for BAC confirmation

Identifying region to be finished in your BAC:Whether overlapping BACs are available

Overlapping BAC availableNo overlaps available

Confirm BAC Ends (BES)

Confirm status of overlap(shotgun, in finishing, finished)

Confirm region to be finished (who is finishing overlap)

Finish 2Kb into overlap if already finished or being finished by

someone else

Confirm ends of your BAC and overlapping BAC

(BES or sequence if available)

Finish whole BAC

insert

If overlap is not being finished by someone

else

Resources available for BAC confirmation

SOL Genomics Network (SGN)

• BES

• Marker verification

• Blast– repeats, unigenes, ESTs, markers, overlaps

SGN Resources for BAC confirmationBES search

Aligning BES to BAC sequence data

BACs are clipped to cloning vector cutsite, dependant on the library used:

SL_Mbol Library = GATCLE_HBa Library = AAGCTT

Sequence Resources Searching for Finished Overlaps

Match to self (bTH198L24)

Match to left overlap(bTH119A16)

Match to right overlap(bTH27G19)

BLAST

Aligning finished overlapping sequence

Finishing 2Kb into existing finished overlapping BACs

Overlapping sequence ends at 48435 Finished region will begin at 46435 to give a 2Kb overlap

Making use of available overlaps

Finished consensus spans 2 gaps reducing number of contigs to finish

Sequence VerificationSearching for Expected Markers

Marker size would be 1284bp

Matches predicted product sizefor S.lycopersicum on SGN

Finishing Strategy

BAC Confirmation

Identify Region to be Finished

Contig Order and Orientation

Use read pair information to order and orientate contigsPlasmid inserts typically 4-6KbBoth strands sequenced to give read pairs

Contig Order and Orientation

Assembled SequenceContigs

Forward Strand Reverse Strand

LeftCloneEnd

RightCloneEnd

Sequence Gap

Good read pair link across gap

Double StrandedSequencingVector(pUC)

Inserted sequence (BAC)

Forward and ReverseRead pairs 4-6Kb apart in assembly

Look at read pair information across gaps to order contigs

Shotgun Sequencing

Contig Order and Orientation

LeftCloneEnd

RightCloneEnd

Assembled sequence contigs

Sequence Gap

LeftCloneEnd

RightCloneEnd

Using Read Pair information to find Assembly Problems

OrchidRead pair Visualisation Tool

Contiguous sequence with good read pair coverage

Finishing Strategy

BAC Confirmation

Identify Region to be Finished

Contig Order and Orientation

Assessment of Gap Sizes and Type

Use all available informationType of gap and size determines finishing approach

Assess Sequence Restriction Digest Data

Restriction Digest Data

• Used in confirmation of finished contiguous assembly

• Also used throughout finishing process

• Sizing of gaps within BACs– use appropriate finishing strategy

• Identifying assembly problems– caused by repeats

• Sizing of repeats– confirming size of assembly of tandem repeats– sizing force joins made in repeats for tagging

purposes

Restriction Digests

• Minimum of three restriction enzymes used to confirm the assembly

• Selection depends on organism and the nature of the sequence

• S. lycopersicum BACs are digested with

– BamHI– EcoRI– HindIII

• Comparison of real and virtual digest of entire BAC sequence

ConfirmWTSI In-house digest visualisation tool

In-house digest visualisation tool

Gap4 - Restriction Digest Viewer

Compare fragment lengths from

virtual digest in gap4 to actual

fragment sizes on the gel produced

in the lab

Using Restriction Digest Datato check for Assembly Problems

• Identifying assembly problems from digests– mis-assemblies caused by repeats

– direct repeats– Inverted repeats

• All digests showing similar amount of missing data or extra data at a particular position

– Possible repeat with incorrect copy number represented

• Certain digests show too much data, others have missing cutsites or data missing

– Possible inverted repeat in wrong orientation– Possible E.coli transposon insertion

Assessment of Sequence - Dotter

Sequence plot of BAC used throughout finishing process

– Check for repeats sequences at gaps– Highlight any potential areas of mis-assembly

Also used to confirm sequence overlaps

– Confirm unique sequence– Not false repeat matches

Used as final assembly check

– Repeats– Cross reference sizes with restriction digests

Sequence plotOverlap Confirmation

Sequence Plot – Assembly Check Repeat Examples

Direct Repeat

Inverted Repeat

Sequence Plot – Assembly CheckRepeat Example

WTSI Finishing Strategy

BAC Confirmation

Identify Region to be Finished

Contig Order and Orientation

Assessment of Gap Sizes and Type

Improvement of Low Quality Sequence

Selection of

Finishing Reactions

Options for Gap Closure and Improving Sequence Quality

Primer walking on subclones across region or gap

Resequencing of subclones across region if appropriate read length, using alternative chemistries if possible

Sequence any unpaired reads which may fall in low quality region or in gap

Direct clone walks

Depending on length of region or gap and associated sequence (repeat, structural problems)

Manual Editing

PCR SIL or TIL

Comment Tag for EMBL submission

Gap Closure in BACs – Gap Types

Spanned Gap Un-spanned Gap

Re-sequencing (read pairs)Oligo walks

Restriction Fragment Library

Direct clone walksPCR

Small Insert Libraries,Transposon Libraries

AlternativeLibrary Sizes

Repeats

Primer Walking into Spanned Gaps

Primer 1 Primer 2

Assembled SequenceContigs

LeftCloneEnd

RightCloneEnd

Sequence Gap

Good read pair link across gap

Forward and ReverseRead pairs 4-6Kb apart in assembly

Look at read pair information across gaps to order contigs

Good read pair linksacross gap

Primer extended template

Original shotguntemplates

Assembled SequenceContigs

Gap Closed

Primer Walking into Spanned Gaps

Primer 1 Primer 2

Primer 3 Primer 4

Assembled Sequence Contigs

Small Insert Library (SIL)

Spanning Shotgun Template4-6Kb insert

SIL templatesaverage300-500bpinsert

Assembled SequenceContigs

Spanning subclone is shattered into smaller fragments to create a SIL. Smaller insert sizes can break up structural problems.

Transposon Insertion Library (TIL)Double StrandedSequencingVector(pUC)

Inserted sequence (BAC)

Transposon Insertion Library (TIL)Double StrandedSequencingVector(pUC)

Inserted sequence (BAC)

Normal sequencing from either end of insert Read pairs ~4-6Kb apart

Transposon Insertion Library (TIL)Double StrandedSequencingVector(pUC)

Inserted sequence (BAC)

Sequence outwards from transposon insertion site

Transposon randomly inserts across entire plasmid

Normal sequencing from either end of insert Read pairs ~4-6Kb apart

Transposon Insertion Library (TIL)Double StrandedSequencingVector(pUC)

Inserted sequence (BAC)

Transposon randomly inserts across entire plasmid

TIL Read pairs overlapby 9bp duplication site

Sequence outwards from transposon insertion site

Transposon Insertion Library

Unspanned Gaps and gaps unresolved by walking on spanning subclones

Resequence any unpaired reads that face into gapPartner may fall in gap, reducing gap size or may fall within other contig and span the gap.

Assembled Sequence Contigs

Primer Sequenceneeds to be unique

Unspanned Gaps and gaps unresolved by walking on spanning subclones

No unpaired reads.Design oligo primers from each contig end to read into gap. Use for walking directly on BAC (clone/stock) DNA and PCRTry to find unique sequence within BAC for oligo selection

Primer 1 Primer 2

Sequence search facility in Gap4

Assembled Sequence Contigs

Direct Clone Walks

Depending on gap size (from restriction digest data) the direct clone walks may close the gap.Alternatively they may extend into the gap allowing further primers to be designed on the newly recovered sequence

Primer 1 Primer 2

Primer 3 Primer 4

Assembled Sequence Contigs

PCR

The same principle applies to PCR.Design unique primers from each contig end to obtain a product that can be sequenced and extended with further primer walking.If confirmed to span the gap a PCR product may be shattered into a SIL but may skip out repetitive sequence.

Primer 1 Primer 2

Assembled Sequence Contigs

SIL from Restriction Fragmet

Alternatively a restriction fragment known to contain the missing data can be isolated from the digest gel and be made into a SIL. The fragment of interest must be distinct from other fragments on the gel and be a suitable size.

Assembled Sequence Contigs

Cutsite1

Cutsite2

Shatter thisFragment of Digested BAC DNA

Sequence gap known to be within this fragment

Gaps and Assembly ProblemsCaused by Repeats

Varying complexity of repeats depending on:

RepeatUnit

Size ofUnit

CopyNumber

Direct orInvertedCopies

How Conserved?

Importance of visualising repeat sequence to assess repeat type

Lower Complexitye.g. Di-nucleotide Runs

Higher Complexitye.g. LTRs

Alter phrap parameters for more stringent assemblyAlternative library sizes if necessaryDiscussion point for Tuesday

Improving Sequence Quality - Summary

Primer walking on subclones across region

Resequencing of subclones across region if appropriate read length, using alternative chemistries if possible

Sequence any unpaired reads which may fall in region

Direct clone walks

Depending on length of poor quality region and associated sequence (repeat, structural problems)

Manual Editing

PCR SIL or TIL

Comment Tag for EMBL Submission

WTSI Finishing Strategy

BAC Confirmation

Identify Region to be Finished

Contig Order and Orientation

Assessment of Gap Sizes and Type

Improvement of Low Quality Sequence

Confirmation of Contiguous Assembly

Selection of

Finishing Reactions

Confirmation of contiguous sequence

ContiguousSequenceGenerated

No QualityIssuesRemain

All Assembly checks completed

Identify any regions to be tagged

Restriction Digests

Read pair coverage

Dotplot

QC checkFinal Submission of Finished Sequence to

EMBL as HTGS Phase 3

Tomato Genome Finishing Strategy

General overview of WTSI finishing approach

Real time finishing examples on Tuesday afternoon if anyone wants to look at something specific

Main Discussion point for Tuesday:

Updated Finishing Standards Document on SGN

http://www.sgn.cornell.edu/solanaceae-project/sol-bioinformatics/

Printed copies available

AcknowledgementsWellcome Trust Sanger Institute:

Jane Rogers

Mapping Core GroupSean HumphrayChristine Nicholson

SubcloningMatt Jones and Team 53

Shotgun SequencingKaren Oliver and Team 41Sarah Sims

Auto-Prefinishing, Finishing and QC Karen McLaren and Team 46 Stuart McLaren and Team 58Christine Lloyd and Team 57

Cornell University: Lukas MuellerJim GiovannoniSteve TanksleyYimin Xu

MIPS/IBI Institute for Bioinformatics:Klaus MayerRemy Bruggmann

Imperial College London:Gerard BishopDaniel BuchanJames AbbottSarah Butcher

University of Nottingham:Graham Seymour

Scottish Crop Research Institute:Glenn Bryan

FUNDING