30
BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: [email protected] Website: http://biocore.unl.edu

BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: [email protected] Website: @unl.edu

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

BIOS816/VBMS818

Lecture 6 – Sequence Assembly

Guoqing LuOffice: E115 Beadle Center

Tel: (402) 472-4982Email: [email protected]

Website: http://biocore.unl.edu

Page 2: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

A Whole Genome Shotgun Sequencing Project

NATURE August 2000 pp. 801.

Page 3: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Introduction to Sequence Assembly

• Sequence assembly– also known as

fragment assembly

– assembling DNA fragments (both text sequences and chromatograms) from automated sequencers, into longer contiguous sequences or “contigs”

Page 4: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Introduction to Sequence Assembly

• Raw sequence data from the sequencer in the form of graphical trace files

• Viewed and converted into textual sequence files

• Align fragments and create assemblies

• Note that – Not all bases can be read

correctly – Not all bases are equally

reliable– Current sequencing methods

allow reading of ~1000 bases per gel

– Vector contamination

Page 5: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Available Sequence Assembly Systems

• GCG Fragment Assembly Package

• VNTI ContigExpress

• Staden GAP4

• Phred/phrap/consed

• TIGR

• Web Contig Assembly Program

• …

Page 6: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

GCG Fragment Assembly Package

• Only works with text-based sequence files

• Does not work directly with automated sequencer trace files

• Can generate sequence files from trace files using FromTrace

Page 7: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

G elV iew

F in a l C on sen su s

G e lA ssem b ly

G e lM erg e

G elE n te r

G e lS ta rt

A GCG Fragment Assembly Project

Initializes a new project

Incorporates individual sequence files into the project

Automatic identification of overlaps and arrangement of ordered contigs

Multiple sequence editor

Presents the reader with a graphical representation

Page 8: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Sequence Data

• Import

• Must be in a supported format– GCG– FastA– Staden

• Enter in SeqEd

• Enter in GelEnter

Page 9: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

New Project

• Create a new fragment assembly project– Creates a new set of

directories and files– DO NOT alter these

files and directories

• GB:M13mp18,GB:SynpBR322• GAATTC, GGATCC

Page 10: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

GelEnter

• Sequence editor• Works like SeqEd• For entering new

fragments or importing fragments

• Existing fragments are modified with GelAssemble

Page 11: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

GelMerge

• Finds overlaps between fragments and contigs

• Compares every fragment with every other fragment

• Settings determine the stringency necessary for an overlap

?

Page 12: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Calculating an Overlap

• Word Size (* 7 *)

• Stringency (* 0.80 *)– What fraction of words must match?

• Minimum overlap length (* 14 *)

Sequence 1

Sequence 2

1

125

200

1

Page 13: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

GelView

• Displays the structure of the fragments and contigs graphically

• Shows the current state of the fragment assembly project

Page 14: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

ContigExpress

• A program for assembling DNA fragments (both text sequences and chromatograms) from automated sequencers, into longer contiguous sequences or “contigs”

Page 15: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Launch ContigExpress (CE)

• From the Start menu choose Programs | InforMax |

Vector NTI Suite 8 | ContigExpress

NOTE: CE Can be launched fro most other Vector NTI Suite applications

Download Demo Projects, then open it

Page 16: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

End Trimming By Sequence Characteristics

• With sequences highlighted, choose Edit | Trim Selected Fragment Ends

• Click Settings and review the options:– For 5’ End; For 3’ End;

• Leave all settings as the default, click OK and then click Calculate!• Any regions meeting the trim criteria defined above will be in red

and lowercase• Click OK then right-click on the gray column heading bar and

choose Columns• Double-click each of Length, 3’Trimmed bases and 5’ Trimmed

bases• Click OK

Page 17: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Trimming Using Phred Quality Values

• Select all fragments in the Project pane then right-click and choose Load phred quality values

• Click Quality Values

• If you have data with associated Phred quality values, navigate to the .qual file and click Open

• Click OK

• Imported Phred data, the scores may be used to trim sequence data

• Select sequences in the right-hand pane and choose Edit | Trim Selected Fragments Ends Using Phred QVs

• Review the Settings options:– Trim bases with QV less than: Select the threshold below which bases will be trimmed

– Trim 5’/3’ bases: Specify which end(s) you wish to trim

• Click OK

Page 18: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Selecting Plasmid Regions for Vector Trimming

• From the Vector NTI Explorer, open the DNA molecule pUC19

• Set the selection to 351bp to 500bp (to include the polylinker)

• Choose Tools | Send to | Polylinker to Contig Express

• Check Selection Only and Direct then click OK

• Name the file ‘pUC19 (351- 500)direct.seq’ then click Save

• Repeat for the complement and name the file ‘pUC19 (351-500)comp.seq’

Page 19: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Trimming for Vector Contamination

• Highlight the sequences in the right-hand pane

• Choose Edit | Trim Selected Fragments For Vector Contamination…

• Click Settings

• In the Polylinker list, check the sequences defined earlier (pUC19 (351-500)direct and pUC19 (351-500)comp)

• Highlight the name pUC19 (351-500)direct, click Add REN Sites, choose Enzlist25.dat then click Open

• Click HindIII (it will change color from gray to blue)

• Repeat for pUC19 (351-500)comp

• Click OK then click Calculate!

• Any contaminated regions will be in red and lowercase

• Click OK

Page 20: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Calling Secondary Peaks

• With all 12 sequences highlighted, choose

Edit | Call Secondary Peaks For Selected

Fragments

• Review the settings (Allow Ns to be

Replaced, Allow Edited Bases to be

Replaced, Set Threshold)

• Click Unselect All Fragments

• Check Allow Ns to be replaced

• Check the box next to ONE4KANR in the left

hand pane (ensure this is the only fragment

checked) and move the sliding bar to choose

the threshold and observe the result in the

sequence window. Choose 85%, the viewer

will display secondary bases with heights

85% (or greater) as tall as the higher peak

• Click OK

This tool can be used to resolve occurrence of double peaks in a chromatogram

Page 21: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Saving a Project

• Choose Project | Save As... and save the Project to your desktop as ‘Tutorial.cep’

Note: Tools such as BLAST Search, BioPlot are available from the menu bar all of the ContigExpress viewers

Page 22: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Assembly Setup

• From the Contig Express Project Window, choose Assemble | Assembly Setup– Contig Assembly Tab: Definition of various parameters such as length

and % identity allowed for overlap

– Alignment Tab: Define parameters for the alignments generated between fragments in contig creation (e.g. the score assigned to matching nucleotides or a mismatch). These are greyed out when using Linear Assembly

– Algorithm Tab: Two algorithms are available

– Light Settings Tab: Light contigs disregard chromatogram data and editing done on light contigs isn’t reflected in the original fragment sequences. Light contig assembly is preferred for assembling very large projects

• Leave all selections as the defaults and click OK

Page 23: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Pairwise Assembly Linear Assembly

Page 24: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Assembling Contigs

• From the List Pane on the right-hand side, highlight all 12 fragments

• Choose Assemble | Assemble Selected Fragments and click OK when the assembly is complete

• The Tree Pane on the left hand side shows the Assembly (Assembly 1)

• Click the Content View icon to show the tree/branching of contigs

• Click the History View icon

• In the List pane, the arrows indicate if fragment was included (blue) or attempted to be included (gray) in the assembly

• Highlight the name of the Contig containing most fragments (Contig1) in the List pane

• Click the Show Unassembled Fragments icon to deselect it and thus view only those fragments that are part of the contig. Click the icon again to return to the original view

Page 25: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Exporting the Contig Consensus Sequence from Vector NTI

• In the Contig Express Project Viewer, highlight the name Contig 2 in the List pane

• Right-click and choose Export Contig | To GenBank file• Save the file to your desktop• With Contig 2 still highlighted, choose Edit | Copy• Return to the Vector NTI Explorer• Choose Edit | Paste• In the New DNA/RNA Molecule dialog box click OK (leave the name

as Contig 2)• In the Vector NTI Explorer, the Contig 2 molecule should now be

present• Double-click the name Contig 2 to open the Molecule Viewer• The consensus sequence is now available for restriction mapping,

editing, annotation and other analyses in Vector NTI

Page 26: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

phred/phrap/consed

• Developed at the University of Washington– Phil Green (phrap)– Brent Ewing (phred)– David Gordon (consed)

• http://www.mbt.washington.edu/

Page 27: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Sequence Assembly

• PHRED– Base calling with

quality scores

• PHRAP– Sequence Assembly

• CONSED– Assembly

visualization/Editing

Page 28: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Quality Scores

• Phred assigns a quality value to each called base

• Phrap uses the quality value during automated assembly

• Consed displays the qualities in different shades

Page 29: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Exercise

• Log onto your biocomp2 account• Create a directory: $ mkdir sequenceAssembly • Fetch all mu* sequence files to the directory

– $ cd sequenceAssembly

– $ fetch mu*.seq

• Add fetched sequences to seqlab working.list• Highlight all mu*.seq sequence files and run Gelstart, Gelenter,

Gelmerge, Gelassemble, Gelview• Export the fetched sequences to a file called mu.genbank in

genbank format• Use ftp to transfer mu.genbank to your desktop computer• Drag and drop mu.genbank file to ContigExpress Project Window• Select all fragments and Run Assembly Selected Fragments

Page 30: BIOS816/VBMS818 Lecture 6 – Sequence Assembly Guoqing Lu Office: E115 Beadle Center Tel: (402) 472-4982 Email: glu3@unl.edu Website: @unl.edu

Answer the following questions

• Summarize the outcome of the assembly

and compare the results generated from

the two sequence assembly systems

• How many contigs resulted?

• What were the lengths of the contigs?

• What were the sequences of the contigs?