Lecture 2102

7/27/2019 Lecture 2102

1/42

Lecture 2

DNA sequencing

Whole genome sequencing

Sequence data formats

SeqLab

7/27/2019 Lecture 2102

2/42

DNA Sequencing

Sanger's method was invented in 1977. It uses purified

template DNA, a DNA primer, deoxynucleotide triphosphates

(dNTPs), di-deoxynucleotide triphosphates (ddNTPs),enzymes (Taq polymerase), and gel electrophoresis.

dNTPs + ddNTPsenzymes

7/27/2019 Lecture 2102

3/42

primer

= a piece of RNA or DNA whose purpose is

to allow the enzyme DNA polymerase to

make new DNA. It "primes" the reaction.

7/27/2019 Lecture 2102

4/42

Base

OH

BaseOO P O

O

...

DNA Sequencing

Base

BaseOO P O

O

XdNTP dNTP

dNTP ddNTP

Taq polymerase adds the complementary nucleotide (normal

or di-deoxy) to the 3' end of the growing strand. If the 3' endhas no 3'-OH group, then growth stops.

7/27/2019 Lecture 2102

5/42

results of polymerization rx:

GCCGATCTAGAAATCTAAGAGGAGAG

AGGCTAGATCTTTAGATTCTCCTCTC

AGGC*

AGGCTAGATC*

AGGCTAGATCTTTAGATTC*

AGGCTAGATCTTTAGATTCTC*

AGGCTAGATCTTTAGATTCTCC*

AGGCTAGATCTTTAGATTCTCCTC*

template

reverse

complement

d

dCTP-termin

atedfragmen

ts

3'

5'

5'

3'

7/27/2019 Lecture 2102

6/42

Separating fragments on a gel

larger

smaller

Gel (or capillary)

electrophoresis separates the

fragments by charge, which is

proportional to length.

Now the sequence can be read

from the gel. Top to bottom is

the reverse complement

sequence. Bottom to top is thetemplate sequence if we

switch the labels of the lanes

to the complement bases.

ddA ddT ddC ddG

A

G

G

C

T

A

G

7/27/2019 Lecture 2102

7/42

DNA SequencingGels can resolve around 500 bases.

dNTP/ddNTP ratio is optimized so that ~1 ddNTP is

incorporated every ~500 bases.

Fragments are visualized by

(1) Using 32P dNTPs in the primer. 32P gives off beta

radiation, which is captured when the gel is layed out on a

piece of film. b won't pass through glass, so the gel must beremoved from the gel apparatus.

(2) Using fluorescent dyes attached to the primer.

The fluorescence is detected by scanning the gel or capillary.

No need to remove the gel.

7/27/2019 Lecture 2102

8/42

Sequence trace

F

Each color is one lane of an electrophoresis gel.

7/27/2019 Lecture 2102

9/42

base-calling

"I call this a G"

"N" = no call, trace

not good enough to

call the base.

"phred" (by Phil Green) is the most widely used base-calling program.

7/27/2019 Lecture 2102

10/42

Naive strategy for sequencing long

pieces of DNAOrder a primer to start the sequencing process.

Sequence ~500 bases.

Design a primer based on the last good part of the sequence.

Order that primer. (wait a few weeks for delivery)

Repeat.

Time required to sequence a small genome (E. coli) this way if

delivery takes one week: 9200 weeks = 176 years

7/27/2019 Lecture 2102

11/42

Whole Genome SequencingDifferent strategies have been tried. We will discuss one:

Shotgun Sequencing.

Start with purified DNA.Shear it into random sized fragments.

Clone fragments into yeast artificial chromosomes (YAC)

orbacterial artificial chromosomes (BAC). Grow

these up.

Sequence the ends of these fragments, and determine the total fragment

size. (These pairs of fragments are calledBAC ends orYAC ends)

Assemble the fragments.

7/27/2019 Lecture 2102

12/42

The Sequence Assembly Problem

Imagine several copies of a book are cut by

scissors into 10 million small pieces. Each copy

is cut in an individual way so that a piece from

one copy may overlap a piece from another

copy. Assuming that 1 million pieces are lost

and the remaining 9 million are splashed with

ink.

Try to recover the original text.

7/27/2019 Lecture 2102

13/42

Sequence assembly strategy

Sequence at least 10 times as much DNA as contained in the

genome. i.e. If the genome has 4.6 Mb (mega-bases) then

sequence 46 Mb. This is called "10-fold redundancy".

Find all overlapping sequences. (sometimes the overlap is

ambiguous)

If the overlap is ambiguous on one end of the BAC or YAC,

the ambiguity can be resolved using the other end.

Errors in assembly can still occur in highly repetitive

regions of the genome (such as near the centromeres).

7/27/2019 Lecture 2102

14/42

Bacterial Artificial Chromosomes

TTAGCTGATACAGGGGCTCAAA

GGGGCTCAAAGTGTCACACATTCA

Size of BAC is known, so distancebetween BAC ends is known

Only 500 bp on the ends of each BACare sequenced

Relative position of BACs is determined by sequence overlap

7/27/2019 Lecture 2102

15/42

Alignment

7/27/2019 Lecture 2102

16/42

Contig

end of one contig

start of next contig

}no data zone

Throughout a "contig" there is a continuous

tiling of overlapping fragments with no gaps

in the data.

7/27/2019 Lecture 2102

17/42

Sources of sequence data

NCBI Washington,DC GenBank

EMBL Heidelberg, Germany EMBL

NIG Shizuoka-ken, Japan DDBJ

Members ofInternational Nucleotide Sequence Database Collaboration

Web sites:

NCBI www.ncbi.nlm.nih.gov

EMBL www.embl-heidelberg.de

DDBJ www.ddbj.nig.ac.jp

7/27/2019 Lecture 2102

18/42

NCBI tour

Log onto yourmodlab (client) machine.

Start a web browser.

Set the browser to

www.ncbi.nlm.nih.gov

And follow along.

S fil hi d bl

7/27/2019 Lecture 2102

19/42

"machine readable" files...

are keyworded

have space delimited fields

contain special characters like /, :,=,{}, etc (/product)

contain database identifiers, accession number

(gi:123456789)

sometimes have a checksum, to guard against corruption.

Sequence files are machine readable

Generally it is better to let the machine do the reading.

7/27/2019 Lecture 2102

20/42

>> > >

>

>>>

>

server

clients

The prompt indicates which machine you are talking to.

7/27/2019 Lecture 2102

21/42

SCP a file to the server

Send the file "lotsofjunk" to bioinf45.bio.rpi.edu as follows:

scp lotsofjunk [email protected]:lotsofjunk

where n is the same as the number of yourmodlab machine(i.e. if you are on modlab16, use bio454016)

Enter your password: (see whiteboard)

7/27/2019 Lecture 2102

22/42

Start SeqLab on the server

client_prompt>ssh -X [email protected]

where n is the same as the number of yourmodlab machine(i.e. if you are on modlab16, login as bio454016)

password: ******* (see whiteboard)

server_prompt>seqlab &

How to start seqlab on the bioinf45 server

background it

7/27/2019 Lecture 2102

23/42

Show otherdisplay

windows

change your

working

directory

and fonts

add-onsHundreds of

functions

some ofthese are

button

functions

Open,

Save,Import,

Export,

Print, etc

Intro to SeqLab

This introduction covers only the basics. Check out the

SeqLab Guide for detailed descriptions.

File Edit Functions Extensions Options Windows

MAIN MENU

7/27/2019 Lecture 2102

24/42

Two SeqLab modes:

Editor The editor window. Where the action is.

Main list A list of available files for editing.

7/27/2019 Lecture 2102

25/42

Basic SeqLab operations

Get a sequence from the local database:

File-->Add Sequence From-->Databases

(only GenEMBL database is actually present)

Select one. Add to Main Window.

Get a sequence from a file:

File-->Import

Use "Filter" button and line to view the files you want.

SeqLab recognizes GenBank format.

7/27/2019 Lecture 2102

26/42

Basic SeqLab operationsSelect a sequence: click on sequence name.

Select a region of a sequence: click and drag. Or click on the

start, then shift-click on the end.

Move a sequence: select a position and hitspace ordelete.(protections may have to be set. Hit the lockicon)

Cut, copy, paste: use icons. (There are two copy buffers, one

for wholesequences, one forregions. SeqLab will ask you

which one you want.)

Create a new sequence: File-->New Sequence (select DNA,

RNA, or protein). Click on blank sequence (~) and type or

paste.

7/27/2019 Lecture 2102

27/42

SeqLab windows

Job manager -- use this to check the status of a job

request, or to kill a job that is taking too long.

Output manager -- find the output of a job.Database browser -- find a sequence if you know

the database and sequence identifier/accession

number.

Traces -- look at trace files

Features -- locate annotated sequence features such

as ORFs, SNPs. Long sequences can be color-coded

by features.

7/27/2019 Lecture 2102

28/42

In class exercise: Seqlab

In the Editor, import the file "lotsofjunk" (GenBank format)

Answer the following:

What kind of sequence is it?

What is the GenBank identifier(accession number) of this

sequence?

What does the keyword CDS mean?

What is its G/C content?

7/27/2019 Lecture 2102

29/42

In class exercise: Seqlab

Display using graphic features and scale so that the whole

sequence fits in the window.

Open the Features window and edit the features as follows:Make all polyproteins purple (with different fills).

Make all BGI-PUPs lime-green.

Make all nucleocapsid and envelope proteins orange.

Make any feature shorter than 100 bases a diamond.

7/27/2019 Lecture 2102

30/42

Edit-->Find

Display as monochrome text, with 1:1 scaling.

Find all A-rich regions with up to 3 mismatches.

enter "AAAAAAAAAAAA"

and allow 3 mismatches. Note coloring. What does it

mean?

Find all potential start codons "atg". Which ones are at thestart of CDS's?

7/27/2019 Lecture 2102

31/42

Save

Save your modified sequence in a new file called "sars.ref"

Explore other functions using the Help button.

You may also use one of the copies of the SeqLab Guide.

7/27/2019 Lecture 2102

32/42

Dot Matrix

Each position in the matrix D[i,j] is either

has a dot, if A[i] == B[j]

or blank, otherwise.

AAGACGTTTAGA

CGTACT

7/27/2019 Lecture 2102

33/42

A more advanced dot matrix

Seqlab function "Compare"

With thousands of bases, it is impossible to plot all dots in

the matrix. Instead we look for stretches of sequence withfew mismatches. If the number of mismatches is less than

the cutoff, plot a dot or line.

AAGACGTTTAGACGTA

CT

All diagonals with

at least 4 out of 5

matches.

In SeqLab:

"window" is the

length of adiagonal,

"stringency" is

minimum number

of matches.

7/27/2019 Lecture 2102

34/42

What you can do with a DNA dot matrix

Find regions of self-similarity (microrepeats, paralogs)

Find regions of complementarity (RNA secondary

structure)

Find the locations of genes between two genomes.

Find Non-sequential alignments

Weaknesses:Dot plots show only the most obvious similarities.

Dot plots are not alignments, yet.

In class exercise:

7/27/2019 Lecture 2102

35/42

Dot matrix for a phageIn SeqLab:

File-->Add Sequence From-->Databases

GenEMBL, phage (click "Show matching entries")

Select first entry (em_ph:s66725)Add to Main Menu

Select the first windowfull (about 80-90 bases), copy it and

use it to make a New DNA sequence.

File-->New Sequence (choose DNA)select a position in the empty sequence (~)

paste

In class exercise:

In class exercise:

7/27/2019 Lecture 2102

36/42

Making the reverse compliment

Copy the sequence. (select, copy, paste) Now you have two

identical sequences.

Make the reverse compliment of the second one. (Edit--

>Reverse. Select "reverse and compliment")

In class exercise:

In class exercise:

7/27/2019 Lecture 2102

37/42

Making a dot plot

Make a dotplot. Select both sequences.

Function-->Pairwise comparison-->Compare

Options...

set Window = 1 (close, Run)Check progress using the Jobs Window

Display the Dotplot

Repeat using Window=8, Stringency=5

Then repeat again using Window=16, stringency=10.

Where is the longest region of self-complimentarity?

In class exercise:

7/27/2019 Lecture 2102

38/42

Self-hybridization plot

Dotplot for randomly-selected DNA and its reverse complement.

DNA sequence

reversecomplementsequence

In class exercise:

7/27/2019 Lecture 2102

39/42

Dot matrix for two viral genomesIn SeqLab:

Import the SARS genome (File-->Import, select lotsofjunk)

Download another coronavirus from the NCBI website.

Search Genome for coronavirus.

Select "Porcine epidemic diarrhea virus" NC_003436

Click on NC_003436 to go to the sequence.

Display GenBank. Send to File.

(Rename it, say,pig.gbk. Copy it to bioinf45)

In SeqLab: File-->Import

Select the two viral genomes, and run

Functions-->Pairwise-->Compare

Use window length =50 andstringency= 30

(Why?)

Try other settings.

In class exercise:

7/27/2019 Lecture 2102

40/42

Parametric search

7/27/2019 Lecture 2102

41/42

Dotplot between two viruses

Results of SeqLabs Compare program using window=50, stringency=26

7/27/2019 Lecture 2102

42/42

Homework 1: Exercise in database searching

Do Problem 1 (a throughj), p.61 in Mount, Bioinformatics2nd Ed.

Write down the number of hits for steps b-i.

Copy the search results into SeqLab for further manipulation.

See details on course web page.

Documents

Lecture 2102