Lecture 2102

Embed Size (px)

Citation preview

  • 7/27/2019 Lecture 2102

    1/42

    Lecture 2

    DNA sequencing

    Whole genome sequencing

    Sequence data formats

    SeqLab

  • 7/27/2019 Lecture 2102

    2/42

    DNA Sequencing

    Sanger's method was invented in 1977. It uses purified

    template DNA, a DNA primer, deoxynucleotide triphosphates

    (dNTPs), di-deoxynucleotide triphosphates (ddNTPs),enzymes (Taq polymerase), and gel electrophoresis.

    dNTPs + ddNTPsenzymes

  • 7/27/2019 Lecture 2102

    3/42

    primer

    = a piece of RNA or DNA whose purpose is

    to allow the enzyme DNA polymerase to

    make new DNA. It "primes" the reaction.

  • 7/27/2019 Lecture 2102

    4/42

    Base

    OH

    BaseOO P O

    O

    ...

    DNA Sequencing

    Base

    BaseOO P O

    O

    XdNTP dNTP

    dNTP ddNTP

    Taq polymerase adds the complementary nucleotide (normal

    or di-deoxy) to the 3' end of the growing strand. If the 3' endhas no 3'-OH group, then growth stops.

  • 7/27/2019 Lecture 2102

    5/42

    results of polymerization rx:

    GCCGATCTAGAAATCTAAGAGGAGAG

    AGGCTAGATCTTTAGATTCTCCTCTC

    AGGC*

    AGGCTAGATC*

    AGGCTAGATCTTTAGATTC*

    AGGCTAGATCTTTAGATTCTC*

    AGGCTAGATCTTTAGATTCTCC*

    AGGCTAGATCTTTAGATTCTCCTC*

    template

    reverse

    complement

    d

    dCTP-termin

    atedfragmen

    ts

    3'

    5'

    5'

    3'

  • 7/27/2019 Lecture 2102

    6/42

    Separating fragments on a gel

    larger

    smaller

    Gel (or capillary)

    electrophoresis separates the

    fragments by charge, which is

    proportional to length.

    Now the sequence can be read

    from the gel. Top to bottom is

    the reverse complement

    sequence. Bottom to top is thetemplate sequence if we

    switch the labels of the lanes

    to the complement bases.

    ddA ddT ddC ddG

    A

    G

    G

    C

    T

    A

    G

  • 7/27/2019 Lecture 2102

    7/42

    DNA SequencingGels can resolve around 500 bases.

    dNTP/ddNTP ratio is optimized so that ~1 ddNTP is

    incorporated every ~500 bases.

    Fragments are visualized by

    (1) Using 32P dNTPs in the primer. 32P gives off beta

    radiation, which is captured when the gel is layed out on a

    piece of film. b won't pass through glass, so the gel must beremoved from the gel apparatus.

    (2) Using fluorescent dyes attached to the primer.

    The fluorescence is detected by scanning the gel or capillary.

    No need to remove the gel.

  • 7/27/2019 Lecture 2102

    8/42

    Sequence trace

    F

    Each color is one lane of an electrophoresis gel.

  • 7/27/2019 Lecture 2102

    9/42

    base-calling

    "I call this a G"

    "N" = no call, trace

    not good enough to

    call the base.

    "phred" (by Phil Green) is the most widely used base-calling program.

  • 7/27/2019 Lecture 2102

    10/42

    Naive strategy for sequencing long

    pieces of DNAOrder a primer to start the sequencing process.

    Sequence ~500 bases.

    Design a primer based on the last good part of the sequence.

    Order that primer. (wait a few weeks for delivery)

    Repeat.

    Time required to sequence a small genome (E. coli) this way if

    delivery takes one week: 9200 weeks = 176 years

  • 7/27/2019 Lecture 2102

    11/42

    Whole Genome SequencingDifferent strategies have been tried. We will discuss one:

    Shotgun Sequencing.

    Start with purified DNA.Shear it into random sized fragments.

    Clone fragments into yeast artificial chromosomes (YAC)

    orbacterial artificial chromosomes (BAC). Grow

    these up.

    Sequence the ends of these fragments, and determine the total fragment

    size. (These pairs of fragments are calledBAC ends orYAC ends)

    Assemble the fragments.

  • 7/27/2019 Lecture 2102

    12/42

    The Sequence Assembly Problem

    Imagine several copies of a book are cut by

    scissors into 10 million small pieces. Each copy

    is cut in an individual way so that a piece from

    one copy may overlap a piece from another

    copy. Assuming that 1 million pieces are lost

    and the remaining 9 million are splashed with

    ink.

    Try to recover the original text.

  • 7/27/2019 Lecture 2102

    13/42

    Sequence assembly strategy

    Sequence at least 10 times as much DNA as contained in the

    genome. i.e. If the genome has 4.6 Mb (mega-bases) then

    sequence 46 Mb. This is called "10-fold redundancy".

    Find all overlapping sequences. (sometimes the overlap is

    ambiguous)

    If the overlap is ambiguous on one end of the BAC or YAC,

    the ambiguity can be resolved using the other end.

    Errors in assembly can still occur in highly repetitive

    regions of the genome (such as near the centromeres).

  • 7/27/2019 Lecture 2102

    14/42

    Bacterial Artificial Chromosomes

    TTAGCTGATACAGGGGCTCAAA

    GGGGCTCAAAGTGTCACACATTCA

    Size of BAC is known, so distancebetween BAC ends is known

    Only 500 bp on the ends of each BACare sequenced

    Relative position of BACs is determined by sequence overlap

  • 7/27/2019 Lecture 2102

    15/42

    Alignment

  • 7/27/2019 Lecture 2102

    16/42

    Contig

    end of one contig

    start of next contig

    }no data zone

    Throughout a "contig" there is a continuous

    tiling of overlapping fragments with no gaps

    in the data.

  • 7/27/2019 Lecture 2102

    17/42

    Sources of sequence data

    NCBI Washington,DC GenBank

    EMBL Heidelberg, Germany EMBL

    NIG Shizuoka-ken, Japan DDBJ

    Members ofInternational Nucleotide Sequence Database Collaboration

    Web sites:

    NCBI www.ncbi.nlm.nih.gov

    EMBL www.embl-heidelberg.de

    DDBJ www.ddbj.nig.ac.jp

  • 7/27/2019 Lecture 2102

    18/42

    NCBI tour

    Log onto yourmodlab (client) machine.

    Start a web browser.

    Set the browser to

    www.ncbi.nlm.nih.gov

    And follow along.

    S fil hi d bl

  • 7/27/2019 Lecture 2102

    19/42

    "machine readable" files...

    are keyworded

    have space delimited fields

    contain special characters like /, :,=,{}, etc (/product)

    contain database identifiers, accession number

    (gi:123456789)

    sometimes have a checksum, to guard against corruption.

    Sequence files are machine readable

    Generally it is better to let the machine do the reading.

  • 7/27/2019 Lecture 2102

    20/42

    >> > >

    >

    >>>

    >

    server

    clients

    The prompt indicates which machine you are talking to.

  • 7/27/2019 Lecture 2102

    21/42

    SCP a file to the server

    Send the file "lotsofjunk" to bioinf45.bio.rpi.edu as follows:

    scp lotsofjunk [email protected]:lotsofjunk

    where n is the same as the number of yourmodlab machine(i.e. if you are on modlab16, use bio454016)

    Enter your password: (see whiteboard)

  • 7/27/2019 Lecture 2102

    22/42

    Start SeqLab on the server

    client_prompt>ssh -X [email protected]

    where n is the same as the number of yourmodlab machine(i.e. if you are on modlab16, login as bio454016)

    password: ******* (see whiteboard)

    server_prompt>seqlab &

    How to start seqlab on the bioinf45 server

    background it

  • 7/27/2019 Lecture 2102

    23/42

    Show otherdisplay

    windows

    change your

    working

    directory

    and fonts

    add-onsHundreds of

    functions

    some ofthese are

    button

    functions

    Open,

    Save,Import,

    Export,

    Print, etc

    Intro to SeqLab

    This introduction covers only the basics. Check out the

    SeqLab Guide for detailed descriptions.

    File Edit Functions Extensions Options Windows

    MAIN MENU

  • 7/27/2019 Lecture 2102

    24/42

    Two SeqLab modes:

    Editor The editor window. Where the action is.

    Main list A list of available files for editing.

  • 7/27/2019 Lecture 2102

    25/42

    Basic SeqLab operations

    Get a sequence from the local database:

    File-->Add Sequence From-->Databases

    (only GenEMBL database is actually present)

    Select one. Add to Main Window.

    Get a sequence from a file:

    File-->Import

    Use "Filter" button and line to view the files you want.

    SeqLab recognizes GenBank format.

  • 7/27/2019 Lecture 2102

    26/42

    Basic SeqLab operationsSelect a sequence: click on sequence name.

    Select a region of a sequence: click and drag. Or click on the

    start, then shift-click on the end.

    Move a sequence: select a position and hitspace ordelete.(protections may have to be set. Hit the lockicon)

    Cut, copy, paste: use icons. (There are two copy buffers, one

    for wholesequences, one forregions. SeqLab will ask you

    which one you want.)

    Create a new sequence: File-->New Sequence (select DNA,

    RNA, or protein). Click on blank sequence (~) and type or

    paste.

  • 7/27/2019 Lecture 2102

    27/42

    SeqLab windows

    Job manager -- use this to check the status of a job

    request, or to kill a job that is taking too long.

    Output manager -- find the output of a job.Database browser -- find a sequence if you know

    the database and sequence identifier/accession

    number.

    Traces -- look at trace files

    Features -- locate annotated sequence features such

    as ORFs, SNPs. Long sequences can be color-coded

    by features.

  • 7/27/2019 Lecture 2102

    28/42

    In class exercise: Seqlab

    In the Editor, import the file "lotsofjunk" (GenBank format)

    Answer the following:

    What kind of sequence is it?

    What is the GenBank identifier(accession number) of this

    sequence?

    What does the keyword CDS mean?

    What is its G/C content?

  • 7/27/2019 Lecture 2102

    29/42

    In class exercise: Seqlab

    Display using graphic features and scale so that the whole

    sequence fits in the window.

    Open the Features window and edit the features as follows:Make all polyproteins purple (with different fills).

    Make all BGI-PUPs lime-green.

    Make all nucleocapsid and envelope proteins orange.

    Make any feature shorter than 100 bases a diamond.

  • 7/27/2019 Lecture 2102

    30/42

    Edit-->Find

    Display as monochrome text, with 1:1 scaling.

    Find all A-rich regions with up to 3 mismatches.

    enter "AAAAAAAAAAAA"

    and allow 3 mismatches. Note coloring. What does it

    mean?

    Find all potential start codons "atg". Which ones are at thestart of CDS's?

  • 7/27/2019 Lecture 2102

    31/42

    Save

    Save your modified sequence in a new file called "sars.ref"

    Explore other functions using the Help button.

    You may also use one of the copies of the SeqLab Guide.

  • 7/27/2019 Lecture 2102

    32/42

    Dot Matrix

    Each position in the matrix D[i,j] is either

    has a dot, if A[i] == B[j]

    or blank, otherwise.

    AAGACGTTTAGA

    CGTACT

  • 7/27/2019 Lecture 2102

    33/42

    A more advanced dot matrix

    Seqlab function "Compare"

    With thousands of bases, it is impossible to plot all dots in

    the matrix. Instead we look for stretches of sequence withfew mismatches. If the number of mismatches is less than

    the cutoff, plot a dot or line.

    AAGACGTTTAGACGTA

    CT

    All diagonals with

    at least 4 out of 5

    matches.

    In SeqLab:

    "window" is the

    length of adiagonal,

    "stringency" is

    minimum number

    of matches.

  • 7/27/2019 Lecture 2102

    34/42

    What you can do with a DNA dot matrix

    Find regions of self-similarity (microrepeats, paralogs)

    Find regions of complementarity (RNA secondary

    structure)

    Find the locations of genes between two genomes.

    Find Non-sequential alignments

    Weaknesses:Dot plots show only the most obvious similarities.

    Dot plots are not alignments, yet.

    In class exercise:

  • 7/27/2019 Lecture 2102

    35/42

    Dot matrix for a phageIn SeqLab:

    File-->Add Sequence From-->Databases

    GenEMBL, phage (click "Show matching entries")

    Select first entry (em_ph:s66725)Add to Main Menu

    Select the first windowfull (about 80-90 bases), copy it and

    use it to make a New DNA sequence.

    File-->New Sequence (choose DNA)select a position in the empty sequence (~)

    paste

    In class exercise:

    In class exercise:

  • 7/27/2019 Lecture 2102

    36/42

    Making the reverse compliment

    Copy the sequence. (select, copy, paste) Now you have two

    identical sequences.

    Make the reverse compliment of the second one. (Edit--

    >Reverse. Select "reverse and compliment")

    In class exercise:

    In class exercise:

  • 7/27/2019 Lecture 2102

    37/42

    Making a dot plot

    Make a dotplot. Select both sequences.

    Function-->Pairwise comparison-->Compare

    Options...

    set Window = 1 (close, Run)Check progress using the Jobs Window

    Display the Dotplot

    Repeat using Window=8, Stringency=5

    Then repeat again using Window=16, stringency=10.

    Where is the longest region of self-complimentarity?

    In class exercise:

  • 7/27/2019 Lecture 2102

    38/42

    Self-hybridization plot

    Dotplot for randomly-selected DNA and its reverse complement.

    DNA sequence

    reversecomplementsequence

    In class exercise:

  • 7/27/2019 Lecture 2102

    39/42

    Dot matrix for two viral genomesIn SeqLab:

    Import the SARS genome (File-->Import, select lotsofjunk)

    Download another coronavirus from the NCBI website.

    Search Genome for coronavirus.

    Select "Porcine epidemic diarrhea virus" NC_003436

    Click on NC_003436 to go to the sequence.

    Display GenBank. Send to File.

    (Rename it, say,pig.gbk. Copy it to bioinf45)

    In SeqLab: File-->Import

    Select the two viral genomes, and run

    Functions-->Pairwise-->Compare

    Use window length =50 andstringency= 30

    (Why?)

    Try other settings.

    In class exercise:

  • 7/27/2019 Lecture 2102

    40/42

    Parametric search

  • 7/27/2019 Lecture 2102

    41/42

    Dotplot between two viruses

    Results of SeqLabs Compare program using window=50, stringency=26

  • 7/27/2019 Lecture 2102

    42/42

    Homework 1: Exercise in database searching

    Do Problem 1 (a throughj), p.61 in Mount, Bioinformatics2nd Ed.

    Write down the number of hits for steps b-i.

    Copy the search results into SeqLab for further manipulation.

    See details on course web page.