Lecture 2102

  • View
    217

  • Download
    0

Embed Size (px)

Text of Lecture 2102

  • 7/27/2019 Lecture 2102

    1/42

    Lecture 2

    DNA sequencing

    Whole genome sequencing

    Sequence data formats

    SeqLab

  • 7/27/2019 Lecture 2102

    2/42

    DNA Sequencing

    Sanger's method was invented in 1977. It uses purified

    template DNA, a DNA primer, deoxynucleotide triphosphates

    (dNTPs), di-deoxynucleotide triphosphates (ddNTPs),enzymes (Taq polymerase), and gel electrophoresis.

    dNTPs + ddNTPsenzymes

  • 7/27/2019 Lecture 2102

    3/42

    primer

    = a piece of RNA or DNA whose purpose is

    to allow the enzyme DNA polymerase to

    make new DNA. It "primes" the reaction.

  • 7/27/2019 Lecture 2102

    4/42

    Base

    OH

    BaseOO P O

    O

    ...

    DNA Sequencing

    Base

    BaseOO P O

    O

    XdNTP dNTP

    dNTP ddNTP

    Taq polymerase adds the complementary nucleotide (normal

    or di-deoxy) to the 3' end of the growing strand. If the 3' endhas no 3'-OH group, then growth stops.

  • 7/27/2019 Lecture 2102

    5/42

    results of polymerization rx:

    GCCGATCTAGAAATCTAAGAGGAGAG

    AGGCTAGATCTTTAGATTCTCCTCTC

    AGGC*

    AGGCTAGATC*

    AGGCTAGATCTTTAGATTC*

    AGGCTAGATCTTTAGATTCTC*

    AGGCTAGATCTTTAGATTCTCC*

    AGGCTAGATCTTTAGATTCTCCTC*

    template

    reverse

    complement

    d

    dCTP-termin

    atedfragmen

    ts

    3'

    5'

    5'

    3'

  • 7/27/2019 Lecture 2102

    6/42

    Separating fragments on a gel

    larger

    smaller

    Gel (or capillary)

    electrophoresis separates the

    fragments by charge, which is

    proportional to length.

    Now the sequence can be read

    from the gel. Top to bottom is

    the reverse complement

    sequence. Bottom to top is thetemplate sequence if we

    switch the labels of the lanes

    to the complement bases.

    ddA ddT ddC ddG

    A

    G

    G

    C

    T

    A

    G

  • 7/27/2019 Lecture 2102

    7/42

    DNA SequencingGels can resolve around 500 bases.

    dNTP/ddNTP ratio is optimized so that ~1 ddNTP is

    incorporated every ~500 bases.

    Fragments are visualized by

    (1) Using 32P dNTPs in the primer. 32P gives off beta

    radiation, which is captured when the gel is layed out on a

    piece of film. b won't pass through glass, so the gel must beremoved from the gel apparatus.

    (2) Using fluorescent dyes attached to the primer.

    The fluorescence is detected by scanning the gel or capillary.

    No need to remove the gel.

  • 7/27/2019 Lecture 2102

    8/42

    Sequence trace

    F

    Each color is one lane of an electrophoresis gel.

  • 7/27/2019 Lecture 2102

    9/42

    base-calling

    "I call this a G"

    "N" = no call, trace

    not good enough to

    call the base.

    "phred" (by Phil Green) is the most widely used base-calling program.

  • 7/27/2019 Lecture 2102

    10/42

    Naive strategy for sequencing long

    pieces of DNAOrder a primer to start the sequencing process.

    Sequence ~500 bases.

    Design a primer based on the last good part of the sequence.

    Order that primer. (wait a few weeks for delivery)

    Repeat.

    Time required to sequence a small genome (E. coli) this way if

    delivery takes one week: 9200 weeks = 176 years

  • 7/27/2019 Lecture 2102

    11/42

    Whole Genome SequencingDifferent strategies have been tried. We will discuss one:

    Shotgun Sequencing.

    Start with purified DNA.Shear it into random sized fragments.

    Clone fragments into yeast artificial chromosomes (YAC)

    orbacterial artificial chromosomes (BAC). Grow

    these up.

    Sequence the ends of these fragments, and determine the total fragment

    size. (These pairs of fragments are calledBAC ends orYAC ends)

    Assemble the fragments.

  • 7/27/2019 Lecture 2102

    12/42

    The Sequence Assembly Problem

    Imagine several copies of a book are cut by

    scissors into 10 million small pieces. Each copy

    is cut in an individual way so that a piece from

    one copy may overlap a piece from another

    copy. Assuming that 1 million pieces are lost

    and the remaining 9 million are splashed with

    ink.

    Try to recover the original text.

  • 7/27/2019 Lecture 2102

    13/42

    Sequence assembly strategy

    Sequence at least 10 times as much DNA as contained in the

    genome. i.e. If the genome has 4.6 Mb (mega-bases) then

    sequence 46 Mb. This is called "10-fold redundancy".

    Find all overlapping sequences. (sometimes the overlap is

    ambiguous)

    If the overlap is ambiguous on one end of the BAC or YAC,

    the ambiguity can be resolved using the other end.

    Errors in assembly can still occur in highly repetitive

    regions of the genome (such as near the centromeres).

  • 7/27/2019 Lecture 2102

    14/42

    Bacterial Artificial Chromosomes

    TTAGCTGATACAGGGGCTCAAA

    GGGGCTCAAAGTGTCACACATTCA

    Size of BAC is known, so distancebetween BAC ends is known

    Only 500 bp on the ends of each BACare sequenced

    Relative position of BACs is determined by sequence overlap

  • 7/27/2019 Lecture 2102

    15/42

    Alignment

  • 7/27/2019 Lecture 2102

    16/42

    Contig

    end of one contig

    start of next contig

    }no data zone

    Throughout a "contig" there is a continuous

    tiling of overlapping fragments with no gaps

    in the data.

  • 7/27/2019 Lecture 2102

    17/42

    Sources of sequence data

    NCBI Washington,DC GenBank

    EMBL Heidelberg, Germany EMBL

    NIG Shizuoka-ken, Japan DDBJ

    Members ofInternational Nucleotide Sequence Database Collaboration

    Web sites:

    NCBI www.ncbi.nlm.nih.gov

    EMBL www.embl-heidelberg.de

    DDBJ www.ddbj.nig.ac.jp

  • 7/27/2019 Lecture 2102

    18/42

    NCBI tour

    Log onto yourmodlab (client) machine.

    Start a web browser.

    Set the browser to

    www.ncbi.nlm.nih.gov

    And follow along.

    S fil hi d bl

  • 7/27/2019 Lecture 2102

    19/42

    "machine readable" files...

    are keyworded

    have space delimited fields

    contain special characters like /, :,=,{}, etc (/product)

    contain database identifiers, accession number

    (gi:123456789)

    sometimes have a checksum, to guard against corruption.

    Sequence files are machine readable

    Generally it is better to let the machine do the reading.

  • 7/27/2019 Lecture 2102

    20/42

    >> > >

    >

    >>>

    >

    server

    clients

    The prompt indicates which machine you are talking to.

  • 7/27/2019 Lecture 2102

    21/42

    SCP a file to the server

    Send the file "lotsofjunk" to bioinf45.bio.rpi.edu as follows:

    scp lotsofjunk bio4540n@bioinf45.bio.rpi.edu:lotsofjunk

    where n is the same as the number of yourmodlab machine(i.e. if you are on modlab16, use bio454016)

    Enter your password: (see whiteboard)

  • 7/27/2019 Lecture 2102

    22/42

    Start SeqLab on the server

    client_prompt>ssh -X bio4540n@bioinf45.bio.rpi.edu

    where n is the same as the number of yourmodlab machine(i.e. if you are on modlab16, login as bio454016)

    password: ******* (see whiteboard)

    server_prompt>seqlab &

    How to start seqlab on the bioinf45 server

    background it

  • 7/27/2019 Lecture 2102

    23/42

    Show otherdisplay

    windows

    change your

    working

    directory

    and fonts

    add-onsHundreds of

    functions

    some ofthese are

    button

    functions

    Open,

    Save,Import,

    Export,

    Print, etc

    Intro to SeqLab

    This introduction covers only the basics. Check out the

    SeqLab Guide for detailed descriptions.

    File Edit Functions Extensions Options Windows

    MAIN MENU

  • 7/27/2019 Lecture 2102

    24/42

    Two SeqLab modes:

    Editor The editor window. Where the action is.

    Main list A list of available files for editing.

  • 7/27/2019 Lecture 2102

    25/42

    Basic SeqLab operations

    Get a sequence from the local database:

    File-->Add Sequence From-->Databases

    (only GenEMBL database is actually present)

    Select one. Add to Main Window.

    Get a sequence from a file:

    File-->Import

    Use "Filter" button and line to view the files you want.

    SeqLab recognizes GenBank format.

  • 7/27/2019 Lecture 2102

    26/42

    Basic SeqLab operationsSelect a sequence: click on sequence name.

    Select a region of a sequence: click and drag. Or click on the

    start, then shift-click on the end.

    Move a sequence: select a position and hitspace ordelete.(protections may have to be set. Hit the lockicon)

    Cut, copy, paste: use icons. (There are two copy buffers, one

    for wholesequences, one forregions. SeqLab will ask you

    which one you want.)

    Create a new sequence: File-->New Sequence (select DNA,

    RNA, or protein). Click on blank sequence (~) and type or

    paste.

  • 7/27/2019 Lecture 2102

    27/42

    SeqLab windows

    Job manager -- use this to check the status of a job

    request, or to kill a job that is taking too long.

    Output manager -- find the output of a job.Database browser -- find a sequence if you know

    the database and sequence identifier/accession

    number.

    Traces -- look at trace files

    Features -- locate annotated sequence features such

    as ORFs, SNPs. Long sequences can be color-cod