29
DNA Sequence Analysis DNA Sequence Analysis and and Fragment Assembly Fragment Assembly System (FAS) System (FAS) Lecture 92-06

DNA Sequence Analysis and Fragment Assembly System (FAS)

  • Upload
    idalee

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 92-06. DNA Sequence Analysis and Fragment Assembly System (FAS). (AAAAAA)n. 3’. 7-mG cap. Exon 1. Exon 2. Exon 3. Exon 4. The Organization of an Eukaryotic Gene. GENE. Exon 1. Intron. Exon 2. Intron. Exon 3. Intron. Exon 4. Promoter Enhancer. Transcription. - PowerPoint PPT Presentation

Citation preview

Page 1: DNA Sequence Analysis and Fragment Assembly         System (FAS)

DNA Sequence AnalysisDNA Sequence Analysis andandFragment AssemblyFragment Assembly

System (FAS)System (FAS)

Lecture 92-06

Page 2: DNA Sequence Analysis and Fragment Assembly         System (FAS)

GENE

Exon 1 Intron Exon 3 Intron Exon 4Exon 2IntronPromoterEnhancer

mRNA transcript

Exon 1 Intron Exon 3 Intron Exon 4Exon 2Intron5’-untranslated

region

5’ 3’

Poly(A) signal

3’-untranslatedregion

Mataure mRNA

Transcription

Processing

The Organization of an Eukaryotic Gene

Exon 1 Exon 3 Exon 4Exon 23’(AAAAAA)n7-mG cap

start stop

5’

Page 3: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Find non-coding features of interest in the sequence

Gene identification involves Gene identification involves 4 main stages4 main stages

Determine the exon-intron organization

Identify the gene

Find the putative coding region(s) in the sequence

motif, signal and patternBlast, FASTAFunctional studies

CpG islandsTandemly and dispersed repeatsPromoter regions (TATA box, cap signal,CCAAT-box)Transcription factors, Poly-A sites

Branch point signalCT(G,A)A(C,T)

5’ and 3’ splice sites: AG/GUAAGU--------------PyPyPyPyPyPyPyPy-CAG/G

Open reading frame

Page 4: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Banbury Cross http://igs-server.cnrs-mrs.fr/igs/banburyFGENEH http://genomic.sanger.ac.uk/gf/gf.shtmlGeneID http://www1.imim.es/geneid.htmlGeneMachine http://genome.nhgri.nih.gov/genemachineGeneParser http://beagle.colorado.edu/_eesnyder/GeneParser.htlGENSCAN http://genes.mit.edu/GENSCAN.htmlGenotator http://www.fruitfly.org/_nomi/genotator/GRAIL http://compbio.ornl.gov/tools/index.shtmlGRAIL-EXP http://compbio.ornl.gov/grailexp/HMMgene http://www.cbs.dtu.dk/services/HMMgene/MZEF http://www.cshl.org/genefinderPROCRUSTES http://www-hto.usc.edu/software/procrustesRepeatMasker http://ftp.genome.washington.edu/RM/RepeatMasker.htmlSputnik http://rast.abajian.com/sputnik/

GENE FINDERS

Page 5: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Sequence manipulationORF Searching

Mapping (restriction sites)

Mapping (transcription factors)

ReverseFramesMapTranslateMap (-minc)(-maxc)Mapsort(-exclude)(-digest)Mapplot

Map tfsites

+++++++++++

+

GCG SeqWEBFunction Command

++++++++--+

-

Page 6: DNA Sequence Analysis and Fragment Assembly         System (FAS)

What to do next?The predictions by these programs is just that: a prediction.

NEVER TRUST A COMPUTER!

Page 7: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Programs used in this exercise:(1) Sequence manipulation – reverse(3)  ORF Searching – frames , map , translate(4)  Mapping (restriction sites) – map (-minc, -maxc), mapsort(-exclude, -digest), mapplot, plasmidmap(5)  Mapping (transcription factor) – map(tfsites).

Sequences used in this exercise:gb:z18853 (C.elegans mRNA for capping protein alpha subunit.)

cds:10-858gb:x03795 (Human mRNA for platelet derived growth factor A-chain, P

DGF-A) cds:388-1020.

Exercise 92-06-1

Page 8: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Fragment AssemblyFragment Assembly System (FAS)System (FAS)

Page 9: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Please Download Ex92-06.exe

Exercise92-06-2.doc ( 上課習作 )Gelassemble commands.doc & SeqED commands.doc ( 指令集 )Seq01.txt - seq10.txt ( 習作用序列 )

Fragment Assembly System (FAS)Fragment Assembly System (FAS)

Page 10: DNA Sequence Analysis and Fragment Assembly         System (FAS)

(1) Store fragment sequences;(2) Recognize overlapping sequences and create aligned

assemblies, called contigs; (3) Display, edit and output the contigs for further analysis.

Fragment Assembly System (FAS)Fragment Assembly System (FAS)

Assemble overlapping fragment sequences from a sequencing project.

Contig 1

Contig 2

Consensus

A contig may not contain more than 1,650 fragments and may not be longer than 200,000 bases. No single fragment may be longer than 2,500 bases

4

31

2

5

Page 11: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Begins a fragment assembly session bycreating a new fragment assembly project or by identifying an existing project.

Enters a fragment sequences to a fragment assembly project from your terminal keyboard, a digitizer, or existing sequence files. Aligns the sequences in a fragment assembly project into assemblies called contigs.

A multiple sequence editor for viewing and editing contigs assembled by GelMerge.

Displays the structure of the contigs in a fragment assembly project.

Breaks up the contigs in a fragment assembly project into single fragments.GelDisassembleGelDisassemble

GelViewGelViewGelAssembleGelAssembleGelMergeGelMergeGelEnterGelEnter

GelStartGelStart

Contig: mu26b

8 mu18b +---------------------> 7 mu9 <---+ 6 mu32 +---> 5 mu26 <----+ 4 mu18 +----> 3 mu27 <--------------------------------+ 2 mu26b <------------------+ C CONSENSUS <-----------------------------------------------+ |----------|----------|----------|---------|---------| 0 100 200 300 400

Page 12: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Use GelStart to create a new project database for each sequencing project. For each new project, GelStart creates a new directory, named after the project, as a subdirectory of your current working directory.

gcg% gelstart -check

Minimal Syntax: % gelstart [-NAME=]MyProject -Default

Prompted Parameters:

-NEWproject begins a new sequencing project-VECtors=GB:M13mp18,GB:SynpBR322 highlights specified sequences in GELENTER-SITes=GAATTC,GGATCC highlights specified patterns in GELENTER

Local Data Files: None

Optional Parameters:

-DELete deletes a whole project!-NOMONitor suppresses the screen monitor

GelStartGelStart

Page 13: DNA Sequence Analysis and Fragment Assembly         System (FAS)

SeqEd is an interactive editor for entering and modifying sequences and for assembling parts of existing sequences into new genetic constructs. You can enter sequences from the keyboard or from a digitizer.

SeqEDSeqED

  AGTCTTAGTCGATCGTAcTGCATRCGA ....|:.......:|.........i.......:.|.........|.........|.........|.........|.. 0 10 20 30 40 50 60 70  "sample.seq" 27 nucleotides

<ctr>d

<return>

screen mode command mode

Page 14: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Screen Mode

G, A, T, . . . - insert a sequence character <Delete> - delete a sequence character <Ctrl>H - delete a sequence character /TAACG<Return> - find the next occurrence of TAACG (last pattern entered is the default) 1<Return> - move to start of the sequence <Ctrl>E - move to end of the sequence [n]<Right-arrow> - go ahead n characters [n]<Left-arrow> - go back n characters <Up-arrow> - go up to check sequence <Down-arrow> - go down to original sequence 'markcharacter - go to marked position 37<Return> - go to position 37 (any positive integer) < - go back 50 characters > - go ahead 50 characters <Ctrl>R - redraw the screen <Ctrl>D - enter command mode

[n] is an optional numeric parameter.

Page 15: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Command Mode EDit seqname - get a new sequence file to edit[n] Include [seqname] - insert another sequence [at position n] (SeqEd prompts for range and strand)s,f Delete - delete a range of bases[s] Check [/Blind] - check a range of bases [beginning at s] 37 - go to base 37 REDraw - redraw the screen[n] COmment comment - insert a comment [at position n][n] COmment - enter comment editing mode [at position n][n] HEAding - edit documentary heading [at line n] change - enter screen mode (<Return> is sufficient) screen - enter screen mode (<Return> is sufficient) OVERstrike - enter overstrike mode INSert - enter insert mode[n] Mark markcharacter - mark the sequence [at position n] PERFect - require finds to be perfect matches PROtein - set sequence type to PROTEIN NUCleotide - set sequence type to NUCLEOTIDE[s,f] Write [seqname] - write [a part of] the sequence to a file DIGitizer - enter digitizer mode RELoad - enter reload mode ACCept - terminate reload mode Help - show commands in screen and command modes[s,f] EXit [seqname] - write [a part of] the sequence and quit Quit - quit the editor without writing the sequence

[n] indicates an optional parameter. s and f are numbers for start and finish of a range of interest

Page 16: DNA Sequence Analysis and Fragment Assembly         System (FAS)

GelEnter is a sequence editor that accepts sequence data. GelEnterGelEnter

gcg% gelenter –check

Minimal Syntax: % gelenter [-INfile1=]mu*.seqPrompted Parameters: NoneLocal Data Files:set.keys (must be in your current working directory to be used)Optional Parameters:-ENTER=mu*.seq enters existing files into the database -STAden enters existing Staden format files into the database -FASTA enters existing FASTA format files into the database-SINGlecommand automatically returns to screen mode after each command-PERFect sets find to search for perfect symbol matches-VECtors=gb:synpbr322 highlights sequences from pBR322-SITes=gaattc highlights GAATTC patterns-LANes=g,A,T,C sets lane order for digitizer-MINOverlap=10 sets minimum overlap length for Reload command-PCTOverlap=95 sets stringency for the Reload command-TOLerance=0.4 sets tolerance for digitizing ambiguity (0 to 1), with 1 being the most tolerant

Page 17: DNA Sequence Analysis and Fragment Assembly         System (FAS)

GelEnter accepts any valid GCG sequence character.Once you enter sequences into a project database, you can no longer edit them with GelEnter.

GelEnterGelEnter

gcg2 21% gelenter seq02.dat

GelEnter adds fragment sequences to a fragment assembly project. Itaccepts sequence data from your terminal keyboard, a digitizer, orexisting sequence files. "seq02" 593 nucleotides IUB/GCG Meaning

A A C C G G T/U T M A or C R A or G W A or T S C or G Y C or T K G or T V A or C or G H A or C or T D A or G or T B C or G or T X/N G or A or T or C ./~ gap character

Page 18: DNA Sequence Analysis and Fragment Assembly         System (FAS)

GelMerge automatically recognizes overlaps among all of the sequences in a project database and creates aligned assemblies, called contigs, from the overlapping sequences. These contigs are stored in the project database. As you add new sequences that connect separate contigs to the project database, GelMerge aligns the contigs into larger assemblies.

GelMergeGelMerge

% GelMerge

What word size (* 7 *) ? What fraction of the words in an overlap must match (* 0.80 *) ? What is the minimum overlap length (* 14 *) ? Reading ............ Comparing ............ Aligning ......... Writing ...

Input Contigs: 12 Output Contigs: 3

CPU time: 02.29 (seconds)

Page 19: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Minimal Syntax: % gelmerge -Default

Prompted Parameters:

-WORdsize=7 sets word size for overlap determination-STRIngency=0.8 sets minimum fraction of matching words in overlap-MINOverlap=14 sets minimum length of overlap

Local Data Files:

-MATRix1=gelmergedna.cmp assigns the scoring matrix for contig assembly-MATRix2=gelmergelocaldna.cmp assigns the scoring matrix for vector recognition

Optional Parameters:

-MINIdentity=14 sets minimum run of identical bases found at least once in an overlap between two contigs-MAXGap=10 sets maximum gap size for overlap determination-GAPweight=8 sets gap creation penalty in contig assembly-LENgthweight=2 sets gap extension penalty in contig assembly-ARChive creates contigs from the original gel readings-WORKing creates contigs from individual working fragment (with gaps removed)-REPortfile[=Filename] writes report of recognized vector sequences-EXCise removes vector sequences from single-fragment contigs-VECTORSTrigency=0.8 sets minimum fraction of matches in vector recognition-VECTORMINIdentity=12 sets minimum run of identical bases found at least once in a match between vector and fragment-VECTORMAXGap=5 sets maximum gap size in first step of vector recognition-VECTORGAPweight=30 sets gap creation penalty in vector recognition-VECTORLENgthweight=3 sets gap extension penalty in vector recognition-NOMERge suppresses contig assembly-NOMONitor suppresses screen trace of program progress-NOSUMmary suppresses screen summary at the end of the program-BATch submits program to the batch queue

Page 20: DNA Sequence Analysis and Fragment Assembly         System (FAS)

After assembling contigs with GelMerge, use the contig editor, GelAssemble, to review and modify the alignments. After choosing a contig for review, GelAssemble lets you edit the individual sequences in that contig to resolve inconsistencies. GelAssemble creates a consensus sequence that uses the IUB nucleotide ambiguity codes. You can modify a sequence and change the alignment in the same way you edit text with a text editor. Although GelMerge assembles and aligns contigs automatically, you can assemble contigs manually using GelAssemble. For example, you could manually assemble separate contigs that do not share sufficient overlap for GelMerge to assemble automatically. You can also separate fragments from a contig if you believe they should not be included. Once you are satisfied with a contig, you can store it in the sequencing project database.

GelAssembleGelAssemble

seq03 > GTTCATCAGTCTTGGTGGAGAAGTTCGACAGATGCCATTGGCAGATTTCACCGATGGTTC 220seq01 > GTTCATCAGTCTTGGTGGAGAAGTTCGACAGATGCCATTGGCAGATTTCACCGATGGTTC 540CONSENSUS > GTTCATCAGTCTTGGTGGAGAAGTTCGACAGATGCCATTGGCAGATTTCACCGATGGTTC 540

.........+.........+.........+.........+.........+.........+

Screen mode Command mode<ctr>D

<return>

Page 21: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Keys Pressed Action[n]<Right-arrow> move ahead [n bases][n]<Left-arrow> move back [n bases][n]<Up-arrow> move up [to row n][n]<Down-arrow> move down [to row n] > scroll one screen to the right < scroll one screen to the left1<Return> move to start of the sequence<Ctrl>E move to end of the sequence165<Return> move to base 165 in sequence/GATTC<Return> find next occurrence of GATTC<Ctrl>A move to next ambiguity in alignment<Ctrl>R move to next ambiguity in sequence<Ctrl>V move to next gap in consensus<Ctrl>D enter Command Mode<Ctrl>L toggle alignment display enlargement<Ctrl>W redraw the screen<Ctrl>O toggle INSERT/OVERSTRIKE mode ! summary of current sequence ? display these help screens<Ctrl>G recalculate the consensusG A T C .... add base at the cursor<Delete> delete a base, or move sequence left<Ctrl>H delete a base, or move sequence left<Space bar> move the sequence to the right<Ctrl>X delete alignment column<Ctrl>I restore alignment column<Ctrl>B begin selecting a range for removal<Ctrl>N remove the selected range<Ctrl>P insert the removed range - reject current fragment

Gelassemble Screen Mode

Page 22: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Gelassemble Command Mode[a,b] specifies a range of fragments.[x,y] specifies a range of bases.[n] is an optional numeric parameter.

EDit [ContigName] replace current contig with a new contig CONTIGs select another contig for editing WRite write a contig to the database EXit write the contig and quit QUIT quit without writing ERASE delete current contig from the database 238 move to position 238 in the current fragment[x,y] PRETTYout [FileName] write the sequence alignment [position x - y][a,b] SEQOUT write fragments [a - b] to sequence files BIGPICture [FileName] write bar schematic to an output file OVERstrike select OVERSTRIKE sequence edit mode NOOVERstrike select INSERT sequence edit mode[x,y] CONSensus recalculate the consensus sequence[a,b] LOCk lock strands [a through b][a,b] Unlock unlock strands [a through b][x,y] SELect select bases [x through y] REMove remove the selected bases[n] INSert insert the removed bases [at position n] CAncel cancel the selection

Page 23: DNA Sequence Analysis and Fragment Assembly         System (FAS)

[x,y] DElete delete bases [x through y] GOTo [FragmentName] move to strand by name FInd GAATC find the next occurrence of GAATC DIfferences show differences from the consensus MAtches show matches with the consensus Neither show neither matches nor differences REDraw redraw the screen Help display these help screens SORt [DEScending] sorts strands by their offsets in alignment[a,b] MOve moves a strand [from line a to line b] OPen opens a blank line at the cursor position[a,b] ANChor anchors strands [a through b][a,b] NOANchor unanchors strands [a through b] LOad [ContigName] loads another contig into the Edit Screen REVerse reverse-complement the (anchored) strand(s)[n] Offset shifts the current fragment [to begin at n] REJect removes the current fragment from the screen NODUPlicate removes a duplicated fragment from the screen SPAWN renames a duplicated fragment SEParate makes two contigs from anchored and unanchored strands

Page 24: DNA Sequence Analysis and Fragment Assembly         System (FAS)

GelView displays bar diagrams that show the overlaps among the fragments in each contig, providing a schematic view of the whole sequencing project.

GelView GelView Gelview filename.vew.cat/more filename.view

GELVIEW Fragment Assembly contig display of Project: bio May 4, 2000 17:42

Contig: seq01

3 seq03 +-------------------> 2 seq01 +-----------------------------> C CONSENSUS +------------------------------------> |----------|----------|----------|---------|---------| 0 200 400 600 800

Contig: seq04

3 seq02 <---------------+ 2 seq04 +------------> C CONSENSUS +---------------------------> |----------|----------|----------|---------|---------| 0 400 800 1200 1600

Contig: seq05

2 seq05 +----------------------------> C CONSENSUS +----------------------------> |----------|----------|----------|---------|---------| 0 200 400 600 800

5 Fragments in 3 Contigs

Page 25: DNA Sequence Analysis and Fragment Assembly         System (FAS)

GelDisassemble breaks up the contigs in a sequencing project, thus recreating the database as a collection of single fragments.

GelDisassemble GelDisassemble

% geldisassemble

Are you sure you want to disassemble your project (* No *) ? Yes

1) Emptying "relation" directory....

2) Emptying "consensus directory....

3) Copying "working" to "consensus"....

4) Creating "relation"....

Gel Project Disassembled

Page 26: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Exercise 92-06-2 Exercise 92-06-2

Download Ex92-06.exe Decompress the file

Exercise 92-06-2.doc ( 上課習作 )Gelassemble commands.doc & SeqED commands.doc ( 指令集 )Seq01.txt - seq10.txt ( 習作用序列 )

Start GCG FAS

Questions:(1) What is the correct order of the assembled sequence?(2) Which putative protein this sequence encodes?(3) Is there any potential regulatory elements upstream of the gene?(4) What is the identity with the human protein?

Page 27: DNA Sequence Analysis and Fragment Assembly         System (FAS)

生物資訊分析生物資訊分析完全攻略完全攻略

Page 28: DNA Sequence Analysis and Fragment Assembly         System (FAS)

GeneGene-mRNA-Protein -mRNA-Protein

Download Bioinfo91-08.exe Decompress the fileYou will found the following files in FASTA format:

ProteinProtein-mRNA-Gene -mRNA-Gene mRNAmRNA-Protein-Gene -Protein-Gene

Is there any standard procedures?Is there any standard procedures?

Gene.txt RNA.txt Protein.txt

Page 29: DNA Sequence Analysis and Fragment Assembly         System (FAS)

Gene-mRNAGene-mRNA-Protein -Protein

OPEN READING FRAMEDNA RNAReverse or Directional

HOMOLOGY SEARCHFASTA, BLASTn, BLASTx

MOTIF SEARCH

ALIGNMENT

RESTRICTION MAPPING

2nd Structure

FILE PROCESSING FILE PROCESSING (Trace File Viewer(Trace File Viewer & & Format ConverterFormat Converter))

Bestfit, gap, pileup