View
223
Download
2
Tags:
Embed Size (px)
Citation preview
Data
Sequences
and
Other Stuff
Sequence Data
Nucleic Acid and Protein Sequences
Sources of Genetic Sequences User GCG supplied databases
Flat File Oracle Relational Database
NCBI supplied databases Other databases
Sequence Databases
Genbank EMBL DDBJ
NCBI PIR Swiss-Prot Swiss-Prot TrEMBL
Genbank
Primary nucleic acid sequence database Maintained by NCBI
National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov
Current Release 122, 2/2001 11,720,120,326 bases 10,896,781 sequences
Species 1995 1996 1997 1998 1999 2000 2001 Increase(since 1995)
Increase(12 months)
all: 16109 23119 32880 43516 61952 87751 95168 490% 40.9%
Viruses: 1845 2122 2678 2968 3573 4428 4857 163% 32.4%
Bacteria: 2939 3847 6091 8711 14322 22758 24878 746% 53.3%
Archaea: 162 235 385 555 1015 1709 1906 1076% 68.8%
Eukaryota: 10366 15901 22596 29926 41420 56961 61571 493% 37.4%
How Many Organisms Are In The Sequence Databases?(April 1, 2001)
Other NCBI Databases
HTGS EST STS GSS RefSeq Unigene Genomic
HTGS
High Throughput Genomic Sequences ‘Unfinished' DNA sequences generated by the high-
throughput sequencing centers Phase 0
Single-few pass reads of a single clone (not contigs) Phase 1
Unfinished, may be unordered, unoriented contigs, with gaps Phase 2
Unfinished, ordered, oriented contigs, with or without gaps Phase 3
Primary division (Genbank) Finished, no gaps (with or without annotations)
EST
Expressed Sequence Tags “Single-pass" cDNA sequences Generally representative of the 3’ ends of
cDNAs More “full-length” ESTs now available
STS
Sequence Tagged Sites Sequence and mapping data Short genomic landmark sequences
GSS
Genome Survey Sequences Similar to the EST division, except that its
sequences are genomic in origin, rather than cDNA Random “single pass read” genome survey
sequences. Cosmid/BAC/YAC end sequences Exon trapped genomic sequences alu PCR sequences
RefSeq
NCBI Reference Sequence project Provides reference sequence standards
for the naturally occurring molecules from chromosomes to mRNAs to proteins
Stable reference point for: mutation analysis gene expression studies polymorphism discovery
RefSeq…
Curated RefSeq transcripts and proteins
Genome Annotation contigs, transcripts, and proteins
Complete Genomes genomes, chromosomes, and proteins
Unigene
Experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters Each UniGene cluster contains sequences that
represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.
Includes EST and cDNA sequences Includes human, rat, mouse, cow and zebrafish
HomoloGene
Curated and calculated orthologs and homologs for genes represented in UniGene and LocusLink
Includes human, mouse, rat, zebrafish, cow and drosophila
LocusLink
Provides a single query interface to curated sequence and descriptive information about genetic loci Nomenclature Aliases Sequence accessions Phenotypes EC numbers MIM numbers UniGene clusters Homology Map locations Web sites
EMBL and DDBJ
European Molecular Biology Laboratory Hinxton, UK http://www.ebi.ac.uk/
DNA Data Bank of Japan Mishima, Japan http://www.ddbj.nig.ac.jp/
Coordination with Genbank
Prevents duplication Genbank enters sequences from U.S.
journals and researchers EMBL handles European data DDBJ handles Asian data Data exchanged daily
Sequence submissions
Sequences entered from journals Sequences submitted by individual
researchers BankIt
NCBI WWW Site Sequin
Multi-platform program
Sequence Names
DO NOT rely on names to find particular sequences
Few conventions Organism
Hum: Human Mus: mouse Eco: E. coli Syn: synthetic
Last Letter(s)
Sometimes gives useful information cg: Complete genome Viruses
Other Letters
Specifies a particular sequence vsvcg
Vesicular stomatitis virus (Indiana serotype) complete genome
EMBL File Names
Ec: E. coli Hs: Human
Locus name
Names are short, fairly non-descriptive, and can change from one release to another vsvcg
The complete sequence for the virus VSV
Most “mnemonic” names already taken Genbank now using accession numbers
as locus names
Accession Numbers
Each sequence submitted to a database is assigned a unique primary accession number
Accession numbers do not change If a sequence is merged with another, a new
accession number is assigned, and the original number becomes a secondary accession number
Accession numbers may include version numbers AO2428.2
Accession Numbers
Using GCG to access sequences via their accession number
Data Library:Accession Number Flatfile - vi:JO2428 RDB - gcgnuc: JO2428
The Sequence Record
Different for each database Locus (Name) Accession Number Keywords Description Properties References The Sequence
analyze% typedata ge:humcftrm!!NA_SEQUENCE 1.0LOCUS HUMCFTRM 6129 bp mRNA PRI 15-DEC-1989DEFINITION Human cystic fibrosis mRNA, encoding a presumed transmembrane conductance regulator (CFTR).ACCESSION M28668NID g180331KEYWORDS cystic fibrosis; transmembrane conductance regulator.SOURCE Human, cDNA to mRNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 6129) AUTHORS Riordan,J.R., Rommens,J.M., Kerem,B., Alon,N., Rozmahel,R., Grzelczak,Z., Zielenski,J., Lok,S., Plavsic,N., Chou,J.-L., Drumm,M.L., Iannuzzi,M.C., Collins,F.S. and Tsui,L.-C. TITLE Identification of the cystic fibrosis gene: Cloning and characterization of complementary DNA JOURNAL Science 245, 1066-1073 (1989) MEDLINE 89368940
COMMENT A three base-pair deletion spanning positions 1654-1656 is observed in cDNAs from cystic fibrosis patients.FEATURES Location/Qualifiers source 1. .6129 /organism="Homo sapiens" /db_xref="taxon:9606" CDS 133. .4575 /note="cystic fibrosis transmembrane conductance regulator" /codon_start=1 /db_xref="PID:g180332" /translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL LNRFSKDIAILDDLLPLTIFDFIQLLLIVIGAIAVVAVLQPYIFVATVPVIVAFIMLR AYFLQTSQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFGRQPYFETLFHKALNLHTAN WFLYLSTLRWFQMRIEMIFVIFFIAVTFISILTTGEGEGRVGIILTLAMNIMSTLQWA VNSSIDVDSLMRSVSRVFKFIDMPTEGKPTKSTKPYKNGQLSKVMIIENSHVKKDDIW PSGGQMTVKDLTAKYTEGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLLN TEGEIQIDGVSWDSITLQQWRKAFGVIPQKVFIFSGTFRKNLDPYEQWSDQEIWKVAD EVGLRSVIEQFPGKLDFVLVDGGCVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDP VTYQIIRRTLKQAFADCTVILCEHRIEAMLECQQFLVIEENKVRQYDSIQKLLNERSL FRQAISPSDRVKLFPHRNSSKCKSKPQIAALKEETEEEVQDTRL"BASE COUNT 1886 a 1181 c 1330 g 1732 tORIGIN
HUMCFTRM Length: 6129 April 13, 1998 13:00 Type: N Check: 6781 .. 1 AATTGGAAGC AAATGACATC ACAGCAGGTC AGAGAAAAAG GGTTGAGCGG 51 CAGGCACCCA GAGTAGTAGG TCTTTGGCAT TAGGAGCTTG AGCCCAGACG 101 GCCCTAGCAG GGACCCCAGC GCCCGAGAGA CCATGCAGAG GTCGCCTCTG 151 GAAAAGGCCA GCGTTGTCTC CAAACTTTTT TTCAGCTGGA CCAGACCAAT 201 TTTGAGGAAA GGATACAGAC AGCGCCTGGA ATTGTCAGAC ATATACCAAA 251 TCCCTTCTGT TGATTCTGCT GACAATCTAT CTGAAAAATT GGAAAGAGAA 301 TGGGATAGAG AGCTGGCTTC AAAGAAAAAT CCTAAACTCA TTAATGCCCT 351 TCGGCGATGT TTTTTCTGGA GATTTATGTT CTATGGAATC TTTTTATATT 401 TAGGGGAAGT CACCAAAGCA GTACAGCCTC TCTTACTGGG AAGAATCATA 451 GCTTCCTATG ACCCGGATAA CAAGGAGGAA CGCTCTATCG CGATTTATCT
analyze% typedata -ref GB_PR:HUMIFNRF1A
!!NA_SEQUENCE 1.0LOCUS HUMIFNRF1A 7721 bp DNA PRI 10-NOV-1992DEFINITION Homo sapiens interferon regulatory factor 1 gene, complete cds.ACCESSION L05072NID g184648KEYWORDS interferon regulatory factor 1.SOURCE Homo sapiens Placenta DNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 7721) AUTHORS Cha,Y., Sims,S.H., Romine,M.F., Kaufmann,M. and Deisseroth,A.B. TITLE Human interferon regulatory factor 1: intron/exon organization JOURNAL DNA Cell Biol. 11, 605-611 (1992) MEDLINE 93000481
FEATURES Location/Qualifiers source 1. .7721 /organism="Homo sapiens" /db_xref="taxon:9606" /tissue_type="Placenta" /map="5q23-q31" exon 1. .219 /gene="IRF1" /note="putative" /number=1 5'UTR join(1. .219,1279. .1287) /gene="IRF1" gene join(1. .219,1279. .1287) /gene="IRF1" intron 220. .1278 /gene="IRF1" /number=1 exon 1279. .1374 /gene="IRF1" /number=2 CDS join(1288. .1374,2738. .2837,3630. .3806,3916. .3965, 4073. .4202,4386. .4508,5040. .5089,6248. .6383,6670. .6794) /gene="IRF1" /codon_start=1 /product="interferon regulatory factor 1" /db_xref="PID:g184649" /translation="MPITRMRMRPWLEMQINSNQIPGLIWINKEEMIFQIPWKHAAKH GWDINKDACLFRSWAIHTGRYKAGEKEPDPKTWKANFRCAMNSLPDIEEVKDQSRNKG SSAVRVYRMLPPLTKNQRKERKSKSSRDAKSKAKRKSCGDSSPDTFSDGLSSSTLPDD HSSYTVPGYMQDLEVEQALTPALSPCAVSSTLPDWHIPVEVVPDSTSDLYNFQVSPMP STSEATTDEDEEGKLPEDIMKLLEQSEWQPTNVDGKGYLLNEPGVQPTSVYGDFSCKE EPEIDSPGGDIGLSLQRVFTDLKNMDATWLDSLLTPVRLPSIQAIPCAP"
intron 1375. .2737 /gene="IRF1" /number=2 exon 2738. .2837 /gene="IRF1" /number=3 intron 2838. .3629 /gene="IRF1" /number=3 exon 3630. .3806 /gene="IRF1" /number=4 intron 3807. .3915 /gene="IRF1" /number=4 exon 3916. .3965 /gene="IRF1" /number=5 intron 3966. .4072 /gene="IRF1" /number=5
...
exon 5040. .5089 /gene="IRF1" /number=8 intron 5090. .6247 /gene="IRF1" /number=8 exon 6248. .6383 /gene="IRF1" /number=9 intron 6384. .6669 /gene="IRF1" /number=9 exon 6670. .7656 /gene="IRF1" /number=10 3'UTR 6795. .7656BASE COUNT 1750 a 1946 c 2253 g 1772 tORIGIN
analyze% typedata -ref est:hum091226f!!NA_SEQUENCE 1.0LOCUS HUM091226F 152 bp mRNA EST 02-APR-1996DEFINITION Homo sapiens retinal fovea EST HFV091226 sequence.ACCESSION L48850NID g1254959KEYWORDS EST; expressed sequence tag.SOURCE Homo sapiens (clone: EST HFV091226) age normalized retinal foveae cDNA to mRNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (sites) AUTHORS Adams,M.D., Kerlavage,A.R., Fields,C. and Venter,J.C. TITLE 3,400 new expressed sequence tags identify diversity of transcripts in human brain JOURNAL Nature Genet. 4 (3), 256-267 (1993) MEDLINE 93364420REFERENCE 2 (sites) AUTHORS Liew,C.C., Hwang,D.M., Fung,Y.W., Laurenssen,C., Cukerman,E., Tsui,S. and Lee,C.Y. TITLE A catalogue of genes in the cardiovascular system as identified by expressed sequence tags JOURNAL Proc. Natl. Acad. Sci. U.S.A. 91 (22), 10645-10649 (1994) MEDLINE 95024171REFERENCE 3 (bases 1 to 152) AUTHORS Bernstein,S.L., Borst,D.E., Neuder,M.E. and Wong,P. TITLE Characterization of a human fovea cDNA library and regional differential gene expression in the human retina JOURNAL Genomics 32 (3), 301-308 (1996)
FEATURES Location/Qualifiers source 1. .152 /organism="Homo sapiens" /note="Expressed sequence tags (first pass sequencing) from randomly selected bacteriophage clones (mRNA-cDNA) from human retinal fovea. The library is age normalized from ten sets of donor foveae 2-79 years old. /db_xref="taxon:9606" /clone="EST HFV091226" /dev_stage="age normalized" /tissue_type="retinal foveae" mRNA <1. .>152 /standard_name="EST HFV091226"BASE COUNT 31 a 42 c 41 g 36 t 2 othersORIGIN
analyze% typedata -ref sts:humswx153!!NA_SEQUENCE 1.0LOCUS HUMSWX153 192 bp DNA STS 24-MAY-1993DEFINITION Human chromosome X STS sWXD153; single read.ACCESSION L15212NID g292645KEYWORDS STS; primer; sequence tagged site.SOURCE Homo sapiens DNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 192) AUTHORS Kere,J., Nagaraja,R., Mumm,S.R., Ciccodicola,A., D'Urso,M. and Schlessinger,D. TITLE Mapping human chromosomes by walking with sequence-tagged sites from end fragments of yeast artificial chromosome inserts JOURNAL Genomics 14, 241-248 (1992) MEDLINE 93052321
COMMENT Submitted by: David Schlessinger, Center for Genetics in Medicine, Washington University School of Medicine, Box 8232 4566 Scott Avenue, St. Louis, MO 63110, USA e-mail: [email protected] Primer A: TAAAGGGATCGCCAAGGAC Primer B: CTTACTCATTTGCTGGATTCTC STS size: 85bp Template: 600 ng/100ul Primer: 40 pmoles/100ul dNTPs: 100 uM MgCl2: 1.5 mM KCl: 100 mM TrisHCl: 10 mM Taq Polymerase: 0.125 U NH4Cl: 5 mM pH: 8.6 Total Vol: 5 ul PCR Profile: Denaturation: 94 degrees C for 1.00 minute(s) Annealing: 55 degrees C for 2.00 minute(s) Polymerization: 72 degrees C for 2.00 minute(s) PCR Cycles: 35 Thermal Cycler: P-E.
FEATURES Location/Qualifiers source 1. .192 /organism="Homo sapiens" /db_xref="taxon:9606" /map="Xq13-q24" STS 60. .144 /standard_name="sWXD153" primer_bind 60. .78 primer_bind complement(123. .144)BASE COUNT 72 a 26 c 60 g 29 t 5 othersORIGINanalyze%
Swiss-Prot
http://www.expasy.ch/sprot/ Protein Database University of Geneva Arranged by protein function Release 39.15 March 19, 2001 94,152 entries Provides annotated protein records
Swiss-Prot Names
Protein_Species Allows easier comparisons when studying
evolutionary relationships H1b_Human
Human histone 1b
Swiss-Prot Names
Vgl*_* Viral glycoproteins
VGLG_HRSVL Viral GLycoprotein G Human Respiratory Syncytial Virus Long
strain
analyze% typedata swp:H1b_Human
!!AA_SEQUENCE 1.0ID H1B_HUMAN STANDARD; PRT; 218 AA.AC P10412;DT 01-MAR-1989 (REL. 10, CREATED)DT 01-MAR-1989 (REL. 10, LAST SEQUENCE UPDATE)DT 01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE)DE HISTONE H1B (H1.4).GN H1F4.OS HOMO SAPIENS (HUMAN).OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;OC EUTHERIA; PRIMATES.RN [1]RP SEQUENCE FROM N.A.RX MEDLINE; 92009931.RA ALBIG W., KARDALINOU E., DRABENT B., ZIMMER A., DOENECKE D.;RL GENOMICS 10:940-948(1991).RN [2]RP SEQUENCE.RC TISSUE=SPLEEN;RX MEDLINE; 87057092.RA OHE Y., HAYASHI H., IWAI K.;RL J. BIOCHEM. 100:359-368(1986).
CC -!- FUNCTION: HISTONES H1 ARE NECESSARY FOR THE CONDENSATION OFCC NUCLEOSOME CHAINS INTO HIGHER ORDER STRUCTURES.CC -!- SUBCELLULAR LOCATION: NUCLEAR.CC -!- THIS VARIANT ACCOUNTS FOR 60% OF HISTONE H1.DR EMBL; M60748; G184074; -.DR PIR; A24413; HSHU1B.DR PIR; C40335; C40335.DR HSSP; P08287; 1GHC.KW CHROMOSOMAL PROTEIN; NUCLEAR PROTEIN; DNA-BINDING; MULTIGENE FAMILY;KW ACETYLATION; METHYLATION.FT INIT_MET 0 0FT MOD_RES 1 1 ACETYLATION.FT MOD_RES 25 25 METHYLATION (PARTIAL).FT DOMAIN 35 113 GLOBULAR.SQ SEQUENCE 218 AA; 21734 MW; 5A277FB0 CRC32;
H1B_HUMAN Length: 218 April 13, 1998 13:19 Type: P Check: 2701 .. 1 SETAPAAPAA PAPAEKTPVK KKARKSAGAA KRKASGPPVS ELITKAVAAS 51 KERSGVSLAA LKKALAAAGY DVEKNNSRIK LGLKSLVSKG TLVQTKGTGA 101 SGSFKLNKKA ASGEAKPKAK KAGAAKAKKP AGAAKKPKKA TGAATPKKSA 151 KKTPKKAKKP AAAAGAKKAK SPKKAKAAKP KKAPKSPAKA KAVKPKAAKP 201 KTAKPKAAKP KKAAAKKK analyze%
Swiss-Prot TrEMBL
Translation of all EMBL Nucleic Acid coding sequences not yet present in Swiss-Prot
Allows rapid availability without immediate annotation
Release 16.3 March 30, 2001 436,896 entries
TrEMBL Divisions
Everything in TrEMBL: spt sp_bacteria sp_fungi sp_human sp_invertebrate sp_mammal sp_mhc sp_organelle sp_phage sp_plant sp_rodent sp_unclassified sp_vertebrate
Protein Identification Resource - PIR
http://pir.georgetown.edu/ National Biomedical Research Foundation Georgetown University Current Release 67.05 March 23, 2001 219,178 Entries
National Biomedical Research Foundation
Database begun over twenty years ago by Margaret O. Dayhoff
Originally published sequences in book form
Started with sequences derived from direct amino acid sequencing
analyze% typedata -ref PIR1:HSHU1B
!!AA_SEQUENCE 1.0P1;HSHU1B - histone H1-4 - humanN;Alternate names: histone H1.4; histone H1bC;Species: Homo sapiens (man)C;Date: 31-Dec-1988 #sequence_revision 12-Apr-1996 #text_change 05-Sep-1997C;Accession: C40335; A24413R;Albig, W.; Kardalinou, E.; Drabent, B.; Zimmer, A.; Doenecke, D.Genomics 10, 940-948, 1991A;Title: Isolation and characterization of two human H1 histone genes within clusters of core histone genes.A;Reference number: A40335; MUID:92009931A;Accession: C40335A;Status: preliminaryA;Molecule type: DNAA;Residues: 1-219 <ALB>A;Cross-references: GB:M60748; NID:g184073; PID:g184074A;Experimental source: bloodR;Ohe, Y.; Hayashi, H.; Iwai, K.J. Biochem. 100, 359-368, 1986A;Title: Human spleen histone H1. Isolation and amino acid sequence of a main variant, H1b.A;Reference number: A24413; MUID:87057092A;Accession: A24413A;Molecule type: proteinA;Residues: 2-219 <OHE>A;Experimental source: spleen
C;Comment: This variant accounts for 60% of histone H1.C;Genetics:A;Gene: GDB:H1F4A;Cross-references: GDB:120030; OMIM:142220A;Map position: 12q11-12q21C;Superfamily: histone H1C;Keywords: acetylated amino end; chromosomal protein; DNA binding; methylated amino acid; nucleosome; spleenF;2-219/Product: histone H1-4 #status experimental <MAT>F;2-32/Domain: amino-terminal <NH2>F;33-110/Domain: globular <GLB>F;111-219/Domain: carboxyl-terminal <END>F;2/Modified site: acetylated amino end (Ser) (in mature form) #status experimentalF;26/Modified site: N6-methyllysine (Lys) (partial) #status experimental
iProClass Database - PIR
http://pir.georgetown.edu/iproclass/ Comprehensive family relationships and
structural/functional classifications and features of proteins Superfamilies Families Domains
GCG Supplied Databases
GCG sequence database files are NOT normal UNIX files. UNIX commands cannot be used to
manipulate sequences in these databases Stored as Data Libraries Stored in Oracle RDB
Sequence Data Updates
Genbank Daily
GCG Flat file No longer updated Last update June, 2000
GCG SeqStore Oracle RDB Daily updates
Database listing – GCG-FF
Databases available:
GenBank Release 118.0 (06/2000)
EMBL (Abridged) Release 62.0 (03/2000)
PIR-Protein Release 65.0 (06/2000)
NRL_3D Release 27.0 (03/2000)
SWISS-PROT Release 39.0 (06/2000)
SP-TREMBL Release 14.0 (06/2000)
PROSITE Release 16.0 (07/1999)
Restriction Enzymes (REBASE) (06/2000)
Database listing – SeqStore
Databases available:
GCGNUC updated nightly by DATASERVE
GCGPROT updated weekly by DATASERVE
GCGEST updated nightly by DATASERVE
PROSITE Release 15.0 (07/1999)
Restriction Enzymes (REBASE) (06/2000)
Data Libraries
Allows rapid searches Sequences organized into groups Each data library can be referred to by a
logical name Individual sequences can be extracted
from the data library.
Logical Names:GCG Sequence Databases
http://www.microbio.uab.edu/seqCourse/datalib.htm
GCG SeqStore (Oracle-based Sequences)
Data Library Names
Database Name DescriptionNucleic Acid Sequences
gcgnuc All Genbank nucleotide sequences (except ESTs) updated nightly by SeqStore
gcgest All Genbank Expressed Sequence Tags updated nightly by SeqStore
Protein Sequences
gcgprot All Swissprot and Swissprot TrEMBL sequences updated nightly by SeqStore
GCG Flat-file
Data Library Names
Nucleic Acid Databases (Genbank and EMBL)
Database Name(s) DescriptionGenEMBL, GE Entire database (except tags)
genemblplus gep geplus Entire database (including tags)
Bacterial, Bacteria, Ba Bacterial sequences
HTG High throughput genome
Invertebrate, In Invertebrate sequence
Organelle, Or Organelle sequences
Other_Mammalian, OtherMammal, OtherMamm, Om
non-rodent, non-primate Mammalian sequences
Other_Vertebrate, Ov, OtherVertebrate, OtherVert
non-mammalian Vertebrate sequences
Nucleic Acid Databases…
Database Name(s) DescriptionPatent, Pat Sequences from patents and
patent applications
Phage, Ph Phage sequences
Plant, Pl Plant and Fungal sequences
Primate, Pr Primate (Mammalian) sequences
Rodent, Ro Rodent (Mammalian) sequences
Structural_RNA, Structural St Structural RNA sequences (such as rRNAs)
Synthetic, Sy Synthetic sequences
Unannotated, Un Unannotated sequences
Viral, Vi Viral sequences
Sequence Tag Databases
Database Name(s) DescriptionEST Expressed sequence tags
GSS Genome survey sequences
STS Sequence-tagged site sequences
Tags EST, STS, and GSS
Protein Databases
Database Name(s) DescriptionPIR,P Entire PIR-Protein Protein
Sequence Data Library
Protein, Prot, PIR1 PIR-Protein annotated sequences
New, Nw PIR-Protein preliminary and unverified sequences
PIR2 PIR-Protein preliminary sequences
PIR3 PIR-Protein unverified sequences
SwissProt, Swiss Entire SwissProt Protein Sequence Data Library
Sptrembl, spt Newly added preliminary sequences, translated from EMBL
swissprotplus swplus swp SwissProt + SPTrEMBL
NCBI Blast Databases
Nucleotide Databases for NetBlast Searching nr Non-redundant Genbank+EMBL+DDBJ+PDB sequences
(but no EST's or STS's)
pdb PDB nucleotide sequences
vector Vector subset of Genbank
yeast Saccharomyces cerevisiae genomic nucleotide sequences
est Non-redundant Database of Genbank+EMBL+DDBJ EST Division
sts Non-redundant Database of Genbank+EMBL+DDBJ STS Division
htgs High Throughput Genomic Sequences
mito Database of mitochondrial sequences, Rel. 1.0, July 1995
kabat Kabat Sequences of Nucleic Acid of Immunological Interest
epd Eukaryotic Promotor Database
alu Select Alu Repeats from REPBASE
gss Genome Survey Sequence, includes single_pass genomic data
ecoli E. coli genomic nucleotide sequences
Drosophila genome Drosophila genome provided by Celera and Berkeley
month All new or revised Genbank+EMBL+DDBJ+PDB sequences released in the last 30 days
Protein Databases for NetBlast Searchingnr Non-redundant Genbank CDS
translations+PDB+SwissProt+PIR
pdb PDB protein sequences
swissprot SwissProt sequences
yeast Saccharomyces cerevisiae protein sequences
kabat Kabat Sequences of Proteins of Immunological Interest
alu Translations of Select Alu Repeats from REPBASE
ecoli E. coli genomic CDS translations
Drosophila genome Drosophila genome proteins provided by Celera and Berkeley
month All new or revised Genbank CDS translation+PDB+SwissProt+PIR sequences released in the last 30 days
Specifying Sequences
Filename Data library specification Accession number specification
Sequences within your own directories
Use the normal file specification:
lefkowit/sequences/vsvcg.seq
Sequences within a Data Library
Flatfile Data Library:Sequence Name sw:vglg_vsvsj - VSV G protein in the
SwissProt library primate:humada
The sequence for human adenosine deaminase mRNA
SeqStore gcgprot:vglg_vsvsj gcgnuc:humada
Sequence Formats
GCG requires a specific sequence format Sequences entered from outside GCG
must be reformatted analyze% reformat
GCG program analyze% readseq
Non-GCG addition
Non-GCG Sequence File
analyze% cat seq.txt
ACGAAGACAAACAAACCATTATTATCATTAAAAGGCTC
AGGAGAAACTTTAACAGTAATCAAAATGTCTGTTACAG
TCAAGAGAATCATTGACAACACAG
analyze%
analyze% reformat
analyze% reformat -check seq.txt
Reformat rewrites sequence file(s), scoring matrix file(s), or enzyme
data file(s) so that they can be read by GCG programs.
Minimal Syntax: % reformat [-INfile=]reformat.txt -Default
Prompted Parameters: None
Local Data Files:
-DATa=translate.txt three-letter to one-letter codes
Optional Parameters: [-OUTfile=]NewSeqName names the output file-EXTension=.seq specifies a file name extension for the output-LIStfile[=reformat.list] writes a list file of output sequence names-MSF reformats sequences into an MSF output file-RSF reformats sequences into an RSF output file-PROtein or -NUCleotide insists that the sequences are reformatted as protein or nucleotide sequences-DEGap removes gap characters (. and ~) from the sequence-LINesize=50 sets number of characters per line-BLOcksize=10 sets number of characters per block-BLAnklines=1 puts blank lines between the sequence lines-NONUMbering suppresses numbering-NOCOMments suppresses comments-DNA changes U into T-RNA changes T into U-UPPer makes all sequence characters uppercase-LOWer makes all sequence characters lowercase-ONEIntothree translates one-letter peptides into three-letter-THReeintoone translates three-letter peptides into one-letter-NOHEAding input sequence from stdin contains no header information
-COMparison reformats a scoring matrix instead of a sequence (used with -PROtein or -NUCleotide, insists that the matrix is reformatted as a protein or nucleotide scoring matrix)-GAPweight=12 specifies the gap creation penalty associated with the scoring matrix-LENgthweight=4 specified the gap extension penalty associated with the scoring matrix-SCAle=10 multiplies each value in the scoring matrix by 10 (use any number from .01 to 100.0)-EQUALSformat writes the scoring matrix in a form that may be more easily read-OLDCMPformat converts a pre-Version 9 scoring matrix into a Version 9 scoring matrix (all options used with -COMparison can also be used with -OLDCMPformat. -PROtein or -NUCleotide must be specified with -OLDCMPformat-TRANSlate=filename.txt lets you name the translation table-NOMONitor suppresses the screen trace showing each output file Add what to the command line ?
No ".." divider seq.txt length: 100 bpanalyze%
analyze% cat seq.txt'!!NA_SEQUENCE 1.0 REFORMAT of: seq.txt check: 3430 from: 1 to: 100 April 9, 1998 14:31 (No documentation) seq.txt Length: 100 April 9, 1998 14:31 Type: N Check: 3430 .. 1 ACGAAGACAA ACAAACCATT ATTATCATTA AAAGGCTCAG GAGAAACTTT 51 AACAGTAATC AAAATGTCTG TTACAGTCAA GAGAATCATT GACAACACAG analyze%
Reformatted Sequence
GCG Sequence Import Programs
fromstaden fromembl fromgenbank frompir fromig fromfasta fromtrace
GCG Sequence Export Programs
tostaden topir toig tofasta
ReadSeq
General reformatting program
analyze% readseqanalyze% readseqreadSeq (1Feb93), multi-format molbio sequence reader. Name of output file (?=help, defaults to display):seq.fasta 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip3.2 3. NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only) Choose an output format (name or #):8
Name an input sequence or -option:seq.txt Name an input sequence or -option:
analyze% cat seq.fasta>seq.txt, 100 bases, D66 checksum.ACGAAGACAAACAAACCATTATTATCATTAAAAGGCTCAGGAGAAACTTTAACAGTAATCAAAATGTCTGTTACAGTCAAGAGAATCATTGACAACACAGanalyze%
ReadSeq Formatted Sequence
Sequence File Utilities
Chopup Break up long lines in a text file prior to
running reformat Breakup
Breakup long sequences into individual, overlapping sequence files
>uunt, 751719 bases, 1F08 checksum.ATGGCTAATAATTATCAAACTTTATATGATTCAGCAATAAAAAGGATTCCATACGATCTTATTTCTGATCAAGCTTATGCAATTCTACAAAATGCTAAAACTCATAAAGTTTGCGATGGTGTTTTATATATAATTGTAGCCAATGCCTTTGAAAAAAGTATTATTAACGGTAATTTTATTAACATTATTTCTAAATATCTAAGCGAAGAATTCAAAAAGGAAAATATTGTTAATTTTGAATTTATTATAGACAATGAAAAATTATTAATTAATAGCAATTTTTTAATTAAAGAAACTAATATTAAAAATCGTTTTAATTTTAGTGATGAACTTTTACGTTACAATTTTAACAATTTAGTAATTAGTAATTTTAATCAAAAAGCGATTAAGGCGATTGAAAATTTATTTTCAAATAACTATGATAATAGTTCAATGTGTAACCCTTTATTTTTATTTGGTAAAGTTGGTGTTGGTAAAACGCATATCGTGGCTGCTGCTGGTAATCGTTTTGCTAATAGTAATCCTAATTTAAAAATTTATTATTATGAAGGGCAAGATTTTTTTCGAAAGTTTTGTTCTGCTTCGTTAAAAGGGACTAGTTATGTTGAAGAGTTTAAAAAAGAAATTGCTTCAGCAGATTTATTAATTTTTGAAGATATTCAAAATATCCAATCACGTGATTCAACGGCTGAATTGTTTTTTAATATCTTTAATGATATAAAATTAAATGGTGGAAAAATTATCTTAACATCTGACCGTACACCAAACGAACTTAATGGTTTTCATAATCGAATTATTTCGAGATTAGCGTCAGGTTTGCAGTGTAAAATTTCTCAACCCGACAAAAATGAAGCTATTAAAATTATTAATAATTGGTTTGAATTCAAAAAAAAATATCAAATTACTGACGAAGCTAAAGAATATATTGCTGAAGGTTTTCACACTGATATTAGACAGATGATtGGTAATCTAAAACAAATTTGTTTTTGAGCGGACAATGATACTAATAAAGATTTAATAATCACAAAAGATTATGTAATTGAGTGTTCAGTTGAAAACGAAATTCCACTAAATATTGTTGTTAAAAAACAATTTAAACC
analyze% readseqreadSeq (1Feb93), multi-format molbio sequence reader. Name of output file (?=help, defaults to display):uunt.seq 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip3.2 3. NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only) Choose an output format (name or #):5 Name an input sequence or -option:uunt Name an input sequence or -option:
analyze% more uunt.sequunt uunt, Length: 751719 (today) Check: 7944 .. 1 ATGGCTAATA ATTATCAAAC TTTATATGAT TCAGCAATAA AAAGGATTCC 51 ATACGATCTT ATTTCTGATC AAGCTTATGC AATTCTACAA AATGCTAAAA 101 CTCATAAAGT TTGCGATGGT GTTTTATATA TAATTGTAGC CAATGCCTTT 151 GAAAAAAGTA TTATTAACGG TAATTTTATT AACATTATTT CTAAATATCT 201 AAGCGAAGAA TTCAAAAAGG AAAATATTGT TAATTTTGAA TTTATTATAG 251 ACAATGAAAA ATTATTAATT AATAGCAATT TTTTAATTAA AGAAACTAAT 301 ATTAAAAATC GTTTTAATTT TAGTGATGAA CTTTTACGTT ACAATTTTAA 351 CAATTTAGTA ATTAGTAATT TTAATCAAAA AGCGATTAAG GCGATTGAAA 401 ATTTATTTTC AAATAACTAT GATAATAGTT CAATGTGTAA CCCTTTATTT 451 TTATTTGGTA AAGTTGGTGT TGGTAAAACG CATATCGTGG CTGCTGCTGG 501 TAATCGTTTT GCTAATAGTA ATCCTAATTT AAAAATTTAT TATTATGAAG 551 GGCAAGATTT TTTTCGAAAG TTTTGTTCTG CTTCGTTAAA AGGGACTAGT ...
751301 GAAAATAAAC TACGATTTGA TTAGAATGAA TTTTTTGTTG TTTCTTAATT 751351 GTATCAAGTA TATCTTCATT TTTTTTTAGA CTAATAAAAT TAGCCATAAA 751401 AATTATTTTT CACTAGAAAC TGTTAGACTA TGACGCCCTT TAAGTCTTCT 751451 TCTAGCTAAA ACATTACGCC CATTTTTTGT TTTCATGCGT GCACGAAAAC 751501 CATGCACTTT TGCTCTTTTA CGATTATTAG GTTGAAACGT TCTTTTCATA 751551 AATCCACCGC CCTCTTACTT TTTTGAAAAC ATAATATGGA TTATTATAAC 751601 ATTTTAGTTA TTTTTTATTT AATATATTTT TTTAAAAAAG TCAATGATAT 751651 CTTTTTAAAA ATAAACATAT ATAATATGAT AATAGGACAA AGATTATTTA 751701 TAAAAAATAG AGGTTACTA
analyze% map uunt.seq Map maps a DNA sequence and displays both strands of the mapped sequencewith restriction enzyme cut points above the sequence and proteintranslations below. Map can also create a peptide map of an amino acidsequence. ***Error: Sequence "uunt.seq" could not be read or is not in GCG format
analyze% breakup uunt.seq BreakUp reads a GCG-format sequence file containing more than 350,000sequence characters and writes it as a set of separate, shorter,overlapping sequence files that can be analyzed by Wisconsin Package programs. uunt_0.seq length: 110000 bp uunt_1.seq length: 110000 bp uunt_2.seq length: 110000 bp uunt_3.seq length: 110000 bp uunt_4.seq length: 110000 bp uunt_5.seq length: 110000 bp uunt_6.seq length: 110000 bp uunt_7.seq length: 51719 bp analyze%
Specifying Multiple Sequences
Multiple sequences
If the program prompts with: sequences(s), file(s), or file name(s), then it can accept more than one input file
Specifying Multiple Sequences
Wild Card Specification File of File Names
List Files Multiple Sequence Format File
Wild card specification (flatfile)
GenEMBL:* All sequences in Genbank and EMBL
Primate:* All primate sequences in GenBank
Primate:Hum* All Human sequences in GenBank EMBL uses HS for human
Wild card specification (SeqStore)
gcgnuc:* All sequences in Genbank and EMBL
Must create a query or list for most groupings
File of Sequence Names
List Files You or certain GCG programs can
construct a file containing any number of sequence names.
Specify as @Sequence_names.fil
The @ tells the program that Sequence_names.fil is a file of sequence names
The program uses all listed sequences
Contents of a File of Sequence Names
Begin with a comment Sequence file names follow a double
period at the end of a line: .. Other comments can be included if
preceded by a ! One sequence name per line
File of Sequence Names...
Put an ! in front of a name to have the program ignore that particular entry.
A sequence name may include a wild card The file can contain another file of
sequence names as a listing It must be preceded by an @
hsp70.fil File
January 21, 1998 ..
SWP:Hs70_Brelc SWP:Hs70_Chick SWP:Hs70_Human SWP:Hs70_Leido SWP:Hs70_Leima SWP:Hs70_Maize SWP:Hs70_Mouse SWP:Hs70_Pethy SWP:HS77_YeastSWP:GR78_Yeast -BEGin=43 -END=682sequences/hsp70/ssa4.pepob0/users/lefkowit/sequences/hsp70/ssa1.pepSWP:DNAK_EColi
Multiple Sequence Files (msf)
File containing multiple sequences that are related and have been aligned
Specifying msf files: filename.msf{*} The {*}indicates which sequences are to be used
You can exclude a sequence in subsequent analyses by preceding its name within the msf file with an ! sign.
hsp70.msf
PileUp of: @Hsp70.Fil Symbol comparison table: GenRunData:NWSGapPep.Cmp CompCheck: 1254 GapWeight: 3.0 GapLengthWeight: 0.1
Pileup.Msf MSF: 738 Type: P December 26, 1990 13:39 Check: 288 .. Name: Hs70_Plafa Len: 738 Check: 9820 Weight: 1.00Name: Hs70_Thean Len: 738 Check: 120 Weight: 1.00!Name: Hs70_Leido Len: 738 Check: 7985 Weight: 1.00// 1 50Hs70_Plafa .......... .....MASAK GSKPNLPESN IAIGIDLGTT YSCVGVWRNE Hs70_Thean .......... .......... .......MTG PAIGIDLGTT YSCVAVYKDN Hs70_Leido .......... .......... ......MTFD GAIGIDLGTT YSCVGVWQNE
51 100Hs70_Plafa NVDIIANDQG NRTTPSYVAF T.DTERLIGD AAKNQVARNP ENTVFDAKRL Hs70_Thean NVEIIPNDQG NRTTPSYVAF T.DTERLIGD AAKNQEARNP ENTIFDAKRL Hs70_Leido RVDIIANDQG NRTTPSYVAF TSDSERLIGD AAKNQVAMNP HNTVFDAKRL
rsf Files
Rich Sequence Format Allows entry of additional information
about each sequence File can contain multiple sequences
Allows gaps Different sequences do not need to be
related Create and Edit rsf files within SeqLab
rsf Sequence Information
Creator/author of the sequence Sequence weight Creation date One-line description of the sequence Offset, or the number of leading gaps in a
sequence that is part of an alignment or fragment assembly project
Known sequence features
rsf File Specification
Similar to msf files hsp70.rsf{*}
Use all the sequences in the file hsp70.rsf{hs70_human}
Only use this single sequence hsp70.rsf{hs70*}
Only use sequences whose name starts with hs70
analyze% more rsb.rsf!!RICH_SEQUENCE 1.0..{name dc-62-18537descrip Description: PileUp of: *.seqtype DNAlongname dc-62-18537checksum 8717creation-date 4/10/98 15:45:50strand 1sequence TCCACCGTGCTCGACACAATCACTCCAAAATACACAATCCAACAGCAATCCCTCCACTCA ACCACCTCCGAAAACACACCCAGCTCCACACAAATACCCACAGCATCCGAGCCCTCCACA TTAAATCCTAAT}{name swed-60-860descrip Description: PileUp of: *.seqtype DNAlongname swed-60-860checksum 8595creation-date 4/10/98 15:45:50strand 1sequence TCCACCGTGATCGACACAATCACTCCAAAATACACAATCCAACAGCAATCCCTCCACTCA ACCACCTCCGAAAACACACCCAGCTCCACACAAATACCCACAGCATCCGAGCCCTCCACA TCAAATCCTACT}
Finding and Displaying Sequences
List Refinement
Run search program 1 Create a list of file names Use as input to search program 2 Create a second list of file names Edit the listfile at each step as necessary. etc.
Programs Which Create a List of Sequences
Names Blast Lookup StringSearch FindPatterns FastA TFastA
Names
Searches sequence names for a match analyze% names primate:Hum*
Will create a file listing all human sequences present in GenBank
Dependent on knowing name features GenBank:Hum* EMBL:Hs*
analyze% names -check pr:huma* Names identifies GCG data files and sequence entries by name. It canshow you what set of sequences is implied by any sequence specification. Minimal Syntax: % names [-INfile=]GenEMBL:Humhb* -Default Prompted Parameters: [-OUTfile=]Term output file name (defaults to your terminal) Options: -SHOwfiles=132 limits documentation in the output file to column 132-NOHEAding suppresses the heading at the top of the file.-NOMONitor suppresses the screen monitor Add what to the command line ? What (file of filenames) output file (* TERM *) ? gb_pr1: huma1aadr huma1acm huma1acmb huma1ar1huma1ar2 huma1at huma1ata huma1atb
analyze% more list.file!!SEQUENCE_LIST 1.0! NAMES from: pr:huma* April 13, 1998 14:55 .. gb_pr1:huma1aadr LOCUS HUMA1AADR 2002 bp mRNA PRI 04-NOV-1991 DEFINITION Human alpha-A1-adrenergic receptor mRNA, complete cds. ACCE
gb_pr1:huma1acm LOCUS HUMA1ACM 1520 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antichymotrypsin (AACT) mRNA, complete cds. ACC
gb_pr1:huma1acmb LOCUS HUMA1ACMB 559 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antichymotrypsin gene, exon 1. ACCESSION M18035
gb_pr1:huma1ar1 LOCUS HUMA1AR1 890 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin-related protein gene, exon 2. ACCESSI
gb_pr1:huma1ar2 LOCUS HUMA1AR2 3758 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin-related protein gene, exons 3, 4 and
gb_pr1:huma1at LOCUS HUMA1AT 143 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin (alpha-1-AT) mRNA, 3' end. ACCESSION M
gb_pr1:huma1ata LOCUS HUMA1ATA 322 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin gene, exon 1 (unexpressed). ACCESSION
gb_pr1:huma1atb LOCUS HUMA1ATB 1345 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin mRNA, complete cds. ACCESSION M1146
StringSearch
Old search method Searches for a particular text pattern in the
sequence documentation. Definition Search Record Search
Complete search for possible text occurances
Very Slow!!
Lookup (gcgff only)
Rapid Text Pattern Searching Uses an index of sequence file
documentation Allows field-specific searches Allows AND; OR; NOT matching
Lookup Considerations
Be sure that analyze is set to use a vt100 terminal: analyze% setenv TERM vt100
Lookup may miss some sequences Dependent on the annotation Spelling counts
Searches are case Insensitive
Logical Operators Within a Field
AND: & A & B means find all entries that contain both A
and B. OR: |
A | B means find all entries that contain either A or B.
BUT-NOT: ! A ! B means find all entries that contain A but do
not contain B.
analyze% lookup -check LookUp identifies sequence database entries by name, accession number,author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences. The LookUp program is experimental in this release. LookUp sometimescrashes or produces incorrect results if you query a nucleic aciddatabase and request fragment output. Please look carefully at yourresults. Minimal Syntax: % lookup [-ALLtext=]Globin -Default
Prompted Parameters: -LIBrary=SwissProt[,...] lookup in specified data libraries -ALLtext=Globin searches all text indices for globin-DEFInition=Globin words indexed independently "Globin & Region"-AUThor=Smithies for more than one "Smithies,O. & Slightom,J.L."-KEYword=Globin see document before using keywords-NAMe=hsggl3 entry name-ACCessionnumber=S12345 accession number-ORGanism="Homo Sapiens" genus and species-REFerence=Cell&1981 complete reference: "Cell & 26 & 191- & 1981"-TITle=History title of citation "History & Duplication"-FEAture=Gamma any word in a feature table-SHOrtest=100 find only sequences of length 100 or more-LONgest=400 find only sequences of length 400 or less-EARliest=01-apr-1992 sequences modified on or after April 1, 1992-LATest=30-apr-1992 sequences modified on or before April 30, 1992-MATch=OR specifies inter-field logic (AND is default)-OUTfile=lookup.list output file for list of sequences
Optional Parameters: -NOWILdcardextension turns off automatic wildcard [email protected] searches in lookup.list instead of libraries-ANNotate=FEAture[,...] shows fields from original annotation in output acceptable values include: ACCession, AUThor, DATe, DEFinition, FEAture, NAMe, KEYword, ORGanism, REFerence, and TITle-FRAgments shows features as fragments instead of whole entries-COMplete shows only features with unambiguous coordinates-MONitor shows databases searched and how many hits found Add what to the command line ?
LOOKUP in what sequence libraries: a) swissprot b) sptrembl c) pir d) embl e) genbank f) em_tags g) gb_tags h) All libraries q) quit Please choose one or more (* h *):
Complete the query form below: All text: Definition: Author: Keyword: Sequence name: Accession number: Organism: Reference: Title: Feature: On or after (dd-mmm-yy): On or before (dd-mmm-yy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Whole Entries Press <Ctrl>D to continue.
SeqStore
Sequence searching
Lookup_rdb (gcgrdb)
Seqstore command-line sequence searching
Barebones – Use Seqstore Web interface
SeqStore Web Searching
Setup multiple criteria for selecting sets of sequences
Save as a query or list Query: Active list. Changes as new sequences are
added List: Static list. o change with database updates
Save to SeqWeb Powerful but can be slow
NCBI Sequence Services
Obtain sequences directly from NCBI Sequence Searches Sequence Retrieval
Other services BLAST Searches Sequence Submission PubMed Searches
Entrez
NCBI Databases on the Web Sequence retrieval Text pattern searches
GenBank is updated on a daily basis Web Site: http://www.ncbi.nlm.nih.gov
Finding Sequences by Similarity
Using GCG
Sequence Similarities
What other sequences have some primary sequence similarity to my query sequence?
Time and cost of the search is dependent on the size of the database Restrict the size of the database
FindPatterns
Look for sequence patterns within sequence files
Allows complex pattern definitions Ambiguous sequence specifications
BLAST; NetBlast
All search combinations possible nt vs. nt database
blastn protein vs. protein database
blastp translated nt vs. protein database
blastx protein vs. translated nt database
tblastn translated nt vs. translated nt database
tblastx
FastA,
Search nucleotide sequences with a nucleotide query
Search protein sequences with a peptide query
TFastA
Translates nucleotide sequences in all 6 reading frames
Search the translated sequences with a peptide query
Displaying Data
analyze% typedata Displays on your screen the contents of any
GCG data file -REF
Display documentation only
Copying Data
analyze% fetch Will copy any GCG data or sequence file to
your director
Sequence Symbols
Sequence symbols Handout lists the sequence symbols
recognized by GCG Ambiguity codes are as proposed by the IUB
nomenclature committee Used by GenBank, EMBL, and NBRF
Nucleotide Symbols IUB/GCG Meaning Complement Staden/Sanger A A T A C C G C G G C G T/U T A T M A or C K 5 R A or G Y R W A or T W 7 S C or G S 8 Y C or T R Y K G or T M 6 V A or C or G B not supported H A or C or T D not supported D A or G or T H not supported B C or G or T V not supported X/N G or A or T or C X -/X (Gap). not G or A or T or C . not supported
Amino Acid Symbols IUB Symbol 3-letter Meaning Codons Depiction A Ala Alanine GCT,GCC,GCA,GCG !GCX B Asp,Asn Aspartic, Asparagine GAT,GAC,AAT,AAC !RAY C Cys Cysteine TGT,TGC !TGY D Asp Aspartic GAT,GAC !GAY E Glu Glutamic GAA,GAG !GAR F Phe Phenylalanine TTT,TTC !TTY G Gly Glycine GGT,GGC,GGA,GGG !GGX H His Histidine CAT,CAC !CAY I Ile Isoleucine ATT,ATC,ATA !ATH K Lys Lysine AAA,AAG !AAR L Leu Leucine TTG,TTA,CTT,CTC,CTA,CTG !TTR,CTX,YTR;YTX M Met Methionine ATG !ATG N Asn Asparagine AAT,AAC !AAY P Pro Proline CCT,CCC,CCA,CCG !CCX Q Gln Glutamine CAA,CAG !CAR R Arg Arginine CGT,CGC,CGA,CGG,AGA,AGG !CGX,AGR,MGR;MGX S Ser Serine TCT,TCC,TCA,TCG,AGT,AGC !TCX,AGY;WSX T Thr Threonine ACT,ACC,ACA,ACG !ACX V Val Valine GTT,GTC,GTA,GTG !GTX W Trp Tryptophan TGG !TGG X Xxx Unknown !XXX Y Tyr Tyrosine TAT, TAC !TAY Z Glu,Gln Glutamic, Glutamine GAA,GAG,CAA,CAG !SAR * End Terminator TAA, TAG, TGA !TAR,TRA;TRR
Other Stuff
Non-sequence Data
NonSequence Data
Non-Sequence Data Data required to run a program Copy to your directory with Fetch
Local Data Files
Copies of GCG Data files stored in your own directory.
May be altered as desired.
Using Local Data Files
Programs will look first in the default directory for a particular data file with a particular name. If not found the public data file will be used. A user may specify a new name for the data
file when running a program.
Restriction Enzyme Files
REBASE (enzyme.dat) REBASE 6/2000 Dr. Richard J. Roberts Cold Spring Harbor Laboratory
Used by: Map, MapSort, MapPlot
Prosite
Dictionary of sequence motifs Dr. Amos Bairoch, University of Geneva
Release 16, 7/1999 over 1300 patterns
Used by: Motifs
Profiles
Database of peptide profiles Drs. Michael Gribskov and Amos Bairoch
Over 600 Profiles Used by ProfileScan
Eukaryotic Transcription Factor Recognition Sites
Transcription Factor Database Dr. David Ghosh, NCBI Release 7.5, 3/96 genmoredata:tfsites.dat Used by:
FindPatterns Map, MapSort, MapPlot
Codon Frequency Tables
Frequency of particular codon usage Look in genmoredata Organism
Human E. coli Drosophila
Used by: BackTranslate, CodonPreference
Translation Tables
Standard Table for translating nucleotide sequences into amino acid sequences
Look in genmoredata Alternate translation tables
Mitochondria Mycoplasma
Used by: Translate, Map, Frames
Symbol Comparison Tables
Amino acid similarities What is the chance that one amino acid can
substitute for another without affecting function?
Used by all sequence comparison programs FastA, TFastA, Blast Gap, BestFit PileUp
Protein Analysis Data
Amino acid properties Charge, hydrophobicity, molecular weight,
secondary structure predictions ect. Protease digestion sites Used by:
PepPlot; PlotStructure
Free Energy Values
RNA secondary structure prediction Used by:
Mfold, FoldRNA