Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle

Data

Sequences

and

Other Stuff

Sequence Data

Nucleic Acid and Protein Sequences

Sources of Genetic Sequences User GCG supplied databases

Flat File Oracle Relational Database

NCBI supplied databases Other databases

Sequence Databases

Genbank EMBL DDBJ

NCBI PIR Swiss-Prot Swiss-Prot TrEMBL

Genbank

Primary nucleic acid sequence database Maintained by NCBI

National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov

Current Release 122, 2/2001 11,720,120,326 bases 10,896,781 sequences

http://www.ncbi.nlm.nih.gov/








Species 1995 1996 1997 1998 1999 2000 2001 Increase(since 1995)

Increase(12 months)

all: 16109 23119 32880 43516 61952 87751 95168 490% 40.9%

Viruses: 1845 2122 2678 2968 3573 4428 4857 163% 32.4%

Bacteria: 2939 3847 6091 8711 14322 22758 24878 746% 53.3%

Archaea: 162 235 385 555 1015 1709 1906 1076% 68.8%

Eukaryota: 10366 15901 22596 29926 41420 56961 61571 493% 37.4%

How Many Organisms Are In The Sequence Databases?(April 1, 2001)

Other NCBI Databases

HTGS EST STS GSS RefSeq Unigene Genomic

HTGS

High Throughput Genomic Sequences ‘Unfinished' DNA sequences generated by the high-

throughput sequencing centers Phase 0

Single-few pass reads of a single clone (not contigs) Phase 1

Unfinished, may be unordered, unoriented contigs, with gaps Phase 2

Unfinished, ordered, oriented contigs, with or without gaps Phase 3

Primary division (Genbank) Finished, no gaps (with or without annotations)

EST

Expressed Sequence Tags “Single-pass" cDNA sequences Generally representative of the 3’ ends of

cDNAs More “full-length” ESTs now available

STS

Sequence Tagged Sites Sequence and mapping data Short genomic landmark sequences

GSS

Genome Survey Sequences Similar to the EST division, except that its

sequences are genomic in origin, rather than cDNA Random “single pass read” genome survey

sequences. Cosmid/BAC/YAC end sequences Exon trapped genomic sequences alu PCR sequences

RefSeq

NCBI Reference Sequence project Provides reference sequence standards

for the naturally occurring molecules from chromosomes to mRNAs to proteins

Stable reference point for: mutation analysis gene expression studies polymorphism discovery

RefSeq…

Curated RefSeq transcripts and proteins

Genome Annotation contigs, transcripts, and proteins

Complete Genomes genomes, chromosomes, and proteins

Unigene

Experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters Each UniGene cluster contains sequences that

represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.

Includes EST and cDNA sequences Includes human, rat, mouse, cow and zebrafish

HomoloGene

Curated and calculated orthologs and homologs for genes represented in UniGene and LocusLink

Includes human, mouse, rat, zebrafish, cow and drosophila

LocusLink

Provides a single query interface to curated sequence and descriptive information about genetic loci Nomenclature Aliases Sequence accessions Phenotypes EC numbers MIM numbers UniGene clusters Homology Map locations Web sites

EMBL and DDBJ

European Molecular Biology Laboratory Hinxton, UK http://www.ebi.ac.uk/

DNA Data Bank of Japan Mishima, Japan http://www.ddbj.nig.ac.jp/

http://www.ebi.ac.uk/





http://www.ddbj.nig.ac.jp/





Coordination with Genbank

Prevents duplication Genbank enters sequences from U.S.

journals and researchers EMBL handles European data DDBJ handles Asian data Data exchanged daily

Sequence submissions

Sequences entered from journals Sequences submitted by individual

researchers BankIt

NCBI WWW Site Sequin

Multi-platform program

Sequence Names

DO NOT rely on names to find particular sequences

Few conventions Organism

Hum: Human Mus: mouse Eco: E. coli Syn: synthetic

Last Letter(s)

Sometimes gives useful information cg: Complete genome Viruses

Other Letters

Specifies a particular sequence vsvcg

Vesicular stomatitis virus (Indiana serotype) complete genome

EMBL File Names

Ec: E. coli Hs: Human

Locus name

Names are short, fairly non-descriptive, and can change from one release to another vsvcg

The complete sequence for the virus VSV

Most “mnemonic” names already taken Genbank now using accession numbers

as locus names

Accession Numbers

Each sequence submitted to a database is assigned a unique primary accession number

Accession numbers do not change If a sequence is merged with another, a new

accession number is assigned, and the original number becomes a secondary accession number

Accession numbers may include version numbers AO2428.2

Accession Numbers

Using GCG to access sequences via their accession number

Data Library:Accession Number Flatfile - vi:JO2428 RDB - gcgnuc: JO2428

The Sequence Record

Different for each database Locus (Name) Accession Number Keywords Description Properties References The Sequence

analyze% typedata ge:humcftrm!!NA_SEQUENCE 1.0LOCUS HUMCFTRM 6129 bp mRNA PRI 15-DEC-1989DEFINITION Human cystic fibrosis mRNA, encoding a presumed transmembrane conductance regulator (CFTR).ACCESSION M28668NID g180331KEYWORDS cystic fibrosis; transmembrane conductance regulator.SOURCE Human, cDNA to mRNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 6129) AUTHORS Riordan,J.R., Rommens,J.M., Kerem,B., Alon,N., Rozmahel,R., Grzelczak,Z., Zielenski,J., Lok,S., Plavsic,N., Chou,J.-L., Drumm,M.L., Iannuzzi,M.C., Collins,F.S. and Tsui,L.-C. TITLE Identification of the cystic fibrosis gene: Cloning and characterization of complementary DNA JOURNAL Science 245, 1066-1073 (1989) MEDLINE 89368940

COMMENT A three base-pair deletion spanning positions 1654-1656 is observed in cDNAs from cystic fibrosis patients.FEATURES Location/Qualifiers source 1. .6129 /organism="Homo sapiens" /db_xref="taxon:9606" CDS 133. .4575 /note="cystic fibrosis transmembrane conductance regulator" /codon_start=1 /db_xref="PID:g180332" /translation="MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVD SADNLSEKLEREWDRELASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLL LNRFSKDIAILDDLLPLTIFDFIQLLLIVIGAIAVVAVLQPYIFVATVPVIVAFIMLR AYFLQTSQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFGRQPYFETLFHKALNLHTAN WFLYLSTLRWFQMRIEMIFVIFFIAVTFISILTTGEGEGRVGIILTLAMNIMSTLQWA VNSSIDVDSLMRSVSRVFKFIDMPTEGKPTKSTKPYKNGQLSKVMIIENSHVKKDDIW PSGGQMTVKDLTAKYTEGGNAILENISFSISPGQRVGLLGRTGSGKSTLLSAFLRLLN TEGEIQIDGVSWDSITLQQWRKAFGVIPQKVFIFSGTFRKNLDPYEQWSDQEIWKVAD EVGLRSVIEQFPGKLDFVLVDGGCVLSHGHKQLMCLARSVLSKAKILLLDEPSAHLDP VTYQIIRRTLKQAFADCTVILCEHRIEAMLECQQFLVIEENKVRQYDSIQKLLNERSL FRQAISPSDRVKLFPHRNSSKCKSKPQIAALKEETEEEVQDTRL"BASE COUNT 1886 a 1181 c 1330 g 1732 tORIGIN

HUMCFTRM Length: 6129 April 13, 1998 13:00 Type: N Check: 6781 .. 1 AATTGGAAGC AAATGACATC ACAGCAGGTC AGAGAAAAAG GGTTGAGCGG 51 CAGGCACCCA GAGTAGTAGG TCTTTGGCAT TAGGAGCTTG AGCCCAGACG 101 GCCCTAGCAG GGACCCCAGC GCCCGAGAGA CCATGCAGAG GTCGCCTCTG 151 GAAAAGGCCA GCGTTGTCTC CAAACTTTTT TTCAGCTGGA CCAGACCAAT 201 TTTGAGGAAA GGATACAGAC AGCGCCTGGA ATTGTCAGAC ATATACCAAA 251 TCCCTTCTGT TGATTCTGCT GACAATCTAT CTGAAAAATT GGAAAGAGAA 301 TGGGATAGAG AGCTGGCTTC AAAGAAAAAT CCTAAACTCA TTAATGCCCT 351 TCGGCGATGT TTTTTCTGGA GATTTATGTT CTATGGAATC TTTTTATATT 401 TAGGGGAAGT CACCAAAGCA GTACAGCCTC TCTTACTGGG AAGAATCATA 451 GCTTCCTATG ACCCGGATAA CAAGGAGGAA CGCTCTATCG CGATTTATCT

analyze% typedata -ref GB_PR:HUMIFNRF1A

!!NA_SEQUENCE 1.0LOCUS HUMIFNRF1A 7721 bp DNA PRI 10-NOV-1992DEFINITION Homo sapiens interferon regulatory factor 1 gene, complete cds.ACCESSION L05072NID g184648KEYWORDS interferon regulatory factor 1.SOURCE Homo sapiens Placenta DNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 7721) AUTHORS Cha,Y., Sims,S.H., Romine,M.F., Kaufmann,M. and Deisseroth,A.B. TITLE Human interferon regulatory factor 1: intron/exon organization JOURNAL DNA Cell Biol. 11, 605-611 (1992) MEDLINE 93000481

FEATURES Location/Qualifiers source 1. .7721 /organism="Homo sapiens" /db_xref="taxon:9606" /tissue_type="Placenta" /map="5q23-q31" exon 1. .219 /gene="IRF1" /note="putative" /number=1 5'UTR join(1. .219,1279. .1287) /gene="IRF1" gene join(1. .219,1279. .1287) /gene="IRF1" intron 220. .1278 /gene="IRF1" /number=1 exon 1279. .1374 /gene="IRF1" /number=2 CDS join(1288. .1374,2738. .2837,3630. .3806,3916. .3965, 4073. .4202,4386. .4508,5040. .5089,6248. .6383,6670. .6794) /gene="IRF1" /codon_start=1 /product="interferon regulatory factor 1" /db_xref="PID:g184649" /translation="MPITRMRMRPWLEMQINSNQIPGLIWINKEEMIFQIPWKHAAKH GWDINKDACLFRSWAIHTGRYKAGEKEPDPKTWKANFRCAMNSLPDIEEVKDQSRNKG SSAVRVYRMLPPLTKNQRKERKSKSSRDAKSKAKRKSCGDSSPDTFSDGLSSSTLPDD HSSYTVPGYMQDLEVEQALTPALSPCAVSSTLPDWHIPVEVVPDSTSDLYNFQVSPMP STSEATTDEDEEGKLPEDIMKLLEQSEWQPTNVDGKGYLLNEPGVQPTSVYGDFSCKE EPEIDSPGGDIGLSLQRVFTDLKNMDATWLDSLLTPVRLPSIQAIPCAP"

intron 1375. .2737 /gene="IRF1" /number=2 exon 2738. .2837 /gene="IRF1" /number=3 intron 2838. .3629 /gene="IRF1" /number=3 exon 3630. .3806 /gene="IRF1" /number=4 intron 3807. .3915 /gene="IRF1" /number=4 exon 3916. .3965 /gene="IRF1" /number=5 intron 3966. .4072 /gene="IRF1" /number=5

...

exon 5040. .5089 /gene="IRF1" /number=8 intron 5090. .6247 /gene="IRF1" /number=8 exon 6248. .6383 /gene="IRF1" /number=9 intron 6384. .6669 /gene="IRF1" /number=9 exon 6670. .7656 /gene="IRF1" /number=10 3'UTR 6795. .7656BASE COUNT 1750 a 1946 c 2253 g 1772 tORIGIN

analyze% typedata -ref est:hum091226f!!NA_SEQUENCE 1.0LOCUS HUM091226F 152 bp mRNA EST 02-APR-1996DEFINITION Homo sapiens retinal fovea EST HFV091226 sequence.ACCESSION L48850NID g1254959KEYWORDS EST; expressed sequence tag.SOURCE Homo sapiens (clone: EST HFV091226) age normalized retinal foveae cDNA to mRNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (sites) AUTHORS Adams,M.D., Kerlavage,A.R., Fields,C. and Venter,J.C. TITLE 3,400 new expressed sequence tags identify diversity of transcripts in human brain JOURNAL Nature Genet. 4 (3), 256-267 (1993) MEDLINE 93364420REFERENCE 2 (sites) AUTHORS Liew,C.C., Hwang,D.M., Fung,Y.W., Laurenssen,C., Cukerman,E., Tsui,S. and Lee,C.Y. TITLE A catalogue of genes in the cardiovascular system as identified by expressed sequence tags JOURNAL Proc. Natl. Acad. Sci. U.S.A. 91 (22), 10645-10649 (1994) MEDLINE 95024171REFERENCE 3 (bases 1 to 152) AUTHORS Bernstein,S.L., Borst,D.E., Neuder,M.E. and Wong,P. TITLE Characterization of a human fovea cDNA library and regional differential gene expression in the human retina JOURNAL Genomics 32 (3), 301-308 (1996)

FEATURES Location/Qualifiers source 1. .152 /organism="Homo sapiens" /note="Expressed sequence tags (first pass sequencing) from randomly selected bacteriophage clones (mRNA-cDNA) from human retinal fovea. The library is age normalized from ten sets of donor foveae 2-79 years old. /db_xref="taxon:9606" /clone="EST HFV091226" /dev_stage="age normalized" /tissue_type="retinal foveae" mRNA <1. .>152 /standard_name="EST HFV091226"BASE COUNT 31 a 42 c 41 g 36 t 2 othersORIGIN

analyze% typedata -ref sts:humswx153!!NA_SEQUENCE 1.0LOCUS HUMSWX153 192 bp DNA STS 24-MAY-1993DEFINITION Human chromosome X STS sWXD153; single read.ACCESSION L15212NID g292645KEYWORDS STS; primer; sequence tagged site.SOURCE Homo sapiens DNA. ORGANISM Homo sapiens Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 192) AUTHORS Kere,J., Nagaraja,R., Mumm,S.R., Ciccodicola,A., D'Urso,M. and Schlessinger,D. TITLE Mapping human chromosomes by walking with sequence-tagged sites from end fragments of yeast artificial chromosome inserts JOURNAL Genomics 14, 241-248 (1992) MEDLINE 93052321

COMMENT Submitted by: David Schlessinger, Center for Genetics in Medicine, Washington University School of Medicine, Box 8232 4566 Scott Avenue, St. Louis, MO 63110, USA e-mail: [email protected] Primer A: TAAAGGGATCGCCAAGGAC Primer B: CTTACTCATTTGCTGGATTCTC STS size: 85bp Template: 600 ng/100ul Primer: 40 pmoles/100ul dNTPs: 100 uM MgCl2: 1.5 mM KCl: 100 mM TrisHCl: 10 mM Taq Polymerase: 0.125 U NH4Cl: 5 mM pH: 8.6 Total Vol: 5 ul PCR Profile: Denaturation: 94 degrees C for 1.00 minute(s) Annealing: 55 degrees C for 2.00 minute(s) Polymerization: 72 degrees C for 2.00 minute(s) PCR Cycles: 35 Thermal Cycler: P-E.

FEATURES Location/Qualifiers source 1. .192 /organism="Homo sapiens" /db_xref="taxon:9606" /map="Xq13-q24" STS 60. .144 /standard_name="sWXD153" primer_bind 60. .78 primer_bind complement(123. .144)BASE COUNT 72 a 26 c 60 g 29 t 5 othersORIGINanalyze%

Swiss-Prot

http://www.expasy.ch/sprot/ Protein Database University of Geneva Arranged by protein function Release 39.15 March 19, 2001 94,152 entries Provides annotated protein records

http://www.expasy.ch/sprot/







Swiss-Prot Names

Protein_Species Allows easier comparisons when studying

evolutionary relationships H1b_Human

Human histone 1b

Swiss-Prot Names

Vgl*_* Viral glycoproteins

VGLG_HRSVL Viral GLycoprotein G Human Respiratory Syncytial Virus Long

strain

analyze% typedata swp:H1b_Human

!!AA_SEQUENCE 1.0ID H1B_HUMAN STANDARD; PRT; 218 AA.AC P10412;DT 01-MAR-1989 (REL. 10, CREATED)DT 01-MAR-1989 (REL. 10, LAST SEQUENCE UPDATE)DT 01-JUN-1994 (REL. 29, LAST ANNOTATION UPDATE)DE HISTONE H1B (H1.4).GN H1F4.OS HOMO SAPIENS (HUMAN).OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;OC EUTHERIA; PRIMATES.RN [1]RP SEQUENCE FROM N.A.RX MEDLINE; 92009931.RA ALBIG W., KARDALINOU E., DRABENT B., ZIMMER A., DOENECKE D.;RL GENOMICS 10:940-948(1991).RN [2]RP SEQUENCE.RC TISSUE=SPLEEN;RX MEDLINE; 87057092.RA OHE Y., HAYASHI H., IWAI K.;RL J. BIOCHEM. 100:359-368(1986).

CC -!- FUNCTION: HISTONES H1 ARE NECESSARY FOR THE CONDENSATION OFCC NUCLEOSOME CHAINS INTO HIGHER ORDER STRUCTURES.CC -!- SUBCELLULAR LOCATION: NUCLEAR.CC -!- THIS VARIANT ACCOUNTS FOR 60% OF HISTONE H1.DR EMBL; M60748; G184074; -.DR PIR; A24413; HSHU1B.DR PIR; C40335; C40335.DR HSSP; P08287; 1GHC.KW CHROMOSOMAL PROTEIN; NUCLEAR PROTEIN; DNA-BINDING; MULTIGENE FAMILY;KW ACETYLATION; METHYLATION.FT INIT_MET 0 0FT MOD_RES 1 1 ACETYLATION.FT MOD_RES 25 25 METHYLATION (PARTIAL).FT DOMAIN 35 113 GLOBULAR.SQ SEQUENCE 218 AA; 21734 MW; 5A277FB0 CRC32;

H1B_HUMAN Length: 218 April 13, 1998 13:19 Type: P Check: 2701 .. 1 SETAPAAPAA PAPAEKTPVK KKARKSAGAA KRKASGPPVS ELITKAVAAS 51 KERSGVSLAA LKKALAAAGY DVEKNNSRIK LGLKSLVSKG TLVQTKGTGA 101 SGSFKLNKKA ASGEAKPKAK KAGAAKAKKP AGAAKKPKKA TGAATPKKSA 151 KKTPKKAKKP AAAAGAKKAK SPKKAKAAKP KKAPKSPAKA KAVKPKAAKP 201 KTAKPKAAKP KKAAAKKK analyze%

Swiss-Prot TrEMBL

Translation of all EMBL Nucleic Acid coding sequences not yet present in Swiss-Prot

Allows rapid availability without immediate annotation

Release 16.3 March 30, 2001 436,896 entries

TrEMBL Divisions

Everything in TrEMBL: spt sp_bacteria sp_fungi sp_human sp_invertebrate sp_mammal sp_mhc sp_organelle sp_phage sp_plant sp_rodent sp_unclassified sp_vertebrate

Protein Identification Resource - PIR

http://pir.georgetown.edu/ National Biomedical Research Foundation Georgetown University Current Release 67.05 March 23, 2001 219,178 Entries

http://pir.georgetown.edu/





National Biomedical Research Foundation

Database begun over twenty years ago by Margaret O. Dayhoff

Originally published sequences in book form

Started with sequences derived from direct amino acid sequencing

analyze% typedata -ref PIR1:HSHU1B

!!AA_SEQUENCE 1.0P1;HSHU1B - histone H1-4 - humanN;Alternate names: histone H1.4; histone H1bC;Species: Homo sapiens (man)C;Date: 31-Dec-1988 #sequence_revision 12-Apr-1996 #text_change 05-Sep-1997C;Accession: C40335; A24413R;Albig, W.; Kardalinou, E.; Drabent, B.; Zimmer, A.; Doenecke, D.Genomics 10, 940-948, 1991A;Title: Isolation and characterization of two human H1 histone genes within clusters of core histone genes.A;Reference number: A40335; MUID:92009931A;Accession: C40335A;Status: preliminaryA;Molecule type: DNAA;Residues: 1-219 <ALB>A;Cross-references: GB:M60748; NID:g184073; PID:g184074A;Experimental source: bloodR;Ohe, Y.; Hayashi, H.; Iwai, K.J. Biochem. 100, 359-368, 1986A;Title: Human spleen histone H1. Isolation and amino acid sequence of a main variant, H1b.A;Reference number: A24413; MUID:87057092A;Accession: A24413A;Molecule type: proteinA;Residues: 2-219 <OHE>A;Experimental source: spleen

C;Comment: This variant accounts for 60% of histone H1.C;Genetics:A;Gene: GDB:H1F4A;Cross-references: GDB:120030; OMIM:142220A;Map position: 12q11-12q21C;Superfamily: histone H1C;Keywords: acetylated amino end; chromosomal protein; DNA binding; methylated amino acid; nucleosome; spleenF;2-219/Product: histone H1-4 #status experimental <MAT>F;2-32/Domain: amino-terminal <NH2>F;33-110/Domain: globular <GLB>F;111-219/Domain: carboxyl-terminal <END>F;2/Modified site: acetylated amino end (Ser) (in mature form) #status experimentalF;26/Modified site: N6-methyllysine (Lys) (partial) #status experimental

iProClass Database - PIR

http://pir.georgetown.edu/iproclass/ Comprehensive family relationships and

structural/functional classifications and features of proteins Superfamilies Families Domains

http://pir.georgetown.edu/iproclass/







GCG Supplied Databases

GCG sequence database files are NOT normal UNIX files. UNIX commands cannot be used to

manipulate sequences in these databases Stored as Data Libraries Stored in Oracle RDB

Sequence Data Updates

Genbank Daily

GCG Flat file No longer updated Last update June, 2000

GCG SeqStore Oracle RDB Daily updates

Database listing – GCG-FF

Databases available:

GenBank Release 118.0 (06/2000)

EMBL (Abridged) Release 62.0 (03/2000)

PIR-Protein Release 65.0 (06/2000)

NRL_3D Release 27.0 (03/2000)

SWISS-PROT Release 39.0 (06/2000)

SP-TREMBL Release 14.0 (06/2000)

PROSITE Release 16.0 (07/1999)

Restriction Enzymes (REBASE) (06/2000)

Database listing – SeqStore

Databases available:

GCGNUC updated nightly by DATASERVE

GCGPROT updated weekly by DATASERVE

GCGEST updated nightly by DATASERVE

PROSITE Release 15.0 (07/1999)

Restriction Enzymes (REBASE) (06/2000)

Data Libraries

Allows rapid searches Sequences organized into groups Each data library can be referred to by a

logical name Individual sequences can be extracted

from the data library.

Logical Names:GCG Sequence Databases

http://www.microbio.uab.edu/seqCourse/datalib.htm









GCG SeqStore (Oracle-based Sequences)

Data Library Names

Database Name DescriptionNucleic Acid Sequences

gcgnuc All Genbank nucleotide sequences (except ESTs) updated nightly by SeqStore

gcgest All Genbank Expressed Sequence Tags updated nightly by SeqStore

Protein Sequences

gcgprot All Swissprot and Swissprot TrEMBL sequences updated nightly by SeqStore

GCG Flat-file

Data Library Names

Nucleic Acid Databases (Genbank and EMBL)

Database Name(s) DescriptionGenEMBL, GE Entire database (except tags)

genemblplus gep geplus Entire database (including tags)

Bacterial, Bacteria, Ba Bacterial sequences

HTG High throughput genome

Invertebrate, In Invertebrate sequence

Organelle, Or Organelle sequences

Other_Mammalian, OtherMammal, OtherMamm, Om

non-rodent, non-primate Mammalian sequences

Other_Vertebrate, Ov, OtherVertebrate, OtherVert

non-mammalian Vertebrate sequences

Nucleic Acid Databases…

Database Name(s) DescriptionPatent, Pat Sequences from patents and

patent applications

Phage, Ph Phage sequences

Plant, Pl Plant and Fungal sequences

Primate, Pr Primate (Mammalian) sequences

Rodent, Ro Rodent (Mammalian) sequences

Structural_RNA, Structural St Structural RNA sequences (such as rRNAs)

Synthetic, Sy Synthetic sequences

Unannotated, Un Unannotated sequences

Viral, Vi Viral sequences

Sequence Tag Databases

Database Name(s) DescriptionEST Expressed sequence tags

GSS Genome survey sequences

STS Sequence-tagged site sequences

Tags EST, STS, and GSS

Protein Databases

Database Name(s) DescriptionPIR,P Entire PIR-Protein Protein

Sequence Data Library

Protein, Prot, PIR1 PIR-Protein annotated sequences

New, Nw PIR-Protein preliminary and unverified sequences

PIR2 PIR-Protein preliminary sequences

PIR3 PIR-Protein unverified sequences

SwissProt, Swiss Entire SwissProt Protein Sequence Data Library

Sptrembl, spt Newly added preliminary sequences, translated from EMBL

swissprotplus swplus swp SwissProt + SPTrEMBL

NCBI Blast Databases

Nucleotide Databases for NetBlast Searching nr Non-redundant Genbank+EMBL+DDBJ+PDB sequences

(but no EST's or STS's)

pdb PDB nucleotide sequences

vector Vector subset of Genbank

yeast Saccharomyces cerevisiae genomic nucleotide sequences

est Non-redundant Database of Genbank+EMBL+DDBJ EST Division

sts Non-redundant Database of Genbank+EMBL+DDBJ STS Division

htgs High Throughput Genomic Sequences

mito Database of mitochondrial sequences, Rel. 1.0, July 1995

kabat Kabat Sequences of Nucleic Acid of Immunological Interest

epd Eukaryotic Promotor Database

alu Select Alu Repeats from REPBASE

gss Genome Survey Sequence, includes single_pass genomic data

ecoli E. coli genomic nucleotide sequences

Drosophila genome Drosophila genome provided by Celera and Berkeley

month All new or revised Genbank+EMBL+DDBJ+PDB sequences released in the last 30 days

Protein Databases for NetBlast Searchingnr Non-redundant Genbank CDS

translations+PDB+SwissProt+PIR

pdb PDB protein sequences

swissprot SwissProt sequences

yeast Saccharomyces cerevisiae protein sequences

kabat Kabat Sequences of Proteins of Immunological Interest

alu Translations of Select Alu Repeats from REPBASE

ecoli E. coli genomic CDS translations

Drosophila genome Drosophila genome proteins provided by Celera and Berkeley

month All new or revised Genbank CDS translation+PDB+SwissProt+PIR sequences released in the last 30 days

Specifying Sequences

Filename Data library specification Accession number specification

Sequences within your own directories

Use the normal file specification:

lefkowit/sequences/vsvcg.seq

Sequences within a Data Library

Flatfile Data Library:Sequence Name sw:vglg_vsvsj - VSV G protein in the

SwissProt library primate:humada

The sequence for human adenosine deaminase mRNA

SeqStore gcgprot:vglg_vsvsj gcgnuc:humada

Sequence Formats

GCG requires a specific sequence format Sequences entered from outside GCG

must be reformatted analyze% reformat

GCG program analyze% readseq

Non-GCG addition

Non-GCG Sequence File

analyze% cat seq.txt

ACGAAGACAAACAAACCATTATTATCATTAAAAGGCTC

AGGAGAAACTTTAACAGTAATCAAAATGTCTGTTACAG

TCAAGAGAATCATTGACAACACAG

analyze%

analyze% reformat

analyze% reformat -check seq.txt

Reformat rewrites sequence file(s), scoring matrix file(s), or enzyme

data file(s) so that they can be read by GCG programs.

Minimal Syntax: % reformat [-INfile=]reformat.txt -Default

Prompted Parameters: None

Local Data Files:

-DATa=translate.txt three-letter to one-letter codes

Optional Parameters: [-OUTfile=]NewSeqName names the output file-EXTension=.seq specifies a file name extension for the output-LIStfile[=reformat.list] writes a list file of output sequence names-MSF reformats sequences into an MSF output file-RSF reformats sequences into an RSF output file-PROtein or -NUCleotide insists that the sequences are reformatted as protein or nucleotide sequences-DEGap removes gap characters (. and ~) from the sequence-LINesize=50 sets number of characters per line-BLOcksize=10 sets number of characters per block-BLAnklines=1 puts blank lines between the sequence lines-NONUMbering suppresses numbering-NOCOMments suppresses comments-DNA changes U into T-RNA changes T into U-UPPer makes all sequence characters uppercase-LOWer makes all sequence characters lowercase-ONEIntothree translates one-letter peptides into three-letter-THReeintoone translates three-letter peptides into one-letter-NOHEAding input sequence from stdin contains no header information

-COMparison reformats a scoring matrix instead of a sequence (used with -PROtein or -NUCleotide, insists that the matrix is reformatted as a protein or nucleotide scoring matrix)-GAPweight=12 specifies the gap creation penalty associated with the scoring matrix-LENgthweight=4 specified the gap extension penalty associated with the scoring matrix-SCAle=10 multiplies each value in the scoring matrix by 10 (use any number from .01 to 100.0)-EQUALSformat writes the scoring matrix in a form that may be more easily read-OLDCMPformat converts a pre-Version 9 scoring matrix into a Version 9 scoring matrix (all options used with -COMparison can also be used with -OLDCMPformat. -PROtein or -NUCleotide must be specified with -OLDCMPformat-TRANSlate=filename.txt lets you name the translation table-NOMONitor suppresses the screen trace showing each output file Add what to the command line ?

No ".." divider seq.txt length: 100 bpanalyze%

analyze% cat seq.txt'!!NA_SEQUENCE 1.0 REFORMAT of: seq.txt check: 3430 from: 1 to: 100 April 9, 1998 14:31 (No documentation) seq.txt Length: 100 April 9, 1998 14:31 Type: N Check: 3430 .. 1 ACGAAGACAA ACAAACCATT ATTATCATTA AAAGGCTCAG GAGAAACTTT 51 AACAGTAATC AAAATGTCTG TTACAGTCAA GAGAATCATT GACAACACAG analyze%

Reformatted Sequence

GCG Sequence Import Programs

fromstaden fromembl fromgenbank frompir fromig fromfasta fromtrace

GCG Sequence Export Programs

tostaden topir toig tofasta

ReadSeq

General reformatting program

analyze% readseqanalyze% readseqreadSeq (1Feb93), multi-format molbio sequence reader. Name of output file (?=help, defaults to display):seq.fasta 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip3.2 3. NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only) Choose an output format (name or #):8

Name an input sequence or -option:seq.txt Name an input sequence or -option:

analyze% cat seq.fasta>seq.txt, 100 bases, D66 checksum.ACGAAGACAAACAAACCATTATTATCATTAAAAGGCTCAGGAGAAACTTTAACAGTAATCAAAATGTCTGTTACAGTCAAGAGAATCATTGACAACACAGanalyze%

ReadSeq Formatted Sequence

Sequence File Utilities

Chopup Break up long lines in a text file prior to

running reformat Breakup

Breakup long sequences into individual, overlapping sequence files

>uunt, 751719 bases, 1F08 checksum.ATGGCTAATAATTATCAAACTTTATATGATTCAGCAATAAAAAGGATTCCATACGATCTTATTTCTGATCAAGCTTATGCAATTCTACAAAATGCTAAAACTCATAAAGTTTGCGATGGTGTTTTATATATAATTGTAGCCAATGCCTTTGAAAAAAGTATTATTAACGGTAATTTTATTAACATTATTTCTAAATATCTAAGCGAAGAATTCAAAAAGGAAAATATTGTTAATTTTGAATTTATTATAGACAATGAAAAATTATTAATTAATAGCAATTTTTTAATTAAAGAAACTAATATTAAAAATCGTTTTAATTTTAGTGATGAACTTTTACGTTACAATTTTAACAATTTAGTAATTAGTAATTTTAATCAAAAAGCGATTAAGGCGATTGAAAATTTATTTTCAAATAACTATGATAATAGTTCAATGTGTAACCCTTTATTTTTATTTGGTAAAGTTGGTGTTGGTAAAACGCATATCGTGGCTGCTGCTGGTAATCGTTTTGCTAATAGTAATCCTAATTTAAAAATTTATTATTATGAAGGGCAAGATTTTTTTCGAAAGTTTTGTTCTGCTTCGTTAAAAGGGACTAGTTATGTTGAAGAGTTTAAAAAAGAAATTGCTTCAGCAGATTTATTAATTTTTGAAGATATTCAAAATATCCAATCACGTGATTCAACGGCTGAATTGTTTTTTAATATCTTTAATGATATAAAATTAAATGGTGGAAAAATTATCTTAACATCTGACCGTACACCAAACGAACTTAATGGTTTTCATAATCGAATTATTTCGAGATTAGCGTCAGGTTTGCAGTGTAAAATTTCTCAACCCGACAAAAATGAAGCTATTAAAATTATTAATAATTGGTTTGAATTCAAAAAAAAATATCAAATTACTGACGAAGCTAAAGAATATATTGCTGAAGGTTTTCACACTGATATTAGACAGATGATtGGTAATCTAAAACAAATTTGTTTTTGAGCGGACAATGATACTAATAAAGATTTAATAATCACAAAAGATTATGTAATTGAGTGTTCAGTTGAAAACGAAATTCCACTAAATATTGTTGTTAAAAAACAATTTAAACC

analyze% readseqreadSeq (1Feb93), multi-format molbio sequence reader. Name of output file (?=help, defaults to display):uunt.seq 1. IG/Stanford 10. Olsen (in-only) 2. GenBank/GB 11. Phylip3.2 3. NBRF 12. Phylip 4. EMBL 13. Plain/Raw 5. GCG 14. PIR/CODATA 6. DNAStrider 15. MSF 7. Fitch 16. ASN.1 8. Pearson/Fasta 17. PAUP/NEXUS 9. Zuker (in-only) 18. Pretty (out-only) Choose an output format (name or #):5 Name an input sequence or -option:uunt Name an input sequence or -option:

analyze% more uunt.sequunt uunt, Length: 751719 (today) Check: 7944 .. 1 ATGGCTAATA ATTATCAAAC TTTATATGAT TCAGCAATAA AAAGGATTCC 51 ATACGATCTT ATTTCTGATC AAGCTTATGC AATTCTACAA AATGCTAAAA 101 CTCATAAAGT TTGCGATGGT GTTTTATATA TAATTGTAGC CAATGCCTTT 151 GAAAAAAGTA TTATTAACGG TAATTTTATT AACATTATTT CTAAATATCT 201 AAGCGAAGAA TTCAAAAAGG AAAATATTGT TAATTTTGAA TTTATTATAG 251 ACAATGAAAA ATTATTAATT AATAGCAATT TTTTAATTAA AGAAACTAAT 301 ATTAAAAATC GTTTTAATTT TAGTGATGAA CTTTTACGTT ACAATTTTAA 351 CAATTTAGTA ATTAGTAATT TTAATCAAAA AGCGATTAAG GCGATTGAAA 401 ATTTATTTTC AAATAACTAT GATAATAGTT CAATGTGTAA CCCTTTATTT 451 TTATTTGGTA AAGTTGGTGT TGGTAAAACG CATATCGTGG CTGCTGCTGG 501 TAATCGTTTT GCTAATAGTA ATCCTAATTT AAAAATTTAT TATTATGAAG 551 GGCAAGATTT TTTTCGAAAG TTTTGTTCTG CTTCGTTAAA AGGGACTAGT ...

751301 GAAAATAAAC TACGATTTGA TTAGAATGAA TTTTTTGTTG TTTCTTAATT 751351 GTATCAAGTA TATCTTCATT TTTTTTTAGA CTAATAAAAT TAGCCATAAA 751401 AATTATTTTT CACTAGAAAC TGTTAGACTA TGACGCCCTT TAAGTCTTCT 751451 TCTAGCTAAA ACATTACGCC CATTTTTTGT TTTCATGCGT GCACGAAAAC 751501 CATGCACTTT TGCTCTTTTA CGATTATTAG GTTGAAACGT TCTTTTCATA 751551 AATCCACCGC CCTCTTACTT TTTTGAAAAC ATAATATGGA TTATTATAAC 751601 ATTTTAGTTA TTTTTTATTT AATATATTTT TTTAAAAAAG TCAATGATAT 751651 CTTTTTAAAA ATAAACATAT ATAATATGAT AATAGGACAA AGATTATTTA 751701 TAAAAAATAG AGGTTACTA

analyze% map uunt.seq Map maps a DNA sequence and displays both strands of the mapped sequencewith restriction enzyme cut points above the sequence and proteintranslations below. Map can also create a peptide map of an amino acidsequence. ***Error: Sequence "uunt.seq" could not be read or is not in GCG format

analyze% breakup uunt.seq BreakUp reads a GCG-format sequence file containing more than 350,000sequence characters and writes it as a set of separate, shorter,overlapping sequence files that can be analyzed by Wisconsin Package programs. uunt_0.seq length: 110000 bp uunt_1.seq length: 110000 bp uunt_2.seq length: 110000 bp uunt_3.seq length: 110000 bp uunt_4.seq length: 110000 bp uunt_5.seq length: 110000 bp uunt_6.seq length: 110000 bp uunt_7.seq length: 51719 bp analyze%

Specifying Multiple Sequences

Multiple sequences

If the program prompts with: sequences(s), file(s), or file name(s), then it can accept more than one input file

Specifying Multiple Sequences

Wild Card Specification File of File Names

List Files Multiple Sequence Format File

Wild card specification (flatfile)

GenEMBL:* All sequences in Genbank and EMBL

Primate:* All primate sequences in GenBank

Primate:Hum* All Human sequences in GenBank EMBL uses HS for human

Wild card specification (SeqStore)

gcgnuc:* All sequences in Genbank and EMBL

Must create a query or list for most groupings

File of Sequence Names

List Files You or certain GCG programs can

construct a file containing any number of sequence names.

Specify as @Sequence_names.fil

The @ tells the program that Sequence_names.fil is a file of sequence names

The program uses all listed sequences

Contents of a File of Sequence Names

Begin with a comment Sequence file names follow a double

period at the end of a line: .. Other comments can be included if

preceded by a ! One sequence name per line

File of Sequence Names...

Put an ! in front of a name to have the program ignore that particular entry.

A sequence name may include a wild card The file can contain another file of

sequence names as a listing It must be preceded by an @

hsp70.fil File

January 21, 1998 ..

SWP:Hs70_Brelc SWP:Hs70_Chick SWP:Hs70_Human SWP:Hs70_Leido SWP:Hs70_Leima SWP:Hs70_Maize SWP:Hs70_Mouse SWP:Hs70_Pethy SWP:HS77_YeastSWP:GR78_Yeast -BEGin=43 -END=682sequences/hsp70/ssa4.pepob0/users/lefkowit/sequences/hsp70/ssa1.pepSWP:DNAK_EColi

Multiple Sequence Files (msf)

File containing multiple sequences that are related and have been aligned

Specifying msf files: filename.msf{*} The {*}indicates which sequences are to be used

You can exclude a sequence in subsequent analyses by preceding its name within the msf file with an ! sign.

hsp70.msf

PileUp of: @Hsp70.Fil Symbol comparison table: GenRunData:NWSGapPep.Cmp CompCheck: 1254 GapWeight: 3.0 GapLengthWeight: 0.1

Pileup.Msf MSF: 738 Type: P December 26, 1990 13:39 Check: 288 .. Name: Hs70_Plafa Len: 738 Check: 9820 Weight: 1.00Name: Hs70_Thean Len: 738 Check: 120 Weight: 1.00!Name: Hs70_Leido Len: 738 Check: 7985 Weight: 1.00// 1 50Hs70_Plafa .......... .....MASAK GSKPNLPESN IAIGIDLGTT YSCVGVWRNE Hs70_Thean .......... .......... .......MTG PAIGIDLGTT YSCVAVYKDN Hs70_Leido .......... .......... ......MTFD GAIGIDLGTT YSCVGVWQNE

51 100Hs70_Plafa NVDIIANDQG NRTTPSYVAF T.DTERLIGD AAKNQVARNP ENTVFDAKRL Hs70_Thean NVEIIPNDQG NRTTPSYVAF T.DTERLIGD AAKNQEARNP ENTIFDAKRL Hs70_Leido RVDIIANDQG NRTTPSYVAF TSDSERLIGD AAKNQVAMNP HNTVFDAKRL

rsf Files

Rich Sequence Format Allows entry of additional information

about each sequence File can contain multiple sequences

Allows gaps Different sequences do not need to be

related Create and Edit rsf files within SeqLab

rsf Sequence Information

Creator/author of the sequence Sequence weight Creation date One-line description of the sequence Offset, or the number of leading gaps in a

sequence that is part of an alignment or fragment assembly project

Known sequence features

rsf File Specification

Similar to msf files hsp70.rsf{*}

Use all the sequences in the file hsp70.rsf{hs70_human}

Only use this single sequence hsp70.rsf{hs70*}

Only use sequences whose name starts with hs70

analyze% more rsb.rsf!!RICH_SEQUENCE 1.0..{name dc-62-18537descrip Description: PileUp of: *.seqtype DNAlongname dc-62-18537checksum 8717creation-date 4/10/98 15:45:50strand 1sequence TCCACCGTGCTCGACACAATCACTCCAAAATACACAATCCAACAGCAATCCCTCCACTCA ACCACCTCCGAAAACACACCCAGCTCCACACAAATACCCACAGCATCCGAGCCCTCCACA TTAAATCCTAAT}{name swed-60-860descrip Description: PileUp of: *.seqtype DNAlongname swed-60-860checksum 8595creation-date 4/10/98 15:45:50strand 1sequence TCCACCGTGATCGACACAATCACTCCAAAATACACAATCCAACAGCAATCCCTCCACTCA ACCACCTCCGAAAACACACCCAGCTCCACACAAATACCCACAGCATCCGAGCCCTCCACA TCAAATCCTACT}

Finding and Displaying Sequences

List Refinement

Run search program 1 Create a list of file names Use as input to search program 2 Create a second list of file names Edit the listfile at each step as necessary. etc.

Programs Which Create a List of Sequences

Names Blast Lookup StringSearch FindPatterns FastA TFastA

Names

Searches sequence names for a match analyze% names primate:Hum*

Will create a file listing all human sequences present in GenBank

Dependent on knowing name features GenBank:Hum* EMBL:Hs*

analyze% names -check pr:huma* Names identifies GCG data files and sequence entries by name. It canshow you what set of sequences is implied by any sequence specification. Minimal Syntax: % names [-INfile=]GenEMBL:Humhb* -Default Prompted Parameters: [-OUTfile=]Term output file name (defaults to your terminal) Options: -SHOwfiles=132 limits documentation in the output file to column 132-NOHEAding suppresses the heading at the top of the file.-NOMONitor suppresses the screen monitor Add what to the command line ? What (file of filenames) output file (* TERM *) ? gb_pr1: huma1aadr huma1acm huma1acmb huma1ar1huma1ar2 huma1at huma1ata huma1atb

analyze% more list.file!!SEQUENCE_LIST 1.0! NAMES from: pr:huma* April 13, 1998 14:55 .. gb_pr1:huma1aadr LOCUS HUMA1AADR 2002 bp mRNA PRI 04-NOV-1991 DEFINITION Human alpha-A1-adrenergic receptor mRNA, complete cds. ACCE

gb_pr1:huma1acm LOCUS HUMA1ACM 1520 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antichymotrypsin (AACT) mRNA, complete cds. ACC

gb_pr1:huma1acmb LOCUS HUMA1ACMB 559 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antichymotrypsin gene, exon 1. ACCESSION M18035

gb_pr1:huma1ar1 LOCUS HUMA1AR1 890 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin-related protein gene, exon 2. ACCESSI

gb_pr1:huma1ar2 LOCUS HUMA1AR2 3758 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin-related protein gene, exons 3, 4 and

gb_pr1:huma1at LOCUS HUMA1AT 143 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin (alpha-1-AT) mRNA, 3' end. ACCESSION M

gb_pr1:huma1ata LOCUS HUMA1ATA 322 bp DNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin gene, exon 1 (unexpressed). ACCESSION

gb_pr1:huma1atb LOCUS HUMA1ATB 1345 bp mRNA PRI 30-OCT-1994 DEFINITION Human alpha-1-antitrypsin mRNA, complete cds. ACCESSION M1146

StringSearch

Old search method Searches for a particular text pattern in the

sequence documentation. Definition Search Record Search

Complete search for possible text occurances

Very Slow!!

Lookup (gcgff only)

Rapid Text Pattern Searching Uses an index of sequence file

documentation Allows field-specific searches Allows AND; OR; NOT matching

Lookup Considerations

Be sure that analyze is set to use a vt100 terminal: analyze% setenv TERM vt100

Lookup may miss some sequences Dependent on the annotation Spelling counts

Searches are case Insensitive

Logical Operators Within a Field

AND: & A & B means find all entries that contain both A

and B. OR: |

A | B means find all entries that contain either A or B.

BUT-NOT: ! A ! B means find all entries that contain A but do

not contain B.

analyze% lookup -check LookUp identifies sequence database entries by name, accession number,author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences. The LookUp program is experimental in this release. LookUp sometimescrashes or produces incorrect results if you query a nucleic aciddatabase and request fragment output. Please look carefully at yourresults. Minimal Syntax: % lookup [-ALLtext=]Globin -Default

Prompted Parameters: -LIBrary=SwissProt[,...] lookup in specified data libraries -ALLtext=Globin searches all text indices for globin-DEFInition=Globin words indexed independently "Globin & Region"-AUThor=Smithies for more than one "Smithies,O. & Slightom,J.L."-KEYword=Globin see document before using keywords-NAMe=hsggl3 entry name-ACCessionnumber=S12345 accession number-ORGanism="Homo Sapiens" genus and species-REFerence=Cell&1981 complete reference: "Cell & 26 & 191- & 1981"-TITle=History title of citation "History & Duplication"-FEAture=Gamma any word in a feature table-SHOrtest=100 find only sequences of length 100 or more-LONgest=400 find only sequences of length 400 or less-EARliest=01-apr-1992 sequences modified on or after April 1, 1992-LATest=30-apr-1992 sequences modified on or before April 30, 1992-MATch=OR specifies inter-field logic (AND is default)-OUTfile=lookup.list output file for list of sequences

Optional Parameters: -NOWILdcardextension turns off automatic wildcard [email protected] searches in lookup.list instead of libraries-ANNotate=FEAture[,...] shows fields from original annotation in output acceptable values include: ACCession, AUThor, DATe, DEFinition, FEAture, NAMe, KEYword, ORGanism, REFerence, and TITle-FRAgments shows features as fragments instead of whole entries-COMplete shows only features with unambiguous coordinates-MONitor shows databases searched and how many hits found Add what to the command line ?

LOOKUP in what sequence libraries: a) swissprot b) sptrembl c) pir d) embl e) genbank f) em_tags g) gb_tags h) All libraries q) quit Please choose one or more (* h *):

Complete the query form below: All text: Definition: Author: Keyword: Sequence name: Accession number: Organism: Reference: Title: Feature: On or after (dd-mmm-yy): On or before (dd-mmm-yy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Whole Entries Press <Ctrl>D to continue.

SeqStore

Sequence searching

Lookup_rdb (gcgrdb)

Seqstore command-line sequence searching

Barebones – Use Seqstore Web interface

SeqStore Web Searching

Setup multiple criteria for selecting sets of sequences

Save as a query or list Query: Active list. Changes as new sequences are

added List: Static list. o change with database updates

Save to SeqWeb Powerful but can be slow

NCBI Sequence Services

Obtain sequences directly from NCBI Sequence Searches Sequence Retrieval

Other services BLAST Searches Sequence Submission PubMed Searches

Entrez

NCBI Databases on the Web Sequence retrieval Text pattern searches

GenBank is updated on a daily basis Web Site: http://www.ncbi.nlm.nih.gov









Finding Sequences by Similarity

Using GCG

Sequence Similarities

What other sequences have some primary sequence similarity to my query sequence?

Time and cost of the search is dependent on the size of the database Restrict the size of the database

FindPatterns

Look for sequence patterns within sequence files

Allows complex pattern definitions Ambiguous sequence specifications

BLAST; NetBlast

All search combinations possible nt vs. nt database

blastn protein vs. protein database

blastp translated nt vs. protein database

blastx protein vs. translated nt database

tblastn translated nt vs. translated nt database

tblastx

FastA,

Search nucleotide sequences with a nucleotide query

Search protein sequences with a peptide query

TFastA

Translates nucleotide sequences in all 6 reading frames

Search the translated sequences with a peptide query

Displaying Data

analyze% typedata Displays on your screen the contents of any

GCG data file -REF

Display documentation only

Copying Data

analyze% fetch Will copy any GCG data or sequence file to

your director

Sequence Symbols

Sequence symbols Handout lists the sequence symbols

recognized by GCG Ambiguity codes are as proposed by the IUB

nomenclature committee Used by GenBank, EMBL, and NBRF

Nucleotide Symbols IUB/GCG Meaning Complement Staden/Sanger A A T A C C G C G G C G T/U T A T M A or C K 5 R A or G Y R W A or T W 7 S C or G S 8 Y C or T R Y K G or T M 6 V A or C or G B not supported H A or C or T D not supported D A or G or T H not supported B C or G or T V not supported X/N G or A or T or C X -/X (Gap). not G or A or T or C . not supported

Amino Acid Symbols IUB Symbol 3-letter Meaning Codons Depiction A Ala Alanine GCT,GCC,GCA,GCG !GCX B Asp,Asn Aspartic, Asparagine GAT,GAC,AAT,AAC !RAY C Cys Cysteine TGT,TGC !TGY D Asp Aspartic GAT,GAC !GAY E Glu Glutamic GAA,GAG !GAR F Phe Phenylalanine TTT,TTC !TTY G Gly Glycine GGT,GGC,GGA,GGG !GGX H His Histidine CAT,CAC !CAY I Ile Isoleucine ATT,ATC,ATA !ATH K Lys Lysine AAA,AAG !AAR L Leu Leucine TTG,TTA,CTT,CTC,CTA,CTG !TTR,CTX,YTR;YTX M Met Methionine ATG !ATG N Asn Asparagine AAT,AAC !AAY P Pro Proline CCT,CCC,CCA,CCG !CCX Q Gln Glutamine CAA,CAG !CAR R Arg Arginine CGT,CGC,CGA,CGG,AGA,AGG !CGX,AGR,MGR;MGX S Ser Serine TCT,TCC,TCA,TCG,AGT,AGC !TCX,AGY;WSX T Thr Threonine ACT,ACC,ACA,ACG !ACX V Val Valine GTT,GTC,GTA,GTG !GTX W Trp Tryptophan TGG !TGG X Xxx Unknown !XXX Y Tyr Tyrosine TAT, TAC !TAY Z Glu,Gln Glutamic, Glutamine GAA,GAG,CAA,CAG !SAR * End Terminator TAA, TAG, TGA !TAR,TRA;TRR

Other Stuff

Non-sequence Data

NonSequence Data

Non-Sequence Data Data required to run a program Copy to your directory with Fetch

Local Data Files

Copies of GCG Data files stored in your own directory.

May be altered as desired.

Using Local Data Files

Programs will look first in the default directory for a particular data file with a particular name. If not found the public data file will be used. A user may specify a new name for the data

file when running a program.

Restriction Enzyme Files

REBASE (enzyme.dat) REBASE 6/2000 Dr. Richard J. Roberts Cold Spring Harbor Laboratory

Used by: Map, MapSort, MapPlot

Prosite

Dictionary of sequence motifs Dr. Amos Bairoch, University of Geneva

Release 16, 7/1999 over 1300 patterns

Used by: Motifs

Profiles

Database of peptide profiles Drs. Michael Gribskov and Amos Bairoch

Over 600 Profiles Used by ProfileScan

Eukaryotic Transcription Factor Recognition Sites

Transcription Factor Database Dr. David Ghosh, NCBI Release 7.5, 3/96 genmoredata:tfsites.dat Used by:

FindPatterns Map, MapSort, MapPlot

Codon Frequency Tables

Frequency of particular codon usage Look in genmoredata Organism

Human E. coli Drosophila

Used by: BackTranslate, CodonPreference

Translation Tables

Standard Table for translating nucleotide sequences into amino acid sequences

Look in genmoredata Alternate translation tables

Mitochondria Mycoplasma

Used by: Translate, Map, Frames

Symbol Comparison Tables

Amino acid similarities What is the chance that one amino acid can

substitute for another without affecting function?

Used by all sequence comparison programs FastA, TFastA, Blast Gap, BestFit PileUp

Protein Analysis Data

Amino acid properties Charge, hydrophobicity, molecular weight,

secondary structure predictions ect. Protease digestion sites Used by:

PepPlot; PlotStructure

Free Energy Values

RNA secondary structure prediction Used by:

Mfold, FoldRNA

Documents

Data Sequences and Other Stuff. Sequence Data Nucleic Acid and Protein Sequences Sources of Genetic Sequences User GCG supplied databases Flat File Oracle