8e5bbFunda Seq Anals

Fundamentals of Fundamentals of Sequence AnalysisSequence Analysis

Fourie JoubertFourie Joubert

FASTA File FormatFASTA File Format First line contains > followed by a space and a First line contains > followed by a space and a

short descriptorshort descriptor Sequence usually 60 or 80 characters per column Sequence usually 60 or 80 characters per column

on following lineson following lines May repeat after inserting a blank lineMay repeat after inserting a blank line

FASTA ExampleFASTA Example> mysequence> mysequenceACGTCGATCGATCGATGCATCGTGCTAGCTACAGTCGATGCATACGTCGATCGATCGATGCATCGTGCTAGCTACAGTCGATGCATCAGTCGATGCTAGCATGCTAGCTGCATCGATCGATGCTACGTACAGTCGATGCTAGCATGCTAGCTGCATCGATCGATGCTACGTACAGTCGATCGATGCATCAGTCGATCGATGCAT

> mysequence2> mysequence2ACCGTACGATGCTAGCTAGCTAGCTACAGTCAGTCGATGCTACGACCGTACGATGCTAGCTAGCTAGCTACAGTCAGTCGATGCTACGCAGTCGTAGCATGCTAACGTCGATCGTACAGTCGTAGCATGCTAACGTCGATCGTA

> mysequence3> mysequence3CAGTCAGTCGTAGCTAGCTAGCTAGCTAGGGGTATCGATGCTAACAGTCAGTCGTAGCTAGCTAGCTAGCTAGGGGTATCGATGCTAACAGTACTTTGCATGCAGCATGCTAGCTAGCTAGCTACAGTACTTTGCATGCAGCATGCTAGCTAGCTAGCTA

Genbank File FormatGenbank File Format File Header File Header

• The first line in the file must have "The first line in the file must have "GENETIC SEQUENCE DATA BANKGENETIC SEQUENCE DATA BANK" in " in spaces 20 through 46. spaces 20 through 46.

• The next 8 lines may contain arbitrary text. They are ignored but are required The next 8 lines may contain arbitrary text. They are ignored but are required to maintain the GenBank format. to maintain the GenBank format.

Sequence Data Entries Sequence Data Entries • Each sequence entry in the file should have the following format: Each sequence entry in the file should have the following format: • 11stst line line: Must have : Must have LOCUSLOCUS in the first 5 spaces. The genetic locus name or in the first 5 spaces. The genetic locus name or

identifier must be in spaces 13 - 22. The length of the sequences is right identifier must be in spaces 13 - 22. The length of the sequences is right justified in spaces 23 through 29. justified in spaces 23 through 29.

• 22ndnd line line: Must have : Must have DEFINITIONDEFINITION in the first 10 spaces. Spaces 13 - 80 are free in the first 10 spaces. Spaces 13 - 80 are free form text to identify the sequence. form text to identify the sequence.

• 33rdrd line line: Must have : Must have ACCESSIONACCESSION in the first 9 spaces. Spaces 13 - 18 must in the first 9 spaces. Spaces 13 - 18 must hold the primary accession number. hold the primary accession number.

• 44thth line line: Must have : Must have ORIGINORIGIN in the first 6 spaces. Nothing else is required on in the first 6 spaces. Nothing else is required on this line, it indicates that the nucleic acid sequence begins on the next line. this line, it indicates that the nucleic acid sequence begins on the next line.

• 55thth line line: Begins the nucleotide : Begins the nucleotide sequencesequence. The first 9 spaces of each sequence . The first 9 spaces of each sequence line may either be blank or may contain the position in the sequence of the line may either be blank or may contain the position in the sequence of the first nucleotide on the line. The next 66 spaces hold the nucleotide sequence first nucleotide on the line. The next 66 spaces hold the nucleotide sequence in six blocks of ten nucleotides. Each of the six blocks begins with a blank in six blocks of ten nucleotides. Each of the six blocks begins with a blank space followed by ten nucleotides. Thus the first nucleotide is in space eleven space followed by ten nucleotides. Thus the first nucleotide is in space eleven of the line while the last is in space 75. of the line while the last is in space 75.

• Last lineLast line: Must have : Must have //// in the first 2 spaces to indicate termination of the in the first 2 spaces to indicate termination of the sequence. sequence.

• NOTE: Multiple sequences may appear in each file. To begin another NOTE: Multiple sequences may appear in each file. To begin another sequence go back to a) and start again. sequence go back to a) and start again.

Genbank ExampleGenbank ExampleLOCUS NM_079846 1190 bp mRNA linear INV 15-DEC-2001LOCUS NM_079846 1190 bp mRNA linear INV 15-DEC-2001DEFINITION Drosophila melanogaster Triose phosphate isomerase (Tpi), mRNA.DEFINITION Drosophila melanogaster Triose phosphate isomerase (Tpi), mRNA.ACCESSION NM_079846ACCESSION NM_079846VERSION NM_079846.1 GI:17864111VERSION NM_079846.1 GI:17864111KEYWORDS .KEYWORDS .SOURCE SOURCE fruit fly. fruit fly. ORGANISM Drosophila melanogasterORGANISM Drosophila melanogaster

Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta;Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera;Pterygota; Neoptera; Endopterygota; Diptera; Brachycera;

Muscomorpha; Ephydroidea; Drosophilidae; Drosophila.Muscomorpha; Ephydroidea; Drosophilidae; Drosophila.REFERENCE 1 (bases 1 to 1190)REFERENCE 1 (bases 1 to 1190)AUTHORS Shaw-Lee,R.L., Lissemore,J.L. and Sullivan,D.T.AUTHORS Shaw-Lee,R.L., Lissemore,J.L. and Sullivan,D.T.TITLE Structure and expression of the triose phosphate isomerase (Tpi) gene of TITLE Structure and expression of the triose phosphate isomerase (Tpi) gene of

Drosophila melanogaster JOURNAL Mol. Gen. Genet. 230 (1-2), 225-229 (1991) Drosophila melanogaster JOURNAL Mol. Gen. Genet. 230 (1-2), 225-229 (1991)MEDLINE 92079900MEDLINE 92079900PUBMED PUBMED 1720860 1720860COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBICOMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI

review. The reference sequence was derived from AE003772.1.review. The reference sequence was derived from AE003772.1.FEATURES FEATURES Location/Qualifiers Location/Qualifiers source 1..1190source 1..1190

/organism="Drosophila melanogaster“/organism="Drosophila melanogaster“ /db_xref="taxon:7227“/db_xref="taxon:7227“ /chromosome="3“/chromosome="3“ /map="99E1-99E2“/map="99E1-99E2“

gene gene 1..1190 1..1190 /gene="Tpi“/gene="Tpi“ /note="TPI; TPIS; CG2171; CT6334“/note="TPI; TPIS; CG2171; CT6334“ /db_xref="FLYBASE:FBgn0003738“/db_xref="FLYBASE:FBgn0003738“ /db_xref="LocusID:43582“/db_xref="LocusID:43582“

CDS CDS 181..924 181..924 /gene="Tpi“/gene="Tpi“ /EC_number="5.3.1.1“/EC_number="5.3.1.1“ /note="Nucleotide sequence of the Celera sequence differs from the /note="Nucleotide sequence of the Celera sequence differs from the

publishedpublished sequence for this transcript.“sequence for this transcript.“ /codon_start=1/codon_start=1 /db_xref="FLYBASE:FBgn0003738“/db_xref="FLYBASE:FBgn0003738“ /db_xref="LocusID:43582“/db_xref="LocusID:43582“ /product="Triose phosphate isomerase“/product="Triose phosphate isomerase“ /protein_id="NP_524585.1“/protein_id="NP_524585.1“ /db_xref="GI:17864112"/db_xref="GI:17864112" /translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPA /translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPA IYLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFG IYLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFG ESDALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVV ESDALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVV VAYEPVWAIGTGQTATPDQAQEVHAFLRQWLSDNISKEVSASLRIQYGGSVTAANAKE VAYEPVWAIGTGQTATPDQAQEVHAFLRQWLSDNISKEVSASLRIQYGGSVTAANAKE LAKKPDIDGFLVGGASLKPEFVDIINARQ“ LAKKPDIDGFLVGGASLKPEFVDIINARQ“

misc_feature 187..921misc_feature 187..921 /note="TIM; Region: Triosephosphate isomerase“/note="TIM; Region: Triosephosphate isomerase“

BASE COUNT 279 a 368 c 323 g 220 t BASE COUNT 279 a 368 c 323 g 220 t ORIGIN ORIGIN

1 ttaatctcga atctgggaaa aatctgagtg gaaaagtcga cggcgagcct ccagtcatcg 1 ttaatctcga atctgggaaa aatctgagtg gaaaagtcga cggcgagcct ccagtcatcg 61 agttacccac ttgaaattat cagttccaaa cactctaata gcagtcccct tgttttgtcc 61 agttacccac ttgaaattat cagttccaaa cactctaata gcagtcccct tgttttgtcc 121 cccgatccgc agttctacgc caatttcagc accgattgca ccgacagcaa cagcaacaac 121 cccgatccgc agttctacgc caatttcagc accgattgca ccgacagcaa cagcaacaac 181 atgagccgaa agttctgcgt gggaggcaac tggaagatga acggcgacca gaagtccatc 181 atgagccgaa agttctgcgt gggaggcaac tggaagatga acggcgacca gaagtccatc 241 gccgagatcg ccaagaccct gagctcggcc gccctcgacc ccaacacgga ggtggtcatc 241 gccgagatcg ccaagaccct gagctcggcc gccctcgacc ccaacacgga ggtggtcatc 301 ggctgcccgg ccatctacct gatgtacgcc cgcaacctgc tgccctgcga gctgggtctg 301 ggctgcccgg ccatctacct gatgtacgcc cgcaacctgc tgccctgcga gctgggtctg 361 gccggccaga atgcctacaa ggtggccaag ggcgcattca ccggcgagat ctcccctgcg 361 gccggccaga atgcctacaa ggtggccaag ggcgcattca ccggcgagat ctcccctgcg 421 atgctgaagg 421 atgctgaagg

// //

EMBL File FormatEMBL File Format Unlike the GenBank file format the EMBL file format does not require a series Unlike the GenBank file format the EMBL file format does not require a series

of header lines. Thus the first line in the file of header lines. Thus the first line in the file begins the first sequence entry begins the first sequence entry of the fileof the file. .

The first line of each sequence entry contains the The first line of each sequence entry contains the two letters IDtwo letters ID in the first in the first two spaces. This is followed by the EMBL identifier in spaces 6 through 14. two spaces. This is followed by the EMBL identifier in spaces 6 through 14.

The second line of each sequence entry has the two letters AC in the first The second line of each sequence entry has the two letters AC in the first two spaces. This is followed by the two spaces. This is followed by the accession numberaccession number in spaces 6 through in spaces 6 through 11. 11.

The third line of each sequence entry has the two letters DE in the first two The third line of each sequence entry has the two letters DE in the first two spaces. This is followed by a free form text spaces. This is followed by a free form text definitiondefinition in spaces 6 through 72. in spaces 6 through 72.

The fourth line in each sequence entry has the two letters SQ in the first two The fourth line in each sequence entry has the two letters SQ in the first two spaces. This is followed by the spaces. This is followed by the length of the sequencelength of the sequence beginning at or after beginning at or after space 13. After the sequence length there is a blank space and the two space 13. After the sequence length there is a blank space and the two letters BP. letters BP.

The nucleotide The nucleotide sequencesequence begins on the fifth line of the sequence entry. Each begins on the fifth line of the sequence entry. Each line of sequence begins with four blank spaces. The next 66 spaces hold the line of sequence begins with four blank spaces. The next 66 spaces hold the nucleotide sequence in six blocks of ten nucleotides. Each of the six blocks nucleotide sequence in six blocks of ten nucleotides. Each of the six blocks begins with a blank space followed by ten nucleotides. Thus the first begins with a blank space followed by ten nucleotides. Thus the first nucleotide is in space 6 of the line while the last is in space 70. nucleotide is in space 6 of the line while the last is in space 70.

The last line of each sequence entry in the file is a terminator line which has The last line of each sequence entry in the file is a terminator line which has the two characters the two characters //// in the first two spaces. in the first two spaces.

Multiple sequences may appear in each file. To begin another sequence go Multiple sequences may appear in each file. To begin another sequence go back to item 1 and start again. back to item 1 and start again.

EMBL ExampleEMBL ExampleID DMTPIG standard; DNA; INV; 3419 BP.ID DMTPIG standard; DNA; INV; 3419 BP.XXXXAC X57576; S70377;AC X57576; S70377;XXXXSV X57576.1SV X57576.1XXXXDT 20-JAN-1992 (Rel. 30, Created)DT 20-JAN-1992 (Rel. 30, Created)DT 19-AUG-1996 (Rel. 49, Last updated, Version 10)DT 19-AUG-1996 (Rel. 49, Last updated, Version 10)XXXXDE D.melanogaster Tpi gene for Triosephosphate isomeraseDE D.melanogaster Tpi gene for Triosephosphate isomeraseXXXXKW glycolytic enzyme; tpi gene; triosephosphate isomerase.KW glycolytic enzyme; tpi gene; triosephosphate isomerase.XXXXOS Drosophila melanogaster (fruit fly)OS Drosophila melanogaster (fruit fly)OC Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota;OC Eukaryota; Metazoa; Arthropoda; Tracheata; Hexapoda; Insecta; Pterygota;OC Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea;OC Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea;OC Drosophilidae; Drosophila.OC Drosophilidae; Drosophila.XXXXRN [1]RN [1]RP 1-3419RP 1-3419RA Sullivan D.T.;RA Sullivan D.T.;RT ;RT ;RL Submitted (07-FEB-1991) to the EMBL/GenBank/DDBJ databases.RL Submitted (07-FEB-1991) to the EMBL/GenBank/DDBJ databases.RL D.T. Sullivan, Biological Research Laboratories, 130 College Pl, SyracuseRL D.T. Sullivan, Biological Research Laboratories, 130 College Pl, SyracuseRL University, Syracuse, NY 13244, USARL University, Syracuse, NY 13244, USAXXXX

RN [3]RN [3]RX MEDLINE; 92079900.RX MEDLINE; 92079900.RA Shaw-Lee R.L., Lissemore J.L., Sullivan D.T.;RA Shaw-Lee R.L., Lissemore J.L., Sullivan D.T.;RT "Structure and expression of the triose phosphate isomerase (Tpi) gene ofRT "Structure and expression of the triose phosphate isomerase (Tpi) gene ofRT Drosophila melanogaster.";RT Drosophila melanogaster.";RL Mol. Gen. Genet. 230:225-229(1991).RL Mol. Gen. Genet. 230:225-229(1991).XXXXDR FLYBASE; FBgn0003738; Tpi.DR FLYBASE; FBgn0003738; Tpi.DR SWISS-PROT; P29613; TPIS_DROME.DR SWISS-PROT; P29613; TPIS_DROME.XXXXFH Key Location/QualifiersFH Key Location/QualifiersFHFHFT source 1..3419FT source 1..3419FT /db_xref="taxon:7227"FT /db_xref="taxon:7227"FT /germlineFT /germlineFT /organism="Drosophila melanogaster"FT /organism="Drosophila melanogaster"FT /strain="Oregon-R"FT /strain="Oregon-R"FT /clone_lib="EMBL-4"FT /clone_lib="EMBL-4"FT CDS join(2237..2773,2830..3036)FT CDS join(2237..2773,2830..3036)FT /db_xref="FLYBASE:FBgn0003738"FT /db_xref="FLYBASE:FBgn0003738"FT /db_xref="SWISS-PROT:P29613"FT /db_xref="SWISS-PROT:P29613"FT /gene="Tpi"FT /gene="Tpi"FT /EC_number="5.3.1.1"FT /EC_number="5.3.1.1"FT /product="triosephosphate isomerase"FT /product="triosephosphate isomerase"FT /protein_id="CAA40804.1"FT /protein_id="CAA40804.1"FT /translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPAIFT /translation="MSRKFCVGGNWKMNGDQKSIAEIAKTLSSAALDPNTEVVIGCPAIFT YLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFGESFT YLMYARNLLPCELGLAGQNAYKVAKGAFTGEISPAMLKDIGADWVILGHSERRAIFGESFT DALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVVVAYFT DALIAEKAEHALAEGLKVIACIGETLEEREAGKTNEVVARQMCAYAQKIKDWKNVVVAYFT EPVWAIGTGKTATPDQAQEVHASLRQWLSDNISKEVSASLRIQYGGSVTAANAKELAKKFT EPVWAIGTGKTATPDQAQEVHASLRQWLSDNISKEVSASLRIQYGGSVTAANAKELAKKFT PDIDGFLVGGASLKPEFLDIINARQ"FT PDIDGFLVGGASLKPEFLDIINARQ"FT mRNA join(2004..2028,2186..2773,2830..3036)FT mRNA join(2004..2028,2186..2773,2830..3036)FT /gene="Tpi"FT /gene="Tpi"FT prim_transcript 2004..3296FT prim_transcript 2004..3296

FT exon 2008..2032FT exon 2008..2032FT /number=1FT /number=1FT exon 2189..2773FT exon 2189..2773FT /number=2FT /number=2FT exon 2830..3296FT exon 2830..3296FT /number=3FT /number=3FT intron 2033..2188FT intron 2033..2188FT /number=1FT /number=1FT intron 2774..2829FT intron 2774..2829FT /number=2FT /number=2FT misc_feature 2147..2151FT misc_feature 2147..2151FT /note="intron 1 lariat sequence"FT /note="intron 1 lariat sequence"FT misc_feature 2789..2793FT misc_feature 2789..2793FT /note="intron 2 lariat sequence"FT /note="intron 2 lariat sequence"FT polyA_signal 3258..3262FT polyA_signal 3258..3262XXXXSQ Sequence 3419 BP; 855 A; 933 C; 849 G; 778 T; 4 other;SQ Sequence 3419 BP; 855 A; 933 C; 849 G; 778 T; 4 other; gatctcgagc gagaaatgtg gaacatagtg gaggcctcca gtggcgccga gctgggtgaa 60gatctcgagc gagaaatgtg gaacatagtg gaggcctcca gtggcgccga gctgggtgaa 60 accagctacg agttcccttc ccccgctccg gttcccagcg cagcagtgaa cgaaatagca 120accagctacg agttcccttc ccccgctccg gttcccagcg cagcagtgaa cgaaatagca 120 gttccacagt cccaccagct cctcctgctc ctgcgaagcc ctcagttccg tccgcctcct 180gttccacagt cccaccagct cctcctgctc ctgcgaagcc ctcagttccg tccgcctcct 180 atgacaacca caactacagt ttcagccagg atgaggacga agatgatgat gatctggagt 240atgacaacca caactacagt ttcagccagg atgaggacga agatgatgat gatctggagt 240 ttgaggacgt attcgtgccg gccagctctg ttccaaatcc cgttcagcct ggcatagatc 300ttgaggacgt attcgtgccg gccagctctg ttccaaatcc cgttcagcct ggcatagatc 300 ccgtggaact gcgtcgctcc ctggctttgg tcatgaggga gaaattgcga tcggatgaca 360ccgtggaact gcgtcgctcc ctggctttgg tcatgaggga gaaattgcga tcggatgaca 360 cggactccag gccaatgggc aacaatcagg atcttcccat agatgaacag tccagggaga 420cggactccag gccaatgggc aacaatcagg atcttcccat agatgaacag tccagggaga 420 gaccgctctc cactcaaaca tctcccacaa atggcccact tccggctctt ctgagggcca 480gaccgctctc cactcaaaca tctcccacaa atggcccact tccggctctt ctgagggcca 480 aactgcttgc tgggcaactc nnnncaatag cgctcactgc ctgccaggat ccacggcgag 540aactgcttgc tgggcaactc nnnncaatag cgctcactgc ctgccaggat ccacggcgag 540 tcctgctccc caggagcaat ccggtatctt tgtgatcgat agtgaggcga gtcccggctc 600tcctgctccc caggagcaat ccggtatctt tgtgatcgat agtgaggcga gtcccggctc 600 aaatgggcac aagcctaagt atcgaaaggg cacggcattc actcggagtt cgctgaagaa 660aaatgggcac aagcctaagt atcgaaaggg cacggcattc actcggagtt cgctgaagaa 660 gagccgatcc tgcaactgta gctccatcgc taagggacga ggggtccacg acgagcccag 720gagccgatcc tgcaactgta gctccatcgc taagggacga ggggtccacg acgagcccag 720 cagtaatctc tgcagggatc aggagtcctc tgtacttcca cagcatccgc agccagccaa 780cagtaatctc tgcagggatc aggagtcctc tgtacttcca cagcatccgc agccagccaa 780 ccatcccaca gagaactttt ccatcccaca gagaactttt ////

PHYLIP File FormatPHYLIP File Format Interleaved and Sequential formatsInterleaved and Sequential formats

The sequences can continue over multiple lines; The sequences can continue over multiple lines; when this is one the sequences must be when this is one the sequences must be either in "either in "interleavedinterleaved" format, similar to the " format, similar to the output of alignment programs, or "output of alignment programs, or "sequentialsequential" " ormat. These are described in the main ormat. These are described in the main document file. In sequential format all of one document file. In sequential format all of one sequence is given, possibly on multiple lines, sequence is given, possibly on multiple lines, before the next starts. In interleaved format the before the next starts. In interleaved format the first part of the file should contain the first first part of the file should contain the first part of each of the sequences, then possibly a part of each of the sequences, then possibly a line containing nothing but a carriage-return line containing nothing but a carriage-return character, then the second part of each character, then the second part of each sequence, and so on. Only the first parts of the sequence, and so on. Only the first parts of the sequences should be preceded by names. sequences should be preceded by names.

InterleavedInterleaved 18 20618 206a121 MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHD a121 MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHD a241 MNTTDCFIAL VTAIREIRAF FLPRATG-RM EFTLHNGERK VFYSRPNNHD a241 MNTTDCFIAL VTAIREIRAF FLPRATG-RM EFTLHNGERK VFYSRPNNHD c-s8c1 MNTTDCFIAV VNAIKEVRAL FLPRTAG-KM EFTLHDGEKK VFYSRPNNHD c-s8c1 MNTTDCFIAV VNAIKEVRAL FLPRTAG-KM EFTLHDGEKK VFYSRPNNHD c1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHD c1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHD o1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD o1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD o1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD o1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD o1kauf MNTTDCFIAL VQAIREIKAL FLSRTTG-KM ELTLYNGEKK TFYSRPNNHD o1kauf MNTTDCFIAL VQAIREIKAL FLSRTTG-KM ELTLYNGEKK TFYSRPNNHD ken1-76 MNTTDCFIAL LRAFREIKTL FLSRVRG-KM EFTLYNGEKK TFYSRPNNHD ken1-76 MNTTDCFIAL LRAFREIKTL FLSRVRG-KM EFTLYNGEKK TFYSRPNNHD ken34-84 MNTTDCFIAL VRAIREFKIL FSLRPLARKM EFTLYNGIKK TFYSRPNKHD ken34-84 MNTTDCFIAL VRAIREFKIL FSLRPLARKM EFTLYNGIKK TFYSRPNKHD ken MNTTDCFIAL VQAIREIKLL FKG--IR-KM KLTLYNGEKK TFYSRPNSHD ken MNTTDCFIAL VQAIREIKLL FKG--IR-KM KLTLYNGEKK TFYSRPNSHD uga97-1 MNTTDCFIAL VQAIREIKSL FRS--SR-KM EFTLYNGEKK TFYSRPNNHD uga97-1 MNTTDCFIAL VQAIREIKSL FRS--SR-KM EFTLYNGEKK TFYSRPNNHD bec1-65 MKTTDCFNVL FEIFHRFGQT FKA--DR-KM EFTLYNGEKK TFYSRPNTHG bec1-65 MKTTDCFNVL FEIFHRFGQT FKA--DR-KM EFTLYNGEKK TFYSRPNTHG zim88-3 MKTTDCFDVL LEIFHRFRQT FKT--DR-KM EFTLYNGEKK TFYSRPNTHG zim88-3 MKTTDCFDVL LEIFHRFRQT FKT--DR-KM EFTLYNGEKK TFYSRPNTHG knp10-90 MKTTDCFNVL LETFHRFRNV FKT--DR-KM EFTLYNGDKK TFYSRPNTHG knp10-90 MKTTDCFNVL LETFHRFRNV FKT--DR-KM EFTLYNGDKK TFYSRPNTHG zim96-3 MKTTGCFDVL IEIAHRLRQL NKT--DR-KM EFTLYNGEKK TFYSRPNTHG zim96-3 MKTTGCFDVL IEIAHRLRQL NKT--DR-KM EFTLYNGEKK TFYSRPNTHG zim7-83 MKTTDCFNVL LEIIYRFRHT FKT--DR-KM EFTLYNGEKK TFYSRPNKHG zim7-83 MKTTDCFNVL LEIIYRFRHT FKT--DR-KM EFTLYNGEKK TFYSRPNKHG knp196-9 MKTTDCFSVL FEIFHRLRHT LKT--ER-KM EFTLYNGERK TFYSRPNKHG knp196-9 MKTTDCFSVL FEIFHRLRHT LKT--ER-KM EFTLYNGERK TFYSRPNKHG zam4-96 MKTTDCFDAL LEAFHRLRQT FKT--DR-KM EFTLYNGEKK TFYSRPNRHG zam4-96 MKTTDCFDAL LEAFHRLRQT FKT--DR-KM EFTLYNGEKK TFYSRPNRHG

NCWLNTILQL FRYVDEPFFD WVYNSPENLT LAAIKQLEEL TGLELHEGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LAAIKQLEEL TGLELHEGGP NCWLNTILQL FRYVGEPFFD WVYDSPENLT LEAIEQLEEL TGLELHEGGP NCWLNTILQL FRYVGEPFFD WVYDSPENLT LEAIEQLEEL TGLELHEGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVEEPFFD WVYSSPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVEEPFFD WVYSSPENLT LEAIKQLEDL TGLELHEGGP NCWLNAILQL FRYVDEPFFE WVYDSPENLT VEAIRQLEEL TGLELHEGGP NCWLNAILQL FRYVDEPFFE WVYDSPENLT VEAIRQLEEL TGLELHEGGP NCWLNAILQL FRYVDEPFFD WVYESPENLT IQAIGQLEEL TGLDLREGGP NCWLNAILQL FRYVDEPFFD WVYESPENLT IQAIGQLEEL TGLDLREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LRAIEQLEEL TGLELREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LRAIEQLEEL TGLELREGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LQAIEQLEEL TGLELHEGGP NCWLNTILQL FRYVDEPFFD WVYNSPENLT LQAIEQLEEL TGLELHEGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKRLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKRLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP NCWLNSLLQL FRYVDEPLFE SEYLSPENKT LDMIKQLSDY TKLDLSDGGP

PALVIWNIKH LLQTGIGTAS RPAR-CMVDG TNMCLADFHA GIFLKEQEHA PALVIWNIKH LLQTGIGTAS RPAR-CMVDG TNMCLADFHA GIFLKEQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TNMCLADFHA GIFLKGQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TNMCLADFHA GIFLKGQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGREHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGREHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA

… …

SequentialSequential 18 206 YF18 206 YFa121 MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHDa121 MNTTNCFIAL VHAIREIRAF FLSRATG-KM EFTLYNGERK TFYSRPNNHD NCWLNTILQL FRYVDEPFFD WVYNSPENLT LAAIKQLEEL TGLELHEGGPNCWLNTILQL FRYVDEPFFD WVYNSPENLT LAAIKQLEEL TGLELHEGGP PALVIWNIKH LLQTGIGTAS RPAR-CMVDG TNMCLADFHA GIFLKEQEHAPALVIWNIKH LLQTGIGTAS RPAR-CMVDG TNMCLADFHA GIFLKEQEHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN GGWKANVQRKVFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN GGWKANVQRK LK----LK----a241 MNTTDCFIAL VTAIREIRAF FLPRATG-RM EFTLHNGERK VFYSRPNNHDa241 MNTTDCFIAL VTAIREIRAF FLPRATG-RM EFTLHNGERK VFYSRPNNHD NCWLNTILQL FRYVGEPFFD WVYDSPENLT LEAIEQLEEL TGLELHEGGPNCWLNTILQL FRYVGEPFFD WVYDSPENLT LEAIEQLEEL TGLELHEGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TNMCLADFHA GIFLKGQEHAPALVIWNIKH LLHTGIGTAS RPSEVCMVDG TNMCLADFHA GIFLKGQEHA VFACVTSNGW YAIDDDDFYP WTPDPSDVLV FVPYDQEPLN GEWKTKVQQKVFACVTSNGW YAIDDDDFYP WTPDPSDVLV FVPYDQEPLN GEWKTKVQQK LK----LK----c-s8c1 MNTTDCFIAV VNAIKEVRAL FLPRTAG-KM EFTLHDGEKK VFYSRPNNHDc-s8c1 MNTTDCFIAV VNAIKEVRAL FLPRTAG-KM EFTLHDGEKK VFYSRPNNHD NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGPNCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGREHAPALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGREHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKASVQRKVFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKASVQRK LKGAGQLKGAGQc1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHDc1nov MNTTDCFIAV VNAIREIRAL FLPRTTG-KM EFTLHDGEKK VFYSRPNNHD NCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGPNCWLNTILQL FRYVDEPFFD WVYNSPENLT LEAIKQLEEL TGLELREGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGQEHAPALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFMKGQEHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKANVQRKVFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN EGWKANVQRK LKGAGQLKGAGQo1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHDo1brazl MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGPNCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHAPALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA VFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN GEWKAKVQRKVFACVTSNGW YAIDDEDFYP WTPDPSDVLV FVPYDQEPLN GEWKAKVQRK LK----LK----o1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHDo1campos MNTTDCFIAL VQAIREIKAL FLPRTTG-KM ELTLYNGEKK TFYSRPNNHD NCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGPNCWLNAILQL FRYVEEPFFD WVYSTPENLT LEAIKQLEDL TGLELHEGGP PALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHAPALVIWNIKH LLHTGIGTAS RPSEVCMVDG TDMCLADFHA GIFLKGQEHA VFAC…VFAC…

PDB File FormatPDB File FormatCOLUMNS DATA TYPE FIELD DEFINITIONCOLUMNS DATA TYPE FIELD DEFINITION------------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 - 6 Record name "ATOM "1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number.7 - 11 Integer serial Atom serial number.13 - 16 Atom name Atom name.13 - 16 Atom name Atom name.17 Character altLoc Alternate location indicator.17 Character altLoc Alternate location indicator.18 - 20 Residue name resName Residue name.18 - 20 Residue name resName Residue name.22 Character chainID Chain identifier.22 Character chainID Chain identifier.23 - 26 Integer resSeq Residue sequence number.23 - 26 Integer resSeq Residue sequence number.27 AChar iCode Code for insertion of residues.27 AChar iCode Code for insertion of residues.31 - 38 Real(8.3) x Orthogonal coordinates for X in31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms.Angstroms.39 - 46 Real(8.3) y Orthogonal coordinates for Y in39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms.Angstroms.47 - 54 Real(8.3) z Orthogonal coordinates for Z in47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms.Angstroms.55 - 60 Real(6.2) occupancy Occupancy.55 - 60 Real(6.2) occupancy Occupancy.61 - 66 Real(6.2) tempFactor Temperature factor.61 - 66 Real(6.2) tempFactor Temperature factor.73 - 76 LString(4) segID Segment identifier, left-justified.73 - 76 LString(4) segID Segment identifier, left-justified.77 - 78 LString(2) element Element symbol, right-justified.77 - 78 LString(2) element Element symbol, right-justified.79 - 80 LString(2) charge Charge on the atom.79 - 80 LString(2) charge Charge on the atom.

PDB ExamplePDB ExampleHEADER LYASE 06-JUL-99 1QU4 HEADER LYASE 06-JUL-99 1QU4 TITLE CRYSTAL STRUCTURE OF TRYPANOSOMA BRUCEI ORNITHINE TITLE CRYSTAL STRUCTURE OF TRYPANOSOMA BRUCEI ORNITHINE TITLE 2 DECARBOXYLASE TITLE 2 DECARBOXYLASE COMPND MOL_ID: 1; COMPND MOL_ID: 1; COMPND 2 MOLECULE: ORNITHINE DECARBOXYLASE; COMPND 2 MOLECULE: ORNITHINE DECARBOXYLASE; COMPND 3 CHAIN: A, B, C, D; COMPND 3 CHAIN: A, B, C, D; COMPND 4 EC: 4.1.1.17; COMPND 4 EC: 4.1.1.17; COMPND 5 ENGINEERED: YES COMPND 5 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: TRYPANOSOMA BRUCEI; SOURCE 2 ORGANISM_SCIENTIFIC: TRYPANOSOMA BRUCEI; SOURCE 3 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 3 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 4 EXPRESSION_SYSTEM_COMMON: BACTERIA; SOURCE 4 EXPRESSION_SYSTEM_COMMON: BACTERIA; SOURCE 5 EXPRESSION_SYSTEM_STRAIN: B21/DG3; SOURCE 5 EXPRESSION_SYSTEM_STRAIN: B21/DG3; SOURCE 6 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID SOURCE 6 EXPRESSION_SYSTEM_VECTOR_TYPE: PLASMID KEYWDS POLYAMINE METABOLISM, PYRIDOXAL 5'-PHOSPHATE, ALPHA-BETA KEYWDS POLYAMINE METABOLISM, PYRIDOXAL 5'-PHOSPHATE, ALPHA-BETA KEYWDS 2 BARREL, LYASE KEYWDS 2 BARREL, LYASE EXPDTA X-RAY DIFFRACTION EXPDTA X-RAY DIFFRACTION AUTHOR N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS, AUTHOR N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS, AUTHOR 2 E.J.GOLDSMITH AUTHOR 2 E.J.GOLDSMITH REVDAT 2 29-DEC-99 1QU4 1 JRNL COMPND REMARK REVDAT 2 29-DEC-99 1QU4 1 JRNL COMPND REMARK REVDAT 1 17-NOV-99 1QU4 0 REVDAT 1 17-NOV-99 1QU4 0 JRNL AUTH N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS, JRNL AUTH N.V.GRISHIN,A.L.OSTERMAN,H.B.BROOKS,M.A.PHILLIPS, JRNL AUTH 2 E.J.GOLDSMITH JRNL AUTH 2 E.J.GOLDSMITH JRNL TITL X-RAY STRUCTURE OF ORNITHINE DECARBOXYLASE FROM JRNL TITL X-RAY STRUCTURE OF ORNITHINE DECARBOXYLASE FROM JRNL TITL 2 TRYPANOSOMA BRUCEI: THE NATIVE STRUCTURE AND THE JRNL TITL 2 TRYPANOSOMA BRUCEI: THE NATIVE STRUCTURE AND THE JRNL TITL 3 STRUCTURE IN COMPLEX WITH JRNL TITL 3 STRUCTURE IN COMPLEX WITH JRNL TITL 4 ALPHA-DIFLUOROMETHYLORNITHINE JRNL TITL 4 ALPHA-DIFLUOROMETHYLORNITHINE JRNL REF BIOCHEMISTRY V. 38 15174 1999 JRNL REF BIOCHEMISTRY V. 38 15174 1999 JRNL REFN ASTM BICHAW US ISSN 0006-2960 JRNL REFN ASTM BICHAW US ISSN 0006-2960 REMARK 1 REMARK 1 REMARK 2 REMARK 2 REMARK 2 RESOLUTION. 2.90 ANGSTROMS.REMARK 2 RESOLUTION. 2.90 ANGSTROMS.REMARK REMARK … …

DBREF 1QU4 A 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 A 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 B 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 B 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 C 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 C 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 D 1 425 SWS P07805 DCOR_TRYBB 21 445 DBREF 1QU4 D 1 425 SWS P07805 DCOR_TRYBB 21 445 SEQRES 1 A 425 GLY ALA MET ASP ILE VAL VAL ASN ASP ASP LEU SER CYS SEQRES 1 A 425 GLY ALA MET ASP ILE VAL VAL ASN ASP ASP LEU SER CYS SEQRES 2 A 425 ARG PHE LEU GLU GLY PHE ASN THR ARG ASP ALA LEU CYS SEQRES 2 A 425 ARG PHE LEU GLU GLY PHE ASN THR ARG ASP ALA LEU CYS SEQRES 3 A 425 LYS LYS ILE SER MET ASN THR CYS ASP GLU GLY ASP PRO SEQRES 3 A 425 LYS LYS ILE SER MET ASN THR CYS ASP GLU GLY ASP PRO SEQRES 4 A 425 PHE PHE VAL ALA ASP LEU GLY ASP ILE VAL ARG LYS HIS SEQRES 4 A 425 PHE PHE VAL ALA ASP LEU GLY ASP ILE VAL ARG LYS HIS SEQRES 5 A 425 GLU THR TRP LYS LYS CYS LEU PRO ARG VAL THR PRO PHE SEQRES 5 A 425 GLU THR TRP LYS LYS CYS LEU PRO ARG VAL THR PRO PHE SEQRES 6 A 425 TYR ALA VAL LYS CYS ASN ASP ASP TRP ARG VAL LEU GLY SEQRES 6 A 425 TYR ALA VAL LYS CYS ASN ASP ASP TRP ARG VAL LEU GLY SEQRES 7 A 425 THR LEU ALA ALA LEU GLY THR GLY PHE ASP CYS ALA SER SEQRES 7 A 425 THR LEU ALA ALA LEU GLY THR GLY PHE ASP CYS ALA SER SEQRES 8 A 425 ASN THR GLU ILE GLN ARG VAL ARG GLY ILE GLY VAL PRO SEQRES 8 A 425 ASN THR GLU ILE GLN ARG VAL ARG GLY ILE GLY VAL PRO SEQRES 9 A 425 PRO GLU LYS ILE ILE TYR ALA ASN PRO CYS LYS GLN ILE SEQRES 9 A 425 PRO GLU LYS ILE ILE TYR ALA ASN PRO CYS LYS GLN ILE SEQRES 10 A 425 SER HIS ILE ARG TYR ALA ARG ASP SER GLY VAL ASP VAL SEQRES 10 A 425 SER HIS ILE ARG TYR ALA ARG ASP SER GLY VAL ASP VAL SEQRES 11 A 425 MET THR PHE ASP CYS VAL ASP GLU LEU GLU LYS VAL ALA SEQRES 11 A 425 MET THR PHE ASP CYS VAL ASP GLU LEU GLU LYS VAL ALA SEQRES 12 A 425 LYS THR HIS PRO LYS ALA LYS MET VAL LEU ARG ILE SER SEQRES 12 A 425 LYS THR HIS PRO LYS ALA LYS MET VAL LEU ARG ILE SER SEQRES 13 A 425 THR ASP ASP SER LEU ALA ARG CYS ARG LEU SER VAL LYS SEQRES 13 A 425 THR ASP ASP SER LEU ALA ARG CYS ARG LEU SER VAL LYS SEQRES 14 A 425 PHE GLY ALA LYS VAL GLU ASP CYS ARG PHE ILE LEU GLU SEQRES 14 A 425 PHE GLY ALA LYS VAL GLU ASP CYS ARG PHE ILE LEU GLU SEQRES 15 A 425 GLN ALA LYS LYS LEU ASN ILE ASP VAL THR GLY VAL SER SEQRES 15 A 425 GLN ALA LYS LYS LEU ASN ILE ASP VAL THR GLY VAL SER SEQRES 16 A 425 PHE HIS VAL GLY SER GLY SER THR ASP ALA SER THR PHE SEQRES 16 A 425 PHE HIS VAL GLY SER GLY SER THR ASP ALA SER THR PHE SEQRES 17 A 425 ALA GLN ALA ILE SER ASP SER ARG PHE VAL PHE ASP MET SEQRES 17 A 425 ALA GLN ALA ILE SER ASP SER ARG PHE VAL PHE ASP MET SEQRES 18 A 425 GLY THR GLU LEU GLY PHE ASN MET HIS ILE LEU ASP ILE SEQRES 18 A 425 GLY THR GLU LEU GLY PHE ASN MET HIS ILE LEU ASP ILE SEQRES 19 A 425 GLY GLY GLY PHE PRO GLY THR ARG ASP ALA PRO LEU LYS SEQRES 19 A 425 GLY GLY GLY PHE PRO GLY THR ARG ASP ALA PRO LEU LYS SEQRES 20 A 425 PHE GLU GLU ILE ALA GLY VAL ILE ASN ASN ALA LEU GLU SEQRES 20 A 425 PHE GLU GLU ILE ALA GLY VAL ILE ASN ASN ALA LEU GLU SEQRES 21 A 425 LYS HIS PHE PRO PRO ASP LEU LYS LEU THR ILE VAL ALA SEQRES 21 A 425 LYS HIS PHE PRO PRO ASP LEU LYS LEU THR ILE VAL ALA SEQRES 22 A 425 GLU PRO GLY ARG TYR TYR VAL ALA SER ALA PHE THR LEU SEQRES 22 A 425 GLU PRO GLY ARG TYR TYR VAL ALA SER ALA PHE THR LEU SEQRES 23 A 425 ALA VAL ASN VAL ILE ALA LYS LYS VAL THR PRO GLY VAL SEQRES 23 A 425 ALA VAL ASN VAL ILE ALA LYS LYS VAL THR PRO GLY VAL SEQRES 24 A 425 GLN THR ASP VAL GLY ALA HIS ALA GLU SER ASN ALA GLN SEQRES 24 A 425 GLN THR ASP VAL GLY ALA HIS ALA GLU SER ASN ALA GLN SEQRES 25 A 425 SER PHE MET TYR TYR VAL ASN ASP GLY VAL TYR GLY SER SEQRES 25 A 425 SER PHE MET TYR TYR VAL ASN ASP GLY VAL TYR GLY SER SEQRES 26 A 425 PHE ASN CYS ILE LEU TYR ASP HIS ALA VAL VAL ARG PRO SEQRES 26 A 425 PHE ASN CYS ILE LEU TYR ASP HIS ALA VAL VAL ARG PRO SEQRES 27 A 425 LEU PRO GLN ARG GLU PRO ILE PRO ASN GLU LYS LEU TYR SEQRES 27 A 425 LEU PRO GLN ARG GLU PRO ILE PRO ASN GLU LYS LEU TYR SEQRES 28 A 425 PRO SER SER VAL TRP GLY PRO THR CYS ASP GLY LEU ASP SEQRES 28 A 425 PRO SER SER VAL TRP GLY PRO THR CYS ASP GLY LEU ASP SEQRES 29 A 425 GLN ILE VAL GLU ARG TYR TYR LEU PRO GLU MET GLN VAL SEQRES 29 A 425 GLN ILE VAL GLU ARG TYR TYR LEU PRO GLU MET GLN VAL SEQRES 30 A 425 GLY GLU TRP LEU LEU PHE GLU ASP MET GLY ALA TYR THR SEQRES 30 A 425 GLY GLU TRP LEU LEU PHE GLU ASP MET GLY ALA TYR THR SEQRES 31 A 425 VAL VAL GLY THR SER SER PHE ASN GLY PHE GLN SER PRO SEQRES 31 A 425 VAL VAL GLY THR SER SER PHE ASN GLY PHE GLN SER PRO SEQRES 32 A 425 THR ILE TYR TYR VAL VAL SER GLY LEU PRO ASP HIS VAL SEQRES 32 A 425 THR ILE TYR TYR VAL VAL SER GLY LEU PRO ASP HIS VAL SEQRES 33 A 425 VAL ARG GLU LEU LYS SER GLN LYS SERSEQRES 33 A 425 VAL ARG GLU LEU LYS SER GLN LYS SER

HET PLP A 600 15 HET PLP A 600 15 HET PLP B 600 15 HET PLP B 600 15 HET PLP C 600 15 HET PLP C 600 15 HET PLP D 600 15 HET PLP D 600 15 HETNAM PLP PYRIDOXAL-5'-PHOSPHATE HETNAM PLP PYRIDOXAL-5'-PHOSPHATE HETSYN PLP VITAMIN B6 COMPLEX HETSYN PLP VITAMIN B6 COMPLEX FORMUL 5 PLP 4(C8 H10 N1 O6 P1) FORMUL 5 PLP 4(C8 H10 N1 O6 P1) HELIX 1 1 LEU A 45 LEU A 59 1 15 HELIX 1 1 LEU A 45 LEU A 59 1 15 HELIX 2 2 LYS A 69 ASN A 71 5 3 HELIX 2 2 LYS A 69 ASN A 71 5 3 HELIX 3 3 ASP A 73 GLY A 84 1 12 HELIX 3 3 ASP A 73 GLY A 84 1 12 HELIX 4 4 SER A 91 ILE A 101 1 11 HELIX 4 4 SER A 91 ILE A 101 1 11 HELIX 5 5 PRO A 104 GLU A 106 5 3 HELIX 5 5 PRO A 104 GLU A 106 5 3 HELIX 6 6 GLN A 116 SER A 126 1 11 HELIX 6 6 GLN A 116 SER A 126 1 11 HELIX 7 7 CYS A 135 HIS A 146 1 12 HELIX 7 7 CYS A 135 HIS A 146 1 12 HELIX 8 8 LYS A 173 GLU A 175 5 3 HELIX 8 8 LYS A 173 GLU A 175 5 3 HELIX 9 9 ASP A 176 LEU A 187 1 12 HELIX 9 9 ASP A 176 LEU A 187 1 12 HELIX 10 10 ALA A 205 LEU A 225 1 21 HELIX 10 10 ALA A 205 LEU A 225 1 21 HELIX 11 11 LYS A 247 PHE A 263 1 17 HELIX 11 11 LYS A 247 PHE A 263 1 17 HELIX 12 12 GLY A 276 ALA A 281 1 6 HELIX 12 12 GLY A 276 ALA A 281 1 6 HELIX 13 13 PHE A 326 HIS A 333 1 8 HELIX 13 13 PHE A 326 HIS A 333 1 8 HELIX 14 14 THR A 390 THR A 394 5 5 HELIX 14 14 THR A 390 THR A 394 5 5 HELIX 15 15 SER A 396 PHE A 400 5 5 HELIX 15 15 SER A 396 PHE A 400 5 5 SHEET 1 A 6 GLN A 365 PRO A 373 0 SHEET 1 A 6 GLN A 365 PRO A 373 0 SHEET 2 A 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372 SHEET 2 A 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372 SHEET 3 A 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354 SHEET 3 A 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354 SHEET 4 A 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317 SHEET 4 A 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317 SHEET 5 A 6 PHE A 40 ASP A 44 -1 O PHE A 40 N ALA A 287 SHEET 5 A 6 PHE A 40 ASP A 44 -1 O PHE A 40 N ALA A 287 SHEET 6 A 6 THR A 404 VAL A 408 1 O THR A 404 N PHE A 41 SHEET 6 A 6 THR A 404 VAL A 408 1 O THR A 404 N PHE A 41 SHEET 1 A1 6 GLN A 365 PRO A 373 0 SHEET 1 A1 6 GLN A 365 PRO A 373 0 SHEET 2 A1 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372 SHEET 2 A1 6 LEU A 350 TRP A 356 -1 N TYR A 351 O LEU A 372 SHEET 3 A1 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354 SHEET 3 A1 6 SER A 313 VAL A 318 1 O PHE A 314 N SER A 354 SHEET 4 A1 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317 SHEET 4 A1 6 PHE A 284 THR A 296 -1 N ILE A 291 O TYR A 317 SHEET 5 A1 6 TRP A 380 PHE A 383 -1 N LEU A 381 O VAL A 288 SHEET 5 A1 6 TRP A 380 PHE A 383 -1 N LEU A 381 O VAL A 288 SHEET 6 A1 6 PRO A 338 PRO A 340 -1 O LEU A 339 N LEU A 382 SHEET 6 A1 6 PRO A 338 PRO A 340 -1 O LEU A 339 N LEU A 382

CRYST1 66.800 151.700 85.350 90.00 102.30 90.00 P 1 21 1 8 CRYST1 66.800 151.700 85.350 90.00 102.30 90.00 P 1 21 1 8 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.014970 0.000000 0.003264 0.00000 SCALE1 0.014970 0.000000 0.003264 0.00000 SCALE2 0.000000 0.006592 0.000000 0.00000 SCALE2 0.000000 0.006592 0.000000 0.00000 SCALE3 0.000000 0.000000 0.011992 0.00000 SCALE3 0.000000 0.000000 0.011992 0.00000 ATOM 1 N ASP A 35 34.731 -5.686 15.000 1.00 98.44 N ATOM 1 N ASP A 35 34.731 -5.686 15.000 1.00 98.44 N ATOM 2 CA ASP A 35 34.249 -5.884 13.629 1.00 98.39 C ATOM 2 CA ASP A 35 34.249 -5.884 13.629 1.00 98.39 C ATOM 3 C ASP A 35 33.320 -4.750 13.203 1.00 98.13 C ATOM 3 C ASP A 35 33.320 -4.750 13.203 1.00 98.13 C ATOM 4 O ASP A 35 33.474 -3.594 13.603 1.00 98.29 O ATOM 4 O ASP A 35 33.474 -3.594 13.603 1.00 98.29 O ATOM 5 CB ASP A 35 33.558 -7.247 13.545 1.00 98.38 C ATOM 5 CB ASP A 35 33.558 -7.247 13.545 1.00 98.38 C ATOM 6 CG ASP A 35 33.566 -7.887 12.170 1.00 98.36 C ATOM 6 CG ASP A 35 33.566 -7.887 12.170 1.00 98.36 C ATOM 7 OD1 ASP A 35 33.717 -9.133 12.114 1.00 98.26 O ATOM 7 OD1 ASP A 35 33.717 -9.133 12.114 1.00 98.26 O ATOM 8 OD2 ASP A 35 33.419 -7.182 11.148 1.00 98.39 O ATOM 8 OD2 ASP A 35 33.419 -7.182 11.148 1.00 98.39 O ATOM 9 N GLU A 36 32.332 -5.073 12.378 1.00 97.79 N ATOM 9 N GLU A 36 32.332 -5.073 12.378 1.00 97.79 N ATOM 10 CA GLU A 36 31.446 -4.080 11.787 1.00 95.51 C ATOM 10 CA GLU A 36 31.446 -4.080 11.787 1.00 95.51 C ATOM 11 C GLU A 36 32.259 -2.944 11.199 1.00 90.65 C ATOM 11 C GLU A 36 32.259 -2.944 11.199 1.00 90.65 C ATOM 12 O GLU A 36 32.220 -1.813 11.692 1.00 94.96 O ATOM 12 O GLU A 36 32.220 -1.813 11.692 1.00 94.96 O ATOM 13 CB GLU A 36 30.419 -3.638 12.840 1.00 97.63 C ATOM 13 CB GLU A 36 30.419 -3.638 12.840 1.00 97.63 C ATOM 14 CG GLU A 36 29.111 -3.155 12.261 1.00 98.19 C ATOM 14 CG GLU A 36 29.111 -3.155 12.261 1.00 98.19 C ATOM 15 CD GLU A 36 27.791 -3.597 12.824 1.00 98.33 C ATOM 15 CD GLU A 36 27.791 -3.597 12.824 1.00 98.33 C ATOM 16 OE1 GLU A 36 27.308 -4.727 12.601 1.00 98.28 O ATOM 16 OE1 GLU A 36 27.308 -4.727 12.601 1.00 98.28 O ATOM 17 OE2 GLU A 36 27.115 -2.806 13.520 1.00 98.43 O ATOM 17 OE2 GLU A 36 27.115 -2.806 13.520 1.00 98.43 O ATOM 18 N GLY A 37 33.018 -3.192 10.131 1.00 52.86 N ATOM 18 N GLY A 37 33.018 -3.192 10.131 1.00 52.86 N ATOM 19 CA GLY A 37 33.624 -2.167 9.299 1.00 39.88 C ATOM 19 CA GLY A 37 33.624 -2.167 9.299 1.00 39.88 C ATOM 20 C GLY A 37 32.598 -1.167 8.712 1.00 34.34 C ATOM 20 C GLY A 37 32.598 -1.167 8.712 1.00 34.34 C ATOM 21 O GLY A 37 32.236 -1.162 7.531 1.00 31.44 O ATOM 21 O GLY A 37 32.236 -1.162 7.531 1.00 31.44 O ATOM 22 N ASP A 38 32.135 -0.248 9.564 1.00 37.23 N ATOM 22 N ASP A 38 32.135 -0.248 9.564 1.00 37.23 N ATOM 23 CA ASP A 38 31.136 0.700 9.138 1.00 36.44 C ATOM 23 CA ASP A 38 31.136 0.700 9.138 1.00 36.44 C ATOM 24 C ASP A 38 31.794 1.722 8.228 1.00 33.49 C ATOM 24 C ASP A 38 31.794 1.722 8.228 1.00 33.49 C ATOM 25 O ASP A 38 33.029 1.896 8.156 1.00 34.06 O ATOM 25 O ASP A 38 33.029 1.896 8.156 1.00 34.06 O ATOM 26 CB ASP A 38 30.500 1.242 10.405 1.00 42.06 C ATOM 26 CB ASP A 38 30.500 1.242 10.405 1.00 42.06 C ATOM 27 CG ASP A 38 29.583 0.207 11.047 1.00 44.59 C ATOM 27 CG ASP A 38 29.583 0.207 11.047 1.00 44.59 C ATOM 28 OD1 ASP A 38 29.408 -0.876 10.434 1.00 45.72 O ATOM 28 OD1 ASP A 38 29.408 -0.876 10.434 1.00 45.72 O ATOM 38 CA PHE A 40 32.728 6.727 7.615 1.00 20.51 C ATOM 38 CA PHE A 40 32.728 6.727 7.615 1.00 20.51 C ......CONECT1117911177 CONECT1117911177 CONECT1118011177 CONECT1118011177 MASTER 482 0 4 60 80 0 0 611176 4 64 132 MASTER 482 0 4 60 80 0 0 611176 4 64 132 ENDEND

CRYST1 66.800 151.700 85.350 90.00 102.30 90.00 P 1 21 1 8 CRYST1 66.800 151.700 85.350 90.00 102.30 90.00 P 1 21 1 8 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX1 1.000000 0.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX2 0.000000 1.000000 0.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 ORIGX3 0.000000 0.000000 1.000000 0.00000 SCALE1 0.014970 0.000000 0.003264 0.00000 SCALE1 0.014970 0.000000 0.003264 0.00000 SCALE2 0.000000 0.006592 0.000000 0.00000 SCALE2 0.000000 0.006592 0.000000 0.00000 SCALE3 0.000000 0.000000 0.011992 0.00000 SCALE3 0.000000 0.000000 0.011992 0.00000 ATOM 1 N ASP A 35 34.731 -5.686 15.000 1.00 98.44 N ATOM 1 N ASP A 35 34.731 -5.686 15.000 1.00 98.44 N ATOM 2 CA ASP A 35 34.249 -5.884 13.629 1.00 98.39 C ATOM 2 CA ASP A 35 34.249 -5.884 13.629 1.00 98.39 C ATOM 3 C ASP A 35 33.320 -4.750 13.203 1.00 98.13 C ATOM 3 C ASP A 35 33.320 -4.750 13.203 1.00 98.13 C ATOM 4 O ASP A 35 33.474 -3.594 13.603 1.00 98.29 O ATOM 4 O ASP A 35 33.474 -3.594 13.603 1.00 98.29 O ATOM 5 CB ASP A 35 33.558 -7.247 13.545 1.00 98.38 C ATOM 5 CB ASP A 35 33.558 -7.247 13.545 1.00 98.38 C ATOM 6 CG ASP A 35 33.566 -7.887 12.170 1.00 98.36 C ATOM 6 CG ASP A 35 33.566 -7.887 12.170 1.00 98.36 C ATOM 7 OD1 ASP A 35 33.717 -9.133 12.114 1.00 98.26 O ATOM 7 OD1 ASP A 35 33.717 -9.133 12.114 1.00 98.26 O ATOM 8 OD2 ASP A 35 33.419 -7.182 11.148 1.00 98.39 O ATOM 8 OD2 ASP A 35 33.419 -7.182 11.148 1.00 98.39 O ATOM 9 N GLU A 36 32.332 -5.073 12.378 1.00 97.79 N ATOM 9 N GLU A 36 32.332 -5.073 12.378 1.00 97.79 N ATOM 10 CA GLU A 36 31.446 -4.080 11.787 1.00 95.51 C ATOM 10 CA GLU A 36 31.446 -4.080 11.787 1.00 95.51 C ATOM 11 C GLU A 36 32.259 -2.944 11.199 1.00 90.65 C ATOM 11 C GLU A 36 32.259 -2.944 11.199 1.00 90.65 C ATOM 12 O GLU A 36 32.220 -1.813 11.692 1.00 94.96 O ATOM 12 O GLU A 36 32.220 -1.813 11.692 1.00 94.96 O ATOM 13 CB GLU A 36 30.419 -3.638 12.840 1.00 97.63 C ATOM 13 CB GLU A 36 30.419 -3.638 12.840 1.00 97.63 C ATOM 14 CG GLU A 36 29.111 -3.155 12.261 1.00 98.19 C ATOM 14 CG GLU A 36 29.111 -3.155 12.261 1.00 98.19 C ATOM 15 CD GLU A 36 27.791 -3.597 12.824 1.00 98.33 C ATOM 15 CD GLU A 36 27.791 -3.597 12.824 1.00 98.33 C ATOM 16 OE1 GLU A 36 27.308 -4.727 12.601 1.00 98.28 O ATOM 16 OE1 GLU A 36 27.308 -4.727 12.601 1.00 98.28 O ATOM 17 OE2 GLU A 36 27.115 -2.806 13.520 1.00 98.43 O ATOM 17 OE2 GLU A 36 27.115 -2.806 13.520 1.00 98.43 O ATOM 18 N GLY A 37 33.018 -3.192 10.131 1.00 52.86 N ATOM 18 N GLY A 37 33.018 -3.192 10.131 1.00 52.86 N ATOM 19 CA GLY A 37 33.624 -2.167 9.299 1.00 39.88 C ATOM 19 CA GLY A 37 33.624 -2.167 9.299 1.00 39.88 C ATOM 20 C GLY A 37 32.598 -1.167 8.712 1.00 34.34 C ATOM 20 C GLY A 37 32.598 -1.167 8.712 1.00 34.34 C ATOM 21 O GLY A 37 32.236 -1.162 7.531 1.00 31.44 O ATOM 21 O GLY A 37 32.236 -1.162 7.531 1.00 31.44 O ATOM 22 N ASP A 38 32.135 -0.248 9.564 1.00 37.23 N ATOM 22 N ASP A 38 32.135 -0.248 9.564 1.00 37.23 N ATOM 23 CA ASP A 38 31.136 0.700 9.138 1.00 36.44 C ATOM 23 CA ASP A 38 31.136 0.700 9.138 1.00 36.44 C ATOM 24 C ASP A 38 31.794 1.722 8.228 1.00 33.49 C ATOM 24 C ASP A 38 31.794 1.722 8.228 1.00 33.49 C ATOM 25 O ASP A 38 33.029 1.896 8.156 1.00 34.06 O ATOM 25 O ASP A 38 33.029 1.896 8.156 1.00 34.06 O ATOM 26 CB ASP A 38 30.500 1.242 10.405 1.00 42.06 C ATOM 26 CB ASP A 38 30.500 1.242 10.405 1.00 42.06 C ATOM 27 CG ASP A 38 29.583 0.207 11.047 1.00 44.59 C ATOM 27 CG ASP A 38 29.583 0.207 11.047 1.00 44.59 C ATOM 28 OD1 ASP A 38 29.408 -0.876 10.434 1.00 45.72 O ATOM 28 OD1 ASP A 38 29.408 -0.876 10.434 1.00 45.72 O ATOM 38 CA PHE A 40 32.728 6.727 7.615 1.00 20.51 C ATOM 38 CA PHE A 40 32.728 6.727 7.615 1.00 20.51 C ......CONECT1117911177 CONECT1117911177 CONECT1118011177 CONECT1118011177 MASTER 482 0 4 60 80 0 0 611176 4 64 132 MASTER 482 0 4 60 80 0 0 611176 4 64 132 ENDEND

Atom serial number Name ChainResidue Seq Nr X Y Z Occupancy ElementTemp Factor

File Format ConversionsFile Format Conversions Wide variety of formatsWide variety of formats Common toolsCommon tools

• readseq readseq (all flavors of Unix)(all flavors of Unix)

1. IG/Stanford 10. Olsen (in-only) 1. IG/Stanford 10. Olsen (in-only)

2. GenBank/GB 11. Phylip3.2 (Sequential) 2. GenBank/GB 11. Phylip3.2 (Sequential)

3. NBRF 12. Phylip (Interleaved) 3. NBRF 12. Phylip (Interleaved)

4. EMBL 13. Plain/Raw 4. EMBL 13. Plain/Raw

5. GCG 14. PIR/CODATA 5. GCG 14. PIR/CODATA

6. DNAStrider 15. MSF 6. DNAStrider 15. MSF

7. Fitch 16. ASN.1 7. Fitch 16. ASN.1

8. Pearson/Fasta 17. PAUP/NEXUS 8. Pearson/Fasta 17. PAUP/NEXUS

9. Zuker (in-only) 18. Pretty (out-only)9. Zuker (in-only) 18. Pretty (out-only)

• seqretseqret (EMBOSS) (EMBOSS) gcg GCG 9.x and 10.x format gcg GCG 9.x and 10.x format emblembl swissswiss fastafasta genbankgenbank nbrfnbrf pir NBRF (PIR)pir NBRF (PIR) codata CODATA format.codata CODATA format. strider DNA strider formatstrider DNA strider format clustalclustal phylip PHYLIP non-interleaved multiple alignment format.phylip PHYLIP non-interleaved multiple alignment format. acedb ACeDB formatacedb ACeDB format msf Wisconsin Package GCG's MSF multiple sequence format.msf Wisconsin Package GCG's MSF multiple sequence format. hennig86 Hennig86 formathennig86 Hennig86 format jackknifer Jackknifer formatjackknifer Jackknifer format jackknifernon Jackknifernon formatjackknifernon Jackknifernon format nexusnexus paup Nexus/PAUP formatpaup Nexus/PAUP format treecon Treecon formattreecon Treecon format mega Mega formatmega Mega format ig IntelliGenetics format.ig IntelliGenetics format. stadenstaden texttext

• Many GUI packages such as GCG Many GUI packages such as GCG SeqLab (Unix), BioEdit (Windows), etc. SeqLab (Unix), BioEdit (Windows), etc. have built in conversion utilities have built in conversion utilities between different file formatsbetween different file formats

• Forcon is handy for converting between Forcon is handy for converting between phylogenetic multiple alignment formatsphylogenetic multiple alignment formats

Structure file formatsStructure file formats Major formatsMajor formats

• PDB – Protein DatabasePDB – Protein Database• mol2 – Tripos Sybylmol2 – Tripos Sybyl• mmCIF - Macromolecular mmCIF - Macromolecular

Crystallographic Information File Crystallographic Information File • XYZXYZ

Some packages can automatically Some packages can automatically convert between these formatsconvert between these formats

BabelBabelAlchemy AMBER PREP Ball and StickAlchemy AMBER PREP Ball and StickBiosym .CAR Boogie Cacao CartesianBiosym .CAR Boogie Cacao CartesianCambridge CADPAC CHARMm Chem3D Cartesian 1Cambridge CADPAC CHARMm Chem3D Cartesian 1Chem3D Cartesian 2 CSD CSSR CSD FDATChem3D Cartesian 2 CSD CSSR CSD FDATCSD GSTAT Dock Database Dock PDBCSD GSTAT Dock Database Dock PDBFeature Free Form Fractional GAMESS OutputFeature Free Form Fractional GAMESS OutputGaussian Output Gaussian Z-Matrix Gaussian 92 OutputGaussian Output Gaussian Z-Matrix Gaussian 92 OutputGaussian 94 Output GAMESS Output (A) GROMOS96 (nm)Gaussian 94 Output GAMESS Output (A) GROMOS96 (nm)Hyperchem HIN MDL Isis (SDF) M3DHyperchem HIN MDL Isis (SDF) M3DMac Molecule Macromodel Micro WorldMac Molecule Macromodel Micro WorldMM2 Input MM2 Ouput MM3 MM2 Input MM2 Ouput MM3 MMADS MDL MOLfile MOLINMMADS MDL MOLfile MOLINMopac Cartesian Mopac Internal Mopac Output Mopac Cartesian Mopac Internal Mopac Output PC Model PDB PS-GVB InputPC Model PDB PS-GVB InputPS-GVB Output Quanta MSF SchakalPS-GVB Output Quanta MSF SchakalShelX SMILES SpartanShelX SMILES SpartanSpartan Semi-Empirical Spartan Mol. Mechanics Sybyl MolSpartan Semi-Empirical Spartan Mol. Mechanics Sybyl MolSybyl Mol2 Conjure UniChem XYZSybyl Mol2 Conjure UniChem XYZXYZ XEDXYZ XED

Also has the ability to add and delete hydrogensAlso has the ability to add and delete hydrogens Available for Unix (AIX, Ultrix, Sun-OS, Convex, SGI, Cray, Linux), MS-DOS, and Available for Unix (AIX, Ultrix, Sun-OS, Convex, SGI, Cray, Linux), MS-DOS, and

on Macs running at least System 7.0.on Macs running at least System 7.0. babel -imm2out mm2.grf -omopint mopac.dat

Some programming tools for Some programming tools for conversionsconversions• bioperlbioperl

use Bio::use Bio::SeqIOSeqIO; ;

$in = Bio::$in = Bio::SeqIOSeqIO->new(-file => "inputfilename" , '-format' => ->new(-file => "inputfilename" , '-format' => 'Fasta''Fasta'););

$out = Bio::$out = Bio::SeqIOSeqIO->new(-file => ">outputfilename" , '-format' => ->new(-file => ">outputfilename" , '-format' => 'EMBL''EMBL'); );

while ( my $seq = $in->next_seq() ) while ( my $seq = $in->next_seq() )

{ $out->write_seq($seq); { $out->write_seq($seq);

} }

or or

use Bio::use Bio::SeqIOSeqIO; ;

$in = Bio::$in = Bio::SeqIOSeqIO->newFh(-file => "inputfilename" , '-format' => ->newFh(-file => "inputfilename" , '-format' => 'Fasta''Fasta'););

$out = Bio::$out = Bio::SeqIOSeqIO->newFh('-format' => ->newFh('-format' => 'EMBL''EMBL'); );

# World's shortest Fasta<->EMBL format converter: # World's shortest Fasta<->EMBL format converter:

print $out $_ while <$in>; print $out $_ while <$in>;

• biopythonbiopython

Scanner - The part of the parser that actually does Scanner - The part of the parser that actually does the work or going through the file and extracting the work or going through the file and extracting useful information. This useful information is useful information. This useful information is converted into events.converted into events.

Consumer - The consumer does the job of Consumer - The consumer does the job of processing the useful information and spitting it out processing the useful information and spitting it out in a format that the programmer can use. The in a format that the programmer can use. The consumer does this by receiving the events created consumer does this by receiving the events created by the scanner.by the scanner.

You may be required to write your own scanner and You may be required to write your own scanner and consumer for certain formatsconsumer for certain formats

Translating nucleotide formatsTranslating nucleotide formats Factors to take into accountFactors to take into account

• Translate in all 6 reading framesTranslate in all 6 reading frames 3 forward, 3 reverse3 forward, 3 reverse The use of non-standard genetic codes for The use of non-standard genetic codes for

different organismsdifferent organisms Stop codonsStop codons Output formatOutput format

• 1 letter1 letter• 3 letter3 letter

EMBOSSEMBOSS

• transeqtranseq It can translate in any of the 3 forward or three reverse It can translate in any of the 3 forward or three reverse

sense frames, or in all three forward or reverse frames, sense frames, or in all three forward or reverse frames, or in all six frames.or in all six frames.

It can translate specified regions corersponding to the It can translate specified regions corersponding to the coding regions of your sequences. coding regions of your sequences.

It can translate using the standard ('Universal') genetic It can translate using the standard ('Universal') genetic code and also with a selection of non-standard codes.code and also with a selection of non-standard codes.

Termination (STOP) codons are translated as the Termination (STOP) codons are translated as the character '*'.character '*'.

The output peptide sequence is always in the standard The output peptide sequence is always in the standard one-letter IUPAC code.one-letter IUPAC code.

• prettyseqprettyseq This writes out a nicely formatted display of the This writes out a nicely formatted display of the

sequence with the translation (within specified ranges) sequence with the translation (within specified ranges) displayed beneath it.displayed beneath it.

Slightly unusually, this application uses the codon Slightly unusually, this application uses the codon usage tables to translate the codonsusage tables to translate the codons

Web toolsWeb tools• Expasy translate toolExpasy translate tool

• EBI translation machineEBI translation machine

Viewers for sequencer dataViewers for sequencer data abiviewabiview (EMBOSS) (EMBOSS) Trev (Unix)Trev (Unix) EditView (Mac)EditView (Mac) Chromas (Windows)Chromas (Windows) AbiView (Windows)AbiView (Windows)

Most viewers allow you to:Most viewers allow you to:• View the tracesView the traces• Change the scaleChange the scale• Edit the basecallingEdit the basecalling• Preserve the original sequencePreserve the original sequence• Export the dataExport the data

Analysis of primary data from Analysis of primary data from sequencerssequencers

Staden Package (MRC-LMB)Staden Package (MRC-LMB) Preparing sequence trace data for analysis Preparing sequence trace data for analysis

for assemblyfor assembly• pregap4

Graphical user interfaceGraphical user interface Prepare trace dataPrepare trace data AutomationAutomation Trace format conversionTrace format conversion Quality analysisQuality analysis Vector clippingVector clipping Contaminant screeningContaminant screening Repeat searching.Repeat searching.

Assembly programAssembly program• gap4gap4

AssemblyAssembly Contig joiningContig joining Assembly checkingAssembly checking Repeat searchingRepeat searching Experiment suggestionExperiment suggestion Read pair analysisRead pair analysis Contig editingContig editing Graphical views of contigsGraphical views of contigs DatabaseDatabase

ConsedConsed• Phred: base callerPhred: base caller• Phrap: assemblerPhrap: assembler• Consed: Editor and finishing programConsed: Editor and finishing program• Quality valuesQuality values

Phred designed for gel-based sequencersPhred designed for gel-based sequencers Being checked for capillary dataBeing checked for capillary data

Finding open reading framesFinding open reading frames GRAILGRAIL

• Neural networkNeural network• Combine evidence fron 7 different statistical Combine evidence fron 7 different statistical

measuresmeasures Frame biasFrame bias PeriodicitiesPeriodicities Fractal dimensionsFractal dimensions Coding 6-tuplesCoding 6-tuples In-frame 6-tuplesIn-frame 6-tuples K-tuple commonalityK-tuple commonality Repetitive 6-tuple wordsRepetitive 6-tuple words

• At each position of the sequence, info is At each position of the sequence, info is weighted, integrated and scored for ORF or weighted, integrated and scored for ORF or intergenic regionintergenic region

Organism/dataset specificityOrganism/dataset specificity GenscanGenscan

• Statistics and probabilistic models of gene Statistics and probabilistic models of gene structurestructure

GeneWiseGeneWise• Comparison of translations with known proteinsComparison of translations with known proteins

NetGeneNetGene• Donor and acceptor sitesDonor and acceptor sites

EMBOSSEMBOSS• getorfgetorf• plotorfplotorf

Determining protein and DNA Determining protein and DNA characteristicscharacteristics

WebWeb• BCM Search LauncherBCM Search Launcher

Nucleic acid sequence searchesNucleic acid sequence searches General protein sequence/pattern searches General protein sequence/pattern searches Species-Specific protein sequence searchesSpecies-Specific protein sequence searches Multiple sequence alignmentsMultiple sequence alignments Pairwise sequence alignmentsPairwise sequence alignments Gene feature searchesGene feature searches Sequence utilitiesSequence utilities Protein secondary structure prediction Protein secondary structure prediction

• SMARTSMART Protein domain and feature analysisProtein domain and feature analysis

• PfamPfam HMM-based protein motif searchesHMM-based protein motif searches

PrositeProsite• Detects signature motifs in proteins Detects signature motifs in proteins • Regular expression searchesRegular expression searches• Scan sequenes against databaseScan sequenes against database

• PrintsPrints Protein fingerprintsProtein fingerprints

EMBOSS DNAEMBOSS DNA• cpgplotcpgplot – plots cpg rich areas – plots cpg rich areas• restrictrestrict – restriction sites – restriction sites• tfscantfscan – transcription factors – transcription factors• einvertedeinverted – find inverted repeats – find inverted repeats• chipschips – codon usage – codon usage• geeceegeecee – GC content – GC content

EMBOSS proteinEMBOSS protein• garniergarnier - predicts protein secondary structure - predicts protein secondary structure• helixturnhelixhelixturnhelix - report nucleic acid binding motifs - report nucleic acid binding motifs• hmomenthmoment - hydrophobic moment calculation - hydrophobic moment calculation• pepcoilpepcoil - predicts coiled coil regions - predicts coiled coil regions• pepnetpepnet - displays proteins as a helical net - displays proteins as a helical net• pepwheelpepwheel - shows protein sequences as helices - shows protein sequences as helices• tmaptmap - displays membrane spanning regions - displays membrane spanning regions• topotopo - draws an image of a transmembrane protein - draws an image of a transmembrane protein• chargecharge - protein charge plot - protein charge plot• checktranschecktrans - reports STOP codons and ORF statistics of a - reports STOP codons and ORF statistics of a

protein sequenceprotein sequence• compseqcompseq - counts the composition of dimer/trimer/etc words in - counts the composition of dimer/trimer/etc words in

a sequencea sequence• iepiep - calculates the isoelectric point of a protein - calculates the isoelectric point of a protein• octanoloctanol - displays protein hydropathy - displays protein hydropathy• pepinfopepinfo - plots simple amino acid properties in parallel - plots simple amino acid properties in parallel• pepstatspepstats - protein statistics - protein statistics• pepwindowpepwindow - displays protein hydropathy - displays protein hydropathy• antigenicantigenic - finds antigenic sites in proteins - finds antigenic sites in proteins• pscanpscan - scans proteins using PRINTS - scans proteins using PRINTS• sigcleavesigcleave - reports protein signal cleavage sites - reports protein signal cleavage sites

Primer DesignPrimer Design FactorsFactors

• Melting pointMelting point LengthLength CompositionComposition Methods for calculating melting pointMethods for calculating melting point Internal stabilityInternal stability

• SpecificitySpecificity False priming sitesFalse priming sites

• Internal stabilityInternal stability Hairpin structuresHairpin structures

• CompatibilityCompatibility Primer dimersPrimer dimers Compatible melting pointsCompatible melting points

OLIGO PackageOLIGO Package• Nearest neighbour method for Tm Nearest neighbour method for Tm

calculationcalculation• Comprehensive analysis suiteComprehensive analysis suite• $$$$$$

CODEHOPCODEHOP• COCOnsensus-nsensus-DEDEgenerate generate HHybrid ybrid OOligonucleotide ligonucleotide PPrimerrimer• PCR primers designed from protein multiple sequence PCR primers designed from protein multiple sequence

alignmentsalignments

Primer3Primer3• You provide the target sequenceYou provide the target sequence• It picks primers for PCR reactions, considering It picks primers for PCR reactions, considering

as criteria:as criteria: Oligonucleotide melting temperatureOligonucleotide melting temperature SizeSize GC contentGC content primer-dimer possibilitiesprimer-dimer possibilities PCR product sizePCR product size Positional constraints within the source sequencePositional constraints within the source sequence Miscellaneous other constraints. Miscellaneous other constraints.

start len tm gc% any 3' seq

1 LEFT PRIMER 66 20 60.22 55.00 5.00 2.00 AAGAGTCTGGGGGAGCTGAT

RIGHT PRIMER 259 20 60.19 50.00 4.00 2.00 ATCATTGCTGGGCTGATCTC

PRODUCT SIZE: 194, PAIR ANY COMPL: 4.00, PAIR 3' COMPL: 2.00

2 LEFT PRIMER 331 20 60.25 45.00 5.00 2.00 AGCTCATTGGGCAAAAAGTG

RIGHT PRIMER 529 20 59.55 55.00 2.00 1.00 CCAGTTCCAATAGCCCAGAC


3 LEFT PRIMER 331 20 60.25 45.00 5.00 2.00 AGCTCATTGGGCAAAAAGTG

RIGHT PRIMER 538 20 60.12 45.00 3.00 2.00 GCAGTTTTGCCAGTTCCAAT


4 LEFT PRIMER 379 20 59.67 50.00 4.00 2.00 TCATCGCCTGTATTGGTGAG

RIGHT PRIMER 578 20 60.44 50.00 6.00 2.00 GCGGAGTTTCTTGTGCACTT


Statistics

con too in in no tm tm high high high

sid many tar excl bad GC too too any 3' poly end

ered Ns get reg GC% clamp low high compl compl X stab ok

Left 4198 0 0 0 0 0 810 2322 17 65 0 86 898

Right 4172 0 0 0 0 0 807 2281 2 5 0 83 994

Pair Stats:

considered 811, unacceptable product size 422, high any compl 1, high end compl 33, ok 355

Documents

8e5bbFunda Seq Anals