6
Hindawi Publishing Corporation Journal of eoretical Chemistry Volume 2013, Article ID 961378, 5 pages http://dx.doi.org/10.1155/2013/961378 Research Article Studying the Polypeptide Sequence (-Code) of Escherichia coli Vladimir R. Rosenfeld Mathematical Chemistry Group, Noble Institution for Environmental Peace, P.O. Box 163, 720 Spadina Avenue, Toronto, ON, Canada M5S 2T9 Correspondence should be addressed to Vladimir R. Rosenfeld; vladimir [email protected] Received 19 March 2013; Accepted 21 November 2013 Academic Editors: A. M. Lamsabhi, A. Stavrakoudis, and B. M. Wong Copyright Β© 2013 Vladimir R. Rosenfeld. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is paper is devoted to algebraically simulating the -code of bacterium Escherichia coli and studying contrast factors (words) in its polypeptide sequence. We utilize the methods of spectral theory of graphs which were previously employed by us for enumerating De Bruijn and Kautz sequences. e empirical material is borrowed from the computer investigation of contrast factors in the polypeptide sequences of prokaryotes. 1. Introduction It was proposed [1, 2] to divide 19 out of all the 20 amino acids into two subgroups, an alanine subgroup (of fatty amino acids: alanine, phenylamine, isoleucine, leucine, methionine, proline, threonine, and valine) and a glycine subgroup (of more polar amino acids: cysteine, aspartic acid, glutamic acid, glycine, histidine, lysine, asparagine, glutamine, arginine, tryptophan, and tyrosine), while serine remains a spare ele- ment in the full classification thereof. In a shorthand notation, this gives a, f, i, l, m, p, t, and v (an alanine subgroup); c, d, e, g, h, k, n, q, r, w, and y (a glycine subgroup); and s (a free character). e three numbers 1, 2, and 3 were picked to represent the main two subgroups and the character s, respectively. Brute statistics, under the natural ratio of 1s to 2s being 0.526: 0.474, had predicted an almost regular distribution (alternation) of the two ciphers. To check such a hypothesis, there were found the frequencies of all 2 (1 ≀ ≀ 11) possible substrings of the length in the genomic sequence of E. coli. e results at once showed that the respective perfectly alternating substrings are in fact the least frequent ones, in the entire genomic sequence. However, visual observation allowed suspecting that the main condition for near-to- statistical distribution of the two subgroups of amino acids may be disguised in grouping equal ciphers (either 1 or 2) rep- resenting the respective subgroups in adjacent pairs thereof. at is, in lieu of the code 121212 β‹… β‹… β‹… 12 it should be 112211 β‹… β‹… β‹… 22, where general ratio of ciphers stays thus unal- tered. e situation with pairing of equal ciphers reminded us of the phenomenon of the so-called -code in polypeptides. According to it, one turn of polypeptide spiral involves 3.5 amino acids on an average. Since the nearest to 3.5 multiple integer equals 7, it was of interest to interpret the -code as a one in which all structural features are due to conditionally grouping amino acids into consecutive sevens thereof. Merging the ideas of sevens (which suit well for interpret- ing the -code) and pairs (which better obey the natural pro- portion of amino acids and follow experimental observation) allows us to set forward a universal model of the -code. In this, the distribution of amino acids along the entire polypep- tide sequence should maximize a mean number of pairs of equal integers (1 or 2) that are contained in consecutive sevens of it. us, the optimal case is that one has 3 pairs (11 and/or 22) and one unpaired number (1 or 2) which can either be between any two pairs out of the three or be outside of these (e.g., it may comprise a pair with an equal number of the adja- cent seven). However, it is more interesting to consider the case when the window of length 7 cuts out a substring without 3 such pairs; this is also a necessary pattern in the sequence. Logically, we could readily establish that the optimal - code should have not less than 2 pairs of adjacent equal numbers in every window of length 7. Another criterion that seemed to be of use was that the -code should reject the subsequences 1212 and 2121 since these allow no more than

Research Article Studying the Polypeptide Sequence ( -Code ...De Bruijn and Kautz sequences. e empirical material is borrowed from the computer investigation of contrast factors in

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Hindawi Publishing CorporationJournal of Theoretical ChemistryVolume 2013, Article ID 961378, 5 pageshttp://dx.doi.org/10.1155/2013/961378

    Research ArticleStudying the Polypeptide Sequence (𝛼-Code) of Escherichia coli

    Vladimir R. Rosenfeld

    Mathematical Chemistry Group, Noble Institution for Environmental Peace, P.O. Box 163, 720 Spadina Avenue,Toronto, ON, Canada M5S 2T9

    Correspondence should be addressed to Vladimir R. Rosenfeld; vladimir [email protected]

    Received 19 March 2013; Accepted 21 November 2013

    Academic Editors: A. M. Lamsabhi, A. Stavrakoudis, and B. M. Wong

    Copyright Β© 2013 Vladimir R. Rosenfeld. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

    This paper is devoted to algebraically simulating the 𝛼-code of bacterium Escherichia coli and studying contrast factors (words) in itspolypeptide sequence. We utilize the methods of spectral theory of graphs which were previously employed by us for enumeratingDe Bruijn and Kautz sequences. The empirical material is borrowed from the computer investigation of contrast factors in thepolypeptide sequences of prokaryotes.

    1. Introduction

    It was proposed [1, 2] to divide 19 out of all the 20 aminoacids into two subgroups, an alanine subgroup (of fatty aminoacids: alanine, phenylamine, isoleucine, leucine, methionine,proline, threonine, and valine) and a glycine subgroup (ofmore polar amino acids: cysteine, aspartic acid, glutamic acid,glycine, histidine, lysine, asparagine, glutamine, arginine,tryptophan, and tyrosine), while serine remains a spare ele-ment in the full classification thereof. In a shorthandnotation,this gives a, f, i, l, m, p, t, and v (an alanine subgroup); c,d, e, g, h, k, n, q, r, w, and y (a glycine subgroup); and s (afree character). The three numbers 1, 2, and 3 were pickedto represent the main two subgroups and the character s,respectively.

    Brute statistics, under the natural ratio of 1s to 2s being0.526: 0.474, had predicted an almost regular distribution(alternation) of the two ciphers. To check such a hypothesis,there were found the frequencies of all 2𝑙 (1 ≀ 𝑙 ≀ 11) possiblesubstrings of the length 𝑙 in the genomic sequence of E.coli. The results at once showed that the respective perfectlyalternating substrings are in fact the least frequent ones, inthe entire genomic sequence. However, visual observationallowed suspecting that the main condition for near-to-statistical distribution of the two subgroups of amino acidsmay be disguised in grouping equal ciphers (either 1 or 2) rep-resenting the respective subgroups in adjacent pairs thereof.That is, in lieu of the code 121212 β‹… β‹… β‹… 12 it should be

    112211 β‹… β‹… β‹… 22, where general ratio of ciphers stays thus unal-tered.

    The situation with pairing of equal ciphers reminded usof the phenomenon of the so-called 𝛼-code in polypeptides.According to it, one turn of polypeptide spiral involves 3.5amino acids on an average. Since the nearest to 3.5 multipleinteger equals 7, it was of interest to interpret the 𝛼-code asa one in which all structural features are due to conditionallygrouping amino acids into consecutive sevens thereof.

    Merging the ideas of sevens (which suit well for interpret-ing the 𝛼-code) and pairs (which better obey the natural pro-portion of amino acids and follow experimental observation)allows us to set forward a universal model of the 𝛼-code. Inthis, the distribution of amino acids along the entire polypep-tide sequence should maximize a mean number of pairs ofequal integers (1 or 2) that are contained in consecutive sevensof it. Thus, the optimal case is that one has 3 pairs (11 and/or22) and one unpaired number (1 or 2) which can either bebetween any two pairs out of the three or be outside of these(e.g., it may comprise a pair with an equal number of the adja-cent seven). However, it is more interesting to consider thecasewhen thewindowof length 7 cuts out a substringwithout3 such pairs; this is also a necessary pattern in the sequence.

    Logically, we could readily establish that the optimal 𝛼-code should have not less than 2 pairs of adjacent equalnumbers in every window of length 7. Another criterion thatseemed to be of use was that the 𝛼-code should reject thesubsequences 1212 and 2121 since these allow no more than

  • 2 Journal of Theoretical Chemistry

    one pair 11 or 22 in most seven-cipher substrings that containthem. Experimentally, there was the third (and strongest ofall we know) criterion; the 𝛼-code should avoid the inclusionof the palindrome of the type 1211121; however, here, veryserious reservations must be made concerning the fact thatsubstituting 2 for 1 (and conversely) in this palindromeproduces (longer) substrings that are, on the contrary, notvery frequent in the natural sequence of E. coli. In caseof longer palindromes including 1211121 (or 2122212), thesituation may seem rather ambiguous.

    That is why we first tried to attest the simplest (forimplementation) mathematical criterion of the above; theprohibition of 1212 and 2121, in amodel𝛼-code. Here, we shallturn to some rigorous mathematical matters.

    2. Preliminaries

    First of all, we must introduce an ancillary digraph𝐷 whichis constructed as follows. The vertices of 𝐷 are all the eightordered triples of 1s and 2s (i.e., 111, 112, 121, 211, 122, 202,220, and 222) and an arc (a self-loop) goes out of one vertexto another (this same vertex) if the last two ciphers (on theright) in the first triple coincide with the first two ciphers inthe second triple. Say there is an arc from 111 to 112 (with twocommon ciphers: 11) and also an oriented selfloop attached tothe vertex 111 (i.e., an arc from 111 into 111 itself).

    Using the connectivity of𝐷, one can mentally travel in itfrom any vertex V, consecutively traversing arcs and visitingadjacent vertices. So, the passage to any adjacent vertex 𝑒along the respective arc (or returning to the original vertexV through the selfloop attached to it, if any) means that onehas done a walk of length 1 (et seq.). However, what is veryessential is that every walk of the length 𝑙, in 𝐷, factuallycovers 𝑙+3 ciphers since every next step increases the numberof visited vertices by one but the very first vertex alreadycontained three ciphers, by definition. It is very important tonote that the auxiliary digraph 𝐷 allows one to reproduceevery sequence of 1s and 2s of length 𝑙 + 3 (𝑙 β‰₯ 0), which alsoincludes sequences containing 1212 and 2121.

    In order to rule out all forbidden sequences with 1212and/or 2121, we shall derive from 𝐷 a β€œbetter” workingdigraph 𝐷. Specifically, 𝐷 is the above digraph 𝐷 less twoopposite arcs going from 121 to 212, and vice versa, from 212to 121. Since now, the problem on enumerating all (model)𝛼-code sequences of length 𝑙 + 3 (𝑙 β‰₯ 3) can be reduced toanother one enumerating all walks of length 𝑙 in the workingdigraph𝐷.

    It is not difficult to form the adjacencymatrix𝐴 = 𝐴(𝐷) =[π‘Žπ‘–π‘—]8

    𝑖,𝑗=1of 𝐷, where an entry π‘Žπ‘–π‘— = 1 if and only if there is

    an arc (a selfloop) from the 𝑖th vertex to the 𝑗th vertex andπ‘Žπ‘–π‘— = 0, otherwise, namely,

    𝐴 =

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    1 0 0 0 1 0 0 0

    0 1 0 0 0 1 0 0

    1 0 0 0 1 0 0 0

    0 1 0 0 0 1 0 0

    0 0 0 1 0 0 1 0

    0 0 1 0 0 0 0 1

    0 0 1 0 0 0 0 0

    0 0 0 1 0 0 0 0

    ]

    ]

    ]

    ]

    ]

    ]

    ]

    ]

    ]

    ]

    . (1)

    The characteristic polynomial 𝑃(𝐷; π‘₯) of the digraph 𝐷(which is by definition the characteristic polynomial 𝑃(𝐴; π‘₯)of its adjacency matrix 𝐴 [3]) is

    𝑃 (𝐷; π‘₯) = π‘₯8βˆ’ 2π‘₯7+ π‘₯6βˆ’ 2π‘₯5+ π‘₯4+ π‘₯2

    = π‘₯2(π‘₯ βˆ’ 1) (π‘₯

    2+ 1) (π‘₯

    3βˆ’ π‘₯2βˆ’ π‘₯ βˆ’ 1) .

    (2)

    The roots of 𝑃(𝐷; π‘₯) are πœ†1 = 1.83929, πœ†2 = 1, πœ†3,4 = 0, πœ†5,6 =±𝑖, and πœ†7,8 = 0.419645Β±0.6063𝑖, where 𝑖 = βˆšβˆ’1.These rootscomprise the spectrum (of eigenvalues) of the digraph𝐷 [3].

    Also, we need to construct a derivative matrix𝐴 = π½βˆ’πΌβˆ’π΄, where 𝐽 is the matrix of all 1s while 𝐼 is a diagonal identitymatrix; namely,

    𝐴 =

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    [

    βˆ’1 1 1 1 0 1 1 1

    1 βˆ’1 1 1 1 0 1 1

    0 1 0 1 0 1 1 1

    1 0 1 0 1 0 1 1

    1 1 1 0 0 1 0 1

    1 1 0 1 1 0 1 0

    1 1 0 1 1 1 0 1

    1 1 1 0 1 1 1 0

    ]

    ]

    ]

    ]

    ]

    ]

    ]

    ]

    ]

    ]

    ]

    . (3)

    Thematrix𝐴 is the matrix of the complementary digraph𝐷. The corresponding characteristic polynomial is

    𝑃 (𝐷; π‘₯) = π‘₯8+ 2π‘₯7βˆ’ 15π‘₯

    6βˆ’ 80π‘₯

    5βˆ’ 180π‘₯

    4

    βˆ’ 232π‘₯3βˆ’ 180π‘₯

    2βˆ’ 80π‘₯ βˆ’ 16.

    (4)

    Let 𝐻𝐷(𝑑) = βˆ‘βˆž

    𝑙=0𝑁𝑙𝑑𝑙 be the generating function of the

    number of walks in the digraph 𝐷, wherein the coefficient𝑁𝑙 of 𝑑

    𝑙 is the number of walks of length 𝑙, in 𝐷. Taking intoaccount that a walk of length 𝑙 (𝑙 β‰₯ 0) corresponds to asubstring of length 𝑙+3 of the𝛼-code, one can alsowrite downthe generating function 𝑆(𝑑) for the number of substrings (ofthe 𝛼-code) of length 𝑙 as follows:

    𝑆 (𝑑) =

    ∞

    βˆ‘

    𝑙=0

    𝑁𝑙𝑑𝑙+3= 𝑑3𝐻𝐷 (𝑑) . (5)

    In the next subsection, we shall demonstrate that ana-lytically calculating 𝑆(𝑑) can readily be done with the aid ofspectral theory of graphs.

    3. Main Part

    We shall start this subsection with adapting adapted in ournotation the following fundamental result (see Theorem 1.11in [3]).

    Lemma 1. Let 𝐷 be a finite connected (di)graph. Then thegenerating function𝐻𝐷(𝑑) of the number of walks in 𝐷 is

    𝐻𝐷 (𝑑) =

    1

    𝑑

    {(βˆ’1)𝑛𝑃𝐷[βˆ’ (𝑑 + 1) /𝑑]

    𝑃𝐷 (1/𝑑)

    βˆ’ 1} , (6)

    where 𝑛 is the number of vertices in 𝐷.

  • Journal of Theoretical Chemistry 3

    For practically using (6), one should substitute to theR.H.S. of it the polynomials 𝑃(𝐷; π‘₯) and 𝑃(𝐷; π‘₯) obtainedabove. After elementary but tedious manipulations (whichwill be omitted herein) we have arrived at the following for-mula for our specific digraph𝐷:

    𝐻𝐷 (𝑑) =

    ∞

    βˆ‘

    𝑙=0

    𝑁𝑙𝑑𝑙= βˆ’2(

    2𝑑4+ 3𝑑3+ 6𝑑2+ 3𝑑 + 4

    𝑑5+ 𝑑4+ 2𝑑3+ 𝑑 βˆ’ 1

    ) . (7)

    It should also be noted that along with the exact resultgiven by Theorem 1.11 from [3], there exists a remarkableasymptotic evaluation for 𝑁𝑙, based on Theorem 1.12 in [3].In our specific case, it gives the following simple formula:

    limπ‘™β†’βˆž

    𝑁𝑙 = 8 Γ— 1.83929𝑙, (8)

    where 8 is the number of vertices in the digraph 𝐷 while1.83929 is its eigenvalue πœ†1 with the maximum modulus(|πœ†1| > |πœ†π‘ |; 2 ≀ 𝑠 ≀ 8). Thus, there can be deduced thefollowing approximate formula for the entire series𝐻𝐷(𝑑):

    𝐻𝐷 (π‘₯) β‰ˆ

    8

    1 βˆ’ 1.83929𝑑

    . (9)

    From the obtained spectral results for the number ofwalks, one can immediately derive the respective workingformulae for the number of substrings of length 𝑙 (𝑙 β‰₯ 3) ofthe 𝛼-code. So, we arrive at the following final expression:

    𝑆 (𝑑) =

    ∞

    βˆ‘

    𝑙=0

    𝑁𝑙𝑑𝑙+3= βˆ’2𝑑

    3(

    2𝑑4+ 3𝑑3+ 6𝑑2+ 3𝑑 + 4

    𝑑5+ 𝑑4+ 2𝑑3+ 𝑑 βˆ’ 1

    )

    β‰ˆ

    8𝑑3

    1 βˆ’ 1.83929𝑑

    ,

    𝑆 (𝑑) = 8𝑑3+ 14𝑑4+ 26𝑑5+ 48𝑑6+ 88𝑑7

    + 162𝑑8+ 298𝑑

    9+ 548𝑑

    10+ 1008𝑑

    11

    + 1854𝑑12+ 3410𝑑

    13+ 6272𝑑

    14

    + 11536𝑑15+ 21218𝑑

    16+ 39026𝑑

    17

    + 71780𝑑18+ 132024𝑑

    19+ 242830𝑑

    20

    + 446634𝑑21+ 821488𝑑

    22+ 1510952𝑑

    23

    + 2779074𝑑24+ 5111514𝑑

    25+ 9401540𝑑

    26

    + 17292128𝑑27+ 31805182𝑑

    28+ 58498850𝑑

    29

    + 107596160𝑑30+ 197900192𝑑

    31+ 363995202𝑑

    32

    + 669491554𝑑33+ 1231386948𝑑

    34

    + 2264873704𝑑35+ 4165752206𝑑

    36

    + 7662012858𝑑37+ 14092638768𝑑

    38

    + 25920403832𝑑39+ 47675055458𝑑

    40

    + 87688098058𝑑41+ 161283557348𝑑

    42

    + 296646710864𝑑43+ 545618366270𝑑

    44

    + 1003548634482𝑑45+ 1845813711616𝑑

    46

    + 3394980712368𝑑47+ 6244343058466𝑑

    48

    + 11485137482450𝑑49+ 21124461253284𝑑

    50

    + 38853941794200𝑑51+ 71463540529934𝑑

    52+ β‹… β‹… β‹… ,

    𝑆 (𝑑) β‰ˆ 8𝑑3+ 14.714𝑑

    4+ 27.064𝑑

    5

    + 49.778𝑑6+ 91.557𝑑

    7+ 168.4𝑑

    8

    + 309.74𝑑9+ 569.69𝑑

    10+ 1047.8𝑑

    11

    + 1927𝑑12+ 3544.8𝑑

    13+ 6519.9𝑑

    14

    + 11992𝑑15+ 22057𝑑

    16+ 40569𝑑

    17

    + 74618𝑑18+ 137240𝑑

    19+ 252430𝑑

    20

    + 464290𝑑21+ 853970𝑑

    22+ 1570700𝑑

    23

    + 2889000𝑑24+ 5313700𝑑

    25+ 9773400𝑑

    26

    + 17976000𝑑27+ 33063000𝑑

    28+ 60813000𝑑

    29

    + 111850000𝑑30+ 205730000𝑑

    31+ 378400000𝑑

    32

    + 695980000𝑑33+ 1280100000𝑑

    34

    + 235450000𝑑35+ 4330600000𝑑

    36

    + 7965200000𝑑37+ 14650000000𝑑

    38

    + 26946000000𝑑39+ 49562000000𝑑

    40

    + 91159000000𝑑41+ 167670000000𝑑

    42

    + 308390000000𝑑43+ 567220000000𝑑

    44

    + 1043300000000𝑑45+ 19189000000𝑑

    46

    + 3529400000000𝑑47+ 6491600000000𝑑

    48

    + 11940000000000𝑑49+ 21961000000000𝑑

    50

    + 40392000000000𝑑51+ 74293000000000𝑑

    52

    + 136650000000000𝑑53+ β‹… β‹… β‹… .

    (10)

    Beckmann et al. [4] and Brendel et al. [5] studied themeasure for evaluating the deviation of the frequency of astring of length 𝑛, in a genomic sequence, from its statisticallyexpectedmagnitude. Here, we shall also adopt their approach[4, 5]. Let 𝑀 = π‘Ž1π‘Ž2 β‹… β‹… β‹… π‘Žπ‘› be a sequence of letters, of length𝑛, encoding amino acids in a genome or random sequence(π‘Žπ‘– ∈ A20 = {a, c, d, e, f , g, h, i, k, l,m,n, p, q, r, s, t, k,w, y},1 ≀ 𝑖 ≀ 𝑛). Let further 𝑓(𝑀) = 𝑓(π‘Ž1π‘Ž2 β‹… β‹… β‹… π‘Žπ‘›) denote

  • 4 Journal of Theoretical Chemistry

    Table 1: Contrast 8-letter factors in the polypeptide sequence of Escherichia coli.

    Number 𝑀 𝑓(𝑀) std(𝑀) Number 𝑀 𝑓(𝑀) std(𝑀)1 11111111 10436 4.686098 61 21121112 3489 3.4702662 11111112 5225 βˆ’6.185607 70 21211111 3439 βˆ’3.4302153 21111111 5214 βˆ’6.595223 74 21121111 3430 βˆ’3.5949185 12111111 4886 4.464338 78 11211112 3403 βˆ’3.5465516 11112111 4852 3.606632 82 12211222 3378 βˆ’3.1681037 11111121 4835 3.235524 92 21111121 3341 βˆ’4.15011910 11111122 4126 βˆ’3.226558 94 22211221 3337 7.77608513 22111111 4064 βˆ’4.725432 110 12111112 3234 βˆ’5.33894216 11221122 3977 5.215973 119 12121111 3193 3.53810626 21111112 3749 8.531731 132 21211112 3124 4.05307629 21112111 3728 βˆ’3.948552 155 12211212 3018 βˆ’3.14304130 22111112 3726 5.740505 162 21111212 2980 3.49067733 22112112 3665 βˆ’3.097771 165 12112222 2971 βˆ’4.23780334 11112112 3639 βˆ’3.956724 177 22122212 2910 3.51141340 21111122 3618 4.192473 206 22222222 2767 3.07154641 21112112 3609 4.460223

    the observed frequency of a string 𝑀 in a genome. Given theobserved frequencies of (𝑛 βˆ’ 2)- and (𝑛 βˆ’ 1)-mers, one cancalculate the expected frequency of 𝑛-mers in a genome asfollows:

    πœ™ (π‘Ž1π‘Ž2 β‹… β‹… β‹… π‘Žπ‘›) =

    𝑓 (π‘Ž1 β‹… β‹… β‹… π‘Žπ‘›βˆ’1) 𝑓 (π‘Ž2 β‹… β‹… β‹… π‘Žπ‘›)

    𝑓 (π‘Ž2 β‹… β‹… β‹… π‘Žπ‘›βˆ’1)

    . (11)

    To measure the deviation of the observed frequency of string𝑀 from its expected occurrence in a genome sequence, oneshould further define the standard deviation of 𝑀 as

    std (𝑀) =𝑓 (𝑀) βˆ’ πœ™ (𝑀)

    max (βˆšπœ™ (𝑀), 1). (12)

    A string is called a contrast factor (or word) if the absolutevalue of std(𝑀), defined as above, is greater or equal to somethreshold value (which is set to be 3.0, in [4, 5]). In otherwords, a contrast word 𝑀 should obey the claim |std(𝑀)| β‰₯ 3.

    Table 1 contains the values std(𝑀) for all 31 8-charactercontrast factors in the polypeptide sequence of Escherichiacoli, wherein the numbers are taken from the full inventoryof all possible 256 8-character factors over the two-characteralphabet A2 = {1, 2}, as these follow in the nonincreasingorder of the magnitudes of 𝑓(𝑀).

    Now we shall consider some conclusions.

    4. Conclusions

    The following inferences can be drawn.

    (1) The distribution the most contrast factors approxi-mately corresponds to that of the most frequent ones.Say the first 16 out of 31 contrast factors fall into thegroup of the 41 most frequent factors, whereas 50 ofthe least frequent factors include no contrast factorswhatever. The last, 31st, contrast word is 206th in thecomplete list of all 8-character sequences overA2.

    (2) It is clearly seen that a β€œnonpolar” alanine subgroup ofamino acids, denoted by 1, and a β€œpolar” glycine sub-group, denoted by 2, play asymmetric parts in com-posing the contrast words. For instance, 1 is the lastcharacter of word 12 times while 2 is at the tail19 times; truly, at the head of word the respectivenumbers are 14 and 17, whose difference seems to benot so essential. The sums 12 + 14 = 26 and 19 + 17 =36 also conserve this disbalance of frequencies. Thus,mutually substituting 1s for all 2s, and vice versa, inthe words collected in Table 1 is not at all an invariantaction on it. Additionally note that out of 8Γ—31 = 248characters comprising the 31 contrast words 2 appearsonly 86 times (or in 34.68% of cases). Separately, thenumber of occurrences of 2 in Table 1 for the words𝑀 with std(𝑀) β‰₯ 3 is 48 (or 19.35% of cases) and thatfor the words with std(𝑀) ≀ βˆ’3 is 38 (or 15.32% ofcases). Since the share of 2 in a natural polypeptidesequence of Escherichia coli is 0.474, contrast factorsin it can comprise only a minor part of its total length(it is a trivial qualitative fact that is easily deduciblefrom the data in Table 1). But another readily seenfact is of paramount importance. Contrast factorsoccur chiefly due to the presence of a fattier alaninesubgroup of amino acids, denoted by 1.

    (3) Since (the most preferable) contrast factors 𝑀 withstd(𝑀) β‰₯ 3 and (the most avoidable) contrast factorswith std(𝑀) ≀ βˆ’3have practically equal frequencies inTable 1 (16 and 15, resp.), both types of contrast factorsplay a commanding role in forming the polypeptidesequence (in particular of Escherichia coli). That is,the synthesis of polypeptides in nature is carried outso as to give preference to the maximum number ofadmissible β€œpreferable” factors and reject the maxi-mum number of avoidable factors. Though the con-trast factors themselves comprise only a minor part

  • Journal of Theoretical Chemistry 5

    of polypeptide sequence (see above), their existence,with the underlined preference of one of them andavoidance of the others, crucially controls the synthe-sis of polypeptides. Accordingly, noncontrast factorsoccur in a quite statistical way and thereby ensurethe conservation of the natural ratio of polar andfatty amino acids. The current study just emphasizesthe significance of compiling a special dictionary ofcontrast factors for polypeptides.

    (4) Thus, the role of noncontrast factors is to be themain buildingmaterial for the aminoacid sequence ofEscherichia coli and, as we may guess, also of such asequence of any other organism. Since our proposedmathematical approach is applicable to an arbitraryaminoacid sequence, it would be of interest to checkit for different organism, as well as to investigateaminoacid factors of different lengths (longer than 7,as in our present work).

    The material of this paper has entirely been borrowedfrom [6]. Additionally, the interested reader may see alsounsolved combinatorial problems in our earlier publications[7, 8].

    References

    [1] B. M. Broome and M. H. Hecht, β€œNature disfavors sequences ofalternating polar and non-polar amino acids: implications foramyloidogenesis,” Journal of Molecular Biology, vol. 296, no. 4,pp. 961–968, 2000.

    [2] Y. Mandel-Gutfreund and L. M. Gregoret, β€œOn the significanceof alternating patterns of polar and non-polar residues in beta-strands,” Journal of Molecular Biology, vol. 323, no. 3, pp. 453–461, 2002.

    [3] D. M. Cvetković, M. Doob, and H. Sachs, Spectra of Graphs:Theory andApplication, Academic Press, Berlin, Germany, 1980.

    [4] J. S. Beckmann, V. Brendel, and E. N. Trifonov, β€œInterveningsequences exhibit distinct vocabulary,” Journal of BiomolecularStructure and Dynamics, vol. 4, no. 3, pp. 391–400, 1986.

    [5] V. Brendel, J. S. Beckmann, and E. N. Trifonov, β€œLinguistics ofnucleotide sequences: morphology and comparison of vocabu-laries,” Journal of Biomolecular Structure and Dynamics, vol. 4,no. 1, pp. 11–21, 1986.

    [6] V. R. Rosenfeld, Function-related linguistics of noncodingsequences [Ph.D. thesis], The University of Haifa, 2003.

    [7] V. R. Rosenfeld, β€œEnumerating De Bruijn sequences,”MATCH:Communications in Mathematical and in Computer Chemistry,no. 45, pp. 71–83, 2002.

    [8] V. R. Rosenfeld, β€œEnumerating Kautz sequences,” KragujevacJournal of Mathematics, vol. 24, pp. 19–41, 2002.

  • Submit your manuscripts athttp://www.hindawi.com

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Inorganic ChemistryInternational Journal of

    Hindawi Publishing Corporation http://www.hindawi.com Volume 2014

    International Journal ofPhotoenergy

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Carbohydrate Chemistry

    International Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Journal of

    Chemistry

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Advances in

    Physical Chemistry

    Hindawi Publishing Corporationhttp://www.hindawi.com

    Analytical Methods in Chemistry

    Journal of

    Volume 2014

    Bioinorganic Chemistry and ApplicationsHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    SpectroscopyInternational Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

    Medicinal ChemistryInternational Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Chromatography Research International

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Applied ChemistryJournal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Theoretical ChemistryJournal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Journal of

    Spectroscopy

    Analytical ChemistryInternational Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Quantum Chemistry

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Organic Chemistry International

    ElectrochemistryInternational Journal of

    Hindawi Publishing Corporation http://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    CatalystsJournal of