Upload
albert-gibson
View
214
Download
0
Embed Size (px)
Citation preview
07/09/04 SFI Summer School 2
Complexity of BiologicalSymbolic Sequences
Bailin Hao
T-Life Research Center, Fudan University
Shanghai 200433
Institute of Theoretical Physics, Academia Sinica
Beijing 100080
http://www.itp.ac.cn/~hao/
07/09/04 SFI Summer School 2
Symbols and symbolic sequencesare inevitable in a
Coarse-grained descriptionof Nature
07/09/04 SFI Summer School 2
An Observation u d c s b t
(Quarks with charge, mass, flavor, charm, …)
p n e
(Particles with charge, mass, spin, magnetic momentum, …)
H C N O P …
(Atoms with atomic number, ion radius, valence, affinity, …)
H2O NO CO2 …
(Molecules with molecular weight, polarity, color, …)
a c g t
(Nucleotides with strong or weak coupling)
A D E F G H … W Y V
(Amino acids with different physico-chemical properties)
BRCA1 PDGF
(Genes, proteins)
07/09/04 SFI Summer School 2
The Central Dogma of Molecular Biology
replication
DNA DNA reverse transcription transcription
cDNA mRNA translation
Protein/Enzyme folding
Function Structure interaction
07/09/04 SFI Summer School 2
Biological Symbolic Sequences
Nucleic Acid (DNA, RNA) Sequences • One-dimensional• Directed, from 5’ to 3’• Unbranching polymers• Heteropolymer made of 4 kinds of monomers
/bases /nulcleotides (a, c, g, t or u)• Length: 104 to 108
07/09/04 SFI Summer School 2
cccaatatcttgcttcagcaagatattgggtatttctagctttcctttcttcaaaaattgctatatgttagcagaaaagccttatccattaagagatggaacttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtgcattacttccataccaagattagcacggttgatgatatcagcccaagtattaataacgcgaccttggctatcaactacagattggttgaaattgaatccgtttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccctactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaaaactagcatattggaagattaatcggccaaaataaccatgagcggccacaatattataagtttcttcctcttgaccaaatctgtaaccctcattagcagattcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccatgcatagcactgaatagggaaccgccgaatacaccagctacacctaacatgtgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaagttgaaagtaccagatattcctaaaggcataccatcagagaaacttccttgaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagctgaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttcccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaattagctcataaggaccaccattgtataaccactcatcaacagatgcagcttcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataatggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggctcacgaataccatcaatatctactggaggggcagcgatgaaggcgataataaatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaaccatccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgaccccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaagatcttggtttattcaaattgcaaggactcccaagcacacgtattaactagaaagataatagaaggcttgttatttaacagtataatatagactatataccaatgtcaaccaagccagccccgacagttgtatatccatacaacaaaatttaccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcagattgctcctttctagtttccatatgggttgcccgggactcgaacccggaactagtcggatggagtagataattattccttgttacaatagagaaaaaacctctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgtagaaataggctatttctattccgaagaggaagtctactaatttttttagtagtaagttgattcacttactatttattatagtacagagaacatttcagaatggaaactgtgaaagttttaccttgatcatttatcaatcatttctagtttattagttttgtttaatgattaattaagaggattcaccagatcattgatacggagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaaagtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgtaaaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtaccgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagtatatactttagtcgatacaaagtcttcttttttgaagatccactgtgataatgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatccgataaatcggtccaaattggtttactaataggatgccccgatccagtacaaaattgggcttttgctaaagatccaatgagaggagtaacagggactttggtatcgaattttttcatttgagtatctattagaaatgaattctccagcatttgattccttactaacaaagaatttattggtacacttgaaaagtaccccagaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggttgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacatttccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttccttgatatcgaacataatgcataaggggatccataacgaaccatatggttttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctagaaaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaagaagattgtttacgaagaaacaacaagaaaaattcatattctgatacataagagttatataggaaccgaaatagtcttttattttcttttttcaaaataaaaatggatttcattgaagtaataaaactattccaattcgagtagtagttgagaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattgaaggagttgaagcaagatatccaaatggataggatagggtatttctatatgtgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattgaatgaatagatcgtaaattctgaaactttggtatttctttttcttccggacaagactgttctcgtagcgagaatgggatttctacaacgatcgcaaacccctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaatccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattctgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaatttcttgttattagaaccaataatttcgacaagttcggaaccatttaatccataatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacgaaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttccatttgtatttctacttgaatcagagagagagaaatatttctcggtttatcaaatggtgatacatagtacaatatggtcagaacagggtgttgcattttttaatacaaacccctggggaagaaaaggagtctaatccacggatctttttccgctccttttctatccaatttgtttatgtttgttctaattacaaaagagaacaaatcctttatttttgcaggccaattgctcttttgactttgggatacagtctctttatcaatatactgcttcttttacacattcaatccataacatccttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaaaatcaaaggtctactcataggaaaaccagcttttccctacatcaggcactaatctatttttaacgtctaattagatcagggagttcttccaattaagaagttaagctcgttgctttttgttttaccagaattggagccaggctctatccatttattcattagacccagaaaatcagaatttttttattccattccaaaaatccaaaataagaaattgattttattacgacatgctattttttccattcattacccttgaggatcagtcgcggtcttatagactctaccaagagtctggacgaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaagccgagtactctaccattgagttagcaacccagataaactaggatcttagatacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtcaaaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaaacactcagataccaaaaggaacgggtctggttaaatttcactaaggttaaaagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttatttaaataaataaataaatcttgtatgagagtacaaacaagagggacaaccctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggataaagagacttatccatctacaaattctagatgttcaatggacctttgtcaatggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaataaaggcttatgttggattggcacgacataaatccagtcaaaaataggattaagaaagaggcaaattatttctaaatagttagacaacaagggatactagtgagcctctcctagttttttattcatttagttcttcaattaactcaaagttctttctttttctttaaagaattccgccttccttaaaatatcagaaacggttcttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatttaaacaagtttgattctttatcggatcataaaaacctacttttcgaagatctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagacagcttattgggatagatgtagataaataaagccccccctagaaacgtataggaggttttctcctcatacggctcgagaatatgacttgcattaatttccgtacagaaaaaacaaatttcatttatactcatgactcaagttgactaattttgattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtctctaaactcttttctttgcctcatctcgaacaaattcacttttattccttattccggtccaattctattgttgagacagttgaaaatcgtgtttacttgttcgggaatcctttatctttgatttgtgaaatccttgggtttaaacattacttcgggaattcttattcttttttctttcaaaagagtagcaacatacccttttttcttatttccttcgataaagcatttccctcttctatagaaatcgaatatgagcgattgattctgatagactttaatcaaaagagttttcccatatcttccaaaattggactttcttcttattttaaccttttgatttctatattatttcgatttctatattaagggtagaatgacaaagttggcctaatttattagttttcactaaccctagattctttcccttgataaaaaataaattctgtcctctcgagctccatcgtgtactatttacttagcttacttacaaacaacccagcgaaaattcggttcgggacgaatagaacagactatgtcgagccaagagcattttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtccttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaacataacattcctctaatttcattgcaaagtgttatagggaattgatccaatatggatggaatcatgaatagtcattagtttcgttttttgtatactaattcaaacttgctttgctatctatggagaaatatgaataaaagaaattaagtatttatcgggaaagactccgcaaagagccaatttatttaaacccatattctatcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgcttaagacttatttattatggaatttccatcctcaacagaggactcgagatgatcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaataaactatcaacctcccgtttaattaatttaattaatatattagattagcaatctatttttccataccatttttccgtaacaaaactaattaactattaactagttaaactattgcaatgaaaagaaagttttttggtagttatagaattctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaaaaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttattttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattctatctaacgagcagttcttatcttatctttaccgggatggatcattctggatatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaaaaagaagaaggaaccttttttactaataaaatactataaaaaaaatttatctctatcataaatctatctctaccataaaggaataggtctcgttttttatacaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttcaatttgactggacttgacactggattatgttttctgagacagaaaatgaacgcattaggactgcatcgaatctaagagtttataagagaaaaaaattctctttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgtttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacgggaccaaaacccgctgccttaccacttggccacgccccatttcgggttttatgcgacactaataaacagtattatgtttatttcttattcgtcaatcctacttcaattacataaaaatggggggtattctcttggtaggattctagacatgcgaataatatagaatccaaaaaatgcattgatcattacatggaattctattaagatattatatgaaagtcgaatttcttccactctcatttgagagtgcgaatacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggattttgaatcctccttttcctttttcccttagaaaaataactcaatcaaaatccaattatctactctacaagaacgaaacgcttgttatgcctaatatacttagtttaacctgtatttgttttaattctgttatttatccgactagttttttcttcgccaaattgcccgaagcttatgccattttcaatccaatcgtggattttatgcctgtcatacctgtactcttttttctattagcctttgtttggcaagctgctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatcatgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtcttatattatgaaccttcgattctaaaattcaaattcttctacattgaatgtatagctgcagcaataaatttggatcagcctttctactccctgcatctacgttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattgataagagtgcttattataaatcaattcttgcaatttttttcaaaaattgatttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttgtgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggagattatgtaatgcttactctcaaactttttgtttatacagtagtgatattctttgtttccctctttatctttggattcttatctaatgatccaggacgtaatcctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggatttgtttcatacatttatctacgagaaaatccgggggtcagaattccttccaattcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaaccctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtccactcagccatctctccccgttccaaatcgaaaggtttccgtgatatgacagaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagttcaaaaaaattatattgccaattccattttagttatattcttttttcttaatgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaaaattggatattggctaaaagacaatcagatagattttctcttcagcaggcatttccatataggacttgttataataaaacaagcaggttatagaaaaaaactcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaaccaacccaccccataaaattggaaagaaagataaagtaagtggacctgactccttgaatgaggcctctatccgctattctgatatataaattcgatgtagatgaaattgtataagtggatttttttgtatttccttagacttagaccacgcaaggcaagaatttctcgctatttactatttcatattcttgttactagatgttctataggaataagaagaaatcgcaacccctttccgctacacataaaaatggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaatcctatttttgttcttatacccatgcaatagagagcgagtgggaaaagggaggttactttttttcattttttccttaaaaaataggctttcttggaaataggaatcatggaataatctgaattccaatgtttatttctatagtataagaaaaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccccatagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggcattgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagtgatataaggtgctcggaaatggttgaagtaattgaataggaggatcactatgactatagcccttggtagagttactaaagaagaaaatgatttatttgatattatggacgactggttacgaagggaccgttttgtttttgtaggatggtctggcctattgctttttccttgtgcttatttcgctttaggaggttggtttacagggacaacttttgtaacttcttggtatacccatggattggcgagttcctatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaatagtttagcacactctttgttgctactatggggcccggaagcacaaggggattttactcgttggtgtcaattaggtggtctgtggacttttgttgctctccatggggcttttgcactaataggtttcatgttacgtcaatttgaacttgctcggtctgttcaattgcggccttataatgcaatttcattctctggcccaatcgctgtttttgtttccgtattcctgatttat
ccactggggcaatccggttggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcctcttcttccaaggatttcataattggacgttgaacccatttcatatgatgggagttgccggagtattaggcgcggctctgctatgcgctattcatggggcaaccgtgga
07/09/04 SFI Summer School 2
Biological Symbolic Sequences
Protein Sequences
• One-dimensional
• Directed, from N-terminus to C-terminus
• Unbranching polymers
• Heteropolymers made of 20 kinds of monomers/amino acids (A, C, … W, Y)
• Length: 102 to 103 AA, say, 50-6000
07/09/04 SFI Summer School 2
ID A1BG_HUMAN STANDARD; PRT; 495 AA.... ... ...KW Immunoglobulin domain; Glycoprotein; Plasma; Repeat; Signal.... ... ...SQ SEQUENCE 495 AA; 54209 MW; 87A50C21CE89459C CRC64; MSMLVVFLLL WGVTWGPVTE AAIFYETQPS LWAESESLLK PLANVTLTCQ ARLETPDFQL FKNGVAQEPV HLDSPAIKHQ FLLTGDTQGR YRCRSGLSTG WTQLGKLLEL TGPKSLPAPW LSMAPVPWIT PGLKTTAVCR GVLRGETFLL RREGDHEFLE VPEAQEDVEA TFPVHQPGNY SCSYRTDGEG ALSEPSATVT IEELAAPPPP VLMHHGESSQ VLHPGNKVTL TCVAPLSGVD FQLRRGEKEL LVPRSSTSPD RIFFHLNAVA LGDGGHYTCR YRLHDNQNGW SGDSAPVELI LSDETLPAPE FSPEPESGRA LRLRCLAPLE GARFALVRED RGGRRVHRFQ SPAGTEALFE LHNISVADSA NYSCVYVDLK PPFGGSAPSE RLELHVDGPP PRPQLRATWS GAALAGRDAV LRCEGPIPDV TFELLREGET KAVKTIPTPG AAANLELIFV GPQHAGNYRC RYRSWVPHTF ESELSDPVELLVAES //
Example of a protein sequenceHuman immunoglobulin domain
07/09/04 SFI Summer School 2
Biological Symbolic Sequences
Other possible types of sequences
• Structural data
• Gene expression data from microarrays
• Bio-medical literature
We will only touch on DNA and protein sequences
07/09/04 SFI Summer School 2
A Proviso: Limitations of Symbols
• DNA sequences: no methylation info (epigenome program), no role of metallic ions, no spatial configurations, etc.
• Protein sequences: no post-translational modifications (signal-peptide, glycosylation, phosphorylation, splicing), no explicit structural info, etc.
07/09/04 SFI Summer School 2
• These sequences are results of billions years of evolution
• They contain conserved but variable and fuzzy regular as well as random elements
• There are sequencing errors (1/10000 in genomic DNA, greater in ESTs)
• The relation between random and non-random aspects may be scrutinized in different ways, e.g., in a collection of many sequences or in a single long sequence
07/09/04 SFI Summer School 2
Where to Get the Sequences?Public Databases, e.g.:
DNA sequences (NCBI/EMBL/DDBJ):
http://www.ncbi.nlm.nih.gov
ftp://ftp.ncbi.nlm.gov/genbank/genomes
ftp://ftp.ncbi.nih.gov/genomes
Protein sequences (SWISS-PROT, PIR, UNIPROT):
http://www.expasy.ch/
07/09/04 SFI Summer School 2
GenBank Rel.142 (15 June 2004)
• 35 532 003 (3.5x107) sequences
• 40 325 321 348 (4x1010) bases/letters
• Average length: 1134 (many ESTs, few but increasing number of complete genomes)
07/09/04 SFI Summer School 2
A simple-minded approach: Counting K-strings
• K-words, K-grams, K-tuples, K-strings. A convention: “string” versus “word”
• One can look at K-tuples in a collection of sequences or in a single sequence
07/09/04 SFI Summer School 2
Transition from Randomness to“Determinism” with K Increasing
• E. coli (strain K12) genome: a loop of 4639221 bp,
distribution of single letters almost random• K=1, 2, …close to random sequences• Among the 44639221 possible sequences only 1 or a small
subset makes E. coli• Probes on gene-chips: K in the tenths, say, 20 ~ 80.
Already gene-specific.• Some extreme examples follow
07/09/04 SFI Summer School 2
Genus-Specific Oligo-Nucleotides
• The 18-tuple gttccaataagactaaaa appears only in the genomes of 3 species of the Archaea genus Pyrococcus: P. horikoshii, P. abyssi, and P. furiosus (as repeats in Non-CDS)
• No exact match in GenBank + EMBL + DDBJ and in the Human draft sequences
• Is it a genus-marker?
07/09/04 SFI Summer School 2
Species-Specific Oligo-Peptides• HAMSCAPDKE: only in E. coli and Shigella among more
than 1.3 mil. known proteins in PIR database• HAMSCAPERD: only in Samonella• It is known: D (Aspartate) and E (Glutamate) are interchangeable in many
homologous proteins K (Lysine) and R (Arginine) are interchangeable in many
homologous proteins• A consensus HAMSCAP[D/E][K/R][D/E] picks up only
Enterobacteria (a bacterial family in the Proteobacteria phylum)
07/09/04 SFI Summer School 2
Four Examples fromOur Own Work
• Regularities in bacterial K-string “portraits” -> combinatorics
• True and redundant avoided strings in bacterial genomes -> formal language
• A surprise in 1D histograms of K-strings in randomized bacterial genomes -> statistics
• Decomposition and reconstruction of protein sequences -> graph theory
07/09/04 SFI Summer School 2
2D Histogram of K-Tuples
“Hao Histograms”: implemented at NIST and EBI:
http://math.nist.gov/~Fhunt/GenPatterns/
http://industry.ebi.ac.uk/openBSA/bsa_viewers/home.html
• Our software SeeDNA will be made public soon
• Relation to “Chaos Game Representation” of DNA:
P. Tino, “Multifractal properties of Hao’s geometric representation of DNA sequences”, Physica A304 (2002) 480-499.
07/09/04 SFI Summer School 2
1D Histogram of K-Strings
• Collect those K-strings whose count fall in a bin in between nmin to nmax.
• Plot the number of such K-strings versus the counts.
• This is a 1D histogram or
an expectation curve if one wants to calculate it from a statistical model.
07/09/04 SFI Summer School 2
1-D Histograms (continued)
• X-axis: string counts from 0 to some maximal number (with K fixed)
• Y-axis: number of different string types within a count bin: 0 (absent or avoided string), 1-3, 4-6 (rare strings), … to some big numbers, e.g. 774 (repeats)
07/09/04 SFI Summer School 2
Effect of Randomization
What happens to the 1D and 2D histograms if we randomize the original genomic sequence?
One must do this comparison in order to show that the somewhat regular patterns seen in the bacterial portraits are not incidental.
07/09/04 SFI Summer School 2
A Surprise in 1D Histograms of K-Tuples of Randomized Prokaryote Genomic Sequences
A single biased peak with a long tail in the original genome
Rich fine structure in randomized sequences of some genomes
07/09/04 SFI Summer School 2
G+C Content of Some Bacteria
65.61%M. tuberculosis
57.80%M. laprae
50.79%E. coli
38.15%H. influenzae
G+C ContentSpecies
07/09/04 SFI Summer School 2
Statistical Combinatoricsor
Combinatorial Statistics
Need tools
• More dirty and fuzzy than pure combinatorics
• More deterministic than pure statistics
07/09/04 SFI Summer School 2
Decomposition and Reconstructionof a Given Amino Acid Sequence
How unique is it?
Connection with the problem of the number of Eulerian loops in a graph
07/09/04 SFI Summer School 2
ANPA_PSEAM 82AA
MALSLFTVGQLIFLFWTMRITEASPDPAAKAAPAAAAAPAAAAPDTASDAAAAAALTAANAKAAAELTAANAAAAAAATARG
MALS
ALSL
LSLF
SLFT
FTVG
TVGQ
VGQL
AKAA
K=5
LFTV
07/09/04 SFI Summer School 2
ANPA_PSEAM 82AA
Antifreeze protein
A/B precursor in winter flaunder
Alanine-rich
Amphiphilic
LTAA 7 AAAA 5
PAAA 4
APAA 3
TANN 8
AANA 9 SSPA 2
AAAP 6
AKAA 1auxiliary arc
6 rings
07/09/04 SFI Summer School 2
From pdb.seq-a special selection of SWISSPROT2821-1=2820 proteins ( May 2000 )
R—number of reconstructed AA sequences from a given protein decomposition
R
K1 2-10 11-100 101-1000
1001-10000
>10 4
5
6
7
8
9
10
11
2164(76. 7%)
404 90 45 21 93
2651(94%)
77 29 10 4 49
2732(96. 9%)
32 16 3 2 44
2740(97. 1%)
23 10 3 0 44
2763(97. 9%)
13 7 1 0 36
2793(99%)
11 7 2 1 6
2798(99. 2%)
12 2 1 1 6
07/09/04 SFI Summer School 2
Compositional Representation of Proteins
The collection {W } or {W ,n j} may be used as an equivalent representation of the original protein sequence, if the reconstruction is unique.
A seemingly trivial result upon further reflection: random AA sequences have unique reconstruction as well.
Compositional Representation works equally for random AA sequences and most of protein sequences.
A given realization of a short random AA sequence is as specific as a real protein sequence.
j=1
M
jK
i=1
L -k+1
iK
07/09/04 SFI Summer School 2
The other extreme:
Quit a few proteins have an enormous
number of reconstructions.
Transmembrane
Antifreeze
Fibrous: collagens
07/09/04 SFI Summer School 2
Protein AA R(11)K
for R(K)=1
MAGA_XENLA
SPA2_STAAU
SRTX_ATREN
EBN1_EBV
ICEN_PSESY
XPIN_XENLA
CAIH_MOUSE
CIPA_CLOTM
303 150 84
508 8640 28
543 9.9732*105 101
641 5.407*1016 28
1200 1.55675*1027 46
1350 3.7584*105 15
1315 21840 15
1853 5.671*1013 179
07/09/04 SFI Summer School 2
A Related Problem and A Piece of On-Going Work: Constraint Randomization
• Given a DNA sequence: randomize it with number of a, c, g, t unchanged. Easy and software available in public domain (e.g., shuffleseq in EMBOSS)
• Given a DNA sequence: randomize it with the number of dinucleotides (aa, ac, …, tt) unchanged. Much harder.
• Algorithm understood and program is being written.