42
07/09/04 SFI Summer School 2 Complexity of Biological Symbolic Sequences Bailin Hao T-Life Research Center, Fudan University Shanghai 200433 Institute of Theoretical Physics, Academia Sinica Beijing 100080 http://www.itp.ac.cn/~hao/

07/09/04SFI Summer School2 Complexity of Biological Symbolic Sequences Bailin Hao T-Life Research Center, Fudan University Shanghai 200433 Institute of

Embed Size (px)

Citation preview

07/09/04 SFI Summer School 2

Complexity of BiologicalSymbolic Sequences

Bailin Hao

T-Life Research Center, Fudan University

Shanghai 200433

Institute of Theoretical Physics, Academia Sinica

Beijing 100080

http://www.itp.ac.cn/~hao/

07/09/04 SFI Summer School 2

Symbols and symbolic sequencesare inevitable in a

Coarse-grained descriptionof Nature

07/09/04 SFI Summer School 2

An Observation u d c s b t

(Quarks with charge, mass, flavor, charm, …)

p n e

(Particles with charge, mass, spin, magnetic momentum, …)

H C N O P …

(Atoms with atomic number, ion radius, valence, affinity, …)

H2O NO CO2 …

(Molecules with molecular weight, polarity, color, …)

a c g t

(Nucleotides with strong or weak coupling)

A D E F G H … W Y V

(Amino acids with different physico-chemical properties)

BRCA1 PDGF

(Genes, proteins)

07/09/04 SFI Summer School 2

The Central Dogma of Molecular Biology

replication

DNA DNA reverse transcription transcription

cDNA mRNA translation

Protein/Enzyme folding

Function Structure interaction

07/09/04 SFI Summer School 2

Biological Symbolic Sequences

Nucleic Acid (DNA, RNA) Sequences • One-dimensional• Directed, from 5’ to 3’• Unbranching polymers• Heteropolymer made of 4 kinds of monomers

/bases /nulcleotides (a, c, g, t or u)• Length: 104 to 108

07/09/04 SFI Summer School 2

cccaatatcttgcttcagcaagatattgggtatttctagctttcctttcttcaaaaattgctatatgttagcagaaaagccttatccattaagagatggaacttcaagagcagctaggtctagagggaagttgtgagcattacgttcgtgcattacttccataccaagattagcacggttgatgatatcagcccaagtattaataacgcgaccttggctatcaactacagattggttgaaattgaatccgtttagattgaaagccatagtactaatacctaaagcagtgaaccaaatccctactacaggccaagcagccaagaagaagtgtaaagaacgagagttgttaaaactagcatattggaagattaatcggccaaaataaccatgagcggccacaatattataagtttcttcctcttgaccaaatctgtaaccctcattagcagattcgttttcagtggtttccctgatcaaactagaggttaccaaggaaccatgcatagcactgaatagggaaccgccgaatacaccagctacacctaacatgtgaaatggatgcataaggatgttatgctctgcctggaatacaatcataaagttgaaagtaccagatattcctaaaggcataccatcagagaaacttccttgaccaatagggtaaatcaagaaaacagcagtagcagctgcaacaggagctgaatatgcaacagcaatccaaggacgcatacccagacggaaactcagttcccactcacgacccatataacaagctacaccaagtaagaagtgtagaacaattagctcataaggaccaccattgtataaccactcatcaacagatgcagcttcccaaattgggtaaaagtgcaatccgatcgccgcagaagtaggaataatggcaccagagataatattgtttccgtaaagtaaagaaccagaaacaggctcacgaataccatcaatatctactggaggggcagcgatgaaggcgataataaatacagaagttgcggtcaataaggtagggatcatcaaaacaccgaaccatccgatgtaaagacggttttcggtgctagttatccagttgcagaagcgaccccacaggcttgtactttcgcgtctctctaaaattgcagtcatggtaagatcttggtttattcaaattgcaaggactcccaagcacacgtattaactagaaagataatagaaggcttgttatttaacagtataatatagactatataccaatgtcaaccaagccagccccgacagttgtatatccatacaacaaaatttaccaaaccaaaaaattttgtaaatgaagtgagtgaaaaatcaaaactcagattgctcctttctagtttccatatgggttgcccgggactcgaacccggaactagtcggatggagtagataattattccttgttacaatagagaaaaaacctctccccaaatcgtgcttgcatttttcattgcacacgactttccctatgtagaaataggctatttctattccgaagaggaagtctactaatttttttagtagtaagttgattcacttactatttattatagtacagagaacatttcagaatggaaactgtgaaagttttaccttgatcatttatcaatcatttctagtttattagttttgtttaatgattaattaagaggattcaccagatcattgatacggagaatatccaaataccaaatacgctcactgtgcgatccacggaaagaaaagtaagttgttttggcgaacatcaaagaaaaaacttgctcttcttccgtaaaaaattcttctaaaaataccgaacccaaccattgcataaaagctcgtaccgtgcttttatgtttacgagctaaagttctagcgcatgaaagtcgaagtatatactttagtcgatacaaagtcttcttttttgaagatccactgtgataatgaaaaagatttctacatatccgaccaaaccgatcaagaatatcccaatccgataaatcggtccaaattggtttactaataggatgccccgatccagtacaaaattgggcttttgctaaagatccaatgagaggagtaacagggactttggtatcgaattttttcatttgagtatctattagaaatgaattctccagcatttgattccttactaacaaagaatttattggtacacttgaaaagtaccccagaaaatcgaagcaagagttttctaattggtttagatggatcctttgcggttgagtccaaaaagagaaagaatattgccacaaacggacaaggtaacatttccatttcttcttcaaaagaagagttccttttgatgcaagaattgcctttccttgatatcgaacataatgcataaggggatccataacgaaccatatggttttccgaaaaaaagcagggtacattaacccaaaatgttccatcttcctagaaaagatgattcgttccagaaaggttccggaagaagttaatcgcaagcaagaagattgtttacgaagaaacaacaagaaaaattcatattctgatacataagagttatataggaaccgaaatagtcttttattttcttttttcaaaataaaaatggatttcattgaagtaataaaactattccaattcgagtagtagttgagaaagaatcgcaataaatgcaaggatggaacatcttggatccggtattgaaggagttgaagcaagatatccaaatggataggatagggtatttctatatgtgctagataatgtaagtgcaaaaatttgtcttctaaaaaaggaaatattgaatgaatagatcgtaaattctgaaactttggtatttctttttcttccggacaagactgttctcgtagcgagaatgggatttctacaacgatcgcaaacccctcagatagaatctgagaataaaactcagaataaaaaaaattgttgtaatccaataatcgatcttggttaggatgattaaccaaattaatccaaaaattctgctgatacattcgaatcattaaccgtttcacaagtagtgaactaaatttcttgttattagaaccaataatttcgacaagttcggaaccatttaatccataatcatgggcaaacacataaatgtactcctgaaagagtagtgggtagacgaaatattgtctaggaaatttaagtttttctgaataaccctcgaatttttccatttgtatttctacttgaatcagagagagagaaatatttctcggtttatcaaatggtgatacatagtacaatatggtcagaacagggtgttgcattttttaatacaaacccctggggaagaaaaggagtctaatccacggatctttttccgctccttttctatccaatttgtttatgtttgttctaattacaaaagagaacaaatcctttatttttgcaggccaattgctcttttgactttgggatacagtctctttatcaatatactgcttcttttacacattcaatccataacatccttttcaatccaaaatcaagaataattaggatttctaaaaaaaaaagaaaaaatcaaaggtctactcataggaaaaccagcttttccctacatcaggcactaatctatttttaacgtctaattagatcagggagttcttccaattaagaagttaagctcgttgctttttgttttaccagaattggagccaggctctatccatttattcattagacccagaaaatcagaatttttttattccattccaaaaatccaaaataagaaattgattttattacgacatgctattttttccattcattacccttgaggatcagtcgcggtcttatagactctaccaagagtctggacgaattttttgcttcatccaaatgtgtaaaagatcatagtcgcacttaaaagccgagtactctaccattgagttagcaacccagataaactaggatcttagatacgatcgaaatccaaaaatcaatggaattacaccgcacacccctgtcaaaatcttaaaatagcaagacattaaaagaaagattttatcaccattgaaaacactcagataccaaaaggaacgggtctggttaaatttcactaaggttaaaagtggcaccaatcacgatcgtaaaattgtcatttttttagcatttttatttaaataaataaataaatcttgtatgagagtacaaacaagagggacaaccctaccatttgagcaaagtgtaggcaaaaaacctaatagggagtgaggataaagagacttatccatctacaaattctagatgttcaatggacctttgtcaatggaaatacaatggtaagaaaaaaattagatagaaaaactcaaaaaaataaaggcttatgttggattggcacgacataaatccagtcaaaaataggattaagaaagaggcaaattatttctaaatagttagacaacaagggatactagtgagcctctcctagttttttattcatttagttcttcaattaactcaaagttctttctttttctttaaagaattccgccttccttaaaatatcagaaacggttcttgtaggttgagcacctttttcaaggaaatagagaatagctggaacatttaaacaagtttgattctttatcggatcataaaaacctacttttcgaagatctcttccttctcttcgagatcgaacatcaattgcaacgattcgatagacagcttattgggatagatgtagataaataaagccccccctagaaacgtataggaggttttctcctcatacggctcgagaatatgacttgcattaatttccgtacagaaaaaacaaatttcatttatactcatgactcaagttgactaattttgattgacagacttgaaagaaaaaaatcctttgaaattttttgagtcgtctctaaactcttttctttgcctcatctcgaacaaattcacttttattccttattccggtccaattctattgttgagacagttgaaaatcgtgtttacttgttcgggaatcctttatctttgatttgtgaaatccttgggtttaaacattacttcgggaattcttattcttttttctttcaaaagagtagcaacatacccttttttcttatttccttcgataaagcatttccctcttctatagaaatcgaatatgagcgattgattctgatagactttaatcaaaagagttttcccatatcttccaaaattggactttcttcttattttaaccttttgatttctatattatttcgatttctatattaagggtagaatgacaaagttggcctaatttattagttttcactaaccctagattctttcccttgataaaaaataaattctgtcctctcgagctccatcgtgtactatttacttagcttacttacaaacaacccagcgaaaattcggttcgggacgaatagaacagactatgtcgagccaagagcattttcattactatggaaaatggtggatagcaaaatccacaatcgatcgtgtccttcaagtcgcacgttgctttctaccacatcgttttaaacgaagttttaacataacattcctctaatttcattgcaaagtgttatagggaattgatccaatatggatggaatcatgaatagtcattagtttcgttttttgtatactaattcaaacttgctttgctatctatggagaaatatgaataaaagaaattaagtatttatcgggaaagactccgcaaagagccaatttatttaaacccatattctatcatatgaatgaaatatagttcgaaaaaagggaataaacaagtttgcttaagacttatttattatggaatttccatcctcaacagaggactcgagatgatcaatccaatcctgaaatgataagagaagaattgactcttctccaacaaataaactatcaacctcccgtttaattaatttaattaatatattagattagcaatctatttttccataccatttttccgtaacaaaactaattaactattaactagttaaactattgcaatgaaaagaaagttttttggtagttatagaattctcgtatttcttcgactcgaataccaaaagaaagaaaaaaatgaagtaaaaaaaacgcatttcctgtaaagtaaaattaaggtctttgcttttacttattttttcttttacctaaaagaagcaactccaaatcaaaattgaatccattctatctaacgagcagttcttatcttatctttaccgggatggatcattctggatatttaaaaaatcgcggatcgagatcgtttttgcttaaccaaagaaagaaaaagaagaaggaaccttttttactaataaaatactataaaaaaaatttatctctatcataaatctatctctaccataaaggaataggtctcgttttttatacaatgttctacgtcaagtttaaaattttttcatgaaaaaaagattttcaatttgactggacttgacactggattatgttttctgagacagaaaatgaacgcattaggactgcatcgaatctaagagtttataagagaaaaaaattctctttaataaactttatgtctcgtgcagaatacaatacgatttcatctttcgtttcatcagaaaaaatctgggacggaaggattcgaacctccgagtaacgggaccaaaacccgctgccttaccacttggccacgccccatttcgggttttatgcgacactaataaacagtattatgtttatttcttattcgtcaatcctacttcaattacataaaaatggggggtattctcttggtaggattctagacatgcgaataatatagaatccaaaaaatgcattgatcattacatggaattctattaagatattatatgaaagtcgaatttcttccactctcatttgagagtgcgaatacaaggaggtattttgtgtttgggaaagtccgaagaaaaaaggattttgaatcctccttttcctttttcccttagaaaaataactcaatcaaaatccaattatctactctacaagaacgaaacgcttgttatgcctaatatacttagtttaacctgtatttgttttaattctgttatttatccgactagttttttcttcgccaaattgcccgaagcttatgccattttcaatccaatcgtggattttatgcctgtcatacctgtactcttttttctattagcctttgtttggcaagctgctgtaagttttcgatgaaatctttactactctgtctgccaaattgaatcatgtattcattctaaaaaaattcgaaaaatggataagagccgagaagtcttatattatgaaccttcgattctaaaattcaaattcttctacattgaatgtatagctgcagcaataaatttggatcagcctttctactccctgcatctacgttgagcaggtatctttaggtaaccgcacaatacctaacctaatttattgataagagtgcttattataaatcaattcttgcaatttttttcaaaaattgatttttgcatttttaggtgtcaaaataaacaaaacccatcctagtggatttgtgtggtaaggaaaaacgggtaatctattccttaaaaaaaaatcttggagattatgtaatgcttactctcaaactttttgtttatacagtagtgatattctttgtttccctctttatctttggattcttatctaatgatccaggacgtaatcctgggcgtgacgagtaaaaatccaaaattttttcttacaaattggatttgtttcatacatttatctacgagaaaatccgggggtcagaattccttccaattcgaaagtcccaaacgatccgagggggcggaaagagagggattcgaaccctcggtacaaaaaaattgtacaacggattagcaatccgccgctttagtccactcagccatctctccccgttccaaatcgaaaggtttccgtgatatgacagaggcaagaaataacgattgcaaaaaatccttcctttttctttcaaaagttcaaaaaaattatattgccaattccattttagttatattcttttttcttaatgttaataaaaaaaagaagaaaattcttcttttttctttctaattctaaaattggatattggctaaaagacaatcagatagattttctcttcagcaggcatttccatataggacttgttataataaaacaagcaggttatagaaaaaaactcttttttttattatttatcaacaaagcaaaaaggggtcttatcaaaccaacccaccccataaaattggaaagaaagataaagtaagtggacctgactccttgaatgaggcctctatccgctattctgatatataaattcgatgtagatgaaattgtataagtggatttttttgtatttccttagacttagaccacgcaaggcaagaatttctcgctatttactatttcatattcttgttactagatgttctataggaataagaagaaatcgcaacccctttccgctacacataaaaatggatttcgaaagtcaatttttcttttcaatatctttactttttttcagaatcctatttttgttcttatacccatgcaatagagagcgagtgggaaaagggaggttactttttttcattttttccttaaaaaataggctttcttggaaataggaatcatggaataatctgaattccaatgtttatttctatagtataagaaaaactaattgaatcaaattcatggatttaccacgacctcggctgtgaccccatagataaaaatgcaaaatttctatcttcgagaccattgaaaaaaggcattgaacgagaaaaaatcgtccacagataatctatcgtatgccttggaagtgatataaggtgctcggaaatggttgaagtaattgaataggaggatcactatgactatagcccttggtagagttactaaagaagaaaatgatttatttgatattatggacgactggttacgaagggaccgttttgtttttgtaggatggtctggcctattgctttttccttgtgcttatttcgctttaggaggttggtttacagggacaacttttgtaacttcttggtatacccatggattggcgagttcctatttggaaggttgcaatttcttaaccgcagcagtttccacccctgccaatagtttagcacactctttgttgctactatggggcccggaagcacaaggggattttactcgttggtgtcaattaggtggtctgtggacttttgttgctctccatggggcttttgcactaataggtttcatgttacgtcaatttgaacttgctcggtctgttcaattgcggccttataatgcaatttcattctctggcccaatcgctgtttttgtttccgtattcctgatttat

ccactggggcaatccggttggttctttgcgccgagttttggcgtagcagcgatatttcgattcatcctcttcttccaaggatttcataattggacgttgaacccatttcatatgatgggagttgccggagtattaggcgcggctctgctatgcgctattcatggggcaaccgtgga

07/09/04 SFI Summer School 2

Biological Symbolic Sequences

Protein Sequences

• One-dimensional

• Directed, from N-terminus to C-terminus

• Unbranching polymers

• Heteropolymers made of 20 kinds of monomers/amino acids (A, C, … W, Y)

• Length: 102 to 103 AA, say, 50-6000

07/09/04 SFI Summer School 2

ID A1BG_HUMAN STANDARD; PRT; 495 AA.... ... ...KW Immunoglobulin domain; Glycoprotein; Plasma; Repeat; Signal.... ... ...SQ SEQUENCE 495 AA; 54209 MW; 87A50C21CE89459C CRC64; MSMLVVFLLL WGVTWGPVTE AAIFYETQPS LWAESESLLK PLANVTLTCQ ARLETPDFQL FKNGVAQEPV HLDSPAIKHQ FLLTGDTQGR YRCRSGLSTG WTQLGKLLEL TGPKSLPAPW LSMAPVPWIT PGLKTTAVCR GVLRGETFLL RREGDHEFLE VPEAQEDVEA TFPVHQPGNY SCSYRTDGEG ALSEPSATVT IEELAAPPPP VLMHHGESSQ VLHPGNKVTL TCVAPLSGVD FQLRRGEKEL LVPRSSTSPD RIFFHLNAVA LGDGGHYTCR YRLHDNQNGW SGDSAPVELI LSDETLPAPE FSPEPESGRA LRLRCLAPLE GARFALVRED RGGRRVHRFQ SPAGTEALFE LHNISVADSA NYSCVYVDLK PPFGGSAPSE RLELHVDGPP PRPQLRATWS GAALAGRDAV LRCEGPIPDV TFELLREGET KAVKTIPTPG AAANLELIFV GPQHAGNYRC RYRSWVPHTF ESELSDPVELLVAES // 

Example of a protein sequenceHuman immunoglobulin domain

07/09/04 SFI Summer School 2

Biological Symbolic Sequences

Other possible types of sequences

• Structural data

• Gene expression data from microarrays

• Bio-medical literature

We will only touch on DNA and protein sequences

07/09/04 SFI Summer School 2

A Proviso: Limitations of Symbols

• DNA sequences: no methylation info (epigenome program), no role of metallic ions, no spatial configurations, etc.

• Protein sequences: no post-translational modifications (signal-peptide, glycosylation, phosphorylation, splicing), no explicit structural info, etc.

07/09/04 SFI Summer School 2

• These sequences are results of billions years of evolution

• They contain conserved but variable and fuzzy regular as well as random elements

• There are sequencing errors (1/10000 in genomic DNA, greater in ESTs)

• The relation between random and non-random aspects may be scrutinized in different ways, e.g., in a collection of many sequences or in a single long sequence

07/09/04 SFI Summer School 2

Where to Get the Sequences?Public Databases, e.g.:

DNA sequences (NCBI/EMBL/DDBJ):

http://www.ncbi.nlm.nih.gov

ftp://ftp.ncbi.nlm.gov/genbank/genomes

ftp://ftp.ncbi.nih.gov/genomes

Protein sequences (SWISS-PROT, PIR, UNIPROT):

http://www.expasy.ch/

07/09/04 SFI Summer School 2

GenBank Rel.142 (15 June 2004)

• 35 532 003 (3.5x107) sequences

• 40 325 321 348 (4x1010) bases/letters

• Average length: 1134 (many ESTs, few but increasing number of complete genomes)

07/09/04 SFI Summer School 2

A simple-minded approach: Counting K-strings

• K-words, K-grams, K-tuples, K-strings. A convention: “string” versus “word”

• One can look at K-tuples in a collection of sequences or in a single sequence

07/09/04 SFI Summer School 2

Transition from Randomness to“Determinism” with K Increasing

• E. coli (strain K12) genome: a loop of 4639221 bp,

distribution of single letters almost random• K=1, 2, …close to random sequences• Among the 44639221 possible sequences only 1 or a small

subset makes E. coli• Probes on gene-chips: K in the tenths, say, 20 ~ 80.

Already gene-specific.• Some extreme examples follow

07/09/04 SFI Summer School 2

Genus-Specific Oligo-Nucleotides

• The 18-tuple gttccaataagactaaaa appears only in the genomes of 3 species of the Archaea genus Pyrococcus: P. horikoshii, P. abyssi, and P. furiosus (as repeats in Non-CDS)

• No exact match in GenBank + EMBL + DDBJ and in the Human draft sequences

• Is it a genus-marker?

07/09/04 SFI Summer School 2

Species-Specific Oligo-Peptides• HAMSCAPDKE: only in E. coli and Shigella among more

than 1.3 mil. known proteins in PIR database• HAMSCAPERD: only in Samonella• It is known: D (Aspartate) and E (Glutamate) are interchangeable in many

homologous proteins K (Lysine) and R (Arginine) are interchangeable in many

homologous proteins• A consensus HAMSCAP[D/E][K/R][D/E] picks up only

Enterobacteria (a bacterial family in the Proteobacteria phylum)

07/09/04 SFI Summer School 2

Four Examples fromOur Own Work

• Regularities in bacterial K-string “portraits” -> combinatorics

• True and redundant avoided strings in bacterial genomes -> formal language

• A surprise in 1D histograms of K-strings in randomized bacterial genomes -> statistics

• Decomposition and reconstruction of protein sequences -> graph theory

07/09/04 SFI Summer School 2

07/09/04 SFI Summer School 2

2D Histogram of K-Tuples

“Hao Histograms”: implemented at NIST and EBI:

http://math.nist.gov/~Fhunt/GenPatterns/

http://industry.ebi.ac.uk/openBSA/bsa_viewers/home.html

• Our software SeeDNA will be made public soon

• Relation to “Chaos Game Representation” of DNA:

P. Tino, “Multifractal properties of Hao’s geometric representation of DNA sequences”, Physica A304 (2002) 480-499.

07/09/04 SFI Summer School 2

g c

a

t

07/09/04 SFI Summer School 2

1D Histogram of K-Strings

• Collect those K-strings whose count fall in a bin in between nmin to nmax.

• Plot the number of such K-strings versus the counts.

• This is a 1D histogram or

an expectation curve if one wants to calculate it from a statistical model.

07/09/04 SFI Summer School 2

1-D Histograms (continued)

• X-axis: string counts from 0 to some maximal number (with K fixed)

• Y-axis: number of different string types within a count bin: 0 (absent or avoided string), 1-3, 4-6 (rare strings), … to some big numbers, e.g. 774 (repeats)

07/09/04 SFI Summer School 2

07/09/04 SFI Summer School 2

Effect of Randomization

What happens to the 1D and 2D histograms if we randomize the original genomic sequence?

One must do this comparison in order to show that the somewhat regular patterns seen in the bacterial portraits are not incidental.

07/09/04 SFI Summer School 2

07/09/04 SFI Summer School 2

07/09/04 SFI Summer School 2

A Surprise in 1D Histograms of K-Tuples of Randomized Prokaryote Genomic Sequences

A single biased peak with a long tail in the original genome

Rich fine structure in randomized sequences of some genomes

07/09/04 SFI Summer School 2

07/09/04 SFI Summer School 2

G+C Content of Some Bacteria

65.61%M. tuberculosis

57.80%M. laprae

50.79%E. coli

38.15%H. influenzae

G+C ContentSpecies

07/09/04 SFI Summer School 2

07/09/04 SFI Summer School 2

07/09/04 SFI Summer School 2

Statistical Combinatoricsor

Combinatorial Statistics

Need tools

• More dirty and fuzzy than pure combinatorics

• More deterministic than pure statistics

07/09/04 SFI Summer School 2

Decomposition and Reconstructionof a Given Amino Acid Sequence

How unique is it?

Connection with the problem of the number of Eulerian loops in a graph

07/09/04 SFI Summer School 2

ANPA_PSEAM 82AA

MALSLFTVGQLIFLFWTMRITEASPDPAAKAAPAAAAAPAAAAPDTASDAAAAAALTAANAKAAAELTAANAAAAAAATARG

MALS

ALSL

LSLF

SLFT

FTVG

TVGQ

VGQL

AKAA

K=5

LFTV

07/09/04 SFI Summer School 2

ANPA_PSEAM 82AA

Antifreeze protein

A/B precursor in winter flaunder

Alanine-rich

Amphiphilic

LTAA 7 AAAA 5

PAAA 4

APAA 3

TANN 8

AANA 9 SSPA 2

AAAP 6

AKAA 1auxiliary arc

6 rings

07/09/04 SFI Summer School 2

From pdb.seq-a special selection of SWISSPROT2821-1=2820 proteins ( May 2000 )

R—number of reconstructed AA sequences from a given protein decomposition

R

K1 2-10 11-100 101-1000

1001-10000

>10 4

5

6

7

8

9

10

11

2164(76. 7%)

404 90 45 21 93

2651(94%)

77 29 10 4 49

2732(96. 9%)

32 16 3 2 44

2740(97. 1%)

23 10 3 0 44

2763(97. 9%)

13 7 1 0 36

2793(99%)

11 7 2 1 6

2798(99. 2%)

12 2 1 1 6

07/09/04 SFI Summer School 2

Compositional Representation of Proteins

The collection {W } or {W ,n j} may be used as an equivalent representation of the original protein sequence, if the reconstruction is unique.

A seemingly trivial result upon further reflection: random AA sequences have unique reconstruction as well.

Compositional Representation works equally for random AA sequences and most of protein sequences.

A given realization of a short random AA sequence is as specific as a real protein sequence.

j=1

M

jK

i=1

L -k+1

iK

07/09/04 SFI Summer School 2

The other extreme:

Quit a few proteins have an enormous

number of reconstructions.

Transmembrane

Antifreeze

Fibrous: collagens

07/09/04 SFI Summer School 2

Protein AA R(11)K

for R(K)=1

MAGA_XENLA

SPA2_STAAU

SRTX_ATREN

EBN1_EBV

ICEN_PSESY

XPIN_XENLA

CAIH_MOUSE

CIPA_CLOTM

303 150 84

508 8640 28

543 9.9732*105 101

641 5.407*1016 28

1200 1.55675*1027 46

1350 3.7584*105 15

1315 21840 15

1853 5.671*1013 179

07/09/04 SFI Summer School 2

A Related Problem and A Piece of On-Going Work: Constraint Randomization

• Given a DNA sequence: randomize it with number of a, c, g, t unchanged. Easy and software available in public domain (e.g., shuffleseq in EMBOSS)

• Given a DNA sequence: randomize it with the number of dinucleotides (aa, ac, …, tt) unchanged. Much harder.

• Algorithm understood and program is being written.

07/09/04 SFI Summer School 2

The End

Thank You!