Upload
annis-perry
View
226
Download
3
Tags:
Embed Size (px)
Citation preview
Grammatical ComplexityOf Symbolic Sequences:
A Brief Introducton
Bailin HaoT-Life Research Center, Fudan University
Institute of Theoretical Physics, Academia SinicaThe Santa Fe Institute, New Mexico, USA
http://www.itp.ac.cn/~hao/
Three Paradigms in TheoreticalDescription of Nature
• Deterministic: based on periodicities and recurrences, from Kepler to Yang-Mills
• Stochastic: based on randomness, from Brownian motion to MSR field theory of hydrodynamics and molecular motors
• Fractal, self-similar, scale invariant: from phase transitions and critical phenomena to chaotic dynamics
• Finiteness is the unifying Physics: languages
语言学( language 而非 philology )方法
统计语言学 “ 字”的频度和关联 Zipf 定律
代数语言学:生成语法和语法复杂性 串行生成: Chomsky 体系 平行生成: Lindenmayer 体系(来自发育生物学) 可因式化语言 (Factorizable language)
自然语言与遗传语言
相似处:多义性 冗余度 容错和纠错 长程关联 均基于离散的排列组合系统有某些语法,但不能完全生成方言、个体差异性演化、突变、灭绝历史“垃圾”、古语、“化石”外来语、横向交换
相异处: 标点符号和间隔不同
两种语言的相互作用
二维、三维的相互作用
重复序列的数目和作用
An Observation u d c s b t
charge, mass, flavor, charm, …
p n e
charge, mass, spin, magnetic momentum, …
H C N O P …
atomic number, ion radius, valence, affinity, …
H2O NO CO2 …
molecular weight, polarity, …
a c g t
A D E F G H … W Y VBRCA1 PDGF
A PROGRAMME:
Coarse-Grained Description of Nature
Use of Symbols and Symbolic Strings
Language
Grammar and Complexity (Chomsky, Lindenmayer, etc.)
So far this programme has been best realized in the study of dynamics by using Symbolic Dynamics.
There have been preliminary attempts in analyzing biological sequences.
It may not be a coincidence that the two systems in the universe that most impress us with their open-ended complex design — life and mind — are based on discrete combinatorial systems. Many biologists believe that if inheritance were not discrete, evolution as we know it could not have taken place.
S. Pinker, The Language Instinct (1995)
Simple Examples
At the level of words:
DOG GOD
At sentence level:
Dog bites Man
Man bites Dog
N C EGF (Epidermal GF)
N C Chymotrypsin ( 胰凝乳蛋白酶 )
N C Urokinase (UK) ( 尿激酶 )
N C Factor IX
( 凝血因子 IX, X-mas 抗血友病因子 )
N C Plasminogen
( 纤维蛋白融酶原 )
几种丝氨酸蛋白酶的 domain组合 B.Alberts 等, Mol.Biology of the Cell 第三版 1994. P.123
Ca 结合蛋白
含 3 个 -s-s-
GC 语法复杂性 字母表 例 1. = {a, c, g, t}
例 2. = {A, C, D … W, Y}
例 3. = {a, … z, A, … Z, +, –, …}
字母表中各种字母组成的一切字母串 (包括空串) *
* 的任何子集是基于的一种语言
语法 = { 字母表,初始字母,产生规则 }
基于该语法的语言
Classification of Formal Languages
Chomsky Hierarchy
Sequential production rules
Lindenmayer Systems
Parallel production rules
Generative Grammar S Sentence
NP Noun Phrase
VP Verb Phrase
Adj Adjective
Art Article
S if S then S
S either S or S
Non-Terminal and Terminal Symbols
N boy | girl | scientist | …
V sees | believes | loves | eats | …
Adj young | good | beautiful | …
Art a | one | the
S NP VP
VP V NP
NP (Art) Adj* N
Chomsky 语法层次 N — 非终结字母集(工作用符号) T — 终结字母集 S N 起始字母 P = { 生成规则( x y )的集合 }
x, y 为字母串 关于 x, y 的不同规定导致不同语法 语法 G = (N, T, P, S)
0 类语法 x (NT)* N(NT)*
y (NT)*至少含有一个非终结字母
1 类语法 上下文有关语法 x = t1 a t2
t1, t2 T*
a N
2 类语法 上下文无关语法 x = a N
3 类语法 正规语法 x = a y = b 或 bc
a, c N b = 空 或 b T
形式语言的 Chomsky 层次
层 语言 计算机 存储要求0 递归可数
REL
图灵机(万能计算机)
无根
1 上下文有关CSL
线性有界自动机 比例於输入字长
2 上下文无关CFL
下推自动机 下推区(堆栈)
3 正规RGL
有限自动机 不要求
R L R R R L R R
a b
(i) (ii)
R L
a b c
b … …
c … …
d … …
A transfer function
(a, R) = b
A Finite State Automaton(FSA)
FSA: Finite State Automata
• Deterministic FSA
• Non-Deterministic SFA
• Equivalence of DFSA and NDFSA: subset construction
• Minimal DFSA
• Myhill-Nerode theorem (1958): number of nodes in minDFSA
A Pushdown Automaton
Pushdown list
Stack
First In Last Out (FILO)
A Turing MachineAlan M. Turing (1912-1954)
FSA + R/W tape
Church-Turing Thesis (1936):
Any effective (mechanical) computation can
be carried out by a Turing machine
Terminals = {a, b, c}
Non-terminal = {A, B}
Sequential rules: B aBAc | abc
bA bb
cA Ac
B abc
B aBAc aabcAc aabAcc
B abAc aaBAcAc
aaBAAc
aaabcAAc
aaabAcAc aaabbAcc
Example: {ai b ici | i>0} CSL
Rules to Generate Gene-Like Sequences( according to David Searls )
gene upstream transcript downstream
transcript 5’-untranslated-region start-codon coding-region
3’-untranslated-region
coding-region codon coding-region | stop-codon | splice |
coding region
codon lys | asn | thr | met | glu | his | pro | asp | ala | gly | tyr |
trp | phe | leu | ile | ser | arg | gln | val | cys
start-codon met
stop-codon taa | tag | tga
leu tt purine | ct base (6)ser ag pyrimidine | tc base (6)arg ag purine | cg base (6)val gt base pro cc base (4)ala gc base gly gg base (4)thr ac base (4) ile at pyrimidine | ata (3)lys aa purine asn aa pyrimidine (2)gln ca purine his ca pyrimidine (2)glu ga purine cys tg pyrimidine (2)phe tt pyrimidine tyr ta pyrimidine (2)asp ga pyrimidine (2)met atg trp tggbase m a | c | g | t purine a | gprimidine c | t
splice intron intron gt | intron-body | ag
splice a a intron splice c c intron
splice t t intron splice g g intron
a splice intron a c splice intron c
t splice intron t g splice intron g
upstream enhancer promotor enhancer
enhancer …
promotor …
silencer …
isolator …
These rules are capable to generate an unlimited
set of gene-like sequences, mostly biological nonsense.
They may be used to recognize gene-like segments
in long DNA sequences.
Syntax versus Semantics: texts vs. grammar.
Physics behind this coarse-grained description:
stereochemistry, interaction between proteins and
DNA chains, metallic ions etc.
Symbolic Dynamics Languages
1991 1999
谢惠民定理和猜测• 单峰映射揉序列对应的语言中的正规语言只有周期序列和最终周期序列两种
• 如何走向比正规语言高一级的上下文无关语言?
• 猜测:单峰映射揉序列对应的语言中没有上下文无关语言
Subintervals determined by the periodic kneading
Sequence (RLRRC)∞
Order of visits in the periodic kneading
Sequence (RLLRC)∞
Transformations of subintervals
• a → c + d (on reading L)
• b → d (on reading R)
• c → b + c (on reading R)
• d → a (on reading R)
Input L R R R
q a b c d
d 1 1 0 0
c 1 0 1 0
b 0 0 1 0
a 0 0 0 1
Transfer Functions
R L
a c, d
b d
c b, c
d a
R L
{a,b,c,d} {a,b,c,d} {c,d}
{c,d} {a,b,c}
{a,b,c} {b,c,d} {c,d}
{b,c,d} {a,b,c,d}
Stefan matrix for 256P in Feigenbaum cascade
Stefan matrix for F13=233; Case (a)
Stefan matrix for F13=233. Case (b)
Stefan matrix for F13=233. Case (c)
Stefan matrix for F13=233. Case (d)
Symbolic Dynamics Languages
1991 1999
br bl
ar al
albr blar
Alphabet: S = {ar, al, br, bl}
Production rules:
Initial symbol (axiom) = ar
Grammar: G = (S, P, )
Language: L (G) S*
Development of Anabaena catenula ( 串珠藻项圈藻属 )
br ar
ar albr
bl al
al blar
P =
Lindenmayer Systems
Parallel production rules. Finer classification
D0L – Deterministic, no interaction, i.e., context-free
0L – non-deterministic, no interaction
IL – non-deterministic, with Interaction, i.e., context
sensitive
T0L – with Table of production rules
TIL –
E0L – Extended to non-terminal symbols
ET0L –
EIL REL of Chomsky
RGL Regular CFL Context-Free
CSL Context-Sensitive REL Recursively Enumerable
CSL
CFL
RGL
FINDOL
REL
Chomsky
Lindenmayer
Indexed
0:REL
1:CSL
IND
ET0L
E0L
2:CFL
3:RGL
IL
T0L
0L
D0L
EIL
L = {aibici | i > 0} CSL
G = (S, T, )
= abc
S = {a, b, c}
T = {t1, t2}
T1 = {a aa, b bb, c cc}
T2 = {a , b , c }
T0L
Example a la Lindenmayer
Dyck language: A language of nested parentheses
• Many types of parentheses
• Finite depth of nesting
• Context-free language
Our case:
• Only 3 types of parentheses
• Shallow nesting
• Conjecture (Xie): may be regular language
模糊语言学 形式推广不难: Z .G .Yu ( 喻祖国 20
01)
如何定量地引用生物知识 Consensus 序列和权重矩阵
随机语法 隐马可夫链 = 随机正规语法 更高阶的随机语法?
Factorizable Languages
• Symbolic dynamics leads to factorizable languages
• A complete genome defines a factorizable langauge
• An amino acid sequence with unique reconstruction (at certain K) defines a factorizable language
Modeling in Biology• Cells
• Tissues
• Organs
• “Systems”: circulation, respiration, reproduction, neural, sensory, musclular, etc.
• Organisms, population, ecosystems
• Animals versus plants
• Plant development, morphology, physiology and pathology
Modeling of Plant MorphologyBy using L-System
• P. Prusinkiewicz, J. Hanan, Lindenmayer Systems, Fractals, and Plants, LN in Biomath., vol. 79, Springer, 1989
• P. Prusinkiewicz, A. Lindenmayer, The Algorithmic Beauty of Plants, Springer, 1990
• P. Prusinkiewicz, M. Hammel, J. Hanan, R. Mech, Visual models of plant development, Chap.9 in Handbook of Formal Languages, Vol.3, Springer, 1997
Consistency of Macro-and Micro-Description of
Nature• Molecular phylogeny versus phylogeny
based on morphological features
• Modeling plant development without getting into molecular and cellular description
• No need to model protein folding by invoking quarks!
Some Useful URLs
• www.grogra.org (Growth Grammar)
• http://www.computableplant.org
• http://algorithmicbotany.org
Huimin Xie 谢惠民 Grammatical Complexity and
1D dynamical Systems
Vol.6 in Directions in Chaos
WSPC, 1996.
谢惠民 《复杂性与动力系统》 上海科技教育出版社 , 1994
Bailin Hao, Weimou Zheng, Applied Symbolic Dynamics and Chaos (WSPC, 1998), Chap. 8
J.Hopcroft, J.Ullman, Introduction to Automata Theory, Languages and Computation, Addison-Wesley, 1979.
Thanks!