52
Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan Univer sity Institute of Theoretical Physics, Ac ademia Sinica The Santa Fe Institute, New Mexico, USA http://www.itp.ac.cn/~hao/

Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Embed Size (px)

Citation preview

Page 1: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Grammatical ComplexityOf Symbolic Sequences:

A Brief Introducton

Bailin HaoT-Life Research Center, Fudan University

Institute of Theoretical Physics, Academia SinicaThe Santa Fe Institute, New Mexico, USA

http://www.itp.ac.cn/~hao/

Page 2: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Three Paradigms in TheoreticalDescription of Nature

• Deterministic: based on periodicities and recurrences, from Kepler to Yang-Mills

• Stochastic: based on randomness, from Brownian motion to MSR field theory of hydrodynamics and molecular motors

• Fractal, self-similar, scale invariant: from phase transitions and critical phenomena to chaotic dynamics

• Finiteness is the unifying Physics: languages

Page 3: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

语言学( language 而非 philology )方法

统计语言学 “ 字”的频度和关联 Zipf 定律

代数语言学:生成语法和语法复杂性 串行生成: Chomsky 体系 平行生成: Lindenmayer 体系(来自发育生物学) 可因式化语言 (Factorizable language)

Page 4: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

自然语言与遗传语言

相似处:多义性 冗余度 容错和纠错 长程关联 均基于离散的排列组合系统有某些语法,但不能完全生成方言、个体差异性演化、突变、灭绝历史“垃圾”、古语、“化石”外来语、横向交换

相异处: 标点符号和间隔不同

两种语言的相互作用

二维、三维的相互作用

重复序列的数目和作用

Page 5: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

An Observation u d c s b t

charge, mass, flavor, charm, …

p n e

charge, mass, spin, magnetic momentum, …

H C N O P …

atomic number, ion radius, valence, affinity, …

H2O NO CO2 …

molecular weight, polarity, …

a c g t

A D E F G H … W Y VBRCA1 PDGF

Page 6: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

A PROGRAMME:

Coarse-Grained Description of Nature

Use of Symbols and Symbolic Strings

Language

Grammar and Complexity (Chomsky, Lindenmayer, etc.)

So far this programme has been best realized in the study of dynamics by using Symbolic Dynamics.

There have been preliminary attempts in analyzing biological sequences.

Page 7: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

It may not be a coincidence that the two systems in the universe that most impress us with their open-ended complex design — life and mind — are based on discrete combinatorial systems. Many biologists believe that if inheritance were not discrete, evolution as we know it could not have taken place.

S. Pinker, The Language Instinct (1995)

Page 8: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Simple Examples

At the level of words:

DOG GOD

At sentence level:

Dog bites Man

Man bites Dog

Page 9: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

N C EGF (Epidermal GF)

N C Chymotrypsin ( 胰凝乳蛋白酶 )

N C Urokinase (UK) ( 尿激酶 )

N C Factor IX

( 凝血因子 IX, X-mas 抗血友病因子 )

N C Plasminogen

( 纤维蛋白融酶原 )

几种丝氨酸蛋白酶的 domain组合 B.Alberts 等, Mol.Biology of the Cell 第三版 1994. P.123

Ca 结合蛋白

含 3 个 -s-s-

Page 10: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

GC 语法复杂性 字母表 例 1. = {a, c, g, t}

例 2. = {A, C, D … W, Y}

例 3. = {a, … z, A, … Z, +, –, …}

字母表中各种字母组成的一切字母串 (包括空串) *

* 的任何子集是基于的一种语言

语法 = { 字母表,初始字母,产生规则 }

基于该语法的语言

Page 11: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Classification of Formal Languages

Chomsky Hierarchy

Sequential production rules

Lindenmayer Systems

Parallel production rules

Page 12: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Generative Grammar S Sentence

NP Noun Phrase

VP Verb Phrase

Adj Adjective

Art Article

S if S then S

S either S or S

Non-Terminal and Terminal Symbols

N boy | girl | scientist | …

V sees | believes | loves | eats | …

Adj young | good | beautiful | …

Art a | one | the

S NP VP

VP V NP

NP (Art) Adj* N

Page 13: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Chomsky 语法层次 N — 非终结字母集(工作用符号) T — 终结字母集 S N 起始字母 P = { 生成规则( x y )的集合 }

x, y 为字母串 关于 x, y 的不同规定导致不同语法 语法 G = (N, T, P, S)

0 类语法 x (NT)* N(NT)*

y (NT)*至少含有一个非终结字母

Page 14: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

1 类语法 上下文有关语法 x = t1 a t2

t1, t2 T*

a N

2 类语法 上下文无关语法 x = a N

3 类语法 正规语法 x = a y = b 或 bc

a, c N b = 空 或 b T

Page 15: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

形式语言的 Chomsky 层次

层 语言 计算机 存储要求0 递归可数

REL

图灵机(万能计算机)

无根

1 上下文有关CSL

线性有界自动机 比例於输入字长

2 上下文无关CFL

下推自动机 下推区(堆栈)

3 正规RGL

有限自动机 不要求

Page 16: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

R L R R R L R R

a b

(i) (ii)

R L

a b c

b … …

c … …

d … …

A transfer function

(a, R) = b

A Finite State Automaton(FSA)

Page 17: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

FSA: Finite State Automata

• Deterministic FSA

• Non-Deterministic SFA

• Equivalence of DFSA and NDFSA: subset construction

• Minimal DFSA

• Myhill-Nerode theorem (1958): number of nodes in minDFSA

Page 18: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

A Pushdown Automaton

Pushdown list

Stack

First In Last Out (FILO)

Page 19: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

A Turing MachineAlan M. Turing (1912-1954)

FSA + R/W tape

Church-Turing Thesis (1936):

Any effective (mechanical) computation can

be carried out by a Turing machine

Page 20: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Terminals = {a, b, c}

Non-terminal = {A, B}

Sequential rules: B aBAc | abc

bA bb

cA Ac

B abc

B aBAc aabcAc aabAcc

B abAc aaBAcAc

aaBAAc

aaabcAAc

aaabAcAc aaabbAcc

Example: {ai b ici | i>0} CSL

Page 21: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Rules to Generate Gene-Like Sequences( according to David Searls )

gene upstream transcript downstream

transcript 5’-untranslated-region start-codon coding-region

3’-untranslated-region

coding-region codon coding-region | stop-codon | splice |

coding region

codon lys | asn | thr | met | glu | his | pro | asp | ala | gly | tyr |

trp | phe | leu | ile | ser | arg | gln | val | cys

start-codon met

stop-codon taa | tag | tga

Page 22: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

leu tt purine | ct base (6)ser ag pyrimidine | tc base (6)arg ag purine | cg base (6)val gt base pro cc base (4)ala gc base gly gg base (4)thr ac base (4) ile at pyrimidine | ata (3)lys aa purine asn aa pyrimidine (2)gln ca purine his ca pyrimidine (2)glu ga purine cys tg pyrimidine (2)phe tt pyrimidine tyr ta pyrimidine (2)asp ga pyrimidine (2)met atg trp tggbase m a | c | g | t purine a | gprimidine c | t

Page 23: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

splice intron intron gt | intron-body | ag

splice a a intron splice c c intron

splice t t intron splice g g intron

a splice intron a c splice intron c

t splice intron t g splice intron g

upstream enhancer promotor enhancer

enhancer …

promotor …

silencer …

isolator …

Page 24: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

These rules are capable to generate an unlimited

set of gene-like sequences, mostly biological nonsense.

They may be used to recognize gene-like segments

in long DNA sequences.

Syntax versus Semantics: texts vs. grammar.

Physics behind this coarse-grained description:

stereochemistry, interaction between proteins and

DNA chains, metallic ions etc.

Page 25: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Symbolic Dynamics Languages

1991 1999

Page 26: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

谢惠民定理和猜测• 单峰映射揉序列对应的语言中的正规语言只有周期序列和最终周期序列两种

• 如何走向比正规语言高一级的上下文无关语言?

• 猜测:单峰映射揉序列对应的语言中没有上下文无关语言

Page 27: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Subintervals determined by the periodic kneading

Sequence (RLRRC)∞

Page 28: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Order of visits in the periodic kneading

Sequence (RLLRC)∞

Page 29: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Transformations of subintervals

• a → c + d (on reading L)

• b → d (on reading R)

• c → b + c (on reading R)

• d → a (on reading R)

Page 30: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Input L R R R

q a b c d

d 1 1 0 0

c 1 0 1 0

b 0 0 1 0

a 0 0 0 1

Page 31: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Transfer Functions

R L

a c, d

b d

c b, c

d a

R L

{a,b,c,d} {a,b,c,d} {c,d}

{c,d} {a,b,c}

{a,b,c} {b,c,d} {c,d}

{b,c,d} {a,b,c,d}

Page 32: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,
Page 33: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Stefan matrix for 256P in Feigenbaum cascade

Page 34: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Stefan matrix for F13=233; Case (a)

Page 35: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Stefan matrix for F13=233. Case (b)

Page 36: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Stefan matrix for F13=233. Case (c)

Page 37: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Stefan matrix for F13=233. Case (d)

Page 38: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Symbolic Dynamics Languages

1991 1999

Page 39: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

br bl

ar al

albr blar

Alphabet: S = {ar, al, br, bl}

Production rules:

Initial symbol (axiom) = ar

Grammar: G = (S, P, )

Language: L (G) S*

Development of Anabaena catenula ( 串珠藻项圈藻属 )

br ar

ar albr

bl al

al blar

P =

Page 40: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Lindenmayer Systems

Parallel production rules. Finer classification

D0L – Deterministic, no interaction, i.e., context-free

0L – non-deterministic, no interaction

IL – non-deterministic, with Interaction, i.e., context

sensitive

T0L – with Table of production rules

TIL –

E0L – Extended to non-terminal symbols

ET0L –

EIL REL of Chomsky

Page 41: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

RGL Regular CFL Context-Free

CSL Context-Sensitive REL Recursively Enumerable

CSL

CFL

RGL

FINDOL

REL

Page 42: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Chomsky

Lindenmayer

Indexed

0:REL

1:CSL

IND

ET0L

E0L

2:CFL

3:RGL

IL

T0L

0L

D0L

EIL

Page 43: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

L = {aibici | i > 0} CSL

G = (S, T, )

= abc

S = {a, b, c}

T = {t1, t2}

T1 = {a aa, b bb, c cc}

T2 = {a , b , c }

T0L

Example a la Lindenmayer

Page 44: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Dyck language: A language of nested parentheses

• Many types of parentheses

• Finite depth of nesting

• Context-free language

Our case:

• Only 3 types of parentheses

• Shallow nesting

• Conjecture (Xie): may be regular language

Page 45: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

模糊语言学 形式推广不难: Z .G .Yu ( 喻祖国 20

01)

如何定量地引用生物知识 Consensus 序列和权重矩阵

随机语法 隐马可夫链 = 随机正规语法 更高阶的随机语法?

Page 46: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Factorizable Languages

• Symbolic dynamics leads to factorizable languages

• A complete genome defines a factorizable langauge

• An amino acid sequence with unique reconstruction (at certain K) defines a factorizable language

Page 47: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Modeling in Biology• Cells

• Tissues

• Organs

• “Systems”: circulation, respiration, reproduction, neural, sensory, musclular, etc.

• Organisms, population, ecosystems

• Animals versus plants

• Plant development, morphology, physiology and pathology

Page 48: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Modeling of Plant MorphologyBy using L-System

• P. Prusinkiewicz, J. Hanan, Lindenmayer Systems, Fractals, and Plants, LN in Biomath., vol. 79, Springer, 1989

• P. Prusinkiewicz, A. Lindenmayer, The Algorithmic Beauty of Plants, Springer, 1990

• P. Prusinkiewicz, M. Hammel, J. Hanan, R. Mech, Visual models of plant development, Chap.9 in Handbook of Formal Languages, Vol.3, Springer, 1997

Page 49: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Consistency of Macro-and Micro-Description of

Nature• Molecular phylogeny versus phylogeny

based on morphological features

• Modeling plant development without getting into molecular and cellular description

• No need to model protein folding by invoking quarks!

Page 50: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Some Useful URLs

• www.grogra.org (Growth Grammar)

• http://www.computableplant.org

• http://algorithmicbotany.org

Page 51: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Huimin Xie 谢惠民 Grammatical Complexity and

1D dynamical Systems

Vol.6 in Directions in Chaos

WSPC, 1996.

谢惠民 《复杂性与动力系统》 上海科技教育出版社 , 1994

Bailin Hao, Weimou Zheng, Applied Symbolic Dynamics and Chaos (WSPC, 1998), Chap. 8

J.Hopcroft, J.Ullman, Introduction to Automata Theory, Languages and Computation, Addison-Wesley, 1979.

Page 52: Grammatical Complexity Of Symbolic Sequences: A Brief Introducton Bailin Hao T-Life Research Center, Fudan University Institute of Theoretical Physics,

Thanks!