Click here to load reader
Upload
klaus-p
View
216
Download
2
Embed Size (px)
Citation preview
INDUCTIVE SYNTHESIS OF PROGRAMS FOR SYMBOLIC
SEQUENCES PROCESSING
N. A. Chuzhanova
Institute of Mathematics SO AN SSSR Novosibirsk 90,USSR
The practical method of synthesis of programs for symbolic sequen-
ces processing is considered ~ The method is based on inductive ge-
neralization of regularities such as repetitions, inversions and
common subsequences in finite set of positive samples ~ Such regula-
rities are described in terms of pattern grammars. The main stages
of the synthesis method are considered. The application of the me -
thod for recognition of ribosomal binding sites in procariotical ge-
netical texts is given.
i o Introduction
The main aim of this research is to create the practical method
of program synthesis for symbolic sequences processing. In order to
choose the theoretical and practical foundations of this method we
investigate the demands following the specifics of the real tasks
and the objects-sequences. We have limited the tasks of symbolic se-
quences processing only by the tasks which appear in genetics. The
main goal of this processing is to formalize the knowledge about the
sequences playing a functional role in genetical texts. A typical
example of such a task is the task of recognition of the sites in
the text.
Genetical text in this consideration is DNA-molecule presented
as the ordered sequence of nucleotides in the alphabet of four sym-
bols (A ,C,G,T) .
The sites are not very long segments of the text (from three to
a few thousands of symbols ) ~ They play a regulatory role in the
main genetical processes such as reduplication, translation and
transcription. So there are different classes of sites (terminators,
318
ribosomal binding sites and so on). The only source of information
about a class of sites is the set of samples. For each class of sites
there is a special mechanism of recognition on molecular level but the
details of such recognition are not well ~_uown.
The sites haven't precise bounds. Noninformative domains mix up
with informative ones (the latter just define the belonging to some
class of sites) . The samples from the same class of sites may be very
different.
The recognition of sites in genetics is a very difficult process.
It includes a lot of chemical-biological experiments. The methods of
sites recognition developed at the last time are based on the determi-
nation of formal characteristics of the sites directly from the text
and on the construction of the "formula of site" or "consensus of site'.'
It is known that an importent role in these methods belongs to such
kinds of regularities as repetitions, inversions and common subsequen-
ces.
The analysis of human procedures for solving those tasks shows that
they are based on modelling the semantical notions through syntactical
ones by the inductive generalization. The high complexity of those pro-
cedures Can be explaned as follows:
- to get reliable results it is necessary to make the inductive ge-
neralization on a large set of samples which contain the differences
of syntactical expressions of semantical notions more fully~
- to describe the semantical notions adequatly it is necessary to
preprocess the set of samples, to determine the informative domains in
the sequenoes and to olassify the set of samples, The informativity
and similarity measures have to be based on the formal properties of
sequences which are used in inductive generalization. These formal pro-
perties usually have no semantical sense for biologists;
- to operate with the lengthy texts and complicated structure it is
necessary to satisfy the high requirements to the quality of the pro-
319
grams. If we take into account that biologists have no possibilities
for deep study of programming languages and the methods of effective
progr~mlng, then the synthesizing system must ensure the high quali-
ty of the produced programs.
The specifics of the tasks, procedures of their solution and pecu-
liarities of the investigated objects have determined the choice of
the inductive approach to the solution of the tasks of symbolic se -
quences processing. As we can see from above this method must satisfy
the following requirements:
- the method must contain the preprocessing of the set of samples
(to determine the informative domains and to clusterize the set of
sampies) ;
- the complexity of the method must lie within practical accep -
table bounds
- the effectivity of the synthesized programs must be not worse
than the effectivity of humem ones;
- the languages for inductive generalization must describe the
symbolic sequences adequatly°
For the last purpose it is proposed to use some classes of pat-
tern grammars [ I-3 ]-
2 ° Preliminaries
According to [I~3 ] let E be a finite alphabet of constant sym-
bols containing at least two symbols and let X = {x I,x2,... } be a
countable set of symbols named variables (X N E = ~).
Pattern is any finite string over X U E. The set (X U E) ~ of
all patterns is denoted by P~ For any pattern p, the number of
variables in p is the number of distinct variables which occur in
P.
Let f be a homomorphism (with respect to concatenation) from P~
to p*. If f(a) = a for any constant symbol a then f is call-
320
ed the substitution. If the pattern p has only two variables x and
y and if f(x) = a I... am and f(y) = am... al (a~E E, 1 < i <m,
m >_ 2) then f is called an inverse substitution [ 8 ]. If f(x) =
• '... a' (ale E, l<i<m, m>_l) and a' = @(a ) = at" " am' f(Y) = al m - -
for some recoding function @ then f is called the @-equivalence
sub s ti tuti on.
We define a binary relation <' on P* as follows: p < 'q
iff p = f(q) for some substitution f.
The language of a pattern p, denoted by L(p), is the set {wEE*i
w < 'p}.
The binary relation > on E" is defined as follows: for any
o~m strings a = a I ... a k and bE E + let a > b iff b = all aim
(I<_i I <i2< ...<i <_k). The string b is called the subsequence of m
string a. If ii = i2-I ='" ~ = ii-1+i, 1 <_ m, then b is called 1-
gram.
If the string at l + is presented in the form uvz then u,v,z
are called prefix, substring and suffix correspondingly.
The empty word e is prefix and suffix of any string.
We call by 1-nary occurrence characteristic the set ~ = {@I I "'"
..., @:Ml } in which each element @iI is the pair < 1-gram, the
number of occurrence of this 1-gram in the analysed part of the
text > and M I is the number of different 1-gramms in the given
part of the text.
Let us consider some subset U of all patterns p* which genera-
tes the following classes of pattern languages:
- regular pattern languages obtained from the pattern with k va-
riables in which each variable symbol occurs at most once (regular
pattern) by substitution f which is non-erasing and nonempty homo-
morphism;
321
- extended regular pattern languages obtained from regular patterns
by any substitution f;
- pattern languages with one variable obtained from pattern with
one variable by nonerasing and nonempty substitution f ;
- pattern languages with two variables obtained from the pattern
with two variables by inverse or ~-equivalence substitution.
Three first classes of languages were introduced and investigated
by Amgluin [ I-2 ] and Shinohara [ 3 ]. They have shown that for these
classes there exists a polynomial time algorithm for inductive infe -
fence of grsm~ar from positive data. The inference algorithm for pat-
tern with one variable for the case of two variables may be used if
the input is equal to the concatenation of the string and inversion of
the same string or the string and recoded string. Then the pattern
with one variable is transformed into the pattern with two variables.
The complexity of transformational operations is polynomial [ 8 ].
3. The method
The method of synthesis contains:
- preprccessing of the set of samples~
- the synthesis of patterns;
- the presentation of the synthesized patterms and programs in the
R-metalanguage [ ~ ] ;
- optimization of the synthesized programs.
The preprocessing of the set of samples S consists of the deter-
ruination of informative domains in the sequences and the clusteriza-
tion of the samples.
The coefficient of concordation of 1-nary occurrence characteris-
tics (1 = 1,2,3) is used as the measure of informativity and simila-
rity of the parts of the texts. The main idea of the algorithm is
the ordering of all 1-gr~mm~ from each occurrence characteristic ac-
cording to the occurrence decreasing. The number of the 1-gram in
322
this ordering is called the rank. All unique 1-gramms are ordered
and arranged in the order of their occurrence. The measure of the si-
milarity of M orderings in the limits of the given part of the text
1-nary occurrence characteris- is the coefficient of concordation of
t ics [ V
W = i
i 13_ i~ IA1 IAII)-M E A1~
where V is the sum (taken over all M orderings) of squared de - 1
viations of the sum of ranks of all 1-gramms from the average value
of these sum, This average value equals to +M( IA I I + I) , where
IAI I is the cardinality of alphabet of 1-gramms ( IAI I = nl) , n is
cardinality of alphabet of constant symbols, ~ ~ is the corrections
for "connected" ranks [ 7 ].
Fixing the length of 1-gram and the length of analysed part of
text (window) and moving the window along the sequence one may get the
graph of the values of the concordation coefficient. Ne consider the
maximums on this graph as the functional informative domains. Each
sequence is represented as xW:y where x,yE X and W~ E + are one
of the chosen informative domain.
The next step of the preprocessing is the clustering of the samp -
les. The idea of clusterisation is sequential uniting the nearest
(in the given above sense) sequences to the cluster. The number of
clusters is diminished by J at each iteration by the uniting the nea-
rest clusters. This process goes ahead till the moment when either
we get a single cluster or maximal value of the coefficient becomes
less than some experimentally chosen value ~. The result of the pre-
processing is the set of clusters Sj (S = U• S~.). The complexity of
preprocessing algorithms is O((min(n 1,n ))mns) and 0(m 2) where m
is the number of sequences, n is the length of the longest sequen- S
323
ce, 1 is the length of the 1-gram and n is the cardinality of
alphabet E.
The synthesis procedure is applied to the set
pattern pj
structed.
~(p~).
the
S~. For each S~ the
of the maximal length from the set of patterns U is con-
This pattern generates the language L(pj) such that S~g
The complexity of the synthesis algorithms is O(mn~) for
regular pattern grammars, O(mu~) - for extended regular pattern gram-
mars, O(mn~log n) - for pattern grammars with one and two variables
(here n = min{lwl: wE S}).
Synthesized from the set of samples program p is presented by R-
grammar [ ~ ] with following rules:
r ~ {<G1>~r~1,...,<ak>~ r~k },
where (Gj> is pattern pj presented in R-metalanguage, k is the
number of clusters. R-g~mmars without stacks are used for presenta-
tion of regular and extended regular pattern languages, R-grammars
with counter, register and car memories - for languages with one and
two variables.
To obtain effective synthesized programs it is necessary to opti-
mize them. The optimization means the ordering of the languages by
the containment. It is well known that this problem is NP-hard in ge-
neral.
Let us formulate the conditions of containment for extended regular
pattern languages. There exists the algorithm with linear complexity
to verify these conditions [ 9 ] •
Let us assume that patterns p and
tions :
p = VoXlVlX 2 xmv m , v o,v m E E ~
q have following presenta -
v ~ Z + (i = I, .... m-l), x~x + (i = 1,...,m)~
324
q = woYlwly ~... ynwn, w o,w nE E*, w E E +
y~ E X +
( j = i , . . . , n - l ) ,
( j = 1 , . . . , n ) .
PROPOSITION.
(i) w o is a prefix of v0,
w i s a s u f f i x v ; n m
(ii) there exists at most one set of indexes 0 ~i I ~i2~...~ in_ I
such that for k = I,..°, n-1 the string w k
s t r i n g v k a n d i n t h e c a s e when i ~ = i t + 1 " =
r r+s
p ~ 'q iff the following conditions are satisfied:
_<m
is the substring of the
...= ir+s(1 <r+s <_n-l)
The complexity of the algorithm that verifies these conditions
O( IPl + ~ql )" It is well known that
p <- 'q ~L(p) _eL(q) [2] •
is
The method is realized for OB EC Computes.
4. An example of the practical task
The method was used for solving the task of recognition of riboso-
mal binding sites in genetical texts.
The set of samples was presented by 86 phage (~X17~, G~, FD,
MS2, RI, T7, k, Q) and bacterial (E.coli) sites. All sites were ali-
gned according to the ATG- or GTG-codons which initiate the process
of translation. We assume that functional domains of different sites
lie at approximately equal distance from these codons.
The choice of the informative domains was made from the graph with
1 = 2. Chosen informative domains were clustered with concordation
coefficient ¢ = 0.75. The extended regular pattern languages were
used for describing inductive generalizations in cluster. The experi-
ments named "slippely control" were used to estimate the , quality of
synthesized program. The experiments consisted of removing the sites
The results of recognition of ribosomal binding sites by
algorithms
A11 (i=2),
A12 (Dw7,1w2), A21,Aaz (D=7,1=5) and synthesized algorithm
AS~ (D=7,1=3)
Table
The type of
experiment
The set of
positive
S and
negative S
samples
Slippely
control
The
control
text
S
US
Xl?
@
G~
FD
~',$
2 X17¢
G~
FD
MS2
The number of
fragments
86 s; 86 ns
11 s; 195 ns
11s; 176 ns
10 s; 175 ns
s; 109 ns
11 s; 195 ns
11 s; 176 ns
10 s; 195 ns
s; 109 ns
The sum of the errors
,,,,u,.,,,,,
All
s-~ns ns
-~s
2 9
0 26
0 19
0 11
0 5
1 21
0 23
5 1 2
57
The
number
of
errors
,,,,,,,
Aal
s-~s
ns-,
s
5 12
0 21
0 21
I 12
1 9
1 19
1 21
2 11
1 9
65
A12
,,,,
,.
S-el
S ns-,s
! .........
,
8 7
0 2~
0 18
2 5
1 10
3 17
1 2o
6 1 10
62
A22
s-,ns ns~s
3 13
0 15
0 15
0 3
1 13
2 13
2 19
5 2
1 13
57
~ns
9 0 I 5 I 2 1 5 2
A~3 ,
,,,,
nS~*S
13
z~
8
6
29
14
8 5
66
CO
~J
Qn
326
which belong to one of the genomes (@X17%, G~, FD, MS2) and the con-
trol on these texts.
The results of recognition of ribosomal binding sites by synthesi-
zed program A33 and algorithms A~ [ 6 ] are collected in the tab-
le.
Algoritbm~ A~ use the set of positive and negative samples.They
combine permissible combinations of different ways of describing sym-
bolic sequences and of different types of decision rules. There are
used 1-gram and common subsequences languages (i = I ~ corresponding-
ly) and decision rules of threshold and taxonomy types (j = 1,2 cor-
respondingly).
The followimg notion is used in the table:
"s ~ ns" - site is not recognized;
"ms, s" - nomsite is recognized as a site;
D - the width of the window.
The comparison of the results of processing of genetical text by
synthesized algorithm with four other human programs for the same
task has shown that they are not worse than the results of human pro-
grams.
References:
I. Angluin D. Inductive inference of formal languages fram positive da-
ta, Inform.and Control ,@5 ,N 3 (1980), 119-135.
2. Angluin D. Finding patterns common to set of strings, J. Comput.
and Syst.Sci.,21, N I (1980), %6-62.
3. Shinohara T. Polynomial time inference of extended ~egular pattezn
languages, I~CS, N I%7 (1983), 115-127.
#. Bea~6~ ~.B. MeTe~S~ R-rpi~eTZ~, K~0ep~eT~a, ~ 5 (1973),47-
63.
5. F2ce~ B.~., EocapeB D.P., T~so#ee~a ~.i., T~T~o~a T.H., ~XaHO-
Ba H.A. ~aNeT np~aa~x nporpa~ ~a~ aHs~zsa npo~BBoa~H~x c~m~-
327
HNX noc~ze~oBaTe~BHOCTe~ aHa~Te~BHO~ ~HN (C~0~]), B~qzC~I~Te~ -
~e c~cTeu~, ~.I01 (198@), 5-2I.
6. r~ce~ B.~., Ey~I~i~Eo]3 B.A., ~y~a~o~a H.A. AJIropI~TMI~ o6~apy~eH~ 8~a-
EOB II~HET~/aL~ B reHeT~i~ecEPIX TOECTaX~ J~lu~C~ITe~IY~H~e CHGT~]~II.
I2~ (1987), i-2#.
7. Ee~an ~. Pa~o~ue ~oppex~. - i~.: C~a~c~xa, 1975.
8. ~a~oBa H.A. rpas~TI~eCENJ~ ~e~O~ C~iHTe~ npo=paM~, ~u~c~we~ -
R. ~o~a H.A. 0 rpa~Ma~ecxoM Memo~e c~esa rpa@z~ec~x nporpaM=,
~cx~en~sue c~c~e~ ~n.123 (1987), 50-60.