[Lecture Notes in Computer Science] Analogical and Inductive Inference Volume 397 || Inductive synthesis of programs for symbolic sequences processing

INDUCTIVE SYNTHESIS OF PROGRAMS FOR SYMBOLIC

SEQUENCES PROCESSING

N. A. Chuzhanova

Institute of Mathematics SO AN SSSR Novosibirsk 90,USSR

The practical method of synthesis of programs for symbolic sequen-

ces processing is considered ~ The method is based on inductive ge-

neralization of regularities such as repetitions, inversions and

common subsequences in finite set of positive samples ~ Such regula-

rities are described in terms of pattern grammars. The main stages

of the synthesis method are considered. The application of the me -

thod for recognition of ribosomal binding sites in procariotical ge-

netical texts is given.

i o Introduction

The main aim of this research is to create the practical method

of program synthesis for symbolic sequences processing. In order to

choose the theoretical and practical foundations of this method we

investigate the demands following the specifics of the real tasks

and the objects-sequences. We have limited the tasks of symbolic se-

quences processing only by the tasks which appear in genetics. The

main goal of this processing is to formalize the knowledge about the

sequences playing a functional role in genetical texts. A typical

example of such a task is the task of recognition of the sites in

the text.

Genetical text in this consideration is DNA-molecule presented

as the ordered sequence of nucleotides in the alphabet of four sym-

bols (A ,C,G,T) .

The sites are not very long segments of the text (from three to

a few thousands of symbols ) ~ They play a regulatory role in the

main genetical processes such as reduplication, translation and

transcription. So there are different classes of sites (terminators,

318

ribosomal binding sites and so on). The only source of information

about a class of sites is the set of samples. For each class of sites

there is a special mechanism of recognition on molecular level but the

details of such recognition are not well ~_uown.

The sites haven't precise bounds. Noninformative domains mix up

with informative ones (the latter just define the belonging to some

class of sites) . The samples from the same class of sites may be very

different.

The recognition of sites in genetics is a very difficult process.

It includes a lot of chemical-biological experiments. The methods of

sites recognition developed at the last time are based on the determi-

nation of formal characteristics of the sites directly from the text

and on the construction of the "formula of site" or "consensus of site'.'

It is known that an importent role in these methods belongs to such

kinds of regularities as repetitions, inversions and common subsequen-

ces.

The analysis of human procedures for solving those tasks shows that

they are based on modelling the semantical notions through syntactical

ones by the inductive generalization. The high complexity of those pro-

cedures Can be explaned as follows:

- to get reliable results it is necessary to make the inductive ge-

neralization on a large set of samples which contain the differences

of syntactical expressions of semantical notions more fully~

- to describe the semantical notions adequatly it is necessary to

preprocess the set of samples, to determine the informative domains in

the sequenoes and to olassify the set of samples, The informativity

and similarity measures have to be based on the formal properties of

sequences which are used in inductive generalization. These formal pro-

perties usually have no semantical sense for biologists;

- to operate with the lengthy texts and complicated structure it is

necessary to satisfy the high requirements to the quality of the pro-

319

grams. If we take into account that biologists have no possibilities

for deep study of programming languages and the methods of effective

progr~mlng, then the synthesizing system must ensure the high quali-

ty of the produced programs.

The specifics of the tasks, procedures of their solution and pecu-

liarities of the investigated objects have determined the choice of

the inductive approach to the solution of the tasks of symbolic se -

quences processing. As we can see from above this method must satisfy

the following requirements:

- the method must contain the preprocessing of the set of samples

(to determine the informative domains and to clusterize the set of

sampies) ;

- the complexity of the method must lie within practical accep -

table bounds

- the effectivity of the synthesized programs must be not worse

than the effectivity of humem ones;

- the languages for inductive generalization must describe the

symbolic sequences adequatly°

For the last purpose it is proposed to use some classes of pat-

tern grammars [ I-3 ]-

2 ° Preliminaries

According to [I~3 ] let E be a finite alphabet of constant sym-

bols containing at least two symbols and let X = {x I,x2,... } be a

countable set of symbols named variables (X N E = ~).

Pattern is any finite string over X U E. The set (X U E) ~ of

all patterns is denoted by P~ For any pattern p, the number of

variables in p is the number of distinct variables which occur in

P.

Let f be a homomorphism (with respect to concatenation) from P~

to p*. If f(a) = a for any constant symbol a then f is call-

320

ed the substitution. If the pattern p has only two variables x and

y and if f(x) = a I... am and f(y) = am... al (a~E E, 1 < i <m,

m >_ 2) then f is called an inverse substitution [ 8 ]. If f(x) =

• '... a' (ale E, l<i<m, m>_l) and a' = @(a ) = at" " am' f(Y) = al m - -

for some recoding function @ then f is called the @-equivalence

sub s ti tuti on.

We define a binary relation <' on P* as follows: p < 'q

iff p = f(q) for some substitution f.

The language of a pattern p, denoted by L(p), is the set {wEE*i

w < 'p}.

The binary relation > on E" is defined as follows: for any

o~m strings a = a I ... a k and bE E + let a > b iff b = all aim

(I<_i I <i2< ...<i <_k). The string b is called the subsequence of m

string a. If ii = i2-I ='" ~ = ii-1+i, 1 <_ m, then b is called 1-

gram.

If the string at l + is presented in the form uvz then u,v,z

are called prefix, substring and suffix correspondingly.

The empty word e is prefix and suffix of any string.

We call by 1-nary occurrence characteristic the set ~ = {@I I "'"

..., @:Ml } in which each element @iI is the pair < 1-gram, the

number of occurrence of this 1-gram in the analysed part of the

text > and M I is the number of different 1-gramms in the given

part of the text.

Let us consider some subset U of all patterns p* which genera-

tes the following classes of pattern languages:

- regular pattern languages obtained from the pattern with k va-

riables in which each variable symbol occurs at most once (regular

pattern) by substitution f which is non-erasing and nonempty homo-

morphism;

321

- extended regular pattern languages obtained from regular patterns

by any substitution f;

- pattern languages with one variable obtained from pattern with

one variable by nonerasing and nonempty substitution f ;

- pattern languages with two variables obtained from the pattern

with two variables by inverse or ~-equivalence substitution.

Three first classes of languages were introduced and investigated

by Amgluin [ I-2 ] and Shinohara [ 3 ]. They have shown that for these

classes there exists a polynomial time algorithm for inductive infe -

fence of grsm~ar from positive data. The inference algorithm for pat-

tern with one variable for the case of two variables may be used if

the input is equal to the concatenation of the string and inversion of

the same string or the string and recoded string. Then the pattern

with one variable is transformed into the pattern with two variables.

The complexity of transformational operations is polynomial [ 8 ].

3. The method

The method of synthesis contains:

- preprccessing of the set of samples~

- the synthesis of patterns;

- the presentation of the synthesized patterms and programs in the

R-metalanguage [ ~ ] ;

- optimization of the synthesized programs.

The preprocessing of the set of samples S consists of the deter-

ruination of informative domains in the sequences and the clusteriza-

tion of the samples.

The coefficient of concordation of 1-nary occurrence characteris-

tics (1 = 1,2,3) is used as the measure of informativity and simila-

rity of the parts of the texts. The main idea of the algorithm is

the ordering of all 1-gr~mm~ from each occurrence characteristic ac-

cording to the occurrence decreasing. The number of the 1-gram in

322

this ordering is called the rank. All unique 1-gramms are ordered

and arranged in the order of their occurrence. The measure of the si-

milarity of M orderings in the limits of the given part of the text

1-nary occurrence characteris- is the coefficient of concordation of

t ics [ V

W = i

i 13_ i~ IA1 IAII)-M E A1~

where V is the sum (taken over all M orderings) of squared de - 1

viations of the sum of ranks of all 1-gramms from the average value

of these sum, This average value equals to +M( IA I I + I) , where

IAI I is the cardinality of alphabet of 1-gramms ( IAI I = nl) , n is

cardinality of alphabet of constant symbols, ~ ~ is the corrections

for "connected" ranks [ 7 ].

Fixing the length of 1-gram and the length of analysed part of

text (window) and moving the window along the sequence one may get the

graph of the values of the concordation coefficient. Ne consider the

maximums on this graph as the functional informative domains. Each

sequence is represented as xW:y where x,yE X and W~ E + are one

of the chosen informative domain.

The next step of the preprocessing is the clustering of the samp -

les. The idea of clusterisation is sequential uniting the nearest

(in the given above sense) sequences to the cluster. The number of

clusters is diminished by J at each iteration by the uniting the nea-

rest clusters. This process goes ahead till the moment when either

we get a single cluster or maximal value of the coefficient becomes

less than some experimentally chosen value ~. The result of the pre-

processing is the set of clusters Sj (S = U• S~.). The complexity of

preprocessing algorithms is O((min(n 1,n ))mns) and 0(m 2) where m

is the number of sequences, n is the length of the longest sequen- S

323

ce, 1 is the length of the 1-gram and n is the cardinality of

alphabet E.

The synthesis procedure is applied to the set

pattern pj

structed.

~(p~).

the

S~. For each S~ the

of the maximal length from the set of patterns U is con-

This pattern generates the language L(pj) such that S~g

The complexity of the synthesis algorithms is O(mn~) for

regular pattern grammars, O(mu~) - for extended regular pattern gram-

mars, O(mn~log n) - for pattern grammars with one and two variables

(here n = min{lwl: wE S}).

Synthesized from the set of samples program p is presented by R-

grammar [ ~ ] with following rules:

r ~ {<G1>~r~1,...,<ak>~ r~k },

where (Gj> is pattern pj presented in R-metalanguage, k is the

number of clusters. R-g~mmars without stacks are used for presenta-

tion of regular and extended regular pattern languages, R-grammars

with counter, register and car memories - for languages with one and

two variables.

To obtain effective synthesized programs it is necessary to opti-

mize them. The optimization means the ordering of the languages by

the containment. It is well known that this problem is NP-hard in ge-

neral.

Let us formulate the conditions of containment for extended regular

pattern languages. There exists the algorithm with linear complexity

to verify these conditions [ 9 ] •

Let us assume that patterns p and

tions :

p = VoXlVlX 2 xmv m , v o,v m E E ~

q have following presenta -

v ~ Z + (i = I, .... m-l), x~x + (i = 1,...,m)~

324

q = woYlwly ~... ynwn, w o,w nE E*, w E E +

y~ E X +

( j = i , . . . , n - l ) ,

( j = 1 , . . . , n ) .

PROPOSITION.

(i) w o is a prefix of v0,

w i s a s u f f i x v ; n m

(ii) there exists at most one set of indexes 0 ~i I ~i2~...~ in_ I

such that for k = I,..°, n-1 the string w k

s t r i n g v k a n d i n t h e c a s e when i ~ = i t + 1 " =

r r+s

p ~ 'q iff the following conditions are satisfied:

_<m

is the substring of the

...= ir+s(1 <r+s <_n-l)

The complexity of the algorithm that verifies these conditions

O( IPl + ~ql )" It is well known that

p <- 'q ~L(p) _eL(q) [2] •

is

The method is realized for OB EC Computes.

4. An example of the practical task

The method was used for solving the task of recognition of riboso-

mal binding sites in genetical texts.

The set of samples was presented by 86 phage (~X17~, G~, FD,

MS2, RI, T7, k, Q) and bacterial (E.coli) sites. All sites were ali-

gned according to the ATG- or GTG-codons which initiate the process

of translation. We assume that functional domains of different sites

lie at approximately equal distance from these codons.

The choice of the informative domains was made from the graph with

1 = 2. Chosen informative domains were clustered with concordation

coefficient ¢ = 0.75. The extended regular pattern languages were

used for describing inductive generalizations in cluster. The experi-

ments named "slippely control" were used to estimate the , quality of

synthesized program. The experiments consisted of removing the sites

The results of recognition of ribosomal binding sites by

algorithms

A11 (i=2),

A12 (Dw7,1w2), A21,Aaz (D=7,1=5) and synthesized algorithm

AS~ (D=7,1=3)

Table

The type of

experiment

The set of

positive

S and

negative S

samples

Slippely

control

The

control

text

S

US

Xl?

@

G~

FD

~',$

2 X17¢

G~

FD

MS2

The number of

fragments

86 s; 86 ns

11 s; 195 ns

11s; 176 ns

10 s; 175 ns

s; 109 ns

11 s; 195 ns

11 s; 176 ns

10 s; 195 ns

s; 109 ns

The sum of the errors

,,,,u,.,,,,,

All

s-~ns ns

-~s

2 9

0 26

0 19

0 11

0 5

1 21

0 23

5 1 2

57

The

number

of

errors

,,,,,,,

Aal

s-~s

ns-,

s

5 12

0 21

0 21

I 12

1 9

1 19

1 21

2 11

1 9

65

A12

,,,,

,.

S-el

S ns-,s

! .........

,

8 7

0 2~

0 18

2 5

1 10

3 17

1 2o

6 1 10

62

A22

s-,ns ns~s

3 13

0 15

0 15

0 3

1 13

2 13

2 19

5 2

1 13

57

~ns

9 0 I 5 I 2 1 5 2

A~3 ,

,,,,

nS~*S

13

z~

8

6

29

14

8 5

66

CO

~J

Qn

326

which belong to one of the genomes (@X17%, G~, FD, MS2) and the con-

trol on these texts.

The results of recognition of ribosomal binding sites by synthesi-

zed program A33 and algorithms A~ [ 6 ] are collected in the tab-

le.

Algoritbm~ A~ use the set of positive and negative samples.They

combine permissible combinations of different ways of describing sym-

bolic sequences and of different types of decision rules. There are

used 1-gram and common subsequences languages (i = I ~ corresponding-

ly) and decision rules of threshold and taxonomy types (j = 1,2 cor-

respondingly).

The followimg notion is used in the table:

"s ~ ns" - site is not recognized;

"ms, s" - nomsite is recognized as a site;

D - the width of the window.

The comparison of the results of processing of genetical text by

synthesized algorithm with four other human programs for the same

task has shown that they are not worse than the results of human pro-

grams.

References:

I. Angluin D. Inductive inference of formal languages fram positive da-

ta, Inform.and Control ,@5 ,N 3 (1980), 119-135.

2. Angluin D. Finding patterns common to set of strings, J. Comput.

and Syst.Sci.,21, N I (1980), %6-62.

3. Shinohara T. Polynomial time inference of extended ~egular pattezn

languages, I~CS, N I%7 (1983), 115-127.

#. Bea~6~ ~.B. MeTe~S~ R-rpi~eTZ~, K~0ep~eT~a, ~ 5 (1973),47-

63.

5. F2ce~ B.~., EocapeB D.P., T~so#ee~a ~.i., T~T~o~a T.H., ~XaHO-

Ba H.A. ~aNeT np~aa~x nporpa~ ~a~ aHs~zsa npo~BBoa~H~x c~m~-

327

HNX noc~ze~oBaTe~BHOCTe~ aHa~Te~BHO~ ~HN (C~0~]), B~qzC~I~Te~ -

~e c~cTeu~, ~.I01 (198@), 5-2I.

6. r~ce~ B.~., Ey~I~i~Eo]3 B.A., ~y~a~o~a H.A. AJIropI~TMI~ o6~apy~eH~ 8~a-

EOB II~HET~/aL~ B reHeT~i~ecEPIX TOECTaX~ J~lu~C~ITe~IY~H~e CHGT~]~II.

I2~ (1987), i-2#.

7. Ee~an ~. Pa~o~ue ~oppex~. - i~.: C~a~c~xa, 1975.

8. ~a~oBa H.A. rpas~TI~eCENJ~ ~e~O~ C~iHTe~ npo=paM~, ~u~c~we~ -

R. ~o~a H.A. 0 rpa~Ma~ecxoM Memo~e c~esa rpa@z~ec~x nporpaM=,

~cx~en~sue c~c~e~ ~n.123 (1987), 50-60.

Documents

[Lecture Notes in Computer Science] Analogical and Inductive Inference Volume 397 || Inductive synthesis of programs for symbolic sequences processing