Eli Shamir Hebrew university of Jerusalem, Israel ISCOL Haifa university September 2014

Preview:

DESCRIPTION

Transformation schemes for context-free grammars structural, algorithmic, linguistic applications. Eli Shamir Hebrew university of Jerusalem, Israel ISCOL Haifa university September 2014. O verview. CFG- Devices producing strings & their derivation trees (with weights) - PowerPoint PPT Presentation

Citation preview

Transformation schemes for context-free grammars

structural, algorithmic, linguistic applicationsEli Shamir

Hebrew university of Jerusalem, Israel

ISCOLHaifa universitySeptember 2014

Overview

• CFG- Devices producing strings & their derivation trees (with weights)

• Top down schemes transforming the grammars• Driven by rotations operations-tree (BOT)• Preserving derivation trees, semi-ring weightsEnhancing: property tests , parsing & optimal tree

algorithms: time down to O(n ), space to O(n).• Decomposition of bounded ambiguity grammars (Sam

Eilenberg’s question [SE])• Non-expansive [NE] (quasi-rational) grammars• Implications to NLP, sequence alignment, …

2

Schemes - simple to subtle

• Chomsky’s normal form (CNF)• Elimination of redundant symbols, ε rules• Greibach’s normal form (GNF) (subtle)

all rules are ATx. T terminal (lexicalization)GNF destroys derivation trees, however has

many applications (structural…)Schemes for sub-classes of CFG (in parsing

technology) deterministic, LR(k)…

Context Free Basics 1

Such a grammar G = (V,T,P,S = root) is a well known model to derive/generate a set of terminal strings in T G defines a derivation relation between strings overV UT: One step xy: y is obtained from x by rewriting a single occurrence of some A by B1..Bk when

A B1..Bk is production rule in P. Several steps x y if x x1 …yLA(G) = {wεT | Aw}, L(G)=LS(G), the language

generated by G.A derivation is best described by a labeled tree in which

the k sons of a node labeled A are labeled B1, .., Bk.

Ambiguity-deg (Aw) = {number of distinct trees for (Aw),

deg (GA)= max deg of (Aw).A - B - defines a partial order on V U T,

denoted A>B. it induces a complete order on any branch of a derivation tree.

B in G is pumping if B>B'>B. Then B' is also pumping; both belong to the pumping equivalence class [B].

Context Free Basics 2

Node Type and Spread Lemma

(i) B Pumping, (ii) C pre-terminal – if NOT {C > B, B pumping} (iii) D spread – D is not pumping but D>B, B pumping.

SPREAD LEMMA 1. Pre-terminal C derives a bounded number of bounded

terminal strings.2. In each derivation tree a spread node D derives a bounded

sub-tree the leaves of which are terminals or pump nodes.3. In G, each spread symbol D derives the bounded number

of sub-trees, as mentioned in 2.

Non Expansive Grammars

G is non-expansive (NE) if no production rule has the form B -B'-B''- where the B's are from the same pumping class Equivalently, no derivation B —B—B— is possible (sideway pumping is forbidden!).

NE is the quasi- rational class, the substitution closure of linear grammars[1]. Our BOT scheme simplifies proofs of its known properties and new ones (parsing speed).

Bounded Operation Tree (BOT)BOT Tree-nodes are labeled by: • Current grammar as a product Π=P1…Pk

• Current operation SPREAD / CYC / TTR (Depending on the type of the root of P1 or Pk)Determines the children nodes and their labelsRoot of BOT= #G, Leaves of BOT - linear G(i) Main Claim: each derivation tree for w w.r. to #G

is mapped onto derivation tree for ƍw w.r. to some G(i), (with the same weight) and V.V.

SPREAD / CYC / TTR Operations

Type=SPREAD: Pk is split to U Q(j), the current grammar at j’th child is P1…Pk-1 * Q(j)

Type=CYC: Pk is terminal, the (effective) current grammar at the single child is Pk P1…Pk-1

Type=TTR, if the root of Pk is pumping: let

M= P1…Pk-1 , N=Pk, the top trunk of N is rotated by 180° and mounted on M, so MN M*N^

Top Trunk Rotation of MN to (M*N^)

M

M

EXIT

N^

x1

x2y1

y2

x1

x2

y2

y1

N^

N

for strings: m x1x2 … n^ …y2y1 …y2y1 m x1x2 … n^

for trees: M*

180

Figure 1.1

N grammar (top trunk) M* grammar

BB’C B’CB BDB’ B’BD

BB^, B^α B root(M), root (M) αAll other productions carry over from N to M*; those of M unchanged.

The TTR rotation is invertible, one-one onto for the derivation trees, preserving ambiguity in ‘cyclic rotated’ sense.

Figure 1.2: TTR For grammars:

Termination and Correctness

TTR operations dominate the BOT scheme for NE grammars. The E-depth of N^ and of the two sides of the mounted trunk must decrease. The M* factors become taller and thinner until they become linear G(i). [without spread symbols]

Claim: each derivation tree for w w.r. to #G is mapped to derivation tree for ƍw w.r. to some G(i), (with the same weight) and V.V.

ƍw = CYCLIC rotation of w.Holds for each SPREAD/CYC/TTR step!

Tabular Dynamic Prog. For parsing G

(CYK/ Earley algorithm for terminal w of length n the table extends to items of rotated intervals

[i+1, i+k (mod n), A BC], at the same cost. For linear G(i) total time cost is only O(n ) Space cost is O(n): one or few diagonals of width

near k are kept in memory with pointers to few neighbors, enabling table reconstruction.

• Just membership, or total weight algorithm, is in the parallel class NC(1), as for finite-state transductions.

2

Example (from [4])

• (M)(N) = (u I u ) (v J v), u , vε {0.1}* = I = J u = reversal of u,• It has unbounded "direct (product) ambiguity"

which increases the time in Earley algorithm. But after one TTR step MN is rotated to

• (M*)(N^) = (v u I u v ) (J) , which has a linear grammar, (of unbounded ambiguity degree)

And all product ambiguity trees are rotated to union of trees for the linear M*N^.

R

R

R

R R

Decomposing Bounded Ambiguity

SE Claim: Ambiguity-deg(G)= l < ∞. Then L (G) is a bounded-size union of languages of deg 1-grammars. This provides a positive answer to a question Sam Eilenberg posed, c. 1970.

"Bounded size" means polynomial in |G|, the size of the grammar G, and l.

Expansive G and Ambiguity

G expansive each pump symbol has ambiguity - degree=1 or unbounded (exponential in length)

B==> --B—B—B--… B--… (k times)If degB ≥ 2 then degB ≥ 2This is a corner stone in the proof of SE • Extending ambiguity to cyclic-closed strings is

helpful (cf last slides)

k

Proof of SE

We briefly sketch the scheme for proving the claim. Starting with # G, and using the SPREAD LEMMA, the claim is reduces to:

LEMMA Let Π = MN(1)…N(k), deg M=1 deg Π=l < ∞, N(i) are terminals or with pump roots then

L(M) = U L(M(j)), jεJ and deg M(j)=1, J bounded. It suffices to prove it for a pair, starting with MN(1),

after which M(j)N(2) are decomposed, and so on.

Proof of SE (2)

For a pair MN the operation TTR is used transforming it to M*N^. Now deg M* < l and its ambiguity must be concentrated along the top trunk which it got from N. An easy direct argument shows it decomposes into a bounded union of M(j) of deg 1. As for N^ its E-depth is smaller than that of N. so for M(j)N^ we can use induction on the E-depth of the second factor or, more explicitly, continue the recursive descent on N^ until it is consumed.

Approximate G by NE G’

• Easy to achieve by duplicating symbols of the pumping classes.

• Makes linguistic sense• Advantages of NE G’ using the BOT scheme

view the linear G’i as finite-state transactions: powerful tool in several linguistic fields

• Applications to Bio-informatic (stringology)?• Extension of NE condition to mildly context-

sensitive models (LIG, TAG…)?

The Hardest Context Free Grammar

The concept is due to S. Greibach. The simplest reduction is based on Shamir's homomorphism theorem([1]), mapping each b in T into a finite set φ(b) of strings over the vocabulary of the Dyck language and claiming that w is in L(G) if and only if φ(w) contains a string in the Dyck language (see the description in [1]).

In fact, the categorical grammar model in the 1960 article ([2]) provides another homomorphism which makes it a hardest CFG.

However, those hardest CFG languages are inherently expansive. Indeed, an NE candidate grammar for Dyck will be negated by its BOT scheme, upon using local pump-shrinks, which for linear grammars can operate near any point of the (sufficiently long) main branch of non- terminals.

We conjecture that any hardest CFG must be expansive. Note that finding a non-expansive one would entail O(n ) complexity of membership test for any context free grammar.

2

Ambiguity and Cyclic RotationAmbiguity in natural languages can be resolved (or

created) by cyclic rotation. Consider the bible verse in book of Job chapter 6 verse 14 (six Hebrew words). Translated to English: "a friend should extend mercy to the sufferer , even if he abandons God's fear."

• The ambiguity here is anaphoric, does the pronoun "he" refer to the sufferer or to the friend? The poetic beautiful answer is: to both.

• The rotated sentences, starting at the symbols # and $, resolve the ambiguity one way or the other.

• Politically loaded example: the policeman shot the boy with the gun

# $

# $

References

1. J. Autebert, J. Berstel and L. Boasson, Context-free language and pushdown automata. Chap. 3 In: handbook of formal languages Vol 1. G. Rozenberg and A. Salomaa (eds.), Springer-Verlag 1997.

2. Y. Bar-Hillel, H. Gaifman and E. Shamir, On categorical and phrase structure grammars. Bulletin research council of Israel, vol. 9f (1960), 1-16.

3. S. Greibach. The hardest context-free language. SIAM J. on computing 3 (1973), 304-310.

4. E. Shamir. Some inherently ambiguous context-free languages. Inf. and Control 18 (1971), 355-363.

Recommended