Do We Need Grammar for Natural Language Semantics?...Do We Need Grammar for Natural Language Semantics? Shalom Lappin University of Gothenburg Workshop on Machine Learning Chalmers

Classical Approaches Distributional Models Semantic Interpretation as MT Conclusions

Do We Need Grammar for Natural LanguageSemantics?

Shalom LappinUniversity of Gothenburg

Workshop on Machine LearningChalmers University of Technology

April 14, 2016

Outline

Classical Approaches to Semantic Representation

Distributional Models of Meaning

Semantic Interpretation as Machine Translation

Conclusions

Classical Theories of Formal Syntax

• Linguistic theory has traditionally represented syntacticknowledge through formal grammars.

• A formal grammar G specifies a language L ⊆ Σ∗, whereΣ∗ is the set of all possible strings of a finite alphabet(vocabulary) Σ.

• Such a grammar is (in most cases) a generative device(with constraints) that assigns structures to the strings ofthe language that it defines.

Classical Theories of Formal Semantics

• On the dominant approach to semantic theory,interpretation consists in mapping the categories of aformal syntax into semantic types that correspond to kindsof denotation in a model theory.

• Semantic rules compute the denotation of an expressionas a relation on the denotations of its syntacticconstituents, in tandem with the syntactic operations thatgenerate (or parse) the expression.

• Recently proof theoretic accounts have been proposed forrepresenting meaning in terms of inference rather thandenotation (Moss (2010), Francez and Dyckhoff (2011))

Classical Semantic Theories

• Classical semantic theories (Montague (1974)), as well asdynamic (Kamp and Reyle (1993)) and underspecified(Fox and Lappin (2010)) frameworks use categorical typesystems.

• A type T identifies a set of possible denotations forexpressions in T .

• The theory specifies combinatorial operations for derivingthe denotation of an expression from the values of itsconstituents.

Problems with the Classical Approach

• These theories cannot represent the gradience of semanticproperties that is pervasive in speakers’ judgementsconcerning truth, predication, and meaning relations.

• Even when embedded in probabilistic grammars they arenot computationally robust, and they do not yield widecoverage systems.

• They focus on compositional semantic relations, but do notprovide empirically broad or computationally interestingtreatments of lexical semantics.

• It is not clear how classical semantic theories could belearned from available linguistic data.

Vector Space Models

• Vector Space Models (VSMs) (Turney and Pantel (2010))offer a fine-grained distributional method for identifying arange of semantic relations among words and phrases.

• They are constructed from matrices in which words arelisted vertically on the left, and the environments in whichthey appear are given horizontally along the top.

• These environments specify the dimensions of the model,corresponding to words, phrases, documents, units ofdiscourse, or any other objects for tracking the occurrenceof words.

• They can also include data structures encodingextra-linguistic elements, like visual scenes and events.

Vector Space Models

A Word-Context Matrix

context 1 context 2 context 3 context 4financial 0 6 4 8market 1 0 15 9share 5 0 0 4economic 0 1 26 12chip 7 8 0 0distributed 11 15 0 0sequential 10 31 0 1algorithm 14 22 2 1

Matrices and Vectors

• The integers in the cells of the matrix give the frequency ofthe word in an environment.

• A vector for a word is the row of values across thedimension columns of the matrix.

• The vectors for chip and algorithm are [7 8 0 0] and[14 22 2 1], respectively.

Measuring Semantic Distance

• A pair of vectors from a matrix can be projected as linesfrom a common point on a plane.

• The smaller the angle between the lines, the greater thesimilarity of the terms, as measured by their co-occurrenceacross the dimensions of the matrix.

• Computing the cosine of this angle is a convenient way ofmeasuring the angles between vector pairs.

• If ~x = 〈x1, x2, ..., xn〉 and ~y = 〈y1, y2, ..., yn〉 are two vectors,then

cos(~x , ~y) =

∑ni=1 xi ·yi√∑n

i=1 x2i ·∑n

i=1 y2i

cos(~x , ~y) =

i=1 x2i ·∑n

i=1 y2i

cos(~x , ~y) =

i=1 x2i ·∑n

i=1 y2i

cos(~x , ~y) =

i=1 x2i ·∑n

i=1 y2i

• The cosine of ~x and ~y is their internal product, formed bysumming the products of the corresponding elements ofthe two vectors and normalizing the result relative to thelengths of the vectors.

• In computing cos(~x , ~y) it may be desirable to apply asmoothing function to the raw frequency counts in eachvector to compensate for sparse data, or to filter out theeffects of high frequency terms.

• A higher value for cos(~x , ~y) correlates with greatersemantic relatedness of the terms associated with the ~xand ~y vectors.

VSMs as Representations of Lexical Meaning andLearning

• VSMs provide highly successful methods for identifying avariety of lexical semantic relations, including synonymy,antinomy, polysemy, and hypernym classes.

• They also perform very well in unsupervised sensedisambiguation tasks.

• VSMs offer a distributional view of lexical semanticlearning.

• On this approach speakers acquire lexical meaning byestimating the environments (linguistic and non-linguistic)in which the words of their language appear.

Compositional VSMs (CVSMs)

• VSMs measure semantic distances and relations amongwords independently of syntactic structure (bag of words).

• Recent work has sought both to integrate syntacticinformation into the dimensions of the vector matrices(Pado and Lapata (2007)), and to extend VSM semanticspaces to the compositional meanings of sentences.

• Mitchell and Lapata (2008) compare additive andmultiplicative models for computing the vectors of complexsyntactic constituents, and they demonstrate better resultswith the latter for sentential semantic similarity tasks.

• These models use simple functions for combiningconstituent vectors, and they do not represent thedependence of composite vectors on syntactic structure.

A Syntactically Driven Compositional VSM

• Coecke et al. (2010) and Grefenstette et al. (2011)propose a procedure for computing vector values forsentences that specifies a correspondence between thevectors and the syntactic structures of their constituents.

• This procedure relies upon a category theoreticrepresentation of the types of a pregroup grammar (PGG,Lambek (2007,2008)), which builds up complex syntacticcategories through direction-marked function application ina manner similar to a basic categorial grammar.

• All sentences receive vectors in the same vector space,and so they can be compared for semantic similarity usingmeasures like cosine.

Tensor Products

• A PGG CVSM computes the values of a complex syntacticstructure through a function that computes the tensorproduct of the vectors of its constituents, while encodingthe correspondence between their grammatical types andtheir semantic vectors.

• For two (finite) vector spaces A, B, their tensor productA ⊗ B is constructed from the Cartesian product of thevectors in A and B.

• For any two vectors v ∈ A, w ∈ B, v ⊗ w is the vectorconsisting of all possible products vi∈v × wj∈w .

• Smolensky (1990) uses tensor products of vector spacesto construct representations of complex structures (stringsand trees) from the distributed variables and values of theunits in a neural network.

Tensor Products

Computing the Vector of a Sentence

• PGGs are modeled as compact closed categories.• A sentence vector is computed by a linear map f on the

tensor product for the vectors of its main constituents,where f stores the type categorial structure of the stringdetermined by its PGG representation.

• The vector for a sentence headed by a transitive verb, forexample, is computed according to the equation

−−−−−−−−−−→subj Vtr obj = f (

−−−→subj ⊗ −→Vtr ⊗

−−→obj)

−−−−−−−−−−→subj Vtr obj = f (

−−−→subj ⊗ −→Vtr ⊗

−−→obj)

−−−−−−−−−−→subj Vtr obj = f (

−−−→subj ⊗ −→Vtr ⊗

−−→obj)

• The vector of a transitive verb Vtr could be taken to be anelement of the tensor product of the vector spaces for thetwo noun bases corresponding to its possible subject andobject arguments

−→Vtr ∈ N ⊗ N.

• Then the vector for a sentence headed by a transitive verbcould be computed as the point-wise product of the verb’svector, and the tensor product of its subject and its object.

−−−−−−−−−−→subj Vtr obj =

−→Vtr � (

−−−→subj ⊗ −−→obj)

• The vector of a transitive verb Vtr could be taken to be anelement of the tensor product of the vector spaces for thetwo noun bases corresponding to its possible subject andobject arguments

−→Vtr ∈ N ⊗ N.

• Then the vector for a sentence headed by a transitive verbcould be computed as the point-wise product of the verb’svector, and the tensor product of its subject and its object.

−−−−−−−−−−→subj Vtr obj =

−→Vtr � (

−−−→subj ⊗ −−→obj)

Advantages of PGG Compositional VSMs

• PGG CVSMs offer a formally grounded andcomputationally efficient method for obtaining vectors forcomplex expressions from their syntactic constituents.

• They permit the same kind of measurement for relations ofsemantic similarity among sentences that lexical VSMsgive for word pairs.

• They can be trained on a (PGG parsed) corpus, and theirperformance evaluated against human annotators’semantic judgements for phrases and sentences.

Problems with CVSMs

• Although the vector of a complex expression is the value ofa linear map on the vectors of its parts, it is not obviouswhat independent property this vector represents.

• Sentential vectors do not correspond to the distributionalproperties of these sentences, as the data is too sparse toestimate distributional vectors for all but a few sentences,across most dimensions.

• But CVSMs are interesting to the extent that the sententialvectors that they assign are derived from lexical vectorsthat represent the distributional properties of theseexpressions.

• VSMs measure intra-corpus semantic relations, but theyneed to be extended to language-world relations to providea fully adequate representation of meaning.

Problems with CVSMs

Neural Network Language Models

• Bengio et al. (2003) propose a single hidden layer neuralnetwork language model.

• This network takes vectors encoding the distributionalpatterns of words across a large number of dimensions(contexts of occurrence) as input, and it produces agenerative language model that assigns probabilities toword sequences.

• It is more powerful than traditional N-gram languagemodels because it represents a large amount ofinformation concerning the syntactic and semanticproperties of lexical items.

Deep Neural Network Language Models

• Erisoy et al. (2012) extend this approach to a multiplehidden layer neural network language model, while Mikolovet al. (2011) and Mikolov (2012) apply it to RecurrentNeural Network (RNN) models.

• Both variants of deep NN language models produce betterperplexity values and error rates than traditional N-gramand word cluster models, for a variety of NLP tasks.

• Lau, Clark, and Lappin (2015) show that RNN’s,augmented by normalising functions, generally outperformother types of language models in the prediction ofspeakers’ grammatical acceptability judgements.

Deep Neural Network Machine Translation

• The current state of the art in MT uses statistical languagemodels trained on aligned corpora of source and targetlanguage pairs.

• Bahdanau et al. (2015) propose a deep neural translationmodel which uses a RNN with an encoder and a decodercomponent.

• The encoder maps the sequence of vectors correspondingto the input words of the source language to a contextvector c.

Bidirectional RNN MT

• The decoder incrementally generates the target sequenceby estimating the conditional probability of each targetword, given the preceding target words and c.

• p(y i | y1, ..., yi−1, x) = g(yi−1, si , ci), where g is a nonlinear(possibly multilayer) function that gives the probability ofthe target word yi , si is a hidden state of the RNN for timei , and ci is the context vector for yi .

• The RNN is bidirectional in that it searches backwards andforwards through the entire context vector for the inputsequence to generate the most likely target words.

• Deep NN MT is now competitive in performance with stateof the art statistical language model MT.

Multi-Modal MT: Generating Descriptions for Images• Recent work (Socher et al. (2014), Karpathy and Fei-Fei

(2015), Xu et al. (2015), Vinyals et al. (2015)) has appliedthe encoder-decoder architecture to the task of generatingdescriptions for visual images, and assigning images todescriptions of scenes.

• The models are trained on data sets of images annotatedwith sentences or captions (Flickr 8k, Flickr 30k).

• The encoder is a deep multi-level NN that maps sets ofpixels in an image into vectors corresponding to visualfeatures.

• The decoder is an RNN that generates a description fromthe vector inputs produced by the encoder.

• The system aligns image features with words and phrasesin the way that deep NN MT does for source and targetlinguistic expressions.

Defining the Decoder RNN over Dependency ParseTrees

• Socher et al. (2014) use dependency parse trees as theinput to the decoder RNN of their image descriptiongenerator.

• A function gθ computes the compositional vectors of theparent nodes in the tree.

• Unlike earlier compositional VSM models this function islearned as a parameter through back propagation.

• The dependency tree RNN outperforms constituency treeand bag of words decoder RNNs in generating imagedescriptions, and in identifying suitable images fromdescriptions.

A CNN Encoder + an RNN Decoder without ParseTree Inputs

• Karpathy and Fei-Fei (2015), Xu et al. (2015), and Vinyalset al. (2015) develop image caption systems that useConvolutional Neural Networks (CNNs) to encode imagesin vectors.

• They employ RNNs that apply directly to word sequencesas decoders.

• Karpathy and Fei-Fei use a bidirectional RNN, while Xu etal. and Vinyals et al. apply a LSTM RNN.

• These models can generate descriptions for sub-scenes inan image.

• They yield better results than Socher et al.’s (2014) system.

Karpathy and Fei-Fei’s (2015) Model

Image and Description from Xu et al. (2015)

Spatial Terms in Image Descriptions

• Kelleher (2016) observes that spatial terms like over in theXu et al. image matched sentence is not keyed to visualfeatures of the image.

• He argues that these terms are generated directly by theRNN language model through the conditional probabilitiesof the preceding word sequences.

• Kelleher suggests that it may be possible to ground spatialterms in dynamic image sequences provided by videos, asthese offer richer and more salient representations oflanguage-spatial correspondences.

Generalising Multi-Modal MT to SemanticInterpretation

• The MT model of Image caption generation suggests anapproach to semantic interpretation.

• The encoder component of the model could map anintegrated data structure for visual, audio, and textual inputrepresenting a situation to vector space features.

• The decoder would align words and phrases to thesefeatures, generating sentences describing the situation, orpart of it.

• Situations (scenes) could also be produced for sentencesthat correspond to them.

Multi-Modal MT and Classical Formal Semantics

• The classical formal semantic program (Davidson (1967),Montague (1974)) seeks a recursive definition of a truthpredicate which entails appropriate truth conditions foreach declarative sentence in a language.

• To the extent that it is successful a generalised multi-modalMT model would achieve the core part of this program.

• It would specify suitable correspondences betweensentences and sets of situations that the sentencesdescribe.

• These correspondences are produced not by a recursivedefinition of a truth predicate, but by an extended deepneural network language model.

Conclusions

• Classical approaches to semantic interpretation assignsemantic values to hierarchical syntactic structures, andthey compute the interpretations of sentences by applyingcombinatorial rules to these values.

• These approaches provide formally elegant systems, butthey have not yielded wide coverage methods for semanticlearning and representation.

• They have also not integrated compositional and lexicalmeaning in a natural and computationally efficient way.

Conclusions

• Mutli-modal deep neural network MT approaches to imagedescription suggest a model for wide coverage semanticlearning and representation that is driven by aligning vectorspace encodings of linguistic and non-linguistic entities.

• Hierarchical syntactic structure need not be included in thelinguistic input to this model, but it is implicitly representedas part of the distributional feature patterns expressed inthe vectors assigned to lexical and phrasal items.

• If it is successful, the multi-modal deep MT model ofsemantic interpretation will satisfy the central condition ofadequacy that the classical formal semantic programimposes on a theory of meaning for natural language.

Conclusions

Classical Approaches to Semantic RepresentationDistributional Models of MeaningSemantic Interpretation as Machine TranslationConclusions