Natural Language Processing (2020) - LiU

ReviewThis work is licensed under a Creative Commons Attribution 4.0 International License.
Background review
Natural Language Processing (2020)
• Essentials of machine learning (45 min)
• Introduction to the project (15 min)
Essentials of linguistics
Form and meaning
• Linguistics studies the relationship between form and meaning. Different languages have different words for the animal ‘cat’.
• This relationship is in principle an arbitrary one – it varies with the language, the time, the speaker, the context, …
Languages of the world
Map taken from Wikimedia Commons (credit and license)Number of living languages: 7,097
morphology
syntax
semantics
pragmatics
generateanalyse
phonology
Phonology
Phonology describes linguistic expressions as sequences of sounds that can be represented in the phonetic alphabet.
fnli dskrabz lgwstk ksprnz æz sikwnsz v sandz ðæt kæn bi rprzntd n ð fntk ælfbt
Levels of linguistic description
Words consist of morphemes
• Morphemes are the smallest meaningful units of language. Morpheme+s are the small+est mean+ing+ful unit+s of language.
• A word consists of one root morpheme and zero or more affixes. draw, draw+s, draw+ing+s, un+draw+able
• The sounds making up a morpheme are not always contiguous.
Hebrew root k-t-b ‘write’: (khatav) ‘wrote’, (mikhtav) ‘a letter’
Lexemes and lemmas
• The term lexeme refers to a set of word forms that all share the same fundamental meaning. word forms run, runs, ran, running – lexeme run
• The term lemma refers to the particular word form that is chosen, by convention, to represent a given lexeme. what you would put into a lexicon
• Syntax is what puts constraints on possible sentences.
• Theoretical syntacticians study what formal mechanisms are required in order to describe natural languages in this way. connection to formal language theory
• These questions are not directly relevant to natural language processing, which needs to be robust to ill-formed input. obvious exceptions: grammar checkers, text generation systems
Parts of speech
• A part of speech is a category of words that play similar roles within the syntactic structure of a sentence.
• Parts of speech can be defined distributionally or functionally. Kim saw the {elephant, movie, mountain, error} before we did.
verbs = predicates; nouns = arguments; adverbs = modify verbs, …
• There are many different ‘tag sets’ for parts of speech. different languages, different levels of granularity, different design principles
Universal part-of-speech tags
Tag Category Examples
VERB verb run, eat
PRON pronoun you, herself
Source: Universal Dependencies Project
Phrases and syntactic heads
• Words form groupings called phrases or constituents. Kim read [a book]. Kim read [a very interesting book about grammar].
• Each phrase is projected by a syntactic head, which determines its internal structure and external distribution. [The war on drugs] is controversial. / *[The battle on drugs] is controversial.
[The war on drugs] is controversial. / *[The war on drugs] are controversial.
Dependency trees
subject object
A dependency is an asymmetric relation between a head and a dependent.
Syntactic ambiguity
Kim had the hat.The man had the hat.
Kim saw a man with a hatKim saw a man with a hat
morphology
syntax
semantics
pragmatics
generateanalyse
phonology
Semantics
• Semantics is the study of the meaning of linguistic expressions such as words, phrases, and sentences.
• The focus is on what expressions conventionally mean, rather than on what they might mean in a particular context. The latter is the focus of pragmatics.
• This distinction is generally presented as the distinction between referential meaning as opposed to associative meaning. What does the word needle mean? What does it mean to you?
Source: Yule (2016)
Lexemes and lemmas
• The term lexeme refers to a set of word forms that all share the same fundamental meaning. word forms run, runs, ran, running – lexeme run
• The term lemma refers to the particular word form that is chosen, by convention, to represent a given lexeme. what you would put into a lexicon
Relations between senses of different words
• Synonymy – Antonymy
the situation where two senses of two different words (lemmas) are identical or nearly identical – opposite of each other couch/sofa, car/automobile – cold/hot, leader/follower
• Hyponymy – Hypernymy
the situation where in a pair of two senses of different words, one is more specific – less specific than the other car/vehicle, mango/fruit – furniture/chair, mammal/dog
The Principle of Compositionality
• Syntax provides the scaffolding for semantic composition. The brown dog on the mat saw the striped cat through the window.
The brown cat saw the striped dog through the window on the mat.
• Principle of Compositionality (Frege, 1848–1925)
The meaning of a complex expression is determined by its structure and the meanings of its parts. challenges include idiomatic expressions
• Pragmatics studies the way linguistic expressions with their semantic meanings are used for particular communicative goals.
• In contrast to semantics, pragmatics explicitly asks the question what an expression means in a given context.
• An important term in pragmatics is speech act, which describes an action that involves language, such as ‘requesting’. Can you pass the salt?
Source: Yule (2016)
What do we mean by learning?
• Informally speaking, a machine learning algorithm is an algorithm that is able to learn from data.
• ‘A computer program is said to learn from experience with respect to some class of tasks and performance measure , if its performance at tasks in , as measured by , improves with .’ Mitchell (1997)
Different learning experiences
• Supervised learning
The system has access to both the input and the target output. regression, classification
• Unsupervised learning
The system has access to the input but not the target output. clustering, topic models
Supervised learning tasks
• Classification
Predict which of categories some given input belongs to. handwritten digit recognition
Housing prices in Portland, OR Pr
ic e
(th ou
sa nd
ic e
(th ou
sa nd
Handwritten digit recognition
• Task: predict the digit represented in an image classification problem with 10 classes
• Experience: manually labelled images of handwritten digits MNIST database
• Performance measure: accuracy proportion of correct predictions
ic e
(th ou
sa nd
ic e
(th ou
sa nd
Linear regression with one variable
Living area (x) Price (y)
2 104 399 900
1 600 329 900
2 400 369 000
1 416 232 000
• Modelling assumption
The relation between housing area and price can be described in terms of an affine function. affine = linear function + intercept (bias)
• Learning task
Find the ‘best’ affine function – the function that minimizes the total distance to the data points. distance measure: mean squared error
input value
bias
Living area (x1) # bedrooms (x2) Price (y)
2 104 3 399 900
1 600 3 329 900
2 400 3 369 000
1 416 2 232 000
… …

input vector (1-by-F)
weight vector (F-by-1)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
bias
Living area (x1) # bedrooms (x2) Price (y)
2 104 3 399 900
1 600 3 329 900
2 400 3 369 000
1 416 2 232 000
… …

input matrix (N-by-F)
weight vector (F-by-1)
bias <latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
N = number of training examples, F = number of features (independent variables)
Learning by gradient search
• Choose a loss function (error function) that can be used to measure how good a model is. The loss function should be differentiable.
• Compute the gradient of the loss function with respect to the current parameters of the model. This can be done efficiently using backpropagation.
• Update the parameters by subtracting the gradient, and repeat the process until the loss is sufficiently low.
Gradient descent: Intuition M
M SE
M SE
M SE
Gradient descent: Intuition
Gradient descent
• Step 0: Start with an arbitrary value for the parameters .
• Step 1: Compute the gradient of the loss function, ∇().
• Step 2: Update the parameters as follows: − ∇() The parameter is the learning rate.
• Repeat step 1–2 until the loss is sufficiently low.
‘Follow the gradient into the valley of error.’
Stochastic gradient descent
• For large training sets, the time required to take a single gradient step becomes prohibitively large.
• The idea behind stochastic gradient descent (SGD) is to estimate the gradient by computing it on a small set of samples.
• A common method is to use small batches of randomly selected training samples; these are called minibatches.
Learning versus optimization
Training error and generalization error
• We train a model by minimizing its error on the training data. optimization
• The training error is different from the generalization error – the expected value of the error on previously unseen inputs.
• We can estimate the generalization error of a model by measuring its test error – the error on a held-out test set.
How can we hope to perform well on the test set?
• Assumption 1: The examples in the training set and the test set are mutually independent.
• Assumption 2: The examples in the training set and the test set are identically distributed. sampled from the same data generating distribution
• Under these assumptions, the expected test error is greater than or equal to the expected training error.
Underfitting and overfitting
• Underfitting
The model is unable to obtain a sufficiently low error on the training set. The model is not expressive enough.
• Overfitting
The gap between the training error and the test error is too large. The model is over-optimized for the training data. memorises noise
Underfitting and overfitting
appropriate underfitting overfitting
ro r
model capacity
underfitting zone
overfitting zone
generalization gap
So ur
ce : G
BC F
ig ur
e 5.
Handwritten digit recognition
• Task: predict the digit represented in an image classification problem with 10 classes
• Experience: manually labelled images of handwritten digits MNIST database
• Performance measure: accuracy proportion of correct predictions
input vector (1-by-F)
weight matrix (F-by-K)
output vector (1-by-K)
bias vector (1-by-K)
F = number of features (input variables), K = number of classes (output variables)
<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>
The softmax classifier
• We may think of = + as a vector of scores that quantify the compatibility of the input with each of the classes .
• We convert these scores into a probability distribution ( | ) over the classes by feeding them through the softmax function:
• We then predict the class with the highest probability.
TPNBY<> FYQ<>ö FYQ<>
<latexit sha1_base64="IqwNLUDgbhLBCUAIhcGv1+kFEpI=">AAAFjHicjZRdT9swFIZTWDfoNlbgcjfWEBKbmqopFQWhSggmtEnrYCtfUhNVjnPSWjgfih1EifJDd70fsdvZSYE1LRqWEtnHj18fvyexHTLKRaPxq7Sw+KL88tXScuX1m7cr76qraxc8iCMC5yRgQXRlYw6M+nAuqGBwFUaAPZvBpX19pOYvbyDiNPDPxDgEy8NDn7qUYCFDg2pqCrgVCQ9c4eHb1LTpkG2ZN0CSu2wQfexfW8jc75j7yHQjTJJ8AdyG07DEcj5NTB57g4Sm6CmU3qOD6kaj3sgamu0Yk86GNmmng9XFP6YTkNgDXxCGOe8bjVBYCY4EJQzSihlzCDG5xkPoy66PPeBWkvmUok0ZcZAbRPLxBcqi/65IsMf52LMl6WEx4sU5FZw314+Fu2sl1A9jAT7JN3JjhkSAlOnIoREQwcayg0lEZa6IjLB0U8jSTO2iEuMhkOmTqB11LsYMOl97JzU7YM7j0Epin5LAAT3Lr2JykLWkvpLqVzYROpT4sTpwByWfUA88qgTSWgWhHr2DY8AijoCraRlCKFFROdJ3660aOglluphNYrvZskdGIrqxXaQMo4AZ27qxV9+b4faKnIT0ItVuphLKyG90+JDsGVyp0PfYs6WLKvtu4AdcegaOlGBOTzkkl1lJl/rya0enUXBvjxjl9vzPgp0amliXa3RVJc6oP66hTF5NyFBPVjUU+btwpJ3cxlmVIxyKeSpPODwr8BOGMcPRMzQe7J8V6cX26DkKqjDt5jyFz5SHBbrd1J8i5+6lyvtYpm7+GTMQ6sLomOqXdOVVYRQvhtnORbNutOqtH62Ng8PJpbGkvdc+aFuaobW1A+2Ldqqda0T7XVourZXWyyvlVnm/3MnRhdJkzbo21crHfwFNMbqC</latexit>
Cross-entropy loss
• To train a softmax classifier, we present it with training samples (, ) where is a feature vector and is the gold-standard class.
• The output of the model is a vector of conditional probabilities ( | ) where ranges over the possible classes.
• The cross-entropy loss for a specific sample is the negative log probability of the gold-standard class . Loss for a batch of samples is the average loss per sample.
Cross-entropy loss
low loss if gold-standard class
has high probability
Note on terminology
• The softmax function can be viewed as a generalization of the standard logistic function to more than two classes.
÷ ÷

Documents

Natural Language Processing (2020) - LiU