This work is licensed under a Creative Commons Attribution 4.0 International License. Background review Marco Kuhlmann Department of Computer and Information Science Natural Language Processing (2020)
ReviewThis work is licensed under a Creative Commons Attribution
4.0 International License.
Background review
Natural Language Processing (2020)
• Essentials of machine learning (45 min)
• Introduction to the project (15 min)
Essentials of linguistics
Form and meaning
• Linguistics studies the relationship between form and meaning.
Different languages have different words for the animal
‘cat’.
• This relationship is in principle an arbitrary one – it varies
with the language, the time, the speaker, the context, …
Languages of the world
Map taken from Wikimedia Commons (credit and license)Number of
living languages: 7,097
morphology
syntax
semantics
pragmatics
generateanalyse
phonology
Phonology
Phonology describes linguistic expressions as sequences of sounds
that can be represented in the phonetic alphabet.
fnli dskrabz lgwstk ksprnz æz sikwnsz v sandz ðæt kæn bi rprzntd n
ð fntk ælfbt
Levels of linguistic description
Words consist of morphemes
• Morphemes are the smallest meaningful units of language.
Morpheme+s are the small+est mean+ing+ful unit+s of language.
• A word consists of one root morpheme and zero or more affixes.
draw, draw+s, draw+ing+s, un+draw+able
• The sounds making up a morpheme are not always contiguous.
Hebrew root k-t-b ‘write’: (khatav) ‘wrote’, (mikhtav) ‘a
letter’
Lexemes and lemmas
• The term lexeme refers to a set of word forms that all share the
same fundamental meaning. word forms run, runs, ran, running –
lexeme run
• The term lemma refers to the particular word form that is chosen,
by convention, to represent a given lexeme. what you would put into
a lexicon
Levels of linguistic description
• Syntax is what puts constraints on possible sentences.
• Theoretical syntacticians study what formal mechanisms are
required in order to describe natural languages in this way.
connection to formal language theory
• These questions are not directly relevant to natural language
processing, which needs to be robust to ill-formed input. obvious
exceptions: grammar checkers, text generation systems
Parts of speech
• A part of speech is a category of words that play similar roles
within the syntactic structure of a sentence.
• Parts of speech can be defined distributionally or functionally.
Kim saw the {elephant, movie, mountain, error} before we did.
verbs = predicates; nouns = arguments; adverbs = modify verbs,
…
• There are many different ‘tag sets’ for parts of speech.
different languages, different levels of granularity, different
design principles
Universal part-of-speech tags
Tag Category Examples
VERB verb run, eat
PRON pronoun you, herself
Source: Universal Dependencies Project
Phrases and syntactic heads
• Words form groupings called phrases or constituents. Kim read [a
book]. Kim read [a very interesting book about grammar].
• Each phrase is projected by a syntactic head, which determines
its internal structure and external distribution. [The war on
drugs] is controversial. / *[The battle on drugs] is
controversial.
[The war on drugs] is controversial. / *[The war on drugs] are
controversial.
Dependency trees
subject object
A dependency is an asymmetric relation between a head and a
dependent.
Syntactic ambiguity
Kim had the hat.The man had the hat.
Kim saw a man with a hatKim saw a man with a hat
Levels of linguistic description
morphology
syntax
semantics
pragmatics
generateanalyse
phonology
Semantics
• Semantics is the study of the meaning of linguistic expressions
such as words, phrases, and sentences.
• The focus is on what expressions conventionally mean, rather than
on what they might mean in a particular context. The latter is the
focus of pragmatics.
• This distinction is generally presented as the distinction
between referential meaning as opposed to associative meaning. What
does the word needle mean? What does it mean to you?
Source: Yule (2016)
Lexemes and lemmas
• The term lexeme refers to a set of word forms that all share the
same fundamental meaning. word forms run, runs, ran, running –
lexeme run
• The term lemma refers to the particular word form that is chosen,
by convention, to represent a given lexeme. what you would put into
a lexicon
Relations between senses of different words
• Synonymy – Antonymy
the situation where two senses of two different words (lemmas) are
identical or nearly identical – opposite of each other couch/sofa,
car/automobile – cold/hot, leader/follower
• Hyponymy – Hypernymy
the situation where in a pair of two senses of different words, one
is more specific – less specific than the other car/vehicle,
mango/fruit – furniture/chair, mammal/dog
The Principle of Compositionality
• Syntax provides the scaffolding for semantic composition. The
brown dog on the mat saw the striped cat through the window.
The brown cat saw the striped dog through the window on the
mat.
• Principle of Compositionality (Frege, 1848–1925)
The meaning of a complex expression is determined by its structure
and the meanings of its parts. challenges include idiomatic
expressions
Levels of linguistic description
• Pragmatics studies the way linguistic expressions with their
semantic meanings are used for particular communicative
goals.
• In contrast to semantics, pragmatics explicitly asks the question
what an expression means in a given context.
• An important term in pragmatics is speech act, which describes an
action that involves language, such as ‘requesting’. Can you pass
the salt?
Source: Yule (2016)
What do we mean by learning?
• Informally speaking, a machine learning algorithm is an algorithm
that is able to learn from data.
• ‘A computer program is said to learn from experience with respect
to some class of tasks and performance measure , if its performance
at tasks in , as measured by , improves with .’ Mitchell
(1997)
Different learning experiences
• Supervised learning
The system has access to both the input and the target output.
regression, classification
• Unsupervised learning
The system has access to the input but not the target output.
clustering, topic models
Supervised learning tasks
• Classification
Predict which of categories some given input belongs to.
handwritten digit recognition
Housing prices in Portland, OR Pr
ic e
(th ou
sa nd
ic e
(th ou
sa nd
Handwritten digit recognition
• Task: predict the digit represented in an image classification
problem with 10 classes
• Experience: manually labelled images of handwritten digits MNIST
database
• Performance measure: accuracy proportion of correct
predictions
ic e
(th ou
sa nd
ic e
(th ou
sa nd
Linear regression with one variable
Living area (x) Price (y)
2 104 399 900
1 600 329 900
2 400 369 000
1 416 232 000
Linear regression with one variable
• Modelling assumption
The relation between housing area and price can be described in
terms of an affine function. affine = linear function + intercept
(bias)
• Learning task
Find the ‘best’ affine function – the function that minimizes the
total distance to the data points. distance measure: mean squared
error
Linear regression with one variable
input value
bias
Living area (x1) # bedrooms (x2) Price (y)
2 104 3 399 900
1 600 3 329 900
2 400 3 369 000
1 416 2 232 000
… …
input matrix (N-by-F)
weight vector (F-by-1)
bias <latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit>
N = number of training examples, F = number of features
(independent variables)
Learning by gradient search
• Choose a loss function (error function) that can be used to
measure how good a model is. The loss function should be
differentiable.
• Compute the gradient of the loss function with respect to the
current parameters of the model. This can be done efficiently using
backpropagation.
• Update the parameters by subtracting the gradient, and repeat the
process until the loss is sufficiently low.
Gradient descent: Intuition M
M SE
M SE
M SE
Gradient descent: Intuition
Gradient descent
• Step 0: Start with an arbitrary value for the parameters .
• Step 1: Compute the gradient of the loss function, ∇().
• Step 2: Update the parameters as follows: − ∇() The parameter is
the learning rate.
• Repeat step 1–2 until the loss is sufficiently low.
‘Follow the gradient into the valley of error.’
Stochastic gradient descent
• For large training sets, the time required to take a single
gradient step becomes prohibitively large.
• The idea behind stochastic gradient descent (SGD) is to estimate
the gradient by computing it on a small set of samples.
• A common method is to use small batches of randomly selected
training samples; these are called minibatches.
Learning versus optimization
Training error and generalization error
• We train a model by minimizing its error on the training data.
optimization
• The training error is different from the generalization error –
the expected value of the error on previously unseen inputs.
• We can estimate the generalization error of a model by measuring
its test error – the error on a held-out test set.
How can we hope to perform well on the test set?
• Assumption 1: The examples in the training set and the test set
are mutually independent.
• Assumption 2: The examples in the training set and the test set
are identically distributed. sampled from the same data generating
distribution
• Under these assumptions, the expected test error is greater than
or equal to the expected training error.
Underfitting and overfitting
• Underfitting
The model is unable to obtain a sufficiently low error on the
training set. The model is not expressive enough.
• Overfitting
The gap between the training error and the test error is too large.
The model is over-optimized for the training data. memorises
noise
Underfitting and overfitting
appropriate underfitting overfitting
ro r
model capacity
underfitting zone
overfitting zone
generalization gap
So ur
ce : G
BC F
ig ur
e 5.
Handwritten digit recognition
• Task: predict the digit represented in an image classification
problem with 10 classes
• Experience: manually labelled images of handwritten digits MNIST
database
• Performance measure: accuracy proportion of correct
predictions
input vector (1-by-F)
weight matrix (F-by-K)
output vector (1-by-K)
bias vector (1-by-K)
F = number of features (input variables), K = number of classes
(output variables)
<latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit><latexit
sha1_base64="(null)">(null)</latexit>
The softmax classifier
• We may think of = + as a vector of scores that quantify the
compatibility of the input with each of the classes .
• We convert these scores into a probability distribution ( | )
over the classes by feeding them through the softmax
function:
• We then predict the class with the highest probability.
TPNBY<> FYQ<>ö FYQ<>
<latexit
sha1_base64="IqwNLUDgbhLBCUAIhcGv1+kFEpI=">AAAFjHicjZRdT9swFIZTWDfoNlbgcjfWEBKbmqopFQWhSggmtEnrYCtfUhNVjnPSWjgfih1EifJDd70fsdvZSYE1LRqWEtnHj18fvyexHTLKRaPxq7Sw+KL88tXScuX1m7cr76qraxc8iCMC5yRgQXRlYw6M+nAuqGBwFUaAPZvBpX19pOYvbyDiNPDPxDgEy8NDn7qUYCFDg2pqCrgVCQ9c4eHb1LTpkG2ZN0CSu2wQfexfW8jc75j7yHQjTJJ8AdyG07DEcj5NTB57g4Sm6CmU3qOD6kaj3sgamu0Yk86GNmmng9XFP6YTkNgDXxCGOe8bjVBYCY4EJQzSihlzCDG5xkPoy66PPeBWkvmUok0ZcZAbRPLxBcqi/65IsMf52LMl6WEx4sU5FZw314+Fu2sl1A9jAT7JN3JjhkSAlOnIoREQwcayg0lEZa6IjLB0U8jSTO2iEuMhkOmTqB11LsYMOl97JzU7YM7j0Epin5LAAT3Lr2JykLWkvpLqVzYROpT4sTpwByWfUA88qgTSWgWhHr2DY8AijoCraRlCKFFROdJ3660aOglluphNYrvZskdGIrqxXaQMo4AZ27qxV9+b4faKnIT0ItVuphLKyG90+JDsGVyp0PfYs6WLKvtu4AdcegaOlGBOTzkkl1lJl/rya0enUXBvjxjl9vzPgp0amliXa3RVJc6oP66hTF5NyFBPVjUU+btwpJ3cxlmVIxyKeSpPODwr8BOGMcPRMzQe7J8V6cX26DkKqjDt5jyFz5SHBbrd1J8i5+6lyvtYpm7+GTMQ6sLomOqXdOVVYRQvhtnORbNutOqtH62Ng8PJpbGkvdc+aFuaobW1A+2Ldqqda0T7XVourZXWyyvlVnm/3MnRhdJkzbo21crHfwFNMbqC</latexit>
Cross-entropy loss
• To train a softmax classifier, we present it with training
samples (, ) where is a feature vector and is the gold-standard
class.
• The output of the model is a vector of conditional probabilities
( | ) where ranges over the possible classes.
• The cross-entropy loss for a specific sample is the negative log
probability of the gold-standard class . Loss for a batch of
samples is the average loss per sample.
Cross-entropy loss
low loss if gold-standard class
has high probability
Note on terminology
• The softmax function can be viewed as a generalization of the
standard logistic function to more than two classes.
÷ ÷