Natural Language Processing: Part 2web.cecs.pdx.edu/~mm/ArtificialIntelligenceFall2007/October24Slides.pdf · • Compute the (unit) eigenvectors vvvviof CCCC • Keep only the first

Natural Language Processing:

Part 2Part 2

AskMSR Question-Answering System(Brill, Dumais, and Banko, 2002)

From Brill et al. (2002), An analysis of the AskMSR question-answering

system

1. Determine expected answer type of question:1. Determine expected answer type of question:1. Determine expected answer type of question:1. Determine expected answer type of question:

7 question types:

– who-question

– what-question

– where-question

– how-many question

– ...– ...

Filters written manually.

2.2.2.2. Rewrite query as declarative sentence likely to be in Rewrite query as declarative sentence likely to be in Rewrite query as declarative sentence likely to be in Rewrite query as declarative sentence likely to be in text with answer:text with answer:text with answer:text with answer:

“When was the paper clip invented?”

� “the paper clip was invented”

� “invented the paper clip”

� “paper clips were invented”� “paper clips were invented”

� “invented paper clips”

� paper AND clip AND invented

Can do with simple rewrite rules.

Weights are given to rewrite rules (set manually).

3. N3. N3. N3. N----gram mining:gram mining:gram mining:gram mining:

Send query to search engine.

Collect top M page summaries (snippets)

– less computationally expensive than using entire web page

Extract unigrams, bigrams, and trigrams from each snippet.

Score of each n-gram: sum of weights of the query rewrites that retrieved corresponding pages

4. N4. N4. N4. N----gram filtering: gram filtering: gram filtering: gram filtering:

Filter and reweight n-grams according to how well each candidate n-gram matches expected answer type

(Uses human-constructed filters)

5. N5. N5. N5. N----gram tiling: gram tiling: gram tiling: gram tiling:

Goal is to combine remaining n-grams to produce longer answer

E.g., trigrams ABC and BCD become ABCD

Start with highest scoring n-gram as current answer

Iterate through remaining n-grams in order of score –for each one, see if it can be tiled with the current answer. If so, current answer tiling.

Results

• Tested on 500 queries from TREC data set

• 61% answered correctly − relatively good performance at this competition!

Knowing when we don’t know

• However.... no answer is usually better than wrong answer. Need option of responding “I don’t know”.

• Built decision tree to predict whether system will answer correctly, based on set of features from question string: question string:

– unigrams, bigrams, sentence length, number of capitalized words, number of stop words, length of longest word.

• Decision tree didn’t work very well. (Are you surprised?)

• You can read about other attempts at this in their paper (in optional reading on the class web page)

Beyond n-grams: Latent Semantic Analysis

• Problem: How to capture semantic similarity between documents in a natural corpus (e.g., problems of homonymy, polysemy, synonymy, etc.)

• In general, n-grams often fail to capture semantic • In general, n-grams often fail to capture semantic similarity, even with query expansion, etc.

• “LSA assumes that there exists a LATENT structure in word usage – obscured by variability in word choice” (http://ir.dcs.gla.ac.uk/oldseminars/Girolami.ppt)

Latent Semantic Analysis(Landauer et al.)

• From training data (large sample of documents), create word-by-document matrix.

From Deerwester et al., Indexing by latent semantic analysis

• Now apply “singular value decomposition” to this matrix

• SVD is similar to principal components analysis (if you know what that is)

• Basically, reduce dimensionality of the matrix by re-representing matrix in terms of “features” (derived from eigenvalues and eigenvectors), and using only the ones with highest value.

• Result: Each document is represented by a vector of featuresobtained by SVD.

• Given a new document (or query), compute its representation vector in this feature space, compute its similarity with other documents using cosine between vector angles. Retrieve documents with highest similarities.

Result of Applying LSA

From http://lair.indiana.edu/courses/i502/lectures/lect6.ppt

DOC1 DOC2 DOC3 DOC4 DOC5 DOC6 DOC7 DOC8 DOC9

HUMAN 0.16 0.4 0.38 0.47 0.18 -0.05 -0.12 -0.16 -0.09

INTERFACE 0.14 0.37 0.33 0.4 0.16 -0.03 -0.07 -0.1 -0.04

COMPUTER 0.15 0.51 0.36 0.41 0.24 0.02 0.06 0.09 0.12

USER 0.26 0.84 0.61 0.7 0.39 0.03 0.08 0.12 0.19

SYSTEM 0.45 1.23 1.05 1.27 0.56 -0.07 -0.15 -0.21 -0.05

RESPONSE 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22RESPONSE 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22

TIME 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22

EPS 0.22 0.55 0.51 0.63 0.24 -0.07 -0.14 -0.2 -0.11

SURVEY 0.1 0.53 0.23 0.21 0.27 0.14 0.31 0.44 0.42

TREES -0.06 0.23 -0.14 -0.27 0.14 0.24 0.55 0.77 0.66

GRAPH -0.06 0.34 -0.15 -0.3 0.2 0.31 0.69 0.98 0.85

MINORS -0.04 0.25 -0.1 -0.21 0.15 0.22 0.5 0.71 0.62

In the above matrix we can now observe:

r(human.user) = 0.94

r(human.minors) = -0.83

How does it find the latent associations?

From http://lair.indiana.edu/courses/i502/lectures/lect6.ppt

• By analyzing the contexts in which the words appear

• The word user has co-occurred with words • The word user has co-occurred with words that human has co-occurred with (e.g., system and interface)

• It downgrades associations when such contextual similarities are not found

Some General LSA Based ApplicationsFrom http://lsa.colorado.edu/~quesadaj/pdf/LSATutorial.pdf

Information RetrievalInformation RetrievalInformation RetrievalInformation Retrieval

Text AssessmentText AssessmentText AssessmentText Assessment

Compare document to documents of known quality / content

Automatic summarization of textAutomatic summarization of textAutomatic summarization of textAutomatic summarization of text

Determine best subset of text to portray same meaning

Categorization / ClassificationCategorization / ClassificationCategorization / ClassificationCategorization / Classification

Place text into appropriate categories or taxonomies

Application: Automatic Essay Scoring(in collaboration with

Educational Testing Service)

Create domain semantic space

Compute vectors for essays, add to vector database

To predict grade on a new essay, compare it to ones To predict grade on a new essay, compare it to ones previously scored by humans

From http://lsa.colorado.edu/~quesadaj/pdf/LSATutorial.pdf

Mutual information between two sets of grades:Mutual information between two sets of grades:Mutual information between two sets of grades:Mutual information between two sets of grades:

human – human .90human – human .90

LSA – human .81

From http://lsa.colorado.edu/~quesadaj/pdf/LSATutorial.pdf

DemoDemo

Vision

1. Intro to content-based image retrieval

2. Face recognition with eigenfaces

Intro to Content-Based Image Retrieval

A sample of existing systems

•QBIC at the Hermitage Museum

•Simplicity

•WISE•WISE

•Like Visual Shopping

http://www.bodexyachting.com/images/sailing.jpg

Query Image

Extract Features

(Primitives)

Similarity

Measure

MatchedResults

Basic technique

Image Database

Features Database

RelevanceFeedbackAlgorithm

From http://www.amrita.edu/cde/downloads/ACBIR.ppt

Each image in database is represented by a feature vector: (F1, F2, ...Fn)

Query is represented in terms of same features: Q

Goal: Find stored image with vector Fi most similar to query vector Q

• Typical distance measure: Inner product

m

i

m

iiiQFQFQF ++++++++++++====⋅⋅⋅⋅ ...

2211QF

Types of Features

• Text

• Color

• Texture• Texture

• Shape

• Layout

Color Features

Hue, saturation, value

http://www.owlnet.rice.edu/~elec301/Projects02/artSpy/patmac/mcolhist.gif

• Moments of color distribution (mean, variance, skewness)

From: http://www2.cmp.uea.ac.uk/Research/compvis/ImageRetrieval/ImageRetrieval.htm

RGB distribution

Texture representations

• entropy

• contrast

Fourier and wavelet transforms• Fourier and wavelet transforms

• Gabor filters

Wavelet transform

From: http://www.mathworks.com/matlabcentral/files/9554/content/wavelets/tp2.html

Shape representations

Need segmentation (another whole story!)

• area, eccentricity, major axis orientation

• Skeletons, shock graphs

• Fourier transformation of boundary

• histograms of edge orientations

From: http://www.lems.brown.edu/vision/researchAreas/ShockMatching/shock-ed-match-results1.g

Histogram of edge orientations

From: http://www.cs.ucl.ac.uk/staff/k.jacobs/teaching/prmv/Edge_histogramming.jpg

Some issues

• Query format, ease of querying

• Speed

• Crawling, preprocessing

• Interactivity, user relevance feedback

• Visual features — which to use? How to combine?

• Curse of dimensionality

• Indexing

• Evaluation of performance

Visual abilities largely missing from current CBIR systems

• Object recognition

• Perceptual organization

Similarity between semantic concepts• Similarity between semantic concepts

• More generally, bridging “the semantic gap”

Image 1 Image 2

Examples of “semantic” similarity


Image 1 Image 2

Image 1 Image 2


“In general, current systems have not yet had significant impact on society due to an inability to bridge the semantic gap between computers and humans.”

How to bridge the semantic gap?

How could we apply LSA toimage retrieval?

Principal Components Analysis (PCA)

PCA is used to reduce dimensions of data without much loss of information.

Used in machine learning and in signal processing and image compression (among other things). and image compression (among other things).

• Suppose attributes are A1 and A2, and we have ntraining examples. x’s denote values of A1 and y’s denote values of A2 over the training examples.

• Variance of an attribute:

Background for PCA

)1(

)(

)var( 1

2

1−−−−

−−−−

====

∑∑∑∑====

n

xx

A

n

i

i

• Covariance of two attributes:

)1(

))((

),cov( 1

21−−−−

−−−−−−−−

====

∑∑∑∑====

n

yyxx

AA

n

i

ii

• If covariance is positive, both dimensions increase together. If negative, as one increases, the other decreases. Zero: independent of each other.

• Covariance matrix

– Suppose we have n attributes, A1, ..., An.

– Covariance matrix:

),cov( where),(nn

AAccC ========××××

),cov( where),( ,, jijiji

nnAAccC ========

××××

Let M beM beM beM be an n×n matrix.

vvvv is an eigenvector of M if M × v = v = v = v = λvvvv

λ is called the eigenvalue associated with vvvv

For any eigenvector vvvv of MMMM and scalar a,

vvM aa λ====××××

Eigenvectors and eigenvalues

Thus you can always choose eigenvectors of length 1:

If MMMM has any eigenvectors, it has n of them, and they are orthogonal to one another.

Thus eigenvectors can be used as a new basis for an n-dimensional vector space.

vvM aa λ====××××

1...22

1====++++++++ nvv

Face analysis using PCA:“Eigenfaces”

Main idea: From training images, learn basis vectors for “face space”

Generating Eigenfaces – in words

1. Large set of images of human faces is taken.

2. The images are normalized to line up the eyes, mouths and other features.

3. The eigenvectors of the covariance matrix of the face image vectors are then extracted.

4. These eigenvectors are called eigenfaces.

http://www.cs.cmu.edu/afs/cs/academic/class/15385-s06/lectures/ppts/lec-20.ppt

Raw features are pixel intensity values Ai

Each image has n × m features, where n,m are the dimensions of the image

Each image is encoded as a vector Γ of these

More details

Each image is encoded as a vector Γi of these features:

Γi = (A1, A2, ...,An×m)

Compute “mean” face in training set:

∑=

Γ=Ψ

M

i

iM 1

1

• Subtract the mean face from each face vector

Ψ−Γ=Φ

• Compute the covariance matrix C C C C

• Compute the (unit) eigenvectors vvvvi of CCCC

• Keep only the first K principal components (eigenvectors)

Ψ−Γ=Φ ii

),cov( where),( ,, kjkjkj

nnAAccC ==

×

Eigenfaces for Face Recognition

• When properly weighted, eigenfaces can be summed together to create an approximate gray-scale rendering of a human face.

• Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces. likeness of most people's faces.

• Hence eigenfaces provide a means of applying data compressionto faces for identification purposes.


Any face:

Projecting onto the Eigenfaces

• The eigenfaces vvvv1111, ..., vvvvKKKK span the space of faces

– A face is converted to eigenface coordinates by


Recognition with Eigenfaces

• Algorithm

1. Process the image database (set of images with labels)

• Run PCA—compute eigenfaces

• Calculate the K coefficients for each image

2. Given a new image (to be recognized) xxxx, calculate K coefficients


coefficients

3. Detect if x is a face

4. If it is a face, who is it?

• Find closest labeled face in database

• nearest-neighbor in K-dimensional space

Choosing the Dimension K

eigenvalues

K NMi =

• How many eigenfaces to use?

• Look at the decay of the eigenvalues

– the eigenvalue tells you the amount of variance “in the direction” of that eigenface

– ignore eigenfaces with low variance

The eigenfaces encode the principal sources of variation in the dataset (e.g., absence/presence of facial hair, skin tone, glasses, etc.).

We can represent any face as a linear combination of these“basis” faces.

Interpreting eigenfaces

“basis” faces.

Use this representation for:• Face recognition (e.g., Euclidean distance from known faces)

• Linear discrimination (e.g., “glasses” versus “no glasses”, or “male” versus “female”)

Documents

Natural Language Processing: Part 2web.cecs.pdx.edu/~mm/ArtificialIntelligenceFall2007/October24Slides.pdf · • Compute the (unit) eigenvectors vvvviof CCCC • Keep only the first