Feature/Kernel Learning in Semi-supervised scenarios Kevin Duh [email protected] UW 2007 Workshop on SSL for Language Processing

Feature/Kernel Learning in Semi-supervised scenarios

Kevin [email protected]

UW 2007 Workshop on SSL for Language Processing

mailto:[email protected]

Agenda

1. Intro: Feature Learning in SSL1. Assumptions

2. General algorithm/setup

2. 3 Case Studies

3. Algo1: Latent Semantic Analysis

4. Kernel Learning

5. Related work

Semi-supervised learning (SSL): Three general assumptions

• How can unlabeled data help?1. Bootstrap assumption:

– Automatic labels on unlabeled data has sufficient accuracy

– Can be incorporated into training set2. Low Density Separation assumption:

– Classifier should not cut through a high density region because clusters of data likely come from the same class

– Unlabeled data can help identify high/low density regions

3. Change of Representation assumption

Low Density Assumption

+

+

+

+ -

--

-

-

oo

o

o

o

o

oo

Black line cuts throughLow density region

Green line cuts through High density region

Change of Representation Assumption

• Basic learning process:1. Define a feature representation of data2. Map each sample into feature vector3. Apply a vector classifier learning algorithm

• What if the original feature set is bad?– Then learning algo’s will have a hard time

• Ex: Highly correlation among features• Ex: Irrelevant features

– Different learning algo’s deal with it differently

• Assumption: Large amounts of unlabeled data can help us learn better feature representation

Two stage learning process

• Stage 1: Feature learning1. Define an original feature representation

2. Using unlabeled data, learn a “better” feature representation

• Stage 2: Supervised training1. Map each labeled training sample into new

feature vector

2. Apply a vector classifier learning algo

ClassifierFeatureVector

(Basic learning process)

labeledData

labeledData

labeledData

labeledData

unlabeled Data

unlabeled Data

Classifier

OriginalFeatureVector

NewFeatureVector

(2-stage learning process)

NewFeatureVector

Philosophical reflection

• Machine learning is not magic– Human knowledge is encoded in feature representation– Machine & statistics only learn the relative weight/importance of

each feature on data• Two-stage learning is not magic

– Merely extended machine/statistical analysis to the feature “engineering” stage

– Human knowledge still needed in:• Original feature representation• Deciding how original feature transforms into new feature (Here is

the research frontier!)

• Designing features and learning weights on features are really the same thing – it’s just division of labor between human and machine.

Wait, how is this different from …

• Feature selection/extraction?– This approach applies traditional feature selection to a larger,

unlabeled dataset.– In particular, we apply unsupervised feature selection methods

(semi-supervised feature selection is an unexplored area)• Supervised learning?

– Supervised learning is a critical component

• We can use all we know in feature selection & supervised learning here:– There components aren’t new: the new thing here is the semi-

supervised perspective. – So far there is little work on this in language processing

(compared to auto-labeling): lots of research opportunities!!!

Research questions

• Given lots of unlabeled data and an original feature representation, how can we transform into a better, new features?

1. How to define “better”? (objective)– Ideally, “better” = “leads to higher classification accuracy, but

also this depends on the downstream classifier algo– Is there a surrogate metric for “better” that does not require

labels?

2. How to model the feature transformation? (modeling)– Mostly we use linear transform: y=Ax

• x = original feature; y = new feature; • A = transform matrix to learn

3. Given the objective and model, how to find A? (algo)

Agenda

1. Intro: Feature Learning in SSL

2. 3 Case Studies1. Correlated features

2. Irrelevant features

3. Feature Set is not sufficiently expressive


4. Kernel Learning

5. Related work

Running Example: Topic classification

• Binary classification of documents by topics: Politics vs. Sports

• Setup:– Data: Each document is vectorized so that each element

represents a unigram and its count• Doc1: “The President met the Russian President”• Vocab: [The President met Russian talks strategy Czech missile]• Vector1: [2 2 1 1 0 0 0 0 ]

– Classifier: Some supervised training algorithm

• We will discuss 3 cases where original feature set may be unsuitable for training

1. Highly correlated features• Unigrams (features) in “Politics”:

– Republican, conservative, GOP– White, House, President– Iowa, primary– Israel, Hamas, Gaza

• Unigrams in “Sports”:– game, ballgame– strike-out, baseball– NBA, Sonics– Tiger, Woods, golf

• OTHER EXAMPLES?• Highly correlate features represent redundant information:

– A document with “Republican” isn’t necessarily more about politics than a document with “Republican” and “conservative.”

– Some classifiers are hurt by highly correlated (redundant) features by “double-counting”, e.g. Naïve Bayes & other generative models

2. Irrelevant features

• Unigrams in both “Politics” & “Sports”:– the, a, but, if, while (stop words)– win, lose (topic neutral words)– Bush (politician or sports player)

• Very rare unigrams in either topic– Wazowski, Pitcairn (named entities)– Brittany Speers (typos, noisy features)

• OTHER EXAMPLES?• Irrelevant features un-necessarily expands the

“search space”. They should get zero weight, but classifier may not do so.

3. Feature Set is notsufficiently expressive

• Original feature set:– 3 features: [leaders challenge group]– Politics or Sports?

• “The leaders of the terrorist group issued a challenge” • “It’s a challenge for the cheer leaders to take a group picture”• Different topics have similar (same) feature vectors

• A more expressive (discriminative) feature set:– 5 features: [leader challenge group cheer terrorist]– Bigrams: [“the leaders” “cheer leaders” …]

• Vector-based linear classifiers simply doesn’t work – Data is “inseparable” by the classifier

• Non-linear classifiers (e.g. kernel methods, decision trees, neural networks) automatically learn combinations of features– But, there is still no substitute for an human expert thinking of features

Agenda


2. 3 Case Studies


4. Kernel Learning

5. Related work

Dealing with correlated and/or irrelevant features

• Feature clustering methods:– Idea: cluster features so that similarly distributed

features are mapped together– Algorithms:

• Word clustering (using Brown algo, spectral clustering, k-means…)

• Latent Semantic Analysis

• Principle Components Analysis (PCA):– Idea: Map original feature vector into directions of

high variance• Words with similar counts across all documents should be

deleted from feature set• Words with largely varying counts *might* be a discriminating

feature

Latent Semantic Analysis (LSA)

• Currently, a document is represented by a set of terms (words).

• LSA: represent a document by a set of “latent semantic” variables, which are coarser-grained than terms.– A kind of “feature-clustering”– Latent semantics could cluster together synonyms or

similar words, but in general it may be difficult to say “what” the latent semantics corresponds to.

– Strengths: data-driven, scalable algorithm (with good results in natural language processing and information retrieval!)

LSA Algorithm

1. Represent data as term-document matrix2. Perform singular value decomposition 3. Map documents to new feature vector by the

vectors with largest singular values

• Term-document matrix: – t terms, d documents. Matrix is size (t x d)– Each column is a document (represented by how

many times each term occurs in that doc)– Each row is a term (represented by how many times

it occurs in various documents)

Term-document matrix: X

• Distance between 2 documents: dot product between column vectors– d1’ * d2 = 5*5+4*3+0*0 = 37– d1’ * d3 = 5*0+4*0+0*7 = 0– X’ * X is the matrix of document distances

• Distance between 2 terms: dot product between row vectors– t1 * t2’ = 5*4+5*3*0*0+0*1+10*7=105– t1 * t3’ = 5*0+5*0+0*7+0*6+10*3=30– X * X’ is the matrix of term distances

doc1 doc2 doc3 doc4 doc5

term1 5 5 0 0 10

term2 4 3 0 1 7

term3 0 0 7 6 3

Singular Value Decomposition (SVD)

• Eigendecomposition: – A is a square matrix– Av = ev

• (v is eigenvector, e is eigenvalue)– A = VEV’

• (E is diagonal matrix of eigenvalues• (V is matrix of eigenvectors; V*V’=V’*V = I)

• SVD: – X is (t x d) rectangular matrix– Xd = st

• (s is singular value, t & d are left/right singular vectors)– X = TSD’

• (S is t x d matrix with diagonal singular values and others 0)• (T is t x t, D is d x d matrix; T*T’=I; D*D’=I)

SVD on term-document matrix

• X = T*S*D’ = (t x t)* (t x d) * (d x d)

• Low-rank approx X ~ (t x k)*(k x k)*(k x d)

• X’*X = (DS’T’)(TSD) = D(S’S)D• New document vector representation: d’*T*inv(S)

-0.78 0.25 0.56 -0.55 0.11 -0.82 -0.27 -0.96 0.05

15.34 0 0 0 0 0 9.10 0 0 0 0 0 0.85 0 0

-0.40 0.18 -0.52 -0.36 -0.62-0.36 0.17 0.42 0.66 -0.45-0.12 -0.73 0.45 -0.37 -0.30-0.14 -0.62 -0.56 0.51 0.08-0.81 0.04 0.09 -0.15 0.54

-0.78 0.25 -0.55 0.11 -0.27 -0.96

15.34 0 0 9.10

-0.40 0.18 -0.52 -0.36 -0.62-0.36 0.17 0.42 0.66 -0.45

Agenda


2. 3 Case Studies


4. Kernel Learning1. Links between features and kernels

2. Algo2: Neighborhood kernels

3. Algo3: Gaussian Mixture Model kernels

5. Related work

Philosophical reflection: Distances

• (Almost) Everything in machine learning is about distances: – A new sample is labeled positive because it is “closer”

in distance to positive training examples• How to define distance?

– Conventional way: dot product of feature vectors– That’s why feature representation is important

• What if we directly learn the distance?– d(x1,x2): inputs are two samples, output is a numer– Argument: it doesn’t matter what features we define;

all that matters is whether positive samples are close in distance to other positive samples, etc.

What is a kernel?

• Kernel is a distance function:– Dot product in feature space

• x1= [a b]; x2 = [c d] , d(x1,x2)=x1’*x2 = ac+bd– Polynomial kernel:

• Map feature into higher-order polynomial• x1 = [aa bb ab ba]; x2 = [cc dd cd dc] • d(x1,x2) = aacc + bbdd + 2abcd• Kernel trick: polynomial kernel = (ac+bd)^2

• When we define a kernel, we are implicitly defining a feature space– Recall in SVMs, kernels allow classification in spaces

non-linear w.r.t. the original features

Kernel Learning in Semi-supervised scenarios

1. Learn a kernel function k(x1,x2) from unlabeled+labeled data

2. Use this kernel when training a supervised classifier from labeled data

labeledData

labeledData

unlabeled Data

unlabeled Data

Kernel-basedClassifier

Algo Kernel

(2-stage learning process)

NewKernel

Neighborhood kernel (1/3)

• Idea: Distance between x1, x2 is based on distances between their neighbors

• Below are samples in feature space– Would you say the distances d(x1,x2) and

d(x2,x7) are equivalent?– Or does it seem like x2 should be closer to x1

than x3?

x1 x2 x7

x5x4x6

x9

x3x8

x11x10


• Analogy: “I am close to you if you are my clique of close friends.”– Neighborhood kernel k(x1,x2): avg distance among

neighbors of x1, x2

• Eg. x1= [1]; x2 = [6], x3=[5], x4=[2]– Original distance: x2-x1 = 6-1 = 5– Neighbor distance = 4

• closest neighbor to x1 is x4; closest neighbor to x2 is x3• [(x2-x1) + (x3-x1) + (x2-x4) + (x3-x4)]/4 = [5+4+4+3]/4 = 4


• Mathematical form:

• Design questions:– What’s the base distance? What’s the distance used

for finding close neighbors?– How many neighbors to use?

• Important: When do you expect neighborhood kernels to work? What assumptions does this method employ?

Gaussian Mixture Model Kernels

1. Perform a (soft) clustering of your labeled + unlabeled dataset (by fitting a Gaussian mixture model)

2. The distance between two samples is defined by whether they are in the same cluster:

K(x,y) = prob[x in cluster1]*prob[y in cluster1]

+ prob[x in cluster2]*prob[y in cluster2]

+ prob[x in cluster3]*prob[y in cluster3] + …

x1 x2 x7

x5x4x6

x9

x3x8

x11x10

Seeing relationships

• We’ve presented various methods, but at some level they are all the same…

• Gaussian mixture kernel & neighborhood kernel?

• Gaussian mixture kernel & feature clustering? • Can you defined a feature learning version of

neighborhood kernels?• Can you define a kernel learning version of

LSA?

Agenda


2. 3 Case Studies


4. Kernel Learning

5. Related work

Related Work: Feature Learning• Look into machine learning & signal processing literature for

unsupervised feature selection / extraction• Discussed here:

– Two-stage learning:• C. Oliveira, F. Cozman, and I. Cohen. “Splitting the unsupervised and

supervised components of semi-supervised learning.” In ICML 2005 Workshop on Learning with Partially Classified Training Data, 2005.

– Latent Semantic Analysis:• S. Deerwester et.al., “Indexing by Latent Semantic Analysis”, Journal of Soc.

For Information Sciences (41), 1990.• T. Hofmann, “Probabilistic Latent Semantic Analysis”, Uncertainty in AI,

1999– Feature clustering:

• W. Li and A. McCallum. “Semi-supervised sequence modeling with syntactic topic models”. AAAI 2005

• Z. Niu, D.-H. Ji, and C. L. Tan. A semi-supervised feature clustering algorithm with application to word sense disambiguation. HLT 2005

• Semi-supervised feature selection is a new area to explore: • Z. Zhao and H. Liu. ``Spectral Feature Selection for Supervised and

Unsupervised Learning'‘, ICML 2007

Related Work: Kernel learning• Beware that many kernel learning algorithms are

transductive (don’t apply to samples not originally in your training data):

• N. Christianni, et.al. “On kernel target alignment”, NIPS 2002• G. Lanckriet, et.al. “Learning a kernel matrix with semi-definite

programming,” ICML 2002• C. Ong, A. Smola, R. Williamson, “Learning the kernel with

hyperkernels”, Journal of Machine Learning Research (6), 2005

• Discussed here:– Neighborhood kernel and others:

• J. Weston, et.al. “Semi-supervised protein classification using cluster kernels”, Bioinformatics 2005 21(15)

– Gaussian mixture kernel and KernelBoost:• T. Hertz, A.Bar-Hillel and D. Weinshall. “Learning a Kernel Function

for Classification with Small Training Samples.” ICML 2006

• Other keywords: “distance learning”, “distance metric learning”

• E. Xing, M. Jordan, S. Russell, “Distance metric learning with applications to clustering with side-information”, NIPS 2002

Related work: SSL• Good survey article:

• Xiaojin Zhu. “Semi-supervised learning literature survey.” Wisconsin CS TechReport

• Graph-based SSL are popular. Some can be related to kernel learning

• SSL on sequence models (of particular importance to NLP)• J. Lafferty, X. Zhu and Y. Liu. “Kernel Conditional Random Fields:

Representation and Clique Selection.” ICML 2004• Y. Altun, D. Mcallester, and M. Belkin. “Maximum margin semi-supervised

learning for structured variables”, NIPS 2005• U. Brefeld and T. Scheffer. “Semi-supervised learning for structured output

variables,” ICML 2006• Other promising methods:

– Fisher kernels and features from generative models:• T. Jaakkola & D. Haussler. “Exploiting generative models in discriminative

classifiers.” NIPS 1998• A. Holub, M. Welling, and P. Perona. “Exploiting unlabelled data for hybrid

object classification.” NIPS 2005 Workshop on Inter-class transfer• A. Fraser and D. Marcu. “Semi-supervised word alignment.” ACL 2006

– Multi-task learning formulation:• R. Ando and T. Zhang. A High-Performance Semi-Supervised Learning

Method for Text Chunking. ACL 2006• J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural

correspondence learning. EMNLP 2006

Summary

• Feature learning and kernel learning are two sides of the same coin

• Feature/kernel learning in SSL consists of a 2-stage approach: – (1) learn better feature/kernel with unlabeled data– (2) learn traditional classifier on labeled data

• 3 Cases of inadequate feature representation: correlated, irrelevant, not expressive

• Particular algorithms presented: – LSA, Neighborhood kernels, GMM kernels

• Many related areas, many algorithms, but underlying assumptions may be very similar

Documents

Feature/Kernel Learning in Semi-supervised scenarios Kevin Duh [email protected] UW 2007 Workshop on SSL for Language Processing