View
215
Download
0
Embed Size (px)
Citation preview
Feature/Kernel Learning in Semi-supervised scenarios
Kevin [email protected]
UW 2007 Workshop on SSL for Language Processing
Agenda
1. Intro: Feature Learning in SSL1. Assumptions
2. General algorithm/setup
2. 3 Case Studies
3. Algo1: Latent Semantic Analysis
4. Kernel Learning
5. Related work
Semi-supervised learning (SSL): Three general assumptions
• How can unlabeled data help?1. Bootstrap assumption:
– Automatic labels on unlabeled data has sufficient accuracy
– Can be incorporated into training set2. Low Density Separation assumption:
– Classifier should not cut through a high density region because clusters of data likely come from the same class
– Unlabeled data can help identify high/low density regions
3. Change of Representation assumption
Low Density Assumption
+
+
+
+ -
--
-
-
oo
o
o
o
o
oo
Black line cuts throughLow density region
Green line cuts through High density region
Change of Representation Assumption
• Basic learning process:1. Define a feature representation of data2. Map each sample into feature vector3. Apply a vector classifier learning algorithm
• What if the original feature set is bad?– Then learning algo’s will have a hard time
• Ex: Highly correlation among features• Ex: Irrelevant features
– Different learning algo’s deal with it differently
• Assumption: Large amounts of unlabeled data can help us learn better feature representation
Two stage learning process
• Stage 1: Feature learning1. Define an original feature representation
2. Using unlabeled data, learn a “better” feature representation
• Stage 2: Supervised training1. Map each labeled training sample into new
feature vector
2. Apply a vector classifier learning algo
ClassifierFeatureVector
(Basic learning process)
labeledData
labeledData
labeledData
labeledData
unlabeled Data
unlabeled Data
Classifier
OriginalFeatureVector
NewFeatureVector
(2-stage learning process)
NewFeatureVector
Philosophical reflection
• Machine learning is not magic– Human knowledge is encoded in feature representation– Machine & statistics only learn the relative weight/importance of
each feature on data• Two-stage learning is not magic
– Merely extended machine/statistical analysis to the feature “engineering” stage
– Human knowledge still needed in:• Original feature representation• Deciding how original feature transforms into new feature (Here is
the research frontier!)
• Designing features and learning weights on features are really the same thing – it’s just division of labor between human and machine.
Wait, how is this different from …
• Feature selection/extraction?– This approach applies traditional feature selection to a larger,
unlabeled dataset.– In particular, we apply unsupervised feature selection methods
(semi-supervised feature selection is an unexplored area)• Supervised learning?
– Supervised learning is a critical component
• We can use all we know in feature selection & supervised learning here:– There components aren’t new: the new thing here is the semi-
supervised perspective. – So far there is little work on this in language processing
(compared to auto-labeling): lots of research opportunities!!!
Research questions
• Given lots of unlabeled data and an original feature representation, how can we transform into a better, new features?
1. How to define “better”? (objective)– Ideally, “better” = “leads to higher classification accuracy, but
also this depends on the downstream classifier algo– Is there a surrogate metric for “better” that does not require
labels?
2. How to model the feature transformation? (modeling)– Mostly we use linear transform: y=Ax
• x = original feature; y = new feature; • A = transform matrix to learn
3. Given the objective and model, how to find A? (algo)
Agenda
1. Intro: Feature Learning in SSL
2. 3 Case Studies1. Correlated features
2. Irrelevant features
3. Feature Set is not sufficiently expressive
3. Algo1: Latent Semantic Analysis
4. Kernel Learning
5. Related work
Running Example: Topic classification
• Binary classification of documents by topics: Politics vs. Sports
• Setup:– Data: Each document is vectorized so that each element
represents a unigram and its count• Doc1: “The President met the Russian President”• Vocab: [The President met Russian talks strategy Czech missile]• Vector1: [2 2 1 1 0 0 0 0 ]
– Classifier: Some supervised training algorithm
• We will discuss 3 cases where original feature set may be unsuitable for training
1. Highly correlated features• Unigrams (features) in “Politics”:
– Republican, conservative, GOP– White, House, President– Iowa, primary– Israel, Hamas, Gaza
• Unigrams in “Sports”:– game, ballgame– strike-out, baseball– NBA, Sonics– Tiger, Woods, golf
• OTHER EXAMPLES?• Highly correlate features represent redundant information:
– A document with “Republican” isn’t necessarily more about politics than a document with “Republican” and “conservative.”
– Some classifiers are hurt by highly correlated (redundant) features by “double-counting”, e.g. Naïve Bayes & other generative models
2. Irrelevant features
• Unigrams in both “Politics” & “Sports”:– the, a, but, if, while (stop words)– win, lose (topic neutral words)– Bush (politician or sports player)
• Very rare unigrams in either topic– Wazowski, Pitcairn (named entities)– Brittany Speers (typos, noisy features)
• OTHER EXAMPLES?• Irrelevant features un-necessarily expands the
“search space”. They should get zero weight, but classifier may not do so.
3. Feature Set is notsufficiently expressive
• Original feature set:– 3 features: [leaders challenge group]– Politics or Sports?
• “The leaders of the terrorist group issued a challenge” • “It’s a challenge for the cheer leaders to take a group picture”• Different topics have similar (same) feature vectors
• A more expressive (discriminative) feature set:– 5 features: [leader challenge group cheer terrorist]– Bigrams: [“the leaders” “cheer leaders” …]
• Vector-based linear classifiers simply doesn’t work – Data is “inseparable” by the classifier
• Non-linear classifiers (e.g. kernel methods, decision trees, neural networks) automatically learn combinations of features– But, there is still no substitute for an human expert thinking of features
Agenda
1. Intro: Feature Learning in SSL
2. 3 Case Studies
3. Algo1: Latent Semantic Analysis
4. Kernel Learning
5. Related work
Dealing with correlated and/or irrelevant features
• Feature clustering methods:– Idea: cluster features so that similarly distributed
features are mapped together– Algorithms:
• Word clustering (using Brown algo, spectral clustering, k-means…)
• Latent Semantic Analysis
• Principle Components Analysis (PCA):– Idea: Map original feature vector into directions of
high variance• Words with similar counts across all documents should be
deleted from feature set• Words with largely varying counts *might* be a discriminating
feature
Latent Semantic Analysis (LSA)
• Currently, a document is represented by a set of terms (words).
• LSA: represent a document by a set of “latent semantic” variables, which are coarser-grained than terms.– A kind of “feature-clustering”– Latent semantics could cluster together synonyms or
similar words, but in general it may be difficult to say “what” the latent semantics corresponds to.
– Strengths: data-driven, scalable algorithm (with good results in natural language processing and information retrieval!)
LSA Algorithm
1. Represent data as term-document matrix2. Perform singular value decomposition 3. Map documents to new feature vector by the
vectors with largest singular values
• Term-document matrix: – t terms, d documents. Matrix is size (t x d)– Each column is a document (represented by how
many times each term occurs in that doc)– Each row is a term (represented by how many times
it occurs in various documents)
Term-document matrix: X
• Distance between 2 documents: dot product between column vectors– d1’ * d2 = 5*5+4*3+0*0 = 37– d1’ * d3 = 5*0+4*0+0*7 = 0– X’ * X is the matrix of document distances
• Distance between 2 terms: dot product between row vectors– t1 * t2’ = 5*4+5*3*0*0+0*1+10*7=105– t1 * t3’ = 5*0+5*0+0*7+0*6+10*3=30– X * X’ is the matrix of term distances
doc1 doc2 doc3 doc4 doc5
term1 5 5 0 0 10
term2 4 3 0 1 7
term3 0 0 7 6 3
Singular Value Decomposition (SVD)
• Eigendecomposition: – A is a square matrix– Av = ev
• (v is eigenvector, e is eigenvalue)– A = VEV’
• (E is diagonal matrix of eigenvalues• (V is matrix of eigenvectors; V*V’=V’*V = I)
• SVD: – X is (t x d) rectangular matrix– Xd = st
• (s is singular value, t & d are left/right singular vectors)– X = TSD’
• (S is t x d matrix with diagonal singular values and others 0)• (T is t x t, D is d x d matrix; T*T’=I; D*D’=I)
SVD on term-document matrix
• X = T*S*D’ = (t x t)* (t x d) * (d x d)
• Low-rank approx X ~ (t x k)*(k x k)*(k x d)
• X’*X = (DS’T’)(TSD) = D(S’S)D• New document vector representation: d’*T*inv(S)
-0.78 0.25 0.56 -0.55 0.11 -0.82 -0.27 -0.96 0.05
15.34 0 0 0 0 0 9.10 0 0 0 0 0 0.85 0 0
-0.40 0.18 -0.52 -0.36 -0.62-0.36 0.17 0.42 0.66 -0.45-0.12 -0.73 0.45 -0.37 -0.30-0.14 -0.62 -0.56 0.51 0.08-0.81 0.04 0.09 -0.15 0.54
-0.78 0.25 -0.55 0.11 -0.27 -0.96
15.34 0 0 9.10
-0.40 0.18 -0.52 -0.36 -0.62-0.36 0.17 0.42 0.66 -0.45
Agenda
1. Intro: Feature Learning in SSL
2. 3 Case Studies
3. Algo1: Latent Semantic Analysis
4. Kernel Learning1. Links between features and kernels
2. Algo2: Neighborhood kernels
3. Algo3: Gaussian Mixture Model kernels
5. Related work
Philosophical reflection: Distances
• (Almost) Everything in machine learning is about distances: – A new sample is labeled positive because it is “closer”
in distance to positive training examples• How to define distance?
– Conventional way: dot product of feature vectors– That’s why feature representation is important
• What if we directly learn the distance?– d(x1,x2): inputs are two samples, output is a numer– Argument: it doesn’t matter what features we define;
all that matters is whether positive samples are close in distance to other positive samples, etc.
What is a kernel?
• Kernel is a distance function:– Dot product in feature space
• x1= [a b]; x2 = [c d] , d(x1,x2)=x1’*x2 = ac+bd– Polynomial kernel:
• Map feature into higher-order polynomial• x1 = [aa bb ab ba]; x2 = [cc dd cd dc] • d(x1,x2) = aacc + bbdd + 2abcd• Kernel trick: polynomial kernel = (ac+bd)^2
• When we define a kernel, we are implicitly defining a feature space– Recall in SVMs, kernels allow classification in spaces
non-linear w.r.t. the original features
Kernel Learning in Semi-supervised scenarios
1. Learn a kernel function k(x1,x2) from unlabeled+labeled data
2. Use this kernel when training a supervised classifier from labeled data
labeledData
labeledData
unlabeled Data
unlabeled Data
Kernel-basedClassifier
Algo Kernel
(2-stage learning process)
NewKernel
Neighborhood kernel (1/3)
• Idea: Distance between x1, x2 is based on distances between their neighbors
• Below are samples in feature space– Would you say the distances d(x1,x2) and
d(x2,x7) are equivalent?– Or does it seem like x2 should be closer to x1
than x3?
x1 x2 x7
x5x4x6
x9
x3x8
x11x10
Neighborhood kernel (2/3)
• Analogy: “I am close to you if you are my clique of close friends.”– Neighborhood kernel k(x1,x2): avg distance among
neighbors of x1, x2
• Eg. x1= [1]; x2 = [6], x3=[5], x4=[2]– Original distance: x2-x1 = 6-1 = 5– Neighbor distance = 4
• closest neighbor to x1 is x4; closest neighbor to x2 is x3• [(x2-x1) + (x3-x1) + (x2-x4) + (x3-x4)]/4 = [5+4+4+3]/4 = 4
Neighborhood kernel (3/3)
• Mathematical form:
• Design questions:– What’s the base distance? What’s the distance used
for finding close neighbors?– How many neighbors to use?
• Important: When do you expect neighborhood kernels to work? What assumptions does this method employ?
Gaussian Mixture Model Kernels
1. Perform a (soft) clustering of your labeled + unlabeled dataset (by fitting a Gaussian mixture model)
2. The distance between two samples is defined by whether they are in the same cluster:
K(x,y) = prob[x in cluster1]*prob[y in cluster1]
+ prob[x in cluster2]*prob[y in cluster2]
+ prob[x in cluster3]*prob[y in cluster3] + …
x1 x2 x7
x5x4x6
x9
x3x8
x11x10
Seeing relationships
• We’ve presented various methods, but at some level they are all the same…
• Gaussian mixture kernel & neighborhood kernel?
• Gaussian mixture kernel & feature clustering? • Can you defined a feature learning version of
neighborhood kernels?• Can you define a kernel learning version of
LSA?
Agenda
1. Intro: Feature Learning in SSL
2. 3 Case Studies
3. Algo1: Latent Semantic Analysis
4. Kernel Learning
5. Related work
Related Work: Feature Learning• Look into machine learning & signal processing literature for
unsupervised feature selection / extraction• Discussed here:
– Two-stage learning:• C. Oliveira, F. Cozman, and I. Cohen. “Splitting the unsupervised and
supervised components of semi-supervised learning.” In ICML 2005 Workshop on Learning with Partially Classified Training Data, 2005.
– Latent Semantic Analysis:• S. Deerwester et.al., “Indexing by Latent Semantic Analysis”, Journal of Soc.
For Information Sciences (41), 1990.• T. Hofmann, “Probabilistic Latent Semantic Analysis”, Uncertainty in AI,
1999– Feature clustering:
• W. Li and A. McCallum. “Semi-supervised sequence modeling with syntactic topic models”. AAAI 2005
• Z. Niu, D.-H. Ji, and C. L. Tan. A semi-supervised feature clustering algorithm with application to word sense disambiguation. HLT 2005
• Semi-supervised feature selection is a new area to explore: • Z. Zhao and H. Liu. ``Spectral Feature Selection for Supervised and
Unsupervised Learning'‘, ICML 2007
Related Work: Kernel learning• Beware that many kernel learning algorithms are
transductive (don’t apply to samples not originally in your training data):
• N. Christianni, et.al. “On kernel target alignment”, NIPS 2002• G. Lanckriet, et.al. “Learning a kernel matrix with semi-definite
programming,” ICML 2002• C. Ong, A. Smola, R. Williamson, “Learning the kernel with
hyperkernels”, Journal of Machine Learning Research (6), 2005
• Discussed here:– Neighborhood kernel and others:
• J. Weston, et.al. “Semi-supervised protein classification using cluster kernels”, Bioinformatics 2005 21(15)
– Gaussian mixture kernel and KernelBoost:• T. Hertz, A.Bar-Hillel and D. Weinshall. “Learning a Kernel Function
for Classification with Small Training Samples.” ICML 2006
• Other keywords: “distance learning”, “distance metric learning”
• E. Xing, M. Jordan, S. Russell, “Distance metric learning with applications to clustering with side-information”, NIPS 2002
Related work: SSL• Good survey article:
• Xiaojin Zhu. “Semi-supervised learning literature survey.” Wisconsin CS TechReport
• Graph-based SSL are popular. Some can be related to kernel learning
• SSL on sequence models (of particular importance to NLP)• J. Lafferty, X. Zhu and Y. Liu. “Kernel Conditional Random Fields:
Representation and Clique Selection.” ICML 2004• Y. Altun, D. Mcallester, and M. Belkin. “Maximum margin semi-supervised
learning for structured variables”, NIPS 2005• U. Brefeld and T. Scheffer. “Semi-supervised learning for structured output
variables,” ICML 2006• Other promising methods:
– Fisher kernels and features from generative models:• T. Jaakkola & D. Haussler. “Exploiting generative models in discriminative
classifiers.” NIPS 1998• A. Holub, M. Welling, and P. Perona. “Exploiting unlabelled data for hybrid
object classification.” NIPS 2005 Workshop on Inter-class transfer• A. Fraser and D. Marcu. “Semi-supervised word alignment.” ACL 2006
– Multi-task learning formulation:• R. Ando and T. Zhang. A High-Performance Semi-Supervised Learning
Method for Text Chunking. ACL 2006• J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural
correspondence learning. EMNLP 2006
Summary
• Feature learning and kernel learning are two sides of the same coin
• Feature/kernel learning in SSL consists of a 2-stage approach: – (1) learn better feature/kernel with unlabeled data– (2) learn traditional classifier on labeled data
• 3 Cases of inadequate feature representation: correlated, irrelevant, not expressive
• Particular algorithms presented: – LSA, Neighborhood kernels, GMM kernels
• Many related areas, many algorithms, but underlying assumptions may be very similar