Upload
oliver-craig
View
221
Download
0
Embed Size (px)
DESCRIPTION
Motivations 2/31 Lack of a foundational connection between supervised and unsupervised learning Supervised learning: minimizing prediction error Unsupervised learning: re-representing the input data For semi-supervised learning, one needs to consider both together The semi-supervised learning literature relies on intuitions: the “cluster assumption” and the “manifold assumption” A unification demonstrated in this paper leads to a novel semi- supervised principle
Citation preview
Optimal Reverse Prediction:
Linli Xu, Martha White and Dale SchuurmansICML 2009, Best Overall Paper Honorable Mention
A Unified Perspective on Supervised, Unsupervised and Semi-supervised Learning
Discussion led by Chunping WangECE, Duke University
October 23, 2009
Outline
• Motivations • Preliminary Foundations• Reverse Supervised Least Squares • Relationship between Unsupervised Least Squares and
PCA, K-means, and Normalized Graph-cut• Semi-supervised Least Squares• Experiments• Conclusions
1/31
Motivations
2/31
• Lack of a foundational connection between supervised and unsupervised learning
Supervised learning: minimizing prediction error
Unsupervised learning: re-representing the input data
• For semi-supervised learning, one needs to consider both together
• The semi-supervised learning literature relies on intuitions: the “cluster assumption” and the “manifold assumption”
• A unification demonstrated in this paper leads to a novel semi-supervised principle
Preliminary Foundations Forward Supervised Least Squares
3/31
• Data:– a input matrix X, a output matrix Y, – t instances, n features, k responses– regression: – classification: – assumption: X, Y full rank,
• Problem: – Find parameters W minimizing least squares loss for a
model
ktRY 11 YY kt ,}1,0{
YXfW :
nt kt
kn
Preliminary Foundations
4/31
• Linear
• Ridge regularization
• Kernelization
• Instance weighting
])')([(trmin YXWYXWW
YXIXXW
W'WYXWYXWW
')'(
]tr[])')([(trmin1
YIKA
AA'KYKAYKAA
1)(
]tr[])')([(trmin
YIKA
AA'KYKAYKAA
1)(
]tr[])')(([trmin
'XXK
Preliminary Foundations Principal Components Analysis - dimensionality reduction
k-means – clustering
Normalized Graph-cut – clustering
5/31
)'(where,' max XXQWXWZ k
k
i Sxij
Sij
xS1
2* minarg
Weighted undirected graph ),,( AEVGnodes
edgesaffinity matrix
Graph partition problem: find a partition minimizing the total weight of edges connecting nodes in distinct subsets.
Preliminary Foundations Normalized Graph-cut – clustering
6/31
• Partition indicator matrix Z
• Weighted degree matrix
• Total cut
• Normalized cut
kjtiSiSi
Zj
jij ,,1,,,1for,
if0if1
)(diag 1A
]'[tr)( 21
1
'21 LZZzAzC
k
jjj
)]'()'[(tr)( 1
21
1'
'
21 LZZZZ
zzzAz
NCk
j jj
jj
constraint
objective
objective
From Xing & Jordan, 2003
SupervisedLeast
Squares Regression
Principle Component
Analysis
Unsupervised
K-means
Graph Norm Cut
Least Squares
Classification
First contribution
In literature
7/31
This paper
7/31
SupervisedLeast
Squares Regression
Principle Component
Analysis
Unsupervised
K-means
Graph Norm Cut
Least Squares
Classification
Unification
First contribution
Reverse Supervised Least Squares
8/31
• Traditional forward least squares: predict the outputs from the inputs
• Reverse least squares: predict the inputs from the outputs
Given reverse solutions U, the corresponding forward solutions W can be recovered exactly.
])')([(trmin YXWYXWW
])')([(trmin YUXYUXU
rankfullX
Reverse Supervised Least Squares
9/31
• Ridge regularization
• Kernelization
• Instance weighting
YYUIXXWYYUYXWIXX '')'(''')'( 1
YYBIKAYYBYAIK '')('')( 1
')'(])'()([trmin 1YYYBYBIKYBIA
])'()[(trmin YBIKYBIB
Reverse problem:
Recover:
YYBIKAYYBYAIK '')('')( 1
Reverse problem:
Recover:
Recover:
Reverse Supervised Least Squares
10/31
For supervised learning with least squares loss
forward and reverse perspectives are equivalent
each can be recovered exactly from the other
the forward and reverse losses are not identical since they are measured in different units – it is not principled to combine them directly!
Unsupervised Least Squares
11/31
Unsupervised learning: no training labels Y are given
Principle: optimize over guessed labels Z
])')([(trminmin ZXWZXWWZ
• forward
• reverse ])')([(trminmin ZUXZUXUZ
For any W, we can choose Z=XW to achieve zero lossIt only gives trivial solutions It does not Work!
It gives non-trivial solutions
nkktktnt RURZRX ),}1,0{or(,
Unsupervised Least Squares PCA
12/31
Proposition 1 Unconstrained reverse prediction
is equivalent to principal components analysis.
])')([(trminmin ZUXZUXUZ
This connection has been made in Jong& Kotz, 1999, and the authors extend it to the kernelized cases
Corollary 1 Kernelized reverse prediction
is equivalent to kernel principal components analysis.
])'()[(trminmin ZBIKZBIBZ
Unsupervised Least Squares PCA
13/31
Proposition 1 Unconstrained reverse prediction
is equivalent to principal components analysis.
])')([(trminmin ZUXZUXUZ
])')([(trminarg* ZUXZUXUU
Proof
])'(')[(trmin])')([(trmin ** ZZIXXZZIZUXZUXZZ
Unsupervised Least Squares PCA
13/31
Proposition 1 Unconstrained reverse prediction
is equivalent to principal components analysis.
])')([(trminmin ZUXZUXUZ
])')([(trminarg* ZUXZUXUU
Proof
])'(')[(trmin])')([(trmin ** ZZIXXZZIZUXZUXZZ
]')[(trmin XXZZIZ
]'[trmax XXZZZ
')'( 1 ZZZZ Recall thatThe solution for Z is not unique
)('')''()()( 1 ZRZTZTZTZTZTZTZTR
Unsupervised Least Squares PCA
14/31
Proposition 1 Unconstrained reverse prediction
is equivalent to principal components analysis.
])')([(trminmin ZUXZUXUZ
Proof Consider the SVD of Z:
diagonaland','for,' kk IQQIPPQPZ
')( PPZZZR Then
The objective ]''[trmax]''[trmax':':
PXXPXXPPkk IPPPIPPP
)'(max XXQP kSolution
Unsupervised Least Squares k-means
15/31
Proposition 2 Constrained reverse prediction
is equivalent to k-means clustering.
])')([(trminmin,}1,0{:
ZUXZUXUZZZ kt
11
The connection between PCA and k-means clustering has been made in Ding & He, 2004, but the authors show the connection of both to supervised (reverse) least squares.
Corollary 2 Constrained kernelized reverse prediction
is equivalent to kernel k-means.
])'()[(trminmin,}1,0{:
ZBIKZBIBZZZ kt
11
Unsupervised Least Squares k-means
16/31
Proposition 2 Constrained reverse prediction
is equivalent to k-means clustering.
])')([(trminmin,}1,0{:
ZUXZUXUZZZ kt
11
])')([(trmin,}1,0{:
XZZXXZZXZZZ kt
11
Proof Equivalent problem
Consider the differenceXZZZZXXZZX ')'( 1
Diagonal matrix
Counts of data in each class
matrix
Each row: sum of data in each class
nk
Unsupervised Least Squares k-means
17/31
Proposition 2 Constrained reverse prediction
is equivalent to k-means clustering.
])')([(trminmin,}1,0{:
ZUXZUXUZZZ kt
11
Proof
k classofmean
1 classofmean')'( 1 XZZZ
classinstanceofmean
class1instanceofmean')'( 1
tXZZZZ
k
n
n
t
means
encoding
Unsupervised Least Squares k-means
18/31
Proposition 2 Constrained reverse prediction
is equivalent to k-means clustering.
])')([(trminmin,}1,0{:
ZUXZUXUZZZ kt
11
Proof Therefore
])')([(trmin,}1,0{:
XZZXXZZXZZZ kt
11
k
i Sxij
Sij
xS1
2* minarg
Unsupervised Least Squares Norm-cut
19/31
Proposition 3 For a doubly nonnegative matrix K and weighting , weighted reverse prediction
is equivalent to normalized graph-cut.
])'()([trminmin 11
,}1,0{:ZBKZB
BZZZ kt
11
)(diag 1K
Proof For any Z, the solution to the inner minimization
')'( 1* ZZZB
])')'([(trmin 11
,}1,0{:KZZZZ
ZZZ kt
11Reduced objective
]')'([tr-tr[I]min 1
,}1,0{:KZZZZ
ZZZ kt
11
])(')'[(trmin 1
,}1,0{:ZKZZZ
ZZZ kt
11
Unsupervised Least Squares Norm-cut
20/31
Proof Recall the normalized-cut (from Xing & Jordan, 2003)
])(')'[(tr 121 ZAZZZNC
Proposition 3 For a doubly nonnegative matrix K and weighting , weighted reverse prediction
is equivalent to normalized graph-cut.
])'()([trminmin 11
,}1,0{:ZBKZB
BZZZ kt
11
)(diag 1K
Since K doubly nonnegative, it could be a affinity matrix.
The objective is equivalent to normalized graph-cut.
Unsupervised Least Squares Norm-cut
21/31
Corollary 3 The weighted least squares problem
is equivalent to normalized graph-cut on if .
])')(([trminmin 11
,}1,0{:ZUXZUX
UZZZ kt
11
With a specific K, we can relate normalized graph-cut to the reverse least squares.
'XXK 0K
Second contribution
22/31
Reverse Prediction
SupervisedLeast
Squares Learning
Principle Component
Analysis
Unsupervised
K-means
Graph Norm Cut
The figure is taken from Xu’s slides
22/31
Reverse Prediction
SupervisedLeast
Squares Learning
Principle Component
Analysis
Unsupervised
K-means
Graph Norm Cut
New
Semi-Supervised
The figure is taken from Xu’s slides
Second contribution
Semi-supervised Least Squares
23/31
A principled approach: reverse loss decomposition
1x
2x
3x
4x
The figure is taken from Xu’s slides
Supervised reverse losses ])')([(tr YUXYUX
Semi-supervised Least Squares
23/31
A principled approach: reverse loss decomposition
1x
2x
3x
4x
The figure is taken from Xu’s slides
Supervised reverse losses ])')([(tr YUXYUX
Semi-supervised Least Squares
23/31
A principled approach: reverse loss decomposition
1x
2x
3x
4x
The figure is taken from Xu’s slides
3x
Supervised reverse losses ])')([(tr YUXYUX
Semi-supervised Least Squares
23/31
A principled approach: reverse loss decomposition
1x
2x
3x
4x
The figure is taken from Xu’s slides
Supervised reverse losses
3x
])')([(tr YUXYUX
Semi-supervised Least Squares
23/31
A principled approach: reverse loss decomposition
1x
2x
3x
4x
3x*
3x
The figure is taken from Xu’s slides
Supervised reverse losses ])')([(tr YUXYUX Unsupervised reverse losses ])')([(tr ZUXZUX
2*33
2*33
233 ˆˆ xxxxxx
Semi-supervised Least Squares
24/31
Proposition 4 For any X, Y, and U
where
])')([(tr
])')([(tr
])')([(tr
**
**
YUUZYUUZ
UZXUZX
YUXYUX
])')([(trminarg* ZUXZUXZZ
Supervised loss
Unsupervised loss
Squared distance
Unsupervised loss depends only on the input data X;Squared distance depends on both X and Y.
Note: we cannot get the true supervised loss since we don’t have all the labels Y. We may estimate it using only labeled data, or also using auxiliary unlabeled data.
Semi-supervised Least Squares
25/31
Corollary 4 For any U
where
]1E[
][]1E[
]1E[
2*
2*2*
2
FLLL
FUUU
FLLL
FLLL
UYUZT
UZXT
EUZXT
UYXT
])')([(trminarg* ZUXZUXZZ
Supervised loss estimate
Unsupervised loss estimate
Squared distance estimate
Labeled data are scarce, but plenty of unlabeled data are available. The variance of the supervised loss estimate is strictly reduced by introducing the second term to get a better unbiased unsupervised loss estimate.
Semi-supervised Least Squares
26/31
A naive approach:]//[minmin 22
UFULFLLUZTZUXTUYX
Loss on labeled data Loss on unlabeled data
Advantages: • The authors combine supervised and unsupervised
reverse losses; while previous approaches combine unsupervised (reverse) loss with supervised (forward) loss, which are not in the same units.
• Compared to the principled approach, it admits more straightforward optimization procedures (alternating between U and Z)
Regression ExperimentsLeast Squares + PCA
27/31
]//[minmin 22UFULFLLUZ
TZUXTUYX
• Two terms are not jointly convex no closed form solution
• Learning method: alternating (with a initial U got from supervised setting)
• Recovered forward solution
• Testing: given a new x,
• Can be kernelized
Basic formulation
YYUIXXW '')'( 1
xWy 'ˆ
Regression ExperimentsLeast Squares + PCA
28/31
Forward root mean squared error (mean± standard deviations for 10 random splits of the data)
The values of (k, n; TL , TU ) are indicated for each data set.
The table is taken from Xu’s paper
Classification ExperimentsLeast Squares + k-means
29/31
]//[minmin 22
,}1,0{:UFULFLLUZZZ
TZUXTUYXkt
11
xWy 'ˆ
• Recovered forward solution
• Testing: given a new x, , predict max response
Least Squares + Norm-cut
YYUIXXW '')'( 1
]/)(/)([minmin2121
,}1,0{:UFUUULFLLLLUZZZ
TZUXTUYXkt
11
Classification ExperimentsLeast Squares + k-means
30/31
Forward root mean squared error (mean± standard deviations for 10 random splits of the data)
The values of (k, n; TL , TU ) are indicated for each data set.
The table is taken from Xu’s paper
Conclusions
31/31
Two main contributions:
1. A unified framework based on reverse least squares loss is proposed for several existing supervised and unsupervised algorithms;
2. In the unified framework, a novel semi-supervised principle is proposed.