Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,

Optimal Reverse Prediction:

Linli Xu, Martha White and Dale SchuurmansICML 2009, Best Overall Paper Honorable Mention

A Unified Perspective on Supervised, Unsupervised and Semi-supervised Learning

Discussion led by Chunping WangECE, Duke University

October 23, 2009

Outline

• Motivations • Preliminary Foundations• Reverse Supervised Least Squares • Relationship between Unsupervised Least Squares and

PCA, K-means, and Normalized Graph-cut• Semi-supervised Least Squares• Experiments• Conclusions

1/31

Motivations

2/31

• Lack of a foundational connection between supervised and unsupervised learning

Supervised learning: minimizing prediction error

Unsupervised learning: re-representing the input data

• For semi-supervised learning, one needs to consider both together

• The semi-supervised learning literature relies on intuitions: the “cluster assumption” and the “manifold assumption”

• A unification demonstrated in this paper leads to a novel semi-supervised principle

Preliminary Foundations Forward Supervised Least Squares

3/31

• Data:– a input matrix X, a output matrix Y, – t instances, n features, k responses– regression: – classification: – assumption: X, Y full rank,

• Problem: – Find parameters W minimizing least squares loss for a

model

ktRY 11 YY kt ,}1,0{

YXfW :

nt kt

kn

Preliminary Foundations

4/31

• Linear

• Ridge regularization

• Kernelization

• Instance weighting

])')([(trmin YXWYXWW

YXIXXW

W'WYXWYXWW

')'(

]tr[])')([(trmin1

YIKA

AA'KYKAYKAA

1)(

]tr[])')([(trmin

YIKA

AA'KYKAYKAA

1)(

]tr[])')(([trmin

'XXK

Preliminary Foundations Principal Components Analysis - dimensionality reduction

k-means – clustering

Normalized Graph-cut – clustering

5/31

)'(where,' max XXQWXWZ k

k

i Sxij

Sij

xS1

2* minarg

Weighted undirected graph ),,( AEVGnodes

edgesaffinity matrix

Graph partition problem: find a partition minimizing the total weight of edges connecting nodes in distinct subsets.

Preliminary Foundations Normalized Graph-cut – clustering

6/31

• Partition indicator matrix Z

• Weighted degree matrix

• Total cut

• Normalized cut

kjtiSiSi

Zj

jij ,,1,,,1for,

if0if1

)(diag 1A

]'[tr)( 21

1

'21 LZZzAzC

k

jjj

)]'()'[(tr)( 1

21

1'

'

21 LZZZZ

zzzAz

NCk

j jj

jj

constraint

objective

objective

From Xing & Jordan, 2003

SupervisedLeast

Squares Regression

Principle Component

Analysis

Unsupervised

K-means

Graph Norm Cut

Least Squares

Classification

First contribution

In literature

7/31

This paper

7/31

SupervisedLeast

Squares Regression

Principle Component

Analysis

Unsupervised

K-means

Graph Norm Cut

Least Squares

Classification

Unification

First contribution

Reverse Supervised Least Squares

8/31

• Traditional forward least squares: predict the outputs from the inputs

• Reverse least squares: predict the inputs from the outputs

Given reverse solutions U, the corresponding forward solutions W can be recovered exactly.

])')([(trmin YXWYXWW

])')([(trmin YUXYUXU

rankfullX


9/31

• Ridge regularization

• Kernelization

• Instance weighting

YYUIXXWYYUYXWIXX '')'(''')'( 1

YYBIKAYYBYAIK '')('')( 1

')'(])'()([trmin 1YYYBYBIKYBIA

])'()[(trmin YBIKYBIB

Reverse problem:

Recover:

YYBIKAYYBYAIK '')('')( 1

Reverse problem:

Recover:

Recover:


10/31

For supervised learning with least squares loss

forward and reverse perspectives are equivalent

each can be recovered exactly from the other

the forward and reverse losses are not identical since they are measured in different units – it is not principled to combine them directly!

Unsupervised Least Squares

11/31

Unsupervised learning: no training labels Y are given

Principle: optimize over guessed labels Z

])')([(trminmin ZXWZXWWZ

• forward

• reverse ])')([(trminmin ZUXZUXUZ

For any W, we can choose Z=XW to achieve zero lossIt only gives trivial solutions It does not Work!

It gives non-trivial solutions

nkktktnt RURZRX ),}1,0{or(,

Unsupervised Least Squares PCA

12/31

Proposition 1 Unconstrained reverse prediction

is equivalent to principal components analysis.

])')([(trminmin ZUXZUXUZ

This connection has been made in Jong& Kotz, 1999, and the authors extend it to the kernelized cases

Corollary 1 Kernelized reverse prediction

is equivalent to kernel principal components analysis.

])'()[(trminmin ZBIKZBIBZ


13/31




])')([(trminarg* ZUXZUXUU

Proof

])'(')[(trmin])')([(trmin ** ZZIXXZZIZUXZUXZZ


13/31




])')([(trminarg* ZUXZUXUU

Proof

])'(')[(trmin])')([(trmin ** ZZIXXZZIZUXZUXZZ

]')[(trmin XXZZIZ

]'[trmax XXZZZ

')'( 1 ZZZZ Recall thatThe solution for Z is not unique

)('')''()()( 1 ZRZTZTZTZTZTZTZTR


14/31




Proof Consider the SVD of Z:

diagonaland','for,' kk IQQIPPQPZ

')( PPZZZR Then

The objective ]''[trmax]''[trmax':':

PXXPXXPPkk IPPPIPPP

)'(max XXQP kSolution

Unsupervised Least Squares k-means

15/31

Proposition 2 Constrained reverse prediction

is equivalent to k-means clustering.

])')([(trminmin,}1,0{:

ZUXZUXUZZZ kt

11

The connection between PCA and k-means clustering has been made in Ding & He, 2004, but the authors show the connection of both to supervised (reverse) least squares.

Corollary 2 Constrained kernelized reverse prediction

is equivalent to kernel k-means.

])'()[(trminmin,}1,0{:

ZBIKZBIBZZZ kt

11


16/31



])')([(trminmin,}1,0{:

ZUXZUXUZZZ kt

11

])')([(trmin,}1,0{:

XZZXXZZXZZZ kt

11

Proof Equivalent problem

Consider the differenceXZZZZXXZZX ')'( 1

Diagonal matrix

Counts of data in each class

matrix

Each row: sum of data in each class

nk


17/31



])')([(trminmin,}1,0{:

ZUXZUXUZZZ kt

11

Proof

k classofmean

1 classofmean')'( 1 XZZZ

classinstanceofmean

class1instanceofmean')'( 1

tXZZZZ

k

n

n

t

means

encoding


18/31



])')([(trminmin,}1,0{:

ZUXZUXUZZZ kt

11

Proof Therefore

])')([(trmin,}1,0{:

XZZXXZZXZZZ kt

11

k

i Sxij

Sij

xS1

2* minarg

Unsupervised Least Squares Norm-cut

19/31

Proposition 3 For a doubly nonnegative matrix K and weighting , weighted reverse prediction

is equivalent to normalized graph-cut.

])'()([trminmin 11

,}1,0{:ZBKZB

BZZZ kt

11

)(diag 1K

Proof For any Z, the solution to the inner minimization

')'( 1* ZZZB

])')'([(trmin 11

,}1,0{:KZZZZ

ZZZ kt

11Reduced objective

]')'([tr-tr[I]min 1

,}1,0{:KZZZZ

ZZZ kt

11

])(')'[(trmin 1

,}1,0{:ZKZZZ

ZZZ kt

11


20/31

Proof Recall the normalized-cut (from Xing & Jordan, 2003)

])(')'[(tr 121 ZAZZZNC

Proposition 3 For a doubly nonnegative matrix K and weighting , weighted reverse prediction

is equivalent to normalized graph-cut.

])'()([trminmin 11

,}1,0{:ZBKZB

BZZZ kt

11

)(diag 1K

Since K doubly nonnegative, it could be a affinity matrix.

The objective is equivalent to normalized graph-cut.


21/31

Corollary 3 The weighted least squares problem

is equivalent to normalized graph-cut on if .

])')(([trminmin 11

,}1,0{:ZUXZUX

UZZZ kt

11

With a specific K, we can relate normalized graph-cut to the reverse least squares.

'XXK 0K

Second contribution

22/31

Reverse Prediction

SupervisedLeast

Squares Learning

Principle Component

Analysis

Unsupervised

K-means

Graph Norm Cut

The figure is taken from Xu’s slides

22/31

Reverse Prediction

SupervisedLeast

Squares Learning

Principle Component

Analysis

Unsupervised

K-means

Graph Norm Cut

New

Semi-Supervised


Second contribution

Semi-supervised Least Squares

23/31

A principled approach: reverse loss decomposition

1x

2x

3x

4x


Supervised reverse losses ])')([(tr YUXYUX


23/31


1x

2x

3x

4x




23/31


1x

2x

3x

4x


3x



23/31


1x

2x

3x

4x


Supervised reverse losses

3x

])')([(tr YUXYUX


23/31


1x

2x

3x

4x

3x*

3x


Supervised reverse losses ])')([(tr YUXYUX Unsupervised reverse losses ])')([(tr ZUXZUX

2*33

2*33

233 ˆˆ xxxxxx


24/31

Proposition 4 For any X, Y, and U

where

])')([(tr

])')([(tr

])')([(tr

**

**

YUUZYUUZ

UZXUZX

YUXYUX

])')([(trminarg* ZUXZUXZZ

Supervised loss

Unsupervised loss

Squared distance

Unsupervised loss depends only on the input data X;Squared distance depends on both X and Y.

Note: we cannot get the true supervised loss since we don’t have all the labels Y. We may estimate it using only labeled data, or also using auxiliary unlabeled data.


25/31

Corollary 4 For any U

where

]1E[

][]1E[

]1E[

2*

2*2*

2

FLLL

FUUU

FLLL

FLLL

UYUZT

UZXT

EUZXT

UYXT

])')([(trminarg* ZUXZUXZZ

Supervised loss estimate

Unsupervised loss estimate

Squared distance estimate

Labeled data are scarce, but plenty of unlabeled data are available. The variance of the supervised loss estimate is strictly reduced by introducing the second term to get a better unbiased unsupervised loss estimate.


26/31

A naive approach:]//[minmin 22

UFULFLLUZTZUXTUYX

Loss on labeled data Loss on unlabeled data

Advantages: • The authors combine supervised and unsupervised

reverse losses; while previous approaches combine unsupervised (reverse) loss with supervised (forward) loss, which are not in the same units.

• Compared to the principled approach, it admits more straightforward optimization procedures (alternating between U and Z)

Regression ExperimentsLeast Squares + PCA

27/31

]//[minmin 22UFULFLLUZ

TZUXTUYX

• Two terms are not jointly convex no closed form solution

• Learning method: alternating (with a initial U got from supervised setting)

• Recovered forward solution

• Testing: given a new x,

• Can be kernelized

Basic formulation

YYUIXXW '')'( 1

xWy 'ˆ

Regression ExperimentsLeast Squares + PCA

28/31

Forward root mean squared error (mean± standard deviations for 10 random splits of the data)

The values of (k, n; TL , TU ) are indicated for each data set.

The table is taken from Xu’s paper

Classification ExperimentsLeast Squares + k-means

29/31

]//[minmin 22

,}1,0{:UFULFLLUZZZ

TZUXTUYXkt

11

xWy 'ˆ

• Recovered forward solution

• Testing: given a new x, , predict max response

Least Squares + Norm-cut

YYUIXXW '')'( 1

]/)(/)([minmin2121

,}1,0{:UFUUULFLLLLUZZZ

TZUXTUYXkt

11

Classification ExperimentsLeast Squares + k-means

30/31

Forward root mean squared error (mean± standard deviations for 10 random splits of the data)

The values of (k, n; TL , TU ) are indicated for each data set.

The table is taken from Xu’s paper

Conclusions

31/31

Two main contributions:

1. A unified framework based on reverse least squares loss is proposed for several existing supervised and unsupervised algorithms;

2. In the unified framework, a novel semi-supervised principle is proposed.

Documents

Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,