Semi-Supervised Learningjbhuang/teaching/ECE5424-CS... · 2019. 4. 8. · Collaborative filtering optimization ... Low-rank matrix factorization. Finding related movies/products

Semi-Supervised Learning

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Administrative

• HW 4 due April 10

Recommender Systems

• Motivation

• Problem formulation

• Content-based recommendations

• Collaborative filtering

• Mean normalization

Problem motivationMovie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥1

(romance)𝑥2

(action)

Love at last 5 5 0 0 0.9 0

Romanceforever

5 ? ? 0 1.0 0.01

Cute puppies of love

? 4 0 ? 0.99 0

Nonstop carchases

0 0 5 4 0.1 1.0

Swords vs.karate

0 0 5 ? 0 0.9

Problem motivationMovie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥1

(romance)𝑥2

(action)

Love at last 5 5 0 0 ? ?

Romanceforever

5 ? ? 0 ? ?


? 4 0 ? ? ?

Nonstop carchases

0 0 5 4 ? ?

Swords vs.karate

0 0 5 ? ? ?

𝜃 1 =050

𝜃 2 =050

𝜃 3 =005

𝜃 4 =005

𝑥 1 =???

Optimization algorithm

• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , to learn 𝑥(𝑖):

min𝑥(𝑖)

1

2

𝑗:𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑘=1

𝑛

𝑥𝑘(𝑖) 2

• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , to learn 𝑥(1), 𝑥(2), ⋯ , 𝑥(𝑛𝑚):

min𝑥(1),𝑥(2),⋯,𝑥(𝑛𝑚)

1

2

𝑖=1

𝑛𝑚


(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑖=1

𝑛𝑚

𝑘=1

𝑛

𝑥𝑘(𝑖) 2

Collaborative filtering

•Given 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 (and movie ratings),

Can estimate 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢

•Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢

Can estimate 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚

Collaborative filtering optimization objective• Given 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢

min𝜃 1 ,𝜃 2 ,⋯,𝜃 𝑛𝑢

1

2

𝑗=1

𝑛𝑢

𝑖:𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑗=1

𝑛𝑢

𝑘=1

𝑛

𝜃𝑘𝑗

2

• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚


1

2

𝑖=1

𝑛𝑚


(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑖=1

𝑛𝑚

𝑘=1

𝑛

𝑥𝑘(𝑖) 2

Collaborative filtering optimization objective

• Given 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢

min𝜃 1 ,𝜃 2 ,⋯,𝜃 𝑛𝑢

1

2

𝑗=1

𝑛𝑢


(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑗=1

𝑛𝑢

𝑘=1

𝑛

𝜃𝑘𝑗

2

• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚


1

2

𝑖=1

𝑛𝑚


(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑖=1

𝑛𝑚

𝑘=1

𝑛

𝑥𝑘(𝑖) 2

• Minimize 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 and 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 simultaneously

𝐽 =1

2

𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑗=1

𝑛𝑢

𝑘=1

𝑛

𝜃𝑘𝑗

2+

𝜆

2

𝑖=1

𝑛𝑚

𝑘=1

𝑛

𝑥𝑘(𝑖) 2

Collaborative filtering optimization objective

𝐽(𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 ) =

1

2

𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑗=1

𝑛𝑢

𝑘=1

𝑛

𝜃𝑘𝑗

2+

𝜆

2

𝑖=1

𝑛𝑚

𝑘=1

𝑛

𝑥𝑘(𝑖) 2

Collaborative filtering algorithm• Initialize 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 to small random values

• Minimize 𝐽(𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 ) using gradient descent (or an advanced optimization algorithm). For every 𝑗 =1⋯𝑛𝑢, 𝑖 = 1,⋯ , 𝑛𝑚:

𝑥𝑘𝑗≔ 𝑥𝑘

𝑗− 𝛼


( 𝜃 𝑗 ⊤𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝜃𝑘

𝑖+ 𝜆 𝑥𝑘

(𝑖)

𝜃𝑘𝑗≔ 𝜃𝑘

𝑗− 𝛼


( 𝜃 𝑗 ⊤𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝑥𝑘

𝑖+ 𝜆 𝜃𝑘

(𝑗)

• For a user with parameter 𝜃 and movie with (learned) feature 𝑥, predict a star rating of 𝜃⊤𝑥


Movie Alice (1) Bob (2) Carol (3) Dave (4)

Love at last 5 5 0 0

Romance forever 5 ? ? 0

Cute puppies oflove

? 4 0 ?

Nonstop car chases 0 0 5 4

Swords vs. karate 0 0 5 ?


• Predicted ratings:

𝑋 =

− 𝑥 1 ⊤−

− 𝑥 2 ⊤−

⋮

− 𝑥 𝑛𝑚⊤−

Θ =

− 𝜃 1 ⊤−

− 𝜃 2 ⊤−

⋮

− 𝜃 𝑛𝑢⊤−

Y = XΘ⊤

Low-rank matrix factorization

Finding related movies/products

• For each product 𝑖, we learn a feature vector 𝑥(𝑖) ∈ 𝑅𝑛

𝑥1: romance, 𝑥2: action, 𝑥3: comedy, …

• How to find movie 𝑗 relate to movie 𝑖?

Small 𝑥(𝑖) − 𝑥(𝑗) movie j and I are “similar”

Recommender Systems

• Motivation





Users who have not rated any movies

1

2

𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑗=1

𝑛𝑢

𝑘=1

𝑛

𝜃𝑘𝑗

2+

𝜆

2

𝑖=1

𝑛𝑚

𝑘=1

𝑛

𝑥𝑘(𝑖) 2

𝜃(5) =00

Movie Alice (1) Bob (2) Carol (3) Dave (4) Eve (5)

Love at last 5 5 0 0 ?

Romanceforever

5 ? ? 0 ?


? 4 0 ? ?

Nonstop carchases

0 0 5 4 ?

Swords vs.karate

0 0 5 ? ?

Users who have not rated any movies

1

2

𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

2

𝑗=1

𝑛𝑢

𝑘=1

𝑛

𝜃𝑘𝑗

2+

𝜆

2

𝑖=1

𝑛𝑚

𝑘=1

𝑛

𝑥𝑘(𝑖) 2

𝜃(5) =00

Movie Alice (1) Bob (2) Carol (3) Dave (4) Eve (5)

Love at last 5 5 0 0 0

Romanceforever

5 ? ? 0 0


? 4 0 ? 0

Nonstop carchases

0 0 5 4 0

Swords vs.karate

0 0 5 ? 0

Mean normalization

For user 𝑗, on movie 𝑖 predict:

𝜃 𝑗 ⊤𝑥(𝑖) + 𝜇𝑖

User 5 (Eve):

𝜃 5 =00

𝜃 5 ⊤𝑥(𝑖) + 𝜇𝑖

Learn 𝜃(𝑗), 𝑥(𝑖)

Recommender Systems

• Motivation





Review: Supervised Learning

• K nearest neighbor

• Linear Regression

• Naïve Bayes

• Logistic Regression

• Support Vector Machines

• Neural Networks

Review: Unsupervised Learning

• Clustering, K-Mean

• Expectation maximization

• Dimensionality reduction

• Anomaly detection

• Recommendation system

Advanced Topics

• Semi-supervised learning

• Probabilistic graphical models

• Generative models

• Sequence prediction models

• Deep reinforcement learning

Semi-supervised Learning

• Motivation


• Consistency regularization

• Entropy-based method

• Pseudo-labeling


• Motivation




• Pseudo-labeling

Classic Paradigm Insufficient Nowadays

• Modern applications: massive amounts of raw data.

• Only a tiny fraction can be annotated by human experts

Protein sequences Billions of webpages Images


Active Learning


• Motivation




• Pseudo-labeling

Semi-supervised Learning Problem Formulation• Labeled data

𝑆𝑙 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚𝑙 , 𝑦 𝑚𝑙

• Unlabeled data

𝑆𝑢 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚𝑢 , 𝑦 𝑚𝑢

• Goal: Learn a hypothesis ℎ𝜃 (e.g., a classifier) that has small error

Combining labeled and unlabeled data- Classical methods• Transductive SVM [Joachims ’99]

• Co-training [Blum and Mitchell ’98]

• Graph-based methods [Blum and Chawla ‘01] [Zhu, Ghahramani, Lafferty ‘03]

Transductive SVM

• The separator goes through low density regions of the space / large margin

SVMInputs:

𝑥l(𝑖), 𝑦l

(𝑖)

min𝜃

1

2σ𝑗=1𝑛 𝜃𝑗

2

s. t. 𝑦l(𝑖)𝜃⊤𝑥𝑙

𝑖≥ 1

Transductive SVMInputs:

𝑥l(𝑖), 𝑦l

(𝑖), 𝑥u

(𝑖), 𝑦𝑢

(𝑖)

min𝜃

1

2σ𝑗=1𝑛 𝜃𝑗

2

s. t. 𝑦l(𝑖)𝜃⊤𝑥𝑙

𝑖≥ 1

𝑦u(𝑖)𝜃⊤𝑥 𝑖 ≥ 1

𝑦u

𝑖∈ {−1, 1}

Transductive SVMs

• First maximize margin over the labeled points

• Use this to give initial labels to unlabeled points based on this separator.

• Try flipping labels of unlabeled points to see if doing so can increase margin

Deep Semi-supervised Learning


• Motivation




• Pseudo-labeling

Stochastic Perturbations/Π-Model

•Realistic perturbations 𝑥 → ො𝑥 of data points 𝑥 ∈ 𝐷𝑈𝐿should not significantly change the output of h𝜃(𝑥)

Temporal Ensembling

Mean Teacher

Virtual Adversarial Training


• Motivation




• Pseudo-labeling

EntMin

• Encourages more confident predictions on unlabeled data.


• Motivation




• Pseudo-labeling

Comparison

Varying number of labels

Class mismatch in Labeled/Unlabeled datasets hurts the performance

Lessons

• Standardized architecture + equal budget for tuning hyperparameters

• Unlabeled data from a different class distribution not that useful

• Most methods don’t work well in the very low labeled-data regime

• Transferring Pre-Trained Imagenet produces lower error rate

• Conclusions based on small datasets though

Documents

Semi-Supervised Learningjbhuang/teaching/ECE5424-CS... · 2019. 4. 8. · Collaborative filtering optimization ... Low-rank matrix factorization. Finding related movies/products