Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Semi-Supervised Learning
Jia-Bin Huang
Virginia Tech Spring 2019ECE-5424G / CS-5824
Administrative
• HW 4 due April 10
Recommender Systems
• Motivation
• Problem formulation
• Content-based recommendations
• Collaborative filtering
• Mean normalization
Problem motivationMovie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥1
(romance)𝑥2
(action)
Love at last 5 5 0 0 0.9 0
Romanceforever
5 ? ? 0 1.0 0.01
Cute puppies of love
? 4 0 ? 0.99 0
Nonstop carchases
0 0 5 4 0.1 1.0
Swords vs.karate
0 0 5 ? 0 0.9
Problem motivationMovie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥1
(romance)𝑥2
(action)
Love at last 5 5 0 0 ? ?
Romanceforever
5 ? ? 0 ? ?
Cute puppies of love
? 4 0 ? ? ?
Nonstop carchases
0 0 5 4 ? ?
Swords vs.karate
0 0 5 ? ? ?
𝜃 1 =050
𝜃 2 =050
𝜃 3 =005
𝜃 4 =005
𝑥 1 =???
Optimization algorithm
• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , to learn 𝑥(𝑖):
min𝑥(𝑖)
1
2
𝑗:𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑘=1
𝑛
𝑥𝑘(𝑖) 2
• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , to learn 𝑥(1), 𝑥(2), ⋯ , 𝑥(𝑛𝑚):
min𝑥(1),𝑥(2),⋯,𝑥(𝑛𝑚)
1
2
𝑖=1
𝑛𝑚
𝑗:𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑖=1
𝑛𝑚
𝑘=1
𝑛
𝑥𝑘(𝑖) 2
Collaborative filtering
•Given 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 (and movie ratings),
Can estimate 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢
•Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢
Can estimate 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚
Collaborative filtering optimization objective• Given 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢
min𝜃 1 ,𝜃 2 ,⋯,𝜃 𝑛𝑢
1
2
𝑗=1
𝑛𝑢
𝑖:𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑗=1
𝑛𝑢
𝑘=1
𝑛
𝜃𝑘𝑗
2
• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚
min𝑥(1),𝑥(2),⋯,𝑥(𝑛𝑚)
1
2
𝑖=1
𝑛𝑚
𝑗:𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑖=1
𝑛𝑚
𝑘=1
𝑛
𝑥𝑘(𝑖) 2
Collaborative filtering optimization objective
• Given 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢
min𝜃 1 ,𝜃 2 ,⋯,𝜃 𝑛𝑢
1
2
𝑗=1
𝑛𝑢
𝑖:𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑗=1
𝑛𝑢
𝑘=1
𝑛
𝜃𝑘𝑗
2
• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚
min𝑥(1),𝑥(2),⋯,𝑥(𝑛𝑚)
1
2
𝑖=1
𝑛𝑚
𝑗:𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑖=1
𝑛𝑚
𝑘=1
𝑛
𝑥𝑘(𝑖) 2
• Minimize 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 and 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 simultaneously
𝐽 =1
2
𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑗=1
𝑛𝑢
𝑘=1
𝑛
𝜃𝑘𝑗
2+
𝜆
2
𝑖=1
𝑛𝑚
𝑘=1
𝑛
𝑥𝑘(𝑖) 2
Collaborative filtering optimization objective
𝐽(𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 ) =
1
2
𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑗=1
𝑛𝑢
𝑘=1
𝑛
𝜃𝑘𝑗
2+
𝜆
2
𝑖=1
𝑛𝑚
𝑘=1
𝑛
𝑥𝑘(𝑖) 2
Collaborative filtering algorithm• Initialize 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 to small random values
• Minimize 𝐽(𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 ) using gradient descent (or an advanced optimization algorithm). For every 𝑗 =1⋯𝑛𝑢, 𝑖 = 1,⋯ , 𝑛𝑚:
𝑥𝑘𝑗≔ 𝑥𝑘
𝑗− 𝛼
𝑗:𝑟 𝑖,𝑗 =1
( 𝜃 𝑗 ⊤𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝜃𝑘
𝑖+ 𝜆 𝑥𝑘
(𝑖)
𝜃𝑘𝑗≔ 𝜃𝑘
𝑗− 𝛼
𝑖:𝑟 𝑖,𝑗 =1
( 𝜃 𝑗 ⊤𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝑥𝑘
𝑖+ 𝜆 𝜃𝑘
(𝑗)
• For a user with parameter 𝜃 and movie with (learned) feature 𝑥, predict a star rating of 𝜃⊤𝑥
Collaborative filtering
Movie Alice (1) Bob (2) Carol (3) Dave (4)
Love at last 5 5 0 0
Romance forever 5 ? ? 0
Cute puppies oflove
? 4 0 ?
Nonstop car chases 0 0 5 4
Swords vs. karate 0 0 5 ?
Collaborative filtering
• Predicted ratings:
𝑋 =
− 𝑥 1 ⊤−
− 𝑥 2 ⊤−
⋮
− 𝑥 𝑛𝑚⊤−
Θ =
− 𝜃 1 ⊤−
− 𝜃 2 ⊤−
⋮
− 𝜃 𝑛𝑢⊤−
Y = XΘ⊤
Low-rank matrix factorization
Finding related movies/products
• For each product 𝑖, we learn a feature vector 𝑥(𝑖) ∈ 𝑅𝑛
𝑥1: romance, 𝑥2: action, 𝑥3: comedy, …
• How to find movie 𝑗 relate to movie 𝑖?
Small 𝑥(𝑖) − 𝑥(𝑗) movie j and I are “similar”
Recommender Systems
• Motivation
• Problem formulation
• Content-based recommendations
• Collaborative filtering
• Mean normalization
Users who have not rated any movies
1
2
𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑗=1
𝑛𝑢
𝑘=1
𝑛
𝜃𝑘𝑗
2+
𝜆
2
𝑖=1
𝑛𝑚
𝑘=1
𝑛
𝑥𝑘(𝑖) 2
𝜃(5) =00
Movie Alice (1) Bob (2) Carol (3) Dave (4) Eve (5)
Love at last 5 5 0 0 ?
Romanceforever
5 ? ? 0 ?
Cute puppies of love
? 4 0 ? ?
Nonstop carchases
0 0 5 4 ?
Swords vs.karate
0 0 5 ? ?
Users who have not rated any movies
1
2
𝑟 𝑖,𝑗 =1
(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆
2
𝑗=1
𝑛𝑢
𝑘=1
𝑛
𝜃𝑘𝑗
2+
𝜆
2
𝑖=1
𝑛𝑚
𝑘=1
𝑛
𝑥𝑘(𝑖) 2
𝜃(5) =00
Movie Alice (1) Bob (2) Carol (3) Dave (4) Eve (5)
Love at last 5 5 0 0 0
Romanceforever
5 ? ? 0 0
Cute puppies of love
? 4 0 ? 0
Nonstop carchases
0 0 5 4 0
Swords vs.karate
0 0 5 ? 0
Mean normalization
For user 𝑗, on movie 𝑖 predict:
𝜃 𝑗 ⊤𝑥(𝑖) + 𝜇𝑖
User 5 (Eve):
𝜃 5 =00
𝜃 5 ⊤𝑥(𝑖) + 𝜇𝑖
Learn 𝜃(𝑗), 𝑥(𝑖)
Recommender Systems
• Motivation
• Problem formulation
• Content-based recommendations
• Collaborative filtering
• Mean normalization
Review: Supervised Learning
• K nearest neighbor
• Linear Regression
• Naïve Bayes
• Logistic Regression
• Support Vector Machines
• Neural Networks
Review: Unsupervised Learning
• Clustering, K-Mean
• Expectation maximization
• Dimensionality reduction
• Anomaly detection
• Recommendation system
Advanced Topics
• Semi-supervised learning
• Probabilistic graphical models
• Generative models
• Sequence prediction models
• Deep reinforcement learning
Semi-supervised Learning
• Motivation
• Problem formulation
• Consistency regularization
• Entropy-based method
• Pseudo-labeling
Semi-supervised Learning
• Motivation
• Problem formulation
• Consistency regularization
• Entropy-based method
• Pseudo-labeling
Classic Paradigm Insufficient Nowadays
• Modern applications: massive amounts of raw data.
• Only a tiny fraction can be annotated by human experts
Protein sequences Billions of webpages Images
Semi-supervised Learning
Active Learning
Semi-supervised Learning
• Motivation
• Problem formulation
• Consistency regularization
• Entropy-based method
• Pseudo-labeling
Semi-supervised Learning Problem Formulation• Labeled data
𝑆𝑙 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚𝑙 , 𝑦 𝑚𝑙
• Unlabeled data
𝑆𝑢 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚𝑢 , 𝑦 𝑚𝑢
• Goal: Learn a hypothesis ℎ𝜃 (e.g., a classifier) that has small error
Combining labeled and unlabeled data- Classical methods• Transductive SVM [Joachims ’99]
• Co-training [Blum and Mitchell ’98]
• Graph-based methods [Blum and Chawla ‘01] [Zhu, Ghahramani, Lafferty ‘03]
Transductive SVM
• The separator goes through low density regions of the space / large margin
SVMInputs:
𝑥l(𝑖), 𝑦l
(𝑖)
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2
s. t. 𝑦l(𝑖)𝜃⊤𝑥𝑙
𝑖≥ 1
Transductive SVMInputs:
𝑥l(𝑖), 𝑦l
(𝑖), 𝑥u
(𝑖), 𝑦𝑢
(𝑖)
min𝜃
1
2σ𝑗=1𝑛 𝜃𝑗
2
s. t. 𝑦l(𝑖)𝜃⊤𝑥𝑙
𝑖≥ 1
𝑦u(𝑖)𝜃⊤𝑥 𝑖 ≥ 1
𝑦u
𝑖∈ {−1, 1}
Transductive SVMs
• First maximize margin over the labeled points
• Use this to give initial labels to unlabeled points based on this separator.
• Try flipping labels of unlabeled points to see if doing so can increase margin
Deep Semi-supervised Learning
Semi-supervised Learning
• Motivation
• Problem formulation
• Consistency regularization
• Entropy-based method
• Pseudo-labeling
Stochastic Perturbations/Π-Model
•Realistic perturbations 𝑥 → ො𝑥 of data points 𝑥 ∈ 𝐷𝑈𝐿should not significantly change the output of h𝜃(𝑥)
Temporal Ensembling
Mean Teacher
Virtual Adversarial Training
Semi-supervised Learning
• Motivation
• Problem formulation
• Consistency regularization
• Entropy-based method
• Pseudo-labeling
EntMin
• Encourages more confident predictions on unlabeled data.
Semi-supervised Learning
• Motivation
• Problem formulation
• Consistency regularization
• Entropy-based method
• Pseudo-labeling
Comparison
Varying number of labels
Class mismatch in Labeled/Unlabeled datasets hurts the performance
Lessons
• Standardized architecture + equal budget for tuning hyperparameters
• Unlabeled data from a different class distribution not that useful
• Most methods don’t work well in the very low labeled-data regime
• Transferring Pre-Trained Imagenet produces lower error rate
• Conclusions based on small datasets though