Semi-Supervised Learningjbhuang/teaching/ECE5424-CS... · 2019. 4. 8. · Collaborative filtering...

Semi-Supervised Learning

Jia-Bin Huang

Virginia Tech Spring 2019ECE-5424G / CS-5824

Administrative

• HW 4 due April 10

Recommender Systems

• Motivation

• Problem formulation

• Content-based recommendations

• Collaborative filtering

• Mean normalization

Problem motivationMovie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥1

(romance)𝑥2

(action)

Love at last 5 5 0 0 0.9 0

Romanceforever

5 ? ? 0 1.0 0.01

Cute puppies of love

? 4 0 ? 0.99 0

Nonstop carchases

0 0 5 4 0.1 1.0

Swords vs.karate

0 0 5 ? 0 0.9

Problem motivationMovie Alice (1) Bob (2) Carol (3) Dave (4) 𝑥1

(romance)𝑥2

(action)

Love at last 5 5 0 0 ? ?

Romanceforever

5 ? ? 0 ? ?

? 4 0 ? ? ?

Nonstop carchases

0 0 5 4 ? ?

Swords vs.karate

0 0 5 ? ? ?

𝜃 1 =050

𝜃 2 =050

𝜃 3 =005

𝜃 4 =005

𝑥 1 =???

Optimization algorithm

• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , to learn 𝑥(𝑖):

min𝑥(𝑖)

𝑗:𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑘=1

𝑥𝑘(𝑖) 2

• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , to learn 𝑥(1), 𝑥(2), ⋯ , 𝑥(𝑛𝑚):

min𝑥(1),𝑥(2),⋯,𝑥(𝑛𝑚)

𝑖=1

𝑛𝑚

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑖=1

𝑛𝑚

𝑘=1

𝑥𝑘(𝑖) 2

Collaborative filtering

•Given 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 (and movie ratings),

Can estimate 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢

•Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢

Can estimate 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚

Collaborative filtering optimization objective• Given 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢

min𝜃 1 ,𝜃 2 ,⋯,𝜃 𝑛𝑢

𝑗=1

𝑛𝑢

𝑖:𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑗=1

𝑛𝑢

𝑘=1

𝜃𝑘𝑗

• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚

𝑖=1

𝑛𝑚

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑖=1

𝑛𝑚

𝑘=1

𝑥𝑘(𝑖) 2

Collaborative filtering optimization objective

• Given 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , estimate 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢

min𝜃 1 ,𝜃 2 ,⋯,𝜃 𝑛𝑢

𝑗=1

𝑛𝑢

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑗=1

𝑛𝑢

𝑘=1

𝜃𝑘𝑗

• Given 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 , estimate 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚

𝑖=1

𝑛𝑚

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑖=1

𝑛𝑚

𝑘=1

𝑥𝑘(𝑖) 2

• Minimize 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 and 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 simultaneously

𝐽 =1

𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑗=1

𝑛𝑢

𝑘=1

𝜃𝑘𝑗

𝑖=1

𝑛𝑚

𝑘=1

𝑥𝑘(𝑖) 2

Collaborative filtering optimization objective

𝐽(𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 ) =

𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑗=1

𝑛𝑢

𝑘=1

𝜃𝑘𝑗

𝑖=1

𝑛𝑚

𝑘=1

𝑥𝑘(𝑖) 2

Collaborative filtering algorithm• Initialize 𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 to small random values

• Minimize 𝐽(𝑥 1 , 𝑥 2 , ⋯ , 𝑥 𝑛𝑚 , 𝜃 1 , 𝜃 2 , ⋯ , 𝜃 𝑛𝑢 ) using gradient descent (or an advanced optimization algorithm). For every 𝑗 =1⋯𝑛𝑢, 𝑖 = 1,⋯ , 𝑛𝑚:

𝑥𝑘𝑗≔ 𝑥𝑘

𝑗− 𝛼

( 𝜃 𝑗 ⊤𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝜃𝑘

𝑖+ 𝜆 𝑥𝑘

(𝑖)

𝜃𝑘𝑗≔ 𝜃𝑘

𝑗− 𝛼

( 𝜃 𝑗 ⊤𝑥 𝑖 − 𝑦 𝑖,𝑗 ) 𝑥𝑘

𝑖+ 𝜆 𝜃𝑘

(𝑗)

• For a user with parameter 𝜃 and movie with (learned) feature 𝑥, predict a star rating of 𝜃⊤𝑥

Movie Alice (1) Bob (2) Carol (3) Dave (4)

Love at last 5 5 0 0

Romance forever 5 ? ? 0

Cute puppies oflove

? 4 0 ?

Nonstop car chases 0 0 5 4

Swords vs. karate 0 0 5 ?

• Predicted ratings:

𝑋 =

− 𝑥 1 ⊤−

− 𝑥 2 ⊤−

− 𝑥 𝑛𝑚⊤−

− 𝜃 1 ⊤−

− 𝜃 2 ⊤−

− 𝜃 𝑛𝑢⊤−

Y = XΘ⊤

Low-rank matrix factorization

Finding related movies/products

• For each product 𝑖, we learn a feature vector 𝑥(𝑖) ∈ 𝑅𝑛

𝑥1: romance, 𝑥2: action, 𝑥3: comedy, …

• How to find movie 𝑗 relate to movie 𝑖?

Small 𝑥(𝑖) − 𝑥(𝑗) movie j and I are “similar”

Recommender Systems

• Motivation

Users who have not rated any movies

𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑗=1

𝑛𝑢

𝑘=1

𝜃𝑘𝑗

𝑖=1

𝑛𝑚

𝑘=1

𝑥𝑘(𝑖) 2

𝜃(5) =00

Movie Alice (1) Bob (2) Carol (3) Dave (4) Eve (5)

Love at last 5 5 0 0 ?

Romanceforever

5 ? ? 0 ?

? 4 0 ? ?

Nonstop carchases

0 0 5 4 ?

Swords vs.karate

0 0 5 ? ?

Users who have not rated any movies

𝑟 𝑖,𝑗 =1

(𝜃 𝑗 )⊤ 𝑥 𝑖 − 𝑦 𝑖,𝑗 2+𝜆

𝑗=1

𝑛𝑢

𝑘=1

𝜃𝑘𝑗

𝑖=1

𝑛𝑚

𝑘=1

𝑥𝑘(𝑖) 2

𝜃(5) =00

Movie Alice (1) Bob (2) Carol (3) Dave (4) Eve (5)

Love at last 5 5 0 0 0

Romanceforever

5 ? ? 0 0

? 4 0 ? 0

Nonstop carchases

0 0 5 4 0

Swords vs.karate

0 0 5 ? 0

Mean normalization

For user 𝑗, on movie 𝑖 predict:

𝜃 𝑗 ⊤𝑥(𝑖) + 𝜇𝑖

User 5 (Eve):

𝜃 5 =00

𝜃 5 ⊤𝑥(𝑖) + 𝜇𝑖

Learn 𝜃(𝑗), 𝑥(𝑖)

Recommender Systems

• Motivation

Review: Supervised Learning

• K nearest neighbor

• Linear Regression

• Naïve Bayes

• Logistic Regression

• Support Vector Machines

• Neural Networks

Review: Unsupervised Learning

• Clustering, K-Mean

• Expectation maximization

• Dimensionality reduction

• Anomaly detection

• Recommendation system

Advanced Topics

• Semi-supervised learning

• Probabilistic graphical models

• Generative models

• Sequence prediction models

• Deep reinforcement learning

Semi-supervised Learning

• Motivation

• Consistency regularization

• Entropy-based method

• Pseudo-labeling

• Motivation

• Pseudo-labeling

Classic Paradigm Insufficient Nowadays

• Modern applications: massive amounts of raw data.

• Only a tiny fraction can be annotated by human experts

Protein sequences Billions of webpages Images

Active Learning

• Motivation

• Pseudo-labeling

Semi-supervised Learning Problem Formulation• Labeled data

𝑆𝑙 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚𝑙 , 𝑦 𝑚𝑙

• Unlabeled data

𝑆𝑢 = 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , ⋯ , 𝑥 𝑚𝑢 , 𝑦 𝑚𝑢

• Goal: Learn a hypothesis ℎ𝜃 (e.g., a classifier) that has small error

Combining labeled and unlabeled data- Classical methods• Transductive SVM [Joachims ’99]

• Co-training [Blum and Mitchell ’98]

• Graph-based methods [Blum and Chawla ‘01] [Zhu, Ghahramani, Lafferty ‘03]

Transductive SVM

• The separator goes through low density regions of the space / large margin

SVMInputs:

𝑥l(𝑖), 𝑦l

(𝑖)

min𝜃

2σ𝑗=1𝑛 𝜃𝑗

s. t. 𝑦l(𝑖)𝜃⊤𝑥𝑙

𝑖≥ 1

Transductive SVMInputs:

𝑥l(𝑖), 𝑦l

(𝑖), 𝑥u

(𝑖), 𝑦𝑢

(𝑖)

min𝜃

2σ𝑗=1𝑛 𝜃𝑗

s. t. 𝑦l(𝑖)𝜃⊤𝑥𝑙

𝑖≥ 1

𝑦u(𝑖)𝜃⊤𝑥 𝑖 ≥ 1

𝑖∈ {−1, 1}

Transductive SVMs

• First maximize margin over the labeled points

• Use this to give initial labels to unlabeled points based on this separator.

• Try flipping labels of unlabeled points to see if doing so can increase margin

Deep Semi-supervised Learning

• Motivation

• Pseudo-labeling

Stochastic Perturbations/Π-Model

•Realistic perturbations 𝑥 → ො𝑥 of data points 𝑥 ∈ 𝐷𝑈𝐿should not significantly change the output of h𝜃(𝑥)

Temporal Ensembling

Mean Teacher

Virtual Adversarial Training

• Motivation

• Pseudo-labeling

EntMin

• Encourages more confident predictions on unlabeled data.

• Motivation

• Pseudo-labeling

Comparison

Varying number of labels

Class mismatch in Labeled/Unlabeled datasets hurts the performance

Lessons

• Standardized architecture + equal budget for tuning hyperparameters

• Unlabeled data from a different class distribution not that useful

• Most methods don’t work well in the very low labeled-data regime

• Transferring Pre-Trained Imagenet produces lower error rate

• Conclusions based on small datasets though

Semi-Supervised Learningjbhuang/teaching/ECE5424-CS... · 2019. 4. 8. · Collaborative filtering...

Documents

Nonnegative Matrix Factorization - Complexity, Algorithms ... · Nonnegative Matrix Factorization, Neural Computation 2012. Householder XIX Nonnegative Matrix Factorization: Complexity,

Towards Interactive Recommending in Model-based Collaborative Filtering Systems · 2019. 9. 18. · Recommender Systems; Matrix Factorization; User Experience ACM Reference Format:

Explainable Matrix Factorization for Collaborative Filteringgdac.uqam.ca/ · Explanations, Matrix Factorization (MF), Recommender Sys-tems, Collaborative Filtering (CF) 1. INTRODUCTION

Application and Research of Improved Probability Matrix ... · Application and Research of Improved Probability Matrix Factorization Techniques in Collaborative Filtering Zhijun Zhang1,2,3

Massive Matrix Factorization : Applications to collaborative filtering

Prime Factorization

Regularizing Matrix Factorization with User and Item ...web.cs.wpi.edu/~kmlee/pubs/tran18cikm.pdf · Weighted matrix factorization (WMF). WMF is a widely-used collaborative filtering

Notes on factorization algebras, factorization homology ... · Notes on factorization algebras, factorization homology and applications ... to factorization algebras, factorization

Delft University of Technology Advanced Factorization ... › portal › files › 47715379 › Advanced_Factorizat… · factorization models known as Factorization Machines (FMs)

Response prediction using collaborative filtering with ...cseweb.ucsd.edu/~akmenon/RPCF-KDD11Poster.pdfwhich is a logistic regression model where the factorization estimates are treated

Matrix factorization

Matrix Factorization Methodsking/PUB/CCF ADL 39 on Social Recommendation 2013... · Probabilistic Matrix Factorization Collaborative Filtering Collaborative Filtering The goal of

Factorize Factorization · 2 Factorize Factorization is however a more sophisticated property (sketched below) of which factorization is a basic instance. The content of factorization

Collaborative Filtering via Co-Factorization of Individuals and Groupshome.cse.ust.hk/~jamesk/papers/ijcnn15.pdf · 2016. 1. 30. · Abstract-Matrix factorization is one of the most

Prime Factorization and the Fundamental Theorem of … · Prime Factorization and the Fundamental ... Prime Factorization ... 52 • Chapter 2 Prime Factorization and the Fundamental

composite number into factors Using Prime Factorization to ... · 1 Using Prime Factorization to Find GCF & LCM Prime Factorization the factorization of a composite number into _____

Part 13: Item-to-Item Collaborative Filtering and Matrix Factorizationwelling/teaching/CS77Bwinter1… · · 2012-01-09Collaborative Filtering and Matrix Factorization Francesco

Triangular factorization

Kernelized Matrix Factorization for Collaborative Filteringcharuaggarwal.net/kernelrec.pdf · Kernelized Matrix Factorization for Collaborative Filtering ... pared to other technologies

Collaborative Filtering with Low Regret...for completeness, describe matrix factorization methods of recommendation systems, which is another popular model of recommendation system