An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling....

An Overview of Other Topics,Conclusions, and Perspectives

Piyush Rai

Machine Learning (CS771A)

Nov 11, 2016

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 1

Plan for today..

Survey of some other topics

Sparse Modeling

Time-series modeling

Reinforcement Learning

Multitask/Transfer Learning

Active Learning

Bayesian Learning

Conclusion and take-aways..

Sparse Modeling

Many ML problems can be written in the following general form

y = Xw + ε

where y is N × 1, X is N × D, w is D × 1, and ε denotes observation noise

Often we expect/want w to be sparse (i.e., at most, say s � D, nonzeros)

Becomes especially important when D >> N

Examples: Sparse regression/classification, sparse matrix factorization, compressive sensing,dictionary learning, and many others.

Sparse Modeling

y = Xw + ε

Sparse Modeling

y = Xw + ε

Sparse Modeling

y = Xw + ε

Sparse Modeling

Ideally, our goal will be to solve the following problem

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

Note: ||w ||0 is known as the `0 norm of w

In general, an NP-hard problem: Combinatorial optimization problem with∑s

)possible

locations of nonzeros in w

Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).

A huge body of work on solving these problems. Primarily in two categories

Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)

arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min

w||y − Xw ||2 + λ||w ||1

(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)

Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)

†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)

Sparse Modeling

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

)possible

w||y − Xw ||2 + λ||w ||1

Sparse Modeling

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

)possible

w||y − Xw ||2 + λ||w ||1

Sparse Modeling

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

)possible

w||y − Xw ||2 + λ||w ||1

Sparse Modeling

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

)possible

w||y − Xw ||2 + λ||w ||1

Sparse Modeling

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

)possible

w||y − Xw ||2 + λ||w ||1

Sparse Modeling

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

)possible

w||y − Xw ||2 + λ||w ||1

Sparsity using `1 Norm

Why `1 norm gives sparsity? Many explanations. An informal one:

Chances of the error function contour meeting the constraint contour at the coordinate axes ismore likely in case of `1

Another explanation: Between `2 and `1 norms, `1 is “closer” to the `0 norm (in fact, `1 norm isthe closest convex approximation to `0 norm)

Learning from Time-Series Data

Modeling Time-Series Data

The input is a sequence of (non-i.i.d.) examples y 1, y 2, . . . , yT

The problem may be supervised or unsupervised, e.g.,

Forecasting: Predict yT+1, given y 1, y 2, . . . , yT

Cluster the examples or perform dimensionality reduction

Evolution of time-series data can be attributed to several factors

Teasing apart these factors of variation is also an important problem

Auto-regressive Models

Auto-regressive (AR): Regress each example on p previous examples

y t = c +

p∑i=1

wiy t−i + εt : An AR(p) model

Moving Average (MA): Regress each example on p previous stochastic errors

y t = c + εt +

p∑i=1

wiεt−i : An MA(p) model

Auto-regressive Moving Average (ARMA): Regress each example of p previous examples and qprevious stochastic errors

y t = c + εt +

p∑i=1

wiy t−i +

q∑i=1

viεt−i : An ARMA(p, q) model

y t = c +

p∑i=1

y t = c + εt +

p∑i=1

y t = c + εt +

p∑i=1

wiy t−i +

q∑i=1

y t = c +

p∑i=1

y t = c + εt +

p∑i=1

y t = c + εt +

p∑i=1

wiy t−i +

q∑i=1

State-Space Models

Assume that each observation y t in the time-series is generated by a low-dimensional latent factorx t (one-hot or continuous)

Basically, a generative latent factor model: y t = g(x t) and x t = f (x t−1), where g and f areprobability distributions

Very similar to PPCA/FA, except that latent factor x t depends on x t−1

Some popular SSMs: Hidden Markov Models (one-hot latent factor x t), Kalman Filters(real-valued latent factor x t)

Note: Models like RNN/LSTM are also similar, except that these are not generative (but can bemade generative)

A paradigm for interactive learning or “learning by doing”

Different from supervised learning. Supervision is “implicit”

The learner (agent) performs “actions” and gets “rewards”

Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)

Order in which data arrives matters (sequential, non i.i.d data)

Agent’s actions affect the subsequent data it receives

Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other

Each learning problem is a “task”

Note: We rarely know how the different tasks are related with each other

Have to learn the relatedness structure as well

For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM

Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)

Some other related problem settings: Domain Adaptation, Covariate Shift

Covariate Shift

Note: The input or the feature vector x is also known as “covariate”

Suppose training inputs come from distribution ptr (x)

Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)

The following loss function can be used to handle this issueN∑

pte(xn)

ptr (xn)`(yn, f (xn)) + R(f )

Why will this work? Well, because

E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr

[pte(x , y)

ptr (x , y)`(y , x ,w)

]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)

ptr (x,y) = pte(x)ptr (x)

Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)

Covariate Shift

pte(xn)

[pte(x , y)

ptr (x , y)`(y , x ,w)

Covariate Shift

pte(xn)

[pte(x , y)

ptr (x , y)`(y , x ,w)

Covariate Shift

pte(xn)

[pte(x , y)

ptr (x , y)`(y , x ,w)

Covariate Shift

pte(xn)

[pte(x , y)

ptr (x , y)`(y , x ,w)

If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)ptr (x,y) = pte(x)

ptr (x)

Covariate Shift

pte(xn)

[pte(x , y)

ptr (x , y)`(y , x ,w)

Covariate Shift

pte(xn)

[pte(x , y)

ptr (x , y)`(y , x ,w)

Active Learning

Allow the learner to ask for the most informative training examples

Active Learning

Bayesian Learning

Optimization vs Inference

Learning as Optimization

Parameter θ is a fixed unknown

Minimize a “loss” and find a point estimate (best answer) for θ, given data X

θ̂ = arg minθ∈Θ

Loss(X; θ)

or arg maxθ

log p(X|θ)︸︷︷︸Maximum Likelihood

or arg maxθ

log p(X|θ)p(θ)︸︷︷︸Maximum-a-Posteriori Estimation

Learning as (Bayesian) Inference

Treat the parameter θ as a random variable with a prior distribution p(θ)

Infer a posterior distribution over the parameters using Bayes rule

p(θ|X) =p(X|θ)p(θ)

p(X)∝ Likelihood× Prior

Posterior becomes the new prior for next batch of observed data

No “fitting”, so no overfitting!

Loss(X; θ) or arg maxθ

or arg maxθ

Loss(X; θ)

or arg maxθ

Why be Bayesian?

Can capture/quantify the uncertainty (or “variance”) in θ via the posterior

Can make predictions by averaging over the posterior

p(y |x ,X ,Y )︸︷︷︸predictive posterior

∫p(y |x , θ)p(θ|X ,Y )dθ

Many other benefits (wait for next semester :) )

Why be Bayesian?

Conclusion and Take-aways

Most learning problems can be cast as optimizing a regularized loss function

Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data

Negative log-likelihood (NLL) = loss function, log-prior = regularizer

More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior

Always start with simple models. Linear models can be really powerful given a good featurerepresentation.

Learn to first diagnose a learning algorithm rather than trying new ones

No free lunch. No learning algorithm is “universally” good.

Thank You!

An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling....

Documents

Contexte et objectifs Expérience Caractérisation Conclusions et Perspectives Caractérisation du gonflement et de la microstructure d’aciers représentatifs

Analyse de sensibilité et estimation de paramètres … · Analyse de vraisemblance post-optimale 4 Conclusions et perspectives W. Castaings Méthodes variationnelles en hydrologie

E-Learning - Perspectives and Conclusions: A Project on Teaching e-Business Roland Traunmüller Johannes Kepler University of Linz, Austria Institute of

Linear Scaling DFT based on Daubechies Wavelets for ... fileBigDFT BigDFT Wavelets Operations Performances O(N) Minimal basis Performance Fragment Applications Perspectives Conclusions

High energy perspectives (and conclusions) Philippe Ferrando APC Laboratory (UMR 7164) - Service d’Astrophysique CEA/Saclay APC Conference High Energy

1 IntroductionEtat de lart Partie 1 Grand volume Partie 2 Petit volume Conclusions et Perspectives Modélisation de lAtténuation du signal EMG Diaphragmatique

Multiblock Method for Categorical Variables · 4. Conclusions & perspectives 21. Cat-mbRA 22. Alternative methods Categorical multiblock Redundancy Analysis The latent variables represent

Fonctionnement biogéochimique d'un barrage tropical ... › file › index › docid › 1067849 › filenam… · Limnology in Mexico ... CONCLUSIONS & PERSPECTIVES.....190 REFERENCES

ANÀLISI DE LA SITUACIÓ COMERCIAL I PERSPECTIVES DE … · ANÀLISI DAFO CONCLUSIONS PROPOSTES ESTRATÈGIQUES. 3SITUACIÓ COMERCIAL I PERSPECTIVES DE FUTUR DEL COMERÇ DE VILANOVA

ETAT DES LIEUX DES METHODES ALTERNATIVES DANS LE …transcience.fr/rapportFRANCOPA_MAJ_07112016.pdf · RESUME 9 LES RECOMMANDATIONS 12 CONCLUSIONS ET PERSPECTIVES 13 INTRODUCTION

Discrete Diﬀerential-Geometry Operators for Triangulated 2 ...misha/Fall09/Meyer02.pdfsmoothing vector ﬁelds and volume data. Conclusions and perspectives are giveninSection8

Clustering: K-means and Kernel K-means · Clustering: K-means and Kernel K-means Piyush Rai Machine Learning (CS771A) Aug 31, 2016 Machine Learning (CS771A) Clustering: K-means and

Coherent and continuous radio emission from Magnetic ... · Acceleration and propagation ... Auroral emission: solar wind, acceleration in magnetic tail ... Conclusions and perspectives

An Examination of the Concrete Ceiling: Perspectives of ... · List of Figures ... RESULTS, CONCLUSIONS, AND RECOMMENDATIONS ... Mid-Management/Senior Executive African American Women

General conclusions and perspectives

Engineering Web Applications: Challenges and Perspectives€¦ · Challenges Design Issues for Web Applications Useful Abstractions Lessons Learned Conclusions ICEIS 2005 -DS 8 Challenges

Colloque national – Assimilation des données : Conclusions générales et perspectives Michael Ghil Ecole Normale Supérieure, Paris, et University of California,

L’engagement des ANÉ au sein du CIR au Burkina Faso Conclusions du Rapport-Pays et perspectives Par Julien Grollier, CUTS International

PERSPECTIVES ON CLIMATE-RELATED SCENARIOS...37 Conclusions 2 On the cover: MPC’s refinery in Dickinson, North Dakota, which we are ... edition of MPC’s Perspectives on Climate-related

CHAPTER 4 CONCLUSIONS AND RECOMMENDATIONS Introduction · problems from different management and operational perspectives with a view to identifying practicable measures to improve