Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Machine Learning Techniques(機器學習技法)
Lecture 16: FinaleHsuan-Tien Lin (林軒田)
Department of Computer Science& Information Engineering
National Taiwan University(國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/21
Finale
Roadmap
1 Embedding Numerous Features: Kernel Models2 Combining Predictive Features: Aggregation Models3 Distilling Implicit Features: Extraction Models
Lecture 15: Matrix Factorizationlinear models of movies on extracted userfeatures (or vice versa) jointly optimized with
stochastic gradient descent
Lecture 16: FinaleFeature Exploitation TechniquesError Optimization TechniquesOverfitting Elimination TechniquesMachine Learning in Practice
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/21
Finale Feature Exploitation Techniques
Exploiting Numerous Features via Kernelnumerous features within some Φ:
embedded in kernel KΦ with inner product operation
Polynomial Kernel‘scaled’ polynomialtransforms
Sum of Kernelstransform union
Gaussian Kernelinfinite-dimensionaltransforms
Product of Kernelstransform combination
Stump Kerneldecision-stumps astransforms
Mercer Kernelstransform implicitly
kernel ridgeregression
kernel logisticregression
SVM SVR probabilistic SVM
possibly Kernel PCA, Kernel k -Means, . . .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/21
Finale Feature Exploitation Techniques
Exploiting Predictive Features via Aggregationpredictive features within some Φ:
φt(x) = gt(x)
Decision Stumpsimplest perceptron;simplest DecTree
Decision Treebranching (divide) +leaves (conquer)
(Gaussian) RBFprototype (center) +influence
Uniform Non-Uniform Conditional
Bagging;Random Forest
AdaBoost;GradientBoost
probabilistic SVM
Decision Tree;Nearest Neighbor
possibly Infinite Ensemble Learning,Decision Tree SVM, . . .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/21
Finale Feature Exploitation Techniques
Exploiting Hidden Features via Extractionhidden features within some Φ:
as hidden variables to be ‘jointly’ optimized with usual weights
—possibly with the help of unsupervised learning
Neural Network;Deep Learningneuron weights
AdaBoost;GradientBoostgt parameters
RBF Network
RBF centers
k -Means
cluster centers
Matrix Factorization
user/movie factors
Autoencoder;PCA‘basis’ directions
possibly GradientBoosted Neurons,NNet on Factorized Features, . . .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/21
Finale Feature Exploitation Techniques
Exploiting Low-Dim. Features via Compressionlow-dimensional features within some Φ:
compressed from original features
Decision Stump;DecTree Branching‘best’ naïve projectionto R
Random ForestTree Branching‘random’ low-dim.projection
Autoencoder;PCAinfo.-preservingcompression
Matrix Factorizationprojection fromabstract to concrete
Feature Selection‘most-helpful’ low-dimensional projection
possibly other ‘dimension reduction’ models
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/21
Finale Feature Exploitation Techniques
Fun Time
Consider running AdaBoost-Stump on a PCA-preprocessed data set.Then, in terms of the original features x, what does the final hypothesisG(x) look like?
1 a neural network with tanh(·) in the hidden neurons2 a neural network with sign(·) in the hidden neurons3 a decision tree4 a random forest
Reference Answer: 2
PCA results in a linear transformation of x.Then, when applying a decision stump on thetransformed data, it is as if a perceptron isapplied on the original data. So the resulting Gis simply a linear aggregation of perceptrons.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/21
Finale Feature Exploitation Techniques
Fun Time
Consider running AdaBoost-Stump on a PCA-preprocessed data set.Then, in terms of the original features x, what does the final hypothesisG(x) look like?
1 a neural network with tanh(·) in the hidden neurons2 a neural network with sign(·) in the hidden neurons3 a decision tree4 a random forest
Reference Answer: 2
PCA results in a linear transformation of x.Then, when applying a decision stump on thetransformed data, it is as if a perceptron isapplied on the original data. So the resulting Gis simply a linear aggregation of perceptrons.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/21
Finale Error Optimization Techniques
Numerical Optimization via Gradient Descentwhen ∇E ‘approximately’ defined, use it for 1st order approximation:
new variables = old variables− η∇E
SGD/Minibatch/GD(Kernel) LogReg;
Neural Network[backprop];
Matrix Factorization;
Linear SVM (maybe)
Steepest DescentAdaBoost;
GradientBoost
Functional GDAdaBoost;
GradientBoost
possibly 2nd order techniques,GD under constraints, . . .
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/21
Finale Error Optimization Techniques
Indirect Optimization via Equivalent Solutionwhen difficult to solve original problem,
seek for equivalent solution
Dual SVM
equivalence viaconvex QP
Kernel LogRegKernel RidgeRegequivalence viarepresenter
PCA
equivalence toeigenproblem
some other boosting models and modernsolvers of kernel models rely on such a
technique heavily
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/21
Finale Error Optimization Techniques
Complicated Optimization via Multiple Stepswhen difficult to solve original problem,
seek for ‘easier’ sub-problems
Multi-Stageprobabilistic SVM;
linear blending;
stacking;
RBF Network;
DeepNet pre-training
Alternating Optim.k -Means;
alternating LeastSqr;
(steepest descent)
Divide & Conquerdecision tree;
useful for complicated models
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/21
Finale Error Optimization Techniques
Fun Time
When running the DeepNet algorithm introduced in Lecture 213 on aPCA-preprocessed data set, which optimization technique is used?
1 variants of gradient-descent2 locating equivalent solutions3 multi-stage optimization4 all of the above
Reference Answer: 4
minibatch GD for training; equivalenteigenproblem solution for PCA; multi-stage forpre-training
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/21
Finale Error Optimization Techniques
Fun Time
When running the DeepNet algorithm introduced in Lecture 213 on aPCA-preprocessed data set, which optimization technique is used?
1 variants of gradient-descent2 locating equivalent solutions3 multi-stage optimization4 all of the above
Reference Answer: 4
minibatch GD for training; equivalenteigenproblem solution for PCA; multi-stage forpre-training
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/21
Finale Overfitting Elimination Techniques
Overfitting Elimination via Regularizationwhen model too ‘powerful’:
add brakes somewhere
large-marginSVM;
AdaBoost (indirectly)
denoisingautoencoder
pruningdecision tree
L2SVR;
kernel models;
NNet [weight-decay]
weight-eliminationNNet
early stoppingNNet (any GD-like)
voting/averaginguniform blending;
Bagging;
Random Forest
constrainingautoenc. [weights];
RBF [# centers];
arguably most important techniques
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/21
Finale Overfitting Elimination Techniques
Overfitting Elimination via Validationwhen model too ‘powerful’:
check performance carefully and honestly
# SVSVM/SVR
OOBRandom Forest
Internal Validation
blending;
DecTree pruning
simple but necessary
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/21
Finale Overfitting Elimination Techniques
Fun Time
What is the major technique for eliminating overfitting in RandomForest?
1 voting/averaging2 pruning3 early stopping4 weight-elimination
Reference Answer: 1
Random Forest, based on uniform blending,relies on voting/averaging for regularization.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/21
Finale Overfitting Elimination Techniques
Fun Time
What is the major technique for eliminating overfitting in RandomForest?
1 voting/averaging2 pruning3 early stopping4 weight-elimination
Reference Answer: 1
Random Forest, based on uniform blending,relies on voting/averaging for regularization.
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/21
Finale Machine Learning in Practice
NTU KDDCup 2010 World Champion ModelFeature engineering and classifier ensemble for KDD Cup 2010,
Yu et al., KDDCup 2010
linear blending of
Logistic Regression +many rawly encoded features
Random Forest +human-designed features
yes, you’ve learned everything! :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/21
Finale Machine Learning in Practice
NTU KDDCup 2011 Track 1 World Champion ModelA linear ensemble of individual and blended models for music ratingprediction, Chen et al., KDDCup 2011
NNet, DecTree-like, and then linear blending of
• Matrix Factorization variants, including probabilistic PCA• Restricted Boltzmann Machines: an ‘extended’ autoencoder• k Nearest Neighbors• Probabilistic Latent Semantic Analysis:
an extraction model that has ‘soft clusters’ as hidden variables• linear regression, NNet, & GBDT
yes, you can ‘easily’understand everything! :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/21
Finale Machine Learning in Practice
NTU KDDCup 2012 Track 2 World Champion ModelA two-stage ensemble of diverse models for advertisement ranking inKDD Cup 2012, Wu et al., KDDCup 2012
NNet, GBDT-like, and then linear blending of
• Linear Regression variants, including linear SVR• Logistic Regression variants• Matrix Factorization variants• . . .
‘key’ is to blend properly without overfitting
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/21
Finale Machine Learning in Practice
NTU KDDCup 2013 Track 1 World Champion ModelCombination of feature engineering and ranking models for paper-author identification in KDD Cup 2013, Li et al., KDDCup 2013
linear blending of
• Random Forest with many many many trees• GBDT variants
with tons of efforts in designing features
‘another key’ is to construct features withdomain knowledge
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/21
Finale Machine Learning in Practice
ICDM 2006 Top 10 Data Mining Algorithms
1 C4.5: another decisiontree
2 k -Means3 SVM4 Apriori: for frequent itemset
mining5 EM: ‘alternating
optimization’ algorithm forsome models
6 PageRank: forlink-analysis, similar tomatrix factorization
7 AdaBoost8 k Nearest Neighbor9 Naive Bayes: a simple
linear model with ‘weights’decided by data statistics
10 C&RT
personal view of five missing ML competitors:LinReg, LogReg,
Random Forest, GBDT, NNet
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/21
Finale Machine Learning in Practice
Machine Learning Jungle
soft-margin k-means OOB error RBF network probabilistic SVM
GBDT PCA random forest matrix factorization Gaussian kernel
kernel LogReg large-margin prototype quadratic programming SVRdual uniform blending deep learning nearest neighbor decision stump
AdaBoost aggregation sparsity autoencoder functional gradient
bagging decision tree support vector machine neural network kernel
welcome to the jungle!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/21
Finale Machine Learning in Practice
Fun Time
Which of the following is the official lucky number of this class?1 98762 12343 11264 6211
Reference Answer: 3
May the luckiness always be with you!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/21
Finale Machine Learning in Practice
Fun Time
Which of the following is the official lucky number of this class?1 98762 12343 11264 6211
Reference Answer: 3
May the luckiness always be with you!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/21
Finale Machine Learning in Practice
Summary1 Embedding Numerous Features: Kernel Models2 Combining Predictive Features: Aggregation Models3 Distilling Implicit Features: Extraction Models
Lecture 16: FinaleFeature Exploitation Techniques
kernel, aggregation, extraction, low-dimensionalError Optimization Techniques
gradient, equivalence, stagesOverfitting Elimination Techniques
(lots of) regularization, validationMachine Learning in Practice
welcome to the jungle
• next: happy learning!
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/21