Upload
antidot
View
3.483
Download
0
Embed Size (px)
Citation preview
Machine Learning
Ludovic Samper
Antidot
September 1st, 2015
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 1 / 77
Antidot
Software vendor since 1999
Paris, Lyon, Aix-en-Provence
45 employees
Founders : Fabrice Lacroix CEO, Stephane Loesel CTO, JeromeMainka Chief Scientist Officer
Software products and solutions
Antidot Finder Suite (AFS) search engine
Antidot Information Factory (AIF) a pipe & filters framework
SaaS, Hosted License, 0n-site License
50% of the revenue invested in R&D
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 2 / 77
Antidot
Machine Learning
Automatic text document classification
Named Entity Extraction
Compound Splitter (for german words)
Clustering algorithm (for news agregation)
Open Data, Semantic Web
http://www.rechercheisidore.fr/ Social Sciences andHumanities research platform. Enriched with open resources
https://github.com/antidot/db2triples/ open source libraryto export a db in RDF
Antidot is a Partner organization in WDAqua project
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 3 / 77
Tutorial
Study a classical task in Machine Learning : text classification
Show scikit-learn.org Python machine learning library
Follow the “Working with text data” tutorial :http://scikit-learn.org/stable/tutorial/text_analytics/
working_with_text_data.html
Additional material on http://blog.antidot.net/
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 4 / 77
Summary of the tutorial
1 Problem definitionSupervised classificationEvaluation metrics
2 Extracting features from text filesBag of words modelTerm frequency inverse document frequency (tfidf)
3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters
Cross validationGrid search
4 ConclusionMethodology
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 5 / 77
Sommaire
1 Problem definitionSupervised classificationEvaluation metrics
2 Extracting features from text files
3 Algorithms for classification
4 Conclusion
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 6 / 77
20 newsgroups dataset
http://qwone.com/~jason/20Newsgroups/
20 newsgroups
20 newsgroups documents collected in the 90’s
The label is the newsgroup the document belongs to
A popular collection
18846 documents : 11314 in train, 7532 in test
wiss-ml.ipynb#The-20-newsgroups-dataset
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 7 / 77
Classification
Problem statement
One label per document
Automatically determine the label of an unseen document. Set ofdocuments and their labels
A supervised classification problem
Training
Set of documents and their labels
Build a model
Inference
Given a new document, use the model to predict its label
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 8 / 77
Precision and Recall I
Binary classification
C C
Labeled C TP True Positive FP False Positive
Not labeled C FN False Negative TN True Negative
Precision
TP
TP + FP
Proba(e ∈ C |e labeled C )
Recall
TP
TP + FN
Proba(e labeled C |e ∈ C )
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 9 / 77
Precision and Recall II
F1
F1 = 2P × R
P + R
Harmonic mean of Precision and Recall
Accuracy
TP + TN
TP + TN + FP + FN
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 10 / 77
Multiclass I
NC = number of class
Macro Average
Bmacro =1
NC
NC∑k=1
(Bbinary (TPk ,FPk ,TNk ,FNk))
Average mesure by class. Large classes count has much as small ones.
Micro Average
Bmicro = Bbinary (
NC∑k=1
TPi ,
NC∑k=1
FPi ,
NC∑k=1
TNk ,
NC∑k=1
FNk)
Average mesure by instance
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 11 / 77
Multiclass II
Micro average in single label multiclass
NC∑k=1
(FNk) =
NC∑k=1
(FPk)
andNC∑k=1
(TNk) =
NC∑k=1
(TPk)
Then,
Precisionmicro = Recallmicro = Accuracy =
∑NCk=1(TPk)
Nbdoc
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 12 / 77
Sommaire
1 Problem definition
2 Extracting features from text filesBag of words modelTerm frequency inverse document frequency (tfidf)
3 Algorithms for classification
4 Conclusion
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 13 / 77
Bag of words
From text to features
Count the number of occurrences of words in text
“bag” because position isn’t taken into account
Extensions
Remove stop words
Remove too frequent words (max_df)
lowercase
Ngram (ngram_range) tokenize ngrams instead of words. Useful totake into account word positions
wiss-ml.ipynb#Bag-of-words
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 14 / 77
Term frequency inverse document frequency (tfidf) I
Intuition
Take into account relative importance of each word regarding the wholedatasetIf a word occurs in every document, it doesn’t hold any information
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 15 / 77
Term frequency inverse document frequency (tfidf) II
Definition
Term frequency × inverse document frequency
tfidf (w , d) = tf (w , d)× idf (w , d)
tf (w , d) = term frequency(word w in doc d)
idf (w) = log(Ndoc
doc freq(w))
In scikit-learn :
tfidf (w , d) = tf (w , d)× (idf (w) + 1)
Terms that occurs in all documents idf = 0 will not be ignored
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 16 / 77
Term frequency inverse document frequency (tfidf) III
Options
Normalisation ||doc|| = 1. Ex, for norm L2,∑
w∈d tfidf(w , d)2 = 1
Smoothing : add one to document frequencies as if an extra doccontained every term in the collection exactly once
idf (w) = log(Ndoc + 1
doc freq(w) + 1)
Example
Show most significants words of a doc wiss-ml.ipynb#Tfidf
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 17 / 77
Sommaire
1 Problem definition
2 Extracting features from text files
3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters
Cross validationGrid search
4 Conclusion
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 18 / 77
Supervised classification problem I
Notations
x = (x1, · · · , xn) = (xi )0≤i<n feature vector
{(xd , yd)}0≤d<D the training set
∀i , xi ∈ Rn
xi feature vector for document in dimension of the feature space
∀d , yd ∈ {1, · · · ,NC}NC the number of classesyd the class of document d
y class predictionFor a new vector x , y is the predicted class of x .
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 19 / 77
Supervised classification problem II
Goal
Find a function F :
Rn → {1, · · · ,NC}x 7→ y
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 20 / 77
In 20newsgroups I
Values in 20 newsgroups
n = 130107 nb features (number of unique terms)
D = 11314 training samples
NC = 20 different classes
Goal
Find a function F that given a new document predicts its class
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 21 / 77
Naıve Bayes Algorithm I
Bayes’ theorem
P(A|B) =P(B|A)P(A)
P(B)
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 22 / 77
Naıve Bayes Algorithm II
Posterior probability of class C
P(C |x) =P(x |C )P(C )
P(x)
P(x) does not depend on C ,
P(C |x) ∝ P(x |C )P(C )
Naıve Bayes independent assumption : each feature i is conditionallyindependent of every other feature j
P(C |x) ∝ P(C )×n∏
i=1
P(xi |C )
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 23 / 77
Naıve Bayes Algorithm III
Classifier from the probability model
y = arg maxk∈{1,··· ,NC}
P(y = k)×n∏
i=0
P(xi |y = k)
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 24 / 77
Parameter estimation in Naıve Bayes’ classifier
Prior of a class
P(y = k) =nb samples in class k
total nb samples
Can also be uniform : P(y = k) = 1NC
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 25 / 77
Multinomial Naıve Bayes I
Naıve Bayes
P(x |y = k) =∏n
i=1 P(xi |y = k)
Multinomial distribution
Event word is i follows a multinomial distribution with parameters(p1, · · · , pn) where pi = P(word = i)
P(x1, · · · , xn) =n∏
i=1
pxii
Where∑
i pi = 1.pi = P(w = i)One distribution for each class y .
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 26 / 77
Multinomial Naıve Bayes II
Multinomial Naıve Bayes
One multinomial distribution for each class
P(i |y = k) =sum of occurrences of word xi in class k
total nb words in class k
=
∑d∈k xi∑
0≤j<n
∑d∈k xj
With smoothing,
P(i |y = k) =
∑d∈k xi + α∑
0≤j<n
∑d∈k xj + αn
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 27 / 77
Multinomial Naıve Bayes III
Inference in Multinomial Naıve Bayes
y = arg maxk
P(y = k |x)
= arg maxk
P(y = k)∏
0≤i<n
P(i |y = k)xi
= arg maxk
(log(P(y = k)) +
∑0≤i<n
xi log(P(i |y = k)))
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 28 / 77
Multinomial Naıve Bayes IV
A linear model
In the log space,
(log P(y = k|x))k ∝ W0 + W T .x
W0, is the vector of priors :
W0 = log(P(y = k))
W is the matrix of distributions :
W = (wik), i ∈ [1, n], k ∈ [1,NC ]
wik = log P(i |y = k)
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 29 / 77
Multinomial Naıve Bayes V
Example step-by-step
http://www.antidot.net/wiss2015/wiss-ml.html#Naive-Bayes
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 30 / 77
Sommaire
1 Problem definition
2 Extracting features from text files
3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters
Cross validationGrid search
4 Conclusion
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 31 / 77
A linear classifier
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 32 / 77
A linear classifier
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 33 / 77
A linear classifier
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 34 / 77
A linear classifier
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 35 / 77
A linear classifier
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 36 / 77
Support Vector Machine, notations
Problem
S, training set{(xi , yi ), xi ∈ Rn, yi ∈ {−1, 1}}i∈0..D
Find a linear function 〈w , xi 〉+ b such that :
sign(〈w , xi 〉+ b) = yi
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 37 / 77
SVM, maximum margin classifier
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 38 / 77
Margin
distance(x+, x−) = 〈 w
||w ||, x+ − x−〉
=1
||w ||(〈w , x+〉 − 〈w , x−〉)
=1
||w ||((〈w , x+〉+ b)− (〈w , x−〉+ b))
=1
||w ||(1− (−1))
=2
||w ||
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 39 / 77
SVM, maximum margin classifier
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 40 / 77
Solving an optimization problem using the Lagrangien
Primal problem
minimizew ,bf (w , b)
Under the constraints, hi (w , b) ≥ 0
Lagrange function
L(w , b, α) = f (w , b)−∑i
αihi (w , b)
Let, g(α) = inf(w ,b) L(w , b, α)∀w , b, g(α) ≤ L(w , b, α)Moreover, L(w , b, α) ≤ f (w , b)Thus, ∀αi ≥ 0, g(α) ≤ minw ,b f (w , b)And with Karush Kuhn Tucker (KKT) optimality condition,
maxα
g(α) = minw ,b
f (w , b)⇔ αihi (w , x) = 0
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 41 / 77
Support Vector Machine, problem
Primal problem
minimize(w ,b)||w ||2
2
Under the constraints, ∀0 < i ≤ D, yi (〈w , xi 〉+ b) ≥ 1
Lagrange function
L(w , b, α) =1
2||w ||2 −
∑i
αi (yi (〈w , xi 〉+ b)− 1)
Dual problem :maximize(w ,b,α)L(w , b, α)
with αi ≥ 0Optimality in w, b is a saddle point with α
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 42 / 77
Support Vector Machine, problem
Derivative in w, b need to vanish
∂
∂wL(w , b, α) = w −
∑i
αiyixi = 0
∂
∂bL(w , b, α) =
∑i
αiyi = 0
Dual problem
maximizeα −1
2
∑i ,j
αiαjyiyj〈xi , xj〉+∑i
αi
under the constraints, { ∑i αiyi = 0
αi ≥ 0
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 43 / 77
Support Vectors
Support vectors
w =∑i
yiαixi
Karush Kuhn Tucker (KKT) optimality condition
Lagrange multiplier times constraint equals zero
αi (yi (〈w , xi 〉+ b)− 1) = 0
Thus, {αi = 0αi > 0⇒ yi (〈w , xi 〉+ b) = 1
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 44 / 77
Experiments with separable space
SVMvaryingC.ipynb
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 45 / 77
What happens if space is not separable
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 46 / 77
Adding slack variable
Problem was
minimize(w ,b)||w ||2
2
With,yi (w .xi + b) ≥ 1
With slack
minimize(w ,b)||w ||2
2+ C
∑i
ξi
With, {yi (w .xi + b) ≥ 1− ξiξi ≥ 0
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 47 / 77
Support Vector Machine, without slack
Primal problem
minimize(w ,b)||w ||2
2
With,yi (w .xi + b) ≥ 1
Lagrange function
L(w , b, α) =1
2||w ||2 −
∑i
αi (yi (〈w , xi 〉+ b)− 1)
Dual problem :maximize(w ,b,α)L(w , b, α)
Optimality in w , b, is a saddle point with α
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 48 / 77
Support Vector Machine, with slack
Primal problem
minimize(w ,b)||w ||2
2+ C
∑i
ξi
With, {yi (w .xi + b) ≥ 1− ξiξi ≥ 0
Lagrange function
L(w , b, ξ, α, η) =1
2||w ||2 + C
∑i
ξi −∑i
αi (yi (〈xi ,w〉+ b) + ξi − 1)−∑i
ηiξi
Dual problem :maximize(w ,b,ξ,α,η)L(w , b, ξ, α, η)
Optimality in w , b, ξ is a saddle point with α, η
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 49 / 77
Support Vector Machine, problem
Derivative in w, b, ξ need to vanish
∂
∂wL(w , b, ξ, α, η) = w −
∑i
αiyixi = 0
∂
∂bL(w , b, ξ, α, η) =
∑i
αiyi = 0
∂
∂ξL(w , b, ξ, α, η) = C − αi − ηi = 0⇒ ηi = C − αi
Dual problem
maximizeα −1
2
∑i ,j
αiαjyiyj〈xi , xj〉+∑i
αi
under the constraints,∑
i αiyi = 0 and 0 ≤ αi ≤ C
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 50 / 77
Support Vectors
Support vectors
w =∑i
yiαixi
Karush Kuhn Tucker (KKT) optimality condition
Lagrange multiplier times constraint equals zero
αi (yi (〈w , xi 〉+ b) + ξi − 1) = 0
ηiξi = 0⇔ (C − αi )ξi = 0
Thus, αi = 0⇒ yi (〈w , xi 〉+ b) ≥ 10 < αi < C ⇒ yi (〈w , xi 〉+ b) = 1αi = C ⇒ yi (〈w , xi 〉+ b) ≤ 1
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 51 / 77
Support Vector Machine, Loss functions
Primal problem
minimize(w ,b)||w ||2
2+ C
∑i
ξi
With, {yi (w .xi + b) ≥ 1− ξiξi ≥ 0
With loss function
minimize(w ,b)||w ||2
2+ C
∑i
max(0, 1− yi (w .xi + b))
here,loss(xi , yi ) = max(0, 1− yi (w .xi + b)) = max(0, 1− f (xi ))
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 52 / 77
Support Vector Machine, Common loss functions
Common loss functions
hinge loss, L1-loss : max(0, 1− yi (w .xi + b))
squares hinge L2-loss : max(0, (1− yi (w .xi + b))2)
logistic loss : log(1 + exp(−yi (w .xi + b)))
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 53 / 77
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 54 / 77
Expermiments with different values for C
SVMvaryingC.ipynb#Varying-C-parameter
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 55 / 77
Non linearly separable data
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 56 / 77
Non linearly separable data, Φ(x) = (x , x2)
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 57 / 77
Non linearly separable data, Φ(x) = (x , x2)
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 58 / 77
Linear case
Primal Problem
minimizew ,b1
2||w ||2 + C
∑i
ξi
subject to, yi (〈w , xi 〉+ b) ≥ 1− ξi and ξi ≥ 0
Dual Problem
maximizeα1
2
∑i ,j
αiαjyiyj〈xi , xj〉+∑i
αi
subject to,∑
i αiyi = 0 and 0 ≤ αi ≤ C
Support vector expansion
f (x) =∑i
αiyi 〈xi , x〉+ b
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 59 / 77
With a transformation Φ : x 7→ Φ(x)
Primal Problem
minimizew ,b1
2||w ||2 + C
∑i
ξi
subject to, yi (〈w ,Φ(xi )〉+ b) ≥ 1− ξi and ξi ≥ 0
Dual Problem
maximizeα1
2
∑i ,j
αiαjyiyj〈Φ(xi ),Φ(xj)〉+∑i
αi
subject to,∑
i αiyi = 0 and 0 ≤ αi ≤ C
Support vector expansion
f (x) =∑i
αiyi 〈Φ(xi ),Φ(x)〉+ b
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 60 / 77
The kernel trick
Kernel function
k(x , x ′) = 〈Φ(x),Φ(x ′)〉
We just need to compute the dot product in the new space
Dual Problem
maximizeα1
2
∑i ,j
αiαjyiyjk(xi , xj) +∑i
αi
subject to,∑
i αiyi = 0 and 0 ≤ αi ≤ C
Support vector expansion
f (x) =∑i
αiyik(xi , x) + b
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 61 / 77
Kernels
Kernel functions
linear : k(x , x ′) = 〈x , x ′〉polynomial : k(x , x ′) = (γ〈x , x ′〉+ r)d
rbf : k(x , x ′) = exp(−γ|x − x ′|2)
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 62 / 77
RBF Kernel imply an infinite space
Here we’re in dimension 1, x ∈ R
k(x , x ′) = exp(−(x − x ′)2)
= exp(−x2)exp(−x ′2)exp(2xx ′)
With Taylor transformation,
k(x , x ′) = exp(−x2)exp(−x ′2)∞∑k=0
2kxkx ′k
k!
= 〈(· · · , 2k−1√k!
exp(−x2)xk , · · · ),
(· · · , 2k−1√k!
exp(−x ′2)x ′k , · · · )〉
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 63 / 77
Experiments with different kernels
www.antidot.net/wiss2015/SVMvaryingC.html#Non-linear-kernels
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 64 / 77
SVM in multiclass
one-vs-the rest
NC binary classifiers (but each involving all dataset)
At prediction time, choose the class with maximum decision value
one-vs-oneNC (NC−1)
2 binary classifiers
At prediction time, vote
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 65 / 77
SVM in scikit-learn
SVC : Support Vector Classification
sklearn.svm.linearSVC
based on Liblinear library
strategy : one-vs-the rest
only linear kernel
loss can be : ‘hinge’ or ‘squared hinge’
sklearn.svm.SVC
based on libSVM
multiclass strategy : one-vs-one
kernel can be : linear, polynomial, RBF, sigmoid, precomputed
only hinge loss
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 66 / 77
Sommaire
1 Problem definition
2 Extracting features from text files
3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters
Cross validationGrid search
4 Conclusion
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 67 / 77
Cross validation I
http://scikit-learn.org/stable/modules/cross_validation.html
Overfitting
Estimation of parameters on the test set can lead to overfitting :parameters are the best for this test set but not in the general case.
Train, test and validation dataset
A solution :
tweak the parameters on the test set
validate on a validation dataset
only few data in training dataset
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 68 / 77
Cross validation II
Cross validation
k-fold cross validation
Split training data in k partitions of the same size
train the model on k − 1 partitions
then, evaluate on the kth partition
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 69 / 77
Cross validation III
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 70 / 77
Grid Search
http://scikit-learn.org/stable/modules/grid_search.html
Grid search
Test each value for each parameter
brut force algorithm to find the best value for each parameter
In scikit-learn
Automatically runs k× number of parameters’ values trainings
Keeps the best model
Demo with scikit-learnhttp://www.antidot.net/wiss2015/grid_search_20newsgroups.html
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 71 / 77
Sommaire
1 Problem definition
2 Extracting features from text files
3 Algorithms for classification
4 ConclusionMethodology
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 72 / 77
1 Problem definitionSupervised classificationEvaluation metrics
2 Extracting features from text filesBag of words modelTerm frequency inverse document frequency (tfidf)
3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters
Cross validationGrid search
4 ConclusionMethodology
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77
Methodology
To solve a problem using Machine Learning, you have to :
1 Understand the data
2 Choose an evaluation measure
3 Be able to test the model
4 Find the main features
5 Try the algorithms, with different parameters
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77
Conclusion
Machine Learning has a lot of applications
With libraries like scikit-learn, no need to implement algorithmsyourself
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 74 / 77
Questions ?
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 75 / 77
References
Machine Learning in Python :
http://scikit-learn.org
Alex Smola very good lecture on Machine Learning at CMU :
http://alex.smola.org/teaching/10-701-15/
Kernels : https://www.youtube.com/watch?v=0Nis-oMLbDs
SVM : https://www.youtube.com/watch?v=bsbpqNIKQzU
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 76 / 77
Bernoulli Naıve Bayes
Features
xi = 1 iff word i is present in documentElse, xi = 0The number of occurrences of word i doesn’t matter
Bernoulli
For each feature i ,P(xi |y = k) = P(i |y = k)xi + (1− P(i |y = k))(1− xi )Absence of a feature is explicitly taken into account
Estimation of P(i |y = k)
P(i |y = k) =1 + nb of documents in k that contains word i
nb of documents in k
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 77 / 77