Applied Machine Learning Lecture 1: Introductionrichajo/dit866/lectures/l1/l1.pdf · Applied Machine Learning Lecture1: Introduction RichardJohansson [email protected] January21,2020

Applied Machine LearningLecture 1: Introduction

Richard Johansson

[email protected]

January 21, 2020

-20pt

welcome to the course!

I machine learning is increasingly popular among studentsI our courses are full!I many thesis projects develop or apply ML models

I . . . and in industry, public sectorI many companies come to us looking for studentsI joint research projects

I why the fuss and why now?

-20pt

success stories: image recognition

-20pt

success stories: machine translation

[image by Chris Manning]

https://nlp.stanford.edu/manning/

-20pt

data

[source]

https://towardsdatascience.com/creating-a-movie-recommender-using-convolutional-neural-networks-be93e66464a7

-20pt

[source]

https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/

-20pt

-20pt

applications. . .

[source]

https://docs.google.com/presentation/d/19JHefvPipfIgmjJhup17cfHauuJLhBQ-1Ekpq80zPOk

-20pt

topics covered in the course

I the usual “zoo”: a selection of machine learning modelsI what’s the idea behind them?I how are they implemented? (at least on a high level)I what are the use cases?I how can we apply them practically?

I but hopefully also the “real-world context”:I extended “messy” practical assignments requiring that you

think of what you’re doingI invited talks from industry and the healthcare sectorI annotation of data, evaluationI ethical and legal issues, interpretability

-20pt

overview

practical issues about the course

basic ideas in machine learning

machine learning libraries in Python

example of a learning algorithm: decision tree learning

underfitting and overfitting

-20pt

course webpage

I the official course webpage is the Canvas pagehttps://chalmers.instructure.com/courses/8685/

https://chalmers.instructure.com/courses/8685

https://chalmers.instructure.com/courses/8685/

-20pt

structure of teaching

I lectures Tuesdays and FridaysI some theory and introduction to ML softwareI interactive codingI solving a few exercises when we have timeI most lectures will be given by SelpiI . . . except some guest lectures

I lab sessions ThursdaysI our TAs help you work on your assignmentsI choose between the 13-15 and the 15-17 sessionI please let me know if it’s too crowded

-20pt

assignments

I seven compulsory assignments:PA 1A intro to the ML workflow, decision treesPA 1B random forestsPA 2A text classificationPA 2B linear classifiersPA 3B skin mark classificationWA 1 read a scientific paper in applied machine learningWA 2 written essay on ethics in ML

I please refer to the course PM for details about gradingI we will use the Python programming language

-20pt

programming assignment 1A

I warmup lab exercise: quick tour of the scikit-learn libraryI introduction to decision treesI for a high grade: implement decision tree regressionI lab session on ThursdayI submission deadline: January 29

-20pt

noncompulsory work

I exercise sheetsI online quizzes

-20pt

literature

I the main course book is A Course in Machine Learning byHal Daumé III: http://ciml.info

I and additional papers to read for some topicsI some notes to complement the lecturesI example code will be posted on the course page

http://ciml.info

-20pt

exam, mid-March

I this is a take-home exam: a written assignmentI your solution be submitted onlineI we will find a date in the exam period that suits as many as

possible

-20pt

exam, details

I a first part about basic concepts: you need to answer most ofthese questions correctly to pass

I a second part that requires more insight: answer thesequestions for a higher grade

-20pt

student representatives

I if you’re interested in being a student representative, pleasesend me an email!

I the workload is light and there will be a small reward. . .

-20pt

overview






-20pt

basic ideas

I given some object, make a predictionI is this patient diabetic?I is the sentiment of this movie review positive?I does this image contain a cat?I what will be tomorrow’s stock market value of this company?I what are the phonemes contained in this speech signal?

I the goal of machine learning is to build the predictionfunctions by observing data

I contrast: expert-defined or data-driven

[source]

https://www.igcseict.info/theory/7_2/expert/

-20pt

basic ideas




[source]


-20pt

basic ideas




[source]


-20pt

why machine learning?

why would we want to “learn” the function from data instead ofjust implementing it?

I usually because we don’t really know how to write downthe function by handI speech recognitionI image classificationI machine translationI . . .

I might not be necessary for limited tasks where we knowI what is more expensive in your case? knowledge or data?

-20pt

don’t forget your domain expertise!

ML makes some tasks automatic, but we still need our brains:

I defining the tasks, terminology, evaluation metricsI annotating (hand-labeling) training and testing dataI designing featuresI error analysis

-20pt

example: is the patient diabetic?

in order to predict, we make some measurements of propertieswe believe will be useful: these are called the features

-20pt

example: is the patient diabetic?

I in order to predict, we make some measurements of propertieswe believe will be useful: these are called the features

-20pt

features: different views

I many learning algorithms operate on numerical vectors:features = [ 1.5, -2, 3.8, 0, 9.12 ]

I more abstractly, we often represent the features as attributeswith values (in Python, typically a dictionary)

features = { "gender":"male","age":37,"blood_pressure":130, ... }

I sometimes, it’s easier just to see the features as a list of e.g.words (bag of words)

features = [ "here", "are", "some", "words","in", "a", "document" ]

-20pt

more terminology: what is the output?

I classification: learning to output a category labelI spam/non-spam; positive/negative; . . .

I regression: learning to guess a numberI value of a share; number of stars in a review; . . .

-20pt

basic terminology: supervised learning

I in supervised learning, the training set consists ofinput–output pairs

I our goal is to learn to produce the outputs

-20pt

types of supervision: alternatives

I unsupervised learning: we are given “unorganized” dataI our goal is to discover some structure

7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

10

5

0

5

10

15

7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0

10

5

0

5

10

15

I reinforcement learning: our problem is formalized as a gameI an agent carries out actions and receives rewards

-20pt

example: Fisher’s iris data

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0petal_length

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

peta

l_w

idth

versicolorvirginicaversicolorvirginica

-20pt

approach 1: linear separator

if 0.85 · petal_length+ 2.42 · petal_width ≥ 8.34:return virginica

elsereturn versicolor

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5petal_length

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

peta

l_w

idth

-20pt

approach 2: if/then/else tree

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5petal_length

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

peta

l_w

idth

-20pt

basic machine learning workflow

-20pt

basic ML methodology: evaluation

I select an evaluation procedure (a “metric”) such asI classification accuracy: proportion correct classifications?I mean squared error often used in regressionI or some domain-specific metric

I compare to one or more baselinesI trivial solutionI rule-based solutionI existing solution

I apply your model to a held-out test set and evaluateI the test set must be different from the training setI also: don’t optimize on the test set; use a development set or

cross-validation!

-20pt

managing your data for evaluation

[source]

https://docs.google.com/presentation/d/19JHefvPipfIgmjJhup17cfHauuJLhBQ-1Ekpq80zPOk

-20pt

overview






-20pt

use cases for machine learning

I standard use cases: standardsolutions are available

I special cases: we may need to tailorour own solutions

-20pt

the Python machine learning ecosystem (selection)

-20pt

machine learning software: a small sample

I general-purpose software, large collections of algorithms:I scikit-learn: http://scikit-learn.org

I Python library – will be used in this courseI Weka: http://www.cs.waikato.ac.nz/ml/weka

I Java library with nice user interfaceI special-purpose software, small collections of algorithms:

I Keras, PyTorch, TensorFlow, CNTK for neural networksI LibSVM/LibLinear for support vector machinesI XGboost, lightgbm for tree ensemblesI . . .

I large-scale learning in distributed architectures:I Spark MLLibI H2O

http://scikit-learn.org

http://www.cs.waikato.ac.nz/ml/weka

-20pt

scikit-learn toy example

see alsohttps://scikit-learn.org/stable/getting_started.html

https://scikit-learn.org/stable/getting_started.html

-20pt

overview






-20pt

classifiers as rule systems

I assume that we’re building the prediction function by handI how would it look?I probably, you would start writing rules like this:

I IF the blood glucose level > 150, THENI IF the age > 50, THEN return TrueI ELSE . . .I . . .

I a human would construct such a rule system by trial and errorI we’ll see how it can be learned automatically

-20pt

decision tree classifiers

I a decision tree is a tree whereI the internal nodes represent a choice based on a featureI the leaves represent the return value of the classifier

I like the example we had previously:I IF the blood glucose level > 150, THEN

I IF the age > 50, THEN return TrueI ELSE . . .I . . .

-20pt

general idea for learning a tree

I it should make few errors on the training setI and an Occam’s razor intuition: we’d like a small treeI however, finding a small and accurate tree is a complex

computational problemI it is NP-hard

I instead, we’ll look at an algorithm that works top-down byselecting the “most useful feature”

I some different variants:I basic approach: the ID3 algorithmI extended approaches: CART, C4.5, . . .I see e.g. Daumé III’s book or

http://en.wikipedia.org/wiki/ID3_algorithm

http://en.wikipedia.org/wiki/ID3_algorithm

-20pt

greedy decision tree classifier learning (pseudocode)

def TrainDecisionTreeClassifier(X , Y )if all outputs in Y are identical

return a leaf with the class of the examples in Yif we have reached the maximally allowed depth

return a leaf with the majority class of YF ← the “most useful feature” in Xfor each possible value fi of F

Xi ,Yi ← the subset where F = fitreei ← TrainDecisionTreeClassifier(Xi ,Yi )

return a tree node that splits on F ,where fi is connected to the subtree treei

-20pt

how to select the “most useful feature”?

I there are many rules of thumb to select the most usefulfeatureI idea: a feature is good if the subsets Ti are homogeneous

I in Daumé III’s book, he uses a simple score to rank thefeatures:I for each subset Ti , compute the frequency of its majority classI sum the majority class frequencies

I however, the most well-known ranking measure is theinformation gainI this measures the reduction of entropy (statistical uncertainty)

we get by considering the feature

I scikit-learn uses the Gini impurity by default

-20pt

example: selecting the feature for the top node

-20pt


-20pt


-20pt

decision trees with numerical features

I when our features are numerical, we set a threshold and buildsubtrees for the upper and lower subset

I so we need to find the threshold that gives us the nicest split

-20pt

example: finding the best threshold

1 2 3 4 5petal_length

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0petal_length

-20pt

example: finding the best threshold

1 2 3 4 5petal_length

3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0petal_length

-20pt

implementing decision tree classifiers

-20pt

overview






-20pt

what goes on when we “learn”?

I the learning algorithm observes the examples in the training setI it tries to find common patterns that explain the data: it

generalizes so that we can make predictions for new examples

I how this is done depends on what algorithm we are using

-20pt

principles of induction: how do we select “good” models?

I hypothesis space: the set of all possible outputs of a learningalgorithmI for decision tree learners: The set of possible treesI for linear separators: the set of all lines in the plane /

hyperplanes in a vector space

I “learning” = searching the hypothesis spaceI how do we know what hypothesis to look for?

-20pt

a fundamental tradeoff in machine learning

I goodness of fit: the learned classifier should be able tocapture the information in the training setI e.g. correctly classify the examples in the training data

I regularization: the classifier should be simpleI use as few features as possible?I don’t rely too much on any feature?I small tree or neural network?

-20pt

why would we prefer “simple” hypotheses?

-20pt

“overfitting” and “underfitting”: the bias–variance tradeoff

[Source: Wikipedia]

https://commons.wikimedia.org/wiki/File:Overfitting.svg

-20pt

example: training/test accuracy as a function of tree depth

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20tree depth

0.90

0.92

0.94

0.96

0.98

accu

racy

traintest

-20pt

up next

I Thursday: lab session for programming assignment 1AI topic of Friday’s lecture: ensembles and random forestsI please prepare for assignment 1A by reading my code and the

extra reading on decision trees

Documents

Applied Machine Learning Lecture 1: Introductionrichajo/dit866/lectures/l1/l1.pdf · Applied Machine Learning Lecture1: Introduction RichardJohansson [email protected] January21,2020