Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Lehrstuhl fuer Maschinelles Lernen und Natuerlich Sprachliche SystemeAlbrecht Zimmernann, Tayfun Guerel, Kristian Kersting, Prof. Dr. Luc De Raedt,

Machine Learning in Games

Crash Course on Machine Learning

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Why Machine Learning?

• Past

Computers (mostly) programmed by hand

• Future

Computers (mostly) program themselves, by interaction with their environment

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Behavioural Cloning / Verhaltensimitation

plays

logs

plays

User model

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Backgammon

• More than 1020 states (boards)• Best human players see only small fraction

of all boards during lifetime• Searching is hard because of dice (branching

factor > 100)

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

TD-Gammon by Tesauro (1995)

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Recent Trends

• Recent progress in algorithms and theory

• Growing flood of online data

• Computational power is available

• Growing industry

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Three Niches for Machine Learning

• Data mining: using historical data to improve decisions– Medical records medical knowledge

• Software applications we can’t program by hand– Autonomous driving– Speech recognition

• Self customizing programs– Newsreader that learns user interests

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Typical Data Mining task

• Given:– 9,714 patient records, each describing pregnancy and birth– Each patient record contains 215 features

• Learn to predict:– Class of future patients at risk for Emergency Cesarean

Section

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Data Mining Result

One of 18 learned rules:If no previous vaginal delivery

abnormal 2nd Trimester UltrasoundMalpresentation at admission

Then Probability of Emergency C-Section is 0.6

Accuracy over training data: 26/14 = .63Accuracy over testing data: 12/20 = .60

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Credit Risk Analysis

Learned Rules:If Other-Delinquent-Accounts > 2

Number-Delinquent-Billing-Cycles > 1Then Profitable-Customer? = no

If Other-Delinquent-Accounts = 0(Income > $30k OR Years-of-credit > 3)

Then Profitable-Customer? = yes

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Other Prediction Problems

Processoptimization

CustomerPurchasebehavior

Customerretention

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Problems Too Difficult to Program by Hand

• ALVINN [Pomerlau] drives 70mph on highways

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Problems Too Difficult to Program by Hand

• ALVINN [Pomerlau] drives 70mph on highways

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Software that Customizes to User

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Lehrstuhl fuer Maschinelles Lernen und Natuerlich Sprachliche SystemeAlbrecht Zimmernann, Tayfun Guerel, Kristian Kersting, Prof. Dr. Luc De Raedt,

Machine Learning in Games

Crash Course on Decision Tree Learning

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Classification: Definition

• Given a collection of records (training set )– Each record contains a set of attributes, one of the

attributes is the class.

• Find a model for class attribute as a function of the values of other attributes.

• Goal: previously unseen records should be assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Illustrating Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?

Test Set

Learningalgorithm

Training Set

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Examples of Classification Task

• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate or fraudulent

• Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil

• Categorizing news stories as finance, weather, entertainment, sports, etc

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Classification Techniques

• Decision Tree based Methods• Rule-based Methods• Instance-Based Learners• Neural Networks• Bayesian Networks• (Conditional) Random Fields• Support Vector Machines• Inductive Logic Programming• Statistical Relational Learning• …

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005


Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Each internal node tests an attribute

Each branch corresponds to anattribute value node

Each leaf node assigns a classification

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

No


Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Decision Tree for Conjunction

Outlook

Sunny Overcast Rain

Wind

Strong Weak

No Yes

No

Outlook=Sunny Wind=Weak

No

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Decision Tree for Disjunction

Outlook

Sunny Overcast Rain

Yes

Outlook=Sunny Wind=Weak

Wind

Strong Weak

No Yes

Wind

Strong Weak

No Yes

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Decision Tree for XOR

Outlook

Sunny Overcast Rain

Wind

Strong Weak

Yes No

Outlook=Sunny XOR Wind=Weak

Wind

Strong Weak

No Yes

Wind

Strong Weak

No Yes

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Decision Tree

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

• decision trees represent disjunctions of conjunctions

(Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak)

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

When to consider Decision Trees

• Instances describable by attribute-value pairs

• Target function is discrete valued

• Disjunctive hypothesis may be required

• Possibly noisy training data

• Missing attribute values

• Examples:– Medical diagnosis– Credit risk analysis– RTS Games ?

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Decision Tree Induction

• Many Algorithms:– Hunt’s Algorithm (one of the earliest)– CART– ID3, C4.5– …

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Top-Down Induction of Decision Trees ID3

1. A the “best” decision attribute for next

node

2. Assign A as decision attribute for node

3. For each value of A create new

descendant

4. Sort training examples to leaf node

according to

the attribute value of the branch

5. If all training examples are perfectly

classified (same value of target attribute)

stop, else iterate over new leaf nodes.

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Which Attribute is ”best”?

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

• Example:– 2 Attributes, 1 class variable– 64 examples: 29+, 35-

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Entropy

• S is a sample of training examples

• p+ is the proportion of positive examples

• p- is the proportion of negative examples

• Entropy measures the impurity of S

Entropy(S) = -p+ log2 p+ - p- log2 p-

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Entropy

• Entropy(S) = expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code)

• Information theory: optimal length code assigns

–log2 p bits to messages having probability p.

• So, the expected number of bits to encode (+ or -) of random member of S:

-p+ log2 p+ - p- log2 p-

(log 0 = 0)

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Information Gain

• Gain(S,A): expected reduction in entropy due to sorting S on attribute A

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)

Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Information Gain

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-]

Entropy([21+,5-]) = 0.71Entropy([8+,30-]) = 0.74Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-])

-38/64*Entropy([8+,30-])

=0.27

Entropy([18+,33-]) = 0.94Entropy([11+,2-]) = 0.62Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-])

-13/64*Entropy([11+,2-])

=0.12 A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Entropy(S)=Entropy([29+,35-]) = 0.99

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Another Example

• 14 training-example (9+, 5-) days for playing tennis– Wind: weak, strong– Humidity: high, normal

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Another Example

Humidity

High Normal

[3+, 4-] [6+, 1-]

S=[9+,5-]E=0.940

Gain(S,Humidity)=0.940-(7/14)*0.985 – (7/14)*0.592=0.151

E=0.985 E=0.592

Wind

Weak Strong

[6+, 2-] [3+, 3-]

S=[9+,5-]E=0.940

E=0.811 E=1.0

Gain(S,Wind)=0.940-(8/14)*0.811 – (6/14)*1.0=0.048

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Yet Another Example: Playing Tennis

Day Outlook Temp. Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Weak YesD8 Sunny Mild High Weak NoD9 Sunny Cold Normal Weak YesD10 Rain Mild Normal Strong YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

PlayTennis - Selecting Next Attribute

Outlook

Sunny Rain

[2+, 3-] [3+, 2-]

S=[9+,5-]E=0.940

Gain(S,Outlook)=0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971=0.247

E=0.971

E=0.971

Overcast

[4+, 0]

E=0.0

Gain(S,Humidity) = 0.151Gain(S,Wind) = 0.048Gain(S,Temp) = 0.029

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

PlayTennis - ID3 Algorithm

Outlook

Sunny Overcast Rain

Yes

[D1,D2,…,D14] [9+,5-]

Ssunny=[D1,D2,D8,D9,D11] [2+,3-]

? ?

[D3,D7,D12,D13] [4+,0-]

[D4,D5,D6,D10,D14] [3+,2-]

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

ID3 Algorithm

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

[D3,D7,D12,D13]

[D8,D9,D11] [D6,D14][D1,D2] [D4,D5,D10]

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Hypothesis Space Search ID3

+ - +

+ - +

A1

- - ++ - +

A2

+ - -

+ - +

A2

-

A4+ -

A2

-

A3- +

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Hypothesis Space Search ID3

• Hypothesis space is complete!– Target function surely in there…

• Outputs a single hypothesis • No backtracking on selected attributes (greedy

search)– Local minimal (suboptimal splits)

• Statistically-based search choices– Robust to noisy data

• Inductive bias (search bias)– Prefer shorter trees over longer ones– Place high information gain attributes close to the

root

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Converting a Tree to Rules

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=NoR5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes

So

ftwa

re-P

rakt

iku

m S

oS

e 2

005

Conclusions

1. Decision tree learning provides a practical method for concept learning.

2. ID3-like algorithms search complete hypothesis space.

3. The inductive bias of decision trees is preference (search) bias.

4. Overfitting (you will see it, ;-)) the training data is an important issue in decision tree learning.

5. A large number of extensions of the ID3 algorithm have been proposed for overfitting avoidance, handling missing attributes, handling numerical attributes, etc (feel free to try them out).

Documents

Software-Praktikum SoSe 2005 Lehrstuhl fuer Maschinelles