Upload
butest
View
511
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Lehrstuhl fuer Maschinelles Lernen und Natuerlich Sprachliche SystemeAlbrecht Zimmernann, Tayfun Guerel, Kristian Kersting, Prof. Dr. Luc De Raedt,
Machine Learning in Games
Crash Course on Machine Learning
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Why Machine Learning?
• Past
Computers (mostly) programmed by hand
• Future
Computers (mostly) program themselves, by interaction with their environment
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Behavioural Cloning / Verhaltensimitation
plays
logs
plays
User model
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Backgammon
• More than 1020 states (boards)• Best human players see only small fraction
of all boards during lifetime• Searching is hard because of dice (branching
factor > 100)
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
TD-Gammon by Tesauro (1995)
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Recent Trends
• Recent progress in algorithms and theory
• Growing flood of online data
• Computational power is available
• Growing industry
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Three Niches for Machine Learning
• Data mining: using historical data to improve decisions– Medical records medical knowledge
• Software applications we can’t program by hand– Autonomous driving– Speech recognition
• Self customizing programs– Newsreader that learns user interests
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Typical Data Mining task
• Given:– 9,714 patient records, each describing pregnancy and birth– Each patient record contains 215 features
• Learn to predict:– Class of future patients at risk for Emergency Cesarean
Section
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Data Mining Result
One of 18 learned rules:If no previous vaginal delivery
abnormal 2nd Trimester UltrasoundMalpresentation at admission
Then Probability of Emergency C-Section is 0.6
Accuracy over training data: 26/14 = .63Accuracy over testing data: 12/20 = .60
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Credit Risk Analysis
Learned Rules:If Other-Delinquent-Accounts > 2
Number-Delinquent-Billing-Cycles > 1Then Profitable-Customer? = no
If Other-Delinquent-Accounts = 0(Income > $30k OR Years-of-credit > 3)
Then Profitable-Customer? = yes
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Other Prediction Problems
Processoptimization
CustomerPurchasebehavior
Customerretention
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Problems Too Difficult to Program by Hand
• ALVINN [Pomerlau] drives 70mph on highways
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Problems Too Difficult to Program by Hand
• ALVINN [Pomerlau] drives 70mph on highways
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Software that Customizes to User
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Lehrstuhl fuer Maschinelles Lernen und Natuerlich Sprachliche SystemeAlbrecht Zimmernann, Tayfun Guerel, Kristian Kersting, Prof. Dr. Luc De Raedt,
Machine Learning in Games
Crash Course on Decision Tree Learning
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Classification: Definition
• Given a collection of records (training set )– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the values of other attributes.
• Goal: previously unseen records should be assigned a class as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Illustrating Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
Test Set
Learningalgorithm
Training Set
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions as legitimate or fraudulent
• Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil
• Categorizing news stories as finance, weather, entertainment, sports, etc
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Classification Techniques
• Decision Tree based Methods• Rule-based Methods• Instance-Based Learners• Neural Networks• Bayesian Networks• (Conditional) Random Fields• Support Vector Machines• Inductive Logic Programming• Statistical Relational Learning• …
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
No Yes
Each internal node tests an attribute
Each branch corresponds to anattribute value node
Each leaf node assigns a classification
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
No
Decision Tree for PlayTennis
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Decision Tree for Conjunction
Outlook
Sunny Overcast Rain
Wind
Strong Weak
No Yes
No
Outlook=Sunny Wind=Weak
No
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Decision Tree for Disjunction
Outlook
Sunny Overcast Rain
Yes
Outlook=Sunny Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Decision Tree for XOR
Outlook
Sunny Overcast Rain
Wind
Strong Weak
Yes No
Outlook=Sunny XOR Wind=Weak
Wind
Strong Weak
No Yes
Wind
Strong Weak
No Yes
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Decision Tree
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
• decision trees represent disjunctions of conjunctions
(Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak)
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
When to consider Decision Trees
• Instances describable by attribute-value pairs
• Target function is discrete valued
• Disjunctive hypothesis may be required
• Possibly noisy training data
• Missing attribute values
• Examples:– Medical diagnosis– Credit risk analysis– RTS Games ?
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Decision Tree Induction
• Many Algorithms:– Hunt’s Algorithm (one of the earliest)– CART– ID3, C4.5– …
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Top-Down Induction of Decision Trees ID3
1. A the “best” decision attribute for next
node
2. Assign A as decision attribute for node
3. For each value of A create new
descendant
4. Sort training examples to leaf node
according to
the attribute value of the branch
5. If all training examples are perfectly
classified (same value of target attribute)
stop, else iterate over new leaf nodes.
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Which Attribute is ”best”?
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-] A2=?
True False
[18+, 33-] [11+, 2-]
[29+,35-]
• Example:– 2 Attributes, 1 class variable– 64 examples: 29+, 35-
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Entropy
• S is a sample of training examples
• p+ is the proportion of positive examples
• p- is the proportion of negative examples
• Entropy measures the impurity of S
Entropy(S) = -p+ log2 p+ - p- log2 p-
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Entropy
• Entropy(S) = expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code)
• Information theory: optimal length code assigns
–log2 p bits to messages having probability p.
• So, the expected number of bits to encode (+ or -) of random member of S:
-p+ log2 p+ - p- log2 p-
(log 0 = 0)
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Information Gain
• Gain(S,A): expected reduction in entropy due to sorting S on attribute A
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-] A2=?
True False
[18+, 33-] [11+, 2-]
[29+,35-]
Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)
Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Information Gain
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-]
Entropy([21+,5-]) = 0.71Entropy([8+,30-]) = 0.74Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-])
=0.27
Entropy([18+,33-]) = 0.94Entropy([11+,2-]) = 0.62Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-])
-13/64*Entropy([11+,2-])
=0.12 A2=?
True False
[18+, 33-] [11+, 2-]
[29+,35-]
Entropy(S)=Entropy([29+,35-]) = 0.99
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Another Example
• 14 training-example (9+, 5-) days for playing tennis– Wind: weak, strong– Humidity: high, normal
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Another Example
Humidity
High Normal
[3+, 4-] [6+, 1-]
S=[9+,5-]E=0.940
Gain(S,Humidity)=0.940-(7/14)*0.985 – (7/14)*0.592=0.151
E=0.985 E=0.592
Wind
Weak Strong
[6+, 2-] [3+, 3-]
S=[9+,5-]E=0.940
E=0.811 E=1.0
Gain(S,Wind)=0.940-(8/14)*0.811 – (6/14)*1.0=0.048
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Yet Another Example: Playing Tennis
Day Outlook Temp. Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Weak YesD8 Sunny Mild High Weak NoD9 Sunny Cold Normal Weak YesD10 Rain Mild Normal Strong YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
PlayTennis - Selecting Next Attribute
Outlook
Sunny Rain
[2+, 3-] [3+, 2-]
S=[9+,5-]E=0.940
Gain(S,Outlook)=0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971=0.247
E=0.971
E=0.971
Overcast
[4+, 0]
E=0.0
Gain(S,Humidity) = 0.151Gain(S,Wind) = 0.048Gain(S,Temp) = 0.029
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
PlayTennis - ID3 Algorithm
Outlook
Sunny Overcast Rain
Yes
[D1,D2,…,D14] [9+,5-]
Ssunny=[D1,D2,D8,D9,D11] [2+,3-]
? ?
[D3,D7,D12,D13] [4+,0-]
[D4,D5,D6,D10,D14] [3+,2-]
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
ID3 Algorithm
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
[D3,D7,D12,D13]
[D8,D9,D11] [D6,D14][D1,D2] [D4,D5,D10]
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Hypothesis Space Search ID3
+ - +
+ - +
A1
- - ++ - +
A2
+ - -
+ - +
A2
-
A4+ -
A2
-
A3- +
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Hypothesis Space Search ID3
• Hypothesis space is complete!– Target function surely in there…
• Outputs a single hypothesis • No backtracking on selected attributes (greedy
search)– Local minimal (suboptimal splits)
• Statistically-based search choices– Robust to noisy data
• Inductive bias (search bias)– Prefer shorter trees over longer ones– Place high information gain attributes close to the
root
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Converting a Tree to Rules
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=NoR5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes
So
ftwa
re-P
rakt
iku
m S
oS
e 2
005
Conclusions
1. Decision tree learning provides a practical method for concept learning.
2. ID3-like algorithms search complete hypothesis space.
3. The inductive bias of decision trees is preference (search) bias.
4. Overfitting (you will see it, ;-)) the training data is an important issue in decision tree learning.
5. A large number of extensions of the ID3 algorithm have been proposed for overfitting avoidance, handling missing attributes, handling numerical attributes, etc (feel free to try them out).