Topic_6

Dr. Jeffrey Huang, Department of Computer Science, IUPUI CSCI-548: Intro. to Bioinformatics

CSCI 548/B480: Introduction to BioinformaticsCSCI 548/B480: Introduction to BioinformaticsFall 2002Fall 2002

Topic 5: Machine Intelligence- Learning and Evolution

Dr. Jeffrey Huang, Assistant ProfessorDepartment of Computer and Information Science, IUPUI

E-mail: [email protected]


Machine IntelligenceMachine Intelligence

• Machine Learning– The subfield of AI concerned with intelligent systems that

learn. – The computational study of algorithms that improve

performance based on experience.

• The attempt to build intelligent entities:– We must understand intelligent entities first– Computational Brain– Mathematics:

• Philosophy staked most of the ideas of AI but to make it a formal science the mathematical formalization is needed in

– Computation– Logic– Probability


Behavior-Based AI vs. Knowledge BasedBehavior-Based AI vs. Knowledge Based

• Definitions of Machine Learning– Reasoning

• The effort to make computers think and solve problem• The study of mental faculties through the use of computational

models

– Behavior• Make machines to perform human actions requiring intelligence• Seeks to explain intelligent behavior in terms of computational

processes

• Agents

EnvironmentEnvironment

percepts

actions

sensors

effectors

agent?


Operational AgentsOperational Agents

• Operational Views of Intelligence:– The ability to perform intellectual tasks

• Prove theorems, play chess, solve puzzle

• Focus on what goes on “between the ears”

• Emphasize the ability to build and effectively use mental models

– The ability to perform intellectually challenging “real world” tasks• Medical diagnosis, tax advising, financial investing

• Introduce new issues such as: critical interactions with the world, model grounding, uncertainty

– The ability to survive, adapt, and function in a constantly changing world

• Autonomous agents

• Vision, locomotion, and manipulation,… many I/O issues

• Self-assessment, learning, curiosity, etc.


Building Intelligent ArtifactsBuilding Intelligent Artifacts

• Symbolic Approaches:– Construct goal-oriented symbol manipulation systems– Focus on high end abstract thinking

• Non-symbolic approaches:– Build performance-oriented systems– Focus on behavior

• Need both in tightly coupled form– Difficult in building such systems– Growing need to automate this process– Good approach: Evolutionary Algorithms


• Behavior-Based AI– Behavior-Based AI vs. Knowledge-Based – "Situated" in environment – Multiple competencies ('routines')– Autonomy– Adaptation and Competition

• Artificial Life (A-Life)– Agents: Reactive Behavior

– Abstracting the logical principles of living organism

– Collective Behavior : Competition and Cooperation


• Classification: – predicts categorical class labels– classifies data (constructs a model) based on the training set and the

values (class labels) in a classifying attribute and uses it in classifying new data

• Prediction: – models continuous-valued functions, i.e., predicts unknown or missing

values

Classification vs. PredictionClassification vs. Prediction


Classification—A Two-Step ProcessClassification—A Two-Step Process

• Model construction: describing a set of predetermined classes– Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute– The set of tuples used for model construction: training set– The model is represented as classification rules, decision trees, or

mathematical formulae

• Model usage: for classifying future or unknown objects– Estimate accuracy of the model

• The known label of test sample is compared with the classified result from the model

• Accuracy rate is the percentage of test set samples that are correctly classified by the model

• Test set is independent of training set, otherwise over-fitting will occur


Classification ProcessClassification Process

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Model ConstructionModel Construction

Use the Model in PredictionUse the Model in Prediction

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 2)

Tenured?

Classifier(Model)


Supervised vs. Unsupervised LearningSupervised vs. Unsupervised Learning

• Supervised learning (classification)– Supervision: The training data (observations, measurements, etc.) are

accompanied by labels indicating the class of the observations

– New data is classified based on the training set

• Unsupervised learning (clustering)– The class labels of training data is unknown

– Given a set of measurements, observations, etc. with the aim of

establishing the existence of classes or clusters in the data


Classification and PredictionClassification and Prediction

• Data Preparation– Data cleaning

• Preprocess data in order to reduce noise and handle missing values

– Relevance analysis (feature selection)• Remove the irrelevant or redundant attributes

– Data transformation• Generalize and/or normalize data

• Evaluating Classification Methods– Predictive accuracy– Speed and scalability

• time to construct the model• time to use the model

– Robustness: handling noise and missing values– Scalability: efficiency in disk-resident databases – Interpretability: understanding and insight provided by the model– Goodness of rules

• decision tree size• compactness of classification rules


From Learning to EvolutionaryFrom Learning to Evolutionary

• Optimization– Accomplishing abstract task = Solving problem

= searching through a space of potential solution

finding the “best solution”

an optimization process– Classical Exhaustive Methods??– Large Space?? Special machine learning technique

• Evolution Algorithms– Stochastic Algorithms– Search methods model some phenomena:

• Genetic Inheritance

• Darwinian strife for survival


• “… the metaphor underlying genetic algorithms is that of natural evolution. In evolution, the problem each species faces is one of searching for beneficial adaptations to a complicated and changing environment. The ‘knowledge’ that each species has gained is embodied in the makeup of chromosomes of its members”

- L. David and M. Steenstrup, “Genetic Algorithms and Simulated Annealing”, pp. 1-11, Kaufmann, 1987


The Essence ComponentsThe Essence Components

– Genetic representation for potential solutions to the problem

– A way to create an Initial population of potential solutions

– An evaluation function that plays the ole of the environment, rating solutions in term of their “fitness”

i.e. the use of fitness to determine survival and reproductive rates

– Genetic operators that alter the composition of children


Evolutionary Algorithm Search ProcedureEvolutionary Algorithm Search Procedure

Randomly generate aninitial population M(0)

Randomly generate aninitial population M(0)

Compute and save the fitness u(m) for each individual m in the current population M(t)

Define selection probabilities p(m) for each individual m in M(t) so that p(m) is proportional to u(m)

Generate M(t+1) by probabilitically selecting individuals to produce offspring via genetic operations(Crossover and mutation)


Historical BackgroundHistorical Background

• Three paradigms emerged in the 1960s:– Genetic Algorithms

• Introduced by Holland (MSU) De Jong (GMU)

• Envisioned for broad range of “adaptive systems”

– Evolution Strategies• Introduced by Rechenberg

• Focused on real-valued parameter optimization

– Evolutionary Programming• Introduced by Fogel and Koza

• Applied to AI and machine learning problem

• Today:– Wide variety of evolutionary algorithms– Applied to many area of science and engineering


Examples of Evolutionary AIExamples of Evolutionary AI

1. Parameter Tuning• Pervasiveness of parameterized models• Complex behavioral changes due to non-linear interactions• Example:

• Weights of an Artificial Neural networks

• Parameters of a heuristic evolution function

• Parameter of a rule induction system

• Parameter of membership functions

• Goal: evolve over time useful set of discrete/ continuous parameter


1. Evolving Structure• Effect behavior change via more complex structures• Example:

• Selecting/constructing the topology of ANNs

• Selecting/constructing the feature sets

• Selecting/constructing plans/scenarios

• Selecting/constructing membership functions

• Goal: evolve useful structure over time

3. Evolving Programs• Goal: acquire new behaviors and adapt existing ones• Example:

• Acquire/adapt behavioral rules sets

• Acquire/adapt arm/joint control programs

• Acquire/adapt task-oriented programming code


How Does Genetic Algorithm Work?How Does Genetic Algorithm Work?

A simple example of function optimization

Find max f(x)=x2, for x [0, 4]

1. Representation:• Genotype (chromosome): internally points in the search space are

represented as (binary) string over some alphabet

• Phenotype: the expressed traits of an individual

• With a precision for x in [0,4] of 10-4 : it needs14 bits– 8,000 213 < 10,000 < 214 16,000

• Simple fixed length binary– Assigned 0.0 to the string 00 0000 0000 0000– Assign 0.0 + bin2dec(binary string)*4/(214 -1)

the string 00 0000 0000 0001 and so on– Phenotype 4.0 = genotype 11 1111 1111 1111


2. Initial population:– Create a population (pop_size) of chromosomes,

where each chromosome is a binary vector of 14 bits

– All 14 bits for each chromosome are initialized randomly

3. Evaluation function• Evaluation function eval for binary vectors v is

equal to the function f:

eval(v) = f(x)

ex; eval(v1)= f(x1) = fitness1

v1

v2

v3

v4

v5

v6

v7

v8

v9

v10

v11

v12

v13

v14

v15

v16

v17

v18

v19

v20

v21

v22

v23

v24

0000000000000000000000000001

……

11111111111111

0.04/(214 -1)

……

4.0

genotypegenotype PhenotypePhenotype


– Parameters• pop_size = 24,

• Prob. of Xover, pc = 0.6,

• Prob. of mutation, pm = 0.01

– Recombination: using genetic operations• Crossover (pc)

v1 =01111100010011 => v1’= 01110101011100

v2 =00010101011100 => v2’= 00011100010011

• Mutation (pm) v2’= 00011100010011 => v2”=

00011110010011


– Selection M(t) from M(t+1): using roulette wheel• Total fitness of the population

• Probability of selection probi for each chromosome vi

• Cumulative prob qi

• Generate random numbers rj, from [0,1], where j =1…pop_size

• Select chromosome vi such that qi-1 < rj <= qi

sizepop

iifitnessF

_

1

F

fitnessprob i

i

i

jii sizepopiprobq

1

_,...1 where,



Homing to the Optimal SolutionHoming to the Optimal Solution


Best-so-far Curve Best-so-far Curve


Optimal Feature SubsetOptimal Feature Subset

• Search for the Subsets of Discriminatory Features– Combination optimization problem

– Two general approaches to identifying optimal subsets of features:

• Abstract measurement for important properties of good feature sets– Orthogonality (ex. PCA), information content, low variance– Less expensive process– Fall in suboptimal performance if the abstract measures do not correlate well

with actual performance

• Building a classifier from the feature subset and evaluating its performance on actual classification tasks.

– Better classification performance– the cost of building and testing classifiers prohibits any kind of systematic

evaluation of feature subsets

• suboptimal in practice: large numbers of candidate features cannot be handled by any form of systematic search

• 2N possible candidate subsets of N features.


Inductive LearningInductive Learning

• Learning From Examples– Decision Tree (DT): – Information Theory (IT)– Question: what are the BEST attributes (Features)

for building the decision tree?– Answer: ‘BEST’ attribute is the one that it is ‘MOST’

informative and for whom ‘ambiguity/uncertainty’ is least

– Solution: Measure (information) contents using the expected amount of information provided by the attribute


Classification by Decision Tree InductionClassification by Decision Tree Induction

• Decision tree – A flow-chart-like tree structure– Internal node denotes a test on an attribute– Branch represents an outcome of the test– Leaf nodes represent class labels or class

distribution

• Decision tree generation consists of two phases– Tree construction

• At start, all the training examples are at the root• Partition examples recursively based on selected

attributes– Tree pruning

• Identify and remove branches that reflect noise or outliers

• Use of decision tree: Classifying an unknown sample– Test the attribute values of the sample against the

decision tree

Exs. Class Size Color Surface

1 A Small Yellow Smooth

2 A Medium Red Smooth


4 A Big Red Rough

5 B Medium Yellow Smooth


color

yellow red

Asize

small medium

BA


• Entropy– Define an entropy function H such that

where pi: the probability associated with ith class– For a feature, the entropy is calculated for each value.– The sum of the entropy weighted by the probability of each value is

the entropy for that feature– Example: Toss a fair coin

if the coin is not fair, i.e. Pheads = 99%, then

So, by tossing the coin you get very little (extra) information (that you didn’t expect)

i

ii ppH log

bit 12

1log

2

1

2

1log

2

122

H

bits 08.0100

99log

100

99

100

1log

100

122

H


– In general, if you have p positive examples, and n negative examples

• For p = n H = 1• i.e. originally there is most uncertainty on the eventual outcome

(picking up an example) and most to gain by picking the example.

np

n

np

n

np

p

np

pH 22 loglog


Decision Tree InductionDecision Tree Induction

• Basic algorithm (a greedy algorithm)– Tree is constructed in a top-down recursive divide-and-conquer

manner– At start, all the training examples are at the root– Attributes are categorical (if continuous-valued, they are discretized in

advance)– Examples are partitioned recursively based on selected attributes– Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain)

• Conditions for stopping partitioning– All samples for a given node belong to the same class– There are no remaining attributes for further partitioning– Majority voting is employed for classifying the leaf– There are no samples left


AlgorithmAlgorithm

1. Select a random subset W (called the window) from the training set T

2. Build a DT for the current W• Select the best feature which minimizes the entropy H (or max.

gain)• Categorize training instances (examples) into subsets by this

feature• Repeat this process recursively until each subset contains

instances of one kind (class) or some statistical criterion is satisfied

3. Scan the entire training set for exceptions to the DT

4. If exceptions are found insert some of them into W and repeat from step 2


• Information Gain– The information gain from the attribute test is defined as

the difference between the original information requirement and the new requirement

• Note that the Remainder() is an weighted (by attribute values) entropy function

– Maximize Gain() Minimize Remainder(); and then is the most informative attribute (‘question’)

aluesdistinct v havecan and , ,)(Remainder where

)(Remainder,)(gain

1

ii

i

ii

i

i

ii

np

n

np

pH

np

np

np

n

np

pH


The ID3 Algorithm and Quinlan’s C4.5The ID3 Algorithm and Quinlan’s C4.5

• C4.5– Tutorial:

http://yoda.cis.temple.edu:8080/UGAIWWW/lectures/C45/– Matlab program:

http://www.cs.wisc.edu/~olvi/uwmp/msmt.html

• See 5/ C5.0– Tutorial:

http://borba.ncc.up.pt/niaad/Software/c50/c50manual.html– Software for Win2000:

http://www.rulequest.com/download.html


Example:Exs. Class Size Color Surface

1 A Small Yellow Smooth



4 A Big Red Rough



0min

636.0)loglog(

0)log()log(

only 6 5, 1, (row) example toApplies :2 Stage

318.0min

56.0)log()loglog(

318.0)log()loglog(

462.0)log()loglog()log(

:1 Stage

31

31

32

32

33

22

22

32

11

11

31

11

11

61

52

52

53

53

65

33

33

63

31

31

32

32

63

11

11

61

42

42

42

42

64

11

11

61

sizeii

surface

size

colorii

surface

color

size

HH

H

H

HH

H

H

H

color

yellow red

Asize

small medium

BA

color

yellow red

A?


• Noise and Overfitting– Question: what about two or more examples with the same description but

different classifications?Answer: Each leaf node reports either MAJORITY classification or relative frequencies

– Question: what about irrelevant attributes (noise and overfitting)?Answer: Tree pruningSolution: An information gain close to zero is a good clue to irrelevance, actual number of (+) and (-) exs. In each subset i, pi and ni vs. expected numbers pi and ni assuming true irrelevance

Where p and n are the total number of positive and negative exs to start with.Total deviation (regarding statistical significant)

Under the null hypothesis, D ~ chi-squared distribution

np

npnn

np

nppp ii

iii

i

ˆ and ,ˆ

i

ii

i i

ii

n

nn

p

ppD

ˆ)ˆ(

ˆ

)ˆ( 2

1

2


Extracting Classification Rules from TreesExtracting Classification Rules from Trees

• Represent the knowledge in the form of IF-THEN rules• One rule is created for each path from the root to a leaf• Each attribute-value pair along a path forms a conjunction• The leaf node holds the class prediction• Rules are easier for humans to understand• Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”


Decision TreeDecision Tree

• Avoid Overfitting in Classification– The generated tree may overfit the training data

• Too many branches, some may reflect anomalies due to noise or outliers

• Result is in poor accuracy for unseen samples

– Two approaches to avoid overfitting • Prepruning: Halt tree construction early—do not split a node if this

would result in the goodness measure falling below a threshold– Difficult to choose an appropriate threshold

• Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees

– Use a set of data different from the training data to decide which is the “best pruned tree”


• Approaches to Determine the Final Tree Size– Separate training (2/3) and testing (1/3) sets– Use cross validation, e.g., 10-fold cross validation– Use all the data for training

• but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution

– Use minimum description length (MDL) principle: • halting growth of the tree when the encoding is minimized


Decision TreeDecision Tree• Enhancements to basic decision tree induction

– Allow for continuous-valued attributes• Dynamically define new discrete-valued attributes that partition the

continuous attribute value into a discrete set of intervals

– Handle missing attribute values• Assign the most common value of the attribute

• Assign probability to each of the possible values

– Attribute construction• Create new attributes based on existing ones that are sparsely

represented

• This reduces fragmentation, repetition, and replication

Documents

Topic_6