30
CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge- Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

Embed Size (px)

Citation preview

Page 1: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

CSC 9010 Spring 2011. Paula Matuszek

CS 9010: Knowledge-Based Systems

Automated Knowledge Acquisition

Paula Matuszek

Spring, 2011

Page 2: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

2CSC 9010 Spring 2011. Paula Matuszek

Learning Systems• Supervised machine learning systems can be viewed as an

iterative process of – produce a result,

– evaluate it against the expected results

– tweak the system

– E.G.: Neural Nets

• Unsupervised machine learning is also used for systems which discover patterns without prior expected results.– E.G.: Self-Organizing Maps

• May be open or black box– Open: changes are clearly visible in KB and understandable to humans

– Black Box: changes are to a system whose internals are not readily visible or understandable.

Page 3: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

3CSC 9010 Spring 2011. Paula Matuszek

Learner Architecture

• Any supervised learning system needs to somehow implement four components:– Knowledge base: what is being learned.

Representation of a problem space or domain.– Performer: does something with the knowledge base to

produce results– Critic: evaluates results produced against expected

results– Learner: takes output from critic and modifies

something in KB or performer.

Page 4: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

4CSC 9010 Spring 2011. Paula Matuszek

Representation• How do you describe your problem?

– I'm guessing an animal: binary decision tree– I'm playing chess: the board itself, sets of rules for

choosing moves– I'm categorizing documents: vector of word

frequencies for this document and for the corpus of documents

– I'm fixing computers: frequency matrix of causes and symptoms

– I'm OCRing digits: probability of this digit; 6x10 matrix of pixels; % light; # straight lines

Page 5: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

5CSC 9010 Spring 2011. Paula Matuszek

Performer• How do you take action?

– Guessing an animal: walk the tree and ask associated questions

– Playing chess: chain through the rules to identify a move; use conflict resolution to choose one; output it.

– Categorizing documents: apply a function to the vector of features (word frequencies) to determine which category to put document in

– Fixing computers: use known symptoms to identify potential causes, check matrix for additional diagnostic symptoms.

– OCRing digits: input the features for a digit, output probability that it's 0-9.

Page 6: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

6CSC 9010 Spring 2011. Paula Matuszek

Critic• How do you judge correct actions?

– Guessing an animal: human feedback

– Playing chess: who won? (Credit assignment problem)

– Categorizing documents: a set of human-categorized test documents.

– Fixing computers: Human input about symptoms and cause observed for a specific case

– OCRing digits: Human-categorized training set.

Page 7: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

7CSC 9010 Spring 2011. Paula Matuszek

Learner• What does the learner do?

– Guessing an animal: elicit a question from the user and add it to the binary tree

– Playing chess: increase the weight for some rules and decrease for others.

– Categorizing documents: modify the weights on the function to improve categorization

– Fixing computers: update frequency matrix with actual symptoms and outcome

– OCRing digits: modify weights on a network of associations.

Page 8: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

8CSC 9010 Spring 2011. Paula Matuszek

Approaches to Learning Systems• Can also be classified by degree of human

involvement required, in the critic or the learner component.– All human input– Computer-guided human input– Human-guided computer learning– All computerized, no human interaction

Page 9: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

9CSC 9010 Spring 2011. Paula Matuszek

Computer-Guided Human Input• Methods in which the computer interacts with the

human to expand the knowledge base

• Knowledge base is built and largely maintained by humans

• Computer assists with checks, suggestions, inferences– Protege with FACT++

– IDEs for rule-based systems

– Critic/create cycles

Page 10: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

10

CSC 9010 Spring 2011. Paula Matuszek

Teresias• Mycin is an early rule-based expert system for

diagnosing medical problems• Teresias rules in Emycin by interacting with human

at the end of a diagnostic run:I conclude XXX. Is this the correct diagnosis?

No

I concluded XXX based on YYY and ZZZ. Is this rule correct, incorrect, or incomplete?

Incomplete

What additional tests should be added to the rule?

• 1. B. Buchanan and E. Shortliffe, Rule-Based Expert Systems. Reading, MA: Addison-Wesley, 1984.

Page 11: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

11

CSC 9010 Spring 2011. Paula Matuszek

Human-Guided Computer Learning• KB is built and maintained by computer

• Knowledge comes entirely from interactions with humans

• KB is readily understandable but modifying it may have unpredictable results

• Example: Animals Guessing Game– Representation is a binary tree

– Performer is a tree walker interacting with a human

– Critic is the human player

– Learning component elicits new questions and modifies the binary tree

Page 12: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

12

CSC 9010 Spring 2011. Paula Matuszek

A Very Simple Interactive Learning ProgramThink of an animal. okay

Is it a mouse? no

What is it? a penguin

What is a question that would distinguish a mouse from a penguin? Does it have fur?

What is the answer for a mouse? yes

Play again? yes

Does it have fur? no

Is it a penguin?

...

Page 13: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

13

CSC 9010 Spring 2011. Paula Matuszek

No Human Interaction• Critic information is provided entirely from

training examples

• System can still be examined and understood by humans

• Maintenance involves retraining

• Some typical examples– rule induction– decision tree creation

Page 14: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

14

CSC 9010 Spring 2011. Paula Matuszek

Rule Induction• Given

– Features

– Training examples

– Output for training examples

• Generate automatically a set of rules or a decision tree which will allow you to judge new objects

• Basic approach is – Combinations of features become antecedents or links

– Examples become consequents or nodes

Page 15: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

15

CSC 9010 Spring 2011. Paula Matuszek

Rule Induction Example• Starting with 100 cases, 10 outcomes, 15 variables

• Form 100 rules, each with 15 antecedents and one consequent.

• Collapse rules.

• Cancellations: If we have– C, A => B and –C, A => B, collapse to A => B

• Drop Terms:– D, E => F and D, G => F, collapse to D => F

• Test rules and undo collapse if performance gets worse

• Additional heuristics for combining rules.

Page 16: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

16

CSC 9010 Spring 2011. Paula Matuszek

Rose DiagnosisYellow Leaves Wilted Leaves Brown Spots

Fungus N Y Y

Bugs N Y Y

Nutrition Y N N

Fungus N N Y

Fungus Y N Y

Bugs Y Y N

R1: If not yellow leaves and wilted leaves and brown spots then fungus.…R6: If wilted leaves and yellow leaves and not brown spots then bugs

Page 17: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

17

CSC 9010 Spring 2011. Paula Matuszek

Rose Diagnosis• Cases 1 and 4 have opposite values for wilted leaves, so

create new rule:– R7: If not yellow leaves and brown spots then fungus.

• KB is rules. Learner is system collapsing and test rules. Critic is the test cases. Performer is rule-based inference.

• Problems:– Over-generalization

– Irrelevance

– Need data on all features for all training cases

– Computationally painful.

• Useful if you have enough good training cases.

• Output can be understood and modified by humans

Page 18: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

18

CSC 9010 Spring 2011. Paula Matuszek

Decision Tree Induction• Very common data mining technique.

• Given:– Examples– Attributes– Goal (classification, typically)

• Pick “important” attribute: one which divides set cleanly.

• Recur with subsets not yet classified.

Page 19: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

19

CSC 9010 Spring 2011. Paula Matuszek

ID3• A greedy algorithm for decision tree construction

developed by Ross Quinlan, 1987

• Top-down construction of the decision tree by recursively selecting the “best attribute” to use at the current node in the tree– Once the attribute is selected for the current node, generate

children nodes, one for each possible value of the selected attribute

– Partition the examples using the possible values of this attribute, and assign these subsets of the examples to the appropriate child node

– Repeat for each child node until all examples associated with a node are either all positive or all negative

Page 20: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

20

CSC 9010 Spring 2011. Paula Matuszek

A training set

From Russell and Norvig, Artificial Intelligence: A Modern Approach. third edition, Prentice Hall, 2010, p 700

Page 21: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

21

CSC 9010 Spring 2011. Paula Matuszek

Making the Partitions

From Russell and Norvig, Artificial Intelligence: A Modern Approach. third edition, Prentice Hall, 2010, p 701

Page 22: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

22

CSC 9010 Spring 2011. Paula Matuszek

ID3-induced decision tree

From Russell and Norvig, Artificial Intelligence: A Modern Approach. third edition, Prentice Hall, 2010, p 702

Page 23: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

23

CSC 9010 Spring 2011. Paula Matuszek

Evaluating Classifying Systems• Standard methodology:

– 1. Collect a large set of examples (all with correct classifications)– 2. Randomly divide collection into two disjoint sets: training and

test– 3. Apply learning algorithm to training set – 4. Measure performance with respect to test set

• Important: keep the training and test sets disjoint!

• To study the efficiency and robustness of an algorithm, repeat steps 2-4 for different training sets and sizes of training sets

• If you improve your algorithm, start again with step 1 to avoid evolving the algorithm to work well on just this collection

Page 24: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

24

CSC 9010 Spring 2011. Paula Matuszek

What next?• Machine learning is one of those fields where the

web is changing everything!• Three major factors

– One problematic aspect of machine learning research is finding enough data.

• This is NOT an issue on the web!

– Another problematic aspect is getting a critic• Web offers a lot of opportunities

– A third is identifying good practical uses for machine learning

• Lots of online opportunities here

Page 25: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

25

CSC 9010 Spring 2011. Paula Matuszek

Finding Enough Data• The web is an enormous repository of machine-readable data. What

are some of the things we can we do with it?– Expand Knowledge Bases.

• Searching for Common Sense, Matuszek et al, 2005. www.cyc.com/doc/white_papers/AAAI051MatuszekC.pdf

• “All You Can Eat” Ontology-Building: Feeding Wikipedia to Cyc, Sarjant et al, 2009. www.cs.waikato.ac.nz/~olena/publications/feedingWikipedia2Cyc.pdf

– Learn taxonomies. • Acquisition of Categorized Named Entities for Web Search, Pasca, 2004.

www.google.com/research/pubs/archive/74.pdf• Deriving a Large Scale Taxonomy from Wikipedia, Ponzetto et al, 2007.

www.cl.uni-heidelberg.de/~ponzetto/pubs/ponzetto07b.pdf

• “Acquisition of Instance Attributes via Labeled and Related Instances”, Alfonseca, et al, 2010. portal.acm.org/citation.cfm?id=1835462

– Expand the semantic web: • Ontology-driven, unsupervised instance population, McDowell et al, 2008.

www.usna.edu/Users/cs/lmcdowel/pubs/jwsOntosyphon.pdf.

Page 26: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

26

CSC 9010 Spring 2011. Paula Matuszek

Know It All!• The KnowItAll project,

– Turing Center, University of Washington. – http://www.cs.washington.edu/research/knowitall/

• Extracting relations from the web

• Provide search of concepts, not pages.

• TextRunner: – http://www.cs.washington.edu/research/textrunner/indexTRTypes.html

– Extracts from web pages– Search by subject/predicate/object

Page 27: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

27

CSC 9010 Spring 2011. Paula Matuszek

Getting Critics

• People spend a lot of time on the web

• The success of sites like Wikipedia is evidence that people are willing to volunteer time and effort– Cyc. – The Open Mind.– Learner– The ESP game. http://www.espgame.org/gwap

/

Page 28: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

28

CSC 9010 Spring 2011. Paula Matuszek

Example: The Cyc Ontology• Cyc

– Long term project: begun in 1985.

– Large hand-crafted ontology (>500K terms)

– Natural language processing capabilities

– Extensive reasoning capabilities.

• Online Browser

• Open source version available, in CycL or OWL

• Online learning game

Page 29: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

29

CSC 9010 Spring 2011. Paula Matuszek

Online Uses for Machine Learning• Improved search: learn from click-throughs. Google

Personalized Search

• Recommendations: learn from peoples’ opinions and choices. Amazon Recommender and many others

• Online games. AIs add to the background but can’t be too static.

• Better targeting for ads. More learning from click-throughs.

• Customer Response Centers. Clustering, improved retrieval of responses.

Page 30: CSC 9010 Spring 2011. Paula Matuszek CS 9010: Knowledge-Based Systems Automated Knowledge Acquisition Paula Matuszek Spring, 2011

30

CSC 9010 Spring 2011. Paula Matuszek

Summary• Valuable both because we want to understand

how humans learn and because it improves computer systems

• May learn representation or actions or both

• Variety of methods, some knowledge-based and some statistical

• Currently very active research area

• Web is providing a lot of new opportunities

• Still a long way to go