Machine Learning for Natural Language Processing (CS4062)koidlk/cs4062/ml4nlp-reader.pdf · Machine Learning for Natural Language Processing (CS4062) Lecturer: Kevin Koidl Original

Machine Learning for Natural LanguageProcessing (CS4062)

Lecturer: Kevin Koidl

Original material by S. Luz,updated by E. Moreau and L. Mamani-Sanchez

School of Computer Science and StatisticsTrinity College Dublin

[email protected]

2017

[email protected]

1Course Overview

1.1 Introduction

This course will introduce a number of concepts and techniques developed inthe field of Machine Learning and review their applications to Natural LanguageProcessing (NLP) applications, and, to a lesser extent, to issues in Computa-tional Linguistics. Natural Language Processing is a sub-discipline of ArtificialIntelligence which studies algorithms and methods for building systems (or,more commonly, components of larger systems) to deal with linguistic input.Sometimes a distinction is drawn between NLP and Computational Linguis-tics whereby the latter is viewed as the study of linguistic ability viewed as acomputational process, whereas the former is viewed more from an engineer-ing (application directed) perspective1. Although the boundaries between thesefields are not always clear, we will accept the distinction and focus on applica-tions. In order to do so, we will introduce a number of tools developed in therelated field of Machine Learning, and illustrate their use in NLP tasks rang-ing from text classification to document analysis and clustering. These toolsand techniques will be illustrated through a series of case studies designed tohelp you understand the main concepts behind Machine Learning while gettingacquainted with modern NLP applications.

Course Overview

• Course goals:

– study Artificial Intelligence concepts and techniques

– with an emphasis on Machine Learning (ML),

– relevant to Natural Language Processing (NLP) and ComputationalLinguistics

• ML: study/design algorithms which can “learn” from data and make pre-dictions

• NLP: applications which require processing language-related data1Gliozzo and Strapparava (2009, p. 1) attribute a distinction along these lines to Martin

Kay.

4 Course Overview

Machine Learning: subjects covered

• Machine learning (ML)

– theoretical background

– methods

– ML in agent architectures

• Supervised learning:

– Bayesian learning: theory and applications

– Case-based and instance-based Reasoning; nearest-neighbour classi-fiers

– Symbolic approaches: decision trees, decision rules

– Predictor functions: SVM, perceptron

• Unsupervised learning:

– clustering algorithms

• Active Learning

How is all this relevant in practice?Challenges:

• Information overload

• Coping with a complex, highly dynamic environments:

– Ubiquity of networked devices (phones, printers, home appliances,...)

– large amounts of poorly structured data repositories

– heterogeneity of methods for access to data

– an increasing variety of ever-changing “standards”

Information Overload

• Internet growth

• Massive amount of data in companies and administrations

• Examples:

– Do you find it hard to “manage” your email?

– Have you ever wondered what lies hidden behind those distant linksthat your search engine returns?

????

Chapter 1 5

ML: leveraging information

• Humans are (still) better than machines at making sense of information

• But machines are much faster

• Typical ML setting:

– feed the machine with a small amount of annotated data

– make it predict the answer for any new instance

• Make the machine able to select the relevant information in a lot of noise

– by making the task a matter of counting things (statistical ML)

The role of NLP in an ocean of information

• Most digital information available comes as text:

– Internet: Wikipedia, social media, ...

– News, research, laws, patents, ...

– Companies and administration reports

• A machine cannot easily use unstructured information

• Using ML in NLP:

– Transforming text into numerical/logical features

Syllabus

• Course description

• Text Categorisation: introduction

• Machine learning (introduction)

• TC and supervised learning

• Formal definition of a text categorisation (TC) task

• Classifier induction, lifecycle and document representation issues

• Dimensionality Reduction and term filtering

• Dimensionality Reduction by term extraction

• Naive Bayes text classifiers

• Variants and properties of Naive Bayes text classifiers

• Decision trees

6 Course Overview

Syllabus (ctd)

• Other symbolic approaches to TC

• Evaluation Metrics and Comparisons

• Instance-based methods

• Regression methods, SVMs and classifier ensembles;

• Unsupervised learning: introduction

• Unsupervised learning: clustering algorithms

• Other applications: Word-sense disambiguation

• Active Learning

Practicals:

• Small TC implementation project in Java (weekly assignments)

• ML packages (WEKA)

• Some demos in R (http://cran.r-project.org/)

Course organisation and resources

• Coursework: tutorials and small programming project account for 10% ofthe overall marks

• Programming project will give you the chance to apply machine learningconcepts to a text categorisation task

• Recommended reading:

– Text classification/mining: (Emms and Luz, 2007), (Sebastiani, 2002)and (Manning and Schutze, 1999, ch 16 and parts of 15 and 14)

– Machine Learning (Mitchell, 1997)

• Course web-page:

https://www.cs.tcd.ie/kevin.koidl/cs4062/

– Slides

– course reader at https://www.cs.tcd.ie/kevin.koidl/cs4062/ml4nlp-reader.pdf (updated weekly; don’t forget to clear web browser cache).

http://cran.r-project.org/

https://www.cs.tcd.ie/kevin.koidl/cs4062/

https://www.cs.tcd.ie/kevin.koidl/cs4062/ml4nlp-reader.pdf

https://www.cs.tcd.ie/kevin.koidl/cs4062/ml4nlp-reader.pdf

2An application area: text categorisation

Text Mining

• Unstructured data is any kind of data which does not follow a predefinedmodel

– A database is structured data; Text is unstructured data

• Data mining: extracting information from structured data

– Example: user visits to a website, stockmarket indicators...

• Text mining is the domain of extracting some information from text

– Usually from a high amount of text

– Usually looking for high-level information (e.g. topic, who’s involved,opinion...)

A taxonomy of text mining tasks

• Information Extraction: retrieving relevant information w.r.t. an infor-mation extraction template

– Example: DARPA’s named-entity recognition task

ORGANIZATION: named corporate, governmental, or other organizational entity

PERSON: named person or family

LOCATION: name of politically or geographically defined location (cities,

provinces, countries, international regions, bodies of water, mountains, etc.)

• DEMO OpenCalais: http://www.opencalais.com/opencalais-demo/

http://www.opencalais.com/opencalais-demo/

8 An application area: text categorisation

OpenCalais demo

Taxonomy of text mining tasks (ctd.)

• Information Retrieval: retrieving relevant documents w.r.t. the occurrenceand relevance of terms of a query

– Example: search engines (global or specialized)

• Text categorisation: labelling texts with a set of thematic categories.Examples:

– spam filtering

– hierarchical web catalogues (Yahoo, etc)

– automatic news categories (politics, sports...)

– authorship attribution

– ...

What is Text Categorisation?

• Roughly: the task of filing a set of texts (also known in computationallinguistics as a corpus) into predefined categories

• Also known as “text classification”, or “topic spotting”

Text Mining

Information Retrieval

Information Extraction

Text Classification

Chapter 2 9

Background: initial goals

• Automatic filtering of information (e.g. by topic)

• Improvements on IR systems

– disambiguation

• An example of manually built content hierarchies

• Problem:

– Manually built catalogues are costly to build and maintain

Related techniques

• From Information Retrieval

– tokenisation: identifying meaningful units

– word indexing (inverted indices etc), e.g. ...

∗ ... record the position of each token into a file (called an invertedindex, or inverted file):

Each word is analysed. The position of each word ...1 6 11 14 25 29 38 41 46

vocabulary positions

each −→ 1, 44word −→ 6, 46is −→ 11... . . .

...

– corpus statistics (type and token counts, frequency tables etc)

– feature selection

Related techniques (ctd.)

• From machine learning:

– dimensionality reduction

– automated knowledge acquisition (coping with the knowledge acqui-sition bottleneck)

– classifier induction (different techniques have been applied, yieldingvarious levels of efficacy)


Real-world Applications of TC

• Document categorization:

– Classification of newspaper articles under the appropriate sections(e.g. Science, Politics, etc)

∗ Reuters news corpora have been used for training and testingTC systems

– Classified ads, product catalogues, hierarchical categorisation of webpages...

• Automatic indexing for boolean information retrieval

– Assigning keywords (and key phrases) of a controlled dictionary todocuments

– Automated metadata creation

Other instances of potential applications of automatic document categorisa-tion include web product catalogues, mailing list filtering etc.

Examples of controlled dictionaries include the ACM classification system,the MESH thesaurus for medicine, the NASA thesaurus etc.

Controlled vocabulary: example

• From https://www.nlm.nih.gov/mesh/trees.html

Applications of TC: Web catalogues

• Hierarchical classification of web pages:

– Internet portals, recommendation sites, price comparators...

– Such catalogues are impossible to maintain manually

– Multiple documents for each category

https://www.nlm.nih.gov/mesh/trees.html

Chapter 2 11

– (and possibly) Multiple categories per document

– Rich sources of extra-linguistic information:

∗ links, hypertext structure

∗ hierarchical structure of the category set

Example: Text filtering• “(...) the activity of classifying a stream of incoming documents dispached in an

asynchronous way from an information producer to an information consumer”

• Example in (Hayes and Weinstein, 1990):

– Producer: a news agency

– Consumer: a newspaper

• Another example: personalised news web sites

• User interface issues

– routing: no profile is available and texts are ranked by relevance

– batch filtering: boolean (accept/reject) classification; no user profile is built

– Adaptive filtering: user provides (explicit or implicit) feedback and thesystem builds user profiles

Applications of TC in Computational Linguistics (1)

• Word sense disambiguation

– Task description: given an ambiguous word, e.g. bank as in:

∗ “The banks are selling Euro notes.” vs.

∗ “Dublin grew up at a fording point on the banks of the Liffey.”

find the sense of each particular occurrence

• Word contexts can be seen as documents, and

• Word senses can be seen as categories

• Uses in systems for: context-sensitive spelling, prepositional phrase at-tachment, machine translation

Example of PP attachment ambiguity:

• “The children ate the cake with a spoon.”

• “The children ate the cake with frosting.”

ML: supervised vs. unsupervised learning

• Learning from what?

• Supervised learning: labeled examples

– e.g. “spam” / “not spam” labels

– usually requires manually annotated training data

• Unsupervised learning: (unlabeled) examples

– e.g. grouping by similarity, ranking


Further distinctions

• Text categorization can be seen (alternatively) as:

– Unsupervised identification of a set of categories for a given corpus(ex.: topics)

– Unsupervised grouping of texts under categories: clustering

– Automatic assignment of texts to (one or more) pre-defined categories(supervised)

• The next lectures will focus on automatic assignment to pre-defined cat-egories

Some history• 80’s:

– Rule-based approaches

∗ Knowledge engineering and expert systems

∗ CONSTRUE (Hayes and Weinstein, 1990)

if (wheat & farm) or(wheat & commodity) or(bushels & export) or(wheat & tonnes) or(wheat & winter and (¬ soft)) or

then WHEAT else (¬ WHEAT)

– “Breakeven” result of 0.9 (Reuters test corpus)

Recent history

• 90’s:

– Increasing availability of larger corpora in electronic format

– The knowledge acquisition bottleneck

– Development of Machine learning techniques

– Automated knowledge acquisition

– Supervised learning and annotated data

– Levels of effectiveness comparable to those obtained by hand-codedclassifiers

• 2000’s:

– Massive amount of unstructured data available

– Increased computing power

– Distributed processing, cloud computing

• 2010’s:

– New methods: Deep learning

– Concerns about AI becoming too powerful in the future

3Practical: Creating a rule-based TC System

3.1 Goal(s)

The goal of this activity is to encourage you to explore some real data used intext categorisation (extracted from the REUTERS-21578 corpus (Lewis, 1997))by assessing the feasibility and effectiveness of rule-based text categorisation.The Reuters-21578 text categorization test collection has been used as the stan-dard corpus for training and evaluation of a number of Text Categorisation (TC)systems, including CONSTRUE (Hayes and Weinstein, 1990). You will analysea set of hand-classified “training” texts, create a set of production rules (akinto those used in CONSTRUE) to perform (binary) classification, and evaluatethe efficacy of your rules on a “test” set.

3.2 Finding the data

Training and test data can be found at

http://www.scss.tcd.ie/~koidlk/cs4062/sw/hctc.tgz

in two sub-directories: training-data and test-data.In training-data you will find two sets of files containing economy news

which have been classified (by human annotators) into a number of differentcategories. In this tutorial we will be interested in two categories only, groupedaccording to file name:

acq01.sgm, ..., acq20.sgm: files containing news about acquisitions, mergers,takeovers, reporting mainly events involving the purchase of shares of acompany by another,

nonacq01.sgm, ...,nonacq14: other news whose main subject cannot be classi-fied as the above

In test-data you will find files test01, ...,test19, containing news itemsto be used in the evaluation of your categorisation rules, and results.sgm,

http://www.scss.tcd.ie/~koidlk/cs4062/sw/hctc.tgz

14 Practical: Creating a rule-based TC System

containing the original (human-generated) classification of each of these newsitems.

All files are in SGML format. As you will realise, you do not need toknow details of the SGML language in order to be able to make sense ofthe annotation. You will be mainly interested in the text enclosed by tags<TOPICS>...</TOPICS> and <BODY>...</BODY>. The former contains the class(es)to which the news belongs, and the latter the news itself. Take the content ofacq01.sgm, for instance:

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"

OLDID="5639" NEWID="96">

<DATE>26-FEB-1987 16:32:37.30</DATE>

<TOPICS><D>acq</D></TOPICS>

<PLACES><D>usa</D></PLACES>

<PEOPLE></PEOPLE>

<ORGS></ORGS>

<EXCHANGES></EXCHANGES>

<COMPANIES></COMPANIES>

<TEXT>

<TITLE>INVESTMENT FIRMS CUT CYCLOPS <CYL> STAKE</TITLE>

<DATELINE> WASHINGTON, Feb 26 - </DATELINE>

<BODY>A group of affiliated New York investment firms said they

lowered their stake in Cyclops Corp to 260,500 shares, or 6.4 pct of

the total outstanding common stock, from 370,500 shares, or 9.2 pct.

In a filing with the Securities and Exchange Commission, the group,

led by Mutual Shares Corp, said it sold 110,000 Cyclops common shares

on Feb 17 and 19 for 10.0 mln dlrs. Reuter </BODY>

</TEXT>

</REUTERS>

This text has been classified by Reuters staff as being about an acquisition asindicated by <TOPICS><D>acq</D></TOPICS> (it also belongs to category usa),based on the text delimited by <BODY> and <TITLE> tag pairs.

An informal description of the mark-up scheme used in the Reuters files canbe found in tag-descriptions.txt. A list of all categories along with theiracronyms can be found in cat-descriptions 120396.txt.

3.3 Exercises

1. Study the hand-annotated files in training-data and create productionrules that, given an unannotated text will correctly classify as belongingto exactly one of the following classes: acq or acq, for texts that are aboutcompany acquisitions and text that are not about acquisitions, respec-tively, as described above.

Binary TC production rules are logical rules of the form:

if CONDITION then CATEGORY else CATEGORY (3.1)

Chapter 3 15

where CATEGORY stands for the complement of CATEGORY, i.e. allcategories, except CATEGORY. CONDITION is a logical expression whichcombines words (true if the word appears in the text, false otherwise)using logical operators. An example of a TC production rule (Hayes andWeinstein, 1990) is shown in (3.2).

if (wheat & farm) or(wheat & commodity) or(bushels & export) or(wheat & tonnes) or(wheat & whinter and (¬ soft)) or

thenWHEAT

else(WHEAT )

(3.2)

2. Once your rules have been defined, the next step is to test them ontest-data/test01.sgm, ...,test-data/test19.sgm. It is important thatthe outcome of the categorisation be based solely on the rules (and not onhow you think the texts should be categorised). For each text in the testset, describe how the categorisation was performed. That is, which ruleswere applied and how1.

3. After you have assigned a category (either acq or acq) to each test file, com-pare your results with the original Reuters categorisation in test-data/results.sgm.Evaluate your TC rules in terms of accuracy: number of correct answersdivided by number of cases.

4. In your opinion, what are the main difficulties in building a rule-based TCsystem?

5. How feasible would it be to convert your specification into a working sys-tem? what would it involve?

6. Comment on the scalability of the approach used: how difficult would itbe to port your rules to a different domain? or to extend them to coverother categories?

1Perhaps you could swap your rules with a colleague and ask them to perform a categori-sation of the test set using your rules, in order to make sure that your rules don’t containhidden assumptions that rely on your own background knowledge.

16 Practical: Creating a rule-based TC System

4Machine Learning: overview and examples

Outline

• Does TC (and NLP) need Machine Learning?

• What can Machine Learning do for us (“what has machine learning everdone for us”?)

• What is machine learning?

• How do we design machine learning systems?

• What is a well-defined learning problem?

• An example

Why Machine Learning

• Progress in algorithms and theory

• Growing flood of online data

• Computational power is available

• Rich application potential

• The knowledge acquisition bottleneck

User interface agents?

• An (artificial) agent may help users cope with increasing information:

An agent is a computer system that is situated in some environ-ment and that is capable of autonomous action in its environ-ment in order to meet its design objectives.

(Wooldridge, 2002)

• In contrast, Human-Computer Interaction researchers have arguedagainst the use of agents: see (Shneiderman and Maes, 1997) for a

debate.

18 Machine Learning: overview and examples

Do Agents need machine learning?

• Practical concerns:

– large amounts of language data have become available (on the weband elsewhere), and one needs to be able to make sense of them all,

– knowledge engineering methods don’t seem to be able to cope withthe growing flood of data

– Machine learning can be used to automate knowledge acquisition andinference

• Theoretical contributions:

– reasonably solid foundations (theory and algorithms)

Application niches for machine learning

• ML for text classification for use in, for instance, self customizing pro-grams:

– Newsreader that learns user interests

• Data mining: using historical data to improve decisions

– medical records → medical knowledge

– analysis of customer behaviour

• Software applications we can’t program by hand

– autonomous driving

– speech recognition

Examples: data mining problem

Patient103 Patient103Patient103 ...time=1 time=2 time=n

Age: 23

FirstPregnancy: no

Anemia: no

Diabetes: no

PreviousPrematureBirth: no

...

Elective C−Section: ?

Emergency C−Section: ?

Age: 23

FirstPregnancy: no

Anemia: no


Diabetes: YES

...Emergency C−Section: ?

Ultrasound: abnormal

Elective C−Section: no

Age: 23

FirstPregnancy: no

Anemia: no


...


Ultrasound: ?

Diabetes: no

Emergency C−Section: Yes

Ultrasound: ?

Given:

• 9714 patient records, each describing a pregnancy and birth

• Each patient record contains 215 features

Learn to predict:

• Classes of future patients at high risk for Emergency Cesarean Section

Chapter 4 19

Examples: data mining results

Patient103 Patient103Patient103 ...time=1 time=2 time=n

Age: 23

FirstPregnancy: no

Anemia: no

Diabetes: no


...

Elective C−Section: ?

Emergency C−Section: ?

Age: 23

FirstPregnancy: no

Anemia: no


Diabetes: YES

...Emergency C−Section: ?

Ultrasound: abnormal


Age: 23

FirstPregnancy: no

Anemia: no


...


Ultrasound: ?

Diabetes: no

Emergency C−Section: Yes

Ultrasound: ?

If No previous vaginal delivery, and

Abnormal 2nd Trimester Ultrasound, and

Malpresentation at admission

Then Probability of Emergency C-Section is 0.6

Over training data: 26/41 = .63,

Over test data: 12/20 = .60

Other prediction problems

• Customer purchase behavior:

Customer103: Customer103: Customer103:(time=t0) (time=t1) (time=tn)...

...

Sex: M

Age: 53

Income: $50k

Own House: Yes

MS Products: Word

Computer: 386 PC

Purchase Excel?: ?

...

Sex: M

Age: 53

Income: $50k

Own House: Yes

MS Products: Word

...

Sex: M

Age: 53

Income: $50k

Own House: Yes

Purchase Excel?: ?

MS Products: Word

Computer: Pentium Computer: Pentium

Purchase Excel?: Yes

• Process optimization:

(time=t0) (time=t1) (time=tn)...Product72: Product72: Product72:

...

Viscosity: 1.3

... ...

Viscosity: 1.3

Product underweight?: ?? Product underweight?:

Viscosity: 3.2

Yes

Fat content: 15%

Stage: mix

Mixing−speed: 60rpm

Density: 1.1

Stage: cook

Temperature: 325

Fat content: 12%

Density: 1.2

Stage: cool

Fan−speed: medium

Fat content: 12%

Spectral peak: 3200

Density: 2.8

Spectral peak: 2800 Spectral peak: 3100

Product underweight?: ??

Problems Too Difficult to Program by Hand

• ALVINN (Pomerleau, 1994): drives 70 mph


Software that adapts to its user

• Recommendation services,

• Bayes spam filtering

• etc

Perspectives

• Common applications

– First-generation algorithms: neural nets, decision trees, regression ...

– Applied to well-formated databases

• Advanced applications; areas of active reasearch:

– Learn across full mixed-media data

– Learn across multiple internal databases, plus the web and newsfeeds

– Learn by active experimentation

– Learn decisions rather than predictions

– Cumulative, lifelong learning

– Deep learning

Chapter 4 21

Learning agents

Performance standard

Agent

En

viro

nm

en

t

Sensors

Effectors

Performance element

changes

knowledge

learning goals

Problem generator

feedback

Learning element

Critic

The archiecture in some detail

• Performance element: responsible for selecting appropriate actions

• Learning element: responsible for making improvements

• Critic: evaluates action selection against a performance standard

• Problem generator: suggests actions that might lead to new and instructiveexperiences

Defining “learning”

• ML has been studied from various perspectives (AI, control theory, statistics,information theory, ...)

• From an AI perspective, the general definition is formulated in terms of agentsand tasks. E.g.:

[An agent] is said to learn from experience E with respect to someclass of tasks T and performance measure P , if its performance attasks in T , as measured by P , improves with E. (Mitchell, 1997, p.2)

• Statistics, model-fitting, ...

Designing a machine learning system

• Main design decisions:

– Training experience: How will the system access an use data?

– Target function: What exactly should be learnt?

– Hypothesis representation: How will we represent the concepts to belearnt?

– Inductive inference: What specific algorithm should be used to learnthe target concepts?


Accessing and using data

• How will the system be exposed to its training experience? Some distinc-tions:

– Direct or indirect access:

∗ indirect access: record of past experiences, corpora

∗ direct access: situated agents → reinforcement learning

– Source of feedback (“teacher”):

∗ supervised learning

∗ unsupervised learning

∗ mixed: semi-supervised (“transductive”), active learning

Determining the target function

• The target function specifies the concept to be learnt.

• In supervised learning, the target function is assumed to be specifiedthrough annotation of training data or some form of feedback:

– a corpus of words annotated for word senses, e.g. f : W ×S → 0, 1

– a database of medical data

– user feedback in spam filtering

– assessment of outcomes of actions by a situated agent

Representing hypotheses and data

• The goal of the learning algorithm is to “induce” an approximation f ofa target function f

• The data used in the induction process needs to be represented uniformly

– E.g. representation of text as a “bag of words”, Boolean vectors, etc

• The choice of representation often constrains the space of available hy-potheses, hence the possible f ’s. E.g.:

– the approximation to be learnt could, for instance, map conjunctionsof Boolean literals to categories

– or it could assume that co-occurence of words do not matter forcategorisation

– etc

Chapter 4 23

Deduction and Induction

• Deduction: from general premises to a concludion. E.g.:

– A→ B,A ` B

• Induction: from instances to generalisations

• Machine learning algorithms produce models that generalise from instancespresented to the algorithm

• But all (useful) learners have some form of inductive bias:

– In terms of representation, as mentioned above,

– But also in terms of their preferences in generalisation procedures.E.g:

∗ prefer simpler hypotheses, or

∗ prefer shorter hypotheses, or

∗ incorporate domain (expert) knowledge, etc etc

Given an function f : X → C trained on a set of instances Dc describing aconcept c, we say that the inductive bias of f is a minimal set of assertions B,such that for any set of instanced X:

∀x ∈ X(B ∧Dc ∧ x ` f(x))

Choosing an algorithm

• Induction task as search for a hypothesis (or model) that fits the data and sampleof the target function available to the learner, in a large space of hypotheses

• The choice of learning algorithm is conditioned to the choice of representation

• Since the target function is not completely accessible to the learner, the algo-rithm needs to operate under the inductive learning assumption that:

an approximation that performs well over a sufficiently large set ofinstances will perform well on unseen data.

• Note: Computational Learning Theory

Computational learning theory deals in a precise manner with the conceptshighlighted above, namely, what it means for an approximation (learnt func-tion) to perform well, and what counts as a sufficiently large set of instances.An influential framework is the probably approximately correct (PAC) learn-ing framework, proposed by Valiant (1984). For an accessible introduction toseveral aspects of machine learning, see (Domingos, 2012). For some interest-ing implications see the “no-free lunch” theorems and the Extended BayesianFramework (Wolpert, 1996).


An Example: learning to play (Mitchell, 1997)

• Learning to play draughts (checkers):

• Task? (target function, data representation) Training experience? Perfor-mance measure?

A target function

• A target function for a draughts (checkers) player:

– f : Board→ R– if b is a final board state that is won, then f(b) = 100

– if b is a final board state that is lost, then f(b) = −100

– if b is a final board state that is drawn, then f(b) = 0

– if b is a not a final state in the game, then f(b) = f(b′), where b′ isthe best final board state that can be achieved starting from b andplaying optimally until the end of the game.

• How feasible would it be to implement it?

• Not very feasible...

Representation

• collection of rules? neural network ? polynomial function of board features? ...

• Approximation as a linear combination of features:

f(b) = w0 + w1 · bp(b) + w2 · rp(b) + w3 · bk(b) + w4 · rk(b) + w5 · bt(b) + w6 · rt(b)

• where:

– bp(b): number of black pieces on board b

– rp(b): number of red pieces on b

– bk(b): number of black kings on b

– rk(b): number of red kings on b

– bt(b): number of red pieces threatened by black (i.e., which can be takenon black’s next turn)

– rt(b): number of black pieces threatened by red

Chapter 4 25

Training Experience

• Distinctions:

– f(b): the true target function

– f(b) : the learnt function

– ftrain(b): the training value

– A training set containing instances and its corresponding trainingvalues

• Problem: How do we estimate training values?

• A simple rule for estimating training values:

– ftrain(b)← f(Successor(b))

How do we learn the weights?

Algorithm 4.1: Least Mean Square

1 LMS(c : l e a r n i n g ra t e )2 f o r each t r a i n i n g in s t ance < b, ftrain(b) >3 do4 compute error(b) f o r cur r ent approximation5 ( us ing cur rent weights ) :

6 error(b) = ftrain(b)− f(b)7 f o r each board f e a t u r e ti ∈ bp(b), rp(b), . . . ,8 do9 update weight wi :

10 wi ← wi + c× ti × error(b)11 done12 done

LMS minimises the squared error between training data and current approx.:E ≡

∑〈b,ftrain(b)〉∈D(ftrain(b) − f(b))2 Notice that if error(b) = 0 (i.e. tar-

get and approximation match) no weights change. Similarly, if or ti = 0 (i.e.feature ti doesn’t occcur) the corresponding weight doesn’t get updated. Thisweight update rule can be shown to perform a gradient descent search for theminimal squared error (i.e. weight updates are proportional to −∇E where∇E = [ ∂E∂w0

, ∂E∂w1, . . . ]).

That the LMS weight update rule implements gradient descent can be seenby differentiating ∇E:

∂E

∂wi=

∂∑

[f(b)− f(b)]2

∂wi

=

∑∂[f(b)− f(b)]2

∂wi

=∑

2× [f(b)− f(b)]× ∂

∂wi[f(b)− f(b)]

=∑

2× [f(b)− f(b)]× ∂

∂wi[f(b)−

|D|∑i

witi]

= −∑

2× error(b)× ti


Learning agent architecture

elementLearning element

Critic

Problem

learninggoals

Enviro

nm

ent

changes

generator

Performance

Sensors

feedback

Effectors

Agent

Performance standard

f_train(b) <− f(successor(b))

New problem (e.g. initial board)

(b1,...,bn)Solution

Hypothesis(f)

(f)

Training instances<b, f_train(b)>, ...

Design choices: summary

Determine

Target Function

Determine Representation

of Learned Function

Determine Type

of Training Experience

Determine

Learning Algorithm

Games against self

Games against experts Table of correct

moves

Linear functionof six features

Artificial neural network

Polynomial

Gradient descent

Board Ý value

BoardÝ move

Completed Design

...

...

Linear programming

...

...

Copyright 2009by Sean Luke and Vittorio Zipparo Licensed under the Academic Free Licenseversion 3.0 See the file ”LICENSE” for more information */ /* Copyright 2009by Sean Luke and Vittorio Zipparo Licensed under the Academic Free Licenseversion 3.0 See the file ”LICENSE” for more information */

Other forms of target functions

Chapter 4 27

• instance → label: Learning to automatically iden-tify the categories of a dataset through externalfeedback

vs.

• set → powerset grouping of instances (so that thecategories will be intrinsically defined by charac-teristics of the groups)

– hierarchical clustering, partitional clustering

riv27

fin26

fin20

riv3

fin29

fin28

fin24

fin23

fin7

fin6

riv5

fin33

riv30

fin25

fin22

riv16

riv12

fin10

fin4

fin34

riv15

fin32

riv14

riv11

riv13

riv21

riv17

fin9

riv35

riv18

riv8

riv2

riv31

fin1

fin19

3.6 3.8 4.0 4.2 4.4

dissimilarity

Mapping and structure

• Some target functions (specially in NLP) fit more naturally into a trans-ducer pattern, and naturally have a signature

f: sequence over vocab Σ ⇒ sequence over (Σ× labels C)

• eg. POS-tagging

last week IBM bought Lotus ⇒ last/JJ week/NN IBM/NNPbought/VBD Lotus/NNP

Targeting Sequences and Trees

• other functions do not fit this pattern either, but instead have a signature

f: sequence over vocab Σ ⇒ tree over (Σ∪ labels C)

• eg. parsing: last week IBM bought Lotus⇒

S

NP NP VP

IBM VBD

bought

NP

Lotus

weeklast

Issues in machine learning

• What algorithms can approximate functions well (and when)?

• How does number of training examples influence accuracy?

• How does complexity of hypothesis representation impact it?


• How does noisy data influence accuracy?

• What are the theoretical limits of learnability?

• How can prior knowledge of learner help?

• What clues can we get from biological learning systems?

• How can systems alter their own representations?

Some application examples we will see in some detail

• Applications of Supervised learning in NLP:

– Text categorisation

– POS tagging (briefly)

– Word-sense disambiguation (briefly)

• Unsupervised learning:

– Keyword selection, feature set reduction

– Word-sense disambiguation (revisited)

5A Formal Definition of the TC Learning Problem

Learning and text categorisation

• Recall Mitchell’s definition of learning:

[An agent] is said to learn from experience E with respect to someclass of tasks T and performance measure P , if its performance attasks in T , as measured by P , improves with E.

(Mitchell, 1997)

• How do we recast TC as a learning problem?

– How do we describe/represent the task, T?

– What is the performance criterion, P?

– What is the (training) experience, E?

– What algorithms can be used?

A formal definition for the TC task

• Notation (based on (Sebastiani, 2002)):

– D: a set (domain) of documents

D = d1, ..., d|D|– C: a set of categories

C = c1, ..., c|C|

• Text categorisation is the task of assigning a boolean value to each pair< di, cj >, s.t.:

< di, cj >= T if di is filed under cj

< di, cj >= F if di is NOT filed under cj(5.1)

We’ll see later that the range of f doesn’t always have to be binary. Seeranking classifiers below.

30 A Formal Definition of the TC Learning Problem

Representing target and classifier functions

• The goal of TC is to approximate the (unknown) target function:

f : D × C → T, F

which classifies (as defined in (5.1)) each document correctly into one ormore categories in C.

• Classification is (actually) done by a classifier function

f : D × C → T, F

• For evaluation purposes, a target function could be assumed to describe,for instance, a set of hand-annotated documents

Performance: evaluating TC systems

• Evaluation compares actual performance (f) to ideal performance (f)

• The most commonly used metrics:

• Recall: how good the system is at “finding” relevant documents for a givencategory (ρ):

ρ =true positives

true positives + false negatives(5.2)

• Precision: the “quality” of the classified data (π):

π =true positives

true positives + false positives(5.3)

Other measures, such as accuracy, fallout, as well as measures of the tradeoff between precision and recall are sometimes used. We’ll examine them laterin the course.

Precision and recall

Chapter 5 31

TargetSelected

All textsTN

FP FNTP

Learning experience: Basic assumptions

• In TC, the Learning experience typically consists of labelled (pre-categorised)texts.

• We will make the following assumptions:

– No knowledge of C, other than its symbolic labels will be available tothe classifier

– No exogenous knowledge of the elements in D such as metadata,location, etc, is available to the classifier

• Although these assumptions are often too strict for real life applications,they make conceptualisation clearer.

Types of classifiers: number of labels

• Consider a classification function f : D → 2C

• Multi-label text categorisation (MLTC):

– A document may be assigned any number of categories.

– For a given document d in D, one migh have a subset of C, Cm =c1, ..., ck such that k > 1 and

f(d) = c1, ..., ck

• The cardinality of Cm may vary from a categorisation task to another

• If |Cm| = 1 for all d ∈ D, we have what is called a single-label classifier(SLTC).

Controlled vocabulary keyword assignment, document classification into webhierarchies are examples of MLTC, whereas ambiguity resolution in NLP is anexample of SLTC. Email spam filtering is an example of the binary case ofSLTC.


Binary classifiers

• For all d ∈ D, Cm = ci, ci

• Each document is classified exactly as either a category ci, or its comple-ment ci, that is:

f(d, ci)⊕ f(d, ci)

• Multi-label text categorisation can be reduced to independent problems ofbinary or single-label text categorisation if all c1, cn ∈ C are stochasticallyindependent.

– Example: as independent binary TC problems, each document wouldbe classified under ci, ci for i = 1, ..., |C|

– Example: The MLTC in the previous slide could be represented as...

f(d, c1) = ... = f(d, ck) = T

Casting BTC as SLTC is trivial: simply define fbtc(d, c) := ¬fmltc(d, c).MLTC can be reduced to SLTC by assuming Cm = c1, ..., where f(d, ci) = T .

Sorting by Categories or Documents

• Document-pivoted categorisation (DPC):

– given a di ∈ D, find all categories cj ∈ C under which it should befiled

• Category-pivoted categorisation (CPC):

– given a category ci ∈ C, find all documents dj ∈ D that should befiled under it

• The above is a purely pragmatic distinction.

• Which factors might affecting whether you choose CPC or DPC?

– DPC: complete D isn’t available (e.g. email filtering)

– CPC: C changes after D has been categorized (e.g. documents in adigital library)

Ranking categorisation

• A categorisation function f : D×C → T, F performs what we call “hard”categorisation

• Another possibility is, “soft” categorisation, i.e. to allow f to range overreal values in the interval [0, 1] (i.e. f(d, c) = v, where 0 ≤ v ≤ 1)

• This is equivalent to a ranking of the categories of C according to theirappropriatenes to each document.

• Ranking CPC is also possible

Chapter 5 33

Text categorisation data

• REUTERS-21578: a commonly used corpus

<REUTERS TOPICS=”YES” LEWISSPLIT=”TRAIN” ID=”96”><DATE>26−FEB−1987 16:32:37.30</DATE><TOPICS><D>acq</D></TOPICS><PLACES><D>usa</D></PLACES><PEOPLE></PEOPLE><ORGS></ORGS><EXCHANGES></EXCHANGES><COMPANIES></COMPANIES><TEXT><TITLE>

INVESTMENT FIRMS CUT CYCLOPS <CYL> STAKE</TITLE><DATELINE> WASHINGTON, Feb 26 − </DATELINE><BODY>A group of affiliated New York investment firms said theylowered their stake in Cyclops Corp to 260,500 shares, or 6.4 pctof the total outstanding common stock, from 370,500 shares, or9.2 pct. In a filing with the Securities and Exchange Commission,the group, led by Mutual Shares Corp, said it sold 110,000 Cyclopscommon shares on Feb 17 and 19 for 10.0 mln dlrs. Reuter</BODY></TEXT></REUTERS>

Training, validation and test sets

• In supervised TC, the corpus available to the developers is annotated, thatis, it is a set Ω = d1, ..., d|Ω| ⊂ D, such that a target function f is known.

• In creating a classifier, Ω is normally divided into three sub-sets:

– a training set (Tr) ,based on which the initial classifier is inductivelybuilt

– a validation set (Tv), also known as hold-out set, on which the initialclassifier is optimised, and

– a test set (Te), on which the classifier is evaluated

Algorithms: Knowledge engineering vs. Machine Learning

• Knowledge engineering was employed in early rule-based systems such asCONSTRUE (Hayes and Weinstein, 1990)

• Those systems were coded as production rules of the form

IF <Condition in DNF> THEN Action

• But creation of such rules (by human “knowledge engineers”) leads to theknowledge acquisition bottleneck

• Machine Learning techniques automate the classifier’s construction process(induction).


Design decisionsFor example, the TC system to be implemented in this course:

Trainningexperi-ence:

Unannotatedcorpus

Annotatedcorpus, D →

Targetfunc-tion

:

f : D × C → [0, 1]

f : D × C → T, F → Repres. :

set of words →...

bag of words →

NB

NN

ID3

SVM

KNN

...

prob. vectors →...

feature sets →...

. . .

f : D → CStream ofdocuments

.

.

.Documentbrowsing

6Probabilities and Information Theory

In these notes we review the basics of probability theory and introduce theinformation theoretic notions which are essential to many aspects of machinelearning in general, and to the induction of text classifiers in particular. Wewill attempt to illustrate the abstract concepts introduced in these notes withexamples from a text categorization task (e.g. spam filtering). The readershould note, however, that the techniques and results reported here apply toa much wider domain of applications. The connection with TC will becomeclearer next week when we will see feature selection and classifier induction.

Concise but useful introductions to probability theory can be found in (Rus-sell and Norvig, 1995, chapter 14) and (Manning and Schutze, 1999, chapter 2).The latter also includes the basics of information theory, viewed mainly from anatural language processing perspective. A very good general introduction toprobability theory is (Bertsekas and Tsitsiklis, 2002).

Why review Probability and Information Theory?

• Probability theory gives us a tool to model uncertainty

• Probabilistic approaches (e.g. naive Bayes) are used in TC.

• Information Theory plays an important role in various areas of machinelearning, in particular:

– Feature selection uses the information theoretic notions of informa-tion gain and mutual information

– Learning algorithms, such as the decision tree induction, use the in-formation theoretic concept of entropy to decide how to partition thedocument space.

Information theory originated from Claude Shannon’s research on the ca-pacity of noisy information channels. Information theory is concerned withmaximising the information one can transmit over an imperfect communicationchannel. The central concept of Information Theory is that of Entropy. Entropy(which we will define formally below) measures “the amount of uncertainty” ina probability distribution.

36 Probabilities and Information Theory

Reportedly, the term “entropy” was suggested to Shannon by John vonNewman : “You should call it entropy for two reasons: first, the function isalready in use in thermodynamics under the same name; second, and moreimportantly, most people don’t know what entropy really is, and if you use theword entropy in an argument you will win every time” (Hamming, 1991).

Probability theory: notation

Notation Set Jargon Probability jargonΩ collection of objects sample spaceω element of Ω elementary eventD subset of Ω event that some outcome in

D occursD complement of D event that no outcome in D

occursD ∩ E intersection both D and ED ∪ E union D or E, or bothD\E difference D but not ED ⊆ E inclusion if D then E∅ empty set impossible eventΩ whole space certain event

A notational variant to the above which stresses the connection with logicwould treat set intersection as conjunction, set union as disjunction, etc. Thisvariant is summarised below:

Logic Set theory

P (A ∨B) P (A ∪B)P (A ∧B) P (A ∩B)P (false) P (∅)P (true) P (Ω)

Sample spaces

The set of all possible outcomes of an experiment is called samplespace

• in Text Categorisation, for instance, one could regard the set of documentsbeing processed as the sample space:

Ω = D = d1, ..., d|Ω|

However, one could alternatively take sets of documents rather than sets(or multisets, lists etc) of words to be elementary events. In this case,Ω = 2D would be the sample space.

• An “experiment” could be performed to determine, for instance, whichdocuments belong to category c (say, email that should be classified asspam). The outcome of that experiment would be a subset of Ω.

Chapter 6 37

Different ways of characterising sample spaces in TC will be presented inchapter 12 of this reader).

Here is an even more mundane example: the combinations of heads (H) andtails (T) resulting from tossing a coin three times can be represented by thefollowing sample space:

Ω = HHH,HHT,HTH,HTT, THH, TTH, THT, TTT

Discrete Uniform Probability Law

• Now that we have characterised the sample space, we would like to beable to quantify the likelihood of events.

• Discrete uniform probability law: If the sample space consists of n possibleoutcomes which are equally likely, then the probability of any event D isgiven by

P (D) =No. of elements of D

n

• One can think of the probability of occurrence of an event D as the pro-portion of times event D occurs in a large number of trials:

P (D) =No. of occurrences of D

No. of trials(6.1)

The view of probabilities implicit in the view adopted above has been termed“frequentist”. It relies on the empirical observation that the ratio betweenobserved occurrences of an event and the number of trials appears to converge toa limit as the number of trials increases. The frequentist approach is inadequatein many ways, but a thorought discussion of its merits and limitations is beyondthe scope of this revision. For a very readable discussion of the “philosophy” ofprobability theory, see (Hamming, 1991).

We may illustrate this approach by calculating the probability that a docu-ment of corpus Ω is filed under category c as follows:

P (c) =|D||Ω|

where D = ω ∈ Ω : f(ω, c) = T

Visualisating sample spaces

Sample spaces can be depicted in different ways.


For events described by two rolls of a

dice, the sample space could be

depicted as a grid:

1st roll

2nd

roll

E1= same result in both rolls

E2= at least one roll is 5

P (E1) = 16 , P (E2) = 11

36

In experiments of a sequential nature

such as this, a tree representation is

also informative.

1st roll 2nd roll

...

...

.

.

.

.

.

.

(adapted from (Bertsekas and Tsitsiklis, 2002)).

Sample space, in somewhat more formal terms

• Defining σ-field:

A collection F of subsets of Ω is called a σ-field if it satisfiesthe following conditions:

1. ∅ ∈ F

2. if D1, ..., Dn ∈ F then⋃ni=1Di ∈ F

3. if D ∈ F then D ∈ F

• Example:

– the smallest σ-field associated with Ω ... is the collection F = ∅,Ω.

Probability spaces

• We continue to add structure to our original set of events by defining Pas a probability measure:

A Probability Measure P on < Ω,F > is a function P : F → [0, 1]satisfying:

1. P (Ω) = 1

2. if D1, D2, . . . is a collection of disjoint members of F , in thatDi ∩Dj = ∅ for all i 6= j, then

P (

∞⋃i=1

Di) =

∞∑i=1

P (Di)

The triple < Ω,F , P > is called a Probability Space

Chapter 6 39

In the definition of probability measure presented here, F is obviously to beunderstood as a σ-field.

A probability measure is a special case of what is called in probability theorysimply a measure. A measure is a function µ : F → [0,∞), satisfying (2)as above and µ(∅) = 0. Some weight assignment functions, such as the onesoften used in the decision trees are measures, though they are not probabilitymeasures.

Properties of probability spaces

• The following hold:

P (D) = 1− P (D) (6.2)

If E ⊇ D then P (E) = P (D) + P (E\D) ≥ P (D) (6.3)

P (A ∪B ∪ C) = P (A) + P (A ∩B) + P (A ∩ B ∩ C) (6.4)

• Inclusion-exclusion principle:

P (D ∪ E) = P (D) + P (E)− P (D ∩ E) (6.5)

or, more generally:

P (

n⋃i=1

Di) =∑i=1

P (Di)−∑i<j

P (Di ∩Dj)

+∑i<j<k

P (Di ∩Dj ∩Dk)− . . .

+(−1)n+1P (D1 ∩ . . . Dn)

Proofs:

(6.2) D ∪ D = Ω and D ∩ D = ∅, so P (D ∪ D) = P (D) + P (D) = 1

(6.3) If E ⊇ D, then E = D∪(E\D), which is a union of disjoint sets. ThereforeP (E) = P (D) + P (E\D)

(6.5) The rationale for the Inclusion-exclusion principle is easy to visualise bydrawing a Venn diagram of (possibly intersecting) sets D and E. Simplyadding the probabilities of D and E is as if we are counting the proba-bility of the intersection twice, so the result needs to be re-adjusted bysubtracting the intersection:

P (D ∪ E) = P ((D\E) ∪ (D ∩ E) ∪ (E\D)) (set theory)= P (D\E) + P (D ∩ E) + P (E\D) (disjoint sets)= P (D\E) + P (D ∩ E) + P (E\D)

+ P (D ∩ E)− P (D ∩ E) (algebra)= P ((D\E) ∪ (D ∩ E))

+ P ((E\D) ∪ (D ∩ E))− P (D ∩ E) (disjoint sets)= P (D) + P (E)− P (D ∩ E) (set theory)


Conditional probability

• If P (E) > 0, then the conditional probability that D occurs given E isdefined to be:

P (D|E) =P (D ∩ E)

P (E)

The diagram above illustrates conditional probabilities in terms of set size:If you see probability measures as frequencies/proportion of occurrence, then

the conditional is given by |D∩E||E| =|D∩E|

Ω|E|Ω

= P (D∩E)P (E) .

Properties of conditional probabilities

1. For any events D and E s.t. 0 < P (E) < 1,

P (D) = P (D|E)P (E) + P (D|E)P (E)

2. More generally, if E1, ..., En are partitions of Ω s.t. P (Ei) > 0, then

P (D) =

n∑i=1

P (D|Ei)P (Ei)

3. Chain rule:

P (D1 ∩ ... ∩Dn) = P (D1)P (D2|D1)P (D3|D2 ∩D1) . . .

Proof:

1. D = (E ∩D) ∪ (E ∩D), which is a union of disjoint sets. Thus

P (D) = P (E ∩D) + P (E ∩D)

= P (D|E)P (E) + P (D|E)P (E)

Chapter 6 41

Bayes’ rule

• Sometimes, as in the case of naive Bayes TC, it is easier to estimate theconditional probability of E given D than the other way around.

• In such cases, Bayes’ rule can be used to simplify computation:

P (E|D) =P (E ∩D)

P (D)(6.6)

=P (D|E)P (E)

P (D)

Proof: It follows trivially from the definition of conditional probability (slide81) and the chain rule (slide 82).

Independence

• In general, the occurrence of an event E changes the probability that anotherevent D occurs. When this happens, the initial (prior) probability P (D) gets“updated” to P (D|E). If the probability remains unchanged, i.e. P (D) =P (D|E), then we call D and E independent:

Events D1 and Dn are called independent if

P

(⋂i∈S

Di

)=∏i∈S

P (Di) ∀S ⊆ 1, 2, . . . , n

• E.g: Two fair coin tosses are independent.

• But note that, when we have more than two events, pairwise indepence doesnot imply indepence:

– from P (C|A) = P (C) and P (C|B) = P (C) you cannot conclude P (A ∩B ∩ C) = P (A)P (B)P (C)

• Neither is the latter a sufficient condition for the indepence of A,B and C

Examples (verify that they are actually the case, as an exercise):

1. Pairwise indepence does not imply indepence:A = coin comes up heads on the first toss B = coin comes up heads on the second toss C = the two tosses have different results

2. P (A ∩B ∩ C) = P (A)P (B)P (C) is not sufficient for independence:

• Consider two throws of a fair die and the following events: A = first roll is 1, 2 or 3 B = first roll is 3, 4, or 5 C = the sum of the two rolls is 9

3. Similarly for a set of random variables S = X1, . . . , Xn, P (Xi|∩Xj∈S\XiXj) = P (Xi) does not imply independence for S:


• Again, consider two throws of a fair die and the following events: A= first roll is 1, 3 or 4 B = first roll is 1, 2, or 4 C = the sum of the two rolls is 4 (show that P (A|B∩C) = P (A) and P (B|A∩C) = P (B) and P (C|B∩A) = P (C) but P (A|B) 6= P (A) etc)

Conditional Independence

• Absolute independence (as described above) is a very strong requirement,which is seldom met.

• In practice, one often uses conditional independence:

P (A ∩B|C) = P (A|C)P (B|C) (6.7)

or, equivalently:P (A|B ∩ C) = P (A|C) (6.8)

• E.g.: Let A and B be two biased coins in that probab. of heads for A is .99 and for B

.01. Choose a coin randomly (with a .5 probab. of choosing each) and toss it twice.

The probabily of heads in the 2nd toss is not independent of the prob. of heads in the

1st, but they are independent given the choice of coin.

Some exercises

1. The dangers of overtesting: Domingos (2012) review of ML used the following exampleto caution readers against overtesting:

... a mutual fund that beats the market ten years in a row looks very impressive,until you realize that, if there are 1000 funds and each has a 50% chance ofbeating the market on any given year, its quite likely that one will succeed allten times just by luck.

Question: What is the actual probability that one mutual fund will succeed all 10 timesby luck?

2. Monty Hall (from Wikipedia):

Suppose you’re on a game show, and you’re given the choice of three doors:Behind one door is a car; behind the others, goats. You pick a door, say No. 1,and the host, who knows what’s behind the doors, opens another door, say No.3, which has a goat. He then says to you, “Do you want to pick door No. 2?”Is it to your advantage to switch your choice?

Question: The best strategy (as Wikipedia will tell you) is to switch. Give an expla-nation of why that is the case based on conditional probabilities and Bayes’s rule.

Random variables

• A random variable is a function X : Ω → R with the property thatω ∈ Ω : X(ω) ≤ x ∈ F , for each x ∈ R

• Random variables offer a convenient way of abstracting over event spaces.E.g.:

Chapter 6 43

• The notation P (X = x) is used to indicate the probability that a randomvariable X takes value x

• Another example: in categories over a, category can be seen as a randomvariable which returns the number of documents classified under a givencategory.

Example (ctd.): Assume that 5 documents, out of a 20-document corpusΩ have been classified as spam. We are now regarding subsets of Ω (possiblythe entire power set of Ω) defined by, for instance, the categories assigned totheir elements as a σ-field, and the resulting triple < Ω,F , P > as a probabilityspace. Events in such probability space will be “things” that denote elements ofF , such as “the event that documents have been filed under category C”. Letsalso assume that category spam denotes a set d1, . . . , d5 ∈ F . The probabilityassociated with “the event that documents have been filed under category C”is summarised in the random variable and is given by P (C). The probabilityassociated with “the event that documents have been filed under category spam”is given by a specific value of the random variable (recall that what we are calling“variable” here is actually a function) is P (C = spam) = 0.25.

Discrete Random Variables

• A discrete random variable is a random variable whose range is finite (orcountably infinite)

• A discrete random variables are associated with a probability mass function(PMF)

– A PMF maps each numerical value that a random variable can taketo a probability

• A function of a discrete random variable defines another discrete randomvariable

– The PMF of this new random variable can be obtained from the PMFof the original one.

• Thus a random variable can be conditioned on another random variable(or on an event), and the notions of independence and conditional inde-pendence seen above also apply.


Probability mass functions

• The PMF of a random variable X is the function p : R→ [0, 1], given by

p(x) = P (X = x)

• For a discrete random variable:∑i∈N

p(xi) =∑i

P (Axi) = P (Ω) = 1

where Axi = ω ∈ Ω : X(ω) = xi

• So, to calculate the PMF of X we add the probabilities of all events X = x foreach possible value x to get p(x).

• E.g.: If X is the number of heads obtained in two tosses of a fair coin, its PMFis:

p(x) =

.25 if x = 0 or x = 2

.5 if x = 1

0 otherwise

So the probability of at least one head is P (X > 0) = .75

Continuous Random Variables

• A continuous random variable is a random variable whose range is conti-nous (e.g. velocity, time intervals etc)

• Variable X is called continous if there is a function fX of X s.t., for everysubset B of R:

P (X ∈ B) =

∫B

fX(x)dx

• E.g.: the probability that X falls within interval [a, b] is P (a ≤ X ≤ b) =∫ baf(x)dx

• fX is called the probability density function (PDF) of X provided that itis non-negative and has the normalisation property:∫ ∞

−∞fX(x)dx = P (−∞X∞) = 1

Note that for a single value v, P (X = v) =∫ vvfX(x)dx = 0, so the prob-

ability that X falls within interval [a, b] is the same as the probability that Xfalls within [a, b), (a, b] or (a, b) (i.e. it makes no difference wheter the endpointsare included or not. The probability that X falls within interval [a, b] can beinterpreted as the area under the PDF curve for the interval.

Cumulative Distribution Functions

• Cumulative Distribution Functions (CDF) subsume PDFs and PMFs un-der a single concept.

Chapter 6 45

• The CDF FX of X gives the probability P (X ≤ x) so that for every x:

FX(x) = P (X ≤ x) =

∑k≤x

p(k) if X is discrete

∫ x

−∞fX(t)dt if X is continuous

• SinceX ≤ x is always an event (having therefore a well defined probaility),every random variable X associated with a given probability model has aCDF.

Moments, expectation, mean, variance

• The expected value of a discrete random variable X with PMF p is given by

E[X] =∑x

p(x)x

• For a continuous variable with PDF f we have

E[X] =

∫ ∞−∞

xf(x)dx

• This is AKA the expectation, mean or the “first moment” of X.

• In general, we define the nth moment as E[Xn]

• The variance of a random variable is defined as the expectation of the randomvariable (X − E[X])2:

var(X) = E[(X − E[X])2] =

∑x p(x)(x− E[X])2 if X is discrete∫∞−∞(x− E[X])2f(x)dx if X is continuous

Some Discrete Random Variables

• Bernoulli (parameter p): success (or failure) in a single trial:

p(k) =

p if k = 1

1− p if k = 0E[X] = p, var(X) = p(1− p)

• Binomial (parameters p and n): successes in n independent Bernoulli trials:

p(k) =(nk

)pk(1− p)n−k, k = 0, 1, . . . , n

E[X] = np, var(X) = np(1− p)

• Geometric (parameter p): number of trials until first success:

p(k) = (1− p)k−1p, k = 0, 1, . . . E[X] =1

p, var(X) =

1− pp2

• Poisson (parameter λ): approximation of binomial PMF when n is large, p is smalland λ = np:

p(k) = eλk − 1λk

k!k = 0, 1, . . . E[X] = λ, var(X) = λ


Some Continuous Random Variables

• Uniform (over interval [a, b]):

f(x) =

1b−a , if a ≤ x ≤ b0, otherwise

E[X] =a+ b

2, var(X) =

(b− 2)2

12

• Exponential (parameter λ): e.g. model time until some event occurs:

f(x) =

λe−λx, if x ≥ 0

0, otherwise

E[X] = 1λ, var(X) = 1

λ2

0.0

0.2

0.4

0.6

0.8

1.0

0 x

Geometric CDFExpoential CDF

• Normal or Gaussian (parameters µ and σ2 > 0):

f(x) =1√2πσ

e−(x−µ)2/2σ2

E[X] = µ, var(X) = σ2

Entropy

• Entropy, AKA self-information, measures the average amount of uncer-tainty in a probability mass distribution. In other words, entropy is ameasure of how much we learn when we observe an event occurring inaccordance with that distribution.

• The entropy of a random variable measures the amount of information inthat variable (we will always be using log base 2 unless stated otherwise):

H(X) = H(p) = −∑x∈X

p(x) log p(x)

=∑x∈X

p(x) log1

p(x)

N.B.: We define 0 log(0) = 0

Chapter 6 47

Example: suppose we have a set of documents D = d1, ..., dn, each classi-fied according to whether it belongs or not to a certain category c, say, spam.First, suppose you know that all documents in D are filed under spam (we rep-resent that as P (spam) = 1). How much information would we gain if someonetold us that a certain document di drawn randomly from corpus D has beenfiled under “spam”? Answer: zero, since we already knew this from the start!Now suppose that you know 80% of D (you incoming email folder) is spam, andyou randomly pick an email message from D and find out that it is labelled“spam”. How much have you learned? Certainly more than before, althoughless than you would have learned if the proportion between spam and legitimateemail were 50-50. In the former case there was less uncertainty involved thanin the latter.

Information gain

• We may also quantify the reduction of uncertainty of a random variable dueto knowing about another. This is known as Expected Mutual Information:

I(X;Y ) = IG(X,Y ) =∑x,y

p(x, y) logp(x, y)

p(x)p(y)(6.9)

• Entropies of different probability functions may also be compared by cal-culating the so called Information Gain. In decision tree learning, forinstance:

G(D,F ) = H(t)−n∑i=1

piH(ti) (6.10)

where t is the distribution of the mother, ti the distribution of daughternode i, and pi the proportion of texts passed to node i if term F is usedto split corpus D.

Information theory tries to quantify such uncertainties by assuming that theamount of information learned from an event is inversely proportional to theprobability of its occurrence. So, in the case where there’s a 50-50 chance thatdi will be spam, the amount learned from di would be

i(P (C = spam)) = log1

P (di = spam)= log

1

0.5= 1

i(.), as defined above, measures the uncertainty for a single value of randomvariable C. How would we measure uncertainty for all possible values of C(in this case, spam, spam)? The answer is: we calculate the entropy of itsprobability mass function:

H(C) = −(p(spam) log p(spam) + p(spam)) log p(spam)

= −(0.5 log 0.5 + 0.5 log 0.5) = 1

A more interesting corpus, where the probability of a document being labelledas “spam” is, say 0.25, would have entropy of

H(C) = −(0.25 log 0.25 + 0.75 log 0.75) = 0.811


N.B.: There’s some confusion surrounding the related notions of Expected Mu-tual Information and Information Gain. The definition in (6.9) corresponds towhat some call Information Gain (Sebastiani, 2002). For the purposes of choos-ing where to split the instances in decision trees, the definition of InformationGain used is the one in (16.1), as defined in (Manning and Schutze, 1999, ch.2). We will reserve the term Expected Mutual Information I(X;Y ) for whatSebastiani (2002) calls Information Gain, though we will sometimes write itIG(X,Y ).

We will see details of information gain scores are used in decision tree in-duction when we review the topic next week.

7Practical: Parsing the REUTERS-21578 News

Corpus

7.1 Goals

In this lab you will get acquainted with a corpus widely used by the Text Classi-fication community: the REUTERS-21578 (Lewis, 1997), see how it is encoded(in XML), and implement a simple DOM parser to extract some informationfrom the XML-encoded files. For this particular type of text classification task(based on supervised learning algorithms) it is important to be able to extractcategory information as well as the main body of text from a standard for-mat (XML, in this case). This is the first step towards building a documentrepresentation that can be used by classifier induction algorithms. See the lec-tures notes (http://www.scss.tcd.ie/~koidlk/cs4062/) for a more detaileddiscussion of document representation issues.

7.2 Software and Data

For this exercise you will be provided with some Java code to represent newsitems, as well as a class BasicParser.java which illustrates the use of a DOMparser.1

The software and data provided for this exercise are available as a compressedtar archive at:

http://www.scss.tcd.ie/~koidlk/cs4062/sw/lab01.tgz

Please download and uncompress it2 in your user area. The resulting direc-tory should contain several files and directories, in particular:

• Directory src: some Java source files and a structure which will eventuallybe filled with the Java classes of a probabilistic text classifier.

1There are two main standards for parsing XML files: SAX and DOM. The former is event-based, which allows parsing a large document progressively; the latter loads the completedocument in memory, which makes it easier to manipulate. See https://docs.oracle.com/

cd/B28359_01/appdev.111/b28394/adx_j_parser.htm.2Using tar xfz lab01.tgz in a terminal, or any GUI tool which handles .tgz files.

http://www.scss.tcd.ie/~koidlk/cs4062/


https://docs.oracle.com/cd/B28359_01/appdev.111/b28394/adx_j_parser.htm

https://docs.oracle.com/cd/B28359_01/appdev.111/b28394/adx_j_parser.htm

50 Practical: Parsing the REUTERS-21578 News Corpus

• reut2-mini.xml: sample data from the REUTERS-21578 Text Catego-rization collection.

In Eclipse3 you can import the project as follows: select File→ Import... →General → Existing projects into workspace; then select the directory in whichyou extracted the archive as root directory; project “TC” should appear in thelist of projects; click Finish.

7.3 Tasks

7.3.1 Data structure

Start by opening the XML file reut2-mini.xml in the data directory. This filehas been converted from Reuters’ SGML source into XML in order to facilitateparsing.

Take a look at the files in the software area especially README LEWIS.txt tosee what the tags mean and how the corpus is structured.

Draw a tree representing the structure of a REUTERS-21578 file. Which ele-ments of this tree correspond to documents? How are documents identified inthe corpus? Which elements indicate the categories a document belongs to?

7.3.2 Parsing the data

Now open tc/parser/BasicParser.java and inspect its contents. This pro-gram simply takes the name of an XML file (arg[0]) and calls the methodparse to read its content and print some information. Observe how XML ele-ments and attributes are obtained from the XML document in parse. Run theprogram with the file reut2-mini.xml as argument.

The main data types to be used in the classifier will be stored in tc/dstuct.Currently, that directory contains two classes which will be used to store newsitems:

tc

|

|-- dstruct

| |-- ParsedText.java

| |-- ParsedNewsItem.java

...

Create a new class NewsParser.java based on BasicParser.java, but inwhich:

1. The method parse returns a ParsedText object which contains all theParsedNewsItem objects populated from the input XML document.

2. The NewsParser main method prints the resulting ParsedText onto thescreen at the end of parsing.

Once your parser works with reut2-mini.xml, download the Reuters corpusat:

3You are free to work with any programming environment you like (you can even use abasic text editor)

http://www.scss.tcd.ie/~koidlk/cs4062/sw/reuters21578-xml/

http://www.scss.tcd.ie/~koidlk/cs4062/sw/reuters21578-xml/README_LEWIS.txt

Chapter 7 51

http://www.scss.tcd.ie/~koidlk/cs4062/sw/reuters21578-xml.tar.bz2

Uncompress the archive (tar xfj reuters21578-xml.tar.bz2) and testyour parser on a few xml files from the dataset.

7.4 Delivering the exercises

Please submit your code by email to [email protected] and [email protected]

(to both addresses) before February 5th.

http://www.scss.tcd.ie/~koidlk/cs4062/sw/reuters21578-xml.tar.bz2

52 Practical: Parsing the REUTERS-21578 News Corpus

8Machine Learning and TC

These notes further refine the formal presentation of TC started last week andextend it to recast TC into the framework of machine leaning. We introducethe basics of document representation and indexing: a vectorial representationof documents, and ways of assigning weights to the features which make up adocument vector.

Recapitulating...

• TC task:

f : D × C → T, F⇑

f : D × C → T, F

• Precision (π) and recall (ρ)

• Multiple label, Single label, and binary categorisation

• Hard vs. soft categorisation

• Corpus (Ω), trainging set (Tr), validation set (Tv) and test set (Te)

As we have seen, precision is the ratio between true positives and the set ofdocuments selected by the classifier function. Recall is the ratio between truepositives and the target set of documents.

Single-label and binary categorisation are inter-definable notions. If you de-fine f(d, c) = T as f(d, c) = F you get a single-label classifier out of a binaryclassifier. Similarly, multi-label classifiers can be re-defined as single-label classi-fiers provided that we one assumes the categories to be statistically independent,as seen in the introduction.

Training, validation and test sets are previously (hand) annotated. We willtypically regard such annotation as corresponding to the target function.

54 Machine Learning and TC

TC life cycle

• Corpus indexing: all textsare tokenized, frequency ta-bles and reverse indices arebuilt

• Training: classifiers are induc-tively constructed

• Classifier validation (param-eter tuning) and evaluation(testing)

Corpus Indexing

Evaluation

Training

The life cycle of a text classifier also typically includes a pre-processing phase,as we will see below. Pre-processing usually involves tokenisation, sometimesstemming, and (more rarely) collocation analysis.

Preliminaries

• Some corpus linguistics jargon:

– Type: each word (lexical unit) occurring in a text

– Token: each occurrence of a word in a document

• How many types and tokens do we have in the following sentence?

“One could, for instance, use a set to store all types in a document, anda list to store all tokens.”

• In a ML context, types will often be called “features”.

For the example above, one would have the following type-token table:

type No. of tokens

one 1could 1for 1instance 1use 1a 3set 1

type No. of tokens

to 2store 2all 2types 1in 1document 1tokens 1

Chapter 8 55

Development approaches

• Train-and-test:

1. the initial classifier is built to categorise the data in the training set

2. the parameters of the classifier are tuned by repeated tests againstthe validation set in order to avoid overfitting to the training data

3. effectiveness of the classifier is assessed by running it on the test set,Te and comparing f : Te× C → T, F to f : Te× C → T, F.

K-fold cross validation

• In this approach, k different classifiers f1, ..., fk are built by partitioningthe corpus into k disjoinct sets Ω1, ...,Ωk

• Then the train-and-test approach is applied iteratively on pairs< Tri, T ei >

• The final effectiveness result is computed by computing the effectivenessfor each classifier f1, ..., fk and averaging the results in some way.

• We will return to this point in future.

• In the meantime, see (Emms and Luz, 2007; Mitchell, 1997) for moredetails on the cross validation procedure.

Category generality

• The generality of a category ci in the context of a text classification system,given a corpus Ω is defined as the proportion of documents that belong toci:

gΩ(ci) =|dj ∈ Ω s.t. f(dj , ci) = T|

|Ω|(8.1)

• Analogously, one can define category generality for

– training sets: gTr(ci)

– validation sets: gTv(ci)

– and test sets: gTe(ci)

Text representation in TC

• ML algorithms cannot cope with full-scale document representations

• Documents must be mapped into compact, uniform representations to beused across training, evaluation and testing

– Indexing

• The main task at this stage: finding out what to index

– “no semantics”,

– lexical semantics,

– compositional semantics, ...


Texts as feature vectors

• The form of representation most commonly adopted in text categorisation(and information retrieval) consists of:

– selecting a set of terms T in the corpus (also known as features),

– encoding all documents dj as vectors:

−→dj = < wij , ..., w|T |j >

– wkj represents how much feature k “contributes” to the semantics oftext dj

Possible implementations

• Variations on the vector representation:

– different ways of defining terms (features):

∗ compounds,

∗ semantic dependencies,

∗ ...

– different ways of computing weights:

∗ which words, phrases etc are the most relevant?

∗ how do we quantify relevance?

• One can eliminate large numbers of candidate features before any statis-tical processing is done...

Before indexing...

• Some pre-processing of texts may help reduce dimensionality even beforeany indexing is done or weights computed.

• One often removes function words as the first step of text processing.These include topic-neutral types in the following categories:

– prepositions, conjunctions, articles

• Stemming is also frequently employed. It involves clustering together typesthat share the same morphological root. For example:

– cluster, clustering, clustered, ...

Different ways of defining features

• Words or phrases?

– Should we represent “Saddam Hussein” as a single feature or as twodistinct features?

– How about “White House”, and “Text categorisation”?

• Syntactically versus semantically motivated phrases:

– a question of whether to group sequences of terms according to agrammar of the language (the syntactic approach) or relative fre-quency of n-grams (the stochastic approach).

Chapter 8 57

Different ways of computing weights

• Alternative ways of computing weights for document vectors also influencethe representation.

• Three approaches:

– sets of words: where binary weights indicate presence or absence ofthe feature in the document

– bags (or multi-sets) of words: where weights quantify feature occur-rence

– relative frequency: term-frequency scores or probabilistic weights

• Computing weights by relative frequency:

– Weights are commonly assigned according to a term frequency - in-verse document frequency (tfidf) score.

Term frequency, Inverse document frequency

• Key ideas behind tfidf:

– the more a feature occurs in a text the more representative of itscontent it is:

tf(tk, dj) =#(tk, dj)∑|T |i #(ti, dj)

(8.2)

– the more texts a feature occurs in the less discriminating that featureis:

idf(tk) = log|Tr|

#Tr(tk)(8.3)

where:

∗ #(tk, dj) is the number of occurrences of tk in dj and

∗ #Tr(tk) is the number of documents in Tr in which tk occurs

TF x IDF

• Combining tf and idf, we get a possible weighting score for the terms of

our vector representation−→dj :

tfidf(tk, dj) =

tf(tk, dj)× idf(tk) if #(tk, dj) ≥ 1

0 otherwise

(8.4)


Normalisation

• Practical issue: weights calculated via (8.4) cannot be compared directly.

• One often employs cosine normalisation to force the weights in−→dj to fall

within the interval [0, 1].

wkj =tfidf(tk, dj)√∑|T |i=1 tfidf(ti, dj)2

(8.5)

• See (Manning and Schutze, 1999, ch 15) for other ways of calculatingfeature vectors and weights.

Darmstadt Indexing Aproach (DIA)

• Used in the AIR/X system (Fuhr et al., 1991).

• Instead of terms (words etc), use properties that define the dimensions ofthe learning space as features.

• Examples of Properties used in DIA:

– properties of tk such as tfidf(tk, dj), or of tk w.r.t ci, such as theassociation factor

z(tk, ci) =#Tr,ci(tk)

#Tr(tk)

– relationships between tk and dj : e.g. position of tk in dj

– properties of dj such as its length

– properties of categories ci such as their generality (as given by equa-tion (8.1))

• (See (Sebastiani, 2002) for more references.)

AIR/X Indexing approach

Rule-based indexing, starting from diffrentForms of Occurrence (FOC), based on:

• Heuristic rules (H, in the diagram)

• Probabilistic rules (P)

• Thesaurus rules (T)

Rules may have human (domain expert)input.Representation in Relevance DescriptionVectors (RDV).

(Fuhr et al., 1991)

9Dimensionality Reduction

Dimensionality reduction in IR nd TC

• A 3d term set:T = football, politics, economy

• IR: calculate distancesbetween vectors (e.g.via cosine matching)

• TC: High dimensional-ity may be problematic

politics

d2 = <0.5,0.3,0.3>

football

Eco

no

my d1 = <0.5,0.5,0.3>

Cosine similarity between documents d and e is given by:

cos(d, e) =d · e

‖d‖ × ‖e‖

Where ‖d‖ is the Euclidean norm of d. In the case of the example above (nor-malised vectors) the Euclidean distance can be used instead, as it gives the samerank order as cosine similarity:

dist(d, e) =

√√√√|T ||∑i=1

(di − ei)2 (9.1)

What is Dimensionality Reduction?

• DR: a processing step whose goal is to reduce the size of the vector spacefrom |T | to |T ′| |T |.

• T ′ is called Reduced Term Set

60 Dimensionality Reduction

• Benefits of DR:

– Lower computational cost for ML

– Help avoid overfitting (training on constitutive features, rather thancontingent ones)

• A rule-of-thumb: overfitting is avoided if the number of training examplesis proportional to the size of T ′ (For TC, experiments have suggested aratio of 50-100 texts per feature).

Local vs. Global DR

• DR can be done for each category or for the whole set of categories:

– Local DR: for for each category ci ,a set T ′i of terms (|T ′i | |T |,typically 10 ≤ T ′i ≤ 50) is chosen for classication under ci.

∗ Different term sets are used for different categories.

– global DR: ,a set T ′ of terms (|T ′| |T |) is chosen for classicationunder all categories C = c1, ..., c|C|

• N.B.: Most feature selection techniques can be used for local and globalDR alike.

DR by feature selection vs. DR by feature extraction

• DR by term selection or Term Space Reduction (TSR): T ′ is a subsetof T . Select a T ′ from the original feature set which yields the highesteffectiveness w.r.t document indexing

• DR by term extraction: the terms in T ′ are not of the same type as theterms in T , but are obtained by combinations or transformations of theoriginal ones.

– E.g: in DR by term extraction, if the terms in T are words, the termsin T ′ may not be words at all.

Term Space Reduction

• There are two ways to reduce the term space:

– TSR by term wrapping: the ML algorithm itself is used to reduceterm space dimensionality.

– TSR by term filtering: terms are ranked according to their “impor-tance” for the TC task and the highest-scoring ones are chosen.

• Performance is measured in terms of aggressiveness: the ratio between the

sizes of original and reduced feature set: |T ||T ′|

• Empirical comparisons of TSR techniques can be found in (Yang and Ped-ersen, 1997) and (Forman, 2003).

Chapter 9 61

Filtering by document frequency

• The simplest TSR technique:

1. Remove stop-words, etc, (see pre-processing steps)

2. Order all features tk in T according to the number of documents inwhich they occur. Call this metric #Tr(tk)

3. Choose T ′ = t1, ..., tn s.t. it contains the n highest scoring tk

• Advantages:

– Low computational cost

– DR up to a factor of 10 with small reduction in effectiveness (Yangand Pedersen, 1997)

Information theoretic TSR: preliminaries

• Probability distributions: probabilities on an event space of documents:

• P (tk, ci): for a random document x, term tk does not occur in x and x isclassified under category ci.

• Similarly, we represent the probability that tk does occur in x and x isfiled under ci by P (tk, ci).

• N.B.: This notation will be used as shorthand for instantiations of theappropriate random variables. That is, for multivariate Bernoulli models:P (Tk = 0, Ci = 1) and P (Tk = 1, Ci = 1), respectively.

Commonly used TSR functionsFunction Notation Mathematical definition

DIA factor z(Tk, Ci) P (ci|tk)

Information Gain, AKAExpected Mutual Information

IG(Tk, Ci) orI(Tk;Ci)

∑c∈ci,ci

∑t∈tk,tk

P (t, c) logP (t, c)

P (t)P (c)

Mutual information MI(Tk, Ci) P (tk, ci) logP (tk,ci)

P (tk)P (ci)

Chi-square χ2(Tk, Ci)|Tr|[P (tk, ci)P (tk, ci)− P (tk, ci)P (tk, ci)]

2

P (tk)P (tk)P (ci)P (ci)

NLG coefficient NGL(Tk, Ci)

√|Tr|[P (tk,ci)P (tk,ci)−P (tk,ci)P (tk,ci)]√

P (tk)P (tk)P (ci)P (ci)

Relevancy score RS(Tk, Ci) logP (tk|ci) + d

P (tk|ci) + d

Odds ratio OR(tk, ci)P (tk|ci)[1−P (tk|ci)][1−P (tk|ci)]P (tk|ci)

GSS coefficient GSS(Tk, Ci) P (tk, ci)P (tk, ci)− P (tk, ci)P (tk, ci)

from (Sebastiani, 2002)

The two more exotic acronyms, GSS and NGL are for the initials of theresearchers who first proposed those metrics, namely Galavotti-Sebastiani-Simicoefficient (GSS), proposed by (Galavotti et al., 2000) and Ng-Goh-Low-Leongcoefficient (NGL) proposed by (Ng et al., 1997).


Some functions in detail

• Basic intuition: the best features for a category are those distributed mostdifferently on sets of positive and negative instances of documents filed underthat category

• Pointwise mutual information:

PMI(Ti, Cj) = logP (ti, cj)

P (ti)P (cj)(9.2)

• Calculations to be performed: co-occurrence of terms and categories in the train-ing corpus (Tr), and frequency of occurrence of words and categories in Tr.

Implementing I(.,.)

• Example: extracting “keywords” from paragraphs.

1. pointwisemi(d): wordpartable

2. pmitable = ()

3. parlist = split_as_paragraphs(d)

4. typelist = gettypelist(d)

5. foreach (par in parlist) do

6 ptypelist = gettypelist(par)

7. pindex = indexof(par, parlist)

8. foreach (word in ptypelist) do

9. i_w_p = log ( getwordprobability(word, par)

/ getwordprobability(word, d) )

10. addtotable(<word,pindex>, i_w_p, pmitable)

11. done

12. done

13 return pmitable

The keyword spotting examples in this chapter use a slightly different samplespace model than the one we will be using in the TC application. The intentionis to illustrate alternative ways of modelling linguistic data in a probabilisticframework, and the fact that the TSR metrics can be used in different contexts.

For the algorithm above, term occurrences are taken to be the elementaryevents. Term occurences in the whole text generate the prior probabilities forterms P (t). Term occurrences in certain paragraphs give conditional probabili-ties P (t|c) (i.e. occurences of terms conditioned on the paragraph, taken in thiscase to be the “category”). Paragraphs are assumed to have a uniform priorP (c) (i.e. they are all equally likely to occur).

In the case of PMI(T,C) (mutual information of a term and a paragraph),we can simply work with priors and conditionals for words:

PMI(T,C) =P (t, c)

P (t)P (c)

=P (t|c)P (c)

P (t)P (c)

=P (t|c)P (t)

(9.3)

Chapter 9 63

The conditional P (t|c) can be calculated by dividing the number of timest occurs in documents of category c by the total number of tokens in thosedocuments (the probability space for documents of category c). P (t) can becalculated by dividing the frequency of t in the training corpus by the totalnumber of tokens in that corpus.

Normalised mutual information

MI(Ti, Cj) = P (ti, cj)logP (ti, cj)

P (ti)P (ci)(9.4)

1. mi(d): wordpartable /* rank each word in ’d’ */

2. mitable = ()


4. p_par = 1/sizeof(parlist)



7 ptypelist = gettypelist(par)



10. mi_w_p = getwordprobability(word, par) * p_par

* log ( getwordprobability(word, par)


11. addtotable(<word,pindex>, mi_w_p, mitable)

12. done

13. done

14 return mitable

Similarly to (9.3) we can simplify the computation of MI(T,C) as follows:

MI(T,C) = P (t, c)logP (t, c)

P (t)P (c)

= P (t|c)P (c)logP (t|c)P (t)

(9.5)

Expected Mutual Information (Information gain)

• A formalisation of how much information about category cj does one gainby knowing term ti (and vice-versa).

IG(Ti, Cj) =∑

t∈ti,ti

∑c∈cj ,cj

P (t, c)logP (t, c)

P (t)P (c)(9.6)

• Computational cost of calculating IG(., .) is higher than that of estimatingMI(., .)

IG: a simplified example


1. ig(d): wordpartable /* features = words; */

2. igtable = () /* categories = paragraphs */


4. p_par = 1/sizeof(parlist)



7. ptypelist = gettypelist(par)



/* oversimplification: assuming T = word */


11. ig_w_p += getwordprobability(word, par) * p_par

* log ( getwordprobability(word, par)


12. done

13. addtotable(<word,pindex>, ig_w_p, igtable)

14. done

15. done

16. return igtable

From local to global TSR

• A locally specified TSR function tsr(Tk, Ci), i.e. ranging over terms tk withrepect to a specific to category ci, can be made global by:

– Summing over the set of all categories:

tsrsum(Tk) =

|C|∑i=1

tsr(Tk, Ci) (9.7)

– Taking a weighted average

tsrwavg(Tk) =

|C|∑i=1

P (Ci)tsr(Tk, Ci) (9.8)

– Picking the maximum

tsrmax(Tk) = max|C|i=1tsr(Tk, Ci) (9.9)

Comparing TSR techniques

• Effectiveness depends on the chosen task, domain etc

• Reduction factor of up to 100 with IGsum and χ2max

• Summary of empirical studies on the performance of different informationtheoretic measures (Sebastiani, 2002):

ORsum, NGLsum, GSSmax↓

IGsum, χ2max

↓

#wavg, χ2wavg

10Implementation of Term Filters & DR by term

extraction

Implementation examples

• The pseudocode in last lecture’s notes implement term selection based on aprobabilistic model in which the event space is comprised of terms.

• The following slides address term selection assuming a model usually employedin TC: i.e. one where the sample space is the set of documents (D).

• The algorithms are based on:

– Pointwise mutual information: PMI(Ti, Cj) = logP (ti,cj)

P (ti)P (cj),

– mutual information:

MI(Tk, Ci) = P (tk, ci) log P (tk,ci)P (tk)P (ci)

,

– and Expected Mutual Information (information gain):

IG(Tk, Ci) =∑c∈ci,ci

∑t∈tk,tk

P (t, c) log P (t,c)P (t)P (c)

Implementing Pointwise MI TSR for TC

tsr i(D, c, a) : T ′ /* a : TSR aggressiveness */

1. var Cl, Tt, Tl : list2. for each d ∈ D do3. if f(d, c) = true do append(d,Cl)4. for each t ∈ d do5. put(< t, d >, Tt)6. put(< t, 0 >, Tl)

7. P (c) = |Cl||D|

8. for each t in Tl do9. Dt = d| < t, d >∈ Tt, Dtc = d|d ∈ Cl∧ < t, d >∈ Tt10. P (t, c) = |Dtc|

|D| , P (t) = |Dt||D|

11. remove(< t, 0 >, Tl)

12. add(< t, log P (t,c)P (t)P (c) >, Tl)

13. return first |Tl|a elements of Tl

66 Implementation of Term Filters & DR by term extraction

TSR by mutual information

tsr mi(D, c, a) : T ′ /* a : TSR aggressiveness */

1. var Cl, Tt, Tl : list2. for each d ∈ D do3. if f(d, c) = true do append(d,Cl)4. for each t ∈ d do5. put(< t, d >, Tt)6. put(< t, 0 > Tl)

7. P (c) = |Cl||D|

8. for each t in Tt do9. Dt = d| < t, d >∈ Tt, Dtc = d|d ∈ Cl∧ < t, d >∈ Tt10. P (t, c) = |Dtc|

|D| , P (t) = |Dt||D|

11. remove(< t, 0 >, Tl)

12. add(< t, P (t, c) log P (t,c)P (t)P (c) >, Tl)

13. return first |Tl|a elements of Tl

TSR by Expected Mutual Information

1 t s r e m i (T , c, a ) : Ts2 var : Cl, Tt, Tl : l i s t3 f o r each d ∈ D4 i f f(d, c) = true do append(d,Cl)5 f o r each t ∈ d6 put(〈t, d〉, Tt)7 put(〈t, 0〉, Tl)8 P (c) = |Cl|

|D|9 f o r each t in Tt

10 Dtc ← d|d ∈ Cl ∧ 〈t, d〉 ∈ Tt , Dtc ← d|d 6∈ Cl ∧ 〈t, d〉 ∈ Tt11 Dtc ← d|d 6∈ Cl ∧ 〈t, d〉 6∈ Tt , Dtc ← d|d ∈ Cl ∧ 〈t, d〉 6∈ Tt12 P (t, c) ← |Dtc|

|D| , P (t, c) ← |Dtc||D|

13 P (t, c) ← |Dtc||D| , P (t, c) ← |Dtc|

|D|

14 P (t) ← |Dtc|+|Dtc||D|

15 remove(〈t, 0〉, Tl)16 i ← P (t, c) log P (t,c)

P (t)P (c)+ P (t, c) log P (t,c)

P (t)P (c)+

17 P (t, c) log P (t,c)P (t)P (c)

+ P (t, c) log P (t,c)P (t)P (c)

18 add(〈t, i〉, Tl)19 s o r t Tl by expected mutual in fo rmat ion s c o r e s

20 re turn f i r s t |Tl|a

e lements o f Tl

The algorithm above is simply meant to illustrate the how the estimationof probabilities work in very general terms. A practical implementation wouldnot involve as many counting operations and would need to take into accountthe need to avoid zero probabilities for cases where terms do not co-occur withcategories. More about the latter in page 78.

With respect to counting, for each term-category pair, it would suffice to

estimate P (c), P (t) and a single joint or conditional, say P (t|c) = |Dtc||Cl| , (or

P (t|c) = |Dtc|+1|Cl|+|Tl| , using a Laplace estimator) and derive the remaining values

Chapter 10 67

from it through straightforward applications of the properties of conditionalprobabiliites:

P (t, c) = P (t|c)P (c) (10.1)

P (t, c) = (1− P (t|c))P (c) (10.2)

P (t, c) = P (t)− P (t, c) (10.3)

P (t, c) = 1− P (t)− P (t, c) (10.4)

Sample ranking

• Top-ranked words for REUTERS-21578 category acq according to expected mu-tual information

stake 0.046524273483876236merger 0.03372606977021521acquisition 0.027773960515907872vs 0.025951563627849852shares 0.021425776013160716buy 0.019486040643726568acquire 0.01772887289440033qtr 0.017520597989855537cts 0.016385470717669232usair 0.016380931873857446shareholders 0.014513788146891683buys 0.014410168362922963

Dimensionality reduction by term extraction

• Potential problems with DR by term selection:

– synonymy, homonymy, polysemy, ...

• Term extraction attempts to generate a synthetic T ′ (|T ′| |T |) thatmaximises classifier effectiveness.

• DR by term extraction consists of two steps:

1. new (artificial) terms are extracted from the old ones

2. the old document representation is converted into a new one basedon the newly extracted terms

Term clustering

• Term clustering consists in:

1. grouping words with high degree of pairwise semantic relatedness,and

2. using the groups (frequently, their centroids) as dimensions in thevectorial representation of the document

• A centroid of a cluster C of a vector of terms −→x is given by:

~µ =

∑~x∈C ~x

|C|(10.5)


A simple term clustering strategy

• Representation (data structure): a term co-occurrence matrix.

• E.g.: if the term “ballot” occurs 8 times in the corpus, and occurs in the same

document as “polls” 5 times, with “seats” 3 times, with “profit” 4 times, with

“holidays” never, etc, its vector representation will be:

−−−→ballot = 〈8, 5, 3, 4, 0, . . . 〉 (10.6)

• Each row of the co-occurrence matrix representing the corpus consists ofa term vector such as (10.6)

A term clustering algorithm

k-means(X = ~x1, . . . , ~xn ⊆ Rm): 2R

1. C : 2R /* µ a set of clusters */

2. d : Rm × Rm → R /* distance measure function */

3. µ : 2R → R /* µ computes the mean of a cluster */

4. select C with k initial centers ~f1, . . . , ~fk5. while stopping criterion not true do6. for all clusters cj ∈ C do7. cj = ~xi|∀fl.d(~xi, fj) ≤ d(~xi, fl)8. for all means ~fj do

9. ~fj = µ(cj)10. return C

Effectiveness of DR by term clustering

• Unsupervised clustering experiments using nearest neighbour clusteringand co-occurrence clustering have achieved so far only marginal improve-ments.

• (Supervised) distributional clustering appears to be more promising:

• (Baker and McCallum, 1998) report a mere 2% effectiveness loss withan aggressiveness factor as large as 1000 for a supervised distributionalclustering

Latent semantic indexing• LSI involves:

– deriving a mapping of dependences among the original terms in Tfrom the training set, and

– mapping the original (dependent) dimensions into independent ones(extracted from T ′)

• Input data structure a term by document matrix:

A|T |×|D| =

at1d1. . . at1d|D|...

at|T |d1. . . at|T |d|D|

• Output: A reduced matrix of T ′ T rows.

Chapter 10 69

A =

d1 d2 d3 d4 d5 d6

cosmonaut 1 0 1 0 0 0astronaut 0 1 0 0 0 0moon 1 1 0 0 0 0car 1 0 0 1 1 0truck 0 0 0 1 0 1

Table 10.1: Sample term-document incidence matrix

An example of a term-by-document matrix

• Consider a corpus containing texts about space travel and traffic incidents.

• A term-by-document matrix can be used to record term occurrences inthe corpus:

LSI from 5 dimensions onto 2 dimensions

• LSI uses Singular value decomposition to decompose the original matrix:

At×d = Tt×nSn×n(Dd×n)T (10.7)

where n =≤ min(d, t)

between d1 and d2 eventhough these documentshave no terms in common

Note: LSI detects similarity

d6d4

d5

d1d2

d3

dimension 1

dimension 2

Effectiveness of LSI• Advantages:

– No stemming required, well understood formalism, robust (allows mis-spellings)

• Disadvantages:

– Computationally costly, rely on a continuous normal distributionmethod (not really appropriate for count data)

• LSI works well a great number of terms contribute a each small amountto the correct classification decision;

• It might cause drawbacks if a single term is particularly good at discrim-inating a document

• See (Manning and Schutze, 1999, ch. 15) for a discussion of LSI and(Manning et al., 2008) for a more complete exposition of the same topic.


Principal component analysis

• Basic idea: find the directions of maximal variance in the data

• Transform the original data’s coordinates; the new coordinates are theprincipal components

• Transforming the original coordinates:

– Build a covariance matrix from the input data (feature vectors)

– Perform eigenvalue decomposition on the covariance matrix

– Resulting eigenvectors are the principal components

– Eigenvalues give the amout of variance for their respective principalcomponents

The term co-ocurrence matrix can be easily derived from a term-documentincidence matrix (such as the one shown in Table 10.1) by multiplying it by itstranspose (AAT ).

PCA in text categorisation

• Because PCA (like most other term extraction methods) is unsupervised,it can be readily used in settings where training data are scarce (Davy andLuz, 2007b)

• Similar to LSI (but applied to covariance matrix rather than term-documentmatrix)

• (Optionally) start by finding the centroid of the data:

µ =1

`

∑i=1

xi (10.8)

PCA in TC (ctd)

• Build covariance matrix C as the dot product of the “centered” data:

C =1

`

∑i=1

(xi − µ)(xi − µ)T (10.9)

• Perform eigenvalue decomposition (similar to SVD but only applicable tosquare matrices) to solve

Cv = λv (10.10)

• The d largest eigenvalues are sorted (λ1 ≥ λ2 ≥ λ3 ≥ . . . ≥ λd) in de-scending order and their associated eigenvectors stacked to form the trans-formation matrix W = [v1, v2, v3, . . . , vd].

• input feature vector x can be mapped to the PCs by simply making y =WTx

Chapter 10 71

Testing PCA term extraction on newsgroups

• Performance of PCA dimensionality reduction on the 20 newsgrousps corpus1 (Lang, 1995),versus full set and document frequency filter (Davy and Luz, 2007b):

0 100 200 300 400 500 600 700 800 9000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Iterations of AL

Err

or

PCAFullDFG

Graphics vs X

0 100 200 300 400 500 600 700 800 9000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Iterations of AL

Err

or

PCAFullDFG

Windows vs hardware

0 100 200 300 400 500 600 7000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Iterations of AL

Err

or

PCAFullDFG

Atheism vs religion

0 100 200 300 400 500 600 700 800 9000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Iterations of AL

Err

or

PCAFullDFG

Baseball vs Cryptography

1http://qwone.com/~jason/20Newsgroups/

http://qwone.com/~jason/20Newsgroups/


11Practical: Dimensionality Reduction by Term

Filtering

11.1 Goals

In this lab you will implement simple probabilistic term ranking algorithms fordimensionality reduction, or Term Space Reduction (TSR), which have beenused in IR and TC systems.

These algorithms are described in the lecture notes on Dimensionality Re-duction


The implementation will build on lab01’s NewsParser. First, download thetemplates for the main classes from the course website. The archive containsthe following files:

|-- ...

|-- tc

|-- dstruct

| |-- BagOfWords.java

| |-- CorpusList.java

| |-- ProbabilityModel.java

| |-- StopWordList.java

| |-- WordFrequencyPair.java

| ‘-- WordScorePair.java

‘-- tsr

|-- DocumentFrequency.java

|-- InfoGain.java

|-- MakeReducedTermSet.java

‘-- TermFilter.java

It must be uncompressed into lab01, since many of the files used in that lab willalso be used in lab02.

The main class implementing TSR is MakeReducedTermSet. It should beimplemented so that the program accepts 5 arguments:

http://www.scss.tcd.ie/~koidlk/cs4062/07-dimreduct-notes.pdf

http://www.scss.tcd.ie/~koidlk/cs4062/07-dimreduct-notes.pdf


74 Practical: Dimensionality Reduction by Term Filtering

• a corpus list: the name of a file containing the locations of all files to beparsed and indexed (see samplecorpuslist.txt, distributed with lab01);

• a stop-word list: the name of a file containing the stop words to be removedbefoere indexing proper begins. This file which will be encapsulated bythe StopWordList class;

• an aggressiveness parameter: as described in the lecture notes;

• the name of a term filtering method: the names of the methods supported.In this assignment you will implement term filtering by document frequency(df) and term filtering by information gain (ig).

• a string specifying a target category or a global score generation method: atarget category could be, for instance, acq; a global score method is oneof MAX, SUM or WAVG, which stand respectively for fmax, fsumand fwavg, as described in the lecture notes.

The output (printed on the screen) will be a list of selected terms (a reducedterm set), sorted in descending order of scores.

You are supposed to implement term ranking by subclassing TermFilter.Templates for two TSR metrics are provided: DocumentFrequency and InfoGain.In order to be able to implement these metrics, you will need to implement therelevant methods of ProbabilityModel. The methods you will need to imple-ment or modify are marked

/** ******************** Lab 02: Exercise *********************

The comments following such tags contain further details about what needsto be done. You will modify the following files:

• MakeReducedTermSet.java,

• ProbabilityModel.java,

• TermFilter.java,

• InfoGain.java, and

• DocumentFrequency.java

Suggestion: start with tc/dstruct/ProbabilityModel.java, and beforeimplementing the classes that handle TSR proper, modify tc/tsr/MakeReducedTermSet.java

so that you can check that your tokenisation, word count, stop-word filteringetc are working as expected.

11.3 Delivering the exercises

You will demonstrate the results (a working TSR program, or part of one) innext week’s lab.

Please also submit your code through Blackboard.

https://tcd.blackboard.com

12Classifier induction: Bayesian Methods

Defining a CSV function

• Inductive construction of a text categorization module consists of defininga Categorization Status Value (CSV) function

• CSV for Ranking and Hard classifiers:

– Ranking classifiers: for each category ci ∈ C, define a function CSViwith the following signature:

CSVi : D → [0, 1] (12.1)

– Hard classifiers: one can either define CSVi as above and definea threshold τi above which a document is said to belong to ci, orconstrain CSVi to range over T, F directly.

Category membership thresholds

• Hard classifier status value, CSV hi : D → T, F, can then be defined asfollows:

CSV hi (d) =

T if CSVi ≥ τi,F otherwise.

(12.2)

• Thresholds can be determined analytically or experimentally.

• Analytically derived thresholds are typical of TC systems that outputprobability estimates of membership of documents to categories

• τi is then determined by decision-theoretic measures (e.g. utility)

Experimental thresholds

• CSV thresholding or SCut: Scut stands for optimal thresholding on theconfidence scores of category candidates:

– Vary τi on Tv and choose the one that maximises effectiveness

76 Classifier induction: Bayesian Methods

• Proportional thresholding: choose τi s.t. that generality measure gTr(ci)is closest to gTv(ci).

• RCut or fixed thresholding: stipulate that a fixed number of categories areto be assigned to each document.

• See (Yang, 2001) for a recent survey of thresholding strategies.

ML methods for learning CSV functions

• Symbolic, numeric and meta-classification methods.

• Numeric methods implement classification indirectly: the classification functionf outputs a numerical score, hard classification via thresholding

– probabilistic classifiers, regression methods, ...

• Symbolic methods usually implement hard classification directly

– e.g.: decision trees, decision rules, ...

• Meta-classification methods combine results from independent classifiers

– e.g.: classifier ensembles, committees, ...

Probabilistic classifiers

• The CSV() of probabilistic classifiers produces an estimate of the condi-

tional probability P (c|~d) = f(d, c) that an instance represented as ~d shouldbe classified as c.

• Components of ~d regarded as random variables Ti (1 ≤ i ≤ |T |)

• Need to estimate probabilities for all possible representations i.e. P (c|Ti, . . . , Tn).

• Too costly in practice: for discrete case and m possible nominal valuesthat is O(mT )

• Independence assumptions help...

Notes on the notation:

• P (~dj) = probability that a randomly picked text has ~dj as its representa-tion

• P (ci) = probability that a randomly picked text belongs to ci

Conditional independence assumption

• Using Bayes’ rule we get

P (c|~dj) =P (c)P (~dj |c)

P (~dj)(12.3)

Chapter 12 77

• Naıve Bayes classifiers: assume Ti, . . . , Tn are independent of each other giventhe target category:

P (~d|c) =

|T |∏k=1

P (tk|c) (12.4)

• maximum a posteriori hypothesis: choose c that maximises (12.3)

• maximum likelihood hypothesis: choose c that maximises P (~dj |c) (i.e. assumeall c’s are equally likely)

Variants of Naive Bayes classifiers

• multi-variate Bernoulli models, in which features are modelled as Booleanrandom variables, and

• multinomial models where the variables represent count data (McCallumand Nigam, 1998a)

• Continuous models which use numeric data representation: attributes rep-resented by continuous probability distributions

– using Gaussian distributions, the conditionals can be estimated as

P (Ti = t|c) =1

σ√

2πe−

(t−µ)2

2σ2 (12.5)

– Non-parametric kernel density estimation has also been proposed(John and Langley, 1995)

Some Uses of NB in NLP

• Information retrieval (Robertson and Jones, 1988)

• Text categorisation (see (Sebastiani, 2002) for a survey)

• Spam filters

• Word sense disambiguation (Gale et al., 1992)

CSV for multi-variate Bernoulli models

• Starting from the independence assumption

P (~d|c) =

|T |∏k=1

P (tk|c)

• and Bayes’ rule

P (c|~dj) =P (c)P (~dj |c)

P (~dj)

• derive a monotonically increasing function of P (c|−→d ):

f(d, c) =

|T |∑i=1

ti logP (ti|c)[1− P (ti|c)]P (ti|c)[1− P (ti|c)]

(12.6)


• Need to estimate 2|T |, rather than 2|T | parameters.

Consider the Boolean document representation described in previous noteswhere a ~dj =< t1j , t2j , ... > could be, for instance, < for = 1, a = 1, typical =1, summer = 1, winter = 0, ... >. This is the type of document representationwe have in mind for this Multi-variate Bernoulli implementation of the NaiveBayes TC.

Estimating the parameters

• For each term ti ∈ T , make:

– nc ← the number of ~d s.t. f(~d, c) = 1

– ni ← the number of ~d for which ti = 1 and f(~d, c) = 1

P (ti|c)←ni + 1

nc + 2(12.7)

∗ (sums in numerator and denominator for smoothing; see nextslides)

– nc ← the number of ~d s.t. f(~d, c) = 0

– ni ← the number of ~d for which ti = 1 and f(~d, c) = 0

P (ti|c)←ni + 1

nc + 2(12.8)

An Alternative: multinomial models

• An alternative implementation of the Naıve Bayes Classifier is described in(Mitchell, 1997).

• In this approach, words appear as values rather than names of attributes

• A document representation for this slide would look like this:

~d = 〈a1 = ”an”, a2 = ”alternative”, a3 = ”implementation”,. . . 〉

• Problem: each attribute’s value would range over the entire vocabulary. Manyvalues would be missing for a typical document.

Dealing with missing values

• what if none of the training instances with target category cj have at-tribute value ai?

P (ai|cj) = 0, and...

P (cj)∏i

P (ai|cj) = 0

• What to do?

Chapter 12 79

• Smoothing: make Bayesian estimate for P (ai|cj)

P (ai|cj)←nc +mp

n+m

where:

– n is number of training examples for which C = cj ,

– nc number of examples for which C = cj and A = ai

– p is prior estimate for P (ai|cj)– m is weight given to prior (i.e. number of “virtual” examples)

Learning in multinomial models

1 NB Learn (Tr, C )2 /∗ c o l l e c t a l l tokens that occur in Tr ∗/3 T ← a l l d i s t i n c t words and other tokens in Tr4 /∗ c a l c u l a t e P (cj) and P (tk|cj) ∗/5 f o r each t a r g e t va lue cj in C do6 Trj ← subset o f Tr f o r which t a r g e t va lue i s cj

7 P (cj) ← |Trj ||Tr|

8 Textj ← concatenat ion o f a l l t e x t s in Trj

9 n ← t o t a l number o f tokens in Textj10 f o r each word tk in T do11 nk ← number o f t imes word tk occurs in Textj12 P (tk|cj)← nk+1

n+|T |13 done14 done

Note an additional assumption: position is irrelevant, i.e.:

P (ai = tk|cj) = P (am = tk|cj) ∀i,m

Sample Classification Algorithm

• Could calculate posterior probabilities for soft classification

f(d) = P (c)

n∏k=1

P (tk|c)

(where n is then number of tokens in d that occur in T ) and use thresholdingas before

• Or, for SLTC, implement hard categorisation directly:

1 positions← a l l word p o s i t i o n s in d2 that conta in tokens found in T3 Return cnb , where4 cnb = arg maxci∈C P (ci)

∏k∈positions P (tk|ci)


Classification Performance(Mitchell, 1997): Given 1000 training documents from each group, learn to

classify new documents according to which newsgroup it came from

comp.graphics misc.forsalecomp.os.ms-windows.misc rec.autoscomp.sys.ibm.pc.hardware rec.motorcycles

comp.sys.mac.hardware rec.sport.baseballcomp.windows.x rec.sport.hockey

alt.atheism sci.spacesoc.religion.christian sci.crypt

talk.religion.misc sci.electronicstalk.politics.mideast sci.med

talk.politics.misctalk.politics.guns

Naive Bayes: 89% classification accuracy.

Learning performance

• Learning Curve for 20 Newsgroups:

NB: TFIDF and PRTFIDF are non-Bayesian probabilistic methods wewill see later in the course. See (Joachims, 1996) for details.

NB and continuous variables

• Another model: suppose we want our document vectors to represent, say,the TF-IDF scores of each term in the document:

~d = 〈a1 = tfidf(t1), . . . , an = tfidf(tn)〉 (12.9)

• How would we estimate P (c|~d)?

Chapter 12 81

• A: assuming an underlying (e.g. normal) distribution:

P (c|~d) ∝n∏i=n

P (ai|c)

=1

σc√

2πe− (x−µc)2

2σ2c (12.10)

µb and σ2b are mean and variance of the values taken by the attributes for

positive instances.

Combining variables

• NB also allows you yo combine different types of variables.

• The result would be a Bayesian Network with continuous and discretenodes. For instance:

...

C

a1 a2 ak an

• See (Luz, 2012; Luz and Su, 2010) for examples of the use of such combinedmodels in a different categorisation task.

Naive but subtle

• Conditional independence assumption is clearly false

P (a1, a2 . . . an|vj) =∏i

P (ai|vj)

• ...but NB works well anyway. Why?

• posteriors P (vj |x) don’t need to be correct; We need only that:

arg maxvj∈V

P (vj)∏i

P (ai|vj) = arg maxvj∈V

P (vj)P (a1 . . . , an|vj)

In othe words, error in NB classification is a zero-one loss function, often cor-rect even if posteriors are unrealistically close to 1 or 0 (Domingos and Pazzani,1996).

Performance can be optimal if dependencies are evenly distributed over classes,or if they cancel each other out (Zhang, 2004).


Other Probabilistic Classifiers

• Alternative approaches to probabilistic classifiers attempt to improve ef-fectiveness by:

– adopting weighted document vectors, rather than binary-valued ones

– introducing document length normalisation, in order to correct dis-tortions in CSVi introduced by long documents

– relaxing the independence assumption (the least adopted variant,since it appears that the binary independence assumption seldomaffects effectiveness)

– But see, for instance Hidden Naive Bayes (Zhang et al., 2005)...

13Practical: Classifier induction

13.1 Goals

Implement the classifier induction module of a probabilistic (Naıve Bayes) textclassifier.


The implementation will build on lab01’s NewsParser and lab02’s term set re-duction algorithms. First, download the templates for the main classes from thecourse website. The archive contains the following files:

|-- ...

|-- tc

\-- induction

|

\-- MakeProbabilityModel.java

Most of the files used in lab01 and lab02 will also be used in this lab.MakeProbabilityModel will (using the classes implemented in previous labs)

parse selected files from the reuters corpus, store the result as a ParsedText ob-ject, generate a ProbabilityModel containing joint term-category probabilities,perform term filtering (thus reducing the number of terms in ProbabilityModel),and save this reduced ProbabilityModel as a serialised Java object. This se-rialised object will form the basis of the classifier to be used in validation andtesting (lab04).

The program will accept 5 arguments:

1. a corpus list: the name of a file containing the locations of all files to beparsed and indexed (see samplecorpuslist.txt, distributed with lab01);

2. a stop-word list: the name of a file containing the stop words to be removedbefoere indexing proper begins. This file which will be encapsulated bythe StopWordList class;



84 Practical: Classifier induction

3. an aggressiveness parameter for TSR: as described in the lecture notes;

4. the name of a term filtering method: the names of the methods supported.I.e. one of the metrics implemented in lab02, namely: document frequency(df) and information gain (ig).

5. a string specifying a target category or a global score generation method:as in lab02, and

6. the name of a file to store the “serialised” version of ProbabilityModel(see the Java API documentation for details on object serialisation

TIP: use tc.tsr.MakeReducedTermSet.java as a starting point.

13.3 Demonstrating your program

You should already start working on the assignment and demonstrate your re-sults (a working program) in the lab after study (reading) week.

http://java.sun.com/j2se/1.4.2/docs/api/java/io/ObjectOutputStream.html

14TC Evaluation Revisited

Text Classifier Evaluation

• Evaluation of TC systems is usually done experimentally rather than an-alytically

– Analytical evaluation is difficult due to the subjective nature of thetask

• Experimental evaluation aims at measuring classifier effectiveness, that is,

– its ability to make correct classification decisions for the largest pos-sible number of documents

• We have already seen two measures used in experimental evaluation: pre-cision and recall; Today we will characterise these measures more preciselyand see other ways of evaluating TC.

Terminology

• A categorisation result can be: a true positive (TP), a false Positive (FP),a true negative (TN) or a false negative (FN).

• Examples:

Judgment Expert System Result“Brazil beat Venezuela” T F FN“US defeated Afghanistan” T T TP“Elections in Wicklow” F F TN“Elections in Peru” F T FP

Contingency tables

• Contingency tables (also known as confusion matrices) are also useful,specially for evaluation of multi-label classification.

• Here is a contingency table for categories A, B, C, D:

86 TC Evaluation Revisited

System A B C Dblue!15 32 yellow!25 1 yellow!25 1 yellow!25 6 Ared!15 3 green!15 25 green!15 10 green!15 2 Bred!15 0 green!15 5 green!15 30 green!15 5 Cred!15 4 green!15 10 green!15 6 green!15 20 D

Expert

[1em]

E.g: for class A → blue!15 TP red!15 FP yellow!25 FN green!15 TN

Probabilistic view of precision and recall

• Precision (π), with respect to category ci may be defined as the followingconditional probability:

π = P (f(dx, ci) = T |f(dx, ci) = T ) (14.1)

the probability that if a document dx has been classified as ci this decisionis correct

• Analogously, recall (ρ), may be defined as follows:

ρ = P (f(dx, ci) = T |f(dx, ci) = T ) (14.2)

the probability that if a random document is meant to be filed under ci, it dxwill be classified as such.

By looking at (14.1) and (14.2) we see that the formulae for precision andrecall in slide 171 will give us experimental estimates of the conditionals. Wewill ignore the distinction in the remaining of this lecture.

Estimating precision and recall

All texts

Target (f(dx, ci) = T )Selected (f(dx, ci) = T )

TPi

TNi

FNiFPi TPi

πi =TPi

TPi+FPiρi =

TPiTPi+FNi

To see why these definitions are equivalent, define fi = d|f(d, ci) = T and

fi = d|f(d, ci) = T.

Chapter 14 87

Combining local into global measures

• Local estimates of the probabilities in (14.1) and (14.2) may be combinedto yield estimates for the classifier as a whole.

• The contingency table below summarises precision and recall over a cat-egory set C = c1, c2, ...:

Expert judgmentYES NO

TC system YES TP =∑|C|i=1 TPi FP =

∑|C|i=1 FPi

judgment NO FN =∑|C|i=1 FNi TN =

∑|C|i=1 TNi

Effectiveness averaging

• Two different methods may be used to calculate global values for π and ρ:micro- and macro-averaging

• Micro-averaging:

πµ =

∑|C|i=1 TPi∑|C|

i=1(TPi + FPi)(14.3)

ρµ =

∑|C|i=1 TPi∑|C|

i=1(TPi + FNi)(14.4)

Macroaveraging

• Precision macroaveraging is calculated as follows:

πM =

∑|C|i=1 πi|C|

(14.5)

, where πi and ρi are local scores.

• Recall macroaveraging is calculated as follows:

ρM =

∑|C|i=1 ρi|C|

(14.6)

An exampleCategory Sport Politics World

Judgment: Expert and System E S E S E S

“Brazil beat Venezuela” T F F F F T“US defeated Afghanistan” F T T T T F“Elections in Wicklow” F F T T F F“Elections in Peru” F F F T T T

Precision (local): π = 0 π = 0.67 π = 0.5

Recall (local): ρ = 0 ρ = 1 ρ = 0.5

πµ = 0+2+10+2+1+1+1+1

= .5

πM = 0+.67+.53

= .39


Other measures

• Since one knows (by experimentation) which documents fall into: TP, FP,TN, FN, one may also estimate Accuracy (A) and Error (E):

A =TP + TN

TP + TN + FP + FN(14.7)

E =FP + FN

TP + TN + FP + FN= 1−A (14.8)

• These measures, however, are not widely used in TC due to the fact thatthey are less sensitive to variations in the number of correct decisions thanπ and ρ.

Fallout and ROC curves

• A less frequently used measure is fallout (or false positive rate):

Fallouti =FPi

FPi + TNi(14.9)

• Fallout measures the proportion of non-targeted items that were mistak-enly selected.

• In certain fields recall-fallout trade-offs are more common than precision-recall ones.

ROC Curves

• The receiver operating characteristic, or ROC curve shows how differentlevels of fallout influence recall or sensitivity. Further explanation aboutROC curves can be found in (Fawcett, 2006).

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Fallout (false positive rate)

reca

ll (a

ka s

ensi

tivity

)

random classif. lineperfect classification

better classification worse classification

• Area under ROC curve measures categorisation performance (as τ varies).

Chapter 14 89

Efficency and utility

• Efficiency is often used as an additional criterion in TC evaluation. Itmay be measured with respect to:

– training or classification

• The utility criterion, from decision theory, is sometimes used:

– Assign a utility, or conversely a penalty (loss), to each possible cate-gorisation outcome (TP, TN, FP, FN).

– Formally, define a loss function

λ : F, T × F, T → [0, 1]

such that λ(f , f) returns the penalty to be applied to the possiblecategorisation decisions wrt the true categorisations.

Loss functions and effectiveness

• For example, say we penalise miscategorisations equally and more than correctcategorisations, i.e.

λ(F, T ) = λ(T, F ) < λ(T, T ) = λ(F, F )

as in, say, zero-one loss: λ(F, F ) = 0, λ(F, T ) = 1).

• It can be shown (Lewis, 1995) that the threshold that minimises the expectedloss under such assumptions is

τ >λ(T, F )− λ(F, F )

(λ(F, T )− λ(T, T )) + (λ(T, F )− λ(F, F ))(14.10)

• Therefore for the “indiferent” evaluation described above a τ > 0.5 meets thecriterium.

– An example of a situation where one might want to define loss differentlyis ... email spam filtering, where failing to discard spam (FN) is less seriousthan discarding a legitimate message (FP).

Combining precision and recall

• Neither π or ρ makes much sense in isolation.

• Classifiers can be tuned to maximise one at the expense of the other.

• TC evaluation is done in terms of measures that combine π and ρ.

• We will examine two such measures:

– breakeven point and

– the F functions


Breakeven point

• The breakeven point is the value at which π equals ρ, as determined bythe following process:

– A plot of π as a function of ρ is computed by varying the thresholdsτi for the CSV function from 1 to 0 (with the threshold set to 1,only those documents that the classifier is “totally sure” belong tothe category will be selected, so π will tend to 1, and ρ to 0; as wedecrease τi, precision will decrease, but ρ will increase monotonically)

– The breakeven point is the value (of ρ or π) at which the plot inter-sects the ρ = π line.

The F functions

• The idea behind F measures (van Rijsbergen, 1979, ch. 7) is to assign adegree of importance to ρ and π.

• Let β be a factor (0 ≤ β ≤ ∞) quantifying such degree of importance.The Fβ function can be defined as follows:

Fβ =(β2 + 1)πρ

β2π + ρ(14.11)

• A β value of 1 assigns equal importance to precision and recall

• The breakeven of a classifier is always less than or equal its Fβ for β = 1

Comparison of existing TC systems

CorpusType Systems 1 2 3 4 5non-learning WORD .150 .310 .290probabilistic PropBayes,

Bim, Nb.443-.650

.747-

.795.720-.815

decision tree C4.5, Ind .670 .794-.884

decision rules Swap-1, Rip-per,

.683-

.753.738-.811

.820-827

regression LISF .855 .810 .849online linear BWinnow .747 .833 .822batch linear Rocchio .660 .748 .625-

.776.649-.799

neural nets Classi .802 .820 .838example based k-NN, Gis-W .690 .852 .820 .820-

.860.823

SVM SVMLight .859-

.870

.864-

.920ensemble AdaBoost .860 .878

Summary of results

• With respect to the TC collection on which those classifiers were tested(Sebastiani, 2002):

Chapter 14 91

– SVM, boosting-based ensembles, example-based and regression meth-ods appear to deliver the best performance (at least for the limitednumber of corpora they were tested on)

– Neural nets, Naive Bayes, and on-line classifiers perform slightlyworse

– Rocchio performed poorly

– Although Decision Trees were not tested on a sufficient number ofcorpora, results seem to be encouraging


15Practical: Completing and Evaluating the TC

System

15.1 Goals

Implement the classification module of a probabilistic (Naıve Bayes) text classi-fier based on the categorisation status value function described in the notes onprobabilistic classifiers:

CSVi(dj) =

|T |∑k=1

tkj logP (tkj = 1|ci)(1− P (tkj = 1|ci))P (tkj = 1|ci)(1− P (tkj = 1|ci)

(15.1)


The implementation will build on lab01’s NewsParser, lab02’s term set reductionalgorithms and use the probability model created in lab03.

First, download the templates for the main classes from the course website.The archive contains the following files:

|-- ...

tc

|-- classify

| ‘-- BVBayes.java

‘-- evaluation

|-- CSVManipulation.class

|-- CSVTable.java

‘-- ThresholdStrategy.java

Most of the files used in lab01, lab02 and lab03 will also be used in this lab.The main class is tc.classify.BVBayes. It is so named because it imple-

ments a Bayes classifier for documents represented as Boolean vectors.The program will accept 3 arguments:

1. a corpus list: the name of a file containing the locations of all files to beparsed and indexed (see samplecorpuslist.txt, distributed with lab01);

http://www.cs.tcd.ie/~koidlk/cs4062/09-ctinduction.pdf

http://www.cs.tcd.ie/~koidlk/cs4062/09-ctinduction.pdf

http://www.cs.tcd.ie/~koidlk/cs4062/sw/lab04.tgz

94 Practical: Completing and Evaluating the TC System

2. a string specifying a target category

3. the name of a file containing a “serialised” version of ProbabilityModel(see the Java API documentation for details on loading serialised objectsand tc.util.IOUtil for a sample implementation)

Your task consists basically of completing tc.classify.BVBayes, and writ-ing the code for tc.evaluation.CSVTable, a class that implements the tc.evaluation.CSVManipulationinterface. For tc.classify.BVBayes, you will implement the computeCSV()

method, which takes a string describing a category (e.g. acq) and a ParsedNewsItem,and computes and returns its CSV. This method will be used by the main()

method which should:

• load the probability model (created by tc.induction.MakeProbabilityModel),

• parse each file in the file list (clistfn),

• obtain a classification status value (CSV) for each news item in the result-ing ParsedText by calling computeCSV(),

• store these CSVs along with each document’s true classification into aCSVTable object,

• perform hard classification by applying a threshold (or thresholding strat-egy, e.g. proportional thresholding) to the CSV results, and

• print a summary of evaluation results (see println in BVBayes.java) forthe entire test set at the end of processing.

Once lab04 is completed, you should have a functioning text classificationsystem, inclding modules for pre-processing (lab01), term set reduction (lab02),classifier induction (lab02 and lab03) and classification and evaluation (lab04).Now, let’s test it on the REUTERS corpus. Define a training and a test set andrun a typical text classification training and testing cycle for category acq. Writea short (1 page, preferably text-only) informal report describing the proceduredfollowed for training and testing (how was the corpus partitioned? was termspace reduction performed before training? if so, how? how was the classifiertrained? how was it tested? how did it perform?) Please append the output oftc.classify.BVBayes to your note.

15.3 Submitting

Archive/compress all files (including the ones used in previous labs and yourreport) into a single file, and submit it through Blackboard.

https://docs.oracle.com/javase/7/docs/api/java/io/ObjectInputStream.html

https://tcd.blackboard.com/

16Symbolic Text Classifiers: Decision Trees

A symbolic method: decision trees

• Symbolic methods offer the advantage that their classification decisions are easilyinterpretable (by humans).

• Decision trees:

– data represented as vectors ofdiscrete-valued (or discretised)attributes

– classification through binary tests onhighly discriminative features.

– test sequence encoded as a tree struc-ture.

yes no yesno

outlook

falsetruenormalhigh

sunnyrainy

overcast

humidity windyyes

A Sample data set

outlook temperature humidity windy play

1 sunny hot high false no2 sunny hot high true no3 overcast hot high false yes4 rainy mild high false yes5 rainy cool normal false yes6 rainy cool normal true no7 overcast cool normal true yes8 sunny mild high false no9 sunny cool normal false yes

10 rainy mild normal false yes11 sunny mild normal true yes12 overcast mild high true yes13 overcast hot normal false yes14 rainy mild high true no

(Quinlan, 1986)

96 Symbolic Text Classifiers: Decision Trees

Divide-and-conquer learning strategy

• Choose the features that better divide the instance space.

• E.g. distribution of attributes for the “tennis weather” task:

overcast rainy sunny

outlook

0

2

4

6

8play=yes

play=no

cool hot mild

temperature

0

2

4

6

8

high normal

humidity

0

2

4

6

8

false true

windy

0

2

4

6

8

Uses of decision trees in NLP

• Parsing (Haruno et al., 1999; Magerman, 1995)

• Text categorisation (Lewis and Ringuette, 1994)

• word-sense disambiguation,

• POS tagging,

• Speech recognition,

• Etc...

What’s a Decision Tree?

• A decision tree is a graph with:

– internal nodes labelled by terms

– edges labelled by tests (on the weight the term from which they departhas in the document)

– leaves labelled by categories

• Given a decision tree T , categorisation of a document dj is done by re-

cursively testing the weights of the internal nodes of T against those in ~djuntil a leaf is reached.

• Simplest case: ~dj consists of Boolean (or binary) weights...

Chapter 16 97

A Text Categorisation Example

WHEAT

wheat ¬wheat

farm¬farm

¬commoditycommodity bushels ¬bushels

export¬export

tonnes

¬winter

soft ¬soft

¬tonnes

winter

agriculture ¬agriculture

WHEATWHEAT

WHEAT

WHEAT

WHEAT

¬WHEAT

¬WHEAT

¬WHEAT

¬WHEAT

⇒

if (wheat ∧ farm) ∨(wheat ∧ commodity) ∨(bushels ∧ export) ∨(wheat ∧ tonnes) ∨(wheat ∧ agriculture)∨(wheat ∧ ∧ ¬ soft)


Building Decision Trees

• Divide-and-conquer strategy

1. check whether all dj have the same label

2. if not, select tk, partition Tr into classes of documents with the samevalue for tk, and place each class under a subtree

3. recur on each subtree until each leaf contains training examples as-signed with the same category ci

4. label each leaf with its respective ci

• Some decision tree packages that have been used in TC:

– ID3, C4.5, C5

Decision tree algorithm

Algorithm 16.1: Decision tree learning

1 DTreeLearn(Tr: 2D, T ′: 2T , default: C): tree

2 / * Tr i s t h e t r a n i n i n g s e t , T ′ i s t h e f e a t u r e s e t * /

3 if isEmpty(Tr′) then

4 return default

5 else if ∃cjs.t.∏di∈Tr f(di, cj) = 1 then / * a l l di h a v e c l a s s cj * /

6 return cj7 else if isEmpty(T ) then

8 return MajorityCateg(Tr′)9 else

10 tbest ← ChooseFeature(T ′, T r′)11 tree ← new dtree with root = tbest12 for each vk ∈ tbest do

13 Trk ← dl ∈ Tr′ | tbest has value vk in dl14 sbt ← DTreeLearn(Trk, T ′ \ tbest, MajorityCateg(Trk))15 add a branch to tree with label vk and subtree sbt16 done

17 return tree


Important Issues

• Choosing the right feature (from T ) to partition the training set

– Choose feature with highest information gain

• Avoiding overfitting:

– Memorising all observations from the Tr

vs.

– Extracting patterns, extrapolating to unseen examples in Tv and Te

– Occam’s razor:

the most likely hypothesis is the simplest one which is con-sistent with all observations

• Inductive bias of DT learning ≈ shorter trees are preferred to larger trees.

How do we Implement ChooseFeature?

• Finding the right feature to partition the feature set is essential.

• One can use Information Gain (the difference of the entropy of the mother nodeand the weighted sum of the entropies of the child nodes) yielded by candidate

features T : G(T,D) = H(D)−∑

ti∈values(T )

p(ti)H(Dti) (16.1)

where H(D) is the information entropy of the category distribution on datasetD, that is, for random variable C with values c1, . . . , c|C| with PMF p(c).

• The sum over values of T is called expected entropy (Quinlan, 1986), and can bewritten as

E(T ) =∑

ti∈values(T )

p(ti)×|C|∑j

−p(cj |ti) logP (cj |tj)

Recall that Entropy (the H(.) function, above), AKA self-information, mea-sures the amount of uncertainty w.r.t a probability distribution. In other words,entropy is a measure of how much we learn when we observe an event occurringin accordance with this distribution.

Specifically, for Boolean-valued features

• For example, the Information Gain yielded by candidate features tk which cantake Boolean values (0 or 1) with respect to a binary categorisation task is given

by: G(T,D) = H(D)− [|Dt||D| H(Dt) +

|Dt||D| H(Dt)] (16.2)

where Dt and Dt are the subsets of D containing instances for which T has value1 and 0, respectively.

• And H(D) = −|Dc||D| log|Dc||D| −

|Dc||D| log

|Dc||D| (16.3)

where |Dc| (|Dc|) is the number of positive (negative) instances filed under cat-egory c in D.

Chapter 16 99

Numeric vector representationsTask: classify REUTERS texts as belonging to category “earnings” (or not).

[...]<title> Cobanco Inc year net</title>

<body>Shr 34 cts vs 1.19 dlrs

Net 807,000 vx 2,858,000

Assets 510.2 mln vs 479 mln

Deposits 472 mln vs 440 mln

Loans 299.2 mln vs 327 mln

Note: 4th qtr not available. Year includes 1985

extraordinary gain from tax carry forward of 132,000 dlrs,

or five cts per shr</body>...

T =<vs, mln, cts, ; , &, 000, loss, ’, ”, 3, profit, dlrs, 1, pct, is, s, that, net, lt, at> ~dj =< 5, 5, 3, 3, 3,

4, 0, 0, 0, 4, 0, 3, 2, 0, 0, 0, 0, 3, 2, 0>

Creating the text vectors

• The feature set, T , can be selected (reduced) via one of the informationtheoretic functions we have seen. Document frequency, G, or χ2, forexample.

• We could assign a weight (wij) to each feature as follows (Manning and

Schutze, 1999): wij = round

(10× 1 + log#j(ti)

1 + log∑|T |l=1 #j(tl)

)(16.4)

• Features for partitioning Tr can be selected by discretising wij values (see(Fayyad and Irani, 1993) for a commonly used method) and applying Gas shown above.

A DT for the ”earnings” category

P(c|n3) = 0.05

25977 documentsP(c|n2) = 0.116

net = 1


vs = 2

17681 documents

P(c|n1) = 0.3cts = 2

Decision

boundaries

35436 documents

4541 documents



cts < 2 cts >= 2

net < 1 net >= 1 vs < 2 vs >= 2

1 2

1

P(c|n4) = 0.649

Probability that a document at node n4 belongs to category c = "earnings"

cts

netn4

Calculating node probabilities


• One can assign probabilities to a leaf node (i.e. the probability that a newdocument d belonging to that node should be filled under category c) asfollows (using add-one smoothing):

P (c|dn) =|Dcn|+ 1

|Dcn|+ |Dcn|+ 1 + 1(16.5)

where

– P (c|dn) is the probability that a document dn which ended up innode n belongs to category c,

– |Dcn| (|Dcn|) number of (training) documents in node n which havebeen assigned category c (c)

Pruning

• Large trees tend to overfit the data.

• Pruning (i.e. removal or overspecific nodes) often helps produce better models.

• A commonly used approach:

1. Build a full decision tree

2. For each node of height 1

(a) test for statistical significance wrt to leaf nodes

(b) remove if expected class distribution (given the parent) is not signi-cantly different from observed class distribution

(c) accept node otherwise

3. until all nodes of height 1 have been tested

• Significance test could be χ2 =∑nk

∑|C|i

(Oki−Eki)2Eki

, for example.

– Oki is the number of observed instances of category i in partition k (i.e.those for which the test has value vk; see algorithm 16.1) and

– Eki the number of expected instances; e.g. Eki = ni ×∑|C|j nkj∑|C|j nj

For instance, the number of expected instances of category c in a binarycategorisation task in a partition k would be

Ekc = nc ×nkc + nkcnc + nc

, where nc is the number of instances of category c in the parent node andnkc, nkc the numbers of instances of category c and not category c, respectively,in leaf k.

Chapter 16 101

The Importance of Pruning

• Comparing full-size and pruned trees:

• Other pruning techniques include minimum description length (MDL) pruning (Mehtaet al., 1995), wrapping (incrementally remove nodes, select the tree which gives peakperformance on the validation set by cross-validation), etc (Mingers, 1989).

A final example: WSD

• Consider the following occurrences of the word “bank”:

RIV y be? Then he ran down along the bank , toward a narrow , muddy path.

FIN four bundles of small notes the bank cashier got it into his head

RIV ross the bridge and on the other bank you only hear the stream , the

RIV beneath the house , where a steep bank of earth is compacted between

FIN op but is really the branch of a bank. As I set foot inside , despite

FIN raffic police also belong to the bank. More foolhardy than entering

FIN require a number. If you open a bank account , the teller identifies

RIV circular movement , skirting the bank of the River Jordan , then turn

• The WSD learning task is to learn to distinguish between meanings “fi-nancial institution” (FIN) and “the land alongside a river” (RIV).

Task definition

• WSD can be described as a categorisation task where

– senses (FIN, RIV) are labels (C)– the representation of instances (D) comes from the context surround-

ing the words to be disambiguated.

– E.g.: For T = along, cashier, stream,muddy, . . . , we could have:d1 = 〈along = 1, cashier = 0, stream = 0,muddy = 1, . . . 〉 and

– f(d1) = RIV

• Performance can be measured as in text categorisation (precision, recall,etc.)


A decision tree

Using the algorithm above (slide193) we get this decision tree

on

1 0

RIV river

1 0

RIV when

1 0

RIV from

1 0

money

1 0

FIN

FIN RIV

Trained on small training set and T = small, money, on, to, river, from, in, his, accounts,

when, by, other, estuary, some, with

Other topics

• Cost-sensitive classification; see (Lomax and Vadera, 2013) for a compre-hensive survey

• Alternative attribute selection criteria: gain ratio, distance-based mea-sures etc (Mitchell, 1997)

• Regression

• Missing features

• ...

17Instance Based Learning

The importance of being Lazy

• Lazy learning: Generalising beyond training examples is postponed untila new instance must be classified

• “The importance of being lazy”: instead of estimating the target functiononce for the whole instance space, estimate it locally and differently foreach new instance

• A “family” of related techniques:

– k-Nearest Neighbour

– Locally weighted regression

– Radial basis functions

– Case-based reasoning

• Lazy vs. Eager learning

Instance-Based LearningTwo classification approaches:

• Nearest neighbour:

– Given query instance xq, first locate nearest training example xn,

then estimate f(xq)← f(xn)

• k-Nearest neighbour:

– Given xq, take vote among its k nearest neighbours, if discrete-valuedtarget function

f(xq)→ arg maxv∈V

k∑i=1

δ(v, f(xi))

– take mean of f values of k nearest neighbours, if the target functionis real-valued.

f(xq)←∑ki=1 f(xi)

k

104 Instance Based Learning

Representation

• All instances correspond to points in the n-dimensional space Rn

• As before, an instance x will be described by a feature vector:

〈a1(x), a2(x), . . . , an(x)〉

• Nearest neighbours can be defined in terms of standard Euclidean distance(but other measures are possible):

d(xi, xj) =

√√√√ n∑r=1

(ar(xi)− ar(xj))2

The k-nearest neighbour algorithm

• Consider learning a discrete-valued function with signature f : Rn → V ,for a finite V = v1, ..., vn

Training algorithm:For each training example 〈x, f(x)〉,

add example to tlistClassification Algorithm:

Input: xq, a query instance to be classifiedLet x1, . . . , xk be the nearest instances to xq in tlist

Return f(xq)→ arg maxv∈V∑ki=1 δ(v, f(xi))

where δ(a, b) = 1 if a = b, and 0 otherwise

Decision Surfaces

01

xq

00

00

1 11

• What would a nearest neighbour classify xq?

• What would a 5-nearest neighbour algorithm do?

• What does the decision surface (for the 1-NN classifier) look like?

– Voronoi Diagrams

Chapter 17 105

Behavior in the LimitConsider p(x) defines probability that instance x will be labeled 1 (positive)

versus 0 (negative).Nearest neighbour:

• As number of training examples →∞, approaches Gibbs Algorithm

i.e choose hypothesis x according to posterior prob. distribution over X;with probability p(x) predict 1, else 0.

• Expected error at most twice the expected error of Bayes Optimal.

k-Nearest neighbour:

• As number of training examples →∞ and k gets large, approaches Bayesoptimal

Bayes optimal: if p(x) > .5 then predict 1, else 0

Distance-Weighted kNNOne might want to weight nearer neighbours more heavily...

• For the discrete case:

f(xq)← arg maxv∈V

k∑i=1

wiδ(v, f(xi)) (17.1)

where wi ≡ 1d(xq,xi)2 and d(xq, xi) is distance between xq and xi. If

d(xq, xi) = 0 assign f(xq)def= f(xi)

• For real-valued target functions?:

f(xq)←∑ki=1 wif(xi)∑k

i=1 wi(17.2)

Now we could use all training examples instead of just k

→ local method, global method, Shepard’s method

Curse of DimensionalityImagine instances described by 20 attributes, but only 2 are relevant to target

function. What do you think would happen?Curse of dimensionality: nearest neighbour is easily mislead when high-

dimensional XOne approach to dealing with the curse:

• Stretch jth axis by weight zj , where z1, . . . , zn chosen to minimize predic-tion error

• Use cross-validation to automatically choose weights z1, . . . , zn

• Setting zj to zero eliminates this dimension altogether


When To Consider Using Nearest Neighbour

• Instances are represented by points in Rn

• Not too many attributes per instance

• Lots of training data

Advantages:

• “Training” is very fast

• Learn complex target functions (arbitrarily complex decision surfaces)

• Don’t lose information

Disadvantages:

• Slow at query time

• Easily fooled by irrelevant attributes

Instance-based methods in TC

• Basic idea: f(d, c) = true if a large enough proportion of d’s neighboursare of category c

• Vector-space document representation might fall victim of the curse ofdimensionality

• Use IR-based document similarity to define neighbours

• A k-NN CSV could be (Yang, 1994):

CSVi(d) =∑

dx∈Trk(d)

RSV (d, dx)× δ(f(dx, ci), true) (17.3)

where where Trk(d) is the set of k documents dx which maximizes a doc-ument similarity function RSV (d, dx).

Locally Weighted Regression

• k-NN forms local approximation to f for each query. point xq

• So why not form an explicit approximation f(x) for region surroundingxq?

• Ways in which this could be done:

– Fit linear function to k nearest neighbours

– Fit quadratic, ...

– Produces “piecewise approximation” to f

( N.B.: Locally Weighted Regression:

– Local: based only on data near xq

– Weighted: contribution of each instance weighted by its distance toxq

– Regression: approximates real-valued functions )

Chapter 17 107

A global approximation

• Consider approximating f near xq by linear function

f(x) = w0 + w1a1(x) + . . .+ wnan(x)

• One could use gradient descent to find the coefficients to minimise theerror in fitting f to training set D:

E =1

2

∑x∈D

(f(x)− f(x))2

• The gradient descent training rule:

∆wj = η∑x∈D

(f(x)− f(x))aj(x)

• (Recall the LMS algorithm from Lecture 3)

Other ways of minimising error

• Gradient descent isn’t the only way to find the coefficients for, say,

f(x) = w0 + w1a1(x) + . . .+ wnan(x)

• One could also use...

– a variety of search methods

– such as simulated annealing, genetic algorithms, etc

• But first... the global approximation given by gradient descent (or GA,etc) needs to be adapted...

Deriving a local approximation

• Several choices of error to minimize:

– Squared error over k nearest neighbours

E1(xq) ≡1

2

∑x∈ k nearest neighbours of xq

(f(x)− f(x))2

– Distance-weighted squared error over all neighbours

E2(xq) ≡1

2

∑x∈D

(f(x)− f(x))2 K(d(xq, x))

– . . .

https://www.scss.tcd.ie/~luzs/t/cs4ll4/ml-notes.pdf


Radial Basis Function Networks

• Global approximation to target function, in terms of linear combination oflocal approximations

• Commonly used for image classification (among other tasks)

• A different kind of neural network

• Closely related to distance-weighted regression, but “eager” instead of“lazy”

Radial Basis Function Networks

...

...

f(x)

w1w0 wk

1

1a (x)

2a (x)

na (x)

where ai(x) are the attributes describing instance x, and

f(x) = w0 +

k∑u=1

wuKu(d(xu, x))

One common choice for Ku(d(xu, x)) is

Ku(d(xu, x)) = e− 1

2σ2ud2(xu,x)

Training Radial Basis Function Networks

Q1: What xu to use for each kernel function Ku(d(xu, x))

• Scatter uniformly throughout instance space

• Or use training instances (reflects instance distribution)

Q2: How to train weights (assume here Gaussian Ku)

• First choose variance (and perhaps mean) for each Ku

– e.g., use the Expectation Maximisation (EM) algorithm

• Then hold Ku fixed, and train linear output layer

– efficient methods to fit linear function

Case-Based Reasoning

Can apply instance-based learning even when X 6= Rn

• need different “distance” metric

Case-Based Reasoning is instance-based learning applied to instances withsymbolic logic descriptions

Chapter 17 109

((user-complaint error53-on-shutdown)

(cpu-model PowerPC)

(operating-system Windows)

(network-connection PCIA)

(memory 48meg)

(installed-applications Excel Netscape VirusScan)

(disk 1gig)

(likely-cause ???))

A Case Study

• Case-Based Reasoning in CADET (Sycara et al., 1992)

• CADET: 75 stored examples of mechanical devices

– each training instance:

〈 qualitative function, mechanical structure〉

– new query: desired function,

– target value: mechanical structure for this function

• Distance metric: match qualitative function descriptions

Case-Based Reasoning in CADET

A stored case:

+

+

+

+

Function:

T−junction pipe

T

Q

= temperature

= waterflow

Structure:

+

+

+

+

−+

+

+

+

Function:Structure:

?+

A problem specification: Water faucet

Q ,T1

2

1

Q ,T

Q ,T2

3 3

Q

Q

T

T

Q

T

1

2

1

2

3

3

C

C

t

f

Q

T

c

c

Q

T

h

h

Q

T

m

m

Case-Based Reasoning in CADET

• Instances represented by rich structural descriptions

• Multiple cases retrieved (and combined) to form solution to new problem

• Tight coupling between case retrieval and problem solving


Bottom line:

• Simple matching of cases useful for tasks such as answering help-deskqueries

• Area of ongoing research

Lazy and Eager LearningLazy: wait for query before generalizing

• k-Nearest Neighbour, Case based reasoning

Eager: generalize before seeing query

• Radial basis function (RBF) networks, ID3, Backpropagation, Naive Bayes,. . .

Does it matter?

• Eager learner must create global approximation

• Lazy learner can create many local approximations

• if they use same H, lazy can represent more complex functions (e.g., con-sider H = linear functions)

18Unsupervised Learning

Supervised vs. unsupervised learning

• So far we have seen supervised learning (or classification):

– learning based on a training set where labelling of instances representsthe target (categorisation) function

– classifier implements an approximation of the target function

– outcome: a classification decision

Unsupervised learning:

• Learning based on unannotated instances;

• Outcome:

– a grouping of objects (instances and groups of instances)

– or a ranking/graph of objects w.r.t each other

Applications

• Exploratory data analysis (data mining): clustering can reveal patterns ofassociation in the data

• Information visualisation: natural ways of displaying association patterns

– dendrograms, self-organising maps etc

• Information retrieval: keyword (Sparck Jones and Jackson, 1970) and doc-ument (van Rijsbergen, 1979) clustering.

• Improving language models

• Corpus analysis (homogeneity)

• Object and character recognition

• Dimensionality reduction by term extraction in text categorisation

112 Unsupervised Learning

lecture we examined clustering groups . . .lecture = 〈 2, 2, 1, 2, 0 , . . . 〉we = 〈 2, 2, 1, 2, 0 , . . . 〉examined = 〈 1, 1, 1, 2, 0 , . . . 〉clustering = 〈 2, 2, 1, 3, 1 , . . . 〉groups = 〈 0, 0, 0, 1, 1 , . . . 〉

.

.

....

Figure 18.1: Co-occurrence vector representation for words

Data representation

• As before, vector-based representation is a popular choice. E.g.:

Types of unsupervised learning

• Clustering algorithms are the main technique for unsupervised learning;

• A taxonomy (Jain et al., 1999):

– Partitional clustering:

∗ k-means, Expectation Maximisation (EM), Graph theoretic, modeseeking

– hierarchical:

∗ single-link

∗ complete-link

∗ average-link

– Agglomerative vs. divisive

Distance and dissimilarity measures

• Given instances a, b and c represented as real-valued vectors, a distancebetween a and b is a function d(a, b) satisfying:

d(a, b) ≥ 0 (18.1)

d(a, a) = 0 (18.2)

d(a, b) = d(b, a) (18.3)

d(a, b) ≤ d(a, c) + d(b, c) (18.4)

• When (18.4) doesn’t hold, d is called a dissimilarity

• Euclidean distance, d(~x, ~y) =

√∑|~x|i=1(xi − yi)2 is commonly used.

Hierarchical clustering

• Input: objects represented as vectors

• Output: a hierarchy of associations represented as a “dendrogram”

Chapter 18 113

a

bc d e

f g

0 2 4 6 8

02

46

810

x0

x1

f g d e a b c

01

23

45

dis

sim

ilarity

A simple agglomerative clustering algorithm

Algorithm 18.1: Simple agglomerative hierarchical clustering

1 h c l u s t (D : s e t o f i n s t a n c e s ) : t r e e2 var : C , / ∗ s e t o f c l u s t e r s ∗ /

3 M / ∗ m a t r i x c o n t a i n i n g d i s t a n c e s b e t w e e n ∗ /

4 / ∗ p a i r s o f c l u s t e r s ∗ /

5 f o r each d ∈ D do6 make d a l e a f node in C7 done8 f o r each pa i r ci = da, cj = db ∈ C do9 Mi,j ← d(da, db)

10 done11 whi le ( not a l l i n s t a n c e s in one c l u s t e r ) do12 Find the most s i m i l a r pa i r o f c l u s t e r s in M13 Merge these two c l u s t e r s i n to one c l u s t e r .14 Update M to r e f l e c t the merge opera t i on .15 done16 r e turn C

Similarity

• Results vary depending on how you define similarity.

• The definition determine the type of clustering algorithm:

– In Single-link clustering, similarity is defined in terms of the minimumdistance between any two pairs of instances:

sims(c1, c2) =1

1 + minx1∈c1,x2∈c2 d(x1, x2)(18.5)

– In complete-link, as the maximum distance between any two pairs ofinstances:

simc(c1, c2) =1

1 + maxx1∈c1,x2∈c2 d(x1, x2)(18.6)

– and in average-link, as the mean distance:

sima(c1, c2) =1

1 + 1|c1||c2|

∑x1∈c1

∑x2∈c2 d(x1, x2)

(18.7)


How do the different definitions affect clustering?

0 1 2 3 4

0

1

2

3

!

d5

!

d6

!

d7

!

d8

!

d1

!

d2

!

d3

!

d4

Single link

0 1 2 3 4

0

1

2

3

!

d5

!

d6

!

d7

!

d8

!

d1

!

d2

!

d3

!

d4

Complete-link

• Single-link tend to produce “straggly” or elongated clusters whereas complete-link tend to produce more compact groups (Manning and Schutze, 1999):

Why are elongated clusters sometimes a bad thing?

• noise data in the vicinity of clusters might lead to incorrect merging:

B

A A

A

A

A A A

A

A

A

A

A

A

A

A

B B

B

B B

B

B

B

B

B BB

* * * * * * * * *

BBB

Potential errors through complete-link

• But complete-link tends to be sensitive to outliers;

• Consider the following example:

1.2 4 5.2 6 6.91.2 4 5.2 6 6.91.2 4 5.2 6 6.9

2

1.7

1.2 4 5.2 6 6.9

• Average-link tend to produce good results, in general.

Chapter 18 115

Examples: clustering of RCV1 documents

• Dendrogram for single-link clustering of 30 RCV1 documents (Manning,Raghavan & Schultze, in press):

1.0 0.8 0.6 0.4 0.2 0.0

Ag trade reform.Back!to!school spending is up

Lloyd’s CEO questionedLloyd’s chief / U.S. grilling

Viag stays positiveChrysler / Latin America

Ohio Blue CrossJapanese prime minister / Mexico

CompuServe reports lossSprint / Internet access service

Planet HollywoodTrocadero: tripling of revenues

German unions splitWar hero Colin PowellWar hero Colin Powell

Oil prices slipChains may raise prices

Clinton signs lawLawsuit against tobacco companies

suits against tobacco firmsIndiana tobacco lawsuit

Most active stocksMexican markets

Hog prices tumbleNYSE closing averages

British FTSE indexFed holds interest rates steady

Fed to keep interest rates steadyFed keeps interest rates steadyFed keeps interest rates steady

Examples: clustering of RCV1 documents

• Dendrogram for complete-link clustering of 30 RCV1 documents:

1.0 0.8 0.6 0.4 0.2 0.0

NYSE closing averages

Hog prices tumble

Oil prices slip

Ag trade reform.

Chrysler / Latin America

Japanese prime minister / Mexico

Fed holds interest rates steady

Fed to keep interest rates steady

Fed keeps interest rates steady

Fed keeps interest rates steady

Mexican markets

British FTSE index

War hero Colin Powell

War hero Colin PowellLloyd’s CEO questioned

Lloyd’s chief / U.S. grilling

Ohio Blue CrossLawsuit against tobacco companies

suits against tobacco firms

Indiana tobacco lawsuitViag stays positive

Most active stocksCompuServe reports loss

Sprint / Internet access service

Planet Hollywood

Trocadero: tripling of revenues

Back!to!school spending is up

German unions split

Chains may raise prices

Clinton signs law

Partitional clusteringAn example of partitional clustering is k-means clustering:

Algorithm 18.2: K-means clustering

1 k−means (X = ~d1, . . . , ~dn ⊆ Rm , k ) : 2R

2 C : 2R / ∗ µ a s e t o f c l u s t e r s ∗ /

3 d : Rm × Rm → R / ∗ d i s t a n c e f u n c t i o n ∗ / 4 µ : 2R → R / ∗ µ c o m p u t e s t h e m e a n o f a c l u s t e r ∗ /


5 s e l e c t C with k i n i t i a l c e n t r e s ~f1, . . . , ~fk6 whi le s topping c r i t e r i o n not t rue do7 f o r a l l c l u s t e r s cj ∈ C do

8 cj ← ~di|∀fld(~di, fj) ≤ d(~di, fl)9 done

10 f o r a l l means ~fj do

11 ~fj ← µ(cj)12 done13 done14 r e turn C

k-means characteristics

• Need to select the number of clusters in advance

• Might converge to a local minimum

• But...

– it is more efficient (lower computational complexity) than hirearchicalclustering

• K-means can be seen as a specialisation of the expectation maximisation(EM) algorithm

Another Example: term extraction for TC

Sample co-occurrence matrix for a subset of REUTERS-21578:

usair 20 2 0 1 0 4 1 0 0 0 1 0 2 0 3 0 0 0 1 0 1 1 0 0 0 0 0 0 0 2 14 1 3voting 2 10 0 2 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 2 0 0buyout 0 0 8 1 0 2 0 0 0 0 0 0 1 1 0 0 0 0 0 3 0 0 0 0 0 0 0 1 0 1 0 0 0stake 1 2 1 62 0 0 0 1 1 0 0 1 2 0 0 0 0 2 0 1 0 0 1 0 1 2 1 0 0 0 1 0 0santa 0 0 0 0 7 3 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 2 0 0 1 0merger 4 1 2 0 3 48 0 1 3 0 2 0 1 2 4 1 0 0 1 1 2 0 0 0 2 0 0 1 0 5 4 3 1ownership 1 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0rospatch 0 0 0 1 0 1 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0rexnord 0 0 0 1 0 3 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0designs 0 0 0 0 0 0 0 0 0 5 0 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 1 0 0 0pie 1 0 0 0 0 2 0 0 0 0 5 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 0 1

.

.

....

Table 18.1: Word co-occurrence matrix

Term Extraction with k-means

K-means (k = 5) clustering of the words in slide 244

Chapter 18 117

Cluster elements1 stake2 usair, merger, twa3 acquisition4 acquire5 voting, buyout, santa, ownership, rospatch,

rexnord, designs, pie, recommend, definitive,piedmont, consent, boards, dome, obtain, lever-aged, comply, phoenix, core, manufactures,midnight, islands, axp, attractive, undisclosed,interested, trans

Term extraction by hierarchical clustering

Hierarchical clustering (complete-link) of the words in slide 245

sta

ke

acq

uis

itio

n

me

rge

r

acq

uire

un

dis

clo

se

d

usa

ir

twa

vo

tin

g

pie

pie

dm

on

t

de

fin

itiv

e

bu

yo

ut

leve

rag

ed

tra

ns

ow

ne

rsh

ip

bo

ard

s

de

sig

ns

ma

nu

factu

res

reco

mm

en

d

isla

nd

s

rosp

atc

h

ob

tain

co

mp

ly

ph

oe

nix

att

ractive

co

nse

nt

axp

do

me

co

re

sa

nta

inte

reste

d

rexn

ord

mid

nig

ht

020

40

60

80

dis

sim

ilarity

Unsupervised word-sense disambiguation

Concordances for the word “bank”

.

.

.RIV y be? Then he ran down along the bank , toward a narrow , muddy path .FIN four bundles o f smal l notes the bank c a s h i e r got i t i n to h i s headRIV r o s s the br idge and on the other bank you only hear the stream , theRIV beneath the house , where a s teep bank o f earth i s compacted betweenFIN op but i s r e a l l y the branch o f a bank . As I s e t f oo t in s id e , d e sp i t eFIN r a f f i c p o l i c e a l s o belong to the bank . More foo lhardy than en t e r i ngFIN r e q u i r e a number . I f you open a bank account , the t e l l e r i d e n t i f i e sRIV c i r c u l a r movement , s k i r t i n g the bank o f the River Jordan , then turn. .. .. .


Sample run of 2-means clustering

• k-means clusters the lines into the following groups:

1 fin1, riv3, fin4, fin6, fin7, fin9, fin10, riv15, riv16,fin19, fin20, fin22, fin23, fin24, fin25, fin26, riv27,fin28, fin29, fin32, fin33, fin34

2 riv2, riv5, riv8, riv11, riv12, riv13, riv14, riv17,riv18, riv21, riv30, riv31, riv35

Hierarchical clustering (single-link) of senses of the word “bank”

riv27

fin26

fin20

riv3

fin29

fin28

fin24

fin23

fin7

fin6

riv5

fin33

riv30

fin25

fin22

riv16

riv12

fin10

fin4

fin34

riv15

fin32

riv14

riv11

riv13

riv21

riv17

fin9

riv35

riv18

riv8

riv2

riv31

fin1

fin19

3.6

3.8

4.0

4.2

4.4

dis

sim

ilarity

Further topics

• Efficient clustering algorithms

• Cluster labelling

• Cluster evaluation

• Expectation Maximisation (EM) clustering and applications

• Clustering and information visualisation:

– Self-organizing maps (SOMs)

– Artificial Neural Networks (ANNs)

19Brief survey of other methods

Outline

• Survey of other popular methods used in Text Classification

– Rocchio

– Decision Rules

– Regression Methods

– Support Vector Machines

– Classifier Ensembles

Rocchio (a non-ML classification method)

• Basic idea:

– Divide the vector space (training data) into regions centred around“prototypes” (centroids))

– Classify based on similarity of query document to each prototype(multi-class text categorisation)

– Resembles k-NN classification and k-means clustering (without thek’s)

– IR-inspired technique, connections with relevance feedback

Rocchio and the vector space

• Document representation: length normalised real-valued vectors

~d =~d

||~d||=

~d√∑|T |i=1 t

2i

(19.1)

where T is the feature set.

• ...DRAW DIAGRAM (non-linear decision surfaces)...

120 Brief survey of other methods

Rocchio decision surfaces

• Compute the prototypes for each category (i.e. the centroids for each setof document vectors labelled as each category):

~µc =1

|D|∑d∈Dc

~d (19.2)

• Hyperplanes separating two classes defined by sets of points located atequal distances from the two classes

• ...DRAW DECISION BOUNDARIES ON 2-D SURFACE (linear Roc-chio)...

Rocchio categorisation

• In practice one does not need to compute decision surfaces explicitly.

• Given a document d to be classified, assign it to the category c to whoseprototype d is most similar:

f(d) = arg maxcsim(~d, ~µc) (19.3)

where sim(~d, ~µc) could be defined in terms of Euclidean distance ( 1

1+||~d−~µc||)

or cosine similarity.

Remarks on Rocchio classifiers

• Categories in Rocchio must be approximate spheres with similar radii.

• This may lead to misclassification in cases of multimodal categories ...DRAW...

• Binary classification is also possible

f(d, c) =

true if ||~µ− ~d|| < tfalse otherwise.

(19.4)

for a threshold value t.

Decision Rule Learners

• Goal: to automatically learn (from annotated data) rules such as thefollowing:

if (wheat & farm) or(wheat & commodity) or(bushels & export) or(wheat & tonnes) or(wheat & whinter and (¬ soft)) or


• As with Decision Trees, Decision Rules can encode any Boolean function...

• But more compactly.

Chapter 19 121

How are rule sets induced

• Initially all rules that fully classify the training set are gathered...

• Then, the “best” rules are chosen.

• Note the contrast between the top-down (“divide-and-conquer”) learnigstrategy adopted in decision tree learning and the bottom-up strategyadopted by decision rule learners.

• The most general rule set could contain, for each document di ∈ Tr, arule of the form η1, . . . , ηn → c, c, where ηj ’s are terms in di

How rule sets are reduced

• A classifier containing rules such as these (η1, . . . , ηn → c, c) is already a DNFclassifiers for c

• “All” one needs to do now is generalise the rule set so that it won’t be overfittedto Tr (the tricky bit)

• The generalisation process aims at maximising the classifiers compactness with-out necessarily reducing its coverage

• This is usally done by:

– transforming the clauses (e.g. by removing disjuncts, merging rules, etc)

Some rule learners in TC

• Rule learners vary in the way they implement the rule reduction step.

• The following rule learners use different strategies and have been appliedto TC tasks (see (Sebastiani, 2002, pp 26–27) for references):

– CHARADE

– DL-ESC

– RIPPER

– SCAR

– SWAP-1

• Inductive logic programming has also been applied, though it seems thatusing FOL doesn’t substantially improve classification

Regression methods for TC learning

• What is a Regression?

– The approximation of a real-valued fuction f : D × C → [0, 1] by

means of a function f that fits the training data

• Linear Least Square Fit has been used in TC by (Yang and Chute, 1994)

• Each document di is associated with two weighted vectors:

– an input vector I(di) (with features from the reduced term set T ,and

– an output vector of categories in C: O(di)


Building a LLSF classifier

• Weights for O(dj) are binary for training documents and and non-binaryfor test documents

• The CSV task is one of finding the best O(dj) for a given test documentwith input vector I(dj)

• Therefore, building the classifier comprises computing a matrix M = C×Tsuch that M I(dj) = O(dj)

• A linear least square fit that minimises the error in Tr is used in thecomputation of M

LLSF Data representation

• Matrix representation for input documents, I, and (tranining) output cat-egorisation, O:

• From (Yang and Chute, 1994). Note that, in our notation I = AT andO = BT

Finding a linear least square fit to Tr

• A matrix M that minimises training set error can be computed by thefollowing formula

M = arg minM||MI −O||F (19.5)

where ||.||F is the Frobenius norm. I.e., for a C × T matrix V :

||V ||Fdef=

√√√√ C∑i=1

T∑j=1

v2ij (19.6)

• A generic entry mkl represents the degree of association between a categoryck and term tl

Chapter 19 123

Formula (19.6) represents the so-called Frobenius norm of a |C|×|T | matrix,I is the |T | × |Tr| matrix whose columns are input vectors of the trainingdocuments, and O is the |C| × |Tr| matrix whose columns are output vectors ofthe training documents.

NB: In the (Yang and Chute, 1994) example given above, the matrices aretransposed.

LLSF Association Matrix

• An example (also from (Yang and Chute, 1994)), of matrix M :

• Once we have computed M , we can project an input document representa-tion ~d = (t1, . . . , tn), into a target category representation ~c by computing:

~c = (M~dT )T (19.7)

TC by Support Vector Machines

• Advantages (with respect to previous methods):

– scalability to high dimensionalities

– and robustness to overfitting,

– which might make feature selection irrelevant

• Nice theoretical underpinnings (statistical learning theory)

– Capacity: ability of the algorithm to learn any training set withouterror

– Minimise generalisation loss rather than empirical loss

– Structural risk minimization

• A handy “trick” to deal with non-linearly separable categories

• See e.g. (Burges, 1998) for details

An example: linearly separable 2-d case

• SVM can be explained in geometrical terms as follows:

– Decision Surfaces: planes σ1, . . . , σn in a |T |-dimensional space which sep-arates positive and negative training examples

– Given σ1, σ2, . . ., find a σi which separates positive from negative examplesby the widest possible margin


– Defined by the support vectors

• Assume positive (ci = +1) and negative (ci = −1) instances are linearly sepa-

rable (decision surfaces are (|T | − 1)-hyperplanes ):

++

+

+

+

+ ++

++ +

+

o

+

o

o

o

o

oo

o

o

o

o

o

o+

σ

+

i

"best" decision surface

Principle

• Remark: the two hyperplanes are entirely determined by the nearest points

• The two “support” hyperplanes can be defined as:

~w · ~x+ b = 1

~w · ~x+ b = −1(19.8)

• The distance between the two planes is2

||~w||→ minimize ~w.

• The points must not belong to the area between the two hyperplanes:

~w · ~xi + b ≥ 1 if yi = 1

~w · ~xi + b ≤ −1 if yi = −1(19.9)

which can be rewritten as:

yi(~w · ~xi + b) ≥ 1 for all 1 ≤ i ≤ n (19.10)

Solving

• The separator is found by solving:

α∗ = arg maxα

∑i

αi −1

2

∑i,j

αiαjyiyj(~xi · ~xj) (19.11)

where αi ≥ 0,∑i αiyi = 0

• can be solved with quadratic programming

• training data only appears in the form of dot products

– Crucial property to generalize to the nonlinear case

• CSV for input vector ~x: sign(~w · ~x+ b)

Chapter 19 125

Non linearly separable cases

• SVMs can also be effective in cases where positive and negative traininginstances are not linearly separable

• The kernel trick:

– Non-linearly separable data almost always becomes separable if mappedinto a high enough number of dimensions

– Replace the dot product in equation (19.11) by a kernel function, e.g.the polynomial kernel

K(~xi, ~xj) = (1 + ~xi · ~xj)n (19.12)

(which corresponds to a feature space with dimension exponential inn)

A non linearly-separable example

(a) A 2-D training with positive examples as black circles and negative examples as whitecircles. The true decision boundary, x2

1 + x22 ≤ 1, is also shown. (Russell and Norvig,

2003)

(b) Same data after mapping into a 3-D input space (x21, x

22,√

2x1x2), from K(~xi, ~xj) =(~xi · ~xj)2. The circular decision boundary in (a) becomes a linear decision boundary inthree dimensions.

TC by Classifier ensembles

• Basic idea: to apply different classifiers f1, f2, . . . to the classification taskand then combine the outputs appropriately

• Ensembles are characterised according to:

– the kinds of classifiers fi they employ: ideally, these classifiers shouldbe as independent as possible

– the way they combine multiple classifier outputs. Examples:

∗ majority voting (for committees of binary classifiers), weightedlinear combination (for probabilistic outputs), dynamic classifierselection, adaptive classifier selection, ...


Boosting

• The basic idea: all classifiers in the ensemble are obtained via the samelearning method

• Classifiers are trained sequentially, rather than independently (i.e. the

training of fi takes into account the performance of f1, . . . , fi−1)

• ADABOOST: each pair < dj , ci > is assigned “importance weight” htij in

ft, which represents how hard it is to get a correct decision for < dj , ci >

in f1, . . . , ft−1

– ADABOOST.MH: maximises microaveraged effectiveness

– ADABOOST.MR: minimises ranking loss

AdaBoost Algorithm

Algorithm 19.1: AdaBoost

1 AdaBoost(Tr, learn, K): weighted majority hypothesis

2 / * Tr i s t h e t r a n i n i n g s e t , l e a r n e r i s a l e a r n i n g

3 a l g o r i t h m , K i s t h e s i z e o f e n s e m b l e * /

4 ~w ← 1/|Tr| / * w i s a v e c t o r o f s i z e |Tr| * /

5 for k = 1 to K do

6 fk ← learn(Tr,w) / * h y p o t h e s e s ~f = (f1, . . . , fK) * /

7 error ← 0

8 for j = 1 to |Tr| do

9 if fk(dj) 6= f(dj)10 error ← error + wj11 for j = 1 to |Tr| do

12 if fk(dj) = f(dj)13 wj ← wj × error/(1− error)14 ~w ← normalise( ~w)

15 / * u p d a t e h y p o t h e s i s w e i g h t s ~z = (z1, . . . , zK) * /

16 zk ← log(1− error)/error17 return weighted\_majority( ~f, ~z)

Other ML algorithms used in TC

• Many other ML approaches have been used in TC which are not covered in theselectures. These include:

– Neural Networks

– Belief Networks (aka Bayesian inference networks etc)

– Genetic Algorithms

– Maximum Entropy Modelling (see (Manning and Schutze, 1999, chapter15) for a good introduction)

Chapter 19 127

CorpusType Systems 1 2 3 4 5non-learning WORD .150 .310 .290probabilistic PropBayes,

Bim, Nb.443-.650

.747-

.795.720-.815

decision tree C4.5, Ind .670 .794-.884

decision rules Swap-1, Rip-per,

.683-

.753.738-.811

.820-827

regression LISF .855 .810 .849online linear BWinnow .747 .833 .822batch linear Rocchio .660 .748 .625-

.776.649-.799

neural nets Classi .802 .820 .838example based k-NN, Gis-W .690 .852 .820 .820-

.860.823

SVM SVMLight .859-

.870

.864-

.920ensemble AdaBoost .860 .878


20Active Learning

Active Learning

• Passive vs Active Learning (aka optimal experiment design)

– Control over the training experience

– Cost of labelling

– The right choice of training instances can make learning more effec-tive

• Motivation for Active Learning

– Large amount of data

– But most of them are unlabelled

– Practical relevance:

∗ Text categorisation

∗ Speech recognition

∗ Sensor data

∗ ...

The promise...The following example (Dasgupta, 2011) illustrates the potential advantages of

Active learning:

• Task: learn to separate two linearly separable classes of instances representedas real numbers in [0, 1]. E.g.:

- - - - + + + +

w

• For standard supervised learning, we would need n = 1ε

labelled instances tolearn the boundary w with error at most ε (Mitchell, 1997, ch. 7)

• Now, place n unlabelled instances on the real line.

• Then perform a binary search:

– choose the midpoint instance and reveal its label

130 Active Learning

– if + (resp. -), choose the midpoint of the instances to the left (resp. right)

– repeat until convergence.

• Since the cost of binary search is O(logn), this toy active learner gets an expo-nential speedup over supervised learning.

A typical scenario: pool-based AL

machine learningmodel

LU

labeledtraining set

unlabeled pool

oracle (e.g., human annotator)

learn a model

select queries

Figure 20.1: Pool-based active learning cycle (from (Settles, 2009))

Three sampling methodsSampling: choice of queries to be annotated (and added to the training set)

instancespace or input

distribution

Usample a largepool of instances

sample aninstance

model generatesa query de novo

model decides toquery or discard

model selectsthe best query

membership query synthesis

stream-based selective sampling

pool-based sampling query is labeledby the oracle

Figure 20.2: High-level sampling methods (from (Settles, 2009))

Query synthesis

• Learning with membership queries (Angluin, 1988)

• Can select any instance in the instance space

Chapter 20 131

• Learner generates instances rather than sample from existing unlabelledset

• Pros:

– Computationally tractable (for finite domains)

– Can be extended to regression tasks

– Promising approach in automation of experiments that do not re-quire a human annotator (see interesting application to “automatedscientific discovery” (King et al., 2009))

• Cons:

– Human annotator may have difficulty interpreting and labelling ar-bitrary instances

– E.g. unintelligible text in TC applications

Stream based active learning

• A selective sampling approach is adopted (Cohn et al., 1994):

– Sample an instance from a natural distribution (unlabelled set)

– Let the learner decide whether to request a label or discard the sample

• Different strategies for deciding whether to query the oracle:

– use an informativeness measure

– adopt different query strategies (e.g. Query By Committee (Freundet al., 1997))

• Pros:

– Has been tested in many NLP tasks (POS tagging, information re-trieval, word-sense disambiguation etc)

– Can reduce annotation effort and improve efficiency of certain classi-fiers (e.g. by reducing the size of the database in k-NN classification)

• Cons:

– Can be computationally intractable

Pool based sampling

• In pool-based active learning (Lewis and Gale, 1994; McCallum and Nigam,1998b), the learner is supplied with a set of unlabelled examples fromwhich it can selects queries.

• Pros:

– Probably the most widely used sampling method

– Applied to many real-world tasks

• Cons:

– Computationally intensive

– Needs to evaluate entire set at each iteration

132 Active Learning

Pool based sampling meta-algorithmA meta-algorithm (Davy and Luz, 2007a):

Input: tr - training data

Input: ul - unlabelled examples

for i = 0 to stop criterion do

Ci = Induce(tr) // Induce classifier

s = QuerySelect ( ul, Ci ) // Select query example

l = Oracle(s) // Obtain label

ul ← ul \ s // remove example from ul

tr ← tr ∪ (s, l) // update training data

Output: C = Induce(tr)

Query selection criteria

• Uncertainty Sampling (Lewis and Gale, 1994): select examples which the

current classifier (fi) is most uncertain about.

• Uncertainty is defined as the confidence the classifier has in a prediction.

– For a probabilistic classifier a prediction close to 0.0 or 1.0 indicatesa confident prediction

– a prediction close to 0.5 indicates an uncertain prediction.

• So, the learner should choose an instance s for presentation to the oraclesuch that:

s = arg minx∈U|fi(x)− 0.5| (20.1)

where U is the set of unlabelled instances.

Example: pooled sampling with task-specific criterion

• Task: find coreferent named entities from a list of N entities. Examples:

– “President Obama” = “Barack Obama”

– “Republic of Ireland” 6= “Northern Ireland”

• Goal: build a classifier

• Annotated data??

– labeling N ×N pairs: too costly

– most pairs are negative matches

• Idea:

– extract pairs showing spelling similarities (more likely to be positivematches)

– label those in priority

Bibliography 133

[MORE TO COME...]

134 Bibliography

Bibliography

Angluin, D. (1988). Queries and concept learning. Machine learning, 2(4):319–342.

Baker, L. D. and McCallum, A. K. (1998). Distributional clustering of wordsfor text classification. In Proceedings of the 21st Annual International ACMSIGIR Conference on Research and Development in Information Retrieval,Categorisation, pages 96–103.

Bertsekas, D. and Tsitsiklis, J. (2002). Introduction to Probability. AthenaScientific.

Burges, C. J. C. (1998). A tutorial on support vector machines for patternrecognition. Data Min. Knowl. Discov., 2(2):121–167.

Cohn, D., Atlas, L., and Ladner, R. (1994). Improving generalization withactive learning. Machine Learning, 15:201–221. 10.1007/BF00993277.

Dasgupta, S. (2011). Two faces of active learning. Theoretical Computer Science,412(19):1767–1781.

Davy, M. and Luz, S. (2007a). Active learning with history-based query selec-tion for text categorisation. In Amati, G., Carpineto, C., and Romano, G.,editors, Advances in Information Retrieval, Proceedings of the 29th EuropeanConference on Information Retrieval Research, ECIR 2007, volume 4425 ofLecture Notes in Computer Science, pages 695–698, Rome, Italy. Springer.

Davy, M. and Luz, S. (2007b). Dimensionality reduction for active learning withnearest neighbour classifier in text categorisation problems. In Proceedingsof the Sixth International Conference on Machine Learning and Applications(ICMLA 2007), pages 292–297, Los Alamitos, CA, USA. IEEE ComputerSociety.

Domingos, P. (2012). A few useful things to know about machine learning.Communications of the ACM, 55(10):78–87.

136 Bibliography

Domingos, P. and Pazzani, M. J. (1996). Beyond independence: Conditions forthe optimality of the simple bayesian classifier. In International Conferenceon Machine Learning, pages 105–112.

Emms, M. and Luz, S. (2007). Machine Learning for Natural Language Process-ing. European Summer School of Logic, Language and Information, coursereader, ESSLLI’07, Dublin.

Fawcett, T. (2006). An introduction to roc analysis. Pattern Recogn. Lett.,27(8):861–874.

Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization ofcontinuous-valued attributes for classification learning. In Proceedings of the13th International Joint Conference in Artificial Intelligence, pages 1022–1026. Morgan Kaufmann.

Forman, G. (2003). An extensive empirical study of feature selection metricsfor text classification. Journal of Machine Learning Research, 3:1289–1305.

Freund, Y., Seung, H., Shamir, E., and Tishby, N. (1997). Selective samplingusing the query by committee algorithm. Machine Learning, (28):133–168.

Fuhr, N., Hartmann, S., Knorz, G., Lustig, G., Schwantner, M., and Tzeras, K.(1991). AIR/X - a rule-based multistage indexing system for large subjectfields. In Proceedings of the RIAO’91, Barcelona, Spain, April 2-5, 1991,pages 606–623.

Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experiments on the useof feature selection and negative evidence in automated text categorization.Technical report, Paris, France, France.

Gale, W., Church, K., and Yarowsky, D. (1992). A method for disambiguatingword senses in a large corpus. Computers and the Humanities, 26:415–439.

Gliozzo, A. and Strapparava, C. (2009). Semantic Domains in ComputationalLinguistics. Springer-Verlag New York Inc.

Hamming, R. W. (1991). The Art of Probability for Scientists and Engineers.Addison-Wesley.

Haruno, M., Shirai, S., and Ooyama, Y. (1999). Using decision trees to constructa practical parser. Machine Learning, 34(1):131–149.

Hayes, P. J. and Weinstein, S. P. (1990). Construe-TIS: A system for content-based indexing of a database of news stories. In Rappaport, A. and Smith, R.,editors, Proceedings of the IAAI-90 Conference on Innovative Applications ofArtificial Intelligence, pages 49–66. MIT Press.

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review.ACM Computing Surveys, 31(3):264–323.

Joachims, T. (1996). A probabilistic analysis of the rocchio algorithm withTFIDF for text categorization. Technical Report CMU-CS-96-118, CMU.

Bibliography 137

John, G. H. and Langley, P. (1995). Estimating continuous distributions inBayesian classifiers. In Besnard, Philippe and Hanks, S., editors, Proceedingsof the 11th Conference on Uncertainty in Artificial Intelligence (UAI’95),pages 338–345, San Francisco, CA, USA. Morgan Kaufmann Publishers.

King, R. D., Rowland, J., Oliver, S. G., Young, M., Aubrey, W., Byrne, E.,Liakata, M., Markham, M., Pir, P., Soldatova, L. N., Sparkes, A., Whe-lan, K. E., and Clare, A. (2009). The automation of science. Science,324(5923):85–89.

Lang, K. (1995). News weeder: Learning to filter netnews. In Proceedings of the12th International Conference on Machine Learning, pages 331–339. MorganKaufmann Publishers, Inc.

Lewis, D. (1997). Reuters-21578 text categorisation corpus. Avaible from http:

//www.daviddlewis.com/resources/.

Lewis, D. and Gale, W. (1994). A sequential algorithm for training text clas-sifiers. Proceedings of the 17th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 3–12.

Lewis, D. and Ringuette, M. (1994). A comparison of two learning algorithmsfor text categorization. In Third Annual Symposium on Document Analysisand Information Retrieval, pages 81–93.

Lewis, D. D. (1995). Evaluating and optimizing autonomous text classificationsystems. In SIGIR ’95: Proceedings of the 18th annual international ACM SI-GIR conference on Research and development in information retrieval, pages246–254, New York, NY, USA. ACM Press.

Lomax, S. and Vadera, S. (2013). A survey of cost-sensitive decision tree induc-tion algorithms. ACM Computing Surveys, 45(2):16:1–16:35.

Luz, S. (2012). The non-verbal structure of patient case discussions in multidis-ciplinary medical team meetings. ACM Transactions on Information Systems,30(3):17:1–17:24.

Luz, S. and Su, J. (2010). Assessing the effectiveness of conversational featuresfor dialogue segmentation in medical team meetings and in the AMI corpus.In Proceedings of the SIGDIAL 2010 Conference, pages 332–339, Tokyo. As-sociation for Computational Linguistics.

Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Meetingof the Association for Computational Linguistics, pages 276–283.

Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Infor-mation Retrieval. Cambridge University Press.

Manning, C. D. and Schutze, H. (1999). Foundations of Statistical NaturalLanguage Processing. The MIT Press, Cambridge, Massachusetts.

McCallum, A. and Nigam, K. (1998a). A comparison of event models for naiveBayes text classification. In AAAI/ICML-98 Workshop on Learning for TextCategorization, pages 41–48. AAAI Press.

http://www.daviddlewis.com/resources/

http://www.daviddlewis.com/resources/

138 Bibliography

McCallum, A. and Nigam, K. (1998b). Employing EM in pool-based activelearning for text classification. In Proceedings of the International Conferenceon Machine Learning (ICML), pages 359–367. Morgan Kaufmann.

Mehta, M., Rissanen, J., and Agrawal, R. (1995). MDL-based decision treepruning. In Proceedings of the First International Conference on KnowledgeDiscovery and Data Mining (KDD’95), pages 216–221.

Mingers, J. (1989). An empirical comparison of pruning methods for decisiontree induction. Machine learning, 4(2):227–243.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.

Ng, H. T., Goh, W. B., and Low, K. L. (1997). Feature selection, perceptronlearning, and a usability case study for text categorization. In Proceedingsof the 20th annual international ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’97, pages 67–73, New York,NY, USA. ACM.

Pomerleau, D. A. (1994). Neural Network Perception for Mobile Robot Guidance.Kluwer, Dordrecht, Netherlands.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1:81–106.10.1007/BF00116251.

Robertson, S. E. and Jones, K. S. (1988). Relevance weighting of search terms.In Document retrieval systems, pages 143–160. Taylor Graham Publishing,London.

Russell, S. J. and Norvig, P. (1995). Artificial Intelligence. A Modern Approach.Prentice-Hall, Englewood Cliffs.

Russell, S. J. and Norvig, P. (2003). Artificial Intelligence. A Modern Approach.Prentice-Hall, Englewood Cliffs, 2nd edition.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACMComputing Surveys, 34(1):1–47.

Settles, B. (2009). Active learning literature survey. Computer Sciences Tech-nical Report 1648, University of Wisconsin–Madison.

Shneiderman, B. and Maes, P. (1997). Direct manipulation vs interface agents.interactions, 4(6):42–61.

Sparck Jones, K. and Jackson, D. (1970). The use of automatically-obtainedkeyword classifications for information retrieval. Information Storage andRetrieval, 5:175–201.

Sycara, K., Navin Chandra, D., Guttal, R., Koning, J., and Narasimhan, S.(1992). CADET: a case-based synthesis tool for engineering design. Interna-tional Journal for Expert Systems, 4(2):157–188.

Valiant, L. (1984). A theory of the learnable. Communications of the ACM,27(11):1134–1142.

Bibliography 139

van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths.

Wolpert, D. H. (1996). The lack of a priori distinctions between learning algo-rithms. Neural Computation, 8(7):1341–1390.

Wooldridge, M. (2002). An Introduction to MultiAgent Systems. John Wiley &Sons.

Yang, Y. (1994). Expert network: effective and efficient learning from humandecisions in text categorisation and retrieval. In Proceedings of SIGIR-94, 17thACM International Conference on Research and Development in InformationRetrieval, pages 13–22, Dublin, Ireland. ACM Press.

Yang, Y. (2001). A study on thresholding strategies for text categorization. InCroft, W. B., Harper, D. J., Kraft, D. H., and Zobel, J., editors, Proceedingsof the 24th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR-01), pages 137–145, New York.ACM Press.

Yang, Y. and Chute, C. G. (1994). An example-based mapping method fortext categorization and retrieval. ACM Transaction on Information Systems,12(3):252–277.

Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection intext categorization. In Fisher, D. H., editor, Proceedings of ICML-97, 14th In-ternational Conference on Machine Learning, pages 412–420, Nashville. Mor-gan Kaufmann Publishers.

Zhang, H. (2004). The optimality of Naive Bayes. In Proceedings of the 7th In-ternational Florida Artificial Intelligence Research Society Conference. AAAIPress.

Zhang, H., Jiang, L., and Su, J. (2005). Hidden naive bayes. In Proceedingsof the National Conference on Artificial Intelligence, volume 20, page 919.Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.