79
Novel representations and methods in text classification Manuel Montes, Hugo Jair Escalante Instituto Nacional de Astrofísica, Óptica y Electrónica, México. http://ccc.inaoep.mx/~mmontesg/ http://ccc.inaoep.mx/~hugojair/ {mmontesg, hugojair}@inaoep.mx 7th Russian Summer School in Information Retrieval Kazan, Russia, September 2013

Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Novel representations and

methods in text classification

Manuel Montes, Hugo Jair Escalante

Instituto Nacional de Astrofísica, Óptica y Electrónica, México.

http://ccc.inaoep.mx/~mmontesg/

http://ccc.inaoep.mx/~hugojair/

{mmontesg, hugojair}@inaoep.mx

7th Russian Summer School in Information RetrievalKazan, Russia, September 2013

Page 2: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Overview of the course

• Day 1: Introduction to text classification

• Day 2: Bag of concepts representations

• Day 3: Representations that incorporate sequentialand syntactic informationand syntactic information

• Day 4: Methods for non-conventional classification

• Day 5: Automated construction of classificationmodels

2

Page 3: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

AUTOMATED CONSTRUCTION OF

CLASSIFICATION MODELS:

FULL MODEL SELECTION

Novel represensations and methods in text classification

3

Page 4: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Outline

• Pattern classification

• Model selection in a broad sense: FMS

• Related works

• PSO for full model selection

• Experimental results and applications

• Conclusions and future work directions

4

Page 5: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Pattern classification

1 (is a car)

Answers

Training

Trained

machine-1 (is not a car)

Queries

Learning machine

Training data

5

Page 6: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Applications

• Natural language processing,

• Computer vision,

• Robotics,

• Information technology,• Information technology,

• Medicine,

• Science,

• Entertainment,

• ….

6

Page 7: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Example: galaxy classification

from images

( ; )g θx

( ; )zf

f θx

1( ; )ff θx…

Instances

Features

Data

Preprocessing

Feature

selection

Classification

method

Training modelModel

evaluation ( ; )ff θx

( ; )gg θx( ; )hh θx

( ; )ff θx( )l x

ERROR [ ] ?

Data

7

Page 8: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Some issues with the cycle of design

• How to preprocess the data?

• What method to use?

Data

Preprocessing

Feature

selection

Classification

method

Training modelModel

evaluation

Data

8

Page 9: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Some issues with the cycle of design

• What feature selection method to use?

• How many features are enough?

• How to set the hyperparameters of the FS method?

Data

Preprocessing

Feature

selection

Classification

method

Training modelModel

evaluation

Data

9

Page 10: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Some issues with the cycle of design

• What learning machine to use?

• How to set the hyperparameters of the chosen method?

Data

Preprocessing

Feature

selection

Classification

method

Training modelModel

evaluation

Data

10

Page 11: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Some issues with the cycle of design

• What combination of methods works better?

• How to evaluate model’s performance

• How to avoid overfitting?

Data

Preprocessing

Feature

selection

Classification

method

Training modelModel

evaluation

Data

11

Page 12: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Some issues with the cycle of design

• The above issues are usually fixed manually:– Domain expert’s knowledge– Machine learning specialists’ knowledge– Trial and error

• The design/development of a pattern classificationsystem relies on the knowledge and biases of humans,system relies on the knowledge and biases of humans,which may be risky, expensive and time consuming

• Automated solutions are available but only for particularprocesses (e.g., either feature selection, or classifierselection but not both)

Is it possible to automate the whole process?

12

Page 13: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

( ; )f θxFull model selection

Data

Full model selection

H. J. Escalante. Towards a Particle Swarm Model Selection algorithm. Multi-level inference workshop and model selection game,

NIPS, Whistler, VA, BC, Canada, 2006.

H. J. Escalante, E. Sucar, M. Montes. Particle Swarm Model Selection, In Journal of Machine Learning Research, 10(Feb):405--440,

2009.

13

Page 14: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Full model selection

• Given a set of methods for data preprocessing,feature selection and classification select thecombination of methods (together with theirhyperparameters) that minimizes an estimate ofclassification performanceclassification performance

H. J. Escalante, E. Sucar, M. Montes. Particle Swarm Model Selection, In Journal of Machine Learning Research, 10(Feb):405--440,

2009.

14

Page 15: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Full model selection

• Full model: A model composed of methods for datapreprocessing, feature selection and classification

• Example:

chainchain{ chain

{1:normalize center=02:relief f_max=Inf w_min=-Inf k_num=43:neural units=10 shrinkage=1e-014 balance=0 maxiter=100

}

chain{1:standardize center=12:svcrfe svc kernel linear coef0=0 degree=1 gamma=0 shrinkage=0.001 f_max=Inf3:pc_extract f_max=20004:svm kernel linear C=Inf ridge=1e-013 balanced_ridge=0 nu=0 alpha_cutoff=-1 b0=0 nob=0

}

15

Page 16: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Full model selection

• Pros– The job of the data analyst is considerably reduced

– Neither knowledge on the application domain nor onmachine learning is required

– Different methods for preprocessing, feature selection andclassification are considered

– It can be used in any classification problem– It can be used in any classification problem

• Cons– It is real function + combinatoric optimization problem

– Computationally expensive

– Risk of overfitting

16

Page 17: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

OVERVIEW OF RELATED WORKSNovel represensations and methods in text classification

17

Page 18: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Model selection via heuristic optimization

• A single model is considered and theirhyperparameters are optimized via heuristicoptimization:

– Swarm optimization,

– Genetic algorithms,

– Pattern search,

– Genetic programming

– …

18

Page 19: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

GEMS

• GEMS (Gene Expression Model Selection) is a systemfor automated development and valuation of high-quality cancer diagnostic models and biomarkerdiscovery from microarray gene expression data

A. Statnikov, I. Tsamardinos, Y. Dosbayev, C.F. Aliferis. GEMS: A System for Automated Cancer Diagnosis and Biomarker Discovery from

Microarray Gene Expression Data. International Journal of Medical Informatics, 2005 Aug;74(7-8):491-503.

N. Fanananapazir, A. Statnikov, C.F. Aliferis. The Fast-AIMS Clinical Mass Spectrometry Analysis System. Advaces in Bioinformatics, 2009, Article ID 598241.

19

Page 20: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

GEMS

• The user specifies the models, and methods to beconsidered

• GEMS explores all of the combinations of methods, usinggrid search to optimize hyperparameters

20

Page 21: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

GEMS

• The user specifies the models, and methods to beconsidered

• GEMS explores all of the combinations of methods, usinggrid search to optimize hyperparameters

21

Page 22: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Model type selection for regression

• Genetic algorithms are used for the selection ofmodel type (learning method, feature selection,preprocessing) and parameter optimization forregression problems

D. Gorissen, T. Dhaene, F. de Turck. Evolutionary Model Type Selection for Global Surrogate Modeling. In Journal of Machine

Learning Research, 10(Jul):2039-2078, 2009

http://www.sumo.intec.ugent.be/

22

Page 23: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Meta-learning: learning to learn

• Evaluates and compares the application of learningalgorithms on (many) previous tasks/domains tosuggest learning algorithms (combinations, rankings)for new tasks

• Focuses on the relation between tasks/domains and• Focuses on the relation between tasks/domains andlearning algorithms

• Accumulating experience on the performance ofmultiple applications of learning methods

Brazdil P., Giraud-Carrier C., Soares C., Vilalta R. Metalearning: Applications to Data Mining. Springer Verlag. ISBN: 978-3-540-73262-4, 2008.

Brazdil P., Vilalta R, Giraud-Carrier C., Soares C.. Metalearning. Encyclopedia of Machine learning. Springer, 2010.

23

Page 24: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Meta-learning: learning to learn

Brazdil P., Giraud-Carrier C., Soares C., Vilalta R. Metalearning: Applications to Data Mining. Springer Verlag. ISBN: 978-3-540-73262-4, 2008.

Brazdil P., Vilalta R, Giraud-Carrier C., Soares C.. Metalearning. Encyclopedia of Machine learning. Springer, 2010.

24

Page 25: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Google prediction API

• “Machine learning as a service in the cloud”

• Upload your data, train a model and performqueries

• Nothing is for free!

https://developers.google.com/prediction/

25

Page 26: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Google prediction API

Your data become property of Google!26

Page 27: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

IBM’s SPSS modeler

27

Page 28: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

$$$$$ 28

Page 29: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Who cares?

29

Page 30: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS: OUR APPROACH TO FULL

MODEL SELECTION

Novel represensations and methods in text classification

30

Page 31: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

( ; )f θxFull model selection

Data

Full model selection

31

Page 32: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS: Our approach to

full model selection

• Particle swarm model selection: Use particle swarm

optimization for exploring the search space of full models in a particular ML-toolbox

Relief (feat.

H. J. Escalante, E. Sucar, M. Montes. Particle Swarm Model Selection, In Journal of Machine Learning Research, 10(Feb):405--440,

2009.

H. J. Escalante. Towards a Particle Swarm Model Selection algorithm. Multi-level inference workshop and model selection game,

NIPS, Whistler, VA, BC, Canada, 2006.

Normalize +

RBF-SVM

(γ = 0.01)

PCA + Neural-

Net

(10 h. units)

Relief (feat.

Sel.) +Naïve

Bayes

Normalize +

PolySVM

(d= 2)Neural Net

( 3 units)

s2n (feat. Sel.)

+K-ridge

32

Page 33: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Particle swarm optimization

• Population-based search heuristic

• Inspired on the behavior of biological communities that exhibit local and social behaviors

A. P. Engelbrecht. Fundamentals of Computational Swarm Intelligence. Wiley,2006

33

Page 34: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Particle swarm optimization

• Each individual (particle) i has:– A position in the search space (Xi

t), which represents a solution to the problem at hand,

– A velocity vector (Vit), which determines how a particle

explores the search space

• After random initialization, particles update their positions according to:

A. P. Engelbrecht. Fundamentals of Computational Swarm Intelligence. Wiley,2006

10 1 2( ) ( )t t t t

i i i i g iφ φ φ+ = × + × − + × −v v p x p x

1 1t t ti i i

+ += +x v x

34

Page 35: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Particle swarm optimization

1. Randomly initialize a population of particles (i.e., the swarm)

2. Repeat the following iterative process until stop criterion is meet:

a) Evaluate the fitness of each particle

b) Find personal best (pi) and global best (pg)

c) Update particles c) Update particles

d) Update best solution found (if needed)

3. Return the best particle (solution)

Fit

ness v

alu

e

...

Init t=1 t=2 t=max_it...

35

Page 36: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS : PSO for full model selection

• Set of methods (not restricted to this set)

Classification

Feature

selection

Preprocessing

http://clopinet.com/CLOP36

Page 37: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS : PSO for full model selection

• Codification of solutions as real valued vectors

Hyperparameters for the selected methods Preprocessingbefore featureselection?

i,pre i,1...Npre i,fs i,1...Nfs i,sel i,class i,1...Nclass, , , , , , i x y x y x x y=x

Choice of methods for preprocessing, feature selection and classification

37

Page 38: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS : PSO for full model selection

• Fitness function:

– K-fold cross-validation balanced error rate

– K-fold cross-validation area under the ROC curve

Training dataTraining dataTraining dataTraining data

TrainTrainTrainTrain

TestTestTestTest

Error fold 1Error fold 1Error fold 1Error fold 1

Error fold 2Error fold 2Error fold 2Error fold 2

Error fold 3Error fold 3Error fold 3Error fold 3

Error fold 4Error fold 4Error fold 4Error fold 4

Error fold 5Error fold 5Error fold 5Error fold 5

CV CV CV CV estimateestimateestimateestimate

38

Page 39: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

SOME EXPERIMENTAL RESULTS OF

PSMS

Novel represensations and methods in text classification

39

Page 40: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS in the ALvsPK challenge

• Five data sets for binary classification

• Goal: to obtain the best classification model for each data set

• Two tracks:

– Prior knowledge– Prior knowledge

– Agnostic learning

http://www.agnostic.inf.ethz.ch/40

Page 41: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS in the ALvsPK challenge

• Best configuration of PSMS:

Entry Description Ada Gina Hiva Nova Sylva Overall Rank

Interim-all-prior Best PK 17.0 2.33 27.1 4.71 0.59 10.35 1st

psmsx_jmlr_run_I PSMS 16.86 2.41 28.01 5.27 0.62 10.63 2nd

Logitboost-trees Best AL 16.6 3.53 30.1 4.69 0.78 11.15 8thLogitboost-trees Best AL 16.6 3.53 30.1 4.69 0.78 11.15 8th

Comparison of the performance of models selected with PSMS with that obtained by other techniques in the ALvsPK challenge

Models selected with PSMS for the different data sets

41http://www.agnostic.inf.ethz.ch/

Page 42: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS in the ALvsPK challenge

• Official ranking:

42http://www.agnostic.inf.ethz.ch/

Page 43: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Some results in benchmark data

• Comparison of PSMS and pattern search

43

Page 44: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Some results in benchmark data

• Comparison of PSMS and pattern search

44

Page 45: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Some results in benchmark data

• Comparison of PSMS and pattern search

45

Page 46: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS: Interactive demo

http://clopinet.com/CLOP

Isabelle Guyon, Amir Saffari, Hugo Jair Escalante, Gokan Bakir, and Gavin Cawley, CLOP: a Matlab Learning Object

Package. NIPS 2007 Demonstrations, Vancouver, British Columbia, Canada 2007.

46

Page 47: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS: APPLICATIONS AND

EXTENSIONS

Novel represensations and methods in text classification

47

Page 48: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Authorship Authorship verification

Page 49: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Authorship verification

• The task of deciding whether given textdocuments were or were not written by acertain author (i.e., a binary classification

problem)

• Applications include: fraud detection, spamfiltering, computer forensics and plagiarismdetection

49

Page 50: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Authorship verification

Dear Mary, I have beenbusy in the lastfew weeks, however, Iwould like to ….….

This paper describes annovel approach to regionlabeling by using a Markovrandom field model ….

Sample documents

from author “X”

Model for

Feature extraction Model construction

and testing

This document was

Prediction

… …

50

Hereby I am sending myapplication for the PhDposition offered in yourresearch group …

Hi Jane, how things aregoing on in Mexico; I hopeyou are not havingproblems because … Our paper introduces a

novel method for …

Unseen document

Model for

author “X”

This document was

not written by “X”

Today in Brazil an AirFrance plane crasheddown, during a storm….

Sample documents

from other authors

than “X”

… …

Page 51: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

PSMS for authorship verification

• Documents are represented by their (binary) bag-of-words

Iberoamerican congresson pattern recognition

Iber

oam

eric

an

Acc

om

mo

dat

ion

Co

ngr

ess

Patt

ern

Rec

ogn

itio

n

0 0 1 01 1 100 0 0

… … … … …

Zorr

o

• We apply straight PSMS for selecting the classification model for each author

• We report average performance over several authors

Hugo Jair Escalante, Manuel Montes, and Luis Villaseñor. Particle Swarm Model Selection for Authorship Verification. LNCS 5856, pp. 563--570, Springer, 2009.

51

Page 52: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results

• Two collections were considered for experiments:

Collection Training Testing Features Authors Language Domain

MX-PO 281 72 8,970 5 Spanish Poetry

CCAT 2,500 2,500 3,400 50 English News

• Evaluation measures: – Balanced error rate (BER)

– F1-measure

• We compare PSMS to results obtained with individual classification models

2

E EBER + −+=

1

2 R PF

R P

× ×=+

52

Page 53: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results

Col.

/ Class.

zarbi naïve lboost neural svc kridge rf lssvm FMS/1 FMS/0 FMS/-1

MX-PO 34.64 30.24 29.08 28.59 30.81 31.90 48.01 33.52 26.18 23.68 26.88

CCAT 14.24 26.21 15.12 41.50 29.18 27.69 47.01 36.64 35.34 13.54 16.39

Average (over the authors) of balanced error rate obtained by each technique in the test sets

53

Page 54: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results

• Full models selected for the MX-PO data set

Poet Preprocessing Feature

Selection

Classifier BER

Efrain Huerta standardize(1) - zarbi 10.28 %Efrain Huerta standardize(1) - zarbi 10.28 %

Jaime Sabines - - zarbi 26.79 %

Octavio Paz normalize(0) Zfilter(3070) zarbi 25.09 %

Rosario Castellanos

normalize(0) Zfilter(3400) Kridge(γ=0.44; s= 0.407)

33.04%

Ruben Bonifaz shift-scale (log) - neural (u=3;s=1.81;i=15;b=1); 35.71%

54

Page 55: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results

Classifiers Feature

selection

Preprocessing

zarbi naive neural svc kridge lssvm W WO W WO

� Statistics on the selection of methods when using PSMS (FMS/1) for the CCAT collection

zarbi naive neural svc kridge lssvm W WO W WO

Frequency of selection

68% 10% 10% 2% 2% 8% 76% 24% 88% 12%

Balanced error rate (BER)

14.22 9.88 23.42 3.82 44.01 33.51 14.14 22.28 15.64 16.50

55

Page 56: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results

• Per-author performance by models selected with FMS/1 for the CCA data set

96.91%

[normalize+standardize+shift-scale] + [Ftest (104)] + [naïve]

Most relevant features (words) for this author: china, chinesse, beijing, official diplomat

56

Page 57: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results

• Per-author performance by models selected with FMS/1 for the CCA data set

[normalize+standardize+shift-scale] + [Zfilter (234)] + [zarbi]

When using FMS/0 we got

F1= 45.71%

14.81%[normalize] + [] + [lssvm]

57

Page 58: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Lessons learned

• PSMS selected very effective models foreach author,

• Allowing the user to provide/fix a classifierresulted beneficialresulted beneficial

• We were not able to outperform baselinemethods in the multiclass classification taskof authorship attribution

58

Page 59: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

ENSEMBLE PSMSNovel represensations and methods in text classification

59

Page 60: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Ensemble PSMS

• Many models are evaluated during the searchprocess of PSMS; although a single model is selected

60

Page 61: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Ensemble PSMS

• Idea: taking advantage of the large number ofmodels that are evaluated during the search forbuilding ensemble classifiers

• Problem: How to select the partial solutions fromPSMS so that they are accurate and diverse to eachPSMS so that they are accurate and diverse to eachother

• Motivation: The success of ensemble classifiersdepends mainly in two key aspects of individualmodels: Accuracy and diversity

61

Page 62: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Ensemble PSMS�How to select potential models for building ensembles?

� BS: store the global best model in each iteration

� BI: the best model in each iteration

� SE: combine the outputs of the final swarm

�How to fuse the outputs of the selected models?

� Simple (un-weighted) voting� Simple (un-weighted) voting

Fit

ness v

alu

e

...

Init I=1 I=4 I=max_it...I=2 I=3

1

1( ) ( )

L

ll

g E fL =

= ∑ p

62

Page 63: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

�How to select potential models for building ensembles?

� BS: store the global best model in each iteration

� BI: the best model in each iteration

� SE: combine the outputs of the final swarm

�How to fuse the outputs of the selected models?

� Simple (un-weighted) voting

Ensemble PSMS

� Simple (un-weighted) voting

1

1( ) ( )

L

ll

g E fL =

= ∑ p

Fit

ness v

alu

e

...

Init I=1 I=4 I=max_it...I=2 I=3

63

Page 64: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

�How to select potential models for building ensembles?

� BS: store the global best model in each iteration

� BI: the best model in each iteration

� SE: combine the outputs of the final swarm

�How to fuse the outputs of the selected models?

� Simple (un-weighted) voting

Ensemble PSMS

� Simple (un-weighted) voting

1

1( ) ( )

L

ll

g E fL =

= ∑ p

Fit

ness v

alu

e

...

Init I=1 I=4 I=max_it...I=2 I=3

64

Page 65: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results

• Data:

– 9 Benchmark machine learning data sets (binary classification)

– 1 Object recognition data set (multiclass, 10 classes)

ID Data set Training Testing Features

1 Breast-cancer 200 77 91 Breast-cancer 200 77 9

2 Diabetes 468 300 8

3 Flare solar 666 400 9

4 German 700 300 20

5 Heart 170 100 13

6 Image 1300 1010 20

7 Splice 1000 2175 60

8 Thyroid 140 75 5

9 Titanic 150 2051 3

OR SCEF 2378 3300 50

Hugo Jair Escalante, Manuel Montes and Enrique Sucar. Ensemble Particle Swarm

Model Selection. Proceedings of the International Joint Conference on Neural Networks(IJCNN2010 – WCCI2010), pp. 1814--1821, IEEE,, 2010

Page 66: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results

• Evaluation:

– Average of area under the ROC curve (performance)

– Coincident failure diversity (ensemble diversity)

10

1

1 1

0

L

rr

L rp

p LCFD =

− − −=

∑If p0 < 1

If p0 = 1

W. Wang. Some fundamental issues in ensemble methods.

Proc. Of IJCNN07, pp. 2244–2251, IEEE, July 2007.66

Page 67: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results: performance

• Benchmark data sets: better performance is obtained by ensemble methods

ID PSMS EPSMS-BS EPSMS-SE EPSMS-BI

1 72.03±2.24 73.40±0.78 74.05±0.91 74.35±0.49

2 82.11±1.29 82.60±1.52 74.07±13.7 83.42±0.462 82.11±1.29 82.60±1.52 74.07±13.7 83.42±0.46

3 68.81±4.31 69.38±4.53 70.13±7.48 72.16±1.42

4 73.92±1.23 73.84±1.53 74.70±0.72 74.77±0.69

5 85.55±5.48 87.40±2.01 87.07±0.75 88.36±0.88

6 97.21±3.15 98.85±1.45 95.27±3.04 99.58±0.33

7 97.26±0.55 98.02±0.64 96.99±1.21 98.84±0.26

8 96.00±4.75 98.18±0.94 97.29±1.54 99.22±0.45

9 73.24±1.16 73.50±0.95 75.37±1.05 74.40±0.91

Avg. 82.90±2.68 83.91±1.59 82.77±3.38 85.01±0.65

Average accuracy over 10-trials of PSMS and EPSMS in benchmark data

67

Hugo Jair Escalante, Manuel Montes and Enrique Sucar. Ensemble Particle Swarm

Model Selection. Proceedings of the International Joint Conference on Neural Networks(IJCNN2010 – WCCI2010), pp. 1814--1821, IEEE,, 2010

Page 68: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results: Diversity of ensemble

• Diversity results

ID EPSMS-BS EPSMS-SE EPSMS-BI

1 0.2055±0.1498 0.5422±0.0550 0.5017±0.1149

2 0.3547±0.1711 0.6241±0.0619 0.5081±0.0728

3 0.1295±0.1704 0.4208±0.1357 0.4012±0.10713 0.1295±0.1704 0.4208±0.1357 0.4012±0.1071

4 0.3019±0.1732 0.5159±0.0596 0.4296±0.0490

5 0.2733±0.1714 0.5993±0.0925 0.5647±0.0655

6 0.7801±0.0818 0.7555±0.0524 0.8427±0.0408

7 0.5427±0.3230 0.7807±0.0585 0.8050±0.0294

8 0.6933±0.1558 0.8173±0.0626 0.8514±0.0403

9 0.7473±0.0089 0.7473±0.0089 0.7473±0.0089

Avg. 0.4476±0.1562 0.6448±0.0603 0.6280±0.0588

EPSMS-SE models are more diverse

68

Hugo Jair Escalante, Manuel Montes and Enrique Sucar. Ensemble Particle Swarm

Model Selection. Proceedings of the International Joint Conference on Neural Networks(IJCNN2010 – WCCI2010), pp. 1814--1821, IEEE,, 2010

Page 69: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Experimental results: region

labeling

ID PSMS EPSMS-BS EPSMS-SE EPSMS-BI

AUC 91.53±6.8 93.27±5.6 92.79±7.4 94.05±5.3

MCC 69.58% 76.59% 79.13% 81.49%

15

20

Diff

eren

ce in

AU

C p

erfo

rman

ce (

%)

EPSMS−BSEPSMS−SEEPSMS−BI

Per-concept improvement of EPSMS variants over straight PSMS

−10

−5

0

5

10

Label

Diff

eren

ce in

AU

C p

erfo

rman

ce (

%)

build

ing

folia

ge

mou

ntai

n

pers

on

road

saili

ng−

boat

sand se

a

sky

snow

69

Page 70: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Lessons learned

• Ensembles generated with EPSMSoutperformed individual classifiers;including those selected with PSMS

• Models evaluated by PSMS are diverse toeach other and accurate

• The extension of PSMS for generatingensembles is promising

70

Page 71: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Summary

• Full model selection is a broad view of the modelselection problem in supervised learning

• There is an increasing interest for this type of methods

• The search space is huge and multiple minima may exist• The search space is huge and multiple minima may exist

• There is no guarantee of avoiding overfitting

• Yet, we showed evidence that it is possible to attempt toautomate the cycle of design of pattern classificationsystems

71

Page 72: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Summary

• PSMS / EPSMS an automatic tool for the selection ofclassification models for any classification task

• PSMS / EPSMS has been successfully applied inseveral domains

• Disclaimer: We do not expect PSMS / EPSMS toperform well in every application it is tested,although we recommend it as a first option whenbuilding a classifier

72

Page 73: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Summary

• For multiclass classification straight OVA strategiesdid not work

• Alternative methods that do not use the output of• Alternative methods that do not use the output ofclassifiers are a good option to explore as futurework

73

Page 74: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Research opportunities

• Multi-Swarm PSMS for buildingensembles

• Multiclass PSMS/EPSMS:– Bi-level optimization (i.e., individual– Bi-level optimization (i.e., individual

and multiclass performance)

– Learning to fuse the outputs (e.g.,using genetic programming)

• Meta-ensembles

74

Page 75: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Research opportunities

• Other search strategies for full modelselection (e.g., Genetic programming,GRASP, Tabu search)

• Other toolboxes (e.g., weka)• Other toolboxes (e.g., weka)

• Meta-learning + PSMS

• Combination of different full modelselection techniques

75

Page 76: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Applications

• Any classification problem where specific classifiersare required for each class

76

Page 77: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

References• Isabelle Guyon. A Practical Guide to Model Selection. In Jeremie Marie, editor, Machine Learning Summer School

2008,Springer Texts in Statistics, 2011. (slide from I.Guyon’s)

• A. Statnikov, I. Tsamardinos, Y. Dosbayev, C.F. Aliferis. GEMS: A System for Automated Cancer Diagnosis andBiomarker Discovery from Microarray Gene Expression Data. International Journal of Medical Informatics, 2005Aug;74(7-8):491-503.

• D. Gorissen, T. Dhaene, F. de Turck. Evolutionary Model Type Selection for Global Surrogate Modeling. In Journalof Machine Learning Research, 10(Jul):2039-2078, 2009

• Brazdil P., Giraud-Carrier C., Soares C., Vilalta R. Metalearning: Applications to Data Mining. Springer Verlag. ISBN:978-3-540-73262-4, 2008.

• H. J. Escalante. Towards a Particle Swarm Model Selection algorithm. Multi-level inference workshop and model• H. J. Escalante. Towards a Particle Swarm Model Selection algorithm. Multi-level inference workshop and modelselection game, NIPS, Whistler, VA, BC, Canada, 2006.

• Isabelle Guyon, Amir Saffari, Hugo Jair Escalante, Gokan Bakir, and Gavin Cawley, CLOP: a Matlab Learning ObjectPackage. NIPS 2007 Demonstrations, Vancouver, British Columbia, Canada 2007.

• H. J. Escalante, E. Sucar, M. Montes. Particle Swarm Model Selection, In Journal of Machine Learning Research,10(Feb):405--440, 2009.

• Hugo Jair Escalante, Jesus A. Gonzalez, Manuel Montes-y-Gomez, Pilar Gómez, Carolina Reta, Carlos A. Reyes,Alejandro Rosales. Acute Leukemia Classiffcation with Ensemble Particle Swarm Model Selection, In press,Artificial Intelligence in Medicine, Available online 15 April 2012.

• Hugo Jair Escalante, Manuel Montes, and Luis Villaseñor. Particle Swarm Model Selection for AuthorshipVerification. LNCS 5856, pp. 563--570, Springer, 2009.

77

Page 78: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

Questions?Questions?

7th Russian Summer School in Information Retrieval Kazan, Russia, September 2013 78

Page 79: Novel representations and methods in text classificationhugojair/Courses/RUSSIR/5_AutomatedClassi… · Novel representations and methods in text classification Manuel Montes, Hugo

79