30
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar

Mapping Between Taxonomies

Embed Size (px)

DESCRIPTION

Mapping Between Taxonomies. Elena Eneva 11 Dec 2001 Advanced IR Seminar. Mapping Between Taxonomies. Formal systems of orderly classification of knowledge, which are designed for a specific purpose - PowerPoint PPT Presentation

Citation preview

Page 1: Mapping Between Taxonomies

Mapping Between Taxonomies

Elena Eneva

11 Dec 2001

Advanced IR Seminar

Page 2: Mapping Between Taxonomies

Mapping Between TaxonomiesFormal systems of orderly classification

of knowledge, which are designed for a specific purpose

Companies, organizing information in various ways (eg. one for marketing, another for product development)

Page 3: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

Page 4: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

Page 5: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

Page 6: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

Page 7: Mapping Between Taxonomies

ApproachTextile

Automobile

By industry

Page 8: Mapping Between Taxonomies

ApproachTextile

Automobile

By industry

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

Page 9: Mapping Between Taxonomies

ApproachTextile

Automobile

By industry

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

abcabcabcabcabcabc

Page 10: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

abc abc abc abc

Page 11: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

abc abc abc abc

Page 12: Mapping Between Taxonomies

ApproachGerman

French

Textile

Automobile

By country

By industry

abc abc abc abc

abc abc abc abc

Page 13: Mapping Between Taxonomies

DatasetsTwo classification schemes:

Reuter 2001 (807900 docs) Topics (127) Industry categories (871) Regions (376)

Hoovers-255 and Hoovers-28 (4286 docs) industry categories (28) industry categories (255)

Page 14: Mapping Between Taxonomies

Learning2 separate methods of learning for the

documents: Old doc category -> new doc category Doc contents -> new category

Combined method: Weighted average based on confidence Final result determined by a decision tree One combined learner – used both old

category and contents as features

Page 15: Mapping Between Taxonomies

Simple Learners

Simple Decision Tree (C4.5) – learns probabilities of new categories based on 1 kind of feature: Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old

categories) Naïve Bayes (rainbow)

Old categories (doesn’t know about documents/words) Word-based classification (doesn’t know about old

categories) Support Vector Machine (SVM-Light)

word-based classification (doesn’t know about old categories), linear kernel [results will be reported in the final paper]

Page 16: Mapping Between Taxonomies

Learning

Using the document content

abcabcabcabcabcabc

Using the document labels

DT, NB, SVM

DT, NB, SVM

Page 17: Mapping Between Taxonomies

Combined Learners

Weighted Average Voting scheme

Combination Decision Tree takes the outputs and confidences of two of

the simple learners, predicts new category

Page 18: Mapping Between Taxonomies

Learning

Using both the content and the label

Combining the two outputs

abcabcabcabcabcabcDT

abcabcabcabcabcabc

DT, NB, SVM

DT, NB, SVM

voting

3rd classifier

Page 19: Mapping Between Taxonomies

Results Words Only

5-fold cross validation

Words Only

0

10

20

30

40

50

60

28p255 255p28

% a

cc

ura

cy

words only NB

words only DT

Page 20: Mapping Between Taxonomies

Results Categories Only

5-fold cross validation

Categories Only

0

20

40

60

80

100

120

28p255 255p28

% a

cc

ura

cy

categs only NB

categs only DT

Page 21: Mapping Between Taxonomies

Results Combination

5-fold cross validation

Combination

0

20

40

60

80

100

120

28p255 255p28

% a

cc

ura

cy

Combination Vote

Combination Comb

Page 22: Mapping Between Taxonomies

Results

words onlyNB DT

28p255 21.14 7.9255p28 53.2 17.5

categs onlyNB DT

28p255 26.19 26.19255p28 100 100

CombinationVote Comb

28p255 28.05 30.26255p28 100 100

Page 23: Mapping Between Taxonomies

Remarks

Hierarchy (old classes) usually ignoredShown that helpsLearners are not the issueBetter way of understandingOld label (or hierarchy path) is meta

data

Page 24: Mapping Between Taxonomies

Remaining Work

SVM results (running even as we speak)Repeat experiments on Reuters-2001

Internal hierarchies Missing labels Less correlated types of classes

Results in standard evaluation format

Page 25: Mapping Between Taxonomies

Future Work

Try with a web dataset (Google and Yahoo! Hierarchies)

Hierarchies of more levelsMeta data (for non-text sources)

Page 26: Mapping Between Taxonomies

Related Literature

A study of Approaches to Hypertext, Y. Yang, S. Slattery, R. Ghani, Journal of Intelligent Information Systems, Volume 18, Number 2, March 2002 (to appear).

Learning Mappings between Data Schemas , A. Doan, P. Domingos, and A. Levy. Proceedings of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 2000, Austin, TX.

Page 27: Mapping Between Taxonomies

Questions and Suggestions

The end.

Page 28: Mapping Between Taxonomies

DT accuracy vs Vocabulary size

0

1020

30

40

5060

70

10 100 500 1000 2000

vocabulary size

% a

ccur

acy train accuracy

test accuracy

Page 29: Mapping Between Taxonomies

Taxonomies

Formal systems of orderly classification of knowledge, which are designed for a specific purpose

Change of purpose, change of taxonomies

Businesses often need and keep theinformation in several structures

Important to be able to automatically map between taxonomies

Page 30: Mapping Between Taxonomies

Useful Mappings Companies, organizing information in various ways

(eg. one for marketing, another for product development)

Personal online bookmark classification

Search engines (eg. Google <-> Yahoo)

EU Committee for Standardization “detailed overview of the existing taxonomies officially used in the EU, in order to derive general concepts such as: information organisation, properties, multilinguality, keywords, etc. and, last but not least, the mapping between.”