Knowledge Discovery in Ontology Learning A survey

Preview:

Citation preview

Knowledge Discovery in Ontology Learning

A survey

Outline• Introduction

• OL Data Input

• OL Application Fields

• OL Methods

• OL Tools (practical session)

Introduction• Ontology Engineering is a time-consuming task

• Ontology Learning (OL) is the semi-automatic process supporting ontology engineering

• OL it is a bottom-up and data-driven process

• OL is an interdisciplinary field

OL Data Input• Pure NL text

• Ontologies

• KB (DB) instances

• Schemata– DB schemata

– Web schemata

• Log files

OL Application Fields• OL can support Ontology Engineering (and management) in different

phases.– Ontology extraction: based on some input the ontology engineer gets

ontology proposal.

– Ontology reuse: pruning existing domain ontologies for a specific application.

– Ontology interoperability (multiple ontology management): mapping discovery.

OL Methods (outline)• Ontology Extraction (from text)

– Weak ontology notion• Document Ontology extraction

– Strong ontology notion• Association rules

• Conceptual clustering

• Ontology Reuse– Ontology Pruning

• Ontology Learning for interoperability

Document Ontology extraction (1)• Extraction of concepts from a set of documents and identification of

relationships between these concepts with different individual terms [3]

• No semantic relations extraction

• Only concepts extraction (aggregation of terms identified with the same concept)

• Use of statistical analisys above a set of documents

• Good for domain specific applications

Document Ontology extraction (2)• Input (text documents)

• Pre-processing

• Normalization

• LSI (using SVD)

• Document Ontology Construction

Document Ontology extraction (3)

m x n m x r

r x rr x n

X X=

Ter

ms

Documents

Singular Value Decomposition

Ter

ms

Concepts

A U S VT

Association Rules (1)• Make use of shallow text processing techniques [6]

• No taxonomic relation

• Assumption: syntactic relations semantic relations

Association Rules (2)• Preprocess the text documents

– Morphological analysis

– Recognition of name entities

– Retrieval of domain specific concepts (if available)

– Disambiguation using context information

• Determine Concept Pairs set (CP) using several heuristic (either general or domain dependant)– NP-PP heuristic

– Sentence heuristic

– Title heuristic

Association Rules (3)

• Determine T = {{ai,1,…,ai,n}| (ai,1, ai,2)CP m >2 ((ai,1, ai,m) H (ai,2,

ai,m) H)}

• Determine support and confidence for all association rules Xk Yk, where |Xk|=|Yk|=1

• Propose to the user only the rules that exceed user-defined thresholds

support (Xk Yk) =

confidence (Xk Yk) =

|{ti|Xk Yk ti}|

n

|{ti|Xk Yk ti}|

|{ti|Xk ti}|

Conceptual Clustering (1)• Use of conceptual clustering approach [2,5] to extract a hierarchy of

concepts and to learn subcategorization frames

• In our case, examples to cluster are set of words, associated to the frequency of the corresponding instantiated frame in the corpora

• Syntactic parser provides parsed sentences including attachments of noun phrases to verbs and clauses<to travel> <subject: father> <by: car><to travel> <subject: neighbor> <by: train><to drive> <subject: friend> <by: car><to drive> <subject: colleague> <by: motor-bike><to drive> <subject: friend> <by: motor-bike>

• Unambiguous parsed sentences is not a requirement, noise is taken in account

• The meaning of the concepts of the ontology is characterized by the subcategorization frames they appear in

Conceptual Clustering (2)E.g.:<to travel> <subject: father> <by: car><to travel> <subject: neighbor> <by: train><to drive> <subject: friend> <by: car><to drive> <subject: colleague> <by: motor-bike><to drive> <subject: friend> <by: motor-bike>

<to travel> <subject: [father(1), neighbor(1)]> <by: [car(1), train(1)]><to drive> <subject: [friend(2), colleague(1)]> <by: [car(1), motor-bike(2)]>

<to travel> <subject: human> <by: motorized vehicle><to drive> <subject: human> <by: motorized vehicle>

Conceptual Clustering (3)

C1 : to cook in C2 : to put in

oven (4)

stew pan (12)

frying pan (2)

oven (5)

stew pan (3)

wok (6)

pan (2)

Clusters which have a maximum overlap (thus, clusters which contains the same words with the same frequencies) have to be merged.

Ontology Pruning• Ontology pruning is a data-driven means to reuse existing (general)

ontologies in order to tailor them to a certain domain [4]

• The approach uses data-oriented techniques that are based on word/concept frequencies

• The idea is to compare the frequencies of words/concepts in two different corpora, one domain-specific and one generic

• Words/concepts whose frequencies, in the domain-specific corpora, overcome of a certain percentage the frequencies of the same words in the generic corpora, are accepted, the others rejected

OL for Interoperability (1)• The key challenge here is to find semantic mappings between similar

elements from two ontologies [1]

• First problem: how can we define a meaningful similarity measure?

• Second problem: how can we compute such measure using the available data?

• An assumption here, is to have instances that can be used to learn concepts

OL for Interoperability (2)• Similarity Measure

– Many definitions are possible (it is task dependent)

– Many similarity measures are based on the joint probability distribution:P(A , B) – P(¬A , B) – P(A , ¬B) – P(¬A , ¬B)

– Jaccard coefficent – JC(A,B) = =P(A B)

P(A B)

P(A , B)

P(A , B) + P(¬A , B) + P(A , ¬B)

A B

OL for Interoperability (3)• Distribution estimator

– We assume to have a set of instances that is representative of the universe covered by the ontology

– N(UiA,B) is the number of instances of the ith ontology that belongs to both

A and B

– P(A , B) =

– Problem: what if A and B does not belong to the same ontology? (because this is our case!)

[N(U1A,B) + N(U2

A,B)]

[N(U1) + N(U2)]

OL for Interoperability (4)R

A C D

E Ft1, t2 t3, t4

t5, t6 t7

t1, t2, t3, t4

t5, t6, t7 Trained Learner L

G

B H

I Js2 s3, s4

s5, s6 s5, s6

s1

s1, s2, s3, s4

U1A

U1¬A

U2¬B

U2B

L s1, s3 s2 , s4

s5 s6

U2A , B

U2A , ¬B

U2¬A , B

U2¬A , ¬ B

OL Tools (KAON)• http://kaon.semanticweb.org

• Open Source

• Java based

• Implements a modular framework

• Text2Onto, module for OL from text (association rules, see Association Rules (1))

• Ontology Pruning implemented (simple filter on TF)

References[1] A. Doan, J. Madhavan, P. Domingos, A. Halevy. Learning to map between ontologies on the Semantic Web. In Proceedings of the 11th International World Wide Web Conference (WWW 2002), Hawaii, USA, May 2002.

[2] D. Faure, C. Nedellec. A corpus-based conceptual clustering method for verb frames and ontology acquisition. In 1st International Conference on Language resources and Evaluation -- Workshop on Adapting lexical and corpus resources to sublanguages and applications, Granada, Spain, pages 1--8, 1998.

[3] G. R. Maddi, C. S. Velvadapu, S. Srivastava, J. Gil de Lamadrid. Ontology Extraction from text documents by Singular Value Decomposition.

[4] A. Maedche, R. Volz, R. Studer, B. Lauser. Pruning-based identification of a domain in ontologies. In Proc. of I-KNOW'03, Graz, Austria, 07 2003.

[5] A. Maedche, V. Zacharias. Ontology-based Instance Clustering. In proc. of ECML/PKDD. Springer, 2002.

[6] A. Maedche, S. Staab. Discovering Conceptual Relations from Text. In Proc. Of ECAI-2000.

Recommended