20
Recognizing Ontology- Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

  • View
    222

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Recognizing Ontology-ApplicableMultiple-Record Web Documents

David W. Embley

Dennis Ng

Li Xu

Brigham Young University

Page 2: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Problem: Recognizing Applicable DocumentsDocument 1: Car Ads

Document 2: Items for Sale or Rent

Page 3: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

A Conceptual Modeling Solution

Page 4: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Car-Ads Ontology

Car [->object];

Car [0:0.975:1] has Year [1:*];

Car [0:0.925:1] has Make [1:*];

Car [0:0.908:1] has Model [1:*];

Car [0:0.45:1] has Mileage [1:*];

Car [0:2.1:*] has Feature [1:*];

Car [0:0.8:1] has Price [1:*];

PhoneNr [1:*] is for Car [1:1.15:*];

Year matches [4]

constant {extract “\d{2}”;

context "([^\$\d]|^)[4-9]\d,[^\d]";

substitute "^" -> "19"; },

End;

Page 5: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Recognition Heuristics

• H1: Density

• H2: Expected Values

• H3: Grouping

Page 6: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Document 1: Car Ads

Document 2: Items for Sale or Rent

H1: Density

Page 7: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

H1: Density

• Car Ads– Number of Matched Characters: 626– Total Number of Characters: 2048– Density: 0.306

• Items for Rent or Sale– Number of Matched Characters: 196– Total Number of Characters: 2671– Density: 0.073

Page 8: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Document 1: Car Ads

Year: 3Make: 2Model: 3Mileage: 1Price: 1Feature: 15PhoneNr: 3

H2: Expected Values

Document 2: Items for Sale or Rent

Year: 1Make: 0Model: 0Mileage: 1Price: 0Feature: 0PhoneNr: 4

Page 9: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

H2: Expected Values

OV D1 D2

Year 0.98 16 6

Make 0.93 10 0

Model 0.91 12 0

Mileage 0.45 6 2

Price 0.80 11 8

Feature 2.10 29 0

PhoneNr 1.15 15 11

D1: 0.996

D2: 0.567

ov

D1

D2

Page 10: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

H3: Grouping (of 1-Max Object Sets)

YearMakeModelPriceYearModelYearMakeModelMileage…

Document 1: Car Ads

{{{

YearMileage…MileageYearPricePrice…

Document 2: Items for Sale or Rent

{{

Page 11: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

H3: GroupingCar Ads----------------YearYearMakeModel-------------- 3PriceYearModelYear---------------3MakeModelMileageYear---------------4ModelMileagePriceYear---------------4…Grouping: 0.865

Sale Items----------------YearYearYearMileage-------------- 2MileageYearPricePrice---------------3YearPricePriceYear---------------2PricePricePricePrice---------------1…Grouping: 0.500

Expected Number in Group = Ave = 4 (for our example)

Sum of Distinct 1-Max in each GroupNumber of Groups Expected Number in a Group

1-Max

3+3+4+4 44

= 0.875 2+3+2+1 44 = 0.500

Page 12: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Combining Heuristics

• Decision-Tree Learning Algorithm C4.5– (H1, H2, H3, Positive)

– (H1, H2, H3, Negative)

• Training Set– 20 positive examples– 30 negative examples (some purposely similar, e.g. classified ads)

• Test Set– 10 positive examples

– 20 negative examples

Page 13: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Car Ads: Rule & Results

• Precision: 100%• Recall: 91%• Accuracy 97%

– Harmonic Mean– 2/(1/Precision + 1/Recall)

Page 14: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

False Negative

Page 15: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Obituaries

Page 16: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Obituaries: Rule & Results

• Precision: 91%• Recall: 100%• Accuracy: 97%

Page 17: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

False Positive: Missing Person Report

Page 18: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Universal Rule

• Precision: 84%• Recall: 100%• Accuracy: 93%

Page 19: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Additional and Future Work

• Other Approaches– Naïve Bayes [McCallum96] (accuracy near 90%)– Logistic Regression [Wang01] (accuracy near 95%)– Multivariate Analysis with Continuous Random Vectors

[Tang01] (accuracy near 100%)

• More Extensive Testing– Similar documents (motorcycles, wedding announcements, …)– Accuracy drops to near 87%– Naïve Bayes drops to near 77%– Others … ?

• Other Types of Documents– XML Documents– Forms and the Hidden Web– Tables

Page 20: Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University

Summary

• Objective: Automatically Recognize Document Applicability

• Approach:– Conceptual Modeling– Recognition Heuristics

• Density

• Expected Values

• Grouping

• Result: Accuracy Near 95%

www.deg.byu.edu