An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew Michelson & Craig A. Knoblock University of Southern

An Automatic Approach to Semantic Annotation of

Unstructured, Ungrammatical Sources: A First Look

Matthew Michelson &

Craig A. Knoblock

University of Southern California /

Information Sciences Institute

Unstructured, Ungrammatical Text

Unstructured, Ungrammatical Text

Car Model

Car Year

Semantic Annotation

02 M3 Convertible .. Absolute beauty!!!

<Make>BMW</Make>

<Model>M3</Model>

<Trim>2 Dr STD Convertible</Trim>

<Year>2002</Year>

“Understand” & query the posts (can query on BMW, even though not in post!)

Note: This is not extraction! (Not pulling them out of post…)

implied!

Reference Sets

Annotation/Extraction is hard Can’t rely on structure (wrappers) Can’t rely on grammar (NLP)

Reference sets are the key (IJCAI 2005) Match posts to reference set tuples

Clue to attributes in posts Provides normalized attribute values when matched

Reference Sets

Collections of entities and their attributes Relational Data!

Scrape make, model, trim, year for all cars from 1990-2005…

Contributions

Previously Now

User supplies reference set

System selects reference sets from repository

User trains record linkage between reference set & posts

Unsupervised matching between

reference set & posts

New unsupervised approach: Two Steps

Reference Set Repository: Grows over time, increasing coverage

---------------------------------

Posts

1) Unsupervised Reference Set Chooser

2) Unsupervised Record Linkage

Unsupervised Semantic Annotation

Choosing a Reference SetVector space model: set of posts are 1 doc, reference sets are 1 doc

Select reference set most similar to the set of posts…

FORD Thunderbird - $4700

2001 White Toyota Corrolla CE Excellent Condition - $8200

HotelsCars Restaurants

SIM:0.7 SIM:0.4 SIM:0.3Cars 0.7 PD(C,H) = 0.75 > T

Hotels 0.4 PD(H,R) = 0.33 < T

Restaurants 0.3

Avg. 0.47

Cars

Choosing Reference Sets Similarity: Jensen-Shannon distance & TF-IDF used in

Experiments in paper Percent Difference as splitting criterion

Relative measure “Reasonable” threshold – we use 0.6 throughout Score > average as well

Small scores with small changes can result in increased percent difference but they are not better, just relatively so…

If two or more reference sets selected, annotation runs iteratively If two reference sets have same schema, use one with higher

rank Eliminate redundant matching

Vector Space Matching for Semantic Annotation

Choosing reference sets: set of posts vs. whole reference set

Vector space matching: each post vs. each reference set record

Modified Dice similarity

Modification: if Jaro-Winler > 0.95 put in (p ∩ r) captures spelling errors and abbreviations

rp

rprpDice

)(*2

),(

Why Dice? TF/IDF w/ Cosine Sim:

“City” given more weight than “Ford” in reference set Post: Near New Ford Expedition XLT 4WD with Brand New 22

Wheels!!! (Redwood City - Sale This Weekend !!!) $26850 TFIDF Match (score 0.20): {VOLKSWAGEN, JETTA, 4 Dr City

Sedan, 1995} Jaccard Sim [(p ∩ r)/(p U r)]:

Discounts shorter strings (many posts are short!) Example Post above MATCHES: {FORD, EXPEDITION, 4 Dr

XLT 4WD SUV, 2005} Dice: 0.32 Jacc: 0.19 Dice boosts numerator If intersection is small, denominator of Dice almost same as Jaccard,

so numerator matters more


new 2007 altima

02 M3 Convertible .. Absolute beauty!!!

Awesome car for sale! It’s an accord, I think…

{BMW, M3, 2 Dr STD Convertible, 2002} 0.5

Average score splits matches from non-matches, eliminating false positives Threshold for matches from data Using average assumes good matches and bad ones (see this in the data…)

{NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} 0.36

{NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007} 0.36

{HONDA, ACCORD, 4 Dr LX, 2001} 0.13

Avg. Dice = 0.33

< 0.33


Attributes in agreement Set of matches: ambiguity in differing attributes Which is better? All have maximum score as

matches! We say none, throw away differences… Union them? In real world, not all posts have all

attributes E.g.: new 2007 altima

{NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} 0.36

{NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007} 0.36

Experimental Data Sets

Name Source Attributes Records

Fodors Fodors Travel Guide name, address, city, cuisine 534

Zagat Zagat Restaurant Guide name, address, city, cuisine 330

Comics Comics Price Guide title, issue, publisher 918

Hotels Bidding For Travel star rating, name, local area 132

Cars Edmunds & Super Lamb Auto make, model, trim, year 27,006

KBBCars Kelly Blue Book Car Prices make, model, trim, year 2,777

Reference Sets:

Name Source Reference Set Match Records

BFT Bidding For Travel Hotels 1,125

EBay EBay Comics Comics 776

Craigs List Craigs List Cars Cars, KBBCars (in order) 2,568

Boats Craigs List Boats None 1,099

Posts:

Results: Choose Ref. Sets (Jensen-Shannon)BFT Posts

Ref. Set Score % Diff.

Hotels 0.622 2.172

Fodors 0.196 0.05

Cars 0.187 0.248

KBBCars 0.15 0.101

Zagat 0.136 0.161

Comics 0.117

Average 0.234

Craig’s List


Cars 0.52 0.161

KBBCars 0.447 1.193

Fodors 0.204 0.144

Zagat 0.178 0.365

Hotels 0.131 0.153

Comics 0.113

Average 0.266

Ebay Posts

Ref. Set Score % Diff

Comics 0.579 2.351

Fodors 0.173 0.152

Cars 0.15 0.252

Zagat 0.12 0.186

Hotels 0.101 0.170

KBBCars 0.086

Average 0.201

Boat Posts


Cars 0.251 0.513

Fodors 0.166 0.144

KBBCars 0.145 0.089

Comics 0.133 0.025

Zagat 0.13 0.544

Hotels 0.084

Average 0.152T = 0.6

Results: Semantic AnnotationBFT Posts

Attribute Recall Prec. F-Measure Phoebus F-Mes.

Hotel Name 88.23 89.36 88.79 92.68

Star Rating 92.02 89.25 90.61 92.68

Local Area 93.77 90.52 92.17 92.68

EBay Posts

Title 86.08 91.60 88.76 88.64

Issue 70.16 89.40 78.62 88.64

Publisher 86.08 91.60 88.76 88.64

Craig’s List Posts

Make 93.96 86.35 89.99 N/A

Model 82.62 81.35 81.98 N/A

Trim 71.62 51.95 60.22 N/A

Year 78.86 91.01 84.50 N/A

Supervised Machine Learning: notion of matches/ nonmatches in its training data

In agreement issues

Related Work Semantic Annotation

Rule and Pattern based methods assume structure repeats to make rules & patterns useful. In our case, unstructured data disallows such assumptions.

SemTag (Dill, et. al. 2003): look up tokens in taxonomy and disambiguate They disambiguate 1 token at time. We disambiguate using all posts

during reference set selection, so we don’t have their ambiguity issue such as “is jaguar a car or animal?” [Reference set would tell us!]

We don’t require carefully formed taxonomy so we can easily exploit widely available reference sets

Info. Extraction using Reference Sets CRAM – unsupervised extraction but given reference set & labels all

tokens (no junk allowed!) Cohen & Sarawagi 2004 – supervised extraction. Ours is unsupervised

Resource Selection in Distr. IR (“Hidden Web”) [Survey: Craswell et. al. 2000] Probe queries required to estimate coverage since they don’t have full

access to data. Since we have full access to reference sets we don’t use probe queries

Conclusions Unsupervised semantic annotation

System can accurately query noisy, unstructured sources w/o human intervention E.g. Aggregate queries (avg. Honda price?) w/o reading all posts

Unsupervised selection of reference sets repository grows over time, increasing coverage over time

Unsupervised annotation competitive with machine learning approach but without burden of

labeling matches. Necessary to exploit newly collected reference sets automatically Allow for large scale annotation over time, w/o user intervention

Future Work Unsupervised extraction Collect reference sets and manage with an information mediator

Documents

An Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look Matthew Michelson & Craig A. Knoblock University of Southern