Upload
aaron-russell
View
214
Download
1
Embed Size (px)
Citation preview
An Automatic Approach to Semantic Annotation of
Unstructured, Ungrammatical Sources: A First Look
Matthew Michelson &
Craig A. Knoblock
University of Southern California /
Information Sciences Institute
Unstructured, Ungrammatical Text
Unstructured, Ungrammatical Text
Car Model
Car Year
Semantic Annotation
02 M3 Convertible .. Absolute beauty!!!
<Make>BMW</Make>
<Model>M3</Model>
<Trim>2 Dr STD Convertible</Trim>
<Year>2002</Year>
“Understand” & query the posts (can query on BMW, even though not in post!)
Note: This is not extraction! (Not pulling them out of post…)
implied!
Reference Sets
Annotation/Extraction is hard Can’t rely on structure (wrappers) Can’t rely on grammar (NLP)
Reference sets are the key (IJCAI 2005) Match posts to reference set tuples
Clue to attributes in posts Provides normalized attribute values when matched
Reference Sets
Collections of entities and their attributes Relational Data!
Scrape make, model, trim, year for all cars from 1990-2005…
Contributions
Previously Now
User supplies reference set
System selects reference sets from repository
User trains record linkage between reference set & posts
Unsupervised matching between
reference set & posts
New unsupervised approach: Two Steps
Reference Set Repository: Grows over time, increasing coverage
---------------------------------
Posts
1) Unsupervised Reference Set Chooser
2) Unsupervised Record Linkage
Unsupervised Semantic Annotation
Choosing a Reference SetVector space model: set of posts are 1 doc, reference sets are 1 doc
Select reference set most similar to the set of posts…
FORD Thunderbird - $4700
2001 White Toyota Corrolla CE Excellent Condition - $8200
HotelsCars Restaurants
SIM:0.7 SIM:0.4 SIM:0.3Cars 0.7 PD(C,H) = 0.75 > T
Hotels 0.4 PD(H,R) = 0.33 < T
Restaurants 0.3
Avg. 0.47
Cars
Choosing Reference Sets Similarity: Jensen-Shannon distance & TF-IDF used in
Experiments in paper Percent Difference as splitting criterion
Relative measure “Reasonable” threshold – we use 0.6 throughout Score > average as well
Small scores with small changes can result in increased percent difference but they are not better, just relatively so…
If two or more reference sets selected, annotation runs iteratively If two reference sets have same schema, use one with higher
rank Eliminate redundant matching
Vector Space Matching for Semantic Annotation
Choosing reference sets: set of posts vs. whole reference set
Vector space matching: each post vs. each reference set record
Modified Dice similarity
Modification: if Jaro-Winler > 0.95 put in (p ∩ r) captures spelling errors and abbreviations
rp
rprpDice
)(*2
),(
Why Dice? TF/IDF w/ Cosine Sim:
“City” given more weight than “Ford” in reference set Post: Near New Ford Expedition XLT 4WD with Brand New 22
Wheels!!! (Redwood City - Sale This Weekend !!!) $26850 TFIDF Match (score 0.20): {VOLKSWAGEN, JETTA, 4 Dr City
Sedan, 1995} Jaccard Sim [(p ∩ r)/(p U r)]:
Discounts shorter strings (many posts are short!) Example Post above MATCHES: {FORD, EXPEDITION, 4 Dr
XLT 4WD SUV, 2005} Dice: 0.32 Jacc: 0.19 Dice boosts numerator If intersection is small, denominator of Dice almost same as Jaccard,
so numerator matters more
Vector Space Matching for Semantic Annotation
new 2007 altima
02 M3 Convertible .. Absolute beauty!!!
Awesome car for sale! It’s an accord, I think…
{BMW, M3, 2 Dr STD Convertible, 2002} 0.5
Average score splits matches from non-matches, eliminating false positives Threshold for matches from data Using average assumes good matches and bad ones (see this in the data…)
{NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} 0.36
{NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007} 0.36
{HONDA, ACCORD, 4 Dr LX, 2001} 0.13
Avg. Dice = 0.33
< 0.33
Vector Space Matching for Semantic Annotation
Attributes in agreement Set of matches: ambiguity in differing attributes Which is better? All have maximum score as
matches! We say none, throw away differences… Union them? In real world, not all posts have all
attributes E.g.: new 2007 altima
{NISSAN, ALTIMA, 4 Dr 3.5 SE Sedan, 2007} 0.36
{NISSAN, ALTIMA, 4 Dr 2.5 S Sedan, 2007} 0.36
Experimental Data Sets
Name Source Attributes Records
Fodors Fodors Travel Guide name, address, city, cuisine 534
Zagat Zagat Restaurant Guide name, address, city, cuisine 330
Comics Comics Price Guide title, issue, publisher 918
Hotels Bidding For Travel star rating, name, local area 132
Cars Edmunds & Super Lamb Auto make, model, trim, year 27,006
KBBCars Kelly Blue Book Car Prices make, model, trim, year 2,777
Reference Sets:
Name Source Reference Set Match Records
BFT Bidding For Travel Hotels 1,125
EBay EBay Comics Comics 776
Craigs List Craigs List Cars Cars, KBBCars (in order) 2,568
Boats Craigs List Boats None 1,099
Posts:
Results: Choose Ref. Sets (Jensen-Shannon)BFT Posts
Ref. Set Score % Diff.
Hotels 0.622 2.172
Fodors 0.196 0.05
Cars 0.187 0.248
KBBCars 0.15 0.101
Zagat 0.136 0.161
Comics 0.117
Average 0.234
Craig’s List
Ref. Set Score % Diff.
Cars 0.52 0.161
KBBCars 0.447 1.193
Fodors 0.204 0.144
Zagat 0.178 0.365
Hotels 0.131 0.153
Comics 0.113
Average 0.266
Ebay Posts
Ref. Set Score % Diff
Comics 0.579 2.351
Fodors 0.173 0.152
Cars 0.15 0.252
Zagat 0.12 0.186
Hotels 0.101 0.170
KBBCars 0.086
Average 0.201
Boat Posts
Ref. Set Score % Diff.
Cars 0.251 0.513
Fodors 0.166 0.144
KBBCars 0.145 0.089
Comics 0.133 0.025
Zagat 0.13 0.544
Hotels 0.084
Average 0.152T = 0.6
Results: Semantic AnnotationBFT Posts
Attribute Recall Prec. F-Measure Phoebus F-Mes.
Hotel Name 88.23 89.36 88.79 92.68
Star Rating 92.02 89.25 90.61 92.68
Local Area 93.77 90.52 92.17 92.68
EBay Posts
Title 86.08 91.60 88.76 88.64
Issue 70.16 89.40 78.62 88.64
Publisher 86.08 91.60 88.76 88.64
Craig’s List Posts
Make 93.96 86.35 89.99 N/A
Model 82.62 81.35 81.98 N/A
Trim 71.62 51.95 60.22 N/A
Year 78.86 91.01 84.50 N/A
Supervised Machine Learning: notion of matches/ nonmatches in its training data
In agreement issues
Related Work Semantic Annotation
Rule and Pattern based methods assume structure repeats to make rules & patterns useful. In our case, unstructured data disallows such assumptions.
SemTag (Dill, et. al. 2003): look up tokens in taxonomy and disambiguate They disambiguate 1 token at time. We disambiguate using all posts
during reference set selection, so we don’t have their ambiguity issue such as “is jaguar a car or animal?” [Reference set would tell us!]
We don’t require carefully formed taxonomy so we can easily exploit widely available reference sets
Info. Extraction using Reference Sets CRAM – unsupervised extraction but given reference set & labels all
tokens (no junk allowed!) Cohen & Sarawagi 2004 – supervised extraction. Ours is unsupervised
Resource Selection in Distr. IR (“Hidden Web”) [Survey: Craswell et. al. 2000] Probe queries required to estimate coverage since they don’t have full
access to data. Since we have full access to reference sets we don’t use probe queries
Conclusions Unsupervised semantic annotation
System can accurately query noisy, unstructured sources w/o human intervention E.g. Aggregate queries (avg. Honda price?) w/o reading all posts
Unsupervised selection of reference sets repository grows over time, increasing coverage over time
Unsupervised annotation competitive with machine learning approach but without burden of
labeling matches. Necessary to exploit newly collected reference sets automatically Allow for large scale annotation over time, w/o user intervention
Future Work Unsupervised extraction Collect reference sets and manage with an information mediator