35
Stop Word and Related Problems in Web Interface Integration Eduard C. Dragut (speaker) Fang Fang Clement Yu Prasad Sistla Weiyi Meng University of Illinois at Chicago University of Illinois at Chicago University of Illinois at Chicago University of Illinois at Chicago SUNY at Binghamton VLDB 2009, Lyon, France

Stop Word and Related Problems in Web Interface Integration

  • Upload
    angus

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Stop Word and Related Problems in Web Interface Integration. Eduard C. Dragut (speaker) ‏ Fang Fang Clement Yu Prasad Sistla Weiyi Meng. University of Illinois at Chicago University of Illinois at Chicago University of Illinois at Chicago University of Illinois at Chicago - PowerPoint PPT Presentation

Citation preview

Page 1: Stop Word and Related Problems in Web Interface Integration

Stop Word and Related Problems in Web Interface Integration

Stop Word and Related Problems in Web Interface Integration

Eduard C. Dragut (speaker)Fang FangClement YuPrasad SistlaWeiyi Meng

University of Illinois at ChicagoUniversity of Illinois at ChicagoUniversity of Illinois at ChicagoUniversity of Illinois at ChicagoSUNY at Binghamton

VLDB 2009, Lyon, France

Page 2: Stop Word and Related Problems in Web Interface Integration

Page 2E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Objectives Address the problem of automatically identifying the set of stop stop

wordswords in a given application domain. “Stop words is the name given to words which are filtered out prior to, or

after, processing of natural language data (text)”, wikipedia.org, answers.com

Hans Peter Luhn is credited with coining the phrase.

Establish semantic relationships between multi-word phrases beyond those in electronic dictionaries (e.g., Wordnet) We focus on synonymy and hyponymy/hypernymy relationships

Analyze the impact of words such as andand and or or when establishing semantic relationships E.g., Is drop-off date drop-off date andand time time a hyponym of date date andand time time?

Page 3: Stop Word and Related Problems in Web Interface Integration

Page 3E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

A Motivating Scenario for Integration

BritishAirline.comunite

d.com

Looking for the cheapest ticket Chicago – Paris, August 20th – August 29th

A user looking for the “best” price for a ticket:Has to explore multiple sources It is tedious, frustrating and time-consuming

AirFrance.com

Page 4: Stop Word and Related Problems in Web Interface Integration

Page 4E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

The Goal Provide a unified way to query

multiple sources in the same domain

Lufthansa.com

nwa.com

delta.comunited.com

Unified query interface

AirFrance.com

The Web

Formulate the query

Page 5: Stop Word and Related Problems in Web Interface Integration

Page 5E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Auto

Overview of Integrating Web Interfaces

Extract query interfaces

He05, Zhang04,Dragut09

Various formatse.g. ASCII files(Deep) Web

Cluster query interfaces

Barbosa07, He04,Peng04

Match query interfaces

B.He03, Dhamankar04, Doan02, Madhavan05, Wu04, 06

Car Rental

Books Airfare

Inte

gra

tion

of In

terfa

ces

H.H

e03

,D

rag

ut 0

6

Page 6: Stop Word and Related Problems in Web Interface Integration

Page 6E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Motivation for Stop Words Automating the process of identifying the set of stop words Establishing semantic relationships between labels

Stop words express important semantic information and their removal may lead to erroneous logic inferences

Stop words removal may leave some labels empty Issue: No semantic relationships can be establish using empty labels

Page 7: Stop Word and Related Problems in Web Interface Integration

Page 7E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Motivation for Stop Words, cont’ The stop words are domain

dependent, i.e. a stop word in one domain may not be a stop word in another domain. The word wherewhere is a stop

word in the Credit Card domain, but not in the Airline domain

Page 8: Stop Word and Related Problems in Web Interface Integration

Page 8E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Motivation for Semantic Enrichment Words The labels of attributes may

contain the words AND, OR and the characters “/”, “&”

Questions: What are their semantics? Where are they used, in the

labels of fields or in the labels of sections?

How should they be handled when semantic relationships are established? Is “Pick-up Date & Time” a

hyponym of “Dates & Times”? Is “Pick-up Date ” a hyponym

of “Pick-up Date & Time”?

Page 9: Stop Word and Related Problems in Web Interface Integration

Page 9E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Motivation for Semantic Relationships Goal:

Provide a systematic way to distinguish between synonymy and hyponymy relationships

Usage: Schema matching Naming the attributes of an integrated query interface [Dragut

06], as part of Web interface integration The main motivation.

Integration of hierarchies Two synonym concepts from distinct hierarchies are collapsed into

one concept in the integrated hierarchy

Page 10: Stop Word and Related Problems in Web Interface Integration

Page 10E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

The Stop Word Problem - Solution The Problem:

Given a set of query interfaces in the same application domain (e.g., real estate), determine those words within the labels of the query interfaces that are stop words

The input: A set of query interfaces in the same domain

E.g. Airline domain: Delta, AA, NWA, Orbitz, Travelocity Each query interface is represented hierarchically [Wu04]

Children

Vacations

Where and when do you want to travel?

LeavingDeparting from

Going to

How many people are going?

Adults Seniors

depDate

Returning

depTime retDate retTime

1 2

Page 11: Stop Word and Related Problems in Web Interface Integration

Page 11E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

The Stop Words Problem - Solution The main heuristic observation:

The set of stop words from an Information Integration perspective is a subset of the set of stop words from an Information Retrieval perspective E.g. the word lastlast in the label Last NameLast Name is a stop word from IR perspective,

but it is not a stop word in the label.

The strategy Take an arbitrary general purpose dictionary of stop words and find its

largest subset satisfying constraints specific to the information integration problem.

General dictionary of stop words obtained through a Google search E.g. dcs.gla.ac.uk/idom/ir resources/linguistic_utils/stop_words.

Page 12: Stop Word and Related Problems in Web Interface Integration

Page 12E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

The Stop Words Problem - Solution The constraints

After the removal of incorrect stop words, the following situations arise: Empty labelEmpty label - A non-empty label becomes empty after the removal. It cannot

be used to derive any knowledge. HomonymyHomonymy - Two sibling nodes in a hierarchy have synonym labels. HyponymyHyponymy - Two sibling nodes in a hierarchy have hyponym labels.

Example:

Page 13: Stop Word and Related Problems in Web Interface Integration

Page 13E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

The Stop Words Problem - Solution The Stop Word Problem is intractableintractable, it is NP-

complete. Worse, regardless of the subset of constraints chosen the

problem remains “equally” hard. Common practice

Come up with an approximation algorithm Not covered.

The proposed algorithm produces a maximal set of stop words with respect to the stop word constraints. The algorithm performance will be discussed in the experimental

part.

Page 14: Stop Word and Related Problems in Web Interface Integration

Page 14E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Semantic Relationships Among Labels The goalgoal is to devise a methodology for establishing synonymy

and hyponymy relationships between multi-word phrases.

Why is the problem of establishing semantic relationships between labels (names) difficult in practice? Is it because, in a given application domain, a content word occurs with

multiple senses with respect to a (electronic) dictionary (e.g., Wordnet [Fellbaum98])? E.g. Select an Select an areaarea vs. Minimum floor Minimum floor areaarea

Is it because of the context of usage of words? E.g. Home Home addressaddress vs. Business Business addressaddress

Is it because of the occurrence of the semantic enrichment words? E.g., Pick-up date Pick-up date andand time time vs. Pick-up datePick-up date E.g., Date Date andand time time vs. Pick-up date Pick-up date andand time time

Page 15: Stop Word and Related Problems in Web Interface Integration

Page 15E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

The Sense of a Word in a Domain To better see the number of meanings of content words

Create inverted lists of labels for each domain used in our experiments 9 domains were used. There are 735 distinct words and 2,319 labels.

Manually check the number of meanings of each word.

Finding: OnlyOnly oneone word (i.e., the word “area” in the Real estate domain) out of 735 words has multiple senses in the same application domain.

Assumption: each word has a unique sense in a given domain.

Area

Type

Type

Address

Words

Select an area, Minimum floor areaReal estate

Property type, Parcel type, Type of useReal estate

3rd party credit card type, Major credit card typeCredit Card

Home address, Company address, Email addressCredit Card

LabelsDomains

Page 16: Stop Word and Related Problems in Web Interface Integration

Page 16E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Dictionary Senses versus Context of Use An example:

Consider the noun AddressAddress in the following labels:Home Address, Company Address, Relative’s Address, Email AddressAddressAddress has the same meaning in all of them, according to Wordnet:

“the place where a person or organization can be found or communicated with”

It will wrongly suggest that Home AddressHome Address is a hyponym of AddressAddress

(Electronic) Dictionaries are limitedThe contextcontext of a label needs to be also taken into considerationThe context of a labelcontext of a label of an internal node is the set of its descendant leaves

Page 17: Stop Word and Related Problems in Web Interface Integration

Page 17E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Defining Semantic Relationships Normalization [e.g., He03 et al, Madhavan01 et al , Rahm01 et al]

E.g. Adults (18-64)Adults (18-64) becomes adultadult

A label is seen as a set of normalized content words E.g., {area, study} corresponds to Area of StudyArea of Study E.g., {field, work} corresponds to Field of WorkField of Work

Informally, a label A is synonymsynonym to a label B if their sets of content words are "equal" (i.e., words may be synonymous) Area of StudyArea of Study is a synonym of Field of WorkField of Work

AreaArea is synonym of FieldField (by WordNet) StudyStudy is synonym of WorkWork (by WordNet)

Page 18: Stop Word and Related Problems in Web Interface Integration

Page 18E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Defining Semantic Relationships Informally, A label A is a hypernymhypernym of a label B if the set of

content words of A is a "subset" of that of B, meaning that the words of may be mapped into those of B using either equality, synonymy, hypernymy relationships. The intuition is that additional words usually restrict the meaning of a

phrase

Example: Financial InformationFinancial Information is a hypernym of Household Financial InformationHousehold Financial Information Employment InformationEmployment Information is a hypernym of Job InformationJob Information

EmploymentEmployment is a hypernym of JobJob (by Wordnet)

Page 19: Stop Word and Related Problems in Web Interface Integration

Page 19E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Computing Semantic Relationships Between two sets A and B, with A and B having n and m elements

(n ≤ m), respectively, there can be a factorial number of mappings. A brute force enumeration algorithm takes exponential time.

Solution sketch: Convert the problem to bipartite matching problemsbipartite matching problems

The vertices of the graph correspond to the content words of the labels. An edge corresponds to two words of the two labels being either equal,

synonyms or hyponyms. The tricktrick to distinguish a synonymy relationship from a hyponymy one is:

To assign a weight of 1 to edges denoting equality or synonymy relationships and a weight of 2 to edges denoting hyponymy relationships.

When |A| = |B| (|A| = number of content words of A) , a synonymy relationship corresponds to a maximum weighted bipartite matching maximum weighted bipartite matching whose weight is equal to |A|.

When |A| = |B| a hyponymy relationship corresponds to a maximum weighted maximum weighted bipartite matching bipartite matching whose weight is larger than |A|.

When |A| < |B| a hyponymy relationship corresponds to a maximum bipartite maximum bipartite matching matching whose weight is equal to |A|.

Page 20: Stop Word and Related Problems in Web Interface Integration

Page 20E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Computing Semantic Relationships

Area

Study

Field

Work

Synonymy – as a perfect matching

Examples:

Employment

Information

Job

Information

Hyponymy – as a maximum weighted bipartite matching

Household

Financial Financial

Information

Hyponymy – as a maximum bipartite matching

Information

Denotes a hyponym edge

Page 21: Stop Word and Related Problems in Web Interface Integration

Page 21E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Semantic Enrichment Words, briefly In the presence of semantic enrichment words (i.e., andand and oror),

the intuition that additional words restrict the meaning of a phrase is no longer true

Examples: Pick-up datePick-up date is a hyponym of Pick-up date Pick-up date andand time time City City oror airport code airport code is a hyponym of City, point of interest City, point of interest oror airport code airport code

Some observations: AND AND appears frequently (91.3%) among the labels of the internal nodes OR OR appears frequently (96%) among the labels of the (fields) leaf nodes

Page 22: Stop Word and Related Problems in Web Interface Integration

Page 22E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Goals:Evaluate the approximation algorithm for computing the

dictionary of stop words.

Asses the ability of the proposed methods to establish semantic relationships.

Determine the impact of stop words on determining semantic relationships.

Experiments

Page 23: Stop Word and Related Problems in Web Interface Integration

Page 23E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

ExperimentsSetup

9 real world domains from the web Parts of the data set used also in Wu06 et al, Madhavan05 et al, Dragut06 at al.

2.32.47.630Hotels

3.620.2550.1520Credit Card

3.588.3215.350Alliances

2.72.46.520Real Estate

2.52.410.420Car Rentals

1.1

1.3

1.7

5.1

Avg. # internal nodes per interface

2.1

2.3

2.4

3.6

Avg. depth of interfaces

20

20

20

20

# interfaces

4.6Job

5.4Book

5.1Automobile

10.7Airfare

Avg. # fields per interfaceDomain

Page 24: Stop Word and Related Problems in Web Interface Integration

Page 24E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

How was the gold standard created?Following the intuition:

A word is not a stop word if there is a label whose meaning changes so “drastically” after the removal of the word from the label that the new label does not resemble in any way the original meaning of the label.

Examples:The word yourselfyourself in the Credit Card domain is not a stop word

because of labels such as Please tell us about yourselfPlease tell us about yourself

The word whowho in the Airline domain is not a stop word because of labels such as Who is going in this trip?Who is going in this trip?

Experiments: Gold Standard Stop Words

Page 25: Stop Word and Related Problems in Web Interface Integration

Page 25E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Experiments: Evaluating Stop WordsFrom left to right Precision, Recall, F-score

Page 26: Stop Word and Related Problems in Web Interface Integration

Page 26E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Experiments: Discussion on Stop Words Example of non-stop words commonly regarded as stop

words

to, from, orReal Estate

from, lastto, and, orCar Rental

yourselffirst, last, per, and, orCredit Card

first, last, before, or

first, last, from, to, within, or

from, to, on, yourself, no, for, there, and, or

first, last, from, to, when, and, or

Found non-stop words

afterBook

Auto

where, when, who, byAlliances

where, whoAirfare

Missed non-stop wordsDomain

Why do we miss some of them?

Page 27: Stop Word and Related Problems in Web Interface Integration

Page 27E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Experiments: Semantic Relationships The gold standard

Manually created for each of the 9 domains.Contains 7,544 relationships: 4,103 (54.4%) are synonymy relationships

and 3,441 (45.6%) are hypernymy/hyponymy relationships.

Page 28: Stop Word and Related Problems in Web Interface Integration

Page 28E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Experiments: The Naïve Algorithm It uses only the dictionary senses of individual words Why is the accuracy so poor and ranging over such a large interval

(from 39% to 97.3%)? It compares labels without taking into consideration their contextscontexts. It blindly establishes semantic relationships between labels that share some

words.

Page 29: Stop Word and Related Problems in Web Interface Integration

Page 29E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Experiments: The Improved Algorithm It combines the context of labels

and semantic enrichment words. F-score ranges from 82.1% to

99.3%, with the mean at 92.6%92.6% and a standard deviation of 5.9%5.9%.

The naive algorithm has a mean F-score of 74.9%74.9% and a standard deviation of 18.5%18.5%..

It improves the average precision to 95%, the average recall to 90.4% and the average F-score to 92.6%.

Page 30: Stop Word and Related Problems in Web Interface Integration

Page 30E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Experiments: Where Do the Problems Lie? Words and phrases that are commonly perceived as

synonyms but not recorded in electronic dictionaries WordNet.

E.g. drop-offdrop-off and returnreturn are synonyms in the Car Rental domain but not by WordNet

Many labels are complex sentences E.g. “So, what do you do for a living?”, “How flexible are

you?”.

Page 31: Stop Word and Related Problems in Web Interface Integration

Page 31E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Experiments: What Else Did We Try?Other linguistic techniques were attempted

Normalized Google Distance (NGD) [Cilibrasi and Vitanyi 2007]

The kernel function for measuring the semantic similarity between pairs of short text snippets [Sahami and Heilman 2006]

Additional authorized userSyn2nd card holderCredit Card

Square feetHypSizeReal Estate

Employment InformationSynSo, what do you do for a living?Credit Card

Start

Drop-off date

Search one day before and after

Origin date

Label

Pick-up

End

How flexible are you?

Outbound

Label

SynCar Rental

SynCar Rental

HypAirfare

SynAirfare

RelationshipDomain

Page 32: Stop Word and Related Problems in Web Interface Integration

Page 32E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Experiments: Stop Words & Semantic Relationships We run the improved algorithm for computing semantic

relationships with the following four possible sets of stop word: S1 is the set of stop words produced by our algorithm; S2 is the gold standard of stop words; S3 is the empty set; S4 is a domain independent stop word set used by a typical IR system;

we used dcs.gla.ac.uk/idom/ir resources/linguistic_utils/stop_words

The outcome: F-score of using S1 is on average 17.6% better than that using S3.

The largest difference is 43%.

F-score of using S1 is on average 8% better than that using S4. The largest difference is 33%.

F-score using S1 is on average 0.03% better than that using S2. This is another way of validating our improve algorithm.

Page 33: Stop Word and Related Problems in Web Interface Integration

Page 33E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Related Work Synonym and near-synonym relationships between short phrases

have been recently studied [Bollegala et al. 2007, Sahami and Heilman 2006]

There is a great deal of work to represent meaning of words (not phrases) in various areas of research: linguistics, computer science, cognitive psychology, etcManually created semantic networks Wordnet [Felbaum 1998] and Cyc

[Lenat et al. 1990]Generic methods to measure word similarity or word association

Using word frequencies in text corpora [Berland and Charniak 1990, Caraballo 1999, Hearst 1992, Jiang and Conrath 1998, Lin 1998]

Using a Web search engine counts (hits) to identify lexico-syntactic patterns [Bollegala et al. 2007, Cilibrasi and Vitani 2007, Cimiano and Staab 2004]

Page 34: Stop Word and Related Problems in Web Interface Integration

Page 34E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

Related Work, Cont’ Schema Matching

Surveys [Rahm and Bernstein 2001, Shvaiko and Euzenat 2005]

Query interface matching [He and Chang 2003, He at al. 2004, Wang et al. 2004, Wu et al. 2004, 2006]

A number of dictionary-based semantic matching techniques for relational/XML schema and ontology alignment [Benevantano et al. 2001, Giunchiglia et al. 2005, Kotis and Vouros 2004]

Page 35: Stop Word and Related Problems in Web Interface Integration

Page 35E. Dragut et al -Stop Word and Related Problems in Web Interface Integration

EndPlease visit the project web site

http://www.cs.uic.edu/~edragut/QIProject.html

Thank you for your time and patience!