Upload
dior
View
35
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Growing Parallel Paths for Entity-Page Retrieval. Tim Weninger , Cindy Xide Lin, and Jiawei Han. Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Work Submitted to VLDB'10. Problem: Entity Page Retrieval. Given: Reference page. - PowerPoint PPT Presentation
Citation preview
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Growing Parallel Paths for Entity-Page Retrieval
Tim Weninger, Cindy Xide Lin, and Jiawei Han
Department of Computer ScienceUniversity of Illinois Urbana-Champaign, Urbana, IL
Work Submitted to VLDB'10
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Problem: Entity Page Retrieval
Given: Reference page
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
…Can We find Entity Pages of the same Type?
Problem: Entity Page Retrieval
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
…Can We find Entity Pages of the same Type?
Problem: Entity Page Retrieval
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Definitions:Defn 1: Root to link path:
◊ - hrefX contains
HTML-TABLE-TR1—TD-hrefX
Defn 2: Parallel Links:
Share a root to link path.i.e., lists of links
Defn 3: Intra-page parallel paths:
◊ - hrefC ǁ ◊ - hrefB
◊ - hrefC ǁ ◊ - hrefX
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Definitions:
Defn 5: Parallel Web site pathsShare intra or inter-page parallel paths across multiple pages
Defn 4: Inter-page parallel ◊ - hrefC in Page A ǁ ◊ - hrefW in Page B
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Properties of Parallel Paths
Prop. 1: Equal Path Length Property:
Parallel paths must contain the same number of pages.
Prop. 2: Parallel Page Property:
The test of two paths being in parallel is equivalent to the result of tests of respective pages.
Prop. 3: Equal Page Length Property:
Parallel paths must have the same number of nodes across pages.
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Properties of Parallel Paths
Prop. 4: Divergent Path Property:
Parallel Paths can extend through separate pages
Prop. 5: Early Termination Property:
The test of two paths can be terminated at the first occurrence of a dissimilar node
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Finding Paths
Naive MethodCan be very costly
Growing Parallel PathsFirst find example path Then grow paths which are in parallel to the example
Repeat with alternate pathsThis makes magic happen
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Repeating with alternate paths
k-shortest pathsDo k-shortest path search. Explore all of these paths
Removing links After exploring a path remove the edges from the graph
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Interpreting the Output
Side Effect of Repeating with Alternate pathsGiven: Jiawei HanResult: Jiawei Han 40
Cheng Zhai 38Kevin Chang 38Dan Roth 32Vikram Adve 4Roy Campbell 3
…
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Interpreting the Output
Side Effect of Path FindingWhat does the link labels on the path tell us about the entity
First pathPeopleFacultyJiawei HanPersonal Site
Second pathResearchData Mining
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Experiments
Top 25 CS Departments in US (according to US News)Find all professors
United States CongressFind all senators, representatives, and committees
UIUC onlyFind all coursesFinal all research groups
BaselineGoogle’s find similar search (essentially TFIDF-type ranking)
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Results
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Results
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Results
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Conclusions and Future Work
Given a reference page and an example entity type we can retrieve all entity pages of the same type
Implications:We can use this for information integrationSearch, retrieval can be enhanced
Shortcomings:Most errors due to incorrect list finding
Data and Information Systems LaboratoryUniversity of Illinois Urbana-Champaign
Advanced Data MiningMay 4, 2010
Questions?