Upload
nani
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Named Entity Mining From Click-Through Data Using Weakly Supervised LDA. Gu Xu 1 , Shuang -Hong Yang 1,2 , Hang Li 1 1 Microsoft Research Asia, China 2 College of Computing, Georgia Tech, USA. Talk Outline. Named Entity Mining Exploiting click-through data - PowerPoint PPT Presentation
Citation preview
Named Entity Mining From Click-Through Data
Using Weakly Supervised LDA
Gu Xu1, Shuang-Hong Yang1,2, Hang Li1
1Microsoft Research Asia, China2College of Computing, Georgia Tech, USA
Talk Outline
• Named Entity Mining– Exploiting click-through data– Applying Latent Dirichlet Allocation– Developing a weakly supervised Learning approach
• Weakly Supervised LDA• Experimental Results• Summary
Named Entity Mining• Named Entity Mining (NEM) – To mine the information of named entities of a class
from a large amount of data. – Example: mine movie titles from a textual data
collection– Applications: Web search, etc.
• Three Challenges– Suitable data source for NEM
– Ambiguity in classes of named entities
– Supervision from human knowledge
Click-through Data
LDA (Topic Model)
Weakly Supervised Learning
Click-through Data
• Query context– [movie] trailer, [game] cheats
• Click context– imdb.com for movies, gamespot.com for games
– Wisdom-of-crowds• Very Large-scale data and keep on growing• Frequent update with emerging named entities
• New data source for NEM– Over 70% queries contain
named entities.– Rich context for determining
the classes of entities.
Query_1 Site_11 Freq_11
Site_12 Freq_12
… …
Query _... … …
Click-Through Data
Latent Dirichlet Allocation
• Deal with ambiguity in classes of named entities– Classes of named entities are ambiguous.
• Harry Potter: Book, Movie and Game
– Topic models (LDA)
Classes of Named Entity as Topics
# trailer# dvd# movie
imdb.commovies.yahoo.comdisney.go.com
# cheats# walkthrough# game
gamespots.comcheats.ign.comgamefaqs.com
MovieMovie GameGame
Query Context
Click Context Query
Context
Click Context
Harry Potterharry potter trailer imdb.comharry potter dvd movies.yahoo.com
harry potter cheats cheats.ign.comharry potter game gamespots.com
Weakly Supervised Learning
• Supervise LDA training with examples– LDA is unsupervised model.• Topics in LDA are latent and not align with predefined
semantic classes, like book, movie and game.– Human labels are inaccurate and partial.• Binary indicator rather than proportion• Labels only indicate that a named entity belongs to
certain classes, but not exclude the possibility that it belongs to the other classes.
– Weakly-supervised LDA• Supervise LDA training with partial labels
Weakly Supervised LDA
• Overview
Create a virtual document for each seed and train WS-LDA Create a virtual document for each seed and train WS-LDA
WebsitesWebsitesContextsContexts
Find new named entities as well as their classes by using obtained query contexts and clicked websites
Find new named entities as well as their classes by using obtained query contexts and clicked websites
Newly Discovered Entities
Newly Discovered Entities
………………..Harry Potter………………..………………..
………………..Harry Potter………………..………………..
harry potter book http://www.amazon.comharry potter cheats http://cheats.ign.comharry potter trailer http://www.imdb.com……………………………………..
harry potter book http://www.amazon.comharry potter cheats http://cheats.ign.comharry potter trailer http://www.imdb.com……………………………………..
Seeds Click-through Data
# book, http://www.amazon.com# cheats, http://cheats.ign.com# trailer, http://www.imdb.com……………………………………..
# book, http://www.amazon.com# cheats, http://cheats.ign.com# trailer, http://www.imdb.com……………………………………..
Virtual Document
Weakly Supervised LDA (cont.)
• LDA with two types of virtual words– w1: Query context
– w2: Click context
# book# cheats# trailer……………
# book# cheats# trailer……………
http://www.amazon.comhttp://cheats.ign.comhttp://www.imdb.com………………………………….
http://www.amazon.comhttp://cheats.ign.comhttp://www.imdb.com………………………………….
Virtual Document
Weakly Supervised LDA (cont.)
• Introduce Weak Supervision– LDA log likelihood + soft constraints
– Soft Constraints
,,log, yCwpywL LDA Probability Soft Constraints
i ii zyyC , Document Probability
on i-th ClassDocument Probability on i-th Class
Document Binary Label on i-th Class Document Binary Label on i-th Class
Experimental Results
• Dataset– Seed named entities• About 1,000 seeds for each class, and 3767 unique
named entities in total
– Click-through data• 1.5 billion query-URL pairs, containing 240 million
unique queries and 17 million unique URLs
Experimental Results (cont.)
• Top Contexts and websitesMovie Contexts Game Contexts Book Contexts Music Contexts
Movie Websites Game Websites Book Websites Music Websites
Experimental Results (cont.)
• Accuracy of Mined Entities
Summary• Proposed to use click-through data as a new data
source for NEM• Employed topic model to deal with ambiguity in
classes of named entities• Devised weakly supervised LDA for modeling
click-through data– Two types of virtual words– Introduce weakly supervised learning into LDA
• Experiments on large-scale data verified effectiveness of proposed approach
THANKS