14
Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu 1 , Shuang-Hong Yang 1,2 , Hang Li 1 1 Microsoft Research Asia, China 2 College of Computing, Georgia Tech, USA

Named Entity Mining From Click-Through Data Using Weakly Supervised LDA

  • Upload
    nani

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Named Entity Mining From Click-Through Data Using Weakly Supervised LDA. Gu Xu 1 , Shuang -Hong Yang 1,2 , Hang Li 1 1 Microsoft Research Asia, China 2 College of Computing, Georgia Tech, USA. Talk Outline. Named Entity Mining Exploiting click-through data - PowerPoint PPT Presentation

Citation preview

Page 1: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Named Entity Mining From Click-Through Data

Using Weakly Supervised LDA

Gu Xu1, Shuang-Hong Yang1,2, Hang Li1

1Microsoft Research Asia, China2College of Computing, Georgia Tech, USA

Page 2: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Talk Outline

• Named Entity Mining– Exploiting click-through data– Applying Latent Dirichlet Allocation– Developing a weakly supervised Learning approach

• Weakly Supervised LDA• Experimental Results• Summary

Page 3: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Named Entity Mining• Named Entity Mining (NEM) – To mine the information of named entities of a class

from a large amount of data. – Example: mine movie titles from a textual data

collection– Applications: Web search, etc.

• Three Challenges– Suitable data source for NEM

– Ambiguity in classes of named entities

– Supervision from human knowledge

Click-through Data

LDA (Topic Model)

Weakly Supervised Learning

Page 4: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Click-through Data

• Query context– [movie] trailer, [game] cheats

• Click context– imdb.com for movies, gamespot.com for games

– Wisdom-of-crowds• Very Large-scale data and keep on growing• Frequent update with emerging named entities

• New data source for NEM– Over 70% queries contain

named entities.– Rich context for determining

the classes of entities.

Query_1 Site_11 Freq_11

Site_12 Freq_12

… …

Query _... … …

Click-Through Data

Page 5: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Latent Dirichlet Allocation

• Deal with ambiguity in classes of named entities– Classes of named entities are ambiguous.

• Harry Potter: Book, Movie and Game

– Topic models (LDA)

Classes of Named Entity as Topics

# trailer# dvd# movie

imdb.commovies.yahoo.comdisney.go.com

# cheats# walkthrough# game

gamespots.comcheats.ign.comgamefaqs.com

MovieMovie GameGame

Query Context

Click Context Query

Context

Click Context

Harry Potterharry potter trailer imdb.comharry potter dvd movies.yahoo.com

harry potter cheats cheats.ign.comharry potter game gamespots.com

Page 6: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Weakly Supervised Learning

• Supervise LDA training with examples– LDA is unsupervised model.• Topics in LDA are latent and not align with predefined

semantic classes, like book, movie and game.– Human labels are inaccurate and partial.• Binary indicator rather than proportion• Labels only indicate that a named entity belongs to

certain classes, but not exclude the possibility that it belongs to the other classes.

– Weakly-supervised LDA• Supervise LDA training with partial labels

Page 7: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Weakly Supervised LDA

• Overview

Create a virtual document for each seed and train WS-LDA Create a virtual document for each seed and train WS-LDA

WebsitesWebsitesContextsContexts

Find new named entities as well as their classes by using obtained query contexts and clicked websites

Find new named entities as well as their classes by using obtained query contexts and clicked websites

Newly Discovered Entities

Newly Discovered Entities

………………..Harry Potter………………..………………..

………………..Harry Potter………………..………………..

harry potter book http://www.amazon.comharry potter cheats http://cheats.ign.comharry potter trailer http://www.imdb.com……………………………………..

harry potter book http://www.amazon.comharry potter cheats http://cheats.ign.comharry potter trailer http://www.imdb.com……………………………………..

Seeds Click-through Data

# book, http://www.amazon.com# cheats, http://cheats.ign.com# trailer, http://www.imdb.com……………………………………..

# book, http://www.amazon.com# cheats, http://cheats.ign.com# trailer, http://www.imdb.com……………………………………..

Virtual Document

Page 8: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Weakly Supervised LDA (cont.)

• LDA with two types of virtual words– w1: Query context

– w2: Click context

# book# cheats# trailer……………

# book# cheats# trailer……………

http://www.amazon.comhttp://cheats.ign.comhttp://www.imdb.com………………………………….

http://www.amazon.comhttp://cheats.ign.comhttp://www.imdb.com………………………………….

Virtual Document

Page 9: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Weakly Supervised LDA (cont.)

• Introduce Weak Supervision– LDA log likelihood + soft constraints

– Soft Constraints

,,log, yCwpywL LDA Probability Soft Constraints

i ii zyyC , Document Probability

on i-th ClassDocument Probability on i-th Class

Document Binary Label on i-th Class Document Binary Label on i-th Class

Page 10: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Experimental Results

• Dataset– Seed named entities• About 1,000 seeds for each class, and 3767 unique

named entities in total

– Click-through data• 1.5 billion query-URL pairs, containing 240 million

unique queries and 17 million unique URLs

Page 11: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Experimental Results (cont.)

• Top Contexts and websitesMovie Contexts Game Contexts Book Contexts Music Contexts

Movie Websites Game Websites Book Websites Music Websites

Page 12: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Experimental Results (cont.)

• Accuracy of Mined Entities

Page 13: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

Summary• Proposed to use click-through data as a new data

source for NEM• Employed topic model to deal with ambiguity in

classes of named entities• Devised weakly supervised LDA for modeling

click-through data– Two types of virtual words– Introduce weakly supervised learning into LDA

• Experiments on large-scale data verified effectiveness of proposed approach

Page 14: Named Entity Mining  From Click-Through Data  Using Weakly Supervised LDA

THANKS