Upload
marshall-farmer
View
220
Download
0
Embed Size (px)
Citation preview
1
People in CALO’s World:Contact Info, Expertise, Groups & Roles
Information Extraction, Coreference, Group/Topic Models
Andrew McCallum Aron Culotta, Xuerui Wang, Charles Sutton, Wei Li
UMass Amherst
4
DEX ExampleTo: “Andrew McCallum” [email protected]
Subject ...
First Name:
Andrew
Middle Name:
Kachites
Last Name:
McCallum
JobTitle: Associate Professor
Company: University of Massachusetts
Street Address:
140 Governor’s Dr.
City: Amherst
State: MA
Zip: 01003
Company Phone:
(413) 545-1323
Links: Fernando Pereira, Sam Roweis,…
Key Words:
Information extraction,
social network,…
Search for new people
6
Outline
Information Extraction– Learning in the wild– Transfer learning
Identity Uncertainty
Modeling Groups, Roles and Topics
7
Outline
Information Extraction– Learning in the wild– Transfer learning
Identity Uncertainty
Modeling Groups, Roles and Topics
9
User feedback “in the wild”as labeling
Labeling forClassification
Easy:Often found in user interfaces
e.g. CALO IRIS, Apple Mail
Seminar:How to Organize your Life
by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq
3:30 pmThursday March 31
In this seminar we will learn how to use CALO to...
Seminar announcement
Todo request
Other
Labeling forExtraction
Painful:Difficult even for paid labelers
Complex tools
Seminar:How to Organize your Life
by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq
3:30 pmThursday March 31
In this seminar we will learn how to use CALO to...
Click, drag, adjust, label,Click, drag, adjust, label,...
10
Multiple-choice Annotation forLearning Extractors “in the wild”
[Culotta, McCallum 2005]
Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.
Task: Information Extraction.Fields: NAME COMPANY ADDRESS (and others)
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
user corrects labels, not segmentations
Interface presents top hypothesized segmentations
11
Multiple-choice Annotation forLearning Extractors “in the wild”
[Culotta, McCallum 2005]
Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
user corrects labels, not segmentations
Interface presents top hypothesized segmentations
Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)
12
Multiple-choice Annotation forLearning Extractors “in the wild”
[Culotta, McCallum 2005]
Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
29% percent reduction in user actions needed to train
Interface presents top hypothesized segmentations
Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)
13
Outline
Information Extraction– Learning in the wild– Transfer learning
Identity Uncertainty
Modeling Groups, Roles and Topics
14
Piecewise Training in Factorial CRFsfor Transfer Learning
Emailed seminar ann’mt entities
Email English words
[Sutton, McCallum, 2005]
Too little labeled training data.
60k words training. GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
15
Piecewise Training in Factorial CRFsfor Transfer Learning
Newswire named entities
Newswire English words
[Sutton, McCallum, 2005]
Train on “related” task with more data.
200k words training.
CRICKET - MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.
16
Piecewise Training in Factorial CRFsfor Transfer Learning
Newswire named entities
Email English words
[Sutton, McCallum, 2005]
At test time, label email with newswire NEs...
17
Piecewise Training in Factorial CRFsfor Transfer Learning
Newswire named entities
Emailed seminar ann’mt entities
Email English words
[Sutton, McCallum, 2005]
…then use these labels as features for final task
18
Piecewise Training in Factorial CRFsfor Transfer Learning
Newswire named entities
Seminar Announcement entities
English words
[Sutton, McCallum, 2005]
Use joint inference at test time.
An alternative to hierarchical Bayes.Needn’t know anything about parameterization of subtask.
AccuracyNo transfer < Cascaded Transfer < Joint Inference Transfer
20
Outline
Information Extraction– Learning in the wild– Transfer learning
Identity Uncertainty
Modeling Groups, Roles and Topics
21
Y/N
Y/N
Y/N
Joint Co-reference Decisions,Discriminative Model
Stuart Russell
Stuart Russell
[Culotta & McCallum 2005]
S. Russel
People
22
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Co-reference for Multiple Entity Types
Stuart Russell
Stuart Russell
University of California at Berkeley
[Culotta & McCallum 2005]
S. Russel
Berkeley
Berkeley
People Organizations
23
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Joint Co-reference of Multiple Entity Types
Stuart Russell
Stuart Russell
University of California at Berkeley
[Culotta & McCallum 2005]
S. Russel
Berkeley
Berkeley
People Organizations
Reduces error by 22%
25
Outline
Information Extraction– Learning in the wild– Transfer learning
Identity Uncertainty
Modeling Groups, Roles and Topics
26
Social network from my email
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
30
From LDA to Author-Recipient-Topic
(ART)
32
Enron Email Corpus
250k email messages 23k people
Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001
Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.
DP
Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]
33
Topics, and prominent sender/receiversdiscovered by ART
Titles chosen by me
34
Topics, and prominent sender/receiversdiscovered by ART
Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice Presidence of Regulatory Affairs”Steffes = “Vice President of Government Affairs”
35
Comparing Role Discovery
connection strength (A,B) =
distribution overauthored topics
Traditional SNA
distribution overrecipients
distribution overauthored topics
Author-TopicART
36
Comparing Role Discovery Tracy Geaconne Dan McCarty
Traditional SNA Author-TopicART
Similar roles Different rolesDifferent roles
Geaconne = “Secretary”McCarty = “Vice President”
38
Traditional SNA Author-TopicART
Different roles Very differentVery similar
Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”
Comparing Role Discovery Lynn Blair Kimberly Watson
40
McCallum Email Corpus 2004
January - October 2004 23k email messages 825 people
From: [email protected]: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: [email protected]
There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:
NIPS registration receipt.CALO registration receipt.
Thanks,Kate
42
Four most prominent topicsin discussions with ____?
44
Two most prominent topicsin discussions with ____?
Words Problove 0.030514house 0.015402
0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009
0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539
47
49
Role-Author-Recipient-Topic Models
50
Year Three Plans: “People”
Extraction, for Expert-finding and Group/Role Analysis• Make learning-in-the-wild practical for extraction.• Transfer from noisy/incomplete databases to improve IE.• Support questions about contact info, organizational affiliation, etc.
Identity Uncertainty• Central problem for going from text to knowledge base. • Many interacting entity types, relationships.
Group/Role/Topic Analysis• Explicit “topic models” of groups, roles, expertise, tasks,
and its interation with extraction...• Support Qs about topical expertise, forwarding messages, team building.
Etc.• Continue to support and enhance MALLET toolkit, in collaboration
with UPenn and others.