Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Construction and Querying of Large-scale Knowledge Bases
Xiang Ren1 Yu Su2 Xifeng Yan2
University of Southern California1
University of California, Santa Barbara2
Tutorial website:http://xren7.web.engr.illinois.edu/tutorial-cikm17.html
Slides, code, datasets, references
Turning Unstructured Text Datainto Structures
3
UnstructuredText Data
(accountfor~80% of alldata inorganizations)
Knowledge& Insights
?Structures
(Chakraborty,2016)
Reading the reviews: From Text toStructured Facts
4
Restaurant Location
Organization Event
Thishotel ismy favorite Hiltonproperty inNYC! It is locatedrighton42ndstreetnear
TimesSquare,itisclose toallsubways, Broadways shows,andnext togreatrestaurantslikeJunior’s Cheesecake,
Virgil’s BBQ andmanyothers.
-- TripAdvisor
hotel
locatedat
NYCTimesSquare
Hiltonpropertyis a
Junior’sCheesecake
Virgil’sBBQ
locatednear
close toclose to
Broadwaysshows
close to
StructuredFacts
1. “Typed” entities2. “Typed” relationships
Why Text to Structures?
5
Structured Search & Exploration
Pattern / Association Rule Mining Structured Feature Generation
Graph Mining & Network Analysis
1234567
broadway showsbeacon theaterbroadway dance centerbroadway playsdavid letterman showradio city music halltheatre shows
1234567
high line parkchelsea markethighline walkwayelevated parkmeatpacking districtwest sideold railway
A Product Use Case: Finding“Interesting Hotel Collections”
6http://engineering.tripadvisor.com/using-nlp-to-find-interesting-collections-of-hotels/
Technology Transfer to TripAdvisor Features for “Catch a Show” collection
Features for “Near The High Line” collection
Grouping hotels based on structured factsextracted from the review text
Prior Art: Extracting Structures withRepeated Human Effort
7
Thishotel ismy favorite Hiltonproperty inNYC! It is located
righton42ndstreetnearTimesSquare,itisclose toall
subways, Broadways shows,andnext tomany great …
… The June2013Egyptianprotestweremassprotesteventthatoccurredin Egypt on30June
2013, …
Humanlabeling
…WehadaroomfacingTimesSquareandaroomfacingthe Empire State
Building, Thelocationisclosetoeverything
andwelove…
Extraction RulesMachine-Learning Models
Broadways shows
NYC
Hiltonproperty
Labeled data Text Corpus
Stanford CoreNLPCMU NELL
UW KnowItAllUSC AMR
IBMAlchemyAPIsGoogle Knowledge Graph
Microsoft Satori…
Structured Facts
Times square
hotel
This Tutorial: Effort-Light StructMine
8
Corpus-specificModelsText Corpus Structures
• Enables quick development of applications over various corpora• Extracts complex structures without introducing human error
Newsarticles
PubMedpapers
KnowledgeBases (KB)
Effort–Light StructMine: Where Are We?
9
Humanlabelingeffort
Featureengineering effort
Weakly-supervisedlearning systems
Hand-craftedSystems
Supervisedlearning systems
Distantly-supervisedLearning Systems
CMU NELL, 2009 - presentUW KnowItAll, Open IE, 2005 - presentMax-Planck YAGO, 2008 - present
Stanford CoreNLP, 2005 - presentUT Austin Dependency Kernel, 2005IBMWatson LanguageAPIs
UCB Hearst Pattern, 1992NYU Proteus, 1997
Stanford DeepDive,MIML-RE 2012- presentUW FIGER, MultiR, 2012Effort-Light StructMine(WWW’15, KDD’15, KDD’16, EMNLP’16, WWW’17, …)
A Review of Previous Efforts
“Distant” Supervision: What Is It?
10
Textcorpus
Knowledge Bases
“Matchable” structures: entity names,entity types, typed relationships ...
(Mintz et al., 2009), (Riedek et al., 2010), (Lin et al., 2012), (Ling et al., 2012),(Surdeanu et al., 2012), (Xu et al., 2013), (Nagesh et al., 2014), …
Freely available!• Common knowledge• Life sciences• Art …
Rapidlygrowing!
Number of Wikipedia articles
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
Human crowds
“Un-matchable”
Learning with Distant Supervision:Challenges
1. Sparsity of “Matchable”• Incomplete knowledge bases
• Low-confidence matching
2. Accuracy of “Expansion”• For “matchable”: Are all thelabels assigned accurately?
• For “un-matchable”:How toperform inference accurately?
11
(Ren et al., KDD’15)
It is my favoritecity in the
United States
The United States needs a new strategy to meet this
challenge
GovernmentLocation
… next to restaurants
like Junior’s Cheesecake
✗
Effort-Light StructMine: Contributions
12
Sparsity of“Matchable”
Effective expansionfrom “matchable”to “un-matchable”
Accuracy of“Expansion”
Pick the “best” labelsbased on the context(for both “matchable”and “un-matchable”)
Harness the “dataredundancy” using
graph-basedjoint optimization
Challenge Solution IdeaText
corpus
It is my favoritecity in the
United States
The United States needs a new strategy to meet this
challenge
GovernmentLocation
Effort-Light StructMine: Methodology
13
Data-driven textsegmentation
(SIGMOD’15, WWW’16)
Entity names& context units
Partially-labeledcorpus
Learning Corpus-specific Model(KDD’15, KDD’16,
EMNLP’16, WWW’17)
Structures fromthe remainingunlabeled data
Knowledgebases
Textcorpus
15
Desktop search Mobile search
Transformation in Information Search
NewYork-NewYorkhotelAnswer:
LengthyDocuments?DirectAnswers!
“WhichhotelhasarollercoasterinLasVegas?”
Surgeof
mobileInternet
usein
China
Application: Facebook Entity Graph
16
People,Places,andThingsFacebook’s knowledge graph (entity graph) stores as entities the users, places, pages and other objects within the Facebook.
ConnectingThe connections between the entities indicate the type of relationship between them, such as friend, following, photo, check-in, etc.
Structured Query: RDF + SPARQL
Subject Predicate ObjectBarack_Obama parentOf Malia_ObamaBarack_Obama parentOf Natasha_ObamaBarack_Obama spouse Michelle_ObamaBarack_Obama_Sr. parentOf Barack_Obama
17
Triples in an RDF graph
Barack_Obama_Sr.
Barack_Obama
Malia_Obama
Natasha_Obama
Michelle_ObamaRDF graph
SELECT?xWHERE{Barack_Obama_Sr.parentOf ?y.?yparentOf ?x.}
<Malia_Obama><Natasha_Obama>
SPARQL query
Answer
Why Structured Query Falls Short?
18
KnowledgeBase #Entities #Triples #Classes #RelationsFreebase 45M 3B 53K 35KDBpedia 6.6M 13B 760 2.8KGoogleKnowledgeGraph* 570M 18B 1.5K 35KYAGO 10M 120M 350K 100KnowledgeVault 45M 1.6B 1.1K 4.5K
• It’s more than large: High heterogeneity of KBs • If it’s hard to write SQL on simple relational
tables, it’s only harder to write SPARQL on large knowledge bases• EvenharderonautomaticallyconstructedKBswithamassive,loosely-definedschema
* as of 2014
Schema-agnostic KB Querying
19
“BarackObamaSr.grandchildren”
Keyword query: query like search engine BarackObamaSr.
grandchildren
Graph query: add a little structure
“WhoareBarackObamaSr.’sgrandchildren?”
Natural language query: like asking a friend
<BarackObamaSr.,MaliaObama>
Query by example: Just show me examples