Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
WeiHu ([email protected])DepartmentofComputerScienceandTechnology
National Key Laboratory for Novel Software TechnologyNanjingUniversity,China
Entity Linkage in the Linked Data:approaches and analysis
Université d'Artois,2016.6.7
Websoft (http://ws.nju.edu.cn)
n Researchtopics
¡ SemanticWeb
¡ Webscience
¡ Bigdata
n Academicrecords
¡ Papersn WWW, IJCAI, AAAI, ISWC …
n Bestpaperaward&nominee
¡ Grants: 863, NSFC …
n Collaborations¡ Stanford, VUA, KIT, Aberdeen …
¡ IBM,Samsung,ZTE …
2YuzhongQu WeiHu GongCheng
Outline
n Introduction to Semantic Web and entity linkage
n A bootstrapping approach to entity linkage
n Link analysis of biomedical linked data
n (Two applications)
3
Semantic Web
n SemanticWebwasa thoughtfrom TimBerners-Lee
n GiveformalmeaningstoWebinformation– semantics¡ Web1.0(page)àWeb2.0(social)àWeb3.0(awebofdata)
n SemanticWebisabout1. commonformatsfor
n integrationandcombination ofdatadrawnfromdiversesources
2. languages forn recordinghowthedatarelatesto
real-worldobjects
4
RDF (Resource Description Framework)
5
http://en.wikipedia.org/wiki/Alan_Kay
http://www.smalltalk.org/
RDFtriple:<subject,predicate,object>
Linked data
n As a realization of Semantic Web¡ LinkedData refers to a collectionofinterrelateddatasets
n Usedfor large-scaleintegrationof, reasoning on,dataontheWeb
n Linked data principles1. UseURIs tonamethings
2. UseHTTPURIs(canbe"dereferenced")
3. ProvideusefulinformationusingtheopenWebstandards (e.g.RDF)
4. Includelinks tootherrelatedthings
6
Linking open data (LOD) cloud
7
publication
life science
social network
geography
governmentdata
media language
misc.
UGC
1,014 datasets
Knowledge graph
n KnowledgeGraphisaknowledgebase usedbyGoogletoenhanceitssearchengine’ssearchresultswithsemantic searchinformationgatheredfromawidevarietyofsources¡ Nodes: entities or concepts
¡ Edges: attributes or relations
8
Entity linkage
n SemanticWeb data reach a scale in billionsof entities
n Many different entities refer to the same real-world thing¡ Typically denoted by URIs, from distributed data sources
n e.g. Wei Hu¡ http://data.semanticweb.org/person/wei-hu
¡ http://ws.nju.edu.cn/people/whu
¡ http://ontoworld.org/wiki/Special:URIResolver/Wei_Hu
¡ …
n Entity linkage: link different entities that refer to the same object¡ a.k.a. coreference resolution, entity matching …
¡ Outof31BRDFstatements,lessthan500Marelinksacrosssources
9
Outline
n Introduction to Semantic Web and entity linkage
n A bootstrapping approach to entity linkage
n Link analysis of biomedical linked data
n (Two applications)
10
Background
n In LOD, millions of entities have alreadybeen linked¡ However, potential candidates are still numerous
n Current solutions1. Equivalence reasoning
ü owl:sameAs, inverse functional properties …
ü Atpresent,probably missmany potential candidates
2. Similarity computation(alsointhedatabasearea)ü Compare properties and values of entities
u Inaccurate (heterogeneity),less scalable (pairwise comparison)
3. To improve, machine learningu Time-consuming, labor-intensive tobuild a large-scale training set
11
Definition
How to combine? Our solution: bootstrapping
n Query-driven entity linkage
n Use scenarios1. Search /browsing – a system knows “what to link” only at query time
2. Analyze small portions of a very large dataset toansweron-demandqueries
12
Definition 1. Let U be the set of entities in a set D of data sources. Given anentity , the entity linkage for u is to query a subset of entitiesfor which a relation ε holds:
where ε links all the entities in U that refer to the same object as u does, i.e.are coreferentwith u.
Our contribution
13
Learn discriminativeproperty-value pairsLearn discriminativeproperty-value pairs
Bootstrapping
Build a kernel(Initialize training set)
Build a kernel(Initialize training set)
External knowledge
Input:an entity
Frequentproperty
combinations
Frequentproperty
combinations
1
23
Output: a setof coreferententities
Labeled entities
Unresolved entities
Input
Automatically infer semanticallycoreferententitiesbasedonOWL/SKOSsemantics
IterativeprocessAssumptions:(1)coreferent entitiessharesomesimilar property-valuepairs;(2)a few property-valuepairsaremoreimportant for linking entities
Somepropertiesaremorenaturaltousetogether
Output
WeiHu,Cunxin Jia.ABootstrappingApproachtoEntityLinkageontheSemanticWeb.J. Web Semantics, 2015
Running example
14
dbpedia:Nanjing(DBpedia)
rdfs:labelowl:sameAs
“Nanjing”geo:1799962
geo:1799962(GeoNames)
geo:latgeo:longgeo:alternateName
“32 N ”“118 E”“Nanjing”“Nan-ching”
fb:m.05gqy(Freebase)
rdfs:labelgeo:latgeo:long
“Nanjing”“32 N ”“118 E”
ex:NationalCity geo:longgeo:lat
“117 W”“32 N ”
WeiHu,Cunxin Jia.ABootstrappingApproachtoEntityLinkageontheSemanticWeb.J. Web Semantics, 2015
Experiment
n Dataset¡ Billion Triples Challenge (BTC) 2011
n Testingentities¡ Top-50 in 364 thousand query logs
n Evaluation procedure andmetrics¡ 30graduates,2judges+1arbitrator/link,Fleiss’sκ =0.8 (sufficient agree)
¡ Precision&relativerecall(RR)n RR=correctlinksinonesystem/totalcorrectuniquelinksinallsystems
15
Entities > 100 million
RDF stat. > 2 billlion
Same-as stat. 3,446,029
IFP stat. 1,799,976
FP stat. 2,279,474
Exact-match stat. 22,398
Cardinality stat. 148
Has-key stat. 2
Different-from stat. 691
All-different stat. 89
People 15 Places 10
Tech terms 8 Music / movies 5
Universities 4 Companies 3
Publications 2 others 3
Experiment
n Bootstrappingcurve¡ Maximum iteration = 4
¡ Discriminability threshold = 0.05
n Linkageaccuracy
n Running timeon 5,000 samples:avg. 11.3 links in 12.6s
16
Outline
n Introduction to Semantic Web and entity linkage
n A bootstrapping approach to entity linkage
n Link analysis of biomedical linked data
n (Two applications)
17
Background
n Networkanalysishaslongbeenusedtostudylinkstructures¡ Network medicine: cellular networks and implications
¡ The “bow tie” structure of the Web
n Linked data for the life sciences¡ e.g. Bio2RDF, Chem2Bio2RDF,
Neurocommons, W3C LODD
n Millions of links over hundredsof datasets in overlap
¡ Network analysis can help
n understand structures to express data
n facilitate large-scale data integration
n improve overall quality of biomedical data
18
No such analysis yet!
Preliminaries
n Graph: nodes and edges¡ (outgoing / incoming) degree
n Sink, source, isolated node
n Power law distribution¡
¡ Scale-free
n Weakly connected component¡ Size: number of nodes
n Average distance
n Clusteringcoefficientà Small-world phenomenon
19
Our contribution
n We conduct anempiricallinkanalysisofBio2RDF¡ Bio2RDFisanopensourceprojectthatusesSemantic Webtechnologies to
buildand provide thelargestnetworkof lifescienceLinkedDatan Ensurethesignificance ofourempiricalstudy
1. Dataset link analysis (using RDF data model)
2. Entity link analysis (using a special kind of cross-references)
3. Term link analysis (using ontology matching)
n For each perspective, weinvestigatethegraphfeaturesofBio2RDFvis-à-viswhathasbeenpreviouslyreported
n Symmetry and transitivity of entity links
n Benchmark to evaluate entity matching approaches
20WeiHu,Honglei Qiu,MichelDumontier.LinkAnalysis ofLifeScienceLinkedData.ISWC, 2015
Dataset overview
What is the status of Bio2RDF?n 35 datasets, 11B RDF triples
n 1B entities
n 2K classes
n 4K properties
21
Observations1. Well linked
2. Average distance = 2.77Clustering coefficient = 0.22à small-world phenomenon
3. Hubs and authorities
4. Goodresilience
Entity link analysis
Howwell do entities link to each other?
n 76% entity links fromaspecialkindofRDFtriples¡ e.g. <kegg:D03455, kegg:x-drugbank, drugbank:DB00002>
¡ x-relations have under-specified semanticsn Refer to a related resource, e.g. article
n Truly identical
n Degree distribution¡ Threetypesofentities in
OMIM,NCBI,KEGG
¡ Do not follow power lawn Exponent is too large (close to 5), p-values is too small (close to 0)
22
BTC2010
Symmetry and transitivity
n Symmetry¡ Borrow
andreverse
¡ Differentclassesn Phenotype
vs.Disorder
n Transitivity¡ Weakintermediate
¡ Modeling divergence
n Evenhardtohuman
23
Discussion of findings
n Entity link graph does not share the same characteristics with the Hypertext /Semantic Web
¡ Degreedistributiondoesnotfollowpowerlaw
n Adominated partofentitieshavebeenlinkedusing x-relations, buttheirintendedsemanticsdiffers¡ Classesare identicalorequivalentà entity links represent logicalequivalence
n Symmetricandtransitiveentitylinksexist,buttheireffectiveness isweakenedduetothesmallnumber¡ Meaningsofentity linksmayshiftduringtransitive
¡ KEGG,DrugBank andOMIMarethemostprominentknowledgebases
24
Applications: BioSearch
n Keywordsearchisthemostpopularparadigmfor information retrieval¡ Keywordscan be ambiguous andhavemultiplemeanings
n Semanticsearchaims toimprovesearchaccuracybyunderstandinguserintentandsearchcontext
¡ Heterogeneity between local schemas
n Our solution1. Semantic query + faceted filtering
n Notonlyplainkeywordsbutalsosemantictags
2. Onto-based query answering
n Rewrite queries from SIO to local schemas
3. Entity browsing
¡ Result:effectiveness +22.4%,usability +28.8%
http://ws.nju.edu.cn/biosearch/ 25
server side
browser side
external resources
images
Ontologymatching
Ontology-basedquery answering
ontologymappings
Semantic query input
Faceted filtering Entity browsing
SPARQL queries
semantic query
SemanticscienceIntegrated Ontology
Gene
Applications: Clinga
n Chinesegeographicaldataissmall scale, e.g. 4.6% in GeoNames
n Chinese linked geographical dataset (Clinga)1. Extract data from thelargest
Chinesewikiencyclopedia
2. Designageo-ontology toclassifygeographicalentity types
3. Automaticdiscoveryoflinks toexisting knowledge bases
¡ Result: 624K entities, 230K links
n Use scenario¡ MajorknowledgebaseforansweringChinesegeographicalquestions inour
NationalHigherEducationEntrance Examination(calledGaoKao)
26http://ws.nju.edu.cn/clinga/
Conclusion
n Entitylinkage is tolinkdifferententitiesthatrefertothesamerealworld object
n Large scale and heterogeneity are challengingexistingentity linkagesolutions
n Entity linkage approaches often involve knowledge representation,datamining, network analysis, crowdsourcingandmanyothertechniques
27
Comments?
Contact: Wei Hu ([email protected])
Thank you for your invitation and time!
Université d'Artois, 2016.6.7