Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Using Wikidata properties to improve search in Dutch historical newspapers Theo van Veen, SEA, 18-11-2016
Content enrichment: purpose and approach
• making content better findable and usable, especially newspapers
• by enriching text or parts of text and names in the text with a.o. links to related information
• this related information is in most cases linked data (Wikipedia, Polygoon news reels)
• linked data is used to improve usability of content by adding related information to the presentation
• linked data is used as a means to improve disclosure of content by adding related information to the search index
• But … we want to hide the user from SPARQL
TheovanVeen,SEA,18-11-2016
How will access and usability be improved?
1. Because “things” are identified we can make a better distinction between things (thesaurus function)
2. Because the identifiers are links to resource descriptions it is possible to present the content with context information about “things” in the content
3. Relevant context information can be indexed as part of a “thing” so it can be used for searching
4. By enriching the content with the identification of “things” semantic search is enabled using properties in external descriptions
1. Iden7fica7on2. Context3. Indexing4. Seman7csearch
TheovanVeen,SEA,18-11-2016
Additional motivation
TheovanVeen,SEA,18-11-2016
• Libraries are more and more part of the outside world. Improving disclosure and usability requires intelligent connecting content with the outside world.
• Content contains “knowledge” that cannot be easily be found by means of conventional search. This requires intelligent preprocessing.
• This knowledge should not first have to be searched for but should be offered on request after alerting the user.
• Our software should have read and analyzed our content integrally prior to the user !!
How to identity names in text? • By recognizing names (named entity recognition) • Those names have to be identified. • How? By searching them in Wikipedia/DBpedia and successively link them to the
Wikipedia/DBpedia descriptions • But …..
• Those names are ambiguous: does Einstein link to Albert Einstein or Alfred Einstein? • So ….
• We have to create software for improving the accuracy of links. Conventional “if then else” software isn’t fit for this job: we need machine learning techniques
• But …. • There remain still many false links and missing links and DBpedia does not contain
everything • So …..
• We need user feedback for correction, for adding links for unrecognized names and for additional training of the software
TheovanVeen,SEA,18-11-2016
Enrichment types
• Newspaper articles and radio bulletins linked to Polygoon newsreels • Named entities linked to DBpedia (en VIAF, Wikidata etc.) • Place-street combinations in newspaper articles linked to latitude
and longitude • Newspaper articles linked to images from Memory of the
Netherlands
LinkedNE’s Geodata Links Extractedfeatures
Userannota6on
Imageenrichment
DBpedia Street,place,laH.,long.
Webpages Classifica7on Tags Facerecogni7on
Wikidata Place,laH.,long.
Video Sen7ment Stories Emo7ondetec7on
VIAF Images Relevance
Geonames Sound Interes7ngness
Etc.
Nowavailable
TheovanVeen,SEA,18-11-2016
Steps in machine learning
1. Polygoon newsreels matching articles on basis of features like named entity matching, string matching, date matching etc. using linear classification
2. Linking named entities in news articles to DBpedia titles using linear classification using SVM
3. Classification using a neural network
TheovanVeen,SEA,18-11-2016
Machine learning for matching newspaper articles and Polygoon news reels
TheovanVeen,SEA,18-11-2016
Titel presentatie Naam en/of datum
Matching newspaper articles by means of title, description and date of Polygoon videos
TheovanVeen,SEA,18-11-2016
Matching by means of different features
Match No match
TheovanVeen,SEA,18-11-2016
3-D feature space
TheovanVeen,SEA,18-11-2016
Machine learning for entity linking
TheovanVeen,SEA,18-11-2016
Named Entity Linking
DBpediaSolrIndexDBpedia
Searchen7ty
NamedEn6ty
recogni6on
Listwith
Einsteins
Enrichmentdatabase
Enrichmentand
training
processar7cle
Geten77esStorear7cleid+resourceids
Findthebestcandidate
VIAF
Wikidata
Etc.
TheovanVeen,SEA,18-11-2016
Index and use of resource identifiers
Newspaperindex
Text+Viafid+Wikidataidetc.
Enrichmentdatabase
Indexing
Gettextforar7cleX
TheovanVeen,SEA,18-11-2016
Getenrichmentsforar7cleX
searchwikidataid’s
Wikidata
Seman7csearchprovidingwikidataid’s
search
Timeline for enriching the newspapers
Ar7clenumber
100
50
1 108mlj
4phases:• AllDBpedia7tlessearchedinnewsar7cles• NamedEn77essearchedinDBpedia• SpeedupbyprocessingcapacitySURFsara• Usingcontextandmachinelearning
Quality/c
onfid
ence(%)
0
TheovanVeen,SEA,18-11-2016
accuracy linkrecall linkprecision linkF-measure
conven7onal .76 .76 .65 .70
svm .85 .76 .84 .80
svm(balanced) .83 .81 .76 .79
neuralnetwork .83 .75 .84 .79
Features,featuresandfeatures
? ? ? ?
crowdsouring ? ? ? ?
TheovanVeen,SEA,18-11-2016
From conventional entity linking to deeplearning and beyond
How to present enrichments in Delpher, the main portal to books, newspapers and serials?
• Links to Wikipedia? • Adding images from Wikipedia to text? • Show abstract from Wikipedia at mouse over? • User may decide himself ?
TheovanVeen,SEA,18-11-2016
Naamen/ofdatum• TheovanVeen,16-6-2016
• TheovanVeen,16-6-2016
How to present enrichments in Delpher? • Links to Wikipedia? • Adding images from Wikipedia to text? • Show abstract from Wikipedia at mouse over? • User may decide himself ? For the time being we use a research portal (xportal) to show enriched search and a browser extension to add enriched information to Delpher
TheovanVeen,SEA,18-11-2016
Naamen/ofdatum• TheovanVeen,16-6-2016
Naamen/ofdatum• TheovanVeen,16-6-2016
Naamen/ofdatum• TheovanVeen,16-6-2016
1. Iden7fica7on2. Context3. Co-indexing4. Seman7csearch
1. Iden7fica7on2. Context3. Co-indexing4. Seman7csearch
1. Iden7fica7on2. Context3. Co-indexing4. Seman7csearch
1. Iden7fica7on2. Context3. Co-indexing4. Seman7csearch
[memberofTheBeatles]
HidingSPARQLforendusers:ThetermbetweensquarebracketsisexpandedinseveralwaysbyqueryingWikidataviaSPARQL.
“Heel Holland verrijkt”, starting at KB !
To improve the automatically generated enrichments and add new enrichments we need user feedback. This feedback can also be used for additional training of our disambiguation software.
TheovanVeen,SEA,18-11-2016
Next steps
• Improving accuracy by changing from linear classification to neural network
• Crowd sourcing by KB employees before broadening the audience
• Use of non-Wikidata identifiers when resource is not in Wikidata
• Combining Solr and RDF and SPARQL for removing limitation on number of wikidata identifiers in Solr query
TheovanVeen,SEA,18-11-2016
The higher goal
Our software should have read and analyzed our content completely !!
TheovanVeen,SEA,18-11-2016
Any questions?