Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016

Using Wikidata properties to improve search in Dutch historical newspapers Theo van Veen, SEA, 18-11-2016

Content enrichment: purpose and approach

•  making content better findable and usable, especially newspapers

•  by enriching text or parts of text and names in the text with a.o. links to related information

•  this related information is in most cases linked data (Wikipedia, Polygoon news reels)

•  linked data is used to improve usability of content by adding related information to the presentation

•  linked data is used as a means to improve disclosure of content by adding related information to the search index

•  But … we want to hide the user from SPARQL

TheovanVeen,SEA,18-11-2016

How will access and usability be improved?

1.  Because “things” are identified we can make a better distinction between things (thesaurus function)

2.  Because the identifiers are links to resource descriptions it is possible to present the content with context information about “things” in the content

3.  Relevant context information can be indexed as part of a “thing” so it can be used for searching

4.  By enriching the content with the identification of “things” semantic search is enabled using properties in external descriptions

1.  Iden7fica7on2.  Context3.  Indexing4.  Seman7csearch


Additional motivation


•  Libraries are more and more part of the outside world. Improving disclosure and usability requires intelligent connecting content with the outside world.

•  Content contains “knowledge” that cannot be easily be found by means of conventional search. This requires intelligent preprocessing.

•  This knowledge should not first have to be searched for but should be offered on request after alerting the user.

•  Our software should have read and analyzed our content integrally prior to the user !!

How to identity names in text? •  By recognizing names (named entity recognition) •  Those names have to be identified. •  How? By searching them in Wikipedia/DBpedia and successively link them to the

Wikipedia/DBpedia descriptions •  But …..

•  Those names are ambiguous: does Einstein link to Albert Einstein or Alfred Einstein? •  So ….

•  We have to create software for improving the accuracy of links. Conventional “if then else” software isn’t fit for this job: we need machine learning techniques

•  But …. •  There remain still many false links and missing links and DBpedia does not contain

everything •  So …..

•  We need user feedback for correction, for adding links for unrecognized names and for additional training of the software


Enrichment types

•  Newspaper articles and radio bulletins linked to Polygoon newsreels •  Named entities linked to DBpedia (en VIAF, Wikidata etc.) •  Place-street combinations in newspaper articles linked to latitude

and longitude •  Newspaper articles linked to images from Memory of the

Netherlands

LinkedNE’s Geodata Links Extractedfeatures

Userannota6on

Imageenrichment

DBpedia Street,place,laH.,long.

Webpages Classifica7on Tags Facerecogni7on

Wikidata Place,laH.,long.

Video Sen7ment Stories Emo7ondetec7on

VIAF Images Relevance

Geonames Sound Interes7ngness

Etc.

Nowavailable


Steps in machine learning

1.  Polygoon newsreels matching articles on basis of features like named entity matching, string matching, date matching etc. using linear classification

2.  Linking named entities in news articles to DBpedia titles using linear classification using SVM

3.  Classification using a neural network


Machine learning for matching newspaper articles and Polygoon news reels


Titel presentatie Naam en/of datum

Matching newspaper articles by means of title, description and date of Polygoon videos


Matching by means of different features

Match No match


3-D feature space


Machine learning for entity linking


Named Entity Linking

DBpediaSolrIndexDBpedia

Searchen7ty

NamedEn6ty

recogni6on

Listwith

Einsteins

Enrichmentdatabase

Enrichmentand

training

processar7cle

Geten77esStorear7cleid+resourceids

Findthebestcandidate

VIAF

Wikidata

Etc.


Index and use of resource identifiers

Newspaperindex

Text+Viafid+Wikidataidetc.

Enrichmentdatabase

Indexing

Gettextforar7cleX


Getenrichmentsforar7cleX

searchwikidataid’s

Wikidata

Seman7csearchprovidingwikidataid’s

search

Timeline for enriching the newspapers

Ar7clenumber

100

50

1 108mlj

4phases:•  AllDBpedia7tlessearchedinnewsar7cles•  NamedEn77essearchedinDBpedia•  SpeedupbyprocessingcapacitySURFsara•  Usingcontextandmachinelearning

Quality/c

onfid

ence(%)

0


accuracy linkrecall linkprecision linkF-measure

conven7onal .76 .76 .65 .70

svm .85 .76 .84 .80

svm(balanced) .83 .81 .76 .79

neuralnetwork .83 .75 .84 .79

Features,featuresandfeatures

? ? ? ?

crowdsouring ? ? ? ?


From conventional entity linking to deeplearning and beyond

How to present enrichments in Delpher, the main portal to books, newspapers and serials?

•  Links to Wikipedia? •  Adding images from Wikipedia to text? •  Show abstract from Wikipedia at mouse over? •  User may decide himself ?


Naamen/ofdatum•  TheovanVeen,16-6-2016

•  TheovanVeen,16-6-2016

How to present enrichments in Delpher? •  Links to Wikipedia? •  Adding images from Wikipedia to text? •  Show abstract from Wikipedia at mouse over? •  User may decide himself ? For the time being we use a research portal (xportal) to show enriched search and a browser extension to add enriched information to Delpher





1.  Iden7fica7on2.  Context3.  Co-indexing4.  Seman7csearch




[memberofTheBeatles]

HidingSPARQLforendusers:ThetermbetweensquarebracketsisexpandedinseveralwaysbyqueryingWikidataviaSPARQL.

“Heel Holland verrijkt”, starting at KB !

To improve the automatically generated enrichments and add new enrichments we need user feedback. This feedback can also be used for additional training of our disambiguation software.


Next steps

•  Improving accuracy by changing from linear classification to neural network

•  Crowd sourcing by KB employees before broadening the audience

•  Use of non-Wikidata identifiers when resource is not in Wikidata

•  Combining Solr and RDF and SPARQL for removing limitation on number of wikidata identifiers in Solr query


The higher goal

Our software should have read and analyzed our content completely !!


Any questions?

Documents

Using Wikidata properties to improve search in …files.meetup.com/17921502/SEA-11-2016-Industrial.pdfTheo van Veen, SEA, 18-11-2016 Additional motivation Theo van Veen, SEA, 18-11-2016