16
The benefits of linking Thesauri for internal users A Netherlands Institute for Sound & Vision case study TIM DE BRUYN, Vrije Universiteit ABSTRACT In this research we look at the Netherlands institute for Sound & Vision (NISV) as a case study to nd out what the advan- tages and possibilities of linking their thesaurus to an online knowledge base are. NISV manages an audio-visual collec- tion concerned with the Dutch media cultural heritage. ey have chosen for Wikidata as the external dataset to enrich their collection with, through links with an internal vocabu- lary, GTAA. is research is part of a broader research into the advantages of using linked data at NISV. is part of the research focuses on the internal users while the other part focuses on the external users. In this research we look into the existing data contained both within the thesaurus of NISV and Wikidata. A statistical analysis is conducted to answer questions on the completeness and richness of the data. Also, several internal users of dierent departments of NISV are interviewed in an eort to extract use cases for an enhanced collection. Using the interviews, use cases are set up and (conceptual) prototypes are built to satisfy said use cases. e (conceptual) prototypes are then evaluated for their usefulness and added functionality. A comparison between the internal and external users and their dierent wishes is drawn. is research shows the dierences and similarities between the existing thesaurus and Wikidata. It brings forth three use cases and lays a down a framework for future work in similar case studies. Additional Key Words and Phrases: Linked Data, Wikidata, Digital thesauri, Cultural heritage 1 INTRODUCTION NISV is an audiovisual archive located in Hilversum. It concerns itself with the Dutch audiovisual heritage of national documentaries, movies, music, radio and television programs. In partnership with other Dutch organizations that concern themselves with said Dutch heritage they developed the Communal esaurus for Audiovisual Archives (GTAA). is thesaurus allows NISV to accurately characterize audio-visual material. e thesaurus is to be enriched with data from the Wiki- data dataset. Wikidata is a free, collaborative, multilingual, secondary database, collecting structured data to provide support for Wikipedia, Wikimedia Commons, the other wikis of the Wikimedia movement, and to anyone in the world 1 . is data from Wikidata is to be manually linked to the GTAA. e data from the Wikidata dataset contains more information per subject. e Wikidata dataset contains: persons, objects, geographical data, concepts, company names etc. Since the data from Wikidata is richer than the data currently contained in the GTAA, it would allow NISV to set up more extensive use cases if these datasets were to be linked. NISV’s goal is to enrich their thesaurus for internal use as well as for external audiences. In order to know what data is important for their own use cases analysis needs to be carried out. e current data contained within the GTAA has to be ana- lyzed to nd out what data it is lacking. Aer that we can see if this data can be supplied by Wikidata. In this research, conducted with co-researcher John Brooks, we aim to bring answers to questions on the subject of alignments between the current GTAA and Wikidata. Whereas my co-researcher focuses on use cases for ex- ternal audiences, this research focuses on the internal users and their use cases. With the help of interviews with the internal users concrete use cases can be set up. ese use cases, in turn, lead to prototypes being built to satisfy the use cases. is research brings in- sights on the analyzed data and brings forth concrete use cases and prototypes to satisfy these use cases. e problem of what to do with the extra information aer linking a thesaurus is not a problem exclusive to NISV. Looking outside the frame of this particular case study, the seing up of use cases also provides a framework for future researches in similar case studies. With its unique combination of a data driven and user driven 1 hps://www.wikidata.org/wiki/Wikidata:Main Page

The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

The benefits of linking Thesauri for internal usersA Netherlands Institute for Sound & Vision case study

TIM DE BRUYN, Vrije Universiteit

ABSTRACTIn this research we look at the Netherlands institute for Sound& Vision (NISV) as a case study to �nd out what the advan-tages and possibilities of linking their thesaurus to an onlineknowledge base are. NISV manages an audio-visual collec-tion concerned with the Dutch media cultural heritage. �eyhave chosen for Wikidata as the external dataset to enrichtheir collection with, through links with an internal vocabu-lary, GTAA. �is research is part of a broader research intothe advantages of using linked data at NISV. �is part of theresearch focuses on the internal users while the other partfocuses on the external users. In this research we look intothe existing data contained both within the thesaurus of NISVand Wikidata. A statistical analysis is conducted to answerquestions on the completeness and richness of the data. Also,several internal users of di�erent departments of NISV areinterviewed in an e�ort to extract use cases for an enhancedcollection. Using the interviews, use cases are set up and(conceptual) prototypes are built to satisfy said use cases. �e(conceptual) prototypes are then evaluated for their usefulnessand added functionality. A comparison between the internaland external users and their di�erent wishes is drawn. �isresearch shows the di�erences and similarities between theexisting thesaurus and Wikidata. It brings forth three usecases and lays a down a framework for future work in similarcase studies.

Additional Key Words and Phrases: Linked Data, Wikidata,Digital thesauri, Cultural heritage

1 INTRODUCTIONNISV is an audiovisual archive located in Hilversum.It concerns itself with the Dutch audiovisual heritageof national documentaries, movies, music, radio andtelevision programs. In partnership with other Dutchorganizations that concern themselves with said Dutchheritage they developed the Communal �esaurus forAudiovisual Archives (GTAA). �is thesaurus allowsNISV to accurately characterize audio-visual material.�e thesaurus is to be enriched with data from the Wiki-data dataset.

Wikidata is a free, collaborative, multilingual, secondarydatabase, collecting structured data to provide supportfor Wikipedia, Wikimedia Commons, the other wikis ofthe Wikimedia movement, and to anyone in the world1.�is data from Wikidata is to be manually linked to theGTAA. �e data from the Wikidata dataset containsmore information per subject. �e Wikidata datasetcontains: persons, objects, geographical data, concepts,company names etc. Since the data from Wikidata isricher than the data currently contained in the GTAA, itwould allow NISV to set up more extensive use cases ifthese datasets were to be linked. NISV’s goal is to enrichtheir thesaurus for internal use as well as for externalaudiences. In order to know what data is important fortheir own use cases analysis needs to be carried out. �ecurrent data contained within the GTAA has to be ana-lyzed to �nd out what data it is lacking. A�er that wecan see if this data can be supplied by Wikidata. In thisresearch, conducted with co-researcher John Brooks,we aim to bring answers to questions on the subject ofalignments between the current GTAA and Wikidata.Whereas my co-researcher focuses on use cases for ex-ternal audiences, this research focuses on the internalusers and their use cases. With the help of interviewswith the internal users concrete use cases can be setup. �ese use cases, in turn, lead to prototypes beingbuilt to satisfy the use cases. �is research brings in-sights on the analyzed data and brings forth concreteuse cases and prototypes to satisfy these use cases. �eproblem of what to do with the extra information a�erlinking a thesaurus is not a problem exclusive to NISV.Looking outside the frame of this particular case study,the se�ing up of use cases also provides a frameworkfor future researches in similar case studies. With itsunique combination of a data driven and user driven

1h�ps://www.wikidata.org/wiki/Wikidata:Main Page

Page 2: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

approach this research provides in-depth insight in howuse cases for an enriched thesaurus are derived.

2 RELATED WORKPrevious researches have been conducted on the subjectof Wikidata and its use for enhancing existing thesauri.Research has also been conducted on the quality of thedata contained in the dataset of Wikidata. In research de-scribing what Wikidata is and what it entails Vrandecic& Krotzsch (2014)[1] state that Wikidata’s goal is toallow its data to be used both internally and in exter-nal applications, such as the external thesauri GTAA.�ey describe it as a community controlled database.Erxleben et al. (2014)[2] have conducted research onWikidata and its possibilities for being connected to thelinked data web. �ey describe how Wikidata is nowa-days already linked to multiple external datasets. �eseexisting links range from the ISSN dataset, which identi-�es all journals and magazines etcetera, to more highlyspecialized databases, such as the database of NorthAtlantic hurricanes. Farber et al. (2016)[3] looked intothe quality of the most noteworthy large knowledgedatabases, which they describe as knowledge graphs.On the knowledge graph this research is concernedwith, namely Wikidata, extensive research show thatmultiple quality aspects of the data contained withinWikidata were the highest of any of the noteworthyknowledge graphs. Examples of quality aspects relateto: Accuracy, Trustworthiness, completeness, interop-erability etc. �is research could show that the datacontained within Wikidata is indeed the best suitabledata for enriching the GTAA. �ornton et al. (2017)[4]also conducted research on Wikidata’s possible role toserve as repository for international organizations con-cerned with digital preservation. �ey concluded thatWikidata had the advantage of being structured, query-able and computable. Another advantage Wikidata hasis that it’s multilingual. Since Wikidata also had a Dutchversion it supports alignments to the GTAA be�er.

Debevere et al. (2011)[5] have conducted researchin the linking of thesauri via linked data as to improvethe metadata of thesauri. �is improved metadata thenfurther allows be�er retrieval of information on searchqueries. �ey developed an algorithm that automati-cally linked data categories of DBpedia to another mediathesaurus keyword thesaurus used to annotate archived

media items at the Flemish public service broadcaster(VRT). It returned acceptable results as 91.42% of thecategory ”Person” and 89.94% of the category locationwere correctly linked. Tordai et al. (2007)[6] tried toprovide a systematic approach to build a large semanticculture web. �ey did so by making clear to heritageinstitutions what they need to do to make their collec-tions �t for becoming a part of this semantic cultureweb. In short they advise to use to use the paradigmof open access, open data and open standards. �eyconclude that collection owners should be providedwith the necessary support facilities. In a similar casestudy to this one Kobilarov et al. (2009)[7] looked intothe use of linked data in a case study at the BBC. �eylinked two separate platform of the BBC to each othervia DBpedia. �ey concluded that in the end the extralinks and data available to the end users bene��ed them.�ey did however agree that much more progress in thelinking engine and DBpedia in particular were neededto achieve full optimization. Caracciolo et al. (2011)[8]performed a case study for the use of linked data atthe AGROVOC multilingual thesaurus maintained bythe Food and Agriculture Organization of the UnitedNations. �ey had a much more explorative researchas in the end they only advised on the changes need-ing to be made to the current thesaurus for becomingaccessible to a linked data web. De Len et al. (2011)[9]performed a similar use case study. �ey however fo-cused on Spanish geographical data and how to rewriteit to make it bene�t from the linked open data princi-ple. Lastly, Neubert (2009)[10] also performed a similarcase study to the case study performed in this research.He describes the linking of a thesaurus for economicsto DBpedia and other large thesauri using SKOS/RDFmethods. In the end the linking to DBpedia returneda high number of unsuccessful matches. �ese unsuc-cessful matches are contributed to simple derivationsin the compared strings and the fact that a signi�cantnumber of economic concepts did not exist in DBpediayet. �ey also describe a method for checking for incon-sistencies between linked datasets using SPARQL. �isis a method that could also be useful for this research.

What we learn from all this related work is that mostresearches work with a data driven approach. We arealso using this approach in this research, but are com-bining it with a user driven approach as we think this

2

Page 3: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

could be�er lead to relevant use cases. Most researchesconclude with a statement on the usefulness of linkeddata and Wikidata in particular. �is a conclusion thisresearch agrees with and tries to back up with statisticsand use cases.

3 RESEARCH QUESTIONSBased on the motivation in the introduction and the re-lated work, in this research we investigate the followingresearch questions.

• What use cases can be derived from an enrichedGTAA?

To best answer this question research is conducted intothe data currently contained within the GTAA. �isresearch provides a clear overview of what data has tobe enriched to set up the use cases. A�er the use casesare set up it can be determined what alignments in thedatasets have to be made to satisfy them.

• How do the wishes of internal and the externalaudiences di�er?

As stated before this research focuses on the wishes ofthe internal users at NISV, whereas John Brooks focuseson the wishes of the external audience. Both our lastresearch questions are a combined e�ort. We used eachother’s data to be able to answer the respective researchquestions.

3.1 Methods�is research answers its research questions using thefollowing methods.

3.1.1 Analysis of the data within internal sources, ex-ternal sources and on the richness of the theoretical link.

�e data currently contained within the GTAA is an-alyzed to see what data it exactly contains. �is alsoallows us to see what data it lacks. �e analysis of thedata is performed using existing tool currently availableat NISV and via Wikidata. �e data currently containedwithin the external sources is also analyzed. Togetherwith the �ndings on the lack of certain data currentlycontained within the GTAA, it is checked how suitablethe data from the external sources is to link to the GTAA.�e data that was lacking in the current GTAA is the-oretically added from external sources. �is allows us

to create statistics on the richness of the theoreticalnew dataset. �ese analyses allow us to create a clearoverview on the data of both the datasets.

3.1.2 Interviews with the internal users and se�ingup use cases.

A�er the analysis of the data of both internal and ex-ternal sources interviews with the internal users areconducted. �ese interviews are conducted with at leasta total of 5 people from relevant departments. �e inter-views are divided into two parts. �e �rst part of theseinterviews contains the showing of all gathered statis-tics on the datasets. �e second part contains openquestions about possible use cases. �e goal of hav-ing open questions at the end of the interview is thesparking of new ideas by �rst walking through the pos-sibilities of the data in the external sources. Togetherwith the data from the analysis and the data gatheredin the interviews use cases can be set up. With the usecases, prototypes can be set up which in turn satisfy theusers’ needs. �is allows us to answer the �rst researchquestion.

3.1.3 Pa�erns in the use cases.

A�er research has been carried on the internal au-diences’ wishes, analysis on the pa�erns found in thedata satisfying these use cases is conducted. �e goalof this analysis is to create a framework of conceptualqueries that could be used in future alignments betweenother similar datasets. It also aids future research onthe subject.

3.1.4 Exploration and working out use cases in formthat best suits them.

Depending on the use cases retrieved earlier the bestway to enable said use case is worked out. We deter-mine if it is possible to translate the use cases into aworking prototype or if it is best to be thought of in aconceptual manner.

3.1.5 Evaluation of the use cases.

A�er the use cases are set up two evaluation roundsare taking place. �e goal of the �rst evaluation roundis to determine if the worked out use cases are what

3

Page 4: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

the internal users envisioned. A�er the �rst evaluationround it is also possible for the users to express theirdesired changes to the worked out use cases. �e sec-ond evaluation round serves as a medium to expressthe perceived usefulness of the �nalized use cases.

3.1.6 Comparison of di�erences between the wishesof internal and external audiences.

In order to answer the last research question, the �nd-ings of this research is compared to that of my co-researcher John Brooks (2018). As stated before, myco-researcher focused on the external audiences whilethis research is concerned with the internal users. Anal-ysis on both parts of the spectrum allows us to drawconclusions on the di�erences in the wishes betweenthe two audiences.

4 ANALYSIS OF THE DATA

4.1 Analysis setupFor the analysis of the data currently contained withinboth the GTAA and Wikidata itself we used the Wiki-data SPARQL endpoint tool2. We set up the statisticalanalysis with three goals in mind. First, we wantedto create an overview of the data currently containedwithin the GTAA. Second, we wanted a more globaloverview of the data contained within Wikidata. Lastly,we draw a comparison between the di�erent datasetsto see if the data contained within the GTAA is in linewith the global data.

�e goal of this analysis is to see if the data currentlyalready linked is useful for NISV. �e total amount ofpersons in the GTAA that is to be linked with Wikidatais 123,152. For the generated statistics on the data con-tained within the GTAA we looked at a subset of 10,350persons. �is subset was chosen on the premise thatthis was the subset of persons that were already linked.�erefore, our subset of already linked data is currentlya sample size of roughly 8.4% of the total. If we lookat the minimum required sample size at a con�dencelevel of 95% and a margin of error of 1% we see that oursubset should at least contain 8,909 persons. �e subsetmeets this requirement.

2h�ps://query.wikidata.org/

�e subset of persons and their properties were re-trieved. Henceforth, this subset is called subset A. Ofthe persons retrieved not all properties were chosenfor analysis. A selection of properties was made basedon relevancy for NISV. �is was selection was made inconsideration with internal users at NISV. A�er talkingwith said users about all the existing properties theythought these to be most relevant. �e conclusions onthe statistical analysis are strictly bound to these proper-ties as other properties could return di�erent statistics.�e chosen properties are as follows:

• Name• Gender• Awards received• Educated at

• Member of politi-cal party

• Nominated for• Place of death• plays sport

To retrieve subset A and their respective selectedrelevant properties we used SPARQL queries to restrictall the persons contained within Wikidata to only thosewho already have a GTAA identi�er as a property. Ofthose persons we retrieved their Wikidata ID, GTAA ID,name and the relevant properties in table format. �isallowed us to look further into the subset.

For the analysis of the data contained within Wiki-data as a whole we looked at a subset of 100,000 persons.�is subset of persons was picked by random selectionby the SPARQL endpoint tool. For this subset we re-trieved the same properties. For this subset we onceagain look at the minimal required sample size at acon�dence level of 95% and a margin of error of 1%.�e total population at the most recent Wikidata datadump3 is 4,310,706. �is means that our subset shouldcontain at least 9,583 persons. �e subset meets thisrequirement. �e subset is named subset B.

We expected the outcome of our statistical analysisto show some sort of relevant di�erence. �is relevantdi�erence could show itself in di�erences in nationality,place of birth and place of death. A di�erence in theseproperties was to be expected as the GTAA is mainlyconcerned with Dutch persons and Wikidata as a wholeis an internationally diverse set of persons.

3h�ps://denelezh.dicare.org/gender-gap.php

4

Page 5: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

4.2 StatisticsFor the statistics we focused on the twenty returnedproperties with the highest occurrence rate. �e fullstatistics for subset A and subset B can be found in anonline �le sharing platform4.

When �rst looking at the gender di�erence we can-not compare subset A to subset B. �is is a�ributedto the fact that the SPARQL endpoint tool had its lim-itations. It could in fact not generate statistics on thegender distribution in subset B. �is is because the oc-currence rate of the property ”Gender” is too high. Forthis comparison we once again look at the most recentdata dump. �is gives us the following gender distribu-tion for subset A in table 1 and the whole population intable 2.

Table 1. Gender distribution for subset A

Gender Count Percentage RankMale 3,060,702 81.77% 1Female 681,697 18.21% 2Other 517 0,01% 3Total 3,742,916 100.00%

Table 2. Gender distribution of the whole population

Gender Count Percentage RankMale 3,060,702 81.77% 1Female 681,697 18.21% 2Other 517 0,01% 3Total 3,742,916 100.00%

Here we clearly see that subset A is very much inline with the whole population. Only a small di�erenceof 3.3% in male properties is denoted.

When comparing the statistics of only subset A and Bseveral things immediately come forward. We see thatsubset A contains more entries referring to Dutch prop-erties. �is can best be seen in the statistics on ”Awardsreceived”, ”Educated at”, ”Member of political party”and ”Place of death”. We also see that subset A con-tains more media oriented properties. �is can best beseen in the statistics on ”Awards received”. A notewor-thy sighting is that both subset A and subset B contain4h�ps://�gshare.com/articles/Statistics/6859595

mainly media oriented properties for the statistics on”Nominated for”.

In conclusion, we expected the data of the two datasetsto di�er in certain aspects. Our analysis con�rmed thisexpectation. �e GTAA is focused on Dutch personswho have a connection to the �elds of television, movies,radio, theater and music. �e global data features more�elds than that. We did however learn that the genderdistribution is almost similar.

4.3 Richness of the dataIn our statistical analysis we also looked into the rich-ness of the data contained in Wikidata. More speci�c,for subset A we looked at the properties in �gure 1 andsome other properties that proved more relevant forNISV. Using these properties, we produced the statisticsas seen in table 3 on the next page.

In this table we see the occurrence rate both in exactvalues and percentages of the total. In the same tablewe also see the count of multiple entries. �ese multipleentries denote that a person has multiple entries for oneproperty. For occupation or country of citizenship thisis straightforward as a person can easily have multiple,but for properties like date of birth and place of deathit is less so. �e explanation for multiple entries inthese properties is that when it is uncertain what thecorrect only property is, multiple are given. Both theseentries could be correct but the only true entry remainsuncertain.

�ese statistics on the richness of the data createa clear image for the usefulness of the data. For ex-ample: Wanting to use the data to analyze di�erentpseudonyms will probably not be very useful as in thissubset only 1.2% of the persons have a Pseudonym prop-erty.

All the produced statistics on both the di�erencesbetween the two subsets and the richness of the datacontained within subset A are used in the interviews.

If we compare this to a recent research by Klein et al.(2016)[11] we can determine the meaning of our results.As part of their research they looked into property cov-erage of Wikidata over the years. �eir results can beseen in table 4 on the next page.

We see that the biggest di�erences lie in the proper-ties: ”Date of birth”, ”Citizenship”, ”Place of birth” and”Occupation”. �e other properties are almost similar

5

Page 6: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

Table 3. Occurrences of properties aligned persons GTAA & Wikidata

Occurrences (OC) % of total Multiple entries (ME) ME % of OC ME % of totalName 10,350 100.00% 0 0.00% 0.00%Gender 10,294 99.46% 0 0.00% 0.00%Birth name 839 8.11% 12 1.43% 0.12%Pseudonym 124 1.20% 17 13.71% 0.16%Date of birth 10,040 97.00% 63 0.63% 0.61%Place of birth 7390 71.40% 41 0.55% 0.40%Date of death 3644 35.21% 38 1.04% 0.37%Place of death 2776 26.82% 17 0.61% 0.16%Occupation 9988 96.50% 5581 55.88% 53.92%Country of cit-izenship

9976 96.39% 493 4.94% 4.76%

Table 4. Property coverage of Wikidata

17-09-2014 03-01-2016Gender 95.3% 96.5%Date of birth 57.6% 71.7%Date of death 28.6% 36.1%Citizenship 42.8% 58.2%Place of birth 24.0% 30.5%Ethnic group 0.3% 0.6%Field of work n/a 0.3%Occupation n/a 58.7%At least one site link 99.6% 98.1%

in richness. What we can conclude from this is thatsubset A has a higher coverage of properties in termsof occurrence rate than the whole of Wikidata in 2016.�is in turn tells us that the richness of our sample sizeis rather high.

INTERVIEWSA series of interviews with internal users of di�erentdepartments of NISV is conducted. In total, �ve inter-nal users are interviewed. �e interviews are eitherconducted in a one-on-one se�ing or with two usersfrom the same department simultaneously. �e inter-views each last an hour. �e goal of these interviewsis to result in use cases for the enriched dataset. �einterviewee is shown the extra data contained withinWikidata. �e interviewee is also shown the results ofthe earlier analysis. A�er, the interviewee is asked if

he/she based on this data sees any concrete use casesfor when more data gets linked. �e departments ofthe di�erent users interviewed are as follows: Intake,information management and Knowledge & Innovation.

In all the interviews the wish for a copyright expira-tion alert on the work of persons in the GTAA came for-ward. Even without looking at the produced statisticsthis proved a long desired wish. For the departments ofintake and information management this proved also tobe the only use case. For the department of Knowledge& Innovation two other use case came forward. �eyboth had to do with the online story platform. �isplatform was due to be completely overhauled and newinnovations were sought to enrich the new platform.Here the extra data provided by linking the GTAA toWikidata proved useful. �e �rst use case to sproutfrom the insight into the statistics manifested itself inthe form of a provider of extra information for a story.It became apparent that the enriched data could bothbe used to display extra information on persons men-tioned in the story and also to show any video contentNISV might own of said person. Another use of theenriched data for the new story platform was the usageof the data as metadata to form some sort of metadatabackbone for an automated relevant story system. �eproperties belonging to a person mentioned in a storywould then be saved as metadata for that story. Whentwo stories would have enough matching metadata theycould show up as a relevant story on the bo�om of thepage.

6

Page 7: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

5 USE CASESDi�erent Use cases as extracted from the interview andearlier discussions in table 5 below:

Table 5. Use cases

Name Description

UC-1 Receiving an alert when the copyright on aperson’s work expires.

UC-2Using Wikidata data to provide moreinformation on a person when said personappears in an online story.

UC-3 Using Wikidata data as metadata to show”Stories you might also like”.

5.1 UC-1�e GTAA mainly contains only a person’s name andmaybe some extra label containing some extra informa-tion to distinguish said person from other persons withthe same name. Wikidata contains more data than that.Included in this data is the data of a person’s date ofdeath. When alignments are made between the GTAAand Wikidata the date of death data contained withinWikidata could be used to enrich the GTAA and enableus to get an alert on the next January, 70 years a�er aperson’s death. Analytic research into subset A showedthat 35.21% of this subset have a date of death. �e fullresults of this research can be seen in table 3 shownearlier.

Using these statistics, we can determine that we shouldat maximum expect a return of 3,644 people for the alert.However, we determined that further focus was neededto ensure that we would only get relevant persons inour alert. �is focus had to be adjusted to the views ofNISV. Meaning, there should be focus on people thatworked in the �elds of television, movies, radio, theaterand music. To ensure that only alerts for these kind ofpeople are received, statistics on all the di�erent occu-pations of subset A were generated. A sample of theresults is shown in table 6. �e full list can be foundin the online �le sharing platform5. In total we foundthat there were 798 di�erent occupations. We took allthe occupations that had a higher occurrence rate than

5h�ps://�gshare.com/articles/Statistics/6859595

0.1% as this provided a large enough sample size. Also,as people can have multiple occupations listed, the rel-evant occupations with a lower occurrence rate than0.1% would still be contained within our resulting setof persons.

Table 6. Occupation distribution subset A

Occupation Count Percentage Rankpolitician 1,722 7.59% 1writer 1,344 5.93% 2actor 1,206 5.32% 3journalist 976 4.30% 4singer 780 3.44% 5televisionpresenter 759 3.35% 6associationfootball player 631 2.78% 7�lm actor 560 2.47% 8university teacher 538 2.37% 9composer 516 2.28% 10screenwriter 487 2.15% 11presenter 485 2.14% 12painter 433 1.91% 13�lm director 394 1.74% 14television actor 339 1.49% 15poet 319 1.41% 16diplomat 312 1.38% 17lawyer 267 1.18% 18sport cyclist 229 1.01% 19pianist 226 1.00% 20Other 10,162 44.80%Total 22,685 100.00%

In this large list of occupation occurrence rate wepicked all the occupations that had something to do withthe �elds of television, movies, radio, theater and music.�e set of persons paired with the relevant occupationswas then returned. �is returned a set of 1,626 personsthat had a link to the GTAA, a date of death and wererelevant for NISV. With the obtained set this data cannow be used to receive alerts on copyright expiration.NISV uses the Google package for their daily activities.�is means that the best option for se�ing up such analert system would be in a platform like Google calendar.�e data was shaped to adhere to the import format of

7

Page 8: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

Google Calendar. A�er the reshaped data �t the formatlaid out by Google we were able to generate a view aspresented in �gure 1 and 2. For this example we lookat January 1st 2030.

Fig. 1. Example UC-1

Fig. 2. Example UC-1

�is use case was worked out as a functional proto-type in Google calender.

5.2 UC-2�e second use case that came forward during the inter-views was the ability to provide extra information onpersons mentioned in the new story platform. A versionof the story platform is currently already active. �isplatform is still to be completely redesigned and moreresources will in the future be allocated towards thisplatform. When using the data available in Wikidatait enables one to quickly get a small overview of thepersons mentioned in the story. Currently, the authorof the story is already displayed on the right. With theuse of the data from Wikidata we could expand thatview to also show the persons mentioned in the storyon the bo�om of the page.

When someone clicks on one of the persons on thebo�om of the story he/she is redirected to a separatewebpage containing extra information on said person.Just like the author page that already exists. It alsoshows other stories the person is mentioned in. �edi�erence between these two pages is that the informa-tion on the persons mentioned in the stories is fetchedand generated from Wikidata. Wikidata already has anautomated description generator that works with theproperties of a person. An example of this automatedservice for the randomly chosen entry on Pim Fortuyn(Q311697) is as follows: ”Pim Fortuyn was a Kingdomof the Netherlands writer, columnist, teacher, politi-cian, sociologist, and university teacher. He was bornon February 19, 1948 in Driehuis. He studied at VrijeUniversiteit, Mendelcollege, and University of Gronin-gen. He worked for Maastricht University, for ErasmusUniversity Ro�erdam, and for University of Groningen.He died of ballistic trauma by Volkert van der Graafon May 6, 2002 in Hilversum. He was buried at SanGiorgio della Richinvelda. ; Dutch politician”. For thisexample, another well known Dutch person was cho-sen as Willem-Alexander is not actually linked to theGTAA yet. An example of how the page with extrainformation on Willem-Alexander with the automaticgenerated description of Pim Fortuyn (Q311697) can beseen in Figure 3 on the next page.

�is provides the reader with extra information onsaid person. A second use of this page could be toshow existing relevant video footage about the person

8

Page 9: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

Fig. 3. Example UC-2

available for the public. �is could be shown below theautomatically generated description.

In conclusion, this usage of Wikidata data retrievalcan be used by the author to provide extra informationon the persons mentioned in the story. �e authorsthemselves can decide if they feel it necessary to displaythis extra information. Also the authors themselves candecide for which persons in story an extra page couldbe created. �e choice for this extra functionality atthe bo�om of the page is based on the fact that theonly readers interested in the extra information/videomaterial on the persons mentioned in the stories wouldbe the ones that actually �nish the whole story. It alsocauses less distraction at the start of the story. �ereason this would likely not completely work as a fullyautomatic tool is that not all the persons mentioned inthe story are as relevant to the story or in general. �isuse case was worked out as concept with illustrationsto support it.

5.3 UC-3A third use case for the data from Wikidata for thestory platform is using the available data as a meta-data backbone. When persons are mentioned in thestory their data available on Wikidata could be used

and stored as metadata for the story. When doing thisone could generate an automated ”Stories you mightalso like” on the bo�om of a story. When looking atan example from the story ”Hoe Cronkite de kijk opde Vietnamoorlog veranderde”6. We see that WalterCronkite, William Westmoreland, Lyndon B. Johnsonand Creighton Abrams are mentioned in the story. Allproperties of these persons can be retrieved from Wiki-data. All the properties of Walter Cronkite (Q31073) arefound in table 7 on the next page.

If we look at these properties it can be concluded thatnot every property is useful for use as metadata. If prop-erties like ”Place of death” would be used as metadatait would not return relevant stories. �is is a�ributedto the fact that the location of one’s death does not nec-essarily de�ne a person. �e ”Place of death” propertyis not alone in this. On the same note ”Birth name”,”Given name”, ”Manner of death” and ”Cause of death”are less relevant. To tackle this problem, we would haveto rate all possible properties on a scale of ”Not relevant”to ”Relevant”. Secondly, all properties are also boundto a category as this could also help with determiningrelevance. An overview of all these properties is shownin Appendix A.

�is leaves us with a signi�cantly smaller list of prop-erties. When using this data on relevancy for the exam-ple case of ”Hoe Cronkite de kijk op de vietnamoorlogveranderde” we see that the relevant properties for Wal-ter Cronkite (Q31073) would be: sex or gender, coun-try of citizenship, award received, date of birth, child,religion and date of death. �is data could be linkedto metadata generated from other stories to supportthe author of the story in making a decision on therelevant story selection. For this particular exampleof Walter Cronkite (Q31073) it would trigger on otherstories about people with similar properties. �is initself would not in any way guarantee a truly relevantstory. For example: someone’s country of citizenshipis relevant, but does not have to make a story relatedto another story. �e best way to handle this is to de-termine some sort of threshold for matching properties.If a threshold is set for a percentage of matching prop-erties for the metadata of stories we can get a morerelevant view. �e author of the story can then look at6h�ps://www.beeldengeluid.nl/verhalen/hoe-cronkite-de-kijk-op-de-vietnamoorlog-veranderde

9

Page 10: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

Table 7. All properties of Walter Cronkite (Q31073)

Instance of Human Date of birth 4 November 1916 Cause ofdeath Stroke

Sex or gender Male Place of birth St. Joseph Child Kathy Cronkite

Country ofcitizenship

United Statesof America Date of death 17 July 2009

Languageswritten,spokenor signed

English

Birth nameWalter LelandCronkite,Jr. (English)

Place of death New York City OccupationJournalist,News presenter,Blogger

Given name Walter Manner of death Natural causes Member of

American Academyof Arts and Sciences,AmericanPhilosophical Society

Awardreceived

PresidentialMedalof Freedom

Educated at University ofTexas at Austin Religion Episcopal Church

the automatically generated relevant story and deter-mine if this automatically generated match is viable asa related story.

As a second part of this use case is the di�erent cat-egories in which the di�erent properties are dividedcould be linked to the user pro�le. �is can be done in asense that if a person is classi�ed in the category ”His-tory” the properties belonging to that category couldbe weighed higher than the others. �is informationcould then be used to be�er generate relevant storiesfor persons interested in those categories.

In conclusion, this tool can be used by the author togive him insight in automatically generated relevantstories. �e predictions would be too random to ac-tually use as a completely automatic tool. With thistool the authors themselves can decide if one of thegenerated relevant story is one they themselves actu-ally overlooked. �is use case was worked out as aconceptual tool.

5.4 Pa�erns in use casesWhen looking at the way the use cases were set up wesee that the most important thing to do is collecting allthe possible relevant data. �is allowed us to give a clearinsight in what the dataset actually contains. UC-2 and

UC-3 were impossible to set up without �rst allowinginternal users to get an overview of the dataset.

�at being said we delve deeper into what allowedus to set up these use cases. In essence, all the use casesstart with a focus on a speci�c aspect of the properties.UC-1 came forth from a focus on the speci�c property of”Date of death”. UC-2 came forward when focusing onproperties with a high occurrence rate. UC-3 was madepossible by focusing on a very speci�c set of properties.�e pa�ern we can derive from this is that one can focuson a speci�c set of properties to �nd a new use case.We also see the di�erences in the se�ing up of use casesthis way. �e similarities are simple to spot; the extradata brings forth new functionality.

If we would look further into this process of speci�cproperty targeting, we would be able to actually comeup with extra use cases. For example, if we would onceagain focus on properties with a high occurrence ratewe would be able to set up an extra use case in whichwe could divide the persons in certain categories. �osecategories would be based upon the properties with anexceptional high occurrence rate.

10

Page 11: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

6 EVALUATIONA�er the use cases were set up and tested, an evalua-tion took place. �is evaluation consisted of two eval-uation rounds. In the �rst round internal users of thedepartment of Knowledge & Innovation were shownthe worked out use cases in a worked out document.�ey were asked if the worked out use cases were whatthey had envisioned during the interview stages. �eywere also asked if they would make any improvementsor changes to the use case as stated in the documenta-tion. A�er the proposed changes were made to the usecases there was a second round of evaluation. �e �naldocument with the updated use cases was discussed ina meeting between internal users at the department ofKnowledge & Innovation. A�er this meeting informalconversations were conducted with the people involved.�ese conversations allowed us to get in insight intothe projected usefulness of the use cases.

A�er the �rst phase we found a unanimous proposedchange for UC-1. �is proposed change was to furtherfocus on people with a media background. In the earlierworked out version all persons that have a date of deathwere included in the worked out prototype. Accordingto the evaluation this caused a lot of unnecessary alerts.A�er receiving this review changes were made so thatonly persons with a media background were includedin the alarm. �ere were also proposed changes to UC-2.For UC-2 the proposed change was to include possiblearchive material on the extra information page for aperson mentioned in a story. It was also proposed tomove extra information bu�on from the top right tothe bo�om of the page as it could prove too distracting.UC-3 did not receive any proposed changes a�er the�rst phase.

A�er the second phase of evaluations informal con-versations between di�erent internal users took place.�e updated UC-1 was evaluated as a useful tool forkeeping track of copyright expirations. Further align-ments between the GTAA and Wikidata are still neededto tackle the problem of incompleteness of the data, butthe foundation of the use case was well received. UC-2and UC-3 were evaluated as interesting assets for thenew story platform. �eir actual impact and usefulnesswas hard to be determined by the internal users as theywere both still conceptual use cases.

7 DISCUSSION & FUTURE WORK

7.1 Discussion�is research also had some limitations. First of all,we were limited in our use of the SPARQL endpointtool provided by Wikidata. When doing the statisticalanalysis, we used this tool to retrieve all our informa-tion from Wikidata. �is tool had its limitations as toocomplex or too large queries would result in the servicetiming out. Something that could have been done be�erin this research was be�er SPARQL query optimization.Due to time pressure we were not able to divert morea�ention into this ma�er. Another solution to the tim-ing out problem would be to run a Wikidata RDF dumpin a local virtual environment. �is was however alsosomething we had to abandon because of time pressure.Especially the second option can prove useful for futureresearch into a similar subject as it would enable theresearchers to have full control over the dataset.

When performing similar case studies one thing thatshould be considered is the combination of a data drivenapproach and a user driven approach. In this researchthe combination of said approaches returned good re-sults in the form of use cases. It also brought forth aframework for said use cases in the form of an analysison the pa�erns. Without the combination of both ap-proaches the use case would not be retrievable in theircurrent form.

7.2 Future workTo ensure a be�er future for the set up use cases, theproduction of alignments between the GTAA and Wiki-data needs to continue. As mentioned before only 8.4%of the possible alignments have been made. �erefore,the use cases as of now cannot be used as a reliable toolfor ge�ing the needed information. �is is all a�ributedto the fact that the dataset is incomplete. Further coor-dination between the volunteers of Wikidata and theinternal users at NISV is needed.

As far as the broader part of this research is con-cerned, this use case study proved an excellent startingpoint for further research into the practical use of linkeddatasets. A lot of research has already been conductedin the di�erent online knowledge bases (e.g. Wikidata,DBpedia) and their usefulness. However, this researchshed some new light on the usefulness and use cases

11

Page 12: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

when linking big datasets to those online knowledgebases. �e method of starting with a data driven ap-proach and continuing using a user driven approachshowed how a mix of the two approaches can be usedto e�ectively determine the use cases in similar casestudies.

In this sense, this research hopes to spark future re-search in other speci�c use cases for the usefulness oflinked data.

8 CONCLUSION�is research aimed to �nd answers to the questions ofwhat use cases can be formed by an enriched GTAA andthe di�erences in wishes between internal and externalusers are. By looking deep into the data both containedin the GTAA and Wikidata it found some meaningfuldi�erences but also similarities. What can be concludedfrom the statistical analysis is that the subset A is repre-sentative of dataset B in a sense of gender distributionand some other generic properties. �e meaningful dif-ferences lie in the more speci�c properties like ”Awardreceived”. Here it became apparent that subset A wasmuch more media oriented than subset B.

�is research also showed that when giving usersinsight in all the raw data and its properties, new ideasabout the usefulness of said data sparks. In this casestudy this manifested itself in the form of use cases. Weconclude that focus on di�erent properties within thedata lead to completely di�erent and independent usecases. �e use cases that came forward in this researchwere connected to a speci�c focused group of properties.For UC-1 the focus lied on the ”date of death” property.For UC-2 the focus lied on properties with a high oc-currence rate. For UC-3 the focus lied on very speci�cproperties that were handpicked for relevancy.

In the evaluation the usefulness of all three use casescame forward. UC-1 in as of now a working tool forthe internal users. �ey can themselves opt to use itto be�er manage copyright expiration cases. UC-2 andUC-3 are still conceptual use case, but were evaluated tobe good in concept. UC-3 is thought to be so interestingas a concept by internal users of the department Knowl-edge & Innovation that a future project to calculate itsusefulness in exact values is now being set up.

For an answer to the second research question wewill have to look at the group of external users. �ose

external users are de�ned as external researchers thatmake use of the facilities at NISV to further their ownresearch. A notable di�erence between the internaland external users lie in their goals, tools and waysof work. �e internal users look at the usefulness foran enriched GTAA for the development of their ownprojects whereas the external users would use an en-riched GTAA to only further their research. Externalusers also use tools that internal users do not. �eymainly use the CLARIAH Mediasuite. In this tool theGTAA is an optional component. Internal users use theGTAA in a di�erent way and also use di�erent tools inmanaging it. �ese tools (Like SKOS) are not open to ex-ternal researchers. �ere are di�erences in the wishes ofthe two groups of users. Internal users want to mainlyuse an enriched GTAA to set up further project or createnew tools. Whereas the external users mainly want touse an enriched GTAA to receive as much extra infor-mation as possible as well as add some functionality totheir search behaviour. �e similarities in the wishes liein the wish for extra data and thus extra functionalityto certain aspects of their workload.

REFERENCES[1] Vrandecic & Krotzsch, M. (2014). Wikidata: a free collaborative

knowledgebase. Communications of the ACM, 57(10), 78-85.[2] Erxleben, F., Gunther, M., Krotzsch, M., Mendez, J., & Vrandecic,

D. (2014, October). Introducing Wikidata to the linked data web.In International Semantic Web Conference (pp. 50-65). Springer,Cham.

[3] Farber, M., Bartscherer, F., Menne, C., & Re�inger, A. (2016).Linked data quality of dbpedia, freebase, opencyc, wikidata, andyago. Semantic Web, (Preprint), 1-53.

[4] �ornton, K., Cochrane, E., Ledoux, T., Caron, B., & Wilson, C.(2017). Modeling the Domain of Digital Preservation in Wikidata.

[5] Debevere, P., Van Deursen, D., Mannens, E., Van de Walle, R.,Braeckman, K., & De Su�er, R. (2011). Linking thesauri to thelinked open data cloud for improved media retrieval. In 12th Inter-national Workshop on Image Analysis for Multimedia InteractiveServices (WIAMIS-2011).

[6] Tordai, A., Omelayenko, B., & Schreiber, G. (2007, October). �e-saurus and metadata alignment for a semantic e-culture appli-cation. In Proceedings of the 4th international conference onKnowledge capture (pp. 199-200). ACM.

[7] Kobilarov, G., Sco�, T., Raimond, Y., Oliver, S., Sizemore, C.,Smethurst, M., … & Lee, R. (2009, May). Media meets semanticweb��how the bbc uses dbpedia and linked data to make con-nections. In European Semantic Web Conference(pp. 723-737).Springer, Berlin, Heidelberg.

12

Page 13: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

[8] Caracciolo, C., Stellato, A., Rajbahndari, S., Morshed, A., Jo-hannsen, G., Jaques, Y., & Keizer, J. (2012). �esaurus mainte-nance, alignment and publication as linked data: the AGROVOCuse case. International Journal of Metadata, Semantics and On-tologies, 7(1), 65-75.

[9] de Len, A., Saquicela, V., Vilches, L. M., Villazn-Terrazas, B.,Priyatna, F., & Corcho, O. (2010, September). Geographical linkeddata: a Spanish use case. In Proceedings of the 6th InternationalConference on Semantic Systems (p. 36). ACM.

[10] Neubert, J. (2009). Bringing the” �esaurus for Economics” onto the Web of Linked Data. LDOW, 25964.

[11] Klein, M., Gupta, H., Rai, V., Konieczny, P., & Zhu, H. (2016, Au-gust). Monitoring the Gender Gap with Wikidata Human GenderIndicators. In Proceedings of the 12th International Symposiumon Open Collaboration (p. 16). ACM.

13

Page 14: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

A APPENDIX A - ALL PROPERTIES WIKIDATA

Property Data type Category Relevancysex or gender item Generic Relevantdate of birth Point in time Generic Relevantbirthday item Generic Relevantplace of birth item Generic Relevantbirth name Monolingual text N/A Not relevantdate of death Point in time Generic Could be relevantplace of death item N/A Not relevantcause of death item N/A Not relevantmanner of death item N/A Not relevantkilled by item N/A Not relevantplace of burial item N/A Not relevantimage of grave Commons media �le N/A Not relevantname in native language Monolingual text N/A Not relevantancestral home item N/A Not relevantmember of item Generic Relevantethnic group item Generic Relevantnative language item Generic Relevantcountry of citizenship item Generic Relevanteducated at item Generic Relevantoccupation item Generic Relevant�eld of work item Generic Relevant

notable work item

Journalism, Music,Documentary, Film,Television,Entertainment,Webvideo, Games,Art & cultureand History

Could be relevant

employer item Generic Could be relevantaward received item Generic Relevant

position held item Journalism, Historyand Organization Relevant

member of political party item Journalism and History Relevantresidence item Generic Could be relevanto�cial residence item Generic Could be relevantreligion item Generic Relevantsexual orientation item Generic Could be relevantcoat of arms image Commons media �le N/A Not relevantcoat of arms item N/A Not relevantsignature Commons media �le N/A Not relevantdoctoral advisor item N/A Not relevant

14

Page 15: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

Property Data type Category Relevancydoctoral student item N/A Not relevantstudent of item N/A Not relevantstudent item N/A Not relevantcanonization status item N/A Not relevantvoice type item N/A Not relevantshooting handedness item N/A Not relevantastronaut mission item N/A Not relevantdan/kyu rank item N/A Not relevantEight Banner register item N/A Not relevantfamily item Generic Relevantnoble title item N/A Not relevanthonori�c pre�x item N/A Not relevantacademic degree item Generic Relevanthandedness item N/A Not relevantwebsite account on item N/A Not relevanthonori�c su�x item N/A Not relevantfamily name item Generic Relevantgiven name item N/A Not relevantpseudonym String N/A Not relevantfeast day item N/A Not relevantaudio recording ofthe subject’s spoken voice Commons media �le N/A Not relevant

manager/director item

Journalism, Documentary,Film, Television,Entertainment, Webvideo,History and media-use

Relevant

�lmography item

Journalism, Documentary,Film, Television,Entertainment,Webvideo,History and media-use

Relevant

instrument item Music and Radio Relevanteye color item N/A Not relevantparticipant of item Sport and History Relevantconvicted of item N/A Not relevantlanguages spoken,wri�en or signed item Generic Could be relevanta�liation item Organization Relevanthas pet item N/A Not relevantCommons Creator page String N/A Not relevantProject Gutenberg author ID External identi�er N/A Not relevantfather item Generic Relevantmother item Generic Relevantsibling item Generic Relevantspouse item Generic Relevant

15

Page 16: The benefits of linking Thesauri for internal users · The benefits of linking Thesauri for internal users ... linked data web. „ey describe how Wikidata is nowa- ... provide a

Property Data type Category Relevancypartner item Generic Relevantchild item Generic Relevantstepparent item Generic Relevantrelative item Generic Relevantgodparent item Generic Could be relevantnumber of children quantity N/A Not relevantmember of sports team item Sport RelevantPosition played on team / specialty item Sport RelevantDoubles record String N/A Not relevantSingles record String N/A Not relevantranking �antity N/A Not relevantCountry for sport Item Sport Could be relevantSwedish Olympic Commi�eeathlete ID External identi�er N/A Not relevantMilitary branch item History Could be relevantMilitary rank item History Could be relevantCommander of item History Could be relevantcon�ict item History Could be relevant

16