Amparo Elizabeth Cano Basave1, Francesco Osborne2, Angelo Salatino2
1 Aston University, United Kingdom2 KMi, The Open University, United Kingdom
EKAW 2016
OntologyForecastinginScientificLiterature:SemanticConceptsPredictionbasedon
Innovation-AdoptionPriors
22
Osborne, F., Motta, E. and Mulholland, P.Exploring scholarly data with Rexplore.International Semantic Web Conference 2013
technologies.kmi.open.ac.uk/rexplore/
TheComputerScienceOntology1
• Notfine-grainedenough.– E.g.,only2topicsareclassifiedunderSemanticWeb
• Static,manuallydefined,hencepronetogetobsoleteveryquickly.
3
Standardresearchareastaxonomies/classifications/ontologiessuchasACMarenotapttothetask.
ACM 2012
TheComputerScienceOntology(CSO)wasautomaticallycreatedandupdatedbyapplyingtheKlink-2algorithm.
Osborne, F. and Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In ISWC 2015. (2015)
TheComputerScienceOntology2
• WeautomaticallygeneratedaversionofCSOconsistingofabout15,000topics linkedbyabout70,000semanticrelationships.
• ItincludedverygranularandlowlevelresearchareasanditcanberegularlyupdatedbyrunningKlink-2onanewsetofpublications.
• WealsohavedifferentversionsofCSOobtainedbyrunningKlink-2onthesetofdocumentsuptoacertainyear.
5
TheComputerScienceOntology3
5
CSO 2012 CSO 2013 CSO 2014 CSO 2015
[…]
Asharedconceptualization
“Ontologiesareaformal,explicitspecificationofasharedconceptualization”(Studer etal.,1998)
“Theconceptualizationshouldexpressasharedviewbetweenseveralparties,aconsensusratherthananindividualview“(Guarino atal,2009)
“Ontologiesareus:inseparablefromthecontextofthecommunityinwhichtheyarecreatedandused.”(Mika,2005)
“OntologyEvolutionisthetimelyadaptationofanontologytothearisenchangesandtheconsistentpropagationofthesechangestodependentartefacts.”(Stojanovic,2004)
6
Butwhatifwecannotwaitforsharedconsensus?
Theseontologiesreflectthepast,andcanonlycontainconceptsthatarealreadypopularenoughtobeselectedbyexpertsorautomaticmethods.
Hence,theyhardlysupporttaskswhichinvolvetheabilitytodescribeemergingconcepts,e.g.:
• Exploringtheforefrontofresearch;
• Trenddetection;
• Horizonscanning;
• Producingsmartanalyticstoinformbusinessdecision.
77
OntologyForecastingGivenanontologyintimet,ateamofexpertsand/orasoftwareconsideranumberofrelevantknowledgesourcesandupdatetheontologybyalsoincludingnewconceptsonwhichtherewillbe (probably)asharedconsensusintimet+1.Forexample,aforecastedontologyofresearchtopicsin2000mayalreadyincludeanewtopicassociatedtothedynamicspreludingtothe“SemanticWeb”(newcollaborationsbetweenKnowleged BaseSystems,AIandWWW)
8
[…]
t-n t-1 t t+1
Contributions– afirststeptowardsontologyforecasting
1. Weapproachthenoveltaskofontologyforecastingbypredictingsemanticconceptsintheresearchdomain.
2. Weintroducemetricstoanalysethelinguisticandsemanticprogressivenessinscholarlydata.
3. Wepropose SemanticInnovationForecast(SIF) anovelweakly-supervisedapproachfortheforecastingofemergingsemanticconcepts.
4. Weevaluateourapproachinadatasetofover1milliondocumentsintheComputerSciencedomain.
– Theproposedframeworkofferscompetitiveboostsinmeanaverageprecisionattenforforecastsover5years.
9
Scopus(ComputerScience)- #ofpublications
10
0
50000
100000
150000
200000
250000
1 9 9 5 1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7
NUMBEROFA
RTICLES
YEAR
Scopus(ComputerScience) – vocabularysize
11
0
20000
40000
60000
80000
100000
120000
140000
160000
1 9 9 5 1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7
VOCA
BULARYSIZE
YEAR
Klink-2ComputerScienceOntology- #ofclasses
12
LinguisticProgressiveness
Languageinnovationinacorpusreferstotheintroductionofnovelpatternsoflanguage.
WegeneratealanguagemodelperyearusingKatzback-offsmoothinglanguagemodelandanalyzeddifferencesbetweenconsecutiveyearsbyusingtheperplexitymetric.
13
0
2E+10
4E+10
6E+10
8E+10
1E+11
1.2E+11
1.4E+11
1 9 9 5 1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7
PERP
LEXITY
YEAR
LinguisticProgressiveness
Wealsoperformaprogressiveanalysisbasedonlexicalinnovationandlexicaladoption.
Alargenumberofnewwordsappeareachyear,butonlyfewofthemareadopted(i.e.,stillusedinthefollowingyear).
14
0
10000
20000
30000
40000
50000
60000
70000
1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7
NUMBEROFW
ORD
S
YEAR
# of new words per year
# of adopted words per year
MeasureLinguisticProgressiveness
15
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
1 9 9 7 1 9 9 9 2 0 0 1 2 0 0 3 2 0 0 5 2 0 0 7
LING
UISTICPRO
GRESSIVENE
SS
YEAR
We introduce the linguistic progressiveness metric:
𝑳𝑷𝒕 =𝑳𝑨𝒕𝑳𝑰𝒕
Innovation-AdoptionPriors
Weassumethatemergingtopicswillbeassociatedwithnovelwords,thuswecomputepriorsintimetbyconsideringinnovative(LI)andadoptedwords(LA).
Awordpriorisaprobabilitydistributionthatexpressesawordrelevanceto- inthiscase- beingcharacteristicofinnovativetopics.
Webuildthepriormatrixbyassigningaweighttoeachterminthisvocabulary.
– 0.7ifw∈ LIt−2 and0.9ifw∈ LAt−1.Becauseouranalysisshowsthatrecentlyadoptedwords(LA)aremoreoftenassociatedwithemergingtopicsthannewwords(LI).
16
SemanticInnovationForecast(SIF)model
SIFisagenerativeprobabilistictopicmodel thattakesininputasetofdocumentsatyeartandasetofhistoricalpriorsandforecasttopicworddistributionsrepresentingnewconceptsintheontologyOt+1.
17
SemanticInnovationForecast(SIF)model
18
WeuseCollapsedGibbsSamplingtoinferthemodelparametersandtopicassignmentsforacorpusatyeart+1givenobserveddocumentsatyeart.
Evaluation
WeperformthistaskbyapplyingourframeworkontheScopusdatasetforComputerScience(>1Mpublications).
Eachcollectionofdocumentsinayearisrandomlypartitionedintothreesubsets:20%isusedtoderiveinnovationpriors,40%trainingset,40%testingset.
WetrainaSIFmodelonyeartusinginnovativepriorscomputedforthetwopreviousyears(t-1andt-2)andweusetheSIFmodeltoforecastsemanticconceptsatyeart+1.
Wethenmeasurecomputethecosinesimilaritybetweenthepredictedsemanticconceptsfort+1andthegoldstandardconceptsforthatyear.WeconsideraconceptcorrectlyforecastedifthesimilaritywithaGSconceptishigherthan0.5.
19
Evaluation- Baselines
WecompareSIFagainstfourbaselines.Forayeartforecastingforyeart+1:
1. LDATopics(LDA) onthefulltrainingset.Thissettingmakesnoassumptionoverinnovative/adoptedlexicons.
2. LDAInnovativeTopics(LDA-I);computestopicsbasedondocumentscontainingatleastonewordappearinginLIt.
3. LDAAdoptedTopics(LDA-A);computestopicsbasedonlyondocumentscontainingatleastonewordappearinginLAt.
4. LDAInnovation/AdoptionTopics(LDA-IA); computestopicsbasedonlyondocumentscontainingatleastonewordappearinginLIt orLAt.
20
Evaluation- MeanAveragePrecision@10
21
Year SIF LDA LDA-A LDA-I LDA-IA
2000 0.70 0.12 0.48 0 0.412002 0.87 0 0.82 0.64 0.752004 0.91 0 0.58 0.57 0.632006 0.87 0.31 0.78 0.84 0.692008 0.99 0.40 0.68 0.57 0.70AVG 0.87 0.17 0.67 0.52 0.64
Conclusion
Itispossibletoforecastreliablyemergingsemanticconceptsiftheontologyisassociatedwithalargecollectionofdocument.
Thenextchallengeistoforecastnewversionofanontology,thatistoproduceanontologythatincludesallconceptsandrelationshipsthatwillbe(probably)includedinthenewversion.
22
Futureworks
• Integrationofexplicitandlatentsemantics;
• Includinggraph-structureinformationintothemodel;
• Understandinghowresearchtopicsarecreatedandforecasttopictrends.
23
Salatino, A.A., Osborne, F., Motta, E. (2016) How are topics born? Understanding the research dynamics preceding the emergence of new areas. PeerJ Preprints
Francesco Osborne Angelo SalatinoAmparo Cano Basave
Elizabeth Cano-Basave, A. E., Osborne, F., Salatino, A.A. (2016) Ontology Forecasting in Scientific Literature: Semantic Concepts Prediction based on Innovation-Adoption Priors. EKAW 2016, Bologna, Italy
Email: [email protected]: FraOsborneSite: people.kmi.open.ac.uk/francesco