1
Data and Tools Abstract Contact Information Shutian Ma: [email protected] Chengzhi Zhang: [email protected] Using Full - text Academic Articles and Wikipedia to Find Alternative Free Bioinformatics Software Shutian Ma; Chengzhi Zhang Department of Information Management, Nanjing University of Science and Technology, Nanjing, China, 210094 Experimental Result v Scientific literature 114,510 papers in XML format - PLOS ONE 11,013 articles containing bioinformatics software v Software list 143 specific bioinformatics software - Wiki 20 commercial software according to their licenses Only 97 software have info box information v Tools LSI and Doc2Vec - Genism Python package of LDA Node2Vec - OpenNE Differ2Vec and struc2vec - Github Taking bioinformatics software as a case study, this paper wants to find free software which are similar with commercial ones and have potential to be alternatives. Content and network information are applied for preference-oriented results, which encapsulates similarity in how people describe them in wiki and how people use them in research. What do we want to do? Find Free Software! Method Represent Software Using Full-text and Wiki Compute Similarity between Free and Commercial Software Construct Ground Truth and Evaluate Recommendation Result Except categories which start with characters "List of". Find Bioinformatics Software Information that Tells us Which is Free/Commercial Information in Wiki info box is converted into “Knowledge Graph” Use returned results number to calculate normalized pointwise mutual information Paper-based Software Recommendation Performance Wiki-based Software Recommendation Performance Conclusion þ Use Wiki content o Use full-text paper content þ Wiki content o Wiki graph þ Combine Wiki content & graph o Wiki content o Wiki graph þ Graph embedding can help to improve Paper-based recommendation. þ Combine all information, recommendation performance isn’t getting much higher as expected. ü Wikipedia content and info box can be balanced together for an efficient software recommendation technique. ü Graph-based information can help to rich semantic information. ü It’s not suitable to use such kind of full-text publication data set to represent entities like software or some others in research when doing linear combinations, all weights are set to be one. v Software Representation Generation LSI, LDA and Doc2Vec – Represent software via vectors based on content in 100 dimensions Node2Vec, Differ2Vec and struc2vec – Represent software via vectors based on graph in 128 dimensions v Software Graph Construction Node Type Value developed by what kind of team university, company and person year of stable release 14 different years written in what kind of programming language 17 languages operation system Linux, Unix, Windows and MacOS applied platform 6 kinds available language English or cross -language software type 44 types

Using Full-text Academic Articles and Wikipedia to Find

  • Upload
    others

  • View
    26

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Full-text Academic Articles and Wikipedia to Find

Data and Tools

Abstract

Contact InformationShutianMa:[email protected] Zhang:[email protected]

UsingFull-textAcademicArticlesandWikipediatoFindAlternativeFreeBioinformaticsSoftwareShutian Ma;Chengzhi Zhang

DepartmentofInformationManagement,NanjingUniversityofScienceandTechnology,Nanjing,China,210094

Experimental Result

v Scientific literature114,510 papers in XML format - PLOS ONE11,013 articles containing bioinformatics software

v Software list143 specific bioinformatics software - Wiki20 commercial software according to their licensesOnly 97 software have info box information

vToolsLSI and Doc2Vec - GenismPython package of LDANode2Vec - OpenNEDiffer2Vec and struc2vec - Github

Taking bioinformatics software as a case study, this paper wants to findfree software which are similar with commercial ones and havepotential to be alternatives. Content and network information areapplied for preference-oriented results, which encapsulates similarity inhow people describe them in wiki and how people use them inresearch.

What do we want to do? Find Free Software!

Method

RepresentSoftwareUsingFull-textandWiki

ComputeSimilaritybetweenFreeand

CommercialSoftware

ConstructGroundTruthandEvaluate

RecommendationResult

Except categories which startwith characters "List of".

FindBioinformaticsSoftwareInformationthat

TellsusWhichisFree/Commercial

Information in Wiki infobox is converted into“Knowledge Graph”

Use returned results numberto calculate normalizedpointwise mutual information

Paper-based SoftwareRecommendation Performance

Wiki-based SoftwareRecommendation Performance

Conclusion

þ UseWikicontento Usefull-text papercontentþ Wikicontent oWikigraphþ CombineWikicontent&graphoWikicontentoWikigraphþ Graphembedding canhelptoimprovePaper-based

recommendation.þ Combineallinformation, recommendation performance isn’tgetting

muchhigherasexpected.

ü Wikipediacontentandinfoboxcanbebalancedtogether foranefficient softwarerecommendationtechnique.

ü Graph-based information canhelptorichsemanticinformation.

ü It’snotsuitabletousesuchkindoffull-text publicationdatasettorepresententities likesoftwareorsomeothers inresearch

when doing linear combinations,all weights are set to be one.

v Software Representation GenerationLSI, LDA and Doc2Vec – Represent software via vectors based oncontent in 100 dimensionsNode2Vec, Differ2Vec and struc2vec – Represent software via vectorsbased on graph in 128 dimensions

v Software Graph Construction

NodeType Value

developedbywhatkindofteam university,companyandperson

yearofstablerelease 14differentyears

writteninwhatkindofprogramminglanguage 17languages

operationsystem Linux,Unix,WindowsandMacOS

appliedplatform 6kinds

availablelanguage Englishorcross-language

softwaretype 44types