Open software and knowledge for MIOSS

MIOSS 2016, EBI, UK, 2016-05-17

Open Content + Open Programs

Peter Murray-Rust1,2

[1]University of Cambridge[2]TheContentMine pm286 AT cam DOT ac DOT uk

Open

SoftwareArticles

Data Infrastructure

Themes• Open :

– Faster– Better– Agile– Inclusive– Re-usable

Pomerantz, J. and Peek, R. 2016. Fifty shades of open http://dx.doi.org/10.5210/fm.v21i5.6360

Open source. Open access. Open society. Open knowledge. Open government. Even open food. The word “open” has been applied to a wide variety of words to create new terms, some of which make sense, and some not so much. This essay disambiguates the many meanings of the word “open” as it is used in a wide range of contexts.

• Demos: – OSCAR, OPSIN, etc.– Content Mining and Annotation

http://dx.doi.org/10.5210/fm.v21i5.6360

Open Source Demos

• Centre for Molecular Informatics– OSCAR (chemical entity recognition)– Opsin (name2structure)– ChemicalTagger (chemical language parsing)– [OSRIC] (chemical image interpretation)

• ContentMine – getpapers, quickscrape, norma, ami, canary– Mining the complete scholarly literature

• 10,000 articles per day• > 1 million facts per day

Open: [state+private] investment

Rufus Pollock “The Open Information age” TBP 2016 [1]

CMI Software (OSCAR, OPSIN, ChemicalTagger, [OSRIC]): sponsors– 2006 …– Unilever– Nature PG, RSC, Int Union of Cryst.– EPSRC– OMII– NCI– [Microsoft]– JISC– CambridgeIP– Linguamatics– … 2016

• Peter Corbett, Joe Townsend, Chris Waudby, Sam Adams, David Jessop, Lezan Hawizy, Nico Adams, Mark Williamson, Andy Howlett, Daniel Lowe…

[1] https://www.youtube.com/watch?v=D2oNxhn6POA

https://www.youtube.com/watch?v=D2oNxhn6POA

Community

Mat Todd (Sydney) and MANY collaborators

http://opensourcemalaria.org/ (Chrome for interactivity)

Mat Todd, Univ Sydney, runs an Open Notebook community to create new antimalarials.

http://opensourcemalaria.org/

http://opensourcemalaria.org/

Interactive OPEN chemical search tool from cheminfo.org

Interactive OPEN molecular display Jmol (Bob Hanson et al)

Interactive OPEN chemical search tool from cheminfo.org

data is associated with the proposed scientific endeavour prior to or at the

point of creation rather than by annotating the data with commentary after the experiment has taken place

University of Southampton

Natural Language Processing

Part of speech tagging (Wordnet, Brown Corpus, etc.)

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

http://chemicaltagger.ch.cam.ac.uk/

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Parsing chemical sentences

This could be extended to much other scientific language

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

“OSRIC”

Is anyone interestedIn taking this further?

http://dx.doi.org/10.3390/metabo2010100

https://bytebucket.org/petermr/xhtml2stm/wiki/animation.svg

@Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism:

"Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/ …

#opencon #TDM

Elsevier stopped me doing my researchChris Hartgerink

https://twitter.com/hashtag/copyright?src=hash

https://twitter.com/hashtag/copyright?src=hash

https://t.co/sgisTFNZ3U

https://t.co/sgisTFNZ3U

https://twitter.com/hashtag/opencon?src=hash

https://twitter.com/hashtag/opencon?src=hash

https://twitter.com/hashtag/TDM?src=hash

https://twitter.com/hashtag/TDM?src=hash

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2

Chris Hartgerink’s blog post

Julia Reda, Pirate MEP, running ContentMine software to liberate science 2016-04-16

The Right to Read is the Right to Mine

http://contentmine.org

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

Latest 20150908

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search for “carvone”, output directory, XML, limit hits to 100

• –q carvne –o carvone –x –k 100Search for “carvone”, output directory, XML, limit hits to 100getpapers