23
Content Mining of Science in Cambridge Peter Murray-Rust, Dept of Chemistry, University of Cambridge libraries@cambridge, Cambridge, UK 2016-01-07 What is mining? Why is it useful? Open Access and UK “Hargreaves” legislation How Cambridge can become a world leader

ContentMining at Cambridge

Embed Size (px)

Citation preview

Page 1: ContentMining at Cambridge

Content Mining of Science in Cambridge

Peter Murray-Rust, Dept of Chemistry, University of Cambridge

libraries@cambridge, Cambridge, UK 2016-01-07

What is mining?Why is it useful?

Open Access and UK “Hargreaves” legislationHow Cambridge can become a world leader

Page 2: ContentMining at Cambridge

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Page 3: ContentMining at Cambridge

Use Cases of ContentMining

• Epidemiology of obesity (Cambridge U)• (OKF, OpenTrials) Mapping clinical trials

repositories to reports in scientific literature• Mining chemical reactions from patents• Creating a bacterial supertree-of-life from

4500 papers

Page 4: ContentMining at Cambridge

Polly has 20 seconds to read this paper…

…and 10,000 more

Page 5: ContentMining at Cambridge

ContentMine software can do this in a few minutes

Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”

Page 6: ContentMining at Cambridge

400,000 Clinical TrialsIn 10 government registries

Mapping trials => papers

http://www.trialsjournal.com/content/16/1/80

2009 => 2015. What’s happened in last 6 years??

Search the whole scientific literatureFor “2009-0100068-41”

Page 7: ContentMining at Cambridge

ContentMine-ing strategy• Discover. Crawl the COMPLETE relevant literature.

=> bibliography• Scrape (download). ALL papers• Index papers => Facts• Search/analyze papers => complex science• Extract, Annotate, Aggregate (“Transformative”)

Page 8: ContentMining at Cambridge

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 9: ContentMining at Cambridge

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

CONTENTMINE Complete OPEN Platform for Mining Scientific Literature

Page 10: ContentMining at Cambridge

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 11: ContentMining at Cambridge

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 12: ContentMining at Cambridge

Facts in contextdaily IUCN endangered species news

en.wikipedia.org CC By-SA

Page 13: ContentMining at Cambridge

ContentMine Fact of The Day

• Fact of the day• Endangered species in recent science• Facts• Bubbles

Page 14: ContentMining at Cambridge

https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA

Page 15: ContentMining at Cambridge

“Root” 4500 papers each with 1 tree

Page 16: ContentMining at Cambridge

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Page 17: ContentMining at Cambridge

Supertree for 924 species

Tree

Page 18: ContentMining at Cambridge

Supertree created from 4300 papers

Page 19: ContentMining at Cambridge

Copyright and Mining

• UK (“Hargreaves”) 2014 legislation:– “personal” “non-commercial*” “research” “data

analytics”– legitimizes copying (?to disk), but not publishing

*teaching, textbooks, etc. may be “commercial”

Page 20: ContentMining at Cambridge

STM Publishers prevent Mining• FUD & disinformation about legality (Elsevier)• Monopolies on infrastructure (“API”s, CCC

Rightfind)• Technical obstruction (Wiley Captcha,

Macmillan Readcube)• Restrictive contracts with libraries (ALL) [1]• Wasting my/our time (ALL)

[1] [You may not] utilize the TDM Output to enhance … subject repositories in a way that would [… ] have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.

Page 21: ContentMining at Cambridge

WILEY … “new security feature… to prevent systematic download of content

“[limit of] 100 papers per day”

“essential security feature … to protect both parties (sic)”

CAPTCHAUser has to type words

Page 22: ContentMining at Cambridge

ContentMine working with Libraries

• Cambridge: Library, Plant Sciences, Public Health, Chemistry

• Cochrane Collaboration on Systematic Reviews of Clinical Trials

• FutureTDM (H2020, LIBER)• Running workshops and training

• We have dedicated servers running in chemistry

Page 23: ContentMining at Cambridge

My European Heroes

Young People(ContentMine)

NEELIE KROES