37
ontent-Mining for Clinical Tria Peter Murray-Rust contentmine.org Cochrane UK, Oxford, 2015-03-16 OPEN Platform for Machines+humans to automatically “read” the trials literature Grow communities and give everyone the tools and know-how to mine trials

ContentMining and Clinical Trials

Embed Size (px)

Citation preview

Page 1: ContentMining and Clinical Trials

Content-Mining for Clinical TrialsPeter Murray-Rust

contentmine.orgCochrane UK, Oxford, 2015-03-16

• OPEN Platform for Machines+humans to automatically “read” the trials literature

• Grow communities and give everyone the tools and know-how to mine trials

Page 2: ContentMining and Clinical Trials

• 09:30 - Introductions10:00 - Overview of ContentMine10:30 - Discussion: why might content mining clinical trials be useful?11:00 - Tea/coffee break11:15 - Discussion: current tools and what is needed12:00 - Discussion: imagining the clinical trials mining pipeline12:30 - Lunch13:30 - Demo and introduction to software14:30 - Technical session 1 (hands-on content mining)15:30 - Tea/coffee break15:45 - Technical session 2 (hands-on content mining)17:00 - Event close

Page 3: ContentMining and Clinical Trials

Background for Today• Contentmine aims to make large areas of scientific fact OPEN (100

million facts/year)• We’re working with WellcomeTrust, Europe PubMedCentral, etc.• A politically “hot” area (Hargreaves legislation, EU activity)• A week ago WellcomeTrust workshop on TDM and Neuroscience; “rough

consensus” on what was needed.• In the last few days we’ve prototyped what we think is a good starting

point…• NOTE: The software is very “bleeding edge”! Please treat in a spirit of

adventure!!

• Vision/enthusiasm from Amy Price, Anna Noel-Storr, Emily Sena (E’burgh) and yourselves!

Page 4: ContentMining and Clinical Trials

Questions we could tackle

• How to we find (mentions of) clinical trials?• Is a document a (clinical) trial?• What is the subject of the trial?• What is the methodology used?• Does the design and practice conform to CONSORT?• What are the outcomes?• Can we extract specific re-usable information?• Who are involved? (researchers, sponsors, patients?)• Has a proposed trial been completed and reported?

Page 5: ContentMining and Clinical Trials

Afternoon session

• Work in groups; mixture of skills and experience

• Take different sections of CONSORT• Scrape articles from trialsjournal.com• Explore word frequency – create your own

lists of frequent words• Design regexes to extract CONSORT 8a->11

Page 6: ContentMining and Clinical Trials

The Right to Read is the Right to Mine

http://contentmine.org

Page 7: ContentMining and Clinical Trials

https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0

Daily Stream of 100,000 Open Facts

Twitter?Indexed by CAT

Page 8: ContentMining and Clinical Trials
Page 9: ContentMining and Clinical Trials

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 10: ContentMining and Clinical Trials

What is “Content”?

Page 11: ContentMining and Clinical Trials

Machine-Human symbioses

• Wikipedia• Open StreetMap

• Google

We aim to make it trivial for a human+machine to mine the scientific literature. By building Communities

Page 12: ContentMining and Clinical Trials

ContentMine Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Page 13: ContentMining and Clinical Trials

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

Page 14: ContentMining and Clinical Trials

Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London

Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO

Collaborators

• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE• EuropePubmedCentral

Page 15: ContentMining and Clinical Trials

• CRAWL the web for scientific documents (articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form

…Open semantic science …• MINE pages with your methods and tools (AMI)

• CAT-alogue results in searchable index• Automate daily process (CANARY)

contentmine.org Infrastructure

Page 16: ContentMining and Clinical Trials

quickscrapeCrawlFeed Norma Index &

Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

PluginsRegex

SequencesSpecies

Bespoke

ScrapersXPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Page 17: ContentMining and Clinical Trials

https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg

CRAWLing the Literature

NO Central Table of Contents

Massive technical, political, legal opposition

Little interest from Academia

Tedious

Few general tools

Page 18: ContentMining and Clinical Trials

The Right to Read is The Right To Mine

PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/

Page 19: ContentMining and Clinical Trials

SCRAPE

https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain

PDF

HTML

XML quickscrape*

*Scrapers created by Richard Smith-Unna + Community

HTMLPDFXMLPNGSVGCSVDOCLaTeXCIF…

Non-standard per-publisher site

Page 20: ContentMining and Clinical Trials

https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain

NORMA-lization of Scientific Literature

PDFs, Broken HTMLPNGs for Math, etc.

NORMA

UnicodeDiacriticsWell-formedSectionedTaggedSVG diagrams

Page 21: ContentMining and Clinical Trials

AMI-plugins• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions

• Farming * (Rory Aaronson)

• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)

• Phylogenetics * (Ross Mounce)

• Phytochemistry * (Chris Steinbeck, PMR)

* subcommunities

Page 22: ContentMining and Clinical Trials

Text-based plugins

• Bag of words (https://en.wikipedia.org/wiki/Bag-of-words_model)

• https://en.wikipedia.org/wiki/Tf%E2%80%93idf (Term-frequency, inverse document frequency)• Templates and regexes (regular expressions).

Page 23: ContentMining and Clinical Trials

“Bag of Words”

Three fulltext articles from trialsjournal.com

Page 24: ContentMining and Clinical Trials

Facts Marked by “non-scientists” in ContentMine workshops

With Wikipedia everyone can be a scientist

Page 25: ContentMining and Clinical Trials

“nuggets” in a scientific paper

quantity

units

Value ranges

Humans aren’t designed to mine this … chemical

project places

Page 26: ContentMining and Clinical Trials

Advanced Plugins

Page 27: ContentMining and Clinical Trials

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 28: ContentMining and Clinical Trials

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 29: ContentMining and Clinical Trials
Page 30: ContentMining and Clinical Trials
Page 31: ContentMining and Clinical Trials

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Page 32: ContentMining and Clinical Trials

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 33: ContentMining and Clinical Trials

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

Page 35: ContentMining and Clinical Trials

Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

Page 36: ContentMining and Clinical Trials

contentmine.org proposed Services

• Workshops• Repository indexing• Funder Compliance• Publication enhancement• Extraction of scientific data

Page 37: ContentMining and Clinical Trials

contentmine.org team