Transcript
Page 1: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

PaperMaker: Validation of biomedical scientific

publications

January 19th, 2011

Workshop: „BeyondThePdf“

Dietrich Rebholz-Schuhmann, MD, PhDGroup Leader Rebholz Group

European Bioinformatics Institute

Page 2: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

Publishing is about …

• ... Agreeing / disagreeing about current science

• Only peer review can judge current science

• ... Bringing new results

• Conceptual results are more difficult than new data

• ... Gaining new knowledge

• New data and new results can imply new knowledge where even

the author is still unaware of

• ... Rewarding the scientist

• Count whatever you can count that could have an impact.

• Validating the scientist’s claim is the key reward.

• Any scientist can fool any system, but (hopefully) only short-term

20.01.20112

Page 3: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

Future of biomedical text mining

Working towards ...

• ... Literature integration

• to have it full fledged as part of bioinformatics data resources

• ... Cross-domain support

• to deliver the content to different scientific communities.

• ... Provenance

• to carry credit of findings into analytical biomedical research

• ... Inference & Reasoning

• to make use of the full semantic support in the scientific literature

20.01.20113

Page 4: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining

Literature content in the Semantic Web

20.01.20114

Page 5: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining

Terminologies vs. Ontologies

Database type Resource building

Terminologies, collection of terms

Automatic generation

Exploitation of terminological features

Standardisation of TM solutions

Interoperability with database

resources

5

Ontological resources

Explicit semantics

Manual generation

Consistency, inference, reasoning

Interoperability with all semantic

resources

Working towards a reasoning

infrastructure

Page 6: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

Efforts in the Rebholz group towards

interoperability of literature with bioinformatics

• Whatizit infrastructure

• Biomedical NER as a public, large-scale service

• LexEBI / BioLexicon (collab. w. NaCTeM, Pisa-U)

• Biomedical terminological resource, standardisation of semantics

• IeXML (BioLink SIG 2006, Brasil)

• Put the annotations into the document (inline annotations)

• CALBC project

• Collaborative annotation of a large-scale biomedical corpus

• UKPMC: U.K. Pubmed Central (collab. w. NaCTeM, BL)

• Use of Whatizit, BioLexicon, IeXML, CALBC alignments for the delivery of quality annotation services to the public

• SESL project

• Joint project with pharma & publishers, literature content in a triple store

• PaperMaker

• Validation of the scientific literature against the above

20.01.20116

Page 7: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

1Whatizit

20.01.20117

Page 8: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining

Integrating biomedical literature and dataRebholz-Schuhmann, D., et

al. Text Processing through

Web Services: Calling

Whatizit. Bioinformatics 24,

no. 2 (2008): 296-98.

20.01.20118

Page 9: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

2BioLexicon

LexEBI20.01.20119

Page 10: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining

LexEBI: content

20.01.201110

# Labels # Variants Total Total /

Labels

# Unique

terms

Uniq. T. /

Labels

GP 7.0 516,113 4,005,040 4,521,153 8.76 1,726,853 3.35

GP 6.0 488,577 3,389,316 3,877,893 7.94 1,564,436 3.20

Jochem 278,578 1,691,980 1,970,558 7.07 1,527,752 5.48

ChEBI 19,645 94,748 114,393 5.82 101,307 5.16

ChEBI (all) 549,838 1,187,322 1,737,160 3.16

Enzymes 4,905 8,082 12,987 2.65 12,377 2.52

Species 643,280 199,130 842,410 1.31 838,135 1.30

Interpro 20,671 0 20,671 1.00 20,671 1.00

Antineuro.,

Neo

4,718 6,488 11,206 2.38

Bio. Act. 54,148 87,209 141,357 2.61

Enzymes 26,065 56,332 82,397 3.16

Lipid, Carb. 11,518 9,770 21,288 1.85

Pharm. Act. 104,201 123,840 228,041 2.19

Vit., Horm. 6,877 10,258 17,135 2.49

Gen

e

/

Pro

t.

Ch

emi-

cals

Oth

erU

MLS

Page 11: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

3IeXML

20.01.201111

Page 12: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

IeXML: Annotating entities in text

• Inline annotations to any part of the document with the

annotations

• No hassle with character or byte counts or layout

modifications to the document

• “Alignment” of annotated documtents to

• Compare annotations

• Validate annotations

• Harmonise annotations (SESL project)

20.01.201112

Page 13: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz20.01.201113

4CALBC

Page 14: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz20.01.201114

The challenge

150,000 documents

or more ...

Test set for all systems

Assessment, benchmarking

Page 15: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

CALBC Challenge II

(1) 75,000 documents training data

(2) 175,000 testing data

(3) Additional 700,000 testing data

• September 13th 2010: Second harmonized corpus available for CALBC Challenge II

• December 15th, 2010: Challenge II closes

• March 2011: CALBC Workshop II

• June 30th, 2011: Final harmonized corpus available

Page 16: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz20.01.201116

5Ukpmc/Elixir

Page 17: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz20.01.201117

Page 18: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

UKPMC

20.01.201118

~ 10 % the size of PubMed

Page 19: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz20.01.201119

6sesl

Page 20: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining20

Assertions, SPARQL, Triple StoreIntegration, Inference, ReasoningSharing of data

Service Layer (RDF, Web 2.0) Common

Service

Broker

Multiple

Consumers

Std Public

Vocabularies

Knowledge

ApplicationsDisease

Dossier

Content

Suppliers

Business

Rules

Open

Stan-

dards

SESL Project: from publisher to pharma

20.01.201120

Page 21: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining

Literature content in the Semantic Web

20.01.201121

Page 22: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz20.01.201122

7Papermaker

Page 23: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

PaperMaker - Overview

• Inte

• PaperMaker - a tool to support authors writing biomedical

papers:

• Interactive feedback on the contents of papers (related

work and concept annotations)

• Formal consistency criteria checking (spelling,

terminology, acronyms, references)

30.03.2009

Page 24: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

Consistency parameters

Domain-independent

• General spelling and grammar

• General readability

• Appropriate use of references

• Finding and acknowledging related work

30.03.2009

Page 25: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

Consistence parameters

Domain-specific

• The use of terminology:

• Should be consistent with naming domain-specific guidelines

• Should not be ambiguous

• Should conform to the conventional usage (possible clashes

between naming guidelines and common-sense convention)

• Useful to resolve terminology to reference databases (e. g.

UniProt for protein names, ChEBI chemical entities, etc.)

• The special case of acronyms

30.03.2009

Page 26: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

Content feedback

• Resolving the contents to literature repositories• Finding related work (document retrieval)

• Finding related ideas (passage retrieval)

• Resolving the contents to ontological reference

databases• MeSH descriptors have been demonstrated to improve

biomedical information retrieval. Can we suggest MeSH terms

directly to the authors?

• Gene Ontology (GO) terms are increasingly used in information

extraction systems.

30.03.2009

Page 27: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

PaperMaker workflow

30.03.2009

Page 28: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining

Page 29: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining

Page 30: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining

Page 31: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text Mining

Page 32: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

Conclusions

• PaperMaker can help the author conform to the formal

requirements of paper writing with special emphasis on

the domain

• It also provides feedback on the contents by relating it to

reference resources and literature repositories

• It may improve the indexing of a paper in literature

repositories (less ambiguous terminology)

• http://www.ebi.ac.uk/Rebholz-srv/PaperMaker

Work in progress

30.03.2009

Page 33: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz20.01.201133

8Summary

Page 34: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz

Efforts in the Rebholz group towards

interoperability of literature with bioinformatics

• Whatizit infrastructure

• Biomedical NER as a public, large-scale service

• LexEBI / BioLexicon (collab. w. NaCTeM, Pisa-U)

• Biomedical terminological resource, standardisation of semantics

• IeXML (BioLink SIG 2006, Brasil)

• Put the annotations into the document (inline annotations)

• CALBC project

• Collaborative annotation of a large-scale biomedical corpus

• UKPMC: U.K. Pubmed Central (collab. w. NaCTeM, BL)

• Use of Whatizit, BioLexicon, IeXML, CALBC alignments for the delivery of quality annotation services to the public

• SESL project

• Joint project with pharma & publishers, literature content in a triple store

• PaperMaker

• Validation of the scientific literature against the above

20.01.201134

Page 35: PaperMaker, BeyondThePdf, RebholzSchuhmann, 19Jan2011

Literature and Text MiningBioCreative III, Rebholz


Recommended