35
Collaborative Curation of Biomedical Literature Chitta Baral, Hasan Davulcu, Anthony Gitter, Graciela Gon zalez, Geeta Joshi-Tope, Mutsumi Nakamura, Prabhdeep Singh, Luis Tari, and Lian Yu.

CBioC: Massive Collaborative Curation of Biomedical Literature

  • Upload
    oke

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

CBioC: Massive Collaborative Curation of Biomedical Literature. Chitta Baral, Hasan Davulcu, Anthony Gitter, Graciela Gonzalez, Geeta Joshi-Tope, Mutsumi Nakamura, Prabhdeep Singh, Luis Tari, and Lian Yu. Premise – current status of curation from text. - PowerPoint PPT Presentation

Citation preview

Page 1: CBioC: Massive Collaborative Curation of Biomedical Literature

CBioC: Massive Collaborative Curation

of Biomedical Literature

Chitta Baral, Hasan Davulcu, Anthony Gitter, Graciela Gonzalez, Ge

eta Joshi-Tope, Mutsumi Nakamura, Prabhdeep Singh,

Luis Tari, and Lian Yu.

Page 2: CBioC: Massive Collaborative Curation of Biomedical Literature

Premise – current status of curation from text

Our initial focus is on curation of “knowledge” nuggets from Biomedical articles. About 15 million abstracts in Pubmed 3 million published by US and EU researchers

during 1994-2004 (800 articles per day) 300 K articles published so far reporting

protein-protein interactions in human, yeast and mouse. BIND (in 7 yrs) -- 23K ; DIP – 3K; MINT – 2.4K.

Page 3: CBioC: Massive Collaborative Curation of Biomedical Literature

Premise: High cost of human curation

Overwhelming cost of large curation efforts may be unsustainable for long periods BIND: Nov 2005 bad news.

Operated for 7 years Listed over 100 curators & programmers CND $29 million received in 2003, plus other funding

Curation efforts of AFCS has recently stopped. Lack of funding for some genome annotation

projects.

Page 4: CBioC: Massive Collaborative Curation of Biomedical Literature

Premise: summary

Human curation of text is expensive. Human curation of text is not scalable. Human curation of text is not sustainable.

Page 5: CBioC: Massive Collaborative Curation of Biomedical Literature

Why not resort to computers? – do automatic extraction

Lessons from DARPA funded MUCs (message understanding conferences) in 90s for a decade and at the cost of tens of millions of dollars. Getting to 60% recall and precision is quick Then every 5% improvement is about a years work. Even when we get to 90% for an individual entity extraction

for recognizing 4 related entities: (.9)4 =.64 Lessons from Biomedical text extraction

No proper evaluation. Recognized that recall and precision is not very good even

in the “best” systems.

Page 6: CBioC: Massive Collaborative Curation of Biomedical Literature

What do we do?

How do we curate not only the existing articles, but also the future articles?

Too important to give up! Need to think of a new way to do it. Faster computers, better sequencing

technology and better algorithms came to the rescue of the Human Genome project.

Hmm. What resources are we overlooking?

Page 7: CBioC: Massive Collaborative Curation of Biomedical Literature

Key Idea

If lots of articles are being written then lot of people are writing them and lot of people are reading them.

If only we could make these people (the authors and the readers) contribute to the curation effort …

Especially the readers; the ones who need the curated data!

Page 8: CBioC: Massive Collaborative Curation of Biomedical Literature

Mass collaboration has worked in

Wikipedia Project Gutenberg Netflix rating Amazon rating Etc.

Page 9: CBioC: Massive Collaborative Curation of Biomedical Literature

Mass collaborative curation: initial hurdles

Russ Altman mentioned the challenges with respect to the authors. Sticking to a format Submitting data

An average reader (S)he is not normally interested in filling a blank curation

form. We can not make an average reader go though curation

training. So it has to be very different from just making the existing

curation tools available to the mass and expect them to contribute.

Page 10: CBioC: Massive Collaborative Curation of Biomedical Literature

Mass collaborative curation : key initial ideas

Make it very easy: user need not remember where (which database,

which web page) to put the curated knowledge. Curation opportunity should present itself

seamlessly. Curation should not be a burden to an

average user Make the curated knowledge “thin”.

There should be immediate rewards Do not start with a blank slate.

Page 11: CBioC: Massive Collaborative Curation of Biomedical Literature

Realization of the key ideas: a biologist with a gene name Goes to Pubmed, types the gene name, clicks on on

e of the abstracts Curation panel presents itself automatically

Our approach calls for researchers to contribute to the curation of facts as they read and research over the web

But not with a blank slate No one wants to be the first one! Automatic extraction jump-starts the process, and then res

earchers improve upon the extracted data, “ironing out” inconsistencies by subsequent edits on a massive scale.

Thin Schemas Average users turned off by traditional wide schemas Wide schemas need to be broken down.

Page 12: CBioC: Massive Collaborative Curation of Biomedical Literature

Case Study with CBioC When the abstract is displayed, all of the interactions

reported in the abstract are shown. The interactions are either automatically extracted in

advance by our system or for brand new abstracts the extraction process is done at display time.

Thus, data becomes immediately available. Researchers then edit the extracted data, add new

interactions, vote on the accuracy of the extraction, assign a confidence rating, and read comments from other researchers.

If one or more of them goes deep into obtaining related info, the effort is not wasted and the rest of the community benefits.

Page 13: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 14: CBioC: Massive Collaborative Curation of Biomedical Literature

Basic curation with CBioC

Interactions are corrected, incorrect extractions are “voted down”, and rated on reliability based on the experimental evidence presented by the author.

It takes a few seconds to vote on the correctness of the extractions

With little effort by each researcher, information is made available immediately to the whole community.

Page 15: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 16: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 17: CBioC: Massive Collaborative Curation of Biomedical Literature

with more effort…

Any researcher that wishes to do a bit more can: add interactions missed by the extraction system add interactions reported within the full article fill up more fields in the database (such as

organism, experimental method, location of the interaction, or supporting evidence).

Added interactions are subject to the community vote, just as the automatically extracted interactions.

Page 18: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 19: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 20: CBioC: Massive Collaborative Curation of Biomedical Literature

Case Study 2: Modifying

A researcher could also modify the reported interactions

For example, consider the following statement in PMID 16297884 :

PKCalpha but not PKCepsilon phosphorylated the catalytic subunit of the p110alpha/p85alpha PI3K

Page 21: CBioC: Massive Collaborative Curation of Biomedical Literature

Case Study 2: Modifying

The automatic extraction system extracted (PKCepsilon, phosphorylates, p110alpha/ p85alpha PI3K), an error caused by the grammatical construction of the statement.

In this case, the researcher should vote “No” on the accuracy of the extraction. This one cannot really be modified, it will eventually be “voted down” by enough “No” votes.

and/or click “Modify” and edit the interaction and then rate its reliability based on the evidence presented by the author.

Page 22: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 23: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 24: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 25: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 26: CBioC: Massive Collaborative Curation of Biomedical Literature
Page 27: CBioC: Massive Collaborative Curation of Biomedical Literature

Addressing challenges

Use ontologies and some automated tools to ensure consistency issues.

To enter data user must register. Does each voter has equal weight?

Trust management

Page 28: CBioC: Massive Collaborative Curation of Biomedical Literature

Summary so far

Information/curation window pops up automatically. Automatic extraction is used as a boot strap so that

no user is working on a blank slate. Users vote on correctness, make corrections, add

fact. Suppose 60% precision and recall of automatic extraction

system A person will have an easier time discarding 40% of

wrongly extracted text than identifying 60% of correct entries and entering them!

Page 29: CBioC: Massive Collaborative Curation of Biomedical Literature

Very useful byproducts

Avoids some problems with existing human curation approach Curators’ bias Curators miss things Curators have disagreements Slow access to newest findings Researchers at large have little or no control over what get

s curated and when A large curated corpus of text gets created

Very useful to evaluate and improve automated extraction systems.

Page 30: CBioC: Massive Collaborative Curation of Biomedical Literature

Other features

Other abstracts related to the specific interaction are accessible through the “More Articles” link.

We are in the process of integrating data from publicly available databases.

All data (raw and processed) will be publicly available

Working on independent data access and querying engine.

Page 31: CBioC: Massive Collaborative Curation of Biomedical Literature

Issues and further challenges

Works well for certain kind of knowledge curation (interactions, …), but what about others (genome annotation ?)

Null values Full papers versus abstracts Are thin schemas enough? Curating new kind of knowledge

Page 32: CBioC: Massive Collaborative Curation of Biomedical Literature

Current status, current funding, call for collaboration

Funded by Arizona State University Second (basic) beta version released. Proposals sent for a fully functional

implementation. Some collaboration with outside groups are in

works.

Page 33: CBioC: Massive Collaborative Curation of Biomedical Literature

Current publications

Collaborative Curation of Data from Bio-medical Texts and Abstracts and its integration. Chitta Baral, Hasan Davulcu, Mutsumi Nakamura, Prabhdeep Singh, Lian Yu and Luis Tari. Proceedings of the 2nd International Workshop on Data Integration in the Life Sciences (DILS'05), San Diego, July 20–22, 2005. In Lecture notes of computer science. Springer. An initial report.

Ready to be submitted to a journal.

Page 34: CBioC: Massive Collaborative Curation of Biomedical Literature

Group members and advisory board

Post docs: Lian Yu and Graciela Gonzalez Biomedical expertise: Geeta Joshi-Tope

(curation), Mike Berens (signal transduction in oncology)

Students: Luis Tari, Prabhdeep Singh, Anthony Gitter, Amanda Ziegler

Advisory board: Gary Bader, Ken Fukuda, Shankar Subramanian.

Page 35: CBioC: Massive Collaborative Curation of Biomedical Literature

Thanks

Questions!