40
MITRE © 2001 The MITRE Corporation. ALL RIGHTS RESERVED. Mining the Biomedical Literature: Creating a Challenge Evaluation Lynette Hirschman Chief Scientist Information Technology Center The MITRE Corporation Bedford, MA USA

Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Mining theBiomedical Literature:

Creating a Challenge EvaluationLynette Hirschman

Chief ScientistInformation Technology Center

The MITRE CorporationBedford, MA

USA

Page 2: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Outline

1. Overview: why mine the literature?2. Where we are: technologies for mining text3. Creating a challenge evaluation4. Recommendations

Page 3: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Why Mine the Literature?

0 Biologists need information contained in text- To integrate information across articles (e.g., in

constructing metabolic pathways)- To refine sequence searches, e.g., literature

search coupled to BLAST searches (Chang et al.,2001)

- To research prior art (for patents)- To update databases

0 Natural language processing offers the tools tomake information in text accessible

Page 4: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

How Good is Current Language Processing?

0 Natural language processing (NLP) works!0 Automated NLP systems exist now that can:

- Return documents relevant to a subject(information retrieval)

- Identify entities (90-95% accuracy) or relationsamong entities (70-80% accuracy) in text(information extraction)

- Answer factual questions using large documentcollections at 75-85% accuracy(question answering)

0 But... these systems work on news, not biology- And we don’t have comparable performance

metrics for biology

Page 5: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Text Mining for Biology

0 Current situation:- There are increasing numbers of groups

working on NLP for biology- Each group reports results for a particular

task, on a specialized data set0 Right now, it is very difficult to compare results

across the groups0 Lack of standards also makes it difficult to share

- Data and knowledge resources- Software components

A common challenge evaluation can focus research and speed progress

Page 6: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Outline

1. Overview: why natural language?2. Where we are: technologies for mining text3. Creating a challenge evaluation4. Recommendations

Page 7: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Literature Mining Overview

Information Extraction:documents to entities, relations

MEDLINE

PIR

Genbank

Collections:Gigabytes Documents:

Megabytes

Disease So urce Country City_n ameDate C ases N ew_case s Dea dEbola PR OMED Uganda Gula 26-Oct-2000 182 17 64Ebola PR OMED Uganda Gula 5-Nov-2000 280 14 89

Ebola PR OMED Uganda Gulu 13-Oct-2000 42 9 30Ebola PR OMED Uganda Gulu 15-Oct-2000 51 7 31

Ebola PR OMED Uganda Gulu 16-Oct-2000 63 12 33Ebola PR OMED Uganda Gulu 17-Oct-2000 73 2 35

Ebola PR OMED Uganda Gulu 18-Oct-2000 94 21 39Ebola PR OMED Uganda Gulu 19-Oct-2000 111 17 41

Lists,Tables:Kilobytes

Protease-resistantprion protein

interacts with...

Phrases: Bytes

Information Retrieval:key words to documents

Question Answering:question to answer

Page 8: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Information Retrieval0 Input: query words

Output: ranked list of documents0 Approach

- Speed, scalabilitydomain independence and robustness are criticalfor access to large collections of documents

0 Technique- Shallow processing provides coarse-grained

result (entire documents or passages)- Query is transformed to collection of words,

but grammatical relations between words lost- Documents are indexed by word occurrences- Search matches query “probe” against indexed

documents using Boolean combination of terms,or vector of word occurrences or language model

MITRE© 2 001 The MITRE Corporation. ALL R IG HTS RESERVED.

Information Retrieval

MEDLINE

PIR

Genbank

Collections:Gigabytes Documents:

Megabytes

Disease Source Country City_name Date Cases New_cases Dead

Ebola PROMED Uganda Gula 26-Oct-2000 182 17 64Ebola PROMED Uganda Gula 5-Nov-2000 280 14 89

Ebola PROMED Uganda Gulu 13-Oct-2000 42 9 30

Ebola PROMED Uganda Gulu 15-Oct-2000 51 7 31Ebola PROMED Uganda Gulu 16-Oct-2000 63 12 33Ebola PROMED Uganda Gulu 17-Oct-2000 73 2 35

Ebola PROMED Uganda Gulu 18-Oct-2000 94 21 39Ebola PROMED Uganda Gulu 19-Oct-2000 111 17 41

L i s t s , T a b l e s :

K i l o b y t e s

Protease-resistantprion protein

interacts with...

Phrases: Bytes

Information Retrieval:key words to documents

Page 9: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Evaluating Text Retrieval

0 The Text Retrieval Conference (TREC) has been heldannually, starting in 1992, run by NIST*

- Successful, attracting 100s of internationalparticipants from industry, academia, government

0 Goal: systematic evaluation of retrieval systems usinga large (5Gb) common corpus

- Given a set of queries, for each query- Systems return a ranked list of documents- Human judges provide relevance assessments for

the ranked documents- Relevance judgements are used to compute

average precision-recall plots for each system= Precision: % returned docs judged relevant= Recall: % of relevant documents found

*US National Institute of Standards and Technology

Page 10: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Sample TREC Topic<num> Number: 409<title> legal, Pan Am, 103

<desc> Description:What legal actions have resulted fromthe destruction of Pan Am Flight 103over Lockerbie, Scotland on December21, 1988?

<narr> Narrative:Documents describing any charges, claimsor fines presented to or imposed by anycourt or tribunal are relevant, butdocuments that discuss charges made indiplomatic jousting are not relevant.

Title:Up to 3 wordsbest describingthe topic

Description:One-sentencedescriptionof the topic

Narrative:Description of whatmakes a documentrelevant or irrelevant

Page 11: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

TREC9 Results for a High-PerformingSystem

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Recall

Pre

cisi

on

Automaticallygenerated query

Manuallygenerated query

4 of top 5 documents relevant:precision = 80%; recall low!

This is a representativehigh-performing system

Manual (“expert”) choiceof query words worksbetter than automaticallygenerated queries

Note that if you need tofind “all” the literatureon a subject, you have tolook through lots of junk!

Page 12: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Lessons from TREC

0 The basic retrieval paradigm indexes words;adding syntax, semantics hasn’t helped (yet)

0 For short queries, adding information (words) tothe query helps

- by hand (manual query creation)- by thesaurus (synonyms, semantic classes)- by feedback of relevant documents to cull

more key words0 Need finer-grained retrieval -- documents too big!

- This has led to question-answering evaluation

Page 13: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Information Extraction

0 Information extraction is theidentification of domain-specificclasses of entities & relations among them

0 Input for extraction: documentsOutput: entities in documents, lists of relations

0 Metrics:- F-measure: harmonic mean of precision, recall

0 Systems need training data (annotated examplesfrom text) to “learn” how to identify entities andrelations

- Data is typically generated by human experts- Systems need 1000’s of examples - the more

data, the higher the performance

MITRE© 2001 The MITRE Corporation. ALL RIG HTS RESERVED.

Information Extraction

Information Extraction:documents to entities, relations

MEDLINE

PIR

Genbank

Collections:Gigabytes Documents:

Megabytes

Disease Source Country City_name Date Cases New_cases Dead

Ebola PROMED Uganda Gula 26-Oct-2000 182 17 64

Ebola PROMED Uganda Gula 5-Nov-2000 280 14 89

Ebola PROMED Uganda Gulu 13-Oct-2000 42 9 30Ebola PROMED Uganda Gulu 15-Oct-2000 51 7 31

Ebola PROMED Uganda Gulu 16-Oct-2000 63 12 33Ebola PROMED Uganda Gulu 17-Oct-2000 73 2 35Ebola PROMED Uganda Gulu 18-Oct-2000 94 21 39

Ebola PROMED Uganda Gulu 19-Oct-2000 111 17 41

L i s t s , T a b l e s :

K i l o b y t e s

Protease-resistantprion protein

interacts with...

Phrases: Bytes

Page 14: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Information Extraction:Epidemiology Example

1. Extract entities from text (color coded via HTML)Disease Source Country City_nameDate Cases New_cases DeadEbola PROMED Uganda Gula 26-Oct-2000 182 17 64Ebola PROMED Uganda Gula 5-Nov-2000 280 14 89Ebola PROMED Uganda Gulu 13-Oct-2000 42 9 30Ebola PROMED Uganda Gulu 15-Oct-2000 51 7 31Ebola PROMED Uganda Gulu 16-Oct-2000 63 12 33Ebola PROMED Uganda Gulu 17-Oct-2000 73 2 35Ebola PROMED Uganda Gulu 18-Oct-2000 94 21 39Ebola PROMED Uganda Gulu 19-Oct-2000 111 17 41

2. Extract outbreak events into table

0

50

100

150

200

250

300

350

400

10/1

3/20

0010

/20/

2000

10/2

7/20

0011

/3/2

000

11/1

0/20

0011

/17/

2000

11/2

4/20

00

TIME

Nu

mb

er C

ases

Cases

New_cases

Dead

3. Display events...

Total Cases; New Cases

Page 15: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Information Extraction EvaluationsFor Newswire

0

10

20

30

40

50

60

70

80

90

100

1991 1992 1993 1995 1998 1999

Year

F-m

easu

re (

Acc

ura

cy)

Names: English

Names: Japanese

Names: Chinese

Relations

Events

Name extraction > 90%in English, Japanese;improving in Chinese

Relation extractionnow at over 80%

Event extractionless than 60%,improving slowly

Commercial nametaggers exist fornews reports in

multiple languages

Page 16: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Lessons Learned from Extraction

0 Name extraction works well- For news (person, organization, location, time,

money), results are over 90%- Local information is used to identify names, e.g.,

morphology, terminology lists, local context0 Relations require more information --

identification of 2 entities & their relationship- Predicted relation accuracy =

Pr(E1)*Pr(E2)*Pr(R) ~(.93) * (.93) * (.93) = .800 Events are even harder

- More slots to fill means lower performance- Events require more cross-sentence information- Complex syntax in abstracts is a problem (see

examples from Park et al., PSB 2001)

Page 17: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Question Answering(MITRE’s QANDA System)

Where did Dylan Thomas die?✖1. Swansea: In “Dylan: the Nine Lives of Dylan Thomas, Fryer makes a virtue of not coming from Swansea✖ 2. Italy: Dylan Thomas’s widow Caitlin, who died last week in Italy aged 81,

3. New York:Dylan Thomas died in New York 40 years ago next Tuesday

What diseases are caused by prions?

1. Both CJD and BSE are caused by mysterious particles ofinfectious protein called prions

2. Scientists trying to understand the epidemic face an unusual problem: BSE, scrapie, and CJD are caused by a bizarre infectious agent, the

prion which does not follow the normal rules of microbiology.

✖ 3. These diseases are caused by a prion, an abnormal version of a naturally-occurring protein, but researchers have recognized different strains of prions that differ in incubation times, symptoms, and severity of illness. ...

MITRE© 2 001 The MITRE Corporation. ALL RIGHTS RESERVED.

Question Answering

MEDLINE

PIR

Genbank

Collections:Gigabytes Documents:

Megabytes

Disease Source Country City_name Date Cases New_cases DeadEbola PROMED Uganda Gula 26-Oct-2000 182 17 64

Ebola PROMED Uganda Gula 5-Nov-2000 280 14 89Ebola PROMED Uganda Gulu 13-Oct-2000 42 9 30

Ebola PROMED Uganda Gulu 15-Oct-2000 51 7 31Ebola PROMED Uganda Gulu 16-Oct-2000 63 12 33Ebola PROMED Uganda Gulu 17-Oct-2000 73 2 35

Ebola PROMED Uganda Gulu 18-Oct-2000 94 21 39Ebola PROMED Uganda Gulu 19-Oct-2000 111 17 41

L i s t s , T a b l e s :

K i l o b y t e s

Protease-resistantprion protein

interacts with...

Phrases: Bytes

Question Answering:question to answer

Page 18: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Question Answering0 Stage 1: Question analysis

- Find type of object that answers the question:“when” needs time, “which proteins” need protein

0 Stage 2: Document retrieval- Using (augmented) question, retrieve set of

possibly relevant documents via informationretrieval

0 Stage 3: Document processing- Search documents for entities of the desired type

using information extraction- Search for entities in appropriate relations

0 Stage 4: Rank answer candidates0 Stage 5: Present the answer (N bytes, or a phrase

or a sentence or a summary)

Page 19: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

TREC Q&A 2000 Results (250-byte)

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

1.000

SMU

U Waterlo

oLIM

SI

Imperia

l Colle

ge

U Montre

al

U Sheffield

U Mass

Nat Taiw

an U

D.I. (Pisa)

Seoul Nat U

Harabagiu and Moldovan,Southern Methodist University

Mean Reciprocal Rank: 76%First Answer Correct: 69%Correct Answer in Top 5: 86%

Lessons: question answering works -- at least for simple factual questions

Page 20: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Outline

1. Overview: why natural language?2. Where we are: technologies for mining text3. Creating a challenge evaluation4. Recommendations

Page 21: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

What is a Challenge Evaluation?

0 A challenge evaluation is … an evaluation party0 The host provides

- The challenge problem- The data to feed (train) the systems- The evaluation metric to judge the systems

0 The guests bring their systems0 The guests then compete and share results and

insights, refereed by the host0 The CASP* evaluations are an example of a

challenge evaluation in biology: predict 3D proteinstructure from linear sequence data

* Critical Assessment of techniques for protein Structure Prediction

Page 22: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Corpus-Based Evaluation Method

1. TASK: Define a useful task where people care aboutthe results

2. GOLD STANDARD: Have human experts create a“gold standard” or answer key for a representativedata sample

3. SCORING: Devise a method to score “correctness”of a result compared to the gold standard

4. TRAINING: Use the annotated data to “train” asystem to emulate human performance

5. EVALUATION: Evaluate system performanceagainst gold standard on unseen (blind) test data

6. ITERATION: Iterate and improve

Page 23: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Example of a Task*

*Table from Stephens et al, PSB 2001

Extraction ofprotein-proteinrelations fromthe literaturevia a thesaurus-based approach

Lessons:Pick real problems where possibleMake sure people can do the taskChoose intuitive evaluation metricsShare… data, tools, metrics

Page 24: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Tool from Genia* (Ohta et al, U Tokyo)

*Tools for Ontology-based Corpus AnnotationTomoko OHTA, Yuka TATEISI, and Jun’ichi TSUJIIUniversity of Tokyo, Tutorial at ISMB 01

Click “Insert”to insert a new

tag

We are starting tosee shared data,shared tools

Now we need toshare an evaluation

Page 25: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Outline

1. Overview: why mine the literature?2. Where we are: technologies for mining text3. Creating a challenge evaluation4. Recommendations☛

Page 26: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Text Mining for Bioinformatics:Recommendations

0 Goal:- Enable rapid progress in text mining for

biology- Transfer (or leapfrog over!) results from text

processing for newswire= Is biology easier? A restricted domain withan ontology

= Or is it harder - syntax is complex, newterms introduced constantly, confusionbetween gene vs. protein, ...

0 Approach:- Create a challenge evaluation for text mining

for biology

Page 27: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Steps for Creating a Challenge Evaluation

0 Assess state of the art- Inventory current approaches & results- Inventory available annotated data, knowledge

sources (ontologies), NLP tools, standards0 Identify participants

- Researchers: biologists, NL researchers,bioinformatics researchers,…

- Identify other stakeholders: biotech industry,pharmaceuticals, standards organizations,…

0 Identify infrastructure needs- How much data, on what timetable?- How to define interesting problems w answers?

Page 28: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Who Do We Need?0Users (biologists) to define a set of relevant

problems with “right answers” & create data sets0Researchers (NL researchers) w. relevant

technology, e.g., entity taggers, event extraction,retrieval systems

0Data providers who have relevant ontologies anddata collections and standards

0Funders who will pay for- Preparation of data- Creation of evaluation tools- Running the evaluation

0Evaluator, who will coordinate the evaluationDiscussion of possible challenge evaluations will continue

at PSB2002 in the Natural Language Session

Page 29: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

References"Automatic extraction of biological information from scientific text:

Protein-protein interactions,” Christian Blaschke, Miguel A.Andrade, Christos Ouzounis, and Alfonso Valencia; InternationalConference on Intelligent Systems for Molecular Biology.Heidelberg, 1999.

“Including Biological Literature Improves Homology Search,” J.T.Chang, S. Raychaudhuri, and R.B. Altman; Pacific Symposium onBiocomputing 6:374-383 (2001).

“Tools for Ontology-based Corpus Annotation,” Tomoko Ohta, YukaTateisi, and Jun’ichi Tsujii, Tutorial at ISMB 01

“Bidirectional Incremental Parsing for Automatic PathwayIdentification with Combinatory Categorial Grammar,” J. C. Park,H. S. Kim, and J. J. Kim; Pacific Symposium on Biocomputing 6:396-407 (2001).

“Detecting Gene Relations from MEDLINE Abstracts.” M. Stephens,M. Palakal, S. Mukhopadhyay, R. Raje, and J. Mostafa; PacificSymposium on Biocomputing 6:483-496 (2001).

Page 30: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Back-Ups

Page 31: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Text Mining for Bioinformatics:Recommendations(1) Use ontologies to define objects of interest

- Entity classes (proteins, genes, …)- Event types (protein-protein interaction, …)

(2) Create training data to train systems- 100s documents, 1000’s of tagged entities

(3) Systems must handle language complexity- Journal abstracts are complex and filled with

(new) terminology; this may require differentsyntactic & discourse processing than newswire

(4) Biologists must specify what output they want- E.g., extract database of relations, find answers

to questions, seed further searches,...

Page 32: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

1. Task Definition0 Basic principle:

- If people cannot do a task reliably andreproducibly, a program cannot do it either!

- Verify by having several experts perform thetask (e.g., mark up data with “right answers”)

- For example, for the task of identifying propernames as Person or Organization or Location,people agree 98% of time

Page 33: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

2. Corpus Creation

0 Define the things of interest for the task, e.g.,- Genes, DNA, RNA, proteins, and the relations

among them0 Provide correctly annotated data

- Tools to aid the human annotator are needed

Note: having an ontology and a listof terms is very useful at this stage

Page 34: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

3. Define Automated Evaluation Method0 An ideal evaluation function is

- Intuitive- Highly correlated with important functionality,

e.g., for speech transcription, evaluationfunction is word error

- For NLP, evaluation is often “accuracy”measured as precision and recall:

= precision: of things classified, what percentare correct?

= Recall: of things that should have been in aclass, what percent were returned?

= F-measure: harmonic mean of P&R

Page 35: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

4. System TrainingIdentify patterns in data, eitherby machine learning or byhand-crafted heuristics

These patterns can recognizepreviously unseen entitiesfrom context

Identify patterns in data, eitherby machine learning or byhand-crafted heuristics

These patterns can recognizepreviously unseen entitiesfrom context

From data co-occurrences in text, we see that:binding to PROTEIN occurs frequentlyconversion of PROTEIN occurs frequently

From data co-occurrences in text, we see that:binding to PROTEIN occurs frequentlyconversion of PROTEIN occurs frequently

Page 36: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

5. Evaluation

0 Test the system on blind (previously unseen) data0 Compare results across systems, to understand

what techniques work, which ones don’t

6. Iteration0 Iterate and improve

- Note: may want to improve the task definition,the scoring function, the amount or quality oftraining data, the rules used by the system,...

Page 37: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Question Answering

MEDLINE

PIR

Genbank

Collections:Gigabytes Documents:

Megabytes

Disease So urce Country City_n ameDate C ases N ew_case s Dea dEbola PR OMED Uganda Gula 26-Oct-2000 182 17 64Ebola PR OMED Uganda Gula 5-Nov-2000 280 14 89

Ebola PR OMED Uganda Gulu 13-Oct-2000 42 9 30Ebola PR OMED Uganda Gulu 15-Oct-2000 51 7 31

Ebola PR OMED Uganda Gulu 16-Oct-2000 63 12 33Ebola PR OMED Uganda Gulu 17-Oct-2000 73 2 35

Ebola PR OMED Uganda Gulu 18-Oct-2000 94 21 39Ebola PR OMED Uganda Gulu 19-Oct-2000 111 17 41

Lists,Tables:Kilobytes

Protease-resistantprion protein

interacts with...

Phrases: Bytes

Question Answering:question to answer

Page 38: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Information Extraction

Information Extraction:documents to entities, relations

MEDLINE

PIR

Genbank

Collections:Gigabytes Documents:

Megabytes

Disease So urce Country City_n ameDate C ases N ew_case s Dea dEbola PR OMED Uganda Gula 26-Oct-2000 182 17 64Ebola PR OMED Uganda Gula 5-Nov-2000 280 14 89

Ebola PR OMED Uganda Gulu 13-Oct-2000 42 9 30Ebola PR OMED Uganda Gulu 15-Oct-2000 51 7 31

Ebola PR OMED Uganda Gulu 16-Oct-2000 63 12 33Ebola PR OMED Uganda Gulu 17-Oct-2000 73 2 35

Ebola PR OMED Uganda Gulu 18-Oct-2000 94 21 39Ebola PR OMED Uganda Gulu 19-Oct-2000 111 17 41

Lists,Tables:Kilobytes

Protease-resistantprion protein

interacts with...

Phrases: Bytes

Page 39: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Information Retrieval

MEDLINE

PIR

Genbank

Collections:Gigabytes Documents:

Megabytes

Disease So urce Country City_n ameDate C ases N ew_case s Dea dEbola PR OMED Uganda Gula 26-Oct-2000 182 17 64Ebola PR OMED Uganda Gula 5-Nov-2000 280 14 89

Ebola PR OMED Uganda Gulu 13-Oct-2000 42 9 30Ebola PR OMED Uganda Gulu 15-Oct-2000 51 7 31

Ebola PR OMED Uganda Gulu 16-Oct-2000 63 12 33Ebola PR OMED Uganda Gulu 17-Oct-2000 73 2 35

Ebola PR OMED Uganda Gulu 18-Oct-2000 94 21 39Ebola PR OMED Uganda Gulu 19-Oct-2000 111 17 41

Lists,Tables:Kilobytes

Protease-resistantprion protein

interacts with...

Phrases: Bytes

Information Retrieval:key words to documents

Page 40: Mining the Biomedical Literature: Creating a Challenge Evaluation · 2015-07-28 · Literature Mining Overview Information Extraction: documents to entities, relations MEDLINE PIR

MITRE© 2001 The MITRE Corporation. ALL RIGHTS RESERVED.

Evaluating Question Answering Systems

0 TREC-9 Q&A Evaluation:- For each of 700 factual short-answers questions- Each system must return a ranked list of 5

candidate answers (250-byte or 50-byte) basedon the standard TREC document collection

- Each question-answer pair is judged as corrector incorrect by a person (“assessor”)

- System score is mean reciprocal rank of correctanswers

0 For TREC-8 and TREC-9, all questions had answersthat consisted of a phrase

0 Later TRECs will include questions without answer,and questions with lists for answer