98
[email protected] http://compbio.ucdenver.edu/Hunter_lab/Cohen Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text Mining Group Lead

[email protected] Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Embed Size (px)

Citation preview

Page 1: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

[email protected]://compbio.ucdenver.edu/Hunter_lab/Cohen

Research Opportunities in Biomedical Text Mining

Kevin Bretonnel CohenBiomedical Text Mining Group Lead

Page 2: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Information extraction

•Also known as “relation extraction”•Limited to one or a small number of

types of facts– Contrast information retrieval or

question-answering

Page 3: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Information extraction

Information extraction: relationships between things

BINDING_EVENT

Binder:

Bound:

2

Page 4: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Information extraction

Met28 binds to DNA.

BINDING_EVENTBinder: Met28Bound: DNA

2

Page 5: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Why text mining is difficult

•Variability

•Pervasive ambiguity at every level of analysis

5

Page 6: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Why text mining is difficult

Met28 binds to DNA…binding of Met28 to DNA……Met28 and DNA bind……binding between Met28 and DNA……Met28 is sufficient to bind DNA……DNA bound by Met28…

2(6)

Page 7: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Why text mining is difficult

…binding of Met28 to DNA……binding under unspecified conditions

of Met28 to DNA……binding of this translational variant

of Met28 to DNA……binding of Met28 to upstream

regions of DNA…

2(6)

Page 8: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Why text mining is difficult

…binding under unspecified conditions of this translational variant of Met28 to upstream regions of DNA…

3(6)

Page 9: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)

NACT:neoadjuvant chemotherapy (PMID 8898170)

N-acetyltransferase (PMID 10725313)

Na+-coupled citrate transporter (PMID 12177002 )

Why text mining is difficult

6

Page 10: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Why text mining is difficult

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)

•(liver), (testis) and (brain in rat)

•liver, (testis and brain in rat)

•(liver, testis and brain in rat)6

Page 11: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Why text mining is difficult

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)

•shows preference for (citrate over dicarboxylates)

•shows preference (for citrate) (over dicarboxylates) 7

Page 12: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Why text mining is difficult

regulation of cell migration and proliferation(PMID …)

serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428)

!proliferation and regulation of cell migration

! regulation of proliferation and cell migration regulation of cell migration and regulation of cell

proliferation

7

Page 13: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Why text mining is difficult

regulation of cell migration and proliferation (PMID …)

serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428)

!degradation of IRS-1, translocation, and serine phosphorylation

!serine phosphorylation, serine translocation, and serine degradation (of IRS-1) 7

Page 14: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

2.5 types of solutions

•Rule-based– Patterns– Grammars

•Statistical/machine learning– Labelled training data– Noisy training data

•Hybrid statistical/rule-based5

Page 15: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Classic work in molecular biology information

extraction: pattern-based

•Blaschke et al. (1999): The beginning of biologists working in BioNLP– Gene names assumed to be known a priori– Patterns assume two gene names and an

“action word”proteinA action_word proteinB– Action words: acetylate, acetylates,

acetylated, acetylation, etc.– Not traditionally evaluated

Page 16: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Classic work in molecular biology information

extraction: pattern-based

•Blaschke et al. (2002): Biologists begin to be aware of linguistics

•Proteins assumed to be known a priori

[proteins] (0-5) [verbs] (6-10) [proteins]

•(Why not 0-5 twice? Different weight of rule)

•P 0.45, R 0.40 (traditional evaluation)

Page 17: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

The Colorado solution: OpenDMAP

Page 18: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Classic DMAP

•Direct Memory Access Parser (Riesbeck, 1986; Martin, 1991; Fitzgerald, 1995)– Belonging to the conceptual parser

family– Going as directly as possible from lexical

input to concepts in memory.– Mostly toy prototype implementations

with no real evaluation

Slide from Zhiyong Lu

Page 19: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

New Features in OpenDMAP

•Open Source – Implemented in java– Available at www.sourceforge.net

•OpenDMAP patterns are – Richer (capable of using external information

such as protein names and linguistic analyses rather than just strings and concepts)

– More flexible in terms of concept ordering

•First time in biomedical domain– Well constructed ontologies– Open Biomedical Ontologies (e.g. Gene

Ontology)Slide from Zhiyong Lu

Page 20: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Framed-based Representations

•Common representation for ontologies

•A unique name that refers to a concept

•A list of attributes (slots) with admissible values

•Frame slots describe logical relations between framesConcept: Protein Transport

Slots: [transported entity]: protein or molecular complex [transporting entity]: protein or molecular complex [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns: [transported entity] translocation to [transport destination]

Slide from Zhiyong Lu

Page 21: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Transport Frame in Protégé

Slide from Zhiyong Lu

Page 22: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: Protein TransportSlots: [transported entity]: protein or molecular complex [transporting entity]: protein or molecular complex [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]

Slots Defines Logical Relations Between Concepts

Concept: Protein TransportSlots: [transported entity]: protein or molecular complex [transporting entity]: protein or molecular complex [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: proteinPhrasal-patterns: none

Concept: molecular complexPhrasal-patterns: none

Concept: Protein TransportSlots: [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]

Relation linkSlide from Zhiyong Lu

Page 23: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Slots Defines Logical Relations Between Concepts

Concept: proteinPhrasal-patterns: none

Concept: molecular complexPhrasal-patterns: none

Concept: Protein TransportSlots: [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: proteinPhrasal-patterns: none

Concept: molecular complexPhrasal-patterns: none

Concept: cellular componentPhrasal-patterns: none

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrialSubsumption link

Relation link

Slide from Zhiyong Lu

Page 24: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Patterns for Cellular Locations

•Names and synonyms from Gene Ontology terms

•Linguistic variationsConcept: cellular component

Concept: mitochondrion Phrasal-patterns: := mitochondrion := mitochondria := mitochondrial

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Slide from Zhiyong Lu

Page 25: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Slide from Zhiyong Lu

Page 26: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Slide from Zhiyong Lu

Page 27: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Slide from Zhiyong Lu

Page 28: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Slide from Zhiyong Lu

Page 29: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 30: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 31: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 32: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 33: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 34: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 35: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 36: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 37: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 38: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 39: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 40: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 41: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Page 42: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

New Features in OpenDMAP Patterns

Data demonstrate that TFII-I, through a Src-dependent mechanism, translocates reversibly from the cytoplasm to the nucleus, leading to the transcription activation of growth-regulated genes.

[transported entity dep:x] _ [action c-action-transport head:x]

(by the? [transporting entity])? @ (to the? [transport destination]

@ (from the? [transport origin])

_ wildcarddep:x/head:x placement of linguistic constraints ? optional concept match@ optional concept ordering + match

Slide from Zhiyong Lu

Page 43: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Other Transport Patterns

Pattern: [transport destination] [action c-action-transport] _(of the? [transported entity]? (by the? transporting entity])?GeneRIF: … nuclear translocation of the NF-kappaB (p65/p50) heterodimers

Pattern: [transported entity dep:x]? _ [transport destination][action c-action-transport head:x] (by the? transporting entity])?GeneRIF: … is sufficient to degrade the AHR and that nuclear translocation

Pattern: [transported entity] (is|are|was|were) [action c-action-transport-passive] @ (by the? transporting entity])@ (from the? [transport origin]) @ (to the? [transport destination])GeneRIF: the YY1 factor is translocated to the cytoplasm … Slide from Zhiyong Lu

Page 44: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Evaluation of information extraction (and many

other NLP tasks)

•Standard paradigm: “corpus”—body of texts with “gold standard” answers marked

•“Weakly annotated” data: publications with metadata only

•Test suites—see previous lectures

Page 45: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Evaluation of NLP systems

•Precision (aka specificity) and recall (aka sensitivity). Tradeoffs between them.

•Against a “gold standard” of human generated representations of texts– Humans don’t always agree, therefore

calculate inter-annotator agreement

•Post-hoc judgments (particularly of IR relevance)

•“Shared task” paradigm – TREC Genomics (IR)– BioCreative (IE)

Page 46: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Evaluation of NLP systems

•Precision: – True positives / (True positives + False

positive)

•Recall: – True positives / (True positives + False

negatives)

•F-measure: “harmonic mean” of precision and recall

Page 47: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Evaluation of NLP systems

•Formal definition:

•Typical definition: β = 1, so…

(1 + β2) * precision * recall

(β2 * precision) + recallFβ =

Page 48: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Evaluation of NLP systems

•Typical definition:

•…or just F: β is usually assumed to be 1

2 * precision * recall

precision + recallF1 =

Page 49: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Evaluation of NLP systems

•β allows you to weight precision and recall differently– Increasing β weights precision more

highly– Decreasing β weights recall more highly

•Rarely used, but designated by value of β, e.g. F0.5 or F2

Page 50: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

OpenDMAP performance

•Performance of any rule-based information extraction system is a function of two things:– Overall architecture and abilities of the

system– Quality of the rules

Page 51: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

OpenDMAP performance•Protein transport

– Complete frame filled: P 0.75, R 0.49, F 0.59– Incomplete frames: P 0.75, R 0.67, F 0.71– Gold standard gene names: complete frame

P 0.77, R 0.67, F 0.72, incomplete frame P 0.75, R 0.85, F 0.81

•Cell-type-specific gene expression– Without gold standard gene names: P 0.64,

R 0.16, F 0.26– With gold standard gene names: P 0.85 R

0.36 F 0.51

Page 52: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

OpenDMAP performance

•Protein-protein interactions– BioCreative II shared task: Placed 1st

•F 0.29 ten percent higher than #2 system, more than 3 standard deviations above the mean—similar recall to others, but precision of 0.39 more than 20% higher than #2 system

•BioCreative II.5: – Another team placed 1st using

OpenDMAP and a much larger set of (automatically learned) rules

Page 53: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

OpenDMAP performance

•“Event” recognition– E.g. phosphorylation, expression,

binding, localization (weird definition of “event”)

– Ranked 19 out of 24 groups , but…

Page 54: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

OpenDMAP performance

•“Event” recognition– E.g. phosphorylation, expression,

binding, localization (weird definition of “event”)

– Ranked 19 out of 24 groups , but…had the highest precision (0.71-0.72)

Page 55: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Paths to improving OpenDMAP performance

•Increase recall– Pattern learning? See Haibin’s lecture

•Increase precision– Leverage what we know about biology– Huge knowledge-base construction

effort underway here over the course of past two years

Page 56: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Adding even small amounts of knowledge to

the system helps•Livingston (2011): Gene activation task

•Original system: enzymes and substrates both allowed to be of type protein

•Enhancement: – Gene Ontology annotations– Potential enzymes must have annotation

catalytic activity– Potential substrates must have annotation

receptor activity

Page 57: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Adding even small amounts of knowledge to

the system helps

Original Added knowledge

Difference

Precision 0.16 0.36 0.20

Recall 0.24 0.18 -0.06

F-measure 0.19 0.24 0.05

Page 58: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Nominalization

•Nominalization: noun derived from a verb– Verbal nominalization: activation,

inhibition, induction – Argument nominalization: activator,

inhibitor, inducer, mutant

Page 59: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Nominalizations are dominant in biomedical

textsPredicate Nominalization All verb forms

Express 2,909 1,233

Develop 1,408 597

Analyze 1,565 364

Observe 185 809

Differentiate 737 166

Describe 10 621

Compare 185 668

Lose 556 74

Perform 86 599

Form 533 511 Data from CRAFT corpus

Page 60: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Relevant points for text mining

•Nominalizations are an obvious route for scaling up recall

•Nominalizations are more difficult to handle than verbs…

•…but can yield higher precision (Cohen et al. 2008)

Page 61: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Alternations of nominalizations: positions

of arguments

•Any combination of the set of positions for each argument of a nominalization– Pre-nominal: phenobarbital induction,

trkA expression– Post-nominal: increases of oxygen– No argument present: Induction

followed a slower kinetic…– Noun-phrase-external: this enzyme can

undergo activation

Page 62: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Result 1: attested alternations are

extraordinarily diverse•Inhibition, a 3-argument predicate—Arguments 0 and 1 only shown

Page 63: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Implications for system-building

•Distinction between absent and noun-phrase-external arguments is crucial and difficult, and finite state approaches will not suffice; merging data from different clauses and sentences may be useful

•Pre-nominal arguments are undergoer by ratio of 2.5:1

•For predicates with agent and patient, post/post and pre/post patterns predominate, but others are common as well

Page 64: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

What can be done?

•External arguments:– semantic role labelling approach

•…but, very important to recognize the absent/external distinction, especially with machine learning

– pattern-based approach•…but, approaches to external arguments

(RLIMS-P) are so far very predicate-specific

Page 65: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

What can be done?

•Pre-nominal arguments: – apply heuristic that we have identified

based on distributional characteristics– for most frequent nominalizations,

manual encoding may be tractable

Page 66: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

So, how do you dotext mining?

Two approaches that are not coexisting peacefully

Page 67: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Two approaches to NLP

Knowledge-based Statistical/machine learning

Page 68: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

First approach to NLP

•Rule-based

•AI, linguisticsOntologiesKnowledge bases

•Patterns (regular, context-free…)

•Procedures

Page 69: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

K-based: procedural

•Patterns (regular, context-free, …)

•Procedures

if (currentWordEndsWith-ing) {

if (previousWordIsThe) {

if (nextWordIsOf) {

Page 70: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

K-based: regex

•Patterns (regular, context-free, …)

•Procedures

$geneName = “[A-Za-z]+-?[0-9]”;

$input =~ /interaction of ($geneName) with ($geneName)/;

$interactionAssertion->setGene1($1);

$interactionAssertion->setGene2($2);

Page 71: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

K-based: CFGs

•Patterns (regular, context-free, …)

•Procedures

NounPhrase -> NounPhrase+ Conjunction NounPhrase

NounPhrase -> Predeterminer Determiner+ Adjective+ Noun

Page 72: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Knowledge-based approachesWhy they work

•Patterns are real– Psychologically– Formally adequate (mostly)

•Intuition works

•No need for training data

Page 73: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Knowledge-based approaches

Why they’re hard

•Knowledge takes time to get

•Process of developing large rule sets can be slow– Consider English syntax…

Page 74: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Second approach to NLP

•Mosteller & Wallace

•Bayesian

•Other machine learning techniques

Page 75: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Statistical/ML approaches

•Frame the NLP task as a series of classification problems– Which POS is this?– Which word meaning?– Which phrasal grouping?

Page 76: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Statistical approachesWhy they work

•Statistics can be proxy for knowledge

•Some interesting stuff is frequent enough to be tractable

Page 81: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Knowledge-based or statistical: what to do??

Page 82: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Knowledge-based vs. statistical approaches

•Pragmatic answer #1: if you must pick one...– Is it cheaper to label more training data,

or to put time into developing patterns?

Page 83: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Knowledge-based vs. statistical approaches

•Researcher’s answer:– Use one as the baseline for the other

Page 84: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Knowledge-based vs. statistical approaches

•Pragmatic answer #2: combine them– Do both together/iteratively– Statistical solution first, then rule-based

post-processing

the 2.5th approach

“Natural language processing is never pure and rarely simple.”

Page 85: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Which works better?

Pestian et al. (2007)

Page 86: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

A rapprochement

Page 87: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Conceptual features for information retrieval

•Task: retrieve sentences that contain mentions of mutations.

•Keyword approach: 1,092

•Recognize mutation mentions: additional 2,171

Page 88: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Conceptual features indocument classification

Caporaso et al. (2005)

Page 89: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Conceptual features indocument classification

Caporaso et al. (2005)

Page 90: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Untapped conceptual types

Page 91: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Malignancies (F = 0.84)

Jin et al. (2006)

Page 92: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Mouse strains

•CAST/EiJ

•C57BL

•SJL/J

•SEG

•C3H/He

•RIII

• DBA/1

Caporaso et al. (2005)

Page 93: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Mutations

•Ala64->Gly

•Ala64Gly

•A376G

Caporaso et al. (2007)

Page 94: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Point/Counterpoint

Page 95: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Contradictory findings

•TREC 2003: “...searching in the MeSH and substance name fields, along with filtering for species, accounted for the best performance” (Hersh and Bhupatiraju 2003, Caporaso et al. 2005)

•TREC 2004: “Approaches that attempted to map to controlled vocabulary terms did not fare as well” (Hersh et al. 2004)

Page 96: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Understanding the TREC 2004 results

• Poor choice of concepts – MeSH terms only, which is known to have problems even

if manually indexed

• “Conceptual” systems weren’t very good (or didn’t try very hard) at concept recognition– Even synonymy not detected well (1 case)– Methods not described, so presumably not a focus of the

work (2 cases)

• Hersh et al. (2004) overstate role of concepts in these systems– Synonym source only (1 case)– Only one of several features (1 case)

Page 97: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

I’m convinced in theory, but will it scale?

•Jin et al. (2006): for malignancy mentions, relatively small amount of training data sufficed

•Caporaso et al. (2007): mutation patterns were learnable with small person-hour investment

Page 98: Kevin.cohen@gmail.com  Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen Biomedical Text

Conclusion

•Statistical and conceptual approaches to text mining can coëxist peacefully– Statistical and rule-based concept

recognizers can work well– Concepts are good features for

statistical systems