66
1 BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world

BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

1

BioNLP for NLPeople

CS5832/HLT-NAACL/RANLP

The weirdest job in the world

Page 2: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

2

The weirdest job in the world

The weirdest job in the world

Page 3: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

3

The weirdest job in the world

The weirdest job in the world

Page 4: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

4

How I got here

How I got here

Page 5: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

5

How I got here

• Voice Input Technologies• Linguistix• Nationwide Insurance• MapQuest• Berdy Medical Systems• OneRealm [sic]

How I got here

• Perl hacker, SLM data preprocessing• Linguist, Corpus construction• Senior Programmer/Analyst,

Interactive Voice Response (yuck)• Software test dept. manager; senior

software engineer• Consultant/Perl hacker• Senior software engineer

Page 6: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

6

What is BioNLP?

• Natural language processing appliedto biomedical language– Publications– Medical records– Ontologies

Part 0

Page 7: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

7

Why a field called BioNLP?

There is little reason for thedata on which a linguist worksto have the right to name thatwork.

Shuy 2002:8

(One lab’s) funding for NLP incomputational biology

• INIA (Neuroinformatics ofAlcoholism) $5M, 5 years

• Wyeth Genomics Institute ($200K, 2years)

• National Library of Medicine ($4.2M,3 years)

• National Library of Medicine ($XM, 3years)

Page 8: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

8

Why biologists care

• High-throughput data interpretation• Literature search• Annotation• Database construction

But, I’m a NLPerson(computer scientist, mathematician,

engineer…)

• Hard, but might be possible• Might be harder in biomedical domain

than in newswire text• Might be more possible in biomedical

domain than in newswire text

Page 9: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

9

ResourcesThe big drawing point for NLPeople

• Data– Lexical resources– 500 * 16M words of text– Labelled training data

• Tools– NER, POS taggers, parsers, semantic

normalizers....

$$$

Page 10: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

10

Job market

• Academia: great– US, Europe

• Industry: not bad, but genomics-specific right now

Surely Shuy jests...

There is littlereason for thedata on which alinguist works tohave the rightto name thatwork.

Page 11: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

11

It really is different on every level

•Tokenization•Named entity recognition•Corpus construction•Semantic representation

NLP actually could make theworld a better place....

Page 12: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

12

An embarrassing truth aboutBioNLP...

www.chilibot.net

1

Page 13: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

13

Part 1:Just enough biology

Cells and proteins

<illustration: cell, structures, proteins>

Page 14: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

14

How biologists see the world

Wattarujeekrit et al. (2004)

The Central Dogma: from genes toproteins

http://www.swbic.org/products/clipart/images/dogmag.jpg

Page 15: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

15

The Central Dogma:from genes to proteins

http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/images/central_dogma.gif

Higher-level structures

• Genotype, phenotype• Tissue, organ, organism

Page 16: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

16

Biological structures are complex

SNAP Receptor

Vesicle SNARE

V-SNARE

N-Ethylmaleimide-Sensitive Fusion Protein

Soluble NSF Attachment Protein

Maleic acid N-ethylimide

Vesicle Soluble Maleic acid N-ethylimide SensitiveFusion Protein Attachment Protein Receptor

(Alex Morgan, MITRE)

Part 2:Why bioscientists fund and publish

research in BioNLP

Page 17: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

17

Two basic markets, multiple usertypes

• Medical– Clinicians– Consumers– “Informationists”– Administrators

(billing, qualityassurance, ...)

• “MolBio” (genomic)– High-throughput

experimentalists– “Bench scientists”– Model organism

database curators

Page 18: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

18

Page 19: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

19

Structured vocabulary

Free text (phenotypes)

Page 20: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

20

122 references...

Medical

Page 21: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

21

1997

<scanned picture of business card>

Page 22: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

22

<happy-face photo>

One year later…

Page 23: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

23

A sad story: physicians don’t buya lot of NLP software

Another sad story: trying to sell“gisting” to physicians

Page 24: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

24

Sold for $400K: 14.5 or 2.9¢ on thedollar…

Salesperson’s thought process

Page 25: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

25

Physician’s thought process

Genomics

Page 26: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

26

Why biologists care

• High-throughput data interpretation• Literature search• Annotation• Database construction

Why biologists care

10 years ago...

Page 27: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

27

Why biologists careToday....

Double exponential growthin the literature

New entries in Medline with publication date inJan-Aug 2005: 431,478 (avg. 1775/ day) 1

Page 28: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

28

Biological Nomenclature: “V-SNARE”

SNAP Receptor

Vesicle SNARE

V-SNARE

N-Ethylmaleimide-Sensitive Fusion Protein

Soluble NSF Attachment Protein

Maleic acid N-ethylimide

Vesicle Soluble Maleic acid N-ethylimide SensitiveFusion Protein Attachment Protein Receptor

(Alex Morgan, MITRE)

Part 3

Some things that make BioNLPdifferent

Page 29: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

29

Named Entity Recognition

Genes have names??

Page 30: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

30

Suzanna Lewis

•Fruitfly geneticist•5 kids•Latte + 3 shots

Suzanna Lewis

It is the middle of the night (2:38to be precise), I am away fromfriends and family, It has beenthis way for over 2 years, I can'tsleep because of all the work thereis yet to do, and there is no endin sight. So when do the magiclittle elves appear out of nowhereand get everything done?

p.s. I am serious.

Page 31: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

31

Suzanna Lewis

pray for elves

D. melanogaster gene Pray For Elves,abbreviated as PFE, is reported here. It hasalso been known in FlyBase as CG15151.Similar sequences have been identified inCaenorhabditis elegans, Homos sapiens, Musmusculus, Rattus norvegicus andSaccharomyces cerevisiae.

(FlyBase report FBal0138651)

Page 32: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

32

D. melanogaster gene Pray For Elves,abbreviated as PFE, is reported here. It hasalso been known in FlyBase as CG15151.Similar sequences have been identified inCaenorhabditis elegans, Homos sapiens, Musmusculus, Rattus norvegicus andSaccharomyces cerevisiae.

(FlyBase report FBal0138651)

Named entity recognition

• Molecular biology entity identificationproblem:– large list of classes– some of them much harder

• Usual case-related cues don't help• More variability of content• Huge lexical ambiguity problem• Common English

– as posed, not useful

Page 33: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

33

white

white

"wild-type" (notmutated)

Page 34: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

34

white

"mutant"

white

white

Page 35: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

35

Case is meaningful

whiteWhite

Case is meaningful

white

Symbol: w

White

Symbol: W

Page 36: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

36

Yes, there are genes with thesymbols I, a, R, p....

Case is meaningful

Misshapen (Msn) has been proposed toshut down Drosophila photoreceptor (Rcell) growth cone motility in responseto targeting signals linked by theSH2/SH3 adaptor protein Dock.

Page 37: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

37

Case is meaningful

Misshapen (Msn) has been proposed toshut down Drosophila photoreceptor (Rcell) growth cone motility in responseto targeting signals linked by theSH2/SH3 adaptor protein Dock. (Ruanet al. 2002)

…even sentence-initially.

sunday driver (syd) was identified in ascreen for novel axonal transportmutants in Drosophila. Syd is a~137kDa protein that is broadlyconserved in evolution with homogousproteins identified in C. elegans, mouseand human. (Bowman 2000)

Page 38: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

38

Case is meaningful

Misshapen (Msn) has been proposed to shutdown Drosophila photoreceptor (R cell)growth cone motility in response to targetingsignals linked by the SH2/SH3 adaptorprotein Dock. Here, we show that Bifocal(Bif), a putative cytoskeletal regulator, is acomponent of the Msn pathway for regulatingR cell growth targeting. bif displays stronggenetic interaction with msn.

Surely you could determine on adocument-by-document basis…

Misshapen (Msn) has been proposed to shutdown Drosophila photoreceptor (R cell)growth cone motility in response to targetingsignals linked by the SH2/SH3 adaptorprotein Dock. Here, we show that Bifocal(Bif), a putative cytoskeletal regulator, is acomponent of the Msn pathway for regulatingR cell growth targeting. bif displays stronggenetic interaction with msn.

Page 39: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

39

Surely you could determine on adocument-by-document basis…

Axonal traffic jams with a sunday driver:Identification of a broadly conservedtransmembrane protein required foraxonal transport in Drosophila.(Bowman 2000)

Evolution

• What it looks like• What it acts like• Metaphor• …

Page 40: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

40

Looks like…

• white• swiss cheese• clown• daschund• dreadlocks

Acts like…

• ether a go-go• lush• agnostic• amontillado

Page 41: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

41

Metaphor/metonymy

• lot• maggie• scott of the antarctic• always early -> british rail• asp -> cleopatra• tudor -> vasa -> gustavus• nanos -> smaug

whimsy

• chablis, merlot, zinfandel, retsina,moonshine (16 zebrafish genes)

• milkah, murashka, zolotistyuy, zloday(32 Drosophila genes)

Page 42: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

42

But, that’s not the only way ofnaming genes....

• Breast cancer 1 (BRCA1)• p53• Ribosomal protein S27• Heat shock protein 110• Mitogen activated protein kinase 15• Mitogen activated protein kinase

kinase kinase 5

• fuculokinase• GABA• Heat shock protein 60• calmodulin• dHAND• suppressor of p53

• cheap date• lush• ken and barbie• ring• to• the• there• a

Page 43: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

43

Worst gene names

• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A

Worst gene names

• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A

Page 44: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

44

Worst gene names

• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A

• SEMA5A

Worst gene names

• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) and shortcytoplasmic domain, (semaphorin) 5A

• SEMA5A• Tyrosine kinase with immunoglobulin and

epidermal growth factor homology domains• tie

Page 45: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

45

• What doesn’t work• What does (as of 2004)

“Gene mention” (NER)

Yeh et al. (2005)

Page 46: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

46

Gene mention (NER)

Yeh et al. (2005)

Good systems?

• Handle multi-word names (heat shockprotein 60) (base NP chunking, abbreviationdefinitions, post-processing)

• Use some form of machine learning(MaxEnt, HMM, CRF, SVM) (or a cleverhack)

• Do some rule-based post-processing• Don’t rely on dictionaries

Page 47: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

47

The Jim Martin techniquereally works

Kinoshita et al. (2005)

...which isn’t to say that externalknowledge is bad

• Markert/Nissim’s extensions ofPoesio’s use of Google

Page 48: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

48

Most feature sets include...

• Typo/orthographic features– Patterns like \w+-?\d+– Contains Greek letters

• Local/distant context– Next word is “protein”– Followed by “protein” somewhere else in

document

Why not better?

• Length• Case• Tokenization• Annotation issues

– Inconsistency– Multiple correct

answers– Inter-corpus

differences indefinition

Yeh et al. (2005)

Page 49: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

49

Length effect(and why the Jim Martin technique

works so well for this)

Kinoshita et al. (2005)

A great research project

• Build an NER system for...– Species– Laboratory techniques– Cell types– Cell lines– Tissues– ....

Page 50: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

50

...and, NER isn’t what you needanyways

• GN task and results

Tokenization

• How to build a cheap base nounphrase chunker– Start from right, move left

• If next token is not conjunction, preposition,comma, period, or right parenthesis, add it

• Else start a new chunk

Page 51: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

51

Tokenization

• Commas– 2,6-diaminohexanoic acid– tricyclo(3.3.1.13,7)decanone

Four kinds of hyphens

• “Syntactic:”– Calcium-dependent– Hsp-60

• Knocked-out gene: lush-- flies• Negation: -fever• Electric charge: Cl-

Page 52: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

52

B-cell-CD4(+)-T-cell interactions

• PMID: 10516078

Special challenges in biomedicalcorpus construction

Page 53: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

53

•How do you parse

rat epithelial growthfactor receptor 2

?

• Don’t—pretag allnamed entities

• How do you tokenize

tricyclo(3.3.1.13,7)decanone

• Don’t—pretag allnamed entities

Page 54: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

54

• How do you hire alinguistics graduatestudent to tag ratepithelial growthfactor receptor 2?

• You can’t...

• How do you do PAStagging when youdon’t havesyntacticallytagged text?

• Sigh...

Page 55: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

55

Some specific cases of wordsense disambiguation

Abbreviation disambiguation

• Incidence of ambiguous abbreviations(Jeff Chang’s paper)

• Statistical approaches– Chang

• Rule-based– Schwartz and Hearst

Page 56: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

56

Part 4: getting up to speed

(about) 10 papers and resourcesthat will let you read most other

papers in BioNLP

Named entity recognition 1:rule-based

• Fukuda et al. (1998): first NER paper– Find something that looks like a symbol

for a yeast gene (ABC1)– Extend name to the left (yeast ABC1)– Extend name to the right (ABC1 protein)

• Results in 90s– Never replicated– Yeast is easy

Page 57: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

57

Named entity recognition 2:machine learning

• Collier et al. (XXX)

NER 3: state of the art

Page 58: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

58

Information extraction 1:rule-based

• Blaschke 1998

Information extraction 2:machine learning

• Craven and Kumlein 199X• Identify entity pairs

– Protein/protein– Protein/disease– Protein/?

• Use naïve Bayes to classify sentencesas +/- positing a relation– Features: bag-of-words

Page 59: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

59

Information extraction 3:rules, linguistics, knowledge

• Friedman: MedLEE, BioMedLEE• NER• Syntax

Corpora: 1

• PubMed/MEDLINE– MEDLINE: database of 16M+ abstracts– PubMed: interface for searching

MEDLINE– ASCII and free

NOT a corpus—not really even a “text collection”

Page 60: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

60

Corpora: 2

• GENIA– Fully annotated corpus– 2,000 abstracts– X00,000 words– Now: POS, named entities, 25%

treebanked– Coming: anaphora; events?; PAS?;

dependency parses?

Lexical resources: 1

• Gene Ontology– Biological functions– Molecular processes– Cell components

• Building blocks– Terms + definitions– Is-a, part-of

Page 61: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

61

Lexical resources: 2

• Entrez Gene (formerly LocusLink)– Names– Symbols– Synonyms– Protein products– “Summary”– Gene References Into Function

Lexical resources: 3

• UMLS (Unified Medical LanguageSystem)– MetaThesaurus– Semantic Network

Page 62: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

62

Tools overview

• Probably something available• Might work decently• Definitely improvable for your

specific task

Tools: 1

• POS tagging:– GENIA– MEDPOST– LingPipe?

Page 63: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

63

Tools: 2

• Named entity recognition– ABNER (Settles 200x)– KeX– AbGene

• LESSON: distribute a .jar file andthe world will beat a path to yourdoor

Part 6:Current hot topics

Page 64: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

64

What’s the right model for semanticrepresentation?

• So far: binary relations• Arguments that that’s not good

enough– Rzhetsky/GeneWays paper– Penn folks/IE paper– Native speaker intuitions (Juliane, etc.)

What’s the right model for semanticrepresentation?

• Two ways forward– Differentiating binary relations

• Marti HLT/EMNLP; Tsujii– PAS

• PASBio/Wattarujeekrit et al.• Kogan et al.

Karin: how do theserepresentationalchoices affect what abiologist would get outof the text?

Page 65: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

65

The ontology wars

• Point:– Hunter; PASBio; Barry Smith; L&C....– GOA; MGI; EBI; ...

• Counterpoint:– Tsujii/Ananiadou; Pedersen/Pakhomov;

Markert/Nissim...

True integration of NLP intolaboratory data interpretation

• <Last chapter of Sophia and John’sbook>

Page 66: BioNLP for NLPeople - Computer Sciencemartin/Csci5832/Slides/S07/bionlp.pdf · growth cone motility in response to targeting signals linked by the SH2/SH3 adaptor protein Dock. Here,

66

The embarrassing truth aboutBioNLP (take 2)...

References

• Shuy, Roger (2002) Linguistic battlesin trademark disputes. Palgrave.

• Yeh, Alexander; Alexander Morgan;Marc Colosimo; and LynetteHirschman (2004) BioCreative Task1A: gene mention finding evaluation.BMC Bioinformatics 6(Suppl. 1):S2.