Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
BioNLP for NLPeople
CS5832/HLT-NAACL/RANLP
The weirdest job in the world
2
The weirdest job in the world
The weirdest job in the world
3
The weirdest job in the world
The weirdest job in the world
4
How I got here
How I got here
5
How I got here
• Voice Input Technologies• Linguistix• Nationwide Insurance• MapQuest• Berdy Medical Systems• OneRealm [sic]
How I got here
• Perl hacker, SLM data preprocessing• Linguist, Corpus construction• Senior Programmer/Analyst,
Interactive Voice Response (yuck)• Software test dept. manager; senior
software engineer• Consultant/Perl hacker• Senior software engineer
6
What is BioNLP?
• Natural language processing appliedto biomedical language– Publications– Medical records– Ontologies
Part 0
7
Why a field called BioNLP?
There is little reason for thedata on which a linguist worksto have the right to name thatwork.
Shuy 2002:8
(One lab’s) funding for NLP incomputational biology
• INIA (Neuroinformatics ofAlcoholism) $5M, 5 years
• Wyeth Genomics Institute ($200K, 2years)
• National Library of Medicine ($4.2M,3 years)
• National Library of Medicine ($XM, 3years)
8
Why biologists care
• High-throughput data interpretation• Literature search• Annotation• Database construction
But, I’m a NLPerson(computer scientist, mathematician,
engineer…)
• Hard, but might be possible• Might be harder in biomedical domain
than in newswire text• Might be more possible in biomedical
domain than in newswire text
9
ResourcesThe big drawing point for NLPeople
• Data– Lexical resources– 500 * 16M words of text– Labelled training data
• Tools– NER, POS taggers, parsers, semantic
normalizers....
$$$
10
Job market
• Academia: great– US, Europe
• Industry: not bad, but genomics-specific right now
Surely Shuy jests...
There is littlereason for thedata on which alinguist works tohave the rightto name thatwork.
11
It really is different on every level
•Tokenization•Named entity recognition•Corpus construction•Semantic representation
NLP actually could make theworld a better place....
12
An embarrassing truth aboutBioNLP...
www.chilibot.net
1
13
Part 1:Just enough biology
Cells and proteins
<illustration: cell, structures, proteins>
14
How biologists see the world
Wattarujeekrit et al. (2004)
The Central Dogma: from genes toproteins
http://www.swbic.org/products/clipart/images/dogmag.jpg
15
The Central Dogma:from genes to proteins
http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/images/central_dogma.gif
Higher-level structures
• Genotype, phenotype• Tissue, organ, organism
16
Biological structures are complex
SNAP Receptor
Vesicle SNARE
V-SNARE
N-Ethylmaleimide-Sensitive Fusion Protein
Soluble NSF Attachment Protein
Maleic acid N-ethylimide
Vesicle Soluble Maleic acid N-ethylimide SensitiveFusion Protein Attachment Protein Receptor
(Alex Morgan, MITRE)
Part 2:Why bioscientists fund and publish
research in BioNLP
17
Two basic markets, multiple usertypes
• Medical– Clinicians– Consumers– “Informationists”– Administrators
(billing, qualityassurance, ...)
• “MolBio” (genomic)– High-throughput
experimentalists– “Bench scientists”– Model organism
database curators
18
19
Structured vocabulary
Free text (phenotypes)
20
122 references...
Medical
21
1997
<scanned picture of business card>
22
<happy-face photo>
One year later…
23
A sad story: physicians don’t buya lot of NLP software
Another sad story: trying to sell“gisting” to physicians
24
Sold for $400K: 14.5 or 2.9¢ on thedollar…
Salesperson’s thought process
25
Physician’s thought process
Genomics
26
Why biologists care
• High-throughput data interpretation• Literature search• Annotation• Database construction
Why biologists care
10 years ago...
27
Why biologists careToday....
Double exponential growthin the literature
New entries in Medline with publication date inJan-Aug 2005: 431,478 (avg. 1775/ day) 1
28
Biological Nomenclature: “V-SNARE”
SNAP Receptor
Vesicle SNARE
V-SNARE
N-Ethylmaleimide-Sensitive Fusion Protein
Soluble NSF Attachment Protein
Maleic acid N-ethylimide
Vesicle Soluble Maleic acid N-ethylimide SensitiveFusion Protein Attachment Protein Receptor
(Alex Morgan, MITRE)
Part 3
Some things that make BioNLPdifferent
29
Named Entity Recognition
Genes have names??
30
Suzanna Lewis
•Fruitfly geneticist•5 kids•Latte + 3 shots
Suzanna Lewis
It is the middle of the night (2:38to be precise), I am away fromfriends and family, It has beenthis way for over 2 years, I can'tsleep because of all the work thereis yet to do, and there is no endin sight. So when do the magiclittle elves appear out of nowhereand get everything done?
p.s. I am serious.
31
Suzanna Lewis
pray for elves
D. melanogaster gene Pray For Elves,abbreviated as PFE, is reported here. It hasalso been known in FlyBase as CG15151.Similar sequences have been identified inCaenorhabditis elegans, Homos sapiens, Musmusculus, Rattus norvegicus andSaccharomyces cerevisiae.
(FlyBase report FBal0138651)
32
D. melanogaster gene Pray For Elves,abbreviated as PFE, is reported here. It hasalso been known in FlyBase as CG15151.Similar sequences have been identified inCaenorhabditis elegans, Homos sapiens, Musmusculus, Rattus norvegicus andSaccharomyces cerevisiae.
(FlyBase report FBal0138651)
Named entity recognition
• Molecular biology entity identificationproblem:– large list of classes– some of them much harder
• Usual case-related cues don't help• More variability of content• Huge lexical ambiguity problem• Common English
– as posed, not useful
33
white
white
"wild-type" (notmutated)
34
white
"mutant"
white
white
35
Case is meaningful
whiteWhite
Case is meaningful
white
Symbol: w
White
Symbol: W
36
Yes, there are genes with thesymbols I, a, R, p....
Case is meaningful
Misshapen (Msn) has been proposed toshut down Drosophila photoreceptor (Rcell) growth cone motility in responseto targeting signals linked by theSH2/SH3 adaptor protein Dock.
37
Case is meaningful
Misshapen (Msn) has been proposed toshut down Drosophila photoreceptor (Rcell) growth cone motility in responseto targeting signals linked by theSH2/SH3 adaptor protein Dock. (Ruanet al. 2002)
…even sentence-initially.
sunday driver (syd) was identified in ascreen for novel axonal transportmutants in Drosophila. Syd is a~137kDa protein that is broadlyconserved in evolution with homogousproteins identified in C. elegans, mouseand human. (Bowman 2000)
38
Case is meaningful
Misshapen (Msn) has been proposed to shutdown Drosophila photoreceptor (R cell)growth cone motility in response to targetingsignals linked by the SH2/SH3 adaptorprotein Dock. Here, we show that Bifocal(Bif), a putative cytoskeletal regulator, is acomponent of the Msn pathway for regulatingR cell growth targeting. bif displays stronggenetic interaction with msn.
Surely you could determine on adocument-by-document basis…
Misshapen (Msn) has been proposed to shutdown Drosophila photoreceptor (R cell)growth cone motility in response to targetingsignals linked by the SH2/SH3 adaptorprotein Dock. Here, we show that Bifocal(Bif), a putative cytoskeletal regulator, is acomponent of the Msn pathway for regulatingR cell growth targeting. bif displays stronggenetic interaction with msn.
39
Surely you could determine on adocument-by-document basis…
Axonal traffic jams with a sunday driver:Identification of a broadly conservedtransmembrane protein required foraxonal transport in Drosophila.(Bowman 2000)
Evolution
• What it looks like• What it acts like• Metaphor• …
40
Looks like…
• white• swiss cheese• clown• daschund• dreadlocks
Acts like…
• ether a go-go• lush• agnostic• amontillado
41
Metaphor/metonymy
• lot• maggie• scott of the antarctic• always early -> british rail• asp -> cleopatra• tudor -> vasa -> gustavus• nanos -> smaug
whimsy
• chablis, merlot, zinfandel, retsina,moonshine (16 zebrafish genes)
• milkah, murashka, zolotistyuy, zloday(32 Drosophila genes)
42
But, that’s not the only way ofnaming genes....
• Breast cancer 1 (BRCA1)• p53• Ribosomal protein S27• Heat shock protein 110• Mitogen activated protein kinase 15• Mitogen activated protein kinase
kinase kinase 5
• fuculokinase• GABA• Heat shock protein 60• calmodulin• dHAND• suppressor of p53
• cheap date• lush• ken and barbie• ring• to• the• there• a
43
Worst gene names
• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A
Worst gene names
• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A
44
Worst gene names
• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) andshort cytoplasmic domain,(semaphorin) 5A
• SEMA5A
Worst gene names
• sema domain, seven thrombospondinrepeats (type 1 and type 1-like),transmembrane domain (TM) and shortcytoplasmic domain, (semaphorin) 5A
• SEMA5A• Tyrosine kinase with immunoglobulin and
epidermal growth factor homology domains• tie
45
• What doesn’t work• What does (as of 2004)
“Gene mention” (NER)
Yeh et al. (2005)
46
Gene mention (NER)
Yeh et al. (2005)
Good systems?
• Handle multi-word names (heat shockprotein 60) (base NP chunking, abbreviationdefinitions, post-processing)
• Use some form of machine learning(MaxEnt, HMM, CRF, SVM) (or a cleverhack)
• Do some rule-based post-processing• Don’t rely on dictionaries
47
The Jim Martin techniquereally works
Kinoshita et al. (2005)
...which isn’t to say that externalknowledge is bad
• Markert/Nissim’s extensions ofPoesio’s use of Google
48
Most feature sets include...
• Typo/orthographic features– Patterns like \w+-?\d+– Contains Greek letters
• Local/distant context– Next word is “protein”– Followed by “protein” somewhere else in
document
Why not better?
• Length• Case• Tokenization• Annotation issues
– Inconsistency– Multiple correct
answers– Inter-corpus
differences indefinition
Yeh et al. (2005)
49
Length effect(and why the Jim Martin technique
works so well for this)
Kinoshita et al. (2005)
A great research project
• Build an NER system for...– Species– Laboratory techniques– Cell types– Cell lines– Tissues– ....
50
...and, NER isn’t what you needanyways
• GN task and results
Tokenization
• How to build a cheap base nounphrase chunker– Start from right, move left
• If next token is not conjunction, preposition,comma, period, or right parenthesis, add it
• Else start a new chunk
51
Tokenization
• Commas– 2,6-diaminohexanoic acid– tricyclo(3.3.1.13,7)decanone
Four kinds of hyphens
• “Syntactic:”– Calcium-dependent– Hsp-60
• Knocked-out gene: lush-- flies• Negation: -fever• Electric charge: Cl-
52
B-cell-CD4(+)-T-cell interactions
• PMID: 10516078
Special challenges in biomedicalcorpus construction
53
•How do you parse
rat epithelial growthfactor receptor 2
?
• Don’t—pretag allnamed entities
• How do you tokenize
tricyclo(3.3.1.13,7)decanone
• Don’t—pretag allnamed entities
54
• How do you hire alinguistics graduatestudent to tag ratepithelial growthfactor receptor 2?
• You can’t...
• How do you do PAStagging when youdon’t havesyntacticallytagged text?
• Sigh...
55
Some specific cases of wordsense disambiguation
Abbreviation disambiguation
• Incidence of ambiguous abbreviations(Jeff Chang’s paper)
• Statistical approaches– Chang
• Rule-based– Schwartz and Hearst
56
Part 4: getting up to speed
(about) 10 papers and resourcesthat will let you read most other
papers in BioNLP
Named entity recognition 1:rule-based
• Fukuda et al. (1998): first NER paper– Find something that looks like a symbol
for a yeast gene (ABC1)– Extend name to the left (yeast ABC1)– Extend name to the right (ABC1 protein)
• Results in 90s– Never replicated– Yeast is easy
57
Named entity recognition 2:machine learning
• Collier et al. (XXX)
NER 3: state of the art
58
Information extraction 1:rule-based
• Blaschke 1998
Information extraction 2:machine learning
• Craven and Kumlein 199X• Identify entity pairs
– Protein/protein– Protein/disease– Protein/?
• Use naïve Bayes to classify sentencesas +/- positing a relation– Features: bag-of-words
59
Information extraction 3:rules, linguistics, knowledge
• Friedman: MedLEE, BioMedLEE• NER• Syntax
Corpora: 1
• PubMed/MEDLINE– MEDLINE: database of 16M+ abstracts– PubMed: interface for searching
MEDLINE– ASCII and free
NOT a corpus—not really even a “text collection”
60
Corpora: 2
• GENIA– Fully annotated corpus– 2,000 abstracts– X00,000 words– Now: POS, named entities, 25%
treebanked– Coming: anaphora; events?; PAS?;
dependency parses?
Lexical resources: 1
• Gene Ontology– Biological functions– Molecular processes– Cell components
• Building blocks– Terms + definitions– Is-a, part-of
61
Lexical resources: 2
• Entrez Gene (formerly LocusLink)– Names– Symbols– Synonyms– Protein products– “Summary”– Gene References Into Function
Lexical resources: 3
• UMLS (Unified Medical LanguageSystem)– MetaThesaurus– Semantic Network
62
Tools overview
• Probably something available• Might work decently• Definitely improvable for your
specific task
Tools: 1
• POS tagging:– GENIA– MEDPOST– LingPipe?
63
Tools: 2
• Named entity recognition– ABNER (Settles 200x)– KeX– AbGene
• LESSON: distribute a .jar file andthe world will beat a path to yourdoor
Part 6:Current hot topics
64
What’s the right model for semanticrepresentation?
• So far: binary relations• Arguments that that’s not good
enough– Rzhetsky/GeneWays paper– Penn folks/IE paper– Native speaker intuitions (Juliane, etc.)
What’s the right model for semanticrepresentation?
• Two ways forward– Differentiating binary relations
• Marti HLT/EMNLP; Tsujii– PAS
• PASBio/Wattarujeekrit et al.• Kogan et al.
Karin: how do theserepresentationalchoices affect what abiologist would get outof the text?
65
The ontology wars
• Point:– Hunter; PASBio; Barry Smith; L&C....– GOA; MGI; EBI; ...
• Counterpoint:– Tsujii/Ananiadou; Pedersen/Pakhomov;
Markert/Nissim...
True integration of NLP intolaboratory data interpretation
• <Last chapter of Sophia and John’sbook>
66
The embarrassing truth aboutBioNLP (take 2)...
References
• Shuy, Roger (2002) Linguistic battlesin trademark disputes. Palgrave.
• Yeh, Alexander; Alexander Morgan;Marc Colosimo; and LynetteHirschman (2004) BioCreative Task1A: gene mention finding evaluation.BMC Bioinformatics 6(Suppl. 1):S2.