Upload
katrina-stewart
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
TM and NLP for BiologyResearch Issues in HPSG Parsing
Junichi TSUJII
School of Computer ScienceNational Centre for Text Mining
University of Manchester, UK
Department of Computer ScienceSchool of Information Science and Technology
University of Tokyo, JAPAN
2
Increments
: accumulation
Increase in Medline
2002
2000
1998
199219941996
1990
1988
1980198219841986
1978
1970197219741976
1968
1966
1964
0
100,000
200,000
300,000
400,000
500,000
600,000
年
incr
emen
ts
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
acc
um
ula
tio
n
G-protein coupled receptor
Before 19889 papers
1992256 papers2005
14,000 papers
MEDLINE alone
More than 0.5 million per year More than 1.3 thousand per day
Articles added
Medline Access
1997: 0.163 M accesses/month2006: 82.027 M accesses/month
[D.L.Banville 2006]
500 times more
3
NaCTeMwww.nactem.ac.uk
• First such centre in the world • Funding: JISC, BBSRC, EPSRC• Consortium investment
• Chair in TM (Prof. J. Tsujii, Univ. Tokyo)
• Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trust
• Initial focus: biomedical academic community• Extend services to industry• Extend focus to other domains (social
sciences)
4
Consortium
• Universities of Manchester, Liverpool• Service activity run by MIMAS (National
Centre for Dataset Services), within MC (Manchester Computing)
• Self-funded partners– San Diego Supercomputing Center – University of California, Berkeley – University of Geneva – University of Tokyo
• Strong industrial & academic support– IBM, AZ, EBI, Wellcome Trust, Sanger Institute,
Unilever, NowGEN, MerseyBio, …
5
6
7
8
9
10
11
NLP and TM
Text Mining
Text as a bag of words
Words as surface strings
Natural Language Processing
Language as a complex system linking surfacestrings of characters with their meanings Text and words as structured objects
NLP-based TM
Linking text with knowledge
12
Non-Trivial Mappings
Language Domain Knowledge Domain
Concepts and Relationships among Them
Linguistic expressions
Motivated Independently of language
TerminologyParsingParaphrasing
From surface diversities and ambiguities to
conceptual invariants
13
Example
14
Non-trivial Mapping
Language Domain Knowledge Domain
Independently motivated of Language
Same relationswith differentStructures
Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to …..
[A] protein activates [B] (Pathway extraction)
Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.
Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.
[sentence] > ([arg1_activate] > [protein])Retrieval usingRegional Algebra
15
Predicate-argument structureParser based on Probabilistic HPSG (Enju)
S
p53 has been shown to directly activate the Bcl-2 protein
NP
VP
ADVP
S
VP
VP
VP
NP arg1arg2
arg2
arg3
16
述語 /項構造確率HPSG解析器 (Enju) の出力
The protein is activated by it
DT NN VBZ VBN IN PRP
dt np vp vp pp np
np pp
vp
vp
s
arg1arg2mod
Semantic Retrieval SystemUsing Deep Syntax
MEDIE
Passive
Passive and Infinitival Clause
17
18
19
20
21
22
23
24
25
26
Demos
•MEDIE
• Info-PubMed
27
Predicate-argument structureParser based on Probabilistic HPSG (Enju)
S
p53 has been shown to directly activate the Bcl-2 protein
NP
VP
ADVP
S
VP
VP
VP
NP arg1arg2
arg2
arg3
28
29
30
31
Penn Treebank GENIA
Coverage 99.7% 99.2%
F-Value (PArelations) 87.4% 86.4%
Sentence Precison 39.2% 31.8%
Processing Time 0.68sec 1.00sec
Performance of Semantic Parser
32
Scalability of TM Tools
The number of papers 14,792,890
The number of abstracts 7,434,879
The number of sentences 70,815,480
The number of words 1,418,949,650
Compressed data size 3.2GB
Uncompressed data size 10GB
Target Corpus: MEDLINE corpus
Suppose, for example, that it
takes one second for parsing one
sentence….70 million seconds, that is, about 2 years
33
TM and GRID
• Solution– The entire MEDLINE were parsed by
distributed PC clusters consisting of 340 CPUs
– Parallel processing was managed by grid platform GXP [Taura2004]
• Experiments– The entire MEDLINE was parsed in 8 days
• Output– Syntactic parse trees and predicate
argument structures in XML format– The data sizes of compressed/uncompressed
output were 42.5GB/260GB.
34
Efficient Parsing for HPSG
35
Background: HPSG• Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994]
– Lexicalized and Constraints-based Grammar–A few Rule Schema General constraints on linguistic constructions
–Constraints embedded in Lexicon Word-Specific Constraints
–Constraints between phrase structures and semantic structures
36
I like it
Parsing by HPSG
37
HEAD nounSUBJ < >COMPS < >
I
HEAD nounSUBJ < >COMPS < >
it
HEAD verbSUBJ COMPS
like
<NP><NP>
Parsing by HPSG
Assignment of Lexical Entries
38
HEAD nounSUBJ < >COMPS < >
I
HEAD verbSUBJ COMPS
HEAD nounSUBJ < >COMPS < >
like it
1< >
2< >2
HEAD verb
SUBJ
COMPS < >
HEAD nounSUBJ < >COMPS < >
1< >
Head-Complement
Application of
Rule Schema
39
HEAD nounSUBJ < >COMPS < >
HEAD nounSUBJ < >COMPS < >
I
HEAD verbSUBJ COMPS
like it
1< >2< >
2
HEAD verbSUBJ < >COMPS < >
1< >HEAD verbSUBJ COMPS < >
1
Subject-Head
Application of
Rule Schema
40
Inefficiency of HPSG Parsing
• Complex DAG : Typed-feature structures– Abstract machine for Unification (LiLFeS)
• Unification: Expensive Operation (⇔ CFG Approximation: CFG Filtering )
• Assignment of Lexical Entries– High reduction of search space / Super
tagging
41
Filtering with CFG (1/5)• 2-phased parsing
– Approximate HPSG with CFG with keeping important constraints.
– Obtained CFG might over-generate, but can be used in filtering.
– Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on.
CompileHPSG CFGFeature
Structures
Input Sentences
Built-in CFG Parser
LiLFeS UnificationParsing
+
Output
Complete parse trees
47
System Overview
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
HEAD nounSUBJ < >
COMPS < >
P High
Supertagger
I like itInputsentence
CFG Filtering
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
I like it
HEAD nounSUBJ < >
COMPS < >
HEAD verbSUBJ <NP>
COMPS <NP>
HEAD nounSUBJ < >
COMPS < >
...
Deterministic Shift/Reduce Parser
I like it
Experiment Results
LP(%) LR(%) F1(%) Avg. timeStaged/Deterministic model
86.93 86.47 86.70 30ms/snt
Previous method 1( Supertagger+ChartParser)
87.35 86.29 86.81 183ms/snt
Previous method 2( Unigram + ChartParser )
84.96 84.25 84.60 674ms/snt
6 times faster20 times faster than the initial model
49
Domain/Text Type Adaptation
50
F-score Training Time ( Sec )
Baseline ( PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining ( GENIA ) 88.45 14,695
Retraining ( PTB+GENIA) ) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
51
Adaptation with Reference Distribution
)(
)|()|(
,)|()|(1
)|(
w ww
ww
l
lw
Tt wsyniilex
wsyniilexE
i
i
tqwlpZ
tqwlpZ
tp
Lexical Assignment Syntactic Preference
Original model
j
jjs
stgZ
stpM )|(exp1
)|(
Feature function
Feature weight)|(0 stp
52
83
84
85
86
87
88
89
90
0 2000 4000 6000 8000
Number of Sentence of the GENIA Training Set
F-s
core
Baseline (PTB)
Simple Retraining ( GENIA)
Retraining (GENIA+PTB)Structure with Ref.DistLexical with RefDist
Lexical/Structure woth RefDist
53
83
84
85
86
87
88
89
90
0 10000 20000 30000
Training Time ( Sec )
F- s
core
Retrinaing(GENIA)
Structure with RefDistLexicon woth RefDist
Lex/Str with RefDist
54
F-score Training Time ( Sec )
Baseline ( PTB-trained, PTB-applied) 89.81 0
Baseline (PTB-trained, GENIA-applied) 86.39 0
Retraining ( GENIA ) 88.45 14,695
Retraining ( PTB+GENIA) ) 89.94 238,576
Structure with RefDist 88.18 21,833
Lexical with RefDist 89.04 12,957
Lex/Structure with RefDist 90.15 31,637
55
Tool1: POS Tagger
• General-Purpose POS taggers, trained by WSJ– Brill’s tagger, TnT tagger, MX POST, etc. – 97%
• General-Purpose POS taggers do not work well for MEDLINE abstracts
The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS
56
Errors seen in TnT tagger (Brants 2000)
A chromosomal translocation in … DT JJ NN IN… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN
57
Performance of GENIA Tagger
Training corpus WSJ GENIA
WSJ 97.0 84.3
GENIA 75.2 98.1
WSJ+GENIA 96.9 98.1
Training corpus
WSJ GENIA
WSJ 96.7 84.3
GENIA 80.1 97.9
WSJ+GENIA 96.5 97.5
• GENIA tagger (Ref.) TnT tagger
No degradation of the taggertrained by the mixed corpus
Some degradations (0.2 ~ 0.4) were observed, compared withthe taggers trained by “pure” corpora
58
CRF-based POS + Active LearningGENIA
3,000 sentences : 98.420,000 sentences: 98.58
59
10,000 sentences: 96.76Best Performance: 97.18
CRF-based POS + Active LearningPTB
60
Applications
GENIA Event Annotation - example
LinkCauseLinkCause
– For an identified event in the given sentence,• classify the type of events and record the text span giving the clue of it (ClueType).• identify the theme of the events and record the text span linking the theme to the event
(LinkTheme).• identify the cause of the events and record the text span linking the cause to the event
(LinkCause).
• record the environment (location, time) of the events (ClueLoc, ClueTime).
LinkThemeLinkTheme
ClueLocClueLoc
ClueTypeClueType
ClueTypeClueType
Gene_expression• Theme patterns observed (2,958)
– Protein 2,308– DNA 591– RNA 25– Peptide 4– Protein Protein 2– Erroneous 27
• Keywords– coexpress, nonexpress, overexpress,
express, biosynthesis, product, synthesize, constitute, …
coexpression
Transcription
• Theme patterns observed (929)– DNA 449– RNA 272– Protein 167– Peptide 2– Erroneous22
• Keyword– Transcrib, transcript, synthesi, express,
…
Localization• Theme patterns observed
(730)– Protein 608– Lipid 31– Atom 29– Other_organic_compound
14– DNA 12– Virus 5– Carbohydrate 5– RNA 4– Inorganic 4– Peptide 3• Keywords
– Translocation, sectetion, release, localization, mobilization, uptake, secrete, import, transport, translocate, sequester, influx, mograte, localisation, move, delivery, export, …
• ClueLoc
– NONE 241
– nuclear 140
– to the nucleus 12
– into the nucleus 11
– Cytoplasmic 8
– in the cytoplasm 7
– macrophages 5
– nuclear … in t lymphocytes4
– monocytes 4
– in the nucleus 4
– in the cytosol 4
– in colostrum 4
– from the cytoplasm to the nucleus 4
Localization• Keywords and Locations
– translocation (166)• nuclear 108• NONE
38• …
– secretion (100)• NONE
57• name_of_cells 43
– release (80)• NONE
51• name_of_cells 19• …
– localization (30)• nuclear 25• intracellular 3
– uptake (24)• NONE 14• name_of_cells 20
• Keywords and Themes– translocation (166)
• Protein 161• Virus
4• RNA
1– secretion (100)
• Protein 98• Lipid
1• Peptide 1
– release (80)• Protein 67• Other_organic_compoun 6• Lipid
3– localization (30)
• Protein 30– uptake (24)
• Lipid15
• Carbohydrate 5• Protein 4
69
Future Plan
Kitano’s group, Kell’s group
70
71
72
Future Directions
• Domain Adaptation + Inter-operability– High performance can be obtained by using domain specific
characteristics and domain semantics– Differences among abstracts, full papers, comments in DBs
– Standardized Interfaces (API) of NLP tools
• Text Archives – Abstracts + Full Papers + Comments/Summary Descriptions
in DBs
• Combining NLP tools with Mining tools – Knowledge Discovery (Disease Gene Association)– Hypotheses Generation– Automatic Data Interpretation
73
Future Directions
• Domain Adaptation + Inter-operability– High performance can be obtained by using domain specific
characteristics and domain semantics– Differences among abstracts, full papers, comments in DBs
– Standardized Interfaces (API) of NLP tools
• Text Archives – Abstracts + Full Papers + Comments/Summary Descriptions
in DBs
• Combining NLP tools with Mining tools – Knowledge Discovery (Disease Gene Association)– Hypotheses Generation– Automatic Data Interpretation