Upload
lottie
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Domain Adaptation for Biomedical Information Extraction. Jing Jiang BeeSpace Seminar Oct 17, 2007. Outline. Why do we need domain adaptation? Solutions: Intelligent learning methods Knowledge bases Expert supervision Connections with BeeSpace V4. Why do we need domain adaptation?. - PowerPoint PPT Presentation
Citation preview
Domain Adaptation for Biomedical Information Extraction
Jing Jiang
BeeSpace SeminarOct 17, 2007
10/17/07 2
Outline
Why do we need domain adaptation? Solutions:
Intelligent learning methods Knowledge bases Expert supervision
Connections with BeeSpace V4
10/17/07 3
Why do we need domain adaptation? Many biomedical information extraction
problems are solved by supervised machine learning methods such as support vector machines (SVMs). Entity recognition Relation extraction Sentence categorization
In supervised machine learning, it is assumed that the training data and the test data have the same distribution.
10/17/07 4
Why do we need domain adaptation? Existing labeled training data is often limited to
certain domains. GENIA corpus human, blood cells, transcription factors PennBioIE Genetic variation in malignancy, Cytochrome
P450 inhibition Training data for sentence categorization in gene
summarizer fly Even when the training data is diverse (containing
multiple domains), it would still be nice to customize the classifier for the particular target domain that we are working on.
10/17/07 5
Why do we need domain adaptation?
NER Task Train → Test F1
to find PER, LOC, ORG from news text
NYT → NYT 0.855
Reuters → NYT 0.641
to find gene/protein from biomedical literature
mouse → mouse 0.541
fly → mouse 0.281
10/17/07 6
Solutions to domain adaptation Intelligent learning methods
Instance weighting Feature selection
Knowledge bases Expert supervision
thesis research
future work
discussion
10/17/07 7
Domain adaptive learning methods Two-stage approach Two frameworks
Instance weighting Feature selection
Use of unlabeled data
10/17/07 8
Intuition
SourceDomain Target
Domain
10/17/07 9
Goal
TargetDomain
SourceDomain
10/17/07 10
Start from the source domain
SourceDomain Target
Domain
10/17/07 11
Focus on the common part
SourceDomain Target
Domain
10/17/07 12
Pick up some part from the target domain
SourceDomain Target
Domain
10/17/07 13
Formal formulation?
SourceDomain Target
Domain
How to formally formulate these ideas?
10/17/07 14
Instance weighting
SourceDomain Target
Domain
instance space
(each point represents an example)
to assign different weights to different instances in the objective function
10/17/07 15
Instance weighting
Observationsource domain target domain
10/17/07 16
Instance weighting
Observationsource domain target domain
10/17/07 17
Instance weighting
Analysis of domain differencep(x, y)
p(x)p(y | x)
ps(y | x) ≠ pt(y | x)
ps(x) ≠ pt(x)
labeling difference instance difference
labeling adaptation instance adaptation?
10/17/07 18
Instance weighting
Three sets of instancesDs Dt, l Dt, u
?
);|(log)|()(maxarg
X
Y
*t
ytt dxxypxypxp
X Ds+ Dt,l+ Dt,u?
10/17/07 19
Instance weighting
Framework
)](log
);|(log)(1
);|(log1
);|(log1
[maxargˆ
,
,
1,,
1,,
1
p
xypyC
xypC
xypC
ut
lt
s
N
k y
tkk
utut
N
j
ti
ti
ltlt
N
i
si
siii
ss
Y
a flexible setup covering both standard methods and new domain adaptive methods
1,, utlts
labeled source data
labeled target data
unlabeled target data
10/17/07 20
Feature selection
SourceDomain Target
Domain
feature space
(each point represents a feature)
to identify features that behave similarly across domains
10/17/07 21
Feature selection
Observation Domain-specific features
wingless
daughterless
eyeless
apexless
…
“suffix -less” weighted high in the model trained from fly data
Useful for other organisms?
in general NO! May cause generalizable
features to be downweightedfly genes
10/17/07 22
Feature selection
Observation Generalizable features: generalize well in all
domains
…decapentaplegic and wingless are
expressed in analogous
patterns in each…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is
expressed in fetal brain and in a range of adult
tissues.
fly mouse
10/17/07 23
Feature selection
Observation Generalizable features: generalize well in all
domains
…decapentaplegic and wingless are
expressed in analogous
patterns in each…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is
expressed in fetal brain and in a range of adult
tissues.
fly mouse
“wi+2 = expressed” is generalizable
10/17/07 24
Feature selectionIntuition for identification of generalizable features
…source
domains
……-less……expressed……
………expressed………-less
………expressed……-less…
…………expressed……-less
…
12345678
12345678
12345678
12345678
…expressed………-less……
fly mouse D3 DK
10/17/07 25
Feature selection
Framework Matrix A is for feature selection
K
k
N
i
kTki
ki
k
K
k
ks
uv
k
k
k
uvAxypNK
uvuv
1 1
1
22
}{,
;|log11
minarg}{,
10/17/07 26
Feature selection results on gene/protein recognition
10/17/07 27
New directions to explore
Knowledge bases Expert supervision
10/17/07 28
Knowledge bases – entity recognition Well-documented nomenclatures
Fly, Mouse, Rat Help filter out false positives? Help select features?
Dictionaries of entities “Dictionary features” Automatic summarization of nomenclatures? Automatic identification of good features?
10/17/07 29
Knowledge bases – sentence categorization in gene summarizer For fly, the training sentences are
automatically extracted from FlyBase. For other organisms, do we have similar resources?
10/17/07 30
Expert supervision – entity recognition Computer system selects ambiguous
examples for human experts to judge. Computer system asks human experts other
questions. Similar organisms? Typical surface features? (e.g. cis-regulatory
elements, “-RE”) Computer system summarizes possible
features from pseudo labeled data, and asks human experts for confirmation.
10/17/07 31
Connections to BeeSpace V4
A major challenge in BeeSpace V4 is extraction of new types of entities and relations.
Exploiting knowledge bases and expert supervision is especially important.
For new types, no labeled data is available even from other domains. Use of bootstrapping methods should be explored.
10/17/07 32
New entity types
Recognition of many new types will be dictionary based: organism, anatomy, biological process, etc.
Recognition of some new types will need some NER techniques: chemical, regulatory element
10/17/07 33
New relation types
Bootstrapping (?) Seed patterns from knowledge bases or human
experts Human inspection of newly discovered patterns?
10/17/07 34
The end