Upload
sharleen-fitzgerald
View
219
Download
0
Embed Size (px)
DESCRIPTION
Overview of BeeSpace Technology Literature Text Search Engine Words/Phrases Entities Relations Natural Language Understanding Users Function Annotator Space/Region Manager, Navigation Support Gene Summarizer Relational Database Text Miner Meta Data Knowledge Discovery & Hypothesis Testing Information Access & Exploration Content Analysis Question Answering 3
Citation preview
BeeSpace Informatics Research
ChengXiang (“Cheng”) Zhai
Department of Computer ScienceInstitute for Genomic Biology
StatisticsGraduate School of Library & Information Science
University of Illinois at Urbana-Champaign
BeeSpace Workshop, May 22, 2009 1
Goal of Informatics Research• Develop general and scalable computational methods
to enable– Semantic integration of data and information
– Effective information access and exploration– Knowledge discovery
– Hypothesis formulation and testing
• Reinforcement of research in biology and computer science– CS research to automate manual tasks of biologests
– Biology research to raise new challenges for CS
2
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
3
Informatics Research Accomplishments
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis Test
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08]
Entity/Relation extraction [Jiang & Zhai 06], [Jiang & Zhai 07a], [Jiang & Zhai 07b]
Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b],
[Chee & Schatz 08]
Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08]
Automatic Function Annotation [He et al. 09/10]
4
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
&Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Part 1. Information Extraction
Part 2. Navigation Support
Part 3. EntitySummarization
Part 4. Function Analysis
5
Part 1. Information Extraction
6
Natural Language Understanding
…We have cloned and sequenced
a cDNA encoding Apis mellifera ultraspiracle (AMUSP)
and examined its responses to …
NP
NP NP
NPVP
VP VP
Gene Gene
7
Entity & Relation Extraction
Gene X Gene YBcd hb…. ….… …
Genetic Interaction
Gene X Anatomy YBcd embryoHb egg… …
Expression Location
…8
Lopes FJ et al., 2005 J. Theor. Biol.
General Approach: Machine Learning
• Computers learn from labeled examples to compute a function to predict labels of new examples
• Examples of predictions– Given a phrase, predict whether it is a gene name– Given a sentence with two gene names mentioned,
predict whether there is a genetic interaction relation
• Many learning methods are available, but training data isn’t always available
9
Extraction Example 1: Gene Name Recognition
… expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation.
10
Gene?
Gene? Gene?
Features for Recognizing Genes
• Syntactic clues:– Capitalization (especially acronyms)– Numbers (gene families)– Punctuation: -, /, :, etc.
• Contextual clues:– Local: surrounding words such as “gene”,
“encoding”, “regulation”, “expressed”, etc.– Global: same noun phrase occurs several times in
the same article
11
Maximum Entropy Modelfor Gene Tagging
• Given an observation (a token or a noun phrase), together with its context, denoted as x
• Predict y {gene, non-gene}
• Maximum entropy model:
P(y|x) = K exp(ifi(x, y))
• Typical f:– y = gene & candidate phrase starts with a capital letter– y = gene & candidate phrase contains digits
• Estimate i with training data
12
Special Challenges
• Gene name disambiguation
• Domain adaptation
13
Gene Name Disambiguation
• Gene names can be common English words: for (foraging), in (inturned), similar (sima),
yellow (y), black (b)…
• Solution: – Disambiguate by looking at the context of the
candidate word – Train a classifier
14
Discriminative Neighbor Words
15
Sample Disambiguation Results
16
... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359(for) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497 function for the for gene in sensory responsiveness and … -0.582 +5.980
the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in assays of black mutants and although … +9.759
“foraging”, “for”
“black”
Nov 27, 2007 17
Problem of Domain Overfitting
gene name recognizer 54.1%
gene name recognizer 28.1%
ideal setting
realistic settingwingless
daughterless
eyeless
apexless…
fly
Solution: Learn Generalizable Features…decapentaplegic and wingless are expressed in
analogous patterns in each primordium of…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is expressed in fetal brain and in
a range of adult tissues.
18
Generalizable Feature: “w+2 = expressed”
Generalizability-Based Feature Ranking
…training
data
……-less……expressed……
………expressed………-less
………expressed……-less…
…………expressed……-less
…
12345678
12345678
12345678
12345678
…expressed………-less……
…0.125………0.167…… 19
20
Effectiveness of Domain Adaptation
Fly + Mouse Yeastgene name recognizer 63.3%
Fly + Mouse Yeastgene name recognizer 75.9%
standard learning
domain adaptive learning
More Results on Domain AdaptationExp Method Precision Recall F1
F+M→Y Baseline 0.557 0.466 0.508Domain 0.575 0.516 0.544
% Imprv. +3.2% +10.7% +7.1%F+Y→M Baseline 0.571 0.335 0.422
Domain 0.582 0.381 0.461% Imprv. +1.9% +13.7% +9.2%
M+Y→F Baseline 0.583 0.097 0.166Domain 0.591 0.139 0.225
% Imprv. +1.4% +43.3% +35.5%
•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast) 21
Extraction Example 2: Genetic Interaction Relation
22
Gene
Gene
Is there a genetic interaction relation here?
Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg.
Challenges
• No/little training data
• What features to use?
23
Solution: Pseudo Training Data
24
Gene:
Bcd +
These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are
known to act in concert for most anterior segmented development.
Pseudo Training Data Works Reasonably Well
25
Precision
Recall
Using all features works the best
Large-Scale Entity/Relation Extraction
• Entity annotation
• Relation extraction
Entity Type Resource MethodGene NCBI, FlyBase, … Dictionary string search +
machine learningAnatomy FlyBase Dictionary string searchChemical MeSH, Biosis, … Dictionary string searchBehavior “x x behavior” pattern search
Relation Type MethodRegulatory Pre-defined pattern + machine learningExpressed In Co-occurrence + relevant keywords
Gene Behavior Co-occurrenceGene Chemical Co-occurrence
53
Part 2: Semantic Navigation
27
Space-Region Navigation
Literature Spaces
Bee Fly
Behavior
Bird…
Topic Regions
Bee Forager
MAP MAP
Bird Singing
EXTRACT
…Fly Rover
EXTRACT
SWITCHING
Intersection, Union,…
Intersection, Union,…
My Regions/Topics
My Spaces
28
General Approach: Language Models
• Topic = word distribution
• Modeling text in a space with mixture models of multinomial distributions
• Text Mining = Parameter Estimation + Inferences
• Matching = Computer similarity between word distributions
• Users can “control” a model by specifying topic preferences
29
A Sample Topic & Corresponding Space
filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626
actin filamentsflight muscleflight muscles
labels
• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle
Word Distribution (language model)
Example documents
Meaningful labels
30
MAP: Topic/RegionSpace
• MAP: Use the topic/region description as a query to search a given space
• Retrieval algorithm:– Query word distribution: p(w|Q)
– Document word distribution: p(w|D)
– Score a document based on similarity of Q and D
• Leverage existing retrieval toolkits: Lemur/Indri
Vocabularyw D
QQDQ wp
wpwpDDQscore
)|()|(
log)|()||(),(
31
EXTRACT: Space Topic/Region
• Assume k topics, each being represented by a word distribution
• Use a k-component mixture model to fit the documents in a given space (EM algorithm)
• The estimated k component word distributions are taken as k topic regions
| |
1 1
log ( | ) log[ ( | ) (1 ) ( | )]D k
i B j i jD C i j
p C p D p D
Likelihood:
Maximum likelihood estimator: * arg max ( | )p C
Bayesian estimator: * arg max ( | ) arg max ( | ) ( )p C p C p 32
User-Controlled Exploration: Sample Topic 1
age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439
Prior:
labor 0.2division 0.2
33
behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045
Prior:
behavioral 0.2maturation 0.2
34
User-Controlled Exploration: Sample Topic 2
foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051
foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228
Exploit Prior for Concept Switching
35
Part 3: Entity Summarization
36
Gene product
Expression
Sequence
Interactions
Mutations
General Functions
Multi-Aspect Gene Summary
Automated Gene Summarization?
A Two-Stage Approach
Text Summary of Gene Abl
General Entity Summarizer
• Task: Given any entity and k aspects to summarize, generate a semi-structured summary
• Assumption: Training sentences available for each aspect
• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category
40
Further Generalizations
• Task: Given any entity and k pre-specified aspects to summarize, generate a semi-structured summary
• Assumption: Training sentences available for each aspect
• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category
41
New method based on mixture modeland regularized optimization
Part 4. Function Analysis
42
Annotating Gene Lists: GO Terms vs. Literature MiningLimitations of GO annotations: - Labor-intensive- Limited Coverage
Literature Mining:- Automatic - Flexible exploration in the entire literature space
For any term:
test its significance
Segmentation 56.0Pattern 34.2
Cell_cycle 25.6Development 22.1
Regulation 20.4…
Enriched concepts
Interactive analysis
Gene group
BcdCad…Tll
Entrez Gene
…
Document sets
For any gene:retrieve
its relevant documents
Bcd
Cad
Tll
Overview of Gene List Annotator
Intuition for Literature-based Annotation
Gene TPI1 GPM1 PGK1 TDH3 TDH2
protein_kinase 0 0 2 0 0
decarboxylase 10 0 10 7 6
protein 39 26 65 44 33
stationary_phase 2 7 3 4 2
energy_metabolism 4 5 5 8 0
oscillation 0 0 0 0 1
Likelihood Ratio Test with 2-Poisson Mixture Model
Dataset distribution: Poisson(λ;d)
Reference distribution: Poisson(λ0;d)
Agreement with GO-based Method• Gene List: 93 genes up-regulated by the manganese treatment
GO Theme Related Annotator terms
neurogenesis axon guidance, growth cone,commissural axon, proneural gene
synaptic transmission synaptic vesicle, neurotransmitterrelease, synaptic transmission, sodiumchannel
cytoskeletal protein alpha tubulin, actin filament
cell communication tight junction, heparan sulfateproteoglycan
47
Discovering Novel Themes• Gene List: 69 genes up-regulated by the methoprene treatment
Theme Annotator terms
muscle flight muscle, muscle myosin, nonmusclemyosin, light chain, myosin ii, thickfilament, thin filament, striated muscle
synaptic transmission neurotransmitter release, synaptictransmission, synaptic vesicle
signaling pathway notch signal
48
Summary
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
&Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Part 1. Information Extraction
Part 2. Navigation Support
Part 3. EntitySummarization
Part 4. Function Analysis
49
Machine Learning + Language Models + Minimum Human Effort
General and scalable, but there’s room for deeper semantics
Looking Ahead…
• Knowledge integration, inferences
• Support for hypothesis formulation and testing
50
51
Exploring Knowledge Space
Gene A2
Gene A1
Gene A4
Gene A3
Gene A4’
Gene A1’
Behavior B4Behavior B3
Behavior B2
Behavior B1
isa isaCo-occur-fly
Orth-mosCo-occur-mos
Co-occur-bee
Co-occur-fly
Regorth
RegReg
1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3}2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3}3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6}4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}
Gene A5Reg
P= PathBetween({Z, B4, {co-occur, reg,isa})
52
Full-Fledged BeeSpace V5
BiomedicalLiterature
Entities - Gene- Behavior- Anatomy- ChemicalRelations -Orthology- Regulatory interaction- …
ExperimentData
Analysis
Additional entities and relations
Expert knowledge
InferencesHypothesis Formulation & Testing
Thanks to
Xin He (UIUC)Jing Jiang (SMU)Yanen Li (UIUC)Xu Ling (UIUC)Yue Lu (UIUC)
Qiaozhu Mei (UIUC/Michigan)
& Bruce Schatz (PI, BeeSpace)53
Thank You!
54