View
216
Download
0
Tags:
Embed Size (px)
Citation preview
1
The BioText Project
SIMS Affiliates MeetingNov 14, 2003
Marti HearstAssociate Professor
SIMS, UC Berkeley
Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech
2
BioText Project Goals
• Provide fast, flexible, intelligent access to information for use in biosciences applications.– Better search results– Text mining
• Focus on– Textual Information– Tightly integrated with other resources
• Ontologies• Record-based databases
3
People
• Project Leaders: – PI: Marti Hearst Co-PI: Adam Arkin
• Computational Linguistics– Barbara Rosario– Presley Nakov
• Database Research– Ariel Schwartz– Gaurav Bhalotia (graduated)
• User Interface / Information Retrieval– Kevin Li– Dr. Emilia Stoica
• Bioscience– Dr. TingTing Zhang
4
Outline
• Main Goals– Text Mining Examples– System Architecture– Apoptosis problem statement
• Recent results in – Abbreviation definition recognition– Semantic relation recognition (from
text)– Search User Interfaces– Hierarchical grouping of journals
5
Text Mining Example 1
• How to discover new information … • … As opposed to discovering which
statistical patterns characterize occurrence of known information.
• Method:– Use large text collections to gather
evidence to support (or refute) hypotheses
– Make Connections– Gather Evidence
6
Etiology Example
• Don Swanson example, 1991• Goal: find cause of disease
– Magnesium-migraine connection
• Given – medical titles and abstracts– a problem (incurable rare disease)– some medical expertise
• find causal links among titles– symptoms– drugs– results
7
Gathering Evidence
stress
migraine
CCB
magnesium
PA
magnesium
SCD
magnesiummagnesium
8
Gathering Evidence
migraine magnesium
stress
CCB
PA
SCD
9
Swanson’s Linking Approach
• Two of his hypotheses have received some experimental verification.
• His technique– Only partially automated– Required medical expertise
10
Text Mining Example 2:
• How to find functions of genes?– Have the genetic sequence– Don’t know what it does– But …
• Know which genes it coexpresses with• Some of these have known function
– So …infer function based on function of co-expressed genes
• This is problem suggested by Michael Walker and others at Incyte Pharmaceuticals
11
Gene Co-expression:Role in the genetic pathway
g?
PSA
Kall.
PAP
h?
PSA
Kall.
PAP
g?
Other possibilities as well
12
Make use of the literature
• Look up what is known about the other genes.
• Different articles in different collections
• Look for commonalities – Similar topics indicated by Subject
Descriptors– Similar words in titles and abstracts
adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...
14
Formulate a Hypothesis
• Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer
• New tack: do some lab tests– See if mystery gene is similar in
molecular structure to the others– If so, it might do some of the same
things they do
15
Outline
• Main Goals– Text Mining Examples– System Architecture– Apoptosis problem statement
• Recent results in – Abbreviation definition recognition– Semantic relation recognition (from
text)– Search User Interfaces– Hierarchical grouping of journals
16
BioText: ArchitectureBioText: Architecture
Sophisticated Text Analysis
Annotations inDatabase
ImprovedSearch Interface
17
Recent Result (Schwartz & Hearst 03)
• Fast, simple algorithm for recognizing abbreviation definitions.– Simpler and faster than the rest– Higher precision and recall– Idea: Work backwards from the end
• Examples:– In eukaryotes, the key to transcriptional regulation of the
Heat Shock Response is the Heat Shock Transcription Factor (HSF).
– Gcn5-related N-acetyltransferase (GNAT)
• Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present.
18
BioText: A Two-Sided ApproachBioText: A Two-Sided Approach
SwissProt
Blast
Mesh
GOWordNet
Medline
JournalFull Text
Sophisticated DatabaseDesign & Algorithms
EmpiricalComputational Linguistics
Algorithms
19
Death ReceptorsSignaling
Survival Factors Signaling
Ca++ Signaling
P53 pathway
Caspase 12
Effecter Caspases (3,6,7)
Caspase 9
Apaf 1IAPs
NFkB
Mitochondria Cytochrome c
Bax, Bak
Apoptosis
Bcl-2 like
BH3 only
Apoptosis Network
Smac
ER Stress
Genotoxic Stress
Initiator Caspases (8, 10)
AIF
Lost of Attachment Cell Cycle stress, etc
Slide courtesy TingTing Zhang
20
The issues (courtesy TingTing Zhang):
• The network nodes are deduced from reading and processing of experimental knowledge by experts. Every month >1000 apoptosis papers are published.
• The supporting experimental data are gathered in different organs, tissues, cells using various techniques.
• There are various levels of uncertainty associated with different techniques used to answer certain questions.
• Depending on the expression patterns for the players in the network, the observation may or may not be extended to other contexts.
• We need to keep track of ALL the information in order to understand the system better.
21
Simple cases:
• Mouse Bim proteins (isoforms EL, L, S) binds to human Bcl-2 (bacteriophoage screening using cDNA expression library from T-Lymphoma cell line KO52DA20).• Human BimEL protein is 89% identical to mouse BimEL, Human BimL is 85% identical to mouse BimL (Hybridization of mouse bim cDNA to human fetal spleen and peripheral blood cDNA library).• Bim mRNA is detected in B and T lyphoid cells (Northern blot analysis of mouse KO52DA20, WEHI 703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415, B6.23.16BW2 cell extracts).• BimL protein interact with Bcl-2 OR Bcl-XL, or Bcl-w proteins (Immuno-precipitation (anti-Bcl-2 OR Bcl-XL OR Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE-tagged BimL AND (bcl-2 OR bcl-XL OR bcl-w) plasmids)• BimL deleted of the BH3 domain does not bind to Bcl-2 OR Bcl-XL, or Bcl-w proteins (under experimental conditions mentioned above)
22
Computational Language Goals
• Recognizing and annotating entities within textual documents
• Identifying semantic relations among entities
• To (eventually) be used in tandem with semi-automated reasoning systems.
23
Main Ideas for NLP Approach
• Assign Semantics using – Statistics– Hierarchical Lexical Ontologies to
generalize– Redundancy in the data
• Build up Layers of Representation– Syntactic and Semantic– Use these in a feedback loop
24
Computational Linguistics Goals
• Mark up text with semantic relations
25
Recent Result:Descent of Hierarchy
• Idea: – Use the top levels of a lexical
hierarchy to identify semantic relations
• Hypothesis:– A particular semantic relation holds
between all 2-word Noun Compounds that can be categorized by a MeSH pair.
26
Definition
• NC: Any sequence of nouns that itself functions as a noun– asthma hospitalizations – health care personnel hand wash
• Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.
27
• Identification• Syntactic analysis (attachments)
• [Baseline [headache frequency]]• [[Tension headache] patient]
• Our Goal: Semantic analysis• Headache treatment treatment for headache• Corticosteroid treatment treatment that uses
corticosteroid
NCs: Three tasks
28
Main Idea:
• Top-level MESH categories can be used to indicate which relations hold between noun compounds
• headache recurrence– C23.888.592.612.441 C23.550.291.937
• headache pain– C23.888.592.612.441 G11.561.796.444
• breast cancer cells– A01.236 C04 A11
29
Linguistic MotivationCan cast NC into head-modifier relation, and assume head noun has an argument and qualia structure.
– (used-in): kitchen knife– (made-of): steel knife– (instrument-for): carving knife– (used-on): putty knife– (used-by): butcher’s knife
30
Distribution of Frequent Category Pairs
31
How Far to Descend?• Anatomy: 250 CPs
– 187 (75%) remain first level– 56 (22%) descend one level – 7 (3%) descend two levels
• Natural Science (H01): 21 CPs– 1 (4%) remain first level– 8 (39%) descend one level – 12 (57%) descend two levels
• Neoplasm (C04) 3 CPs:– 3 (100%) descend one level
32
Evaluation• Apply the rules to a test set• Accuracy:
– Anatomy: 91% accurate– Natural Science: 79%– Diseases: 100%
• Total:– 89.6% via intra-category averaging– 90.8% via extra-category averaging
33
Summary of NC Work
• Lexical hierarchy useful for inferring semantic relations
• Works because semantics are constrained and word sense ambiguity is not too much of a problem
• Can it be extended to other types of relations?– Preliminary results on one set of relations
are promising.
34
Database Research Issues
• Efficiently and effectively combining – Relational databases & Text– Hierarchical Ontologies– Layers of Annotations
35
Interface Issues
• Create intuitive, appealing interfaces that are better than what’s currently out there.
• Start with existing assigned metadata
• As text analysis improves, incorporate the results into the interface.
36
37
38
39
40
Some Recent Work
• Organizing BioScience Journal Names– Currently there are > 3500
41
42
43
Some Recent Work
• Organizing BioScience Journal Names– Currently there are > 3500
• Idea:– Group them into faceted hierarchies
semi-automatically– Using clustering of title terms,
synonym similarity via WordNet, and other techniques
44
45
46
Summary
• BioText aims to improve access to bioscience information via– Sophisticated language analysis– Integration of results into
• Annotated database• Flexible user interface
• Eventual goal– Semi-automated mining and
discovery
47
There’s lots to do!
biotext.berkeley.edu
For more information: