46
1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech

1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

1

The BioText Project

SIMS Affiliates MeetingNov 14, 2003

Marti HearstAssociate Professor

SIMS, UC Berkeley

Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech

Page 2: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

2

BioText Project Goals

• Provide fast, flexible, intelligent access to information for use in biosciences applications.– Better search results– Text mining

• Focus on– Textual Information– Tightly integrated with other resources

• Ontologies• Record-based databases

Page 3: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

3

People

• Project Leaders: – PI: Marti Hearst Co-PI: Adam Arkin

• Computational Linguistics– Barbara Rosario– Presley Nakov

• Database Research– Ariel Schwartz– Gaurav Bhalotia (graduated)

• User Interface / Information Retrieval– Kevin Li– Dr. Emilia Stoica

• Bioscience– Dr. TingTing Zhang

Page 4: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

4

Outline

• Main Goals– Text Mining Examples– System Architecture– Apoptosis problem statement

• Recent results in – Abbreviation definition recognition– Semantic relation recognition (from

text)– Search User Interfaces– Hierarchical grouping of journals

Page 5: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

5

Text Mining Example 1

• How to discover new information … • … As opposed to discovering which

statistical patterns characterize occurrence of known information.

• Method:– Use large text collections to gather

evidence to support (or refute) hypotheses

– Make Connections– Gather Evidence

Page 6: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

6

Etiology Example

• Don Swanson example, 1991• Goal: find cause of disease

– Magnesium-migraine connection

• Given – medical titles and abstracts– a problem (incurable rare disease)– some medical expertise

• find causal links among titles– symptoms– drugs– results

Page 7: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

7

Gathering Evidence

stress

migraine

CCB

magnesium

PA

magnesium

SCD

magnesiummagnesium

Page 8: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

8

Gathering Evidence

migraine magnesium

stress

CCB

PA

SCD

Page 9: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

9

Swanson’s Linking Approach

• Two of his hypotheses have received some experimental verification.

• His technique– Only partially automated– Required medical expertise

Page 10: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

10

Text Mining Example 2:

• How to find functions of genes?– Have the genetic sequence– Don’t know what it does– But …

• Know which genes it coexpresses with• Some of these have known function

– So …infer function based on function of co-expressed genes

• This is problem suggested by Michael Walker and others at Incyte Pharmaceuticals

Page 11: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

11

Gene Co-expression:Role in the genetic pathway

g?

PSA

Kall.

PAP

h?

PSA

Kall.

PAP

g?

Other possibilities as well

Page 12: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

12

Make use of the literature

• Look up what is known about the other genes.

• Different articles in different collections

• Look for commonalities – Similar topics indicated by Subject

Descriptors– Similar words in titles and abstracts

adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...

Page 13: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

14

Formulate a Hypothesis

• Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer

• New tack: do some lab tests– See if mystery gene is similar in

molecular structure to the others– If so, it might do some of the same

things they do

Page 14: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

15

Outline

• Main Goals– Text Mining Examples– System Architecture– Apoptosis problem statement

• Recent results in – Abbreviation definition recognition– Semantic relation recognition (from

text)– Search User Interfaces– Hierarchical grouping of journals

Page 15: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

16

BioText: ArchitectureBioText: Architecture

Sophisticated Text Analysis

Annotations inDatabase

ImprovedSearch Interface

Page 16: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

17

Recent Result (Schwartz & Hearst 03)

• Fast, simple algorithm for recognizing abbreviation definitions.– Simpler and faster than the rest– Higher precision and recall– Idea: Work backwards from the end

• Examples:– In eukaryotes, the key to transcriptional regulation of the

Heat Shock Response is the Heat Shock Transcription Factor (HSF).

– Gcn5-related N-acetyltransferase (GNAT)

• Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present.

Page 17: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

18

BioText: A Two-Sided ApproachBioText: A Two-Sided Approach

SwissProt

Blast

Mesh

GOWordNet

Medline

JournalFull Text

Sophisticated DatabaseDesign & Algorithms

EmpiricalComputational Linguistics

Algorithms

Page 18: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

19

Death ReceptorsSignaling

Survival Factors Signaling

Ca++ Signaling

P53 pathway

Caspase 12

Effecter Caspases (3,6,7)

Caspase 9

Apaf 1IAPs

NFkB

Mitochondria Cytochrome c

Bax, Bak

Apoptosis

Bcl-2 like

BH3 only

Apoptosis Network

Smac

ER Stress

Genotoxic Stress

Initiator Caspases (8, 10)

AIF

Lost of Attachment Cell Cycle stress, etc

Slide courtesy TingTing Zhang

Page 19: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

20

The issues (courtesy TingTing Zhang):

• The network nodes are deduced from reading and processing of experimental knowledge by experts. Every month >1000 apoptosis papers are published.

• The supporting experimental data are gathered in different organs, tissues, cells using various techniques.

• There are various levels of uncertainty associated with different techniques used to answer certain questions.

• Depending on the expression patterns for the players in the network, the observation may or may not be extended to other contexts.

• We need to keep track of ALL the information in order to understand the system better.

Page 20: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

21

Simple cases:

• Mouse Bim proteins (isoforms EL, L, S) binds to human Bcl-2 (bacteriophoage screening using cDNA expression library from T-Lymphoma cell line KO52DA20).• Human BimEL protein is 89% identical to mouse BimEL, Human BimL is 85% identical to mouse BimL (Hybridization of mouse bim cDNA to human fetal spleen and peripheral blood cDNA library).• Bim mRNA is detected in B and T lyphoid cells (Northern blot analysis of mouse KO52DA20, WEHI 703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415, B6.23.16BW2 cell extracts).• BimL protein interact with Bcl-2 OR Bcl-XL, or Bcl-w proteins (Immuno-precipitation (anti-Bcl-2 OR Bcl-XL OR Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE-tagged BimL AND (bcl-2 OR bcl-XL OR bcl-w) plasmids)• BimL deleted of the BH3 domain does not bind to Bcl-2 OR Bcl-XL, or Bcl-w proteins (under experimental conditions mentioned above)

Page 21: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

22

Computational Language Goals

• Recognizing and annotating entities within textual documents

• Identifying semantic relations among entities

• To (eventually) be used in tandem with semi-automated reasoning systems.

Page 22: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

23

Main Ideas for NLP Approach

• Assign Semantics using – Statistics– Hierarchical Lexical Ontologies to

generalize– Redundancy in the data

• Build up Layers of Representation– Syntactic and Semantic– Use these in a feedback loop

Page 23: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

24

Computational Linguistics Goals

• Mark up text with semantic relations

Page 24: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

25

Recent Result:Descent of Hierarchy

• Idea: – Use the top levels of a lexical

hierarchy to identify semantic relations

• Hypothesis:– A particular semantic relation holds

between all 2-word Noun Compounds that can be categorized by a MeSH pair.

Page 25: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

26

Definition

• NC: Any sequence of nouns that itself functions as a noun– asthma hospitalizations – health care personnel hand wash

• Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.

Page 26: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

27

• Identification• Syntactic analysis (attachments)

• [Baseline [headache frequency]]• [[Tension headache] patient]

• Our Goal: Semantic analysis• Headache treatment treatment for headache• Corticosteroid treatment treatment that uses

corticosteroid

NCs: Three tasks

Page 27: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

28

Main Idea:

• Top-level MESH categories can be used to indicate which relations hold between noun compounds

• headache recurrence– C23.888.592.612.441 C23.550.291.937

• headache pain– C23.888.592.612.441 G11.561.796.444

• breast cancer cells– A01.236 C04 A11

Page 28: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

29

Linguistic MotivationCan cast NC into head-modifier relation, and assume head noun has an argument and qualia structure.

– (used-in): kitchen knife– (made-of): steel knife– (instrument-for): carving knife– (used-on): putty knife– (used-by): butcher’s knife

Page 29: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

30

Distribution of Frequent Category Pairs

Page 30: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

31

How Far to Descend?• Anatomy: 250 CPs

– 187 (75%) remain first level– 56 (22%) descend one level – 7 (3%) descend two levels

• Natural Science (H01): 21 CPs– 1 (4%) remain first level– 8 (39%) descend one level – 12 (57%) descend two levels

• Neoplasm (C04) 3 CPs:– 3 (100%) descend one level

Page 31: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

32

Evaluation• Apply the rules to a test set• Accuracy:

– Anatomy: 91% accurate– Natural Science: 79%– Diseases: 100%

• Total:– 89.6% via intra-category averaging– 90.8% via extra-category averaging

Page 32: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

33

Summary of NC Work

• Lexical hierarchy useful for inferring semantic relations

• Works because semantics are constrained and word sense ambiguity is not too much of a problem

• Can it be extended to other types of relations?– Preliminary results on one set of relations

are promising.

Page 33: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

34

Database Research Issues

• Efficiently and effectively combining – Relational databases & Text– Hierarchical Ontologies– Layers of Annotations

Page 34: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

35

Interface Issues

• Create intuitive, appealing interfaces that are better than what’s currently out there.

• Start with existing assigned metadata

• As text analysis improves, incorporate the results into the interface.

Page 35: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

36

Page 36: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

37

Page 37: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

38

Page 38: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

39

Page 39: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

40

Some Recent Work

• Organizing BioScience Journal Names– Currently there are > 3500

Page 40: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

41

Page 41: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

42

Page 42: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

43

Some Recent Work

• Organizing BioScience Journal Names– Currently there are > 3500

• Idea:– Group them into faceted hierarchies

semi-automatically– Using clustering of title terms,

synonym similarity via WordNet, and other techniques

Page 43: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

44

Page 44: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

45

Page 45: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

46

Summary

• BioText aims to improve access to bioscience information via– Sophisticated language analysis– Integration of results into

• Annotated database• Flexible user interface

• Eventual goal– Semi-automated mining and

discovery

Page 46: 1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA

47

There’s lots to do!

biotext.berkeley.edu

For more information: