1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC...

The BioText Project

SIMS Affiliates MeetingNov 14, 2003

Marti HearstAssociate Professor

SIMS, UC Berkeley

Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech

BioText Project Goals

• Provide fast, flexible, intelligent access to information for use in biosciences applications.– Better search results– Text mining

• Focus on– Textual Information– Tightly integrated with other resources

• Ontologies• Record-based databases

People

• Project Leaders: – PI: Marti Hearst Co-PI: Adam Arkin

• Computational Linguistics– Barbara Rosario– Presley Nakov

• Database Research– Ariel Schwartz– Gaurav Bhalotia (graduated)

• User Interface / Information Retrieval– Kevin Li– Dr. Emilia Stoica

• Bioscience– Dr. TingTing Zhang

Outline

• Main Goals– Text Mining Examples– System Architecture– Apoptosis problem statement

• Recent results in – Abbreviation definition recognition– Semantic relation recognition (from

text)– Search User Interfaces– Hierarchical grouping of journals

Text Mining Example 1

• How to discover new information … • … As opposed to discovering which

statistical patterns characterize occurrence of known information.

• Method:– Use large text collections to gather

evidence to support (or refute) hypotheses

– Make Connections– Gather Evidence

Etiology Example

• Don Swanson example, 1991• Goal: find cause of disease

– Magnesium-migraine connection

• Given – medical titles and abstracts– a problem (incurable rare disease)– some medical expertise

• find causal links among titles– symptoms– drugs– results

Gathering Evidence

stress

migraine

magnesium

magnesiummagnesium

Gathering Evidence

migraine magnesium

stress

Swanson’s Linking Approach

• Two of his hypotheses have received some experimental verification.

• His technique– Only partially automated– Required medical expertise

Text Mining Example 2:

• How to find functions of genes?– Have the genetic sequence– Don’t know what it does– But …

• Know which genes it coexpresses with• Some of these have known function

– So …infer function based on function of co-expressed genes

• This is problem suggested by Michael Walker and others at Incyte Pharmaceuticals

Gene Co-expression:Role in the genetic pathway

Other possibilities as well

Make use of the literature

• Look up what is known about the other genes.

• Different articles in different collections

• Look for commonalities – Similar topics indicated by Subject

Descriptors– Similar words in titles and abstracts

adenocarcinoma, neoplasm, prostate, prostatic neoplasms, tumor markers, antibodies ...

Formulate a Hypothesis

• Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer

• New tack: do some lab tests– See if mystery gene is similar in

molecular structure to the others– If so, it might do some of the same

things they do

Outline

• Main Goals– Text Mining Examples– System Architecture– Apoptosis problem statement

• Recent results in – Abbreviation definition recognition– Semantic relation recognition (from

text)– Search User Interfaces– Hierarchical grouping of journals

BioText: ArchitectureBioText: Architecture

Sophisticated Text Analysis

Annotations inDatabase

ImprovedSearch Interface

Recent Result (Schwartz & Hearst 03)

• Fast, simple algorithm for recognizing abbreviation definitions.– Simpler and faster than the rest– Higher precision and recall– Idea: Work backwards from the end

• Examples:– In eukaryotes, the key to transcriptional regulation of the

Heat Shock Response is the Heat Shock Transcription Factor (HSF).

– Gcn5-related N-acetyltransferase (GNAT)

• Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present.

BioText: A Two-Sided ApproachBioText: A Two-Sided Approach

SwissProt

GOWordNet

Medline

JournalFull Text

Sophisticated DatabaseDesign & Algorithms

EmpiricalComputational Linguistics

Algorithms

Death ReceptorsSignaling

Survival Factors Signaling

Ca++ Signaling

P53 pathway

Caspase 12

Effecter Caspases (3,6,7)

Caspase 9

Apaf 1IAPs

Mitochondria Cytochrome c

Bax, Bak

Apoptosis

Bcl-2 like

BH3 only

Apoptosis Network

ER Stress

Genotoxic Stress

Initiator Caspases (8, 10)

Lost of Attachment Cell Cycle stress, etc

Slide courtesy TingTing Zhang

The issues (courtesy TingTing Zhang):

• The network nodes are deduced from reading and processing of experimental knowledge by experts. Every month >1000 apoptosis papers are published.

• The supporting experimental data are gathered in different organs, tissues, cells using various techniques.

• There are various levels of uncertainty associated with different techniques used to answer certain questions.

• Depending on the expression patterns for the players in the network, the observation may or may not be extended to other contexts.

• We need to keep track of ALL the information in order to understand the system better.

Simple cases:

• Mouse Bim proteins (isoforms EL, L, S) binds to human Bcl-2 (bacteriophoage screening using cDNA expression library from T-Lymphoma cell line KO52DA20).• Human BimEL protein is 89% identical to mouse BimEL, Human BimL is 85% identical to mouse BimL (Hybridization of mouse bim cDNA to human fetal spleen and peripheral blood cDNA library).• Bim mRNA is detected in B and T lyphoid cells (Northern blot analysis of mouse KO52DA20, WEHI 703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415, B6.23.16BW2 cell extracts).• BimL protein interact with Bcl-2 OR Bcl-XL, or Bcl-w proteins (Immuno-precipitation (anti-Bcl-2 OR Bcl-XL OR Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE-tagged BimL AND (bcl-2 OR bcl-XL OR bcl-w) plasmids)• BimL deleted of the BH3 domain does not bind to Bcl-2 OR Bcl-XL, or Bcl-w proteins (under experimental conditions mentioned above)

Computational Language Goals

• Recognizing and annotating entities within textual documents

• Identifying semantic relations among entities

• To (eventually) be used in tandem with semi-automated reasoning systems.

Main Ideas for NLP Approach

• Assign Semantics using – Statistics– Hierarchical Lexical Ontologies to

generalize– Redundancy in the data

• Build up Layers of Representation– Syntactic and Semantic– Use these in a feedback loop

Computational Linguistics Goals

• Mark up text with semantic relations

1 The BioText Project SIMS Affiliates Meeting Nov 14, 2003 Marti Hearst Associate Professor SIMS, UC...

Documents

Sims presentation

Gamelike sims

Disco Sims

Sims genetics

SIMS Agora - Capita SIMS online parent payments made simple Who benefits? Headteacher School Administrator Business Manager Parents SIMS Agora Seamless integration with SIMS …

DeShawn Sims

SIMS Discover and SIMS MultiView

VOULLE SIMS

Sims wunderteam

SENIOR INFORMATION MANAGEMENT SYSTEM (SIMS…umassmed.typepad.com/files/final-sims-prog-booklet.pdf · SENIOR INFORMATION MANAGEMENT SYSTEM (SIMS) ... with EOEA's SIMS team and Harmony

SIMS InTouch · SIMS InTouch uses contacts within SIMS - any contact details updated in SIMS are available immediately for use with SIMS InTouch. 6. Use SIMS InTouch to send exam

Innova*Sims La revista de Sims Artists

6 September 2004 Biotext - Blood

David Sims

Semantic Relation Detection in Bioscience Text Marti Hearst SIMS, UC Berkeley Supported by NSF DBI-0317510 and a gift from

SIMS Training - SIMS Sensory Panel Software Trai… · SIMS Training - SIMS Sensory Panel Software ... Test Definitions / Rotation Plans - Master List ... Rotation Plan Page 11: Test

SIMS Behaviour

SIMS Discover

Krystal Sims

SIMS Discover SIMS MultiView