48
Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Embed Size (px)

Citation preview

Page 1: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Untangling Text Data Mining

Marti Hearst UC Berkeley SIMS

ACL’99 Plenary TalkJune 23, 1999

Page 2: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Outline

Untangling several different fields– DM, CL, IA, TDM

TDM examples TDM as Exploratory Data Analysis

– New Problems for Computational Linguistics– Our current efforts

Page 3: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Classifying Application Types

Patterns Non- NovelNuggets

NovelNuggets

Non- textualdata Standard data

miningDatabasequeries

?

Textual dataComputational

linguisticsI nformation

retrievalReal text

data mining

Page 4: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

What is Data Mining? (Fayyad & Uthurusamy 96, Fayyad 97)

Fitting models to or determining patterns from very large datasets.

A “regime” which enables people to interact effectively with massive data stores.

Deriving new information from data.

Page 5: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Why Data Mining? Because the data is there. Because

– larger disks– faster cpus– high-powered visualization – networked information

are becoming widely available.

Page 6: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

The Knowledge Discovery from Data Process (KDD)

KDD: The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. (Fayyad, Shapiro, & Smyth, CACM 96)

Note: data mining is just one step in the process

Page 7: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

DM Touchstone Applications(CACM 39 (11) Special Issue)

Finding patterns across data sets:– Reports on changes in retail sales

» to improve sales

– Patterns of sizes of TV audiences» for marketing

– Patterns in NBA play» to alter, and so improve, performance

– Deviations in standard phone calling behavior

» to detect fraud» for marketing

Page 8: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

What is Data Mining?Potential point of confusion:

– The extracting ore from rock metaphor does not really apply to the practice of data mining

– If it did, then standard database queries would fit under the rubric of data mining

– In practice, DM refers to:»finding patterns across large datasets»discovering heretofore unknown information

Page 9: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

What is Text Data Mining?

Many peoples’ first thought: – Make it easier to find things on the Web.– But this is information retrieval!

Page 10: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Needles in Haystacks

The emphasis in IR is in finding documents that already contain answers to questions.

Page 11: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Information RetrievalA restricted form of Information Access

The system has available only pre-existing, “canned” text passages.

Its response is limited to selecting from these passages and presenting them to the user.

It must select, say, 10 or 20 passages out of millions.

Page 12: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

What is Text Data Mining?

The metaphor of extracting ore from rock:– Does make sense for extracting

documents of interest from a huge pile.

– But does not reflect notions of DM in practice:»finding patterns across large collections»discovering heretofore unknown

information

Page 13: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Real Text DM

What would finding a pattern across a large text collection really look like?

Page 14: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)

Bill Gates + MS-DOS in the Bible!

Page 15: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

From: “The Internet Diary of the man who cracked the Bible Code”Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil

Page 16: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Real Text DM

The point:– Discovering heretofore unknown

information is not what we usually do with text.

– (If it weren’t known, it could not have been written by someone!)

However:– There is a field whose goal is to learn

about patterns in text for their own sake ...

Page 17: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Computational Linguistics!

Goal: automated language understanding– this isn’t possible– instead, go for subgoals, e.g.,

»word sense disambiguation»phrase recognition»semantic associations

Common current approach:– statistical analyses over very large text

collections

Page 18: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Why CL Isn’t TDM

A linguist finds it interesting that “cloying” co-occurs significantly with “Jar Jar Binks” ...

… But this doesn’t really answer a question relevant to the world outside the text itself.

Page 19: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Why CL Isn’t TDM

We need to use the text indirectly to answer questions about the world

Direct:– Analyze patent text; determine which word

patterns indicate various subject categories.

Indirect:– Analyze patent text; find out whether

private or public funding leads to more inventions.

Page 20: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Why CL Isn’t TDM

Direct:– Cluster newswire text; determine which

terms are predominant

Indirect:– Analyze newswire text; gather evidence

about which countries/alliances are dominating which financial sectors

Page 21: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Nuggets vs. Patterns

TDM: we want to discover new information …

… As opposed to discovering which statistical patterns characterize occurrence of known information.

Example: WSD– not TDM: computing statistics over a corpus to

determine what patterns characterize Sense S.– TDM: discovering the meaning of a new sense

of a word.

Page 22: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Nuggets vs. Patterns

Nugget: a new, heretofore unknown item of information.

Pattern: distributions or rules that characterize the occurrence (or non-occurrence) of a known item of information.

Application of rules can create nuggets in some circumstances.

Page 23: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Example: Lexicon Augmentation

Application of a lexico-syntactic pattern:NP0 such as NP1, {NP2 …, (and | or) NPi }

i >= 1, implies thatforall NPi, i>=1, hyponym(NPi, NP0)

Extracts out a new hypernym:– “Agar is a substance prepared from a mixture

of red algae, such as Gelidium, for laboratory or industrial use.”

– implies hyponym(“Gelidium”, “red algae”) However, this fact was already known to

the author of the text.

Page 24: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

The Quandry

How do we use text to both– Find new information not known to

the author of the text– Find information that is not about the

text itself

Page 25: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Idea: Exploratory Data Analysis

Use large text collections to gather evidence to support (or refute) hypotheses– Not known to author: links across

many texts– Not self-referential: work within the

domain of discourse

Page 26: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Example: Etiology

Given – medical titles and abstracts– a problem (incurable rare disease)– some medical expertise

find causal links among titles– symptoms– drugs– results

Page 27: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Swanson Example (1991) Problem: Migraine headaches (M)

– stress associated with M– stress leads to loss of magnesium– calcium channel blockers prevent some M– magnesium is a natural calcium channel blocker– spreading cortical depression (SCD) implicated in M– high levels of magnesium inhibit SCD– M patients have high platelet aggregability– magnesium can suppress platelet aggregability

All extracted from medical journal titles

Page 28: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Gathering Evidence

stress

migraine

CCB

magnesium

PA

magnesium

SCD

magnesiummagnesium

Page 29: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Gathering Evidence

migraine magnesium

stress

CCB

PA

SCD

Page 30: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Swanson’s TDM

Two of his hypotheses have received some experimental verification.

His technique– Only partially automated– Required medical expertise

Few people are working on this.

Page 31: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

How to Automate This?

Idea: mixed-initiative interaction– User applies tools to help explore the

hypothesis space– System runs suites of algorithms to

help explore the space, suggest directions

Page 32: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Our Proposed Approach

Three main parts– UI for building/using strategies– Backend for interfacing with various

databases and translating different formats

– Content analysis/machine learning for figuring out good hypotheses/throwing out bad ones

Page 33: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

How to find functions of genes? Important problem in molecular biology

– Have the genetic sequence– Don’t know what it does– But …

»Know which genes it coexpresses with»Some of these have known function

– So … Infer function based on function of co-expressed genes»This is new work by Michael Walker and others

at Incyte Pharmaceuticals

Page 34: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Gene Co-expression:Role in the genetic pathway

g?

PSA

Kall.

PAP

h?

PSA

Kall.

PAP

g?

Other possibilities as well

Page 35: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Make use of the literature

Look up what is known about the other genes.

Different articles in different collections Look for commonalities

– Similar topics indicated by Subject Descriptors

– Similar words in titles and abstractsadenocarcinoma, neoplasm, prostate, prostatic

neoplasms, tumor markers, antibodies ...

Page 36: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Developing Strategies

Different strategies seem needed for different situations– First: see what is known about Kallikrein.– 7341 documents. Too many– AND the result with “disease” category

» If result is non-empty, this might be an interesting gene

– Now get 803 documents– AND the result with PSA

»Get 11 documents. Better!

Page 37: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Developing Strategies

Look for commalities among these documents– Manual scan through ~100 category

labels– Would have been better if

»Automatically organized» Intersections of “important” categories

scanned for first

Page 38: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Try a new tack

Researcher uses knowledge of field to realize these are related to prostate cancer and diagnostic tests

New tack: intersect search on all three known genes– Hope they all talk about diagnostics and

prostate cancer– Fortunately, 7 documents returned– Bingo! A relation to regulation of this

cancer

Page 39: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Formulate a Hypothesis

Hypothesis: mystery gene has to do with regulation of expression of genes leading to prostate cancer

New tack: do some lab tests– See if mystery gene is similar in

molecular structure to the others– If so, it might do some of the same

things they do

Page 40: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Strategies again

In hindsight, combining all three genes was a good strategy.– Store this for later

Might not have worked– Need a suite of strategies– Build them up via experience and a

good UI

Page 41: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999
Page 42: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

The System

Doing the same query with slightly different values each time is time-consuming and tedious

Same goes for cutting and pasting results– IR systems don’t support varying queries

like this very well.– Each situation is a bit different

Some automatic processing is needed in the background to eliminate/suggest hypotheses

Page 43: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

The UI part

Need support for building strategies Mixed-initiative system

– Trade off between user-initiated hypotheses exploration and system-initiated suggestions

Information visualization– Another way to show lots of choices

Page 44: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Candidate Associations

Current Retrieval Results

Suggested Strategies

Page 45: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

LINDI: Linking Information for Novel Discovery and Insight

Just starting up now (fall 98) Initial work: Hao Chen, Ketan

Mayer-Patel, Shankar Raman

Page 46: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Summary The future: analyzing what the text

is about– We don’t know how; text is tough!– Idea: bring the user into the loop.– Build up piecewise evidence to

support hypotheses– Make use of partial domain models.

The Truth is Out There!

Page 47: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999

Summary Text Data Mining:

– Extracting heretofore undiscovered information from large text collections

Information Access TDM– IA: locating already known information

that is currently of interest Finding patterns across text is already

done in CL– Tells us about the behavior of language– Helps build very useful tools!

Page 48: Untangling Text Data Mining Marti Hearst UC Berkeley SIMS ACL’99 Plenary Talk June 23, 1999