47
Text Data Mining: Text Data Mining: Introduction Introduction Hao Chen Hao Chen School of Information Systems School of Information Systems University of California at University of California at Berkeley Berkeley [email protected]

Text Data Mining: Introduction

  • Upload
    lorne

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Text Data Mining: Introduction. Hao Chen School of Information Systems University of California at Berkeley [email protected]. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Large databases becomes ubiquitous grocery store’s checkout registry - PowerPoint PPT Presentation

Citation preview

Page 1: Text Data Mining:  Introduction

Text Data Mining: Text Data Mining: IntroductionIntroduction

Hao ChenHao Chen

School of Information SystemsSchool of Information Systems

University of California at BerkeleyUniversity of California at Berkeley

[email protected]

Page 2: Text Data Mining:  Introduction

The KDD Process for The KDD Process for Extracting Useful Knowledge Extracting Useful Knowledge

from Volumes of Datafrom Volumes of Data

Large databases becomes ubiquitousLarge databases becomes ubiquitous grocery store’s checkout registrygrocery store’s checkout registry credit card authorizationcredit card authorization

Computer technology allow efficient and Computer technology allow efficient and inexpensive data storage and accessinexpensive data storage and access

But our ability to analyze and understand But our ability to analyze and understand large dataset lags far behind.large dataset lags far behind.

Page 3: Text Data Mining:  Introduction

Manual Data Analysis Manual Data Analysis ImpracticalImpractical

Slow, expensive, and highly subjectiveSlow, expensive, and highly subjective Becomes impractical as data volumns Becomes impractical as data volumns

growgrow N: number of records (10N: number of records (1099)) D: number of fields (10D: number of fields (1022 -- 10 -- 1033))

Need computer technology to automate Need computer technology to automate the bookkeeping.the bookkeeping.

First KDD Workshop in 1989First KDD Workshop in 1989

Page 4: Text Data Mining:  Introduction

Definitions of KDDDefinitions of KDD

Knowledge Discovery from DataKnowledge Discovery from DataThe nontrivial process of identifying valid, The nontrivial process of identifying valid, novel, potentially useful, and ultimately novel, potentially useful, and ultimately understandable patterns in data.understandable patterns in data.

Page 5: Text Data Mining:  Introduction

KDD Process: SelectionKDD Process: Selection

Learning the application domainLearning the application domain Creating a target datasetCreating a target dataset

Page 6: Text Data Mining:  Introduction

KDD Process: PreprocessingKDD Process: Preprocessing

Data cleaning & preprocessingData cleaning & preprocessing remove noiseremove noise handle missing data fieldshandle missing data fields time sequence informationtime sequence information

Page 7: Text Data Mining:  Introduction

KDD Process: KDD Process: TransformationTransformation

Data reduction & projectionData reduction & projection features extractionfeatures extraction dimensionality reductiondimensionality reduction invariant representationinvariant representation

Page 8: Text Data Mining:  Introduction

KDD Process: Data MiningKDD Process: Data Mining

Choosing function of data miningChoosing function of data mining Choosing data mining algorithmsChoosing data mining algorithms Data mining: searching for patterns of Data mining: searching for patterns of

interestinterest

Page 9: Text Data Mining:  Introduction

KDD Process: KDD Process: Interpretation / EvaluationInterpretation / Evaluation

InterpretationInterpretation Using discovered knowledgeUsing discovered knowledge

Page 10: Text Data Mining:  Introduction

What is Data Mining? What is Data Mining?

Fitting models to or determining patterns Fitting models to or determining patterns from very large datasets.from very large datasets.

A “regime” which enables people to A “regime” which enables people to interact effectively with massive data interact effectively with massive data stores.stores.

Deriving new information from data.Deriving new information from data. finding patternsfinding patterns across large datasets across large datasets discoveringdiscovering heretofore unknown information heretofore unknown information

Page 11: Text Data Mining:  Introduction

What is Data Mining?What is Data Mining?

Potential point of confusion:Potential point of confusion: The The extracting ore from rockextracting ore from rock metaphor does metaphor does

not really apply to the practice of data miningnot really apply to the practice of data mining If it did, then standard If it did, then standard database queriesdatabase queries would would

fit under the rubric of data miningfit under the rubric of data mining Find all employee records in which employee earns

$300/month less than their managers

In practice, DM refers to:In practice, DM refers to: finding patterns across large datasets discovering heretofore unknown information

Page 12: Text Data Mining:  Introduction

Another Definition of DMAnother Definition of DM

What SQL currently What SQL currently cannot cannot do.do. A standard query does not infer new informationA standard query does not infer new information

It retrieves a subset of what is already present and known. SQL originally intended for business apps

DM requires sophisticated aggregate queriesDM requires sophisticated aggregate queries

Page 13: Text Data Mining:  Introduction

DM Touchstone ApplicationsDM Touchstone Applications

Finding patterns across data sets:Finding patterns across data sets: Reports on changes in retail salesReports on changes in retail sales

to improve sales

Patterns of sizes of TV audiencesPatterns of sizes of TV audiences for marketing

Patterns in NBA playPatterns in NBA play to alter, and so improve, performance

Deviations in standard phone calling behaviorDeviations in standard phone calling behavior to detect fraud for marketing

Page 14: Text Data Mining:  Introduction

DM Touchstone ApplicationsDM Touchstone Applications

Separating signal from noise:Separating signal from noise: Classifying faint astronomical objects Classifying faint astronomical objects

Finding genes within DNA sequencesFinding genes within DNA sequences

Discovering novel tectonic activityDiscovering novel tectonic activity

Page 15: Text Data Mining:  Introduction

Components of Data MiningComponents of Data Mining The modelThe model

function of the modelfunction of the model classification clustering

representational form of the modelrepresentational form of the model linear function of multiple variables Gaussian probability density function

The preference criterionThe preference criterion goodness of fitgoodness of fit avoiding overfittingavoiding overfitting

The search algorithmThe search algorithm

Page 16: Text Data Mining:  Introduction

Model FunctionModel Function

ClassificationClassification RegressionRegression ClusteringClustering SummarizationSummarization Dependency modelingDependency modeling Link analysisLink analysis Sequence analysisSequence analysis

Page 17: Text Data Mining:  Introduction

Model RepresentationModel Representation

Decision treeDecision tree Linear modelLinear model Nonlinear model (e.g. Neural Network)Nonlinear model (e.g. Neural Network) Example-based methodExample-based method

(e.g. Nearest Neighbor) (e.g. Nearest Neighbor) Probabilistic graphical dependency modelProbabilistic graphical dependency model

(e.g. Baysian Network)(e.g. Baysian Network) Relational attribute modelRelational attribute model

Page 18: Text Data Mining:  Introduction

Search AlgorithmSearch Algorithm

Parameter search, given a modelParameter search, given a model Model search over model spaceModel search over model space

predictive predictive descriptivedescriptive

Page 19: Text Data Mining:  Introduction

What’s New Here?What’s New Here?

Sounds like statistical modeling or machine Sounds like statistical modeling or machine learning.learning.

Main difference: scale and availabilityMain difference: scale and availability Datasets too large for classical analysisDatasets too large for classical analysis Increased opportunity for access Increased opportunity for access

end user is often not a statistician

New issues in samplingNew issues in sampling

Page 20: Text Data Mining:  Introduction

Statistician’s ViewpointStatistician’s Viewpoint

What’s new about DM?What’s new about DM? Returns statisticians to their empirical rootsReturns statisticians to their empirical roots

exploration rather than modeling

Hypothesis testing may be irrelevantHypothesis testing may be irrelevant given the large data sizes everything is significant

Data was collected for some other purpose Data was collected for some other purpose than what it is being analyzed for nowthan what it is being analyzed for now

Page 21: Text Data Mining:  Introduction

The Statistician’s Viewpoint The Statistician’s Viewpoint (David Hand 97)(David Hand 97)

conservativeconservative rigorousrigorous abstractabstract idealizedidealized

adventurousadventurous engineeringengineering practicalpractical real solutionsreal solutions

Statistics Machine Learningvs.

Page 22: Text Data Mining:  Introduction

Research ChallengesResearch Challenges Massive datasets & high dimensionalityMassive datasets & high dimensionality User interaction & prior knowledgeUser interaction & prior knowledge Overfitting & assessing statistical significanceOverfitting & assessing statistical significance Missing dataMissing data Understandability of patternsUnderstandability of patterns Managing changing data and knowledgeManaging changing data and knowledge IntegrationIntegration Nonstandard, multimedia, object-oriented dataNonstandard, multimedia, object-oriented data

Page 23: Text Data Mining:  Introduction

A Database Perspective on A Database Perspective on Knowledge DiscoveryKnowledge Discovery

Concept of data mining as a querying Concept of data mining as a querying processprocess

First steps toward efficient development First steps toward efficient development of knowledge discovery applicationsof knowledge discovery applications

Page 24: Text Data Mining:  Introduction

New Research FrontierNew Research Frontier

Short termShort term::Efficient algorithms implementing Efficient algorithms implementing machine learning tools on the top of large machine learning tools on the top of large databasesdatabases

Long termLong term::building optimizing compilers for ad hoc building optimizing compilers for ad hoc queries and embedding queries in queries and embedding queries in application programming interfacesapplication programming interfaces

Page 25: Text Data Mining:  Introduction

KDDMSKDDMS

KDD objectsKDD objects a rulea rule a classifiera classifier a clusteringa clustering

KDD queriesKDD queries a predicate returning a set of KDD or DB a predicate returning a set of KDD or DB

objectsobjects

Page 26: Text Data Mining:  Introduction

Examples of KDD QueryExamples of KDD Query

Generate a classifierGenerate a classifier Generate the strongest ruleGenerate the strongest rule Generate all rules with consequent Generate all rules with consequent

attribute values computed by SQL queryattribute values computed by SQL query Find tuples that belong to the largest Find tuples that belong to the largest

clustercluster

Page 27: Text Data Mining:  Introduction

Future DirectionsFuture Directions

KDD applications need development KDD applications need development supportsupport query KDD objectsquery KDD objects data mining operationsdata mining operations

nearest neighbors clustering

Development of querying tools is a big Development of querying tools is a big challengechallenge

Provide developers with build applications Provide developers with build applications using a KDD query languageusing a KDD query language

Page 28: Text Data Mining:  Introduction

Text Data MiningText Data Mining

Peoples’ first thought:Peoples’ first thought: Make it easier to find things on the Web.Make it easier to find things on the Web. But this is information retrieval!But this is information retrieval!

The metaphor of extracting ore from rock:The metaphor of extracting ore from rock: Does Does make sense for extracting documents of make sense for extracting documents of

interest from a huge pile.interest from a huge pile. But does But does not not reflect notions of DM in practice:reflect notions of DM in practice:

finding patterns across large collections discovering heretofore unknown information

Page 29: Text Data Mining:  Introduction

RealReal Text DM Text DM

What would finding a pattern across a What would finding a pattern across a large text collection large text collection reallyreally look like? look like?

Page 30: Text Data Mining:  Introduction

From: “The Internet Diary of the man who cracked the Bible Code” Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil (William Gates, agitator, leader)

Bill Gates + MS-DOS Bill Gates + MS-DOS in the Bible!in the Bible!

Page 31: Text Data Mining:  Introduction

From: “The Internet Diary of the man who cracked the Bible Code”Brendan McKay, Yahoo Internet Life, www.zdnet.com/yil

Page 32: Text Data Mining:  Introduction

RealReal Text DM Text DM

The point:The point: Discovering heretofore unknown information Discovering heretofore unknown information

is not what we usually do with text.is not what we usually do with text. (If it weren’t known, it could not have been (If it weren’t known, it could not have been

written by someone!)written by someone!)

However:However: There is a field whose goal is to learn about There is a field whose goal is to learn about

patterns in text for its own sake ...patterns in text for its own sake ...

Page 33: Text Data Mining:  Introduction

ObservationObservation

Research that exploits patterns in text does so Research that exploits patterns in text does so mainly in the service of computational mainly in the service of computational

linguistics, rather than for learning about and linguistics, rather than for learning about and exploring text collections.exploring text collections.

Page 34: Text Data Mining:  Introduction

TDM using Metadata TDM using Metadata (instead of Text)(instead of Text)

Data: Data: Reuter’s newswire (22,000 articles, late 1980s) Categories: commodities, time, countries, people,

and topic

Goals:Goals: distributions of categories across time (trends) distributions of categories between collections category co-occurrence (e.g., topic|country)

Interactive Interface:Interactive Interface: lists, pie charts, 2D line plots

Page 35: Text Data Mining:  Introduction

Combining Text with Combining Text with MetadataMetadata

(images, hyperlinks)(images, hyperlinks)

ExamplesExamples Text + Links to find “authority pages” Text + Links to find “authority pages” (Kleinberg (Kleinberg

at Cornell, Page at Stanford)at Cornell, Page at Stanford)

Usage + Time + Links to study evolution of Usage + Time + Links to study evolution of web and information use web and information use (Pitkow et al. at PARC)(Pitkow et al. at PARC)

Images + Text to improve image searchImages + Text to improve image search

Page 36: Text Data Mining:  Introduction

True Text Data Mining:True Text Data Mining:Don Swanson’s Medical WorkDon Swanson’s Medical Work

Given Given medical titles and abstractsmedical titles and abstracts a problem (incurable rare disease)a problem (incurable rare disease) some medical expertisesome medical expertise

find causal links among titlesfind causal links among titles symptomssymptoms drugsdrugs results results

Page 37: Text Data Mining:  Introduction

Swanson Example (1991)Swanson Example (1991)

Problem: Migraine headaches (M)Problem: Migraine headaches (M) stress associated with Mstress associated with M stress leads to loss of magnesiumstress leads to loss of magnesium calcium channel blockers prevent some Mcalcium channel blockers prevent some M magnesium is a natural calcium channel blockermagnesium is a natural calcium channel blocker spreading cortical depression (SCD)implicated in Mspreading cortical depression (SCD)implicated in M high levels of magnesium inhibit SCDhigh levels of magnesium inhibit SCD M patients have high platelet aggregabilityM patients have high platelet aggregability magnesium can suppress platelet aggregabilitymagnesium can suppress platelet aggregability

All extracted from medical journal titlesAll extracted from medical journal titles

Page 38: Text Data Mining:  Introduction

Swanson’s TDMSwanson’s TDM

Two of his hypotheses have received Two of his hypotheses have received some experimental verification.some experimental verification.

His techniqueHis technique Only partially automatedOnly partially automated Required medical expertiseRequired medical expertise

Few people are working on this.Few people are working on this.

Page 39: Text Data Mining:  Introduction

ConclusionsConclusions

Currently, what might be construed as Text Data Currently, what might be construed as Text Data Mining is really Computational LinguisticsMining is really Computational Linguistics Text is tricky to process, but rich and abundant Text is tricky to process, but rich and abundant (now)(now) There are many CL tools availableThere are many CL tools available

Data Mining directly from textData Mining directly from text tells us about languagetells us about language produces meta-information that may be useful for produces meta-information that may be useful for

information accessinformation access

Page 40: Text Data Mining:  Introduction

ConclusionsConclusions Information Access != Text Data MiningInformation Access != Text Data Mining

IA = finding needle in haystackIA = finding needle in haystack TDM = finding patterns or new informationTDM = finding patterns or new information

However, Information Access may potentially be However, Information Access may potentially be served by Text Data Mining techniques:served by Text Data Mining techniques: automated metadata assignmentautomated metadata assignment collection overviewscollection overviews

The synthesis of ideas from TDM and IAThe synthesis of ideas from TDM and IA: : Perhaps a new field of exploratory data analysis over Perhaps a new field of exploratory data analysis over

text!text!

Page 41: Text Data Mining:  Introduction

Promising Research Promising Research DirectionsDirections

Text Data Mining Problems:Text Data Mining Problems: Patterns within sets of documents:Patterns within sets of documents:

What is the latest in this field? How is this field related to that field?

Chains of evidence embedded in text:Chains of evidence embedded in text: What drugs have been tested for this symptom? What effects did this funding have on that field?

Human use of information over timeHuman use of information over time How does information diffuse across the web?

Page 42: Text Data Mining:  Introduction

Needed from SystemsNeeded from Systems

Support for linking Support for linking chainschains of associationsof associations Support for combined Support for combined structured structured andand

unstructured dataunstructured data Support for combining Support for combining disparate disparate

collectionscollections

Page 43: Text Data Mining:  Introduction

Statistical Themes & Lessons Statistical Themes & Lessons

for Data Miningfor Data Mining Statistical themesStatistical themes Statistical lessonsStatistical lessons Cooperation between statistical and Cooperation between statistical and

computational communitiescomputational communities

Page 44: Text Data Mining:  Introduction

Overview of Statistical Overview of Statistical ScienceScience

Probability distributionsProbability distributions Estimation, consistency, uncertainty, Estimation, consistency, uncertainty,

assumptions, robustness, and model assumptions, robustness, and model averagingaveraging

Hypothesis testingHypothesis testing Model scoringModel scoring Markov Chain Monte CarloMarkov Chain Monte Carlo Generalized model classesGeneralized model classes

Page 45: Text Data Mining:  Introduction

Overview of Statistical Overview of Statistical SciencesSciences

Rational decision making and planningRational decision making and planning Inference to causesInference to causes PredictionPrediction

Page 46: Text Data Mining:  Introduction

Important Themes of Important Themes of StatisticsStatistics

to Data Miningto Data Mining Clarity about goalsClarity about goals Use of model that are reliable means to Use of model that are reliable means to

the goal, understandable and plausible to the goal, understandable and plausible to usersusers

Sense of uncertainties of models and Sense of uncertainties of models and predictionspredictions

Page 47: Text Data Mining:  Introduction

LessonsLessons

Data can lieData can lie Sometimes it’s not what’s in the data that Sometimes it’s not what’s in the data that

mattersmatters Perversity of the pervasive P-valuePerversity of the pervasive P-value Intervention and predictionIntervention and prediction