Database Research: Data Mining

Database Research: Database Research: Data Mining & Other Areas Data Mining & Other Areas

Dr. Aparna VardeDr. Aparna VardePh.D., Computer Science, WPI, MAPh.D., Computer Science, WPI, MA

Assistant Professor, Computer Science, VSU, VAAssistant Professor, Computer Science, VSU, VA

Presentation at Montclair State University, NJ Presentation at Montclair State University, NJ May 2, 2008May 2, 2008

AgendaAgenda

Database SystemsDatabase Systems– Introduction to Databases and Research AreasIntroduction to Databases and Research Areas

Data MiningData Mining– Research Problem in Graphical Data MiningResearch Problem in Graphical Data Mining

Other AreasOther Areas– Data Warehousing Data Warehousing – Web DatabasesWeb Databases

Data in Various FormsData in Various Forms

Human Mind(Too much data)

Documents (Processed)

Raw Data(Handwritten)

Flat Files(Unprocessed)

Images (Complex)

Simple Tables (Organized)

Need for DatabasesNeed for Databases

Integration of dataIntegration of data

Efficient storageEfficient storage

Fast retrievalFast retrieval

Ease of modificationEase of modification

Security of informationSecurity of information

Recovery from failuresRecovery from failures

Database System EnvironmentDatabase System Environment

Database

DBMS (Database Management System)

Application Programs/Queries

Users

Database System

Roles in the Database World Roles in the Database World

Database Administrator Database Application Programmer

Database User Database Researcher

Examples of Database Research AreasExamples of Database Research Areas

Query Processing and OptimizationQuery Processing and Optimization

Privacy and SecurityPrivacy and Security

Storage and IndexingStorage and Indexing

Data MiningData Mining

Data WarehousingData Warehousing

Web DatabasesWeb Databases

Data MiningData Mining

Discovering knowledge from data Discovering knowledge from data – Non-trivial process of finding novel and Non-trivial process of finding novel and

interesting patterns in large datasets to guide interesting patterns in large datasets to guide future decisionsfuture decisions

Types of DataTypes of Data– NumbersNumbers– GraphsGraphs– ImagesImages– TextText

Data Mining TechniquesData Mining Techniques

Association Rule MiningAssociation Rule Mining– Discovering relationships of the type A => BDiscovering relationships of the type A => B

Clustering Clustering – Grouping objects based on similarityGrouping objects based on similarity

ClassificationClassification– Predicting the class of a target Predicting the class of a target

Graphical Data Mining ProblemGraphical Data Mining Problem

Experimental results in scientific domains plotted as graphs Experimental results in scientific domains plotted as graphs

Users pose queries for predictive analysis:Users pose queries for predictive analysis:– Given input conditions, predict most likely graphGiven input conditions, predict most likely graph– Given desired graph, predict most likely conditions Given desired graph, predict most likely conditions

Need for mining graphical data to discover knowledge Need for mining graphical data to discover knowledge

Proposed Approach: AutoDomainMineProposed Approach: AutoDomainMine

AutoDomainMine: Prediction of GraphAutoDomainMine: Prediction of Graph

AutoDomainMine: Prediction of ConditionsAutoDomainMine: Prediction of Conditions

Main TasksMain Tasks

Task 1AutoDomainMine Learning Strategy

of Integrating Clustering and Classification

[AAAI-06 Poster, ACM SIGART’s ICICIS-05]

Task 2Learning Domain-Specific

Distance Metrics for Graphs

[ACM KDD’s MDM-05, MTAP-06 Journal]

Task 3Designing Semantics-Preserving

Representatives for Clusters

[ACM SIGMOD’S IQIS-06,ACM CIKM-06]

Learning Distance Metrics for Graphs

Various distance metrics Various distance metrics • Absolute position of pointsAbsolute position of points• Statistical observationsStatistical observations• Critical features Critical features

IssuesIssues• Not known what metrics apply Not known what metrics apply • Multiple metrics may be Multiple metrics may be

relevantrelevant

Need for distance metric Need for distance metric learning in graphslearning in graphs

Example of domain-specific problem

Proposed Distance Metric Learning Approach: LearnMet

GivenGiven• Training set with Training set with

actual clusters of actual clusters of graphsgraphs

Additional InputAdditional Input• Components: Components:

distance metrics distance metrics applicable to applicable to graphsgraphs

LearnMet Metric • D = ∑wiDi

Evaluate Accuracy

Use pairs of graphsUse pairs of graphs

A pair (gA pair (gaa,g,gbb) is) is TP - same predicted, TP - same predicted,

same actual cluster: same actual cluster: (g(g11, g, g22))

TN - different TN - different predicted, different predicted, different actual clusters: (gactual clusters: (g22,g,g33))

FP -FP - same predicted same predicted cluster, different actual cluster, different actual clusters: (gclusters: (g33,g,g44))

FN - different FN - different predicted, same actual predicted, same actual clusters: (gclusters: (g44,g,g55))

Evaluate Accuracy (Contd.)

How do we compute error for whole set of graphs?How do we compute error for whole set of graphs?• For all pairsFor all pairs

Error MeasureError Measure• Failure Rate FR Failure Rate FR • FR = (FP+FN) / (TP+TN+FP+FN)FR = (FP+FN) / (TP+TN+FP+FN)

Error Threshold (t)Error Threshold (t)• Extent of FR allowed Extent of FR allowed • If (FR < t) then clustering is accurate If (FR < t) then clustering is accurate

Adjust the Metric

Weight Adjustment Heuristic: for each DWeight Adjustment Heuristic: for each Dii

• New wNew wii = w = wi i – sf– sfi i (DFN(DFNii/DFN + DFP/DFN + DFPii/DFP) [KDD’s MDM-05]/DFP) [KDD’s MDM-05]

Testing of LearnMetDetails: MTAP-06 Details: MTAP-06

Effect of pairs per epoch Effect of pairs per epoch (ppe)(ppe)• G = number of graphs, G = number of graphs,

e.g., = 25e.g., = 25

• GGCC2 2 = total number of = total number of

pairs, e.g., = 300pairs, e.g., = 300

• Select subset of Select subset of GGCC22 pairs pairs

per epochper epoch

ObservationsObservations• Highest accuracy with Highest accuracy with

middle range of ppemiddle range of ppe• Learning efficiency best Learning efficiency best

with low ppewith low ppe

Accuracy of Learned Metrics over Test Set

Learning Efficiency over Training Set

User Surveys of the AutoDomainMine System

Formal user surveys in Formal user surveys in different applicationsdifferent applications

Evaluation ProcessEvaluation Process• Compare estimation with Compare estimation with

real data in test setreal data in test set• If they match estimation If they match estimation

is accurateis accurate

ObservationsObservations• Estimation Accuracy Estimation Accuracy

around 90 to 95 %around 90 to 95 %Accuracy: Estimating Graphs

Accuracy: Estimating Conditions

Related WorkRelated WorkSimilarity Search [HK-01, WF-00]Similarity Search [HK-01, WF-00]• Non-matching conditions could be significant Non-matching conditions could be significant

Mathematical Modeling [M-95, S-60]Mathematical Modeling [M-95, S-60]• Existing models not applicable under certain situationsExisting models not applicable under certain situations

Case-based Reasoning [K-93, AP-03]Case-based Reasoning [K-93, AP-03]• Adaptation of cases not feasible with graphsAdaptation of cases not feasible with graphs

Learning nearest neighbor in high-dimensional spaces: [HAK-00]Learning nearest neighbor in high-dimensional spaces: [HAK-00]• Focus is dimensionality reduction, do not deal with graphsFocus is dimensionality reduction, do not deal with graphs

Distance metric learning given basic formula: [XNJR-03]Distance metric learning given basic formula: [XNJR-03]• Deal with position-based distances for points, no graphs involvedDeal with position-based distances for points, no graphs involved

Similarity search in multimedia databases [KB-04] Similarity search in multimedia databases [KB-04] • Use various metrics in different applications, do not learn a single metricUse various metrics in different applications, do not learn a single metric

Image Rating: [HH-01]Image Rating: [HH-01]• User intervention involved in manual ratingUser intervention involved in manual rating

Semantic Fish Eye Views: [JP-04] Semantic Fish Eye Views: [JP-04] • Display multiple objects in small space, no representativesDisplay multiple objects in small space, no representatives

PDA Displays in Levels of Detail: [BGMP-01]PDA Displays in Levels of Detail: [BGMP-01]• Do not evaluate different types of representativesDo not evaluate different types of representatives

Data WarehousingData Warehousing

Data WarehouseData Warehouse– Subject-oriented, integrated repository of relevant Subject-oriented, integrated repository of relevant

data from various information sourcesdata from various information sources

DW

R11 R12

Mediator

View

IS1 IS2 IS3 R31R21 R22 R23

Research Problem in Data Research Problem in Data WarehousingWarehousing

View Maintenance (VM)View Maintenance (VM)– Keeping warehouse view consistent with respect to Keeping warehouse view consistent with respect to

change in sourceschange in sources

Incremental VMIncremental VM– Update warehouse as the source data changesUpdate warehouse as the source data changes– Propagate only the updates, not all dataPropagate only the updates, not all data

Concurrency ConflictsConcurrency Conflicts– Two or more sources / relations try to send updates at Two or more sources / relations try to send updates at

the same timethe same time

ProblemProblem– Solve concurrency conflicts in view maintenance in multi-Solve concurrency conflicts in view maintenance in multi-

source multi-relation environmentssource multi-relation environments

Wrapper (Single-Source

VM Algorithm)

Wrapper (Single-Source VM Algorithm)

Wrapper (Single-Source VM Algorithm)

V

IS2IS1 IS3

Mediator (Multi-Source VM Algorithm)

R11 R21 R22 R23 R31 R32

IS1 IS3IS2

Data Warehouse

R11 R21 R22 R31

Proposed Solution: MEDWRAP (MEDiator Proposed Solution: MEDWRAP (MEDiator WRAPper compensation)WRAPper compensation)

Generic for any compensation based algorithmsGeneric for any compensation based algorithms

Allows sources to be semi-autonomousAllows sources to be semi-autonomous– Sources do not participate in maintenance beyond Sources do not participate in maintenance beyond

processing queries and reporting updatesprocessing queries and reporting updates– No locking neededNo locking needed

Low Storage CostLow Storage Cost– Additional views not stored at wrappersAdditional views not stored at wrappers– Copies of source relations not stored at warehouseCopies of source relations not stored at warehouse

Efficient Processing TimeEfficient Processing Time– No need to re-compute whole viewNo need to re-compute whole view

Details in DEXA-2002 paperDetails in DEXA-2002 paper

Advantages of MEDWRAPAdvantages of MEDWRAP

RV: Re-computation of View (Traditional)RV: Re-computation of View (Traditional)– Rewrite all tuples, not only affected onesRewrite all tuples, not only affected ones– Highly inefficient if done for every updateHighly inefficient if done for every update

SM: Self Maintenance [Q-96, G-96]SM: Self Maintenance [Q-96, G-96]– DW stores copies of source relations for maintenanceDW stores copies of source relations for maintenance– Huge storage at warehouse Huge storage at warehouse

Version Control: [K-99, C-00]Version Control: [K-99, C-00]– Versions of transactions / tuples stored at wrappersVersions of transactions / tuples stored at wrappers– Latest version used to answer queriesLatest version used to answer queries– Huge storage at source wrappersHuge storage at source wrappers

Related WorkRelated Work

Web DatabasesWeb Databases

Management of Data on the WebManagement of Data on the Web

XML, the eXtensible Markup LanguageXML, the eXtensible Markup Language– Widespread standard in storing and publishing dataWidespread standard in storing and publishing data

Domain-specific markup languages designed Domain-specific markup languages designed with XML tag setswith XML tag sets

Standardization bodies extend these to include Standardization bodies extend these to include additional semanticsadditional semantics

Aspects such domain knowledge, XML Aspects such domain knowledge, XML constraints are importantconstraints are important

Domain-specific Markup LanguageDomain-specific Markup Language

Medium of communication for Medium of communication for potential users of the domainpotential users of the domainFollows XML syntaxFollows XML syntaxEncompasses the semantics Encompasses the semantics of the domainof the domainExamples Examples

MML: Medical Markup MML: Medical Markup Language Language ChemML: Chemical Markup ChemML: Chemical Markup Language Language

Markup Language

Industries

Consumers

Universities Research Organizations

Publishers

Markup Language Development StepsMarkup Language Development Steps1. Acquisition of Domain Knowledge1. Acquisition of Domain Knowledge

- - Familiarity with related markupsFamiliarity with related markups

2. Data Modeling 2. Data Modeling - - E.g.,E.g., Entity Relationship modelsEntity Relationship models

3. Requirements Specification3. Requirements Specification- - E.g.,E.g., Interviews with Domain ExpertsInterviews with Domain Experts

4. Ontology Creation4. Ontology Creation- - Analogous to pilot version of softwareAnalogous to pilot version of software

5. Revision of Ontology5. Revision of Ontology- - Alpha versionAlpha version

6. Schema Definition6. Schema Definition- - Beta versionBeta version

7. Reiteration of Schema until 7. Reiteration of Schema until StandardizationStandardization- - Release VersionRelease Version

Snapshot of Final Schemawith data storage

Desired Features of Markup LanguagesDesired Features of Markup Languages

Avoidance of RedundancyAvoidance of Redundancy– No duplicate informationNo duplicate information

Non-Ambiguous Presentation of DataNon-Ambiguous Presentation of Data– Issues such as synonymy & polysemyIssues such as synonymy & polysemy

Easy Interpretability of DataEasy Interpretability of Data– E.g. in scientific domains, store experimental input E.g. in scientific domains, store experimental input

conditions before resultsconditions before results

Incorporation of Domain-Specific RequirementsIncorporation of Domain-Specific Requirements– E.g. conflicts such as: in financial domains, a person E.g. conflicts such as: in financial domains, a person

can be either insolvent or asset-holder but not bothcan be either insolvent or asset-holder but not both

Extensibility of the MarkupExtensibility of the Markup– Users should be able to capture additional semanticsUsers should be able to capture additional semantics

Application of XML ConstraintsApplication of XML Constraints

Sequence ConstraintSequence Constraint– To control the order of tagsTo control the order of tags

Choice ConstraintChoice Constraint– To use either one tag or the otherTo use either one tag or the other

Key ConstraintKey Constraint– To identify an attribute as a unique primary keyTo identify an attribute as a unique primary key

Occurrence ConstraintOccurrence Constraint– To declare minimum and maximum occurrences To declare minimum and maximum occurrences

Convenient Access to InformationConvenient Access to Information

Data stored using XML based markup Data stored using XML based markup languages can be easily accessed using languages can be easily accessed using languages such aslanguages such as– XQuery: XML Query LanguageXQuery: XML Query Language– XSLT: XML Stylesheet Language TransformationsXSLT: XML Stylesheet Language Transformations– XPath: XML Path LanguageXPath: XML Path Language

Details on markup language development Details on markup language development – Chapter on “XML Based Markup Languages for Chapter on “XML Based Markup Languages for

Specific Domains” by Varde et al. in book “XML Specific Domains” by Varde et al. in book “XML Based Support Systems”, Springer 2008Based Support Systems”, Springer 2008

Related WorkRelated Work

Semantic Extensions of XML for Advanced Semantic Extensions of XML for Advanced Applications [YKB-2001]Applications [YKB-2001]

Versions and Standards of HTML [B-95]Versions and Standards of HTML [B-95]

The Latest MML (Medical Markup Language) The Latest MML (Medical Markup Language) Version 2.3 - XML based Standard for Medical Data Version 2.3 - XML based Standard for Medical Data Exchange/ Storage [GATSSTSNY-2003]Exchange/ Storage [GATSSTSNY-2003]

XQuery 1.0: An XML Query Language [BFFRS-2003]XQuery 1.0: An XML Query Language [BFFRS-2003]

Handbook of Modern Finance [SL-2004]Handbook of Modern Finance [SL-2004]

Propagating XML Constraints to Relations [DFHQ-Propagating XML Constraints to Relations [DFHQ-2003]2003]

Conclusions and Ongoing WorkConclusions and Ongoing WorkData MiningData Mining– Graphical Data Mining Area, AutoDomainMine approachGraphical Data Mining Area, AutoDomainMine approach– Ongoing WorkOngoing Work

• Feature Selection in Image Mining (with colleagues in VSU and WPI: NSF Feature Selection in Image Mining (with colleagues in VSU and WPI: NSF Grants involved)Grants involved)

• Mining Genomic and Proteomic Data (with ISB: Institute of Systems Biology)Mining Genomic and Proteomic Data (with ISB: Institute of Systems Biology)

Data WarehousingData Warehousing– View Maintenance Area, MEDWRAP approachView Maintenance Area, MEDWRAP approach– Ongoing WorkOngoing Work

• Data Warehouse Maintenance in real time environments (with researchers at Data Warehouse Maintenance in real time environments (with researchers at Microsoft Search Labs)Microsoft Search Labs)

Web DatabasesWeb Databases– Book Chapter on XML Based Markup Languages for Specific DomainsBook Chapter on XML Based Markup Languages for Specific Domains– Ongoing WorkOngoing Work

• Development of Domain-specific markups (with NIST: National Institute of Development of Domain-specific markups (with NIST: National Institute of Standards and Technology)Standards and Technology)