21
Designing Semantics- Designing Semantics- Preserving Cluster Preserving Cluster Representatives for Representatives for Scientific Input Scientific Input Conditions Conditions Aparna Varde, Elke Rundensteiner, Aparna Varde, Elke Rundensteiner, Carolina Ruiz, David Brown, Carolina Ruiz, David Brown, Mohammed Maniruzzaman and Richard Sisson Jr. Mohammed Maniruzzaman and Richard Sisson Jr. Worcester Polytechnic Institute Worcester Polytechnic Institute Worcester, MA, USA Worcester, MA, USA ACM CIKM 2006, Arlington, VA, ACM CIKM 2006, Arlington, VA, USA USA

Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

  • Upload
    kyria

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions. Aparna Varde, Elke Rundensteiner, Carolina Ruiz, David Brown, Mohammed Maniruzzaman and Richard Sisson Jr. Worcester Polytechnic Institute Worcester, MA, USA ACM CIKM 2006, Arlington, VA, USA. - PowerPoint PPT Presentation

Citation preview

Page 1: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Designing Semantics-Designing Semantics-Preserving Cluster Preserving Cluster

Representatives for Scientific Representatives for Scientific Input ConditionsInput Conditions

Aparna Varde, Elke Rundensteiner, Aparna Varde, Elke Rundensteiner,

Carolina Ruiz, David Brown, Carolina Ruiz, David Brown,

Mohammed Maniruzzaman and Richard Sisson Jr.Mohammed Maniruzzaman and Richard Sisson Jr.

Worcester Polytechnic InstituteWorcester Polytechnic Institute

Worcester, MA, USAWorcester, MA, USA

ACM CIKM 2006, Arlington, VA, USAACM CIKM 2006, Arlington, VA, USA

Page 2: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

IntroductionIntroduction Clustering often groups data with mixed attributesClustering often groups data with mixed attributes

NumericNumeric CategoricalCategorical OrdinalOrdinal

Examples: PDAs, Web Pages, Scientific Examples: PDAs, Web Pages, Scientific Experiments Experiments

Cluster Representatives: depictions of each cluster Cluster Representatives: depictions of each cluster

Randomly selected representatives not enough inRandomly selected representatives not enough in Capturing cluster informationCapturing cluster information Providing ease of interpretation Providing ease of interpretation Incorporating different user interestsIncorporating different user interests

Need for Designing Cluster RepresentativesNeed for Designing Cluster Representatives

Page 3: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Motivating Example

Scientific experiments Scientific experiments clustered based on resultsclustered based on results

Clustering criteria learned Clustering criteria learned based on input conditionsbased on input conditions

Representative of conditions Representative of conditions used to characterize a clusterused to characterize a cluster

Problem with randomly Problem with randomly selected representativeselected representative Distinct combinations of conditions Distinct combinations of conditions

could lead to a given clustercould lead to a given cluster Decision tree learning the clustering criteria(Heat Treating of Materials)

Page 4: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Goals Goals

Need to Design Semantics-Need to Design Semantics-Preserving Cluster Representatives Preserving Cluster Representatives thatthat

Capture relevant information in clusterCapture relevant information in clusterAvoid visual clutter and are easy to Avoid visual clutter and are easy to

interpretinterpretTake into account various user Take into account various user

interests in targeted applicationsinterests in targeted applications

Page 5: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Proposed Approach: DesCondProposed Approach: DesCond

Build candidate representatives with increasing levels of detail

Given: Clusters of experiments,conditions leading to clusters

Compare candidates using MDL-based encoding capturing user interests

Return candidate with lowest encoding as best for each cluster

Define notion of distance for conditions

incorporating domain semantics

Page 6: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Main Tasks in DesCondMain Tasks in DesCond

Defining a notion of distance for the Defining a notion of distance for the input conditionsinput conditions

Obtaining suitable candidate Obtaining suitable candidate representatives for each clusterrepresentatives for each cluster

Proposing an encoding to compare Proposing an encoding to compare candidates and find a winnercandidates and find a winner

Page 7: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Notion of DistanceNotion of Distance

Example: Heat Treating of Example: Heat Treating of MaterialsMaterials Quenchant: Cooling MediumQuenchant: Cooling Medium Part: The material being treatedPart: The material being treated Probe: Characterizes shape, Probe: Characterizes shape,

dimension dimension Oxide: Thickness of oxide on Oxide: Thickness of oxide on

surfacesurface Agitation: Extent of agitation of Agitation: Extent of agitation of

cooling mediumcooling medium Quenchant Temperature: Starting Quenchant Temperature: Starting

temperature of cooling mediumtemperature of cooling medium

Define domain-specific distance Define domain-specific distance metric for conditions metric for conditions incorporatingincorporating Data types of attributesData types of attributes Distance between attribute valuesDistance between attribute values Weights of the attributesWeights of the attributes

Pneum atic cylinder

Furnace

O il beaker

Pneum atic on/offsw itch

K-type therm ocouple

Probe tip

C onnecting rod

C om puter w ith D ataAcquisition C ard

Therm ocouple for O il tem p.

Page 8: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Data Types of the AttributesData Types of the Attributes

CategoricalCategoricalCharacters or strings with descriptive informationCharacters or strings with descriptive informationE.g., Quenchant Name, Part Material, Probe TypeE.g., Quenchant Name, Part Material, Probe Type

Numerical Numerical Integers or real numbersIntegers or real numbersE.g., Quenchant TemperatureE.g., Quenchant Temperature

OrdinalOrdinalWhere order mattersWhere order mattersE.g., Oxide Layer, Agitation LevelE.g., Oxide Layer, Agitation Level

Page 9: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Distance Between the Attribute Distance Between the Attribute ValuesValues

CategoricalCategorical Different = 1Different = 1 Same = 0Same = 0

Numerical Numerical Absolute difference between Absolute difference between

Values or Values or Mean values of rangesMean values of ranges

OrdinalOrdinal Map values to integerMap values to integer

E.g., Oxide Layer: none = 0, thin =1, thick = 2E.g., Oxide Layer: none = 0, thin =1, thick = 2 Absolute difference between mapped values Absolute difference between mapped values

Page 10: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Weights of the AttributesWeights of the Attributes

Attribute has higher Attribute has higher weight if itweight if it Is at higher level in treeIs at higher level in tree Belongs to a shorter pathBelongs to a shorter path Has more experiments in Has more experiments in

its corresponding clusterits corresponding cluster Decision Tree Weight Decision Tree Weight

HeuristicHeuristicWWi i = 1/P ∑= 1/P ∑j=1 to Pj=1 to P (H (Hi,ji,j / H / Hjj) * G) * Gj j

Page 11: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Candidate Representatives in Candidate Representatives in Levels of DetailLevels of Detail

Level 1: Single Conditions Representative (SCR)Level 1: Single Conditions Representative (SCR)One set of conditions preserving cluster One set of conditions preserving cluster

informationinformation

Level 2: Multiple Conditions Representative (MCR)Level 2: Multiple Conditions Representative (MCR)Summary of information in cluster Summary of information in cluster

Level 3: All Conditions Representative (ACR)Level 3: All Conditions Representative (ACR)All information in cluster abstracted suitablyAll information in cluster abstracted suitably

Page 12: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Single Conditions Representative

Return set of conditions closest to all Return set of conditions closest to all others in clusterothers in cluster

Notion of distance: Domain-specific Notion of distance: Domain-specific distance metric for conditionsdistance metric for conditions

Input conditions in Cluster A

SCR for Cluster A

Page 13: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Multiple Conditions Representative

Build sub-Build sub-clusters of clusters of condition using condition using domain domain knowledgeknowledge

Return nearest Return nearest sub-cluster sub-cluster representativesrepresentatives

Sort themSort them

MCR for Cluster A

Cluster A

Sub-clusters within Cluster A

Page 14: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

All Conditions Representative

Return all sets of conditionsReturn all sets of conditions Sort them in ascending order Sort them in ascending order

Cluster A

ACR for Cluster A

Page 15: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

DesCond Encoding to Compare DesCond Encoding to Compare CandidatesCandidates

Analogous to Minimum Description Length (MDL)Analogous to Minimum Description Length (MDL) Theory: representative, Examples: Sets of conditions in clusterTheory: representative, Examples: Sets of conditions in cluster

Complexity of representative (ease of interpretation) Complexity of representative (ease of interpretation) Complexity = logComplexity = log22 AV AV

A= number of attributes, V= number of values for each A= number of attributes, V= number of values for each attributeattribute

Distance of all items from representative (information Distance of all items from representative (information loss)loss)Distance = logDistance = log2 2 (1/s)∑ (1/s)∑{i=1 to s} {i=1 to s} D(R,SD(R,Sii))

D: domain-specific distance metric for conditionsD: domain-specific distance metric for conditionss: total number of items (sets of conditions) in cluster s: total number of items (sets of conditions) in cluster SSii: each individual item: each individual itemR: representative set of conditionsR: representative set of conditions

DesCond Encoding DesCond Encoding Effectiveness= UBC*Complexity + UBD*DistanceEffectiveness= UBC*Complexity + UBD*Distance

UBC, UBD: User bias % weights for complexity and distanceUBC, UBD: User bias % weights for complexity and distance

Page 16: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Evaluation of DesCond with Domain Evaluation of DesCond with Domain Expert InterviewsExpert Interviews

Evaluated with real data in Heat TreatingEvaluated with real data in Heat Treating User Bias weights in Encoding reflect User Bias weights in Encoding reflect

interests in targeted applications interests in targeted applications Different data sets and number of clustersDifferent data sets and number of clusters

For each data set score calculated as followsFor each data set score calculated as follows Consider winning candidate for each clusterConsider winning candidate for each cluster

Based on DesCond EncodingBased on DesCond Encoding Score: Number of clusters in which candidate is winnerScore: Number of clusters in which candidate is winner Example: Dataset of size 25 with 5 clustersExample: Dataset of size 25 with 5 clusters

If SCR wins for 2 clusters, ACR for 3If SCR wins for 2 clusters, ACR for 3 Score: SCR=2, ACR=3Score: SCR=2, ACR=3

Page 17: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Evaluation Results

DetailsDetails• Data Set Size = 400, Number of Clusters = 20Data Set Size = 400, Number of Clusters = 20• Experts provide UBC / UBD values in EncodingExperts provide UBC / UBD values in Encoding

ObservationsObservations• Overall winner is Overall winner is MCRMCR• As weight for complexity increases, As weight for complexity increases, SCRSCR wins wins• Designed better than RandomDesigned better than Random

Page 18: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Evaluation with Formal User Evaluation with Formal User SurveysSurveys

DesCond used to design representatives DesCond used to design representatives for a trademarked estimation tool [ref for a trademarked estimation tool [ref CHTE: Center for Heat Treating Excellence]CHTE: Center for Heat Treating Excellence]

Formal user surveys conducted in different Formal user surveys conducted in different applications of the systemapplications of the system

Evaluation ProcessEvaluation Process• Compare estimation with real data in Compare estimation with real data in

test settest set• If they match estimation is accurateIf they match estimation is accurate

Page 19: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Evaluation ResultsEvaluation Results

Different winners in different applicationsDifferent winners in different applications Results of surveys tally with those of Encoding-based evaluationResults of surveys tally with those of Encoding-based evaluation Estimation Accuracy: 90 to 94% (better than earlier versions of tool)Estimation Accuracy: 90 to 94% (better than earlier versions of tool)

Parameter Selection Applications Simulation Tool Applications

Decision Support Applications Intelligent Tutoring Applications

Page 20: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

Related WorkRelated Work

Image Rating: [HH-01]Image Rating: [HH-01]• User intervention involved in manual User intervention involved in manual

ratingrating Semantic Fish Eye Views: [JP-04] Semantic Fish Eye Views: [JP-04]

• Display multiple objects in small space, Display multiple objects in small space, no representativesno representatives

PDA Displays in Levels of Detail: [BGMP-01]PDA Displays in Levels of Detail: [BGMP-01]• Do not evaluate different types of Do not evaluate different types of

representativesrepresentatives

Page 21: Designing Semantics-Preserving Cluster Representatives for Scientific Input Conditions

ConclusionsConclusions Contributions of this workContributions of this work

Designing cluster representatives for scientific input Designing cluster representatives for scientific input conditions in levels of detailconditions in levels of detail

Defining a domain-specific distance metric for Defining a domain-specific distance metric for conditionsconditions

Proposing an encoding to compare representativesProposing an encoding to compare representatives Conducting evaluation using encoding with real data Conducting evaluation using encoding with real data

from Heat Treatingfrom Heat Treating Assessing use of representatives in applications of a Assessing use of representatives in applications of a

CHTE trademarked estimation tool CHTE trademarked estimation tool

ResultsResults Designed Representatives better than randomDesigned Representatives better than random Different designed representatives suit different Different designed representatives suit different

applicationsapplications DesCond enhances accuracy of estimation toolDesCond enhances accuracy of estimation tool