42
Graph-based Learning and Graph-based Learning and Discovery Discovery Diane J. Cook Diane J. Cook University of Texas at University of Texas at Arlington Arlington [email protected] [email protected] http://www-cse.uta.edu/~cook http://www-cse.uta.edu/~cook

Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington [email protected]

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Graph-based Learning and DiscoveryGraph-based Learning and Discovery

Diane J. CookDiane J. Cook

University of Texas at ArlingtonUniversity of Texas at Arlington

[email protected]@cse.uta.edu

http://www-cse.uta.edu/~cookhttp://www-cse.uta.edu/~cook

Page 2: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Data MiningData Mining

“The nontrivial extraction of implicit, previously unknown,and potentially useful information from data” [Frawley et al., 92]

Increasing ability to generate data Increasing ability to store data

Page 3: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

KDD ProcessKDD Process

Page 4: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Approaches to Data MiningApproaches to Data Mining

Pattern extractionPattern extraction Prediction / classificationPrediction / classification ClusteringClustering

Debt Loan

NoLoan

0.123

0.203

0.117

0.545

Income

Debt<50

Income Income

NO YES YES NO NO YES

yes no

<5050-100 >100 <50

50-100 >100

Page 5: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Substructure DiscoverySubstructure Discovery

Most data mining algorithms deal with Most data mining algorithms deal with linearlinear attribute-value data attribute-value data

Need to represent and learn Need to represent and learn relationshipsrelationships between attributes between attributes

Page 6: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Discovers repetitive substructure patterns in Discovers repetitive substructure patterns in graph databasesgraph databases

Pattern extraction, classification, clusteringPattern extraction, classification, clustering Serial and parallel / distributed versionsSerial and parallel / distributed versions Applied to Applied to CAD circuits, telecom, DNA, and moreCAD circuits, telecom, DNA, and more

http://cygnus.uta.edu/subduehttp://cygnus.uta.edu/subdue

Page 7: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

object

triangle

Graph RepresentationGraph Representation Input is a labeled graphInput is a labeled graph A A substructuresubstructure is connected subgraph is connected subgraph An An instanceinstance of a substructure is a subgraph of a substructure is a subgraph

that is isomorphic to substructure definitionthat is isomorphic to substructure definition

R1

C1

T1

S1

T2

S2

T3

S3

T4

S4

Input Database Substructure S1 (graph form)

Compressed Database

R1

C1object

squareon

shape

shape S1S1 S1S1 S1S1

S1S1

Page 8: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

MDL PrincipleMDL Principle Best theory minimizes description length of dataBest theory minimizes description length of data Evaluate substructure based ability to compress DL Evaluate substructure based ability to compress DL

of graph of graph Description length = Description length = DL(S) + DL(G|S)DL(S) + DL(G|S)

Page 9: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

AlgorithmAlgorithm1.1. Create substructure for each unique vertex labelCreate substructure for each unique vertex label

circle

rectangle

left

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

onleft

left left

left

Substructures:

triangle (4), square (4),circle (1), rectangle (1)

Page 10: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

AlgorithmAlgorithm2.2. Expand best substructure by an edge or Expand best substructure by an edge or

edge+neighboring vertexedge+neighboring vertex

circle

rectangle

left

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

onleft

left left

left

Substructures:

triangle

square

on

circleleftsquare

rectangle

square

on

rectangle

triangleon

Page 11: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

AlgorithmAlgorithm

3.3. Keep only best substructures on queue Keep only best substructures on queue (specified by (specified by beam widthbeam width))

4.4. Terminate when queue is empty or Terminate when queue is empty or #discovered substructures >= limit#discovered substructures >= limit

5.5. Compress graph and repeat to generate Compress graph and repeat to generate hierarchical descriptionhierarchical description

Note:Note: polynomially constrained polynomially constrained [IEEE Exp96][IEEE Exp96]

Page 12: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Examples Examples [Jair94][Jair94]

Page 13: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Inexact Graph Match Inexact Graph Match [JIIS95][JIIS95]

Some variations may occur between Some variations may occur between instancesinstances

Want to abstract over minor differencesWant to abstract over minor differences Difference = cost of transforming one Difference = cost of transforming one

graph to make it isomorphic to anothergraph to make it isomorphic to another Match if Match if cost/size < thresholdcost/size < threshold

Page 14: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Inexact Graph MatchInexact Graph Match

1 2A Ba

b

5

3 4B Ab

aa b

B

(1,3) 1 (1,4) 0 (1,5) 1 (1,) 1

(2,4)7

(2,5)6

(2,)10

(2,3)3

(2,5)6

(2,)9

(2,3)7

(2,4)7

(2,)10

(2,3)9

(2,4)10

(2,5)9

(2,)11

Least-cost match is {(1,4), (2,3)}

Page 15: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Background Knowledge Background Knowledge [IEEE TKDE96][IEEE TKDE96]

Some substructures not relevantSome substructures not relevant Background knowledge can bias searchBackground knowledge can bias search Two typesTwo types

Model knowledgeModel knowledge Graph match rulesGraph match rules

Page 16: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook
Page 17: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Parallel/distributed Subdue Parallel/distributed Subdue [JPDC00][JPDC00]

Scalability issuesScalability issues Three approachesThree approaches

Dynamic partitioningDynamic partitioning Functional parallelFunctional parallel Static partitioningStatic partitioning

Page 18: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Static PartitioningStatic Partitioning

Divide graph into P partitions, distribute Divide graph into P partitions, distribute to P processorsto P processors

Each processor performs serial Subdue Each processor performs serial Subdue on local partitionon local partition

Broadcast best substructures, evaluate Broadcast best substructures, evaluate on other processorson other processors

Master processor stores best global Master processor stores best global substructuressubstructures

Page 19: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Static Partitioning ResultsStatic Partitioning Results Close to linear speedupClose to linear speedup Continue until #processors > #verticesContinue until #processors > #vertices

Page 20: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

IssuesIssues

When partition graph, lose informationWhen partition graph, lose information Metis graph partitioning systemMetis graph partitioning system Quality of resulting substructures?Quality of resulting substructures? Recapture by overlap, multiple partitionsRecapture by overlap, multiple partitions Evaluating more substructures globallyEvaluating more substructures globally

Page 21: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Compression ResultsCompression Results

Page 22: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

AutoClassAutoClass Linear representationLinear representation Fit possible probabilistic models to dataFit possible probabilistic models to data Satellite data, DNA data, Landsat dataSatellite data, DNA data, Landsat data

Page 23: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

SSUBDUEUBDUE/AutoClass Combined/AutoClass Combined

Data

structural features

structural patterns

Classeslinear features

= Combination of linear data or addition of linear features

Subdue

AutoClass+

+

Page 24: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Example - 30 2-color squaresExample - 30 2-color squares

AutoClassAutoClass Rep - tuple for Rep - tuple for each line (x1, y1, x2, y2, each line (x1, y1, x2, y2, angle, length, color)angle, length, color)

Add structure Add structure (neighboring edge (neighboring edge information)information)

SubdueSubdue Rep - each line is Rep - each line is node in graph, edges node in graph, edges between connecting linesbetween connecting lines

Attributes from nodesAttributes from nodes

Page 25: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

ResultsResults AutoClass (12 classes)AutoClass (12 classes)

Subdue (top substructure)Subdue (top substructure)

Class 0 (20): Color=green, LineNo=Line1=Line2=98 +/- 10Class 1 (20): Color=red, LineNo=Line1=Line2=99 +/- 10…Class 11 (3): Line2=1 +/-13, Color=green

Page 26: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Combined ResultsCombined Results

Combine 4 entries for each square into oneCombine 4 entries for each square into one 30 tuples (one for each square)30 tuples (one for each square) DiscoverDiscover

Class 0 (10): Color1=red, Color2=red,Color3=green, Color4=green

Class 1 (10): Color1=green, Color2=green, Color3=blue, Color4=blue

Class 2 (10): Color1=blue, Color2=blue,Color3=red, Color4=red

Page 27: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

More ResultsMore Results

Page 28: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Supervised Supervised SSUBDUE UBDUE [IEEE IS00][IEEE IS00]

One graph stores One graph stores positivepositive examples examples One graph stores One graph stores negativenegative examples examples Find substructure that compresses Find substructure that compresses

positivepositive graph but not graph but not negativenegative graph graph

Page 29: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

ExampleExample

object

object

object

on

on

triangle

square

shape

shape

Page 30: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

ResultsResults

Chess endgames (19,257 examples), BK is Chess endgames (19,257 examples), BK is (+) or is not (-) in check(+) or is not (-) in check

99.8% FOIL, 99.77% C4.5, 99.21% Subdue99.8% FOIL, 99.77% C4.5, 99.21% Subdue

Page 31: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

More ResultsMore Results Tic Tac Toe endgamesTic Tac Toe endgames

+ is win for X (958 examples)+ is win for X (958 examples) 100% Subdue, 100% Subdue,

92.35% FOIL, 96.03% C4.592.35% FOIL, 96.03% C4.5 Bach choralesBach chorales

Musical sequences (20 sequences)Musical sequences (20 sequences) 100% Subdue, 100% Subdue,

85.71% FOIL, 82.00% C4.585.71% FOIL, 82.00% C4.5

Page 32: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Clustering Using Clustering Using SSUBDUEUBDUE Iterate Subdue until single vertexIterate Subdue until single vertex

Each cluster (substructure) inserted into a Each cluster (substructure) inserted into a classification latticeclassification lattice

Early results similar to COBWEB Early results similar to COBWEB [Fisher87][Fisher87]

Root

Page 33: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Discovery Application DomainsDiscovery Application Domains Biochemical domainsBiochemical domains

Protein data Protein data [PSB99, IDA99][PSB99, IDA99] Human Genome DNA dataHuman Genome DNA data Toxicology (cancer) dataToxicology (cancer) data

Spatial-temporal domainsSpatial-temporal domains Earthquake dataEarthquake data Aircraft Safety and Reporting SystemAircraft Safety and Reporting System

Telecommunications dataTelecommunications data Program source codeProgram source code

Page 34: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Structured Web Search Structured Web Search [AAAI-AIWS00][AAAI-AIWS00]

Existing search engines use linear feature matchExisting search engines use linear feature match Subdue searches based on structureSubdue searches based on structure Incorporation of WordNet allows for inexact feature match Incorporation of WordNet allows for inexact feature match

through synset path lengththrough synset path length TechniqueTechnique

Breadth-first search through domain to generate graphBreadth-first search through domain to generate graph Nodes represent pages / documentsNodes represent pages / documents Edges represent hyperlinksEdges represent hyperlinks Additional nodes used to represent document keywordsAdditional nodes used to represent document keywords Pose query as graphPose query as graph Search for query match within domain graphSearch for query match within domain graph

Page 35: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Sample SearchSample Search

Instructor

TeachingRobotics

ResearchRobotics

Publication

Robotics

httphttp

Postscript| PDF

Page 36: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Query: Query: Find all pages which link to Find all pages which link to a page containing term ‘subdue’a page containing term ‘subdue’

Subgraph vertices: 1 _page_URL: http://cygnus.uta.edu7  _page_URL: http://cygnus.uta.edu/projects.html8 Subdue[1->7] hyperlink[7->8] word

subdue

pagehyperlink

/* Vertex ID Label */

sv 1 _page_v 2 _page_v 3 subdue

/* Edge Vertex 1 Vertex 2 Label */

d 1 2 _hyperlink_d 2 3 _word_

word

page

Page 37: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Search for Presentation PagesSearch for Presentation Pages

SubdueSubdue 22 instances22 instances

AltaVistaAltaVista Query Query ““host:www-cse.uta.edu AND host:www-cse.uta.edu AND

image:next_motif.gif AND image:up_motif.gif AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.”image:previous_motif.gif.”

12 instances12 instances

page

page page page

hyperlinkhyperlink

hyperlink

hyperlink hyperlink

Page 38: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Search for Reference PagesSearch for Reference Pages

Search for page with at least 35 in linksSearch for page with at least 35 in links 5 pages in www-cse5 pages in www-cse

AltaVista cannot perform this type of searchAltaVista cannot perform this type of search

page

page page page

hyperlinkhyperlink

hyperlink

Page 39: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Search for pages on ‘jobs in Search for pages on ‘jobs in computer science’computer science’

Inexact match: allow one level of synonymsInexact match: allow one level of synonyms Subdue found 33 matchesSubdue found 33 matches

Words include Words include employment, work, job, problem, employment, work, job, problem, tasktask

AltaVista found 2 matchesAltaVista found 2 matches

page

jobs computer science

wordword

word

Page 40: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Search for ‘authority’ hub and authority pagesSearch for ‘authority’ hub and authority pages

Subdue found 3 hub Subdue found 3 hub (and 3 authority) pages(and 3 authority) pages

AltaVista cannot AltaVista cannot perform this type of perform this type of searchsearch

Inexact match applied Inexact match applied with threshold = 0.2 (4.2 with threshold = 0.2 (4.2 transformations allowed)transformations allowed)

Subdue found 13 Subdue found 13 matchesmatches

page

hyperlink

page page

page page page

word word word

algorithms algorithms algorithms

HUBS

AUTHORITIES

Page 41: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

Subdue Learning from Web DataSubdue Learning from Web Data Distinguish professors’ and students’ web pagesDistinguish professors’ and students’ web pages

Learned concept (professors have “box” in Learned concept (professors have “box” in address field)address field)

Distinguish online stores and professors’ web pagesDistinguish online stores and professors’ web pages Learned concept (stores have more levels in Learned concept (stores have more levels in

graph)graph)

page boxword

page

page

page

page

page

page page

Page 42: Graph-based Learning and Discovery Diane J. Cook University of Texas at Arlington cook@cse.uta.educook

To Learn MoreTo Learn More

cygnus.uta.edu/subdue

[email protected]://www-cse.uta.edu/~cook