Thèses de l'INSA de Lyon | Les Thèses de l'INSA de …theses.insa-lyon.fr/publication/2005isal0118/these.pdfDr. David Sol Martínez Dr. Jesús A. González Bernal Gracias French

2005 ISAL 00118

PhD

Jointly awarded at the Institut National des Sciences Appliquées de Lyon

(Ecole Doctorale Informatique et Information pour la Société) and the Universidad de las Américas, Puebla

(School of Engineering, Department of Computer Sciences)

By

Manuel Alfredo PECH PALACIO

On December 12th, 2005

"Spatial Data Modeling and Mining using a Graph-based Representation"

----------------------------------------------------

PhD Committee Chair, Pr. Eduardo Morales Manzanares, ITESM, Morelos

Tutors:

Pr. Robert Laurini, INSA of Lyon Dr. Anne Tchounikine, INSA of Lyon Dr. David Sol Martínez, UDLAP Dr. Jesús A. González Bernal, INAOE

Reviewers:

Pr. Hervé Martin, U. Joseph Fourier, Grenoble Dr Nicandro Cruz Ramírez, U. Veracruzana Examiner: Dr. François Fages, INRIA

I

ABSTRACT

Motivation

Several approaches have been developed for mining spatial data (i.e., generalization-based

methods, clustering, spatial associations, approximation and aggregation, mining in image

and raster databases, spatial classification and spatial trend detection). However, we argue

that these approaches do not consider all the elements found in a spatial database (spatial

data, non-spatial data and spatial relations among the spatial objects) in an extended way.

Some of them focus first on spatial data and then on non-spatial data or vice versa, and

others consider restricted combinations of these elements. We think that it is possible to

enhance the generated results of the data mining task by mining them as a whole and not as

separate elements (they are related elements). A graph representation provides the

flexibility to describe these elements together and this is the motivation to explore the area

of graph-based spatial knowledge discovery.

II

Proposal

Our idea is to create a unique graph-based model to represent spatial data, non-spatial data

and the spatial relations among spatial objects. We will generate datasets composed of

graphs with a set of these three elements. We consider that by mining a dataset with these

characteristics a graph-based mining tool can search patterns involving all these elements at

the same time improving the results of the spatial analysis task. A significant characteristic

of spatial data is that the attributes of the neighbors of an object may have an influence on

the object itself. So, we propose to include in the model three relationship types

(topological, orientation, and distance relations).

In the model the spatial data (i.e., spatial objects), non-spatial data (i.e., non-spatial

attributes), and spatial relations are represented as a collection of one or more directed

graphs. A directed graph contains a collection of vertices and edges representing all these

elements. Vertices represent either spatial objects, spatial relation types between two spatial

objects (binary relation), or non-spatial attributes describing the spatial objects. Edges

represent a link between two vertices of any type. According to the type of vertices that an

edge joins, it can represent either an attribute name or a spatial relation name. The attribute

name can refer to a spatial object or a non-spatial entity. We use directed edges to represent

directional information of relations among elements (i.e., object x covers object y) and to

describe attributes about objects (i.e., object x has attribute z).

We propose to adopt the Subdue system, a general graph-based data mining system

developed at the University of Texas at Arlington, as our mining tool. Subdue discovers

III

substructures using a graph-based representation of structural databases. The substructures

(a connected subgraph within the graphical representation) describe structural concepts in

the data (i.e., patterns). The discovery algorithm follows a computationally constrained

beam search. The algorithm begins with the substructure matching a single vertex in the

graph. On each iteration, the algorithm selects the best substructure and incrementally

expands the instances of the substructure. An instance of a substructure in the input graph is

a subgraph that matches (graph theoretically) that substructure.

A special feature named overlap has a primary role in the substructures discovery process

and consequently a direct impact over the generated results. However, it is currently

implemented in an orthodox way: all or nothing. If we set overlap to true, Subdue will

allow the overlap among all instances sharing at least one vertex. On the other hand, if

overlap is set to false, Subdue will not allow the overlap among instances sharing at least

one vertex. So, we propose a third approach: limited overlap, which gives the user the

capability to set over which vertices the overlap will be allowed (vertices representing

remarkable elements that refer, for instance, to a spatial object in a spatial database or to

some characteristic defining a particular topic of a dataset). We visualize directly three

motivations issues to propose the implementation of the new algorithm: search space

reduction, processing time reduction, and pattern oriented search.

IV

Contribution

The contribution to the discovery knowledge in the spatial data domain, described in this

dissertation, is the development of a new approach for spatial data modeling and mining

using a graph-based representation. This contribution includes the following results:

• A new graph-based data representation for spatial, non-spatial data and spatial

relations.

• A new algorithm to discover substructures using a limited overlap approach in the

Subdue system.

• A prototype system implementing the proposed model.

V

Acknowledgments

Thank you; you are my guide, motivation and inspiration:

Erandi, my darling wife

Paula Ireri, my wonderful daughter

Manuel André, my little son

Manuel and Elsy, my parents

I want to express my gratitude, respect and admiration my tutors for the time, advices and

support they granted me during the research:

Mexican tutors:

Dr. David Sol Martínez

Dr. Jesús A. González Bernal

Gracias

French tutors:

Pr. Robert Laurini

Dr. Anne Tchounikine

Merci

VI

This works was supported in part by the Excellent Graduate Scholarship from the

Fundación Universidad de las Américas, Puebla, 38257-H project (Habitar y vivir. Análisis

del espacio habitacional de la ciudad de Puebla 1690-1890) Scholarship from the Mexican

Council of Science and Technology, SIRPO project (Sistema de Información para los

Riesgos del Popocatépetl) research grand from the Laboratoire Franco-Mexicain

d'Informatique, the Institut National des Sciences Appliquées de Lyon, and Excellent

Scholarship from the Gobierno del Estado de Quintana Roo.

VII

Table of contents

ABSTRACT............................................................................................................................ I

Motivation........................................................................................................................... I

Proposal ............................................................................................................................. II

Contribution ..................................................................................................................... IV

Appendix A RESUMEN EN ESPAÑOL ................................................................................i

A.1. Introducción .................................................................................................................i

A.1.1 Métodos basados en generalización..................................................................... ii

A.1.2 Agrupamiento ..................................................................................................... iii

A.1.3 Asociaciones espaciales .......................................................................................iv

A.1.4 Aproximación y agregación.................................................................................iv

A.1.5 Minería de datos en imágenes...............................................................................v

A.1.6 Clasificación de datos espaciales ..........................................................................v

A.1.7 Detección de tendencias espaciales .....................................................................vi

A.2. Motivación .................................................................................................................vi

A.2.1 Relaciones espaciales......................................................................................... vii

A.3 Representaciones basadas en grafos ........................................................................ viii

VIII

A.4 Minando el grafo.......................................................................................................xix

A.5 Resultados ............................................................................................................... xxii

A.6 Conclusiones ............................................................................................................xxx

A.7 Contribución ........................................................................................................ xxxiii

Appendix B RÉSUMÉ EN FRANÇAIS...........................................................................xxxv

B.1. Introduction ...........................................................................................................xxxv

B.1.1 Méthodes basées sur la généralisation ...........................................................xxxvi

B.1.2 Regroupement .............................................................................................. xxxvii

B.1.3 Associations spatiales.................................................................................. xxxviii

B.1.4 Rapprochement et agrégation...................................................................... xxxviii

B.1.5 Fouille de données-images.............................................................................xxxix

B.1.6 Classification de données spatiales ................................................................xxxix

B.1.7 Détection de tendances spatiales ..........................................................................xl

B.2. Motivation ..................................................................................................................xl

B.2.1 Relations spatiales.............................................................................................. xli

B.3 Représentations basées sur des graphes ................................................................... xlii

B.4 Fouille du graphe....................................................................................................... liii

B.5 Résultats .................................................................................................................... lvi

B.6 Conclusions ............................................................................................................. lxiv

B.7 Contribution ........................................................................................................... lxvii

Chapter 1 INTRODUCTION..................................................................................................1

1.1 Motivation.....................................................................................................................2

1.2 Proposal ........................................................................................................................4

1.3 Contribution ..................................................................................................................6

IX

1.4 Organization of the thesis .............................................................................................7

Chapter 2 RELATED WORK ................................................................................................8

2.1 Geographic Information System (GIS).........................................................................8

2.2 Data Mining ................................................................................................................12

2.2.1 Spatial Data Mining .............................................................................................15

2.2.1.1 Generalization-based Method .......................................................................16

2.2.1.2 Clustering......................................................................................................17

2.2.1.3 Spatial Associations......................................................................................19

2.2.1.4 Approximation and Aggregation ..................................................................20

2.2.1.5 Mining an Image Database ...........................................................................22

2.2.1.6 Classification Learning .................................................................................23

2.2.1.7 Spatial Trend Detection ................................................................................24

2.3 Spatial relations...........................................................................................................24

2.3.1 Neighborhood Graphs, Neighborhood Paths and Neighborhood Indices............25

2.3.2 Topological Relations ..........................................................................................26

2.3.3 Distance Relations ...............................................................................................28

2.3.4 Direction Relations ..............................................................................................29

2.4 Conclusion ..................................................................................................................30

Chapter 3 GRAPH-BASED REPRESENTATIONS............................................................31

3.1 Generalities .................................................................................................................31

3.2 Methodology...............................................................................................................33

3.3 Spatial Graph-based Data Representations.................................................................36

3.4 Use-case ......................................................................................................................52

3.5 Conclusion ..................................................................................................................58

X

Chapter 4 MINING THE GRAPH........................................................................................59

4.1 Characteristics.............................................................................................................59

4.1.1 Main Functions ....................................................................................................63

4.2 Overlap........................................................................................................................66

4.3 Limited Overlap..........................................................................................................81

4.4 Conclusion ..................................................................................................................94

Chapter 5 PROTOTYPE.......................................................................................................96

5.1 Population Census from the year of 1777 in Puebla downtown.................................97

5.2 Modules ....................................................................................................................101

5.3 Conclusion ................................................................................................................115

Chapter 6 RESULTS ..........................................................................................................117

6.1 Population census from the year of 1777 in Puebla downtown................................118

6.1.1 Use-case: El Sagrario........................................................................................121

6.1.2 Use-case: People living along the borders of the river crossing Puebla downtown

....................................................................................................................................131

6.2 Popocatépetl volcano ................................................................................................136

6.2.2. Use-case: Popocatépetl .....................................................................................138

6.2.2.1 Model #1 - base model................................................................................140

6.2.2.2 Model #2 - single replication of relation types, complete information ......147

6.2.2.3 Model #3 - double replication of relation types, no complete information 151

6.2.2.4 Model #4 - single replication of relation types, no complete information .155

6.2.2.5 Model #5 - double replication of relation types, complete information .....159

6.3 Conclusion ................................................................................................................167

Chapter 7 CONCLUSIONS................................................................................................169

XI

BIBLIOGRAPHY...............................................................................................................174

Appendix C AGREEMENTS UDLAP/INSA DE LYON................................................. lxix

C.1 CONVENTION DE CO-TUTELLE DE THÈSE.....................................................lxx

C.2 AVENANT RELATIF A LA CONVENTION DE THÈSE EN CO-TUTELLE DE

MANUEL ALFREDO PECH PALACIO ................................................................... lxxiii

XII

List of figures

Figura A.1. Modelo basado en grafos para representar datos espaciales. .............................ix

Figura A.2. Base de datos de ejemplo para caracterizar los 3 modelos propuestos. .............xi

Figura A.3. Modelo #1 - modelo base. ................................................................................ xii

Figura A.4. Modelo #2 - replicación simple de tipos de relación, información completa. xiii

Figura A.5. Modelo #3 - doble replicación de tipos de relación, información no completa.xv

Figura A.6. Relaciones entre carreteras y ríos usando el modelo #1.................................xxiv

Figura A.7. Relaciones entre carreteras y poblaciones usando el modelo #1.....................xxv

Figura A.8. Relaciones entre ríos y poblaciones usando el modelo #1. .......................... xxvii

Figure B.1. Modèle basé sur des graphes pour représenter des données spatiales. ........... xliii

Figure B.2. Base d'exemple pour caractériser les 3 modèles proposés................................xlv

Figure B.3. Modèle n°1 - modèle base. ............................................................................. xlvi

Figure B.4. Modèle n°2 - réplication simple des types de relation, information complète.

.................................................................................................................................. xlvii

Figure B.5. Modèle n°3 - double réplication des types de relation, information non

complète..................................................................................................................... xlix

Figure B.6. Relations entre des routes et des rivières en utilisant le modèle n°1. ............. lviii

XIII

Figure B.7. Relations entre des routes et des villes en utilisant le modèle n°1. .................. lix

Figure B.8. Relations entre des rivières et des villes en utilisant le modèle n°1. ................ lxi

Figure 2.1. Geographic Information System. .......................................................................11

Figure 2.2. Knowledge Discovery in Databases...................................................................13

Figure 2.3. Data mining: integration of several fields. .........................................................14

Figure 2.4. Architecture for a KDD system..........................................................................15

Figure 2.5. Example of the interior, boundary and exterior of a circle. ...............................27

Figure 2.6. Nine intersection model......................................................................................27

Figure 2.7. Topological Relations.........................................................................................28

Figure 2.8. Distance Relations. .............................................................................................28

Figure 2.9. Direction Relations.............................................................................................30

Figure 3.1. General graph-based model to represent spatial data. ........................................35

Figure 3.2. Sample dataset. ...................................................................................................40

Figure 3.3. Model #1 - base model. ......................................................................................42

Figure 3.4. Model #2 - single replication of relation types, complete information. .............43

Figure 3.5. Model #3 - double replication of relation types, no complete information........45

Figure 3.6. Model #4 - single replication of relation types, no complete information. ........47

Figure 3.7. Model #5 - double replication of relation types, complete information.............48

Figure 3.8. Spatial database representing some objects of the world. ..................................53

Figure 3.9. Selection window. ..............................................................................................54

Figure 3.10. Querying a spatial database. .............................................................................55

Figure 3.11. Graph-based representation for spatial data. ....................................................56

Figure 4.1. Graph representation of the house domain.........................................................61

Figure 4.2. Substructure and instances discovered from the house domain by Subdue. ......62

XIV

Figure 4.3. Substructure replacement procedure in the house domain. ................................62

Figure 4.4. Graph representation of the house domain after substructure replacement. ......63

Figure 4.5. SGISO - input graph_1.......................................................................................67

Figure 4.6. SGISO - input graph_2.......................................................................................67

Figure 4.7. SGISO - no overlap. ...........................................................................................68

Figure 4.8. SGISO - no overlap, 1 instance in graph_2........................................................68

Figure 4.9. SGISO - overlap. ................................................................................................68

Figure 4.10. SGISO - overlap, 4 instances in graph_2. ........................................................68

Figure 4.11. MDL - input graph_1. ......................................................................................71

Figure 4.12. MDL example 1 - input graph_2......................................................................72

Figure 4.13. MDL example 1 - no overlap. ..........................................................................72


Figure 4.15. MDL example 2 - overlap. ...............................................................................73












XV




Figure 4.30. Limited overlap - input graphs PS_1 and PS_2. ..............................................82

Figure 4.31. Limited overlap - input graph_3.......................................................................82

Figure 4.32. No overlap - SGISO. ........................................................................................84

Figure 4.33. Overlap - SGISO. .............................................................................................84

Figure 4.34. Limited overlap PS_1 - SGISO. .......................................................................84

Figure 4.35. Limited overlap in the Subdue system. ............................................................87

Figure 4.36. No overlap - compressed graph........................................................................88

Figure 4.37. No overlap - discovered substructures. ............................................................90

Figure 4.38. Overlap - compressed Graph. ...........................................................................91

Figure 4.39. Overlap - discovered substructures. .................................................................92

Figure 4.40. Limited overlap PS_1 - compressed graph.......................................................93

Figure 4.41. Limited overlap - discovered substructures......................................................94

Figure 5.1. Representation of spatial concepts in the census from the year of 1777............98

Figure 5.2. First representation for the non-spatial data in the population census from the

year of 1777. ...............................................................................................................100

Figure 5.3. Second representation for the non-spatial data in the population census from the

year of 1777. ...............................................................................................................101

Figure 5.4. The query panel. ...............................................................................................103

Figure 5.5. The map panel. .................................................................................................106

Figure 5.6. The spatial graph panel.....................................................................................109

Figure 5.7. Graph representation of processed data............................................................111

XVI

Figure 5.8. The non-spatial graph panel. ............................................................................112

Figure 5.9. The Subdue panel. ............................................................................................113

Figure 5.10. Example of Subdue’s standard output............................................................114

Figure 5.11. Layout for reading the Subdue’s discovered substructures............................115

Figure 6.1. Population census from the year of 1777 in Puebla downtown. ......................118

Figure 6.2. Parishes in Puebla downtown in the 1777 year. ...............................................120

Figure 6.3. Blocks 150m. from representative church, parish “El Sagrario”. ....................122

Figure 6.4 Example of a generated graph in the use-case “El Sagrario”............................122

Figure 6.5. Examples of discovered patterns by using standard overlap in use-case “El

Sagrario” (1). ..............................................................................................................124

Figure 6.6. Examples of discovered patterns by using limited overlap in use-case “El

Sagrario” (1). ..............................................................................................................126

Figure 6.7. Processing time standard vs. limited overlap: use-case “El Sagrario” (1). ......127

Figure 6.8. Blocks 150m. North side from representative church, parish “El Sagrario”. ..128

Figure 6.9. Examples of discovered patterns in the use-case “El Sagrario” (2) .................130

Figure 6.10. Processing time standard vs. limited overlap: use-case “El Sagrario” (2). ....131

Figure 6.11. Blocks 50m. from river crossing Puebla downtown. .....................................132

Figure 6.12. Examples of discovered patterns in use-case “people around the river crossing

Puebla downtown”......................................................................................................133

Figure 6.13. Processing time standard vs. limited overlap: use-case “people around the river

crossing Puebla downtown”........................................................................................134

Figure 6.14. Popocatépetl volcano......................................................................................137

Figure 6.15. Popocatépetl volcano: study zone. .................................................................138

Figure 6.16. Relationships among roads and rivers by using model #1. ............................142

XVII

Figure 6.17. Relationships among roads and settlements by using model #1. ...................144

Figure 6.18. Relationships among rivers and settlements by using model #1....................146













XVIII

List of tables

Tabla A.1. Características de los modelos de representación basados en grafos. ...............xvi

Tabla A.2. Instancias/iteraciones por cada modelo basado en grafos: caso de uso

Popocatépetl............................................................................................................ xxviii

Tabla A.3. Max/Min de instancias descubiertas por “objeto-objeto”/característica overlap.

...................................................................................................................................xxix

Tabla A.4. Promedio de instancias descubiertas por modelo/“objeto-objeto”. .................xxix

Tabla A.5. Promedio de instancias descubiertas por modelo/característica overlap. .........xxx

Tabla A.6. Promedio de instancias descubiertas por modelo. ............................................xxx

Table B.1. Caractéristiques des modèles de représentation basés sur des graphes. ................l

Table B.2. Instances/itérations par chaque modèle basé sur des graphes : cas d'utilisation

Popocatépetl................................................................................................................ lxii

Table B.3. Max/Min d'instances découvertes par "objet-objet"/caractéristique

recouvrement. ............................................................................................................ lxiii

Table B.4. Moyenne d'instances découvertes par modèle/"objet-objet". .......................... lxiii

Table B.5. Moyenne d'instances découvertes par modèle/caractéristique recouvrement.. lxiv

Table B.6. Moyenne d'instances découvertes par modèle. ................................................ lxiv

XIX

Table 3.1. Characteristics of the graph-based representation models...................................49

Table 6.1. Instances/iterations by each graph-based model: Popocatépetl use-case. .........163

Table 6.2. The best model to discover complete patterns among the road and river spatial

objects (no overlap) ....................................................................................................164

Table 6.3. Max/Min of discovered instances by “object-object”/overlap feature. .............165

Table 6.4. Average of discovered instances by model/“object-object”. .............................166

Table 6.5. Average of discovered instances by model/overlap feature. .............................166

Table 6.6. Average of discovered instances by model. ......................................................167

i

Appendix A

RESUMEN EN ESPAÑOL

A.1. Introducción

En los últimos años hemos sido testigos del rápido crecimiento en el número, capacidad y

diseminación de aplicaciones informáticas dedicadas a la obtención, generación,

manipulación y almacenamiento de datos en diversos ámbitos de la vida humana. Esto ha

propiciado una gran cantidad de colecciones de datos cuyo análisis por medios manuales se

vuelve una tarea complicada. Recordemos que en muchas ocasiones los datos “crudos”

necesitan ser analizados e interpretados para convertirlos en información útil y provechosa.

Tal situación ha propiciado una creciente necesidad por técnicas/herramientas

computacionales que nos ayuden en estas tareas. Descubrimiento de conocimiento en bases

de datos (KDD, por sus siglas en el idioma Inglés) es definido como la extracción no trivial

de información implícita, previamente desconocida y potencialmente útil a partir de datos

[16]. Este es un proceso iterativo e interactivo que envuelve diferentes fases. El núcleo del

proceso es la fase de minado de datos, que se conceptualiza como la aplicación de

algoritmos de análisis de datos y de descubrimiento que bajo parámetros aceptables de

ii

eficiencia computacional producen/descubren una enumeración particular de patrones sobre

los datos mismos [12].

En este mismo contexto, pero enfocado al análisis y explotación de datos provenientes de

fenómenos generados en, sobre, y bajo la superficie de la tierra, llamados datos espaciales,

ha generado un nuevo dominio de investigación llamado Minería de datos Espaciales. De

tal forma, la Minería de datos espaciales se enfoca al descubrimiento de conocimiento

implícito, y previamente desconocido en datos espaciales [16]. Como resultado de esta

necesidad creciente diversos enfoques para el minado de datos espaciales han sido

desarrollados, entre los más representativos encontramos:

A.1.1 Métodos basados en generalización

La generalización ha demostrado ser uno de los métodos efectivos para descubrir

conocimiento. Fue introducido por la comunidad de aprendizaje máquina y se basa en el

aprendizaje a partir de ejemplos. El descubrimiento de conocimiento basado en

generalización requiere jerarquías de conceptos (dadas explícitamente por el experto ó

generadas automáticamente). En la caso de las bases de datos espaciales, pueden darse dos

tipos de jerarquías de conceptos: (1) Jerarquías temáticas, por ejemplo, generalizar tomates

y plátanos como frutas, las frutas y vegetales como alimentos de origen vegetal. (2)

Jerarquías espaciales, por ejemplo, generalizar una serie de puntos espaciales como una

región ó país.

iii

Lu et al. [35] extienden la técnica attribute-oriented induction a las bases de datos

espaciales. Esta técnica se basa en escalar la jerarquía de generalización e ir resumiendo las

relaciones entre los datos espaciales y no espaciales a un nivel de concepto más alto. Los

autores presentan dos algoritmos basados en generalización: (1) Enfoque de dominación de

datos no espaciales. Este método realiza en primera instancia inducción orientada al

atributo sobre los datos no-espaciales y posteriormente sobre los espaciales. (2) Enfoque de

dominación de datos espaciales. Dado la jerarquía de datos espaciales, la generalización se

realizado primero sobre estos datos y posteriormente sobre los datos no espaciales.

A.1.2 Agrupamiento

Agrupamiento (clustering) es el proceso de agrupar de manera física ó abstracta objetos en

clases de objetos similares. Este enfoque de minería de datos nos ayuda a construir

particiones “representativas” de un conjunto de objetos dada una medida de

similitud/distancia (Ej. distancia euclidiana). Esto es, el agrupamiento de datos identifica

grupos (clusters) ó regiones densamente pobladas de acuerdo a alguna medida de distancia

en un conjunto de datos multidimensionales. Podemos clasificar a los algoritmos de

agrupamiento en cuatro grupos principales: Algoritmos de particionamiento basados en los

enfoques k-means (centro de gravedad del cluster) y k-medoid (objeto representativo del

cluster), algoritmos jerárquicos, algoritmos basados en la ubicación de los objetos

(agrupamiento por densidad), y por último los basados en grids.

iv

A.1.3 Asociaciones espaciales

Una regla de asociación espacial es una regla que describe la implicación de uno o un

conjunto de objetos por otro conjunto de objetos en base de datos espaciales [29]. Un

ejemplo de una regla de asociación espacial podría ser “si la empresa se ubica cerca de la

Ciudad de México entonces es una empresa grande”. Una regla de asociación espacial es de

la forma X Y, donde X y Y son conjuntos de predicados espaciales o no espaciales. Existen

varios tipos de predicados espaciales que pudieran constituir una regla de asociación

espacial, por ejemplo, relaciones topológicas como son intersección, traslape y

orientaciones espaciales tales como Izquierda_de, y Oeste_de.

A.1.4 Aproximación y agregación

Los métodos basados en aproximación y agregación buscan analizar las características de

grupos de objetos (clusters) en base a objetos (features) cercanos a ellos. Proximidad

agregada es la medida de cercanía de un conjunto de puntos en un cluster a un feature. La

idea de encontrar relaciones de proximidad no es un problema simple como podría parecer,

existen tres razones para esta aseveración. Supongamos que tenemos un cluster de puntos y

queremos encontrar los k-features más cercanos a él:

• El tamaño y forma del cluster y los features puede ser muy variado.

• Podríamos tener una gran cantidad de features para examinar.

• Aún en el caso de encontrar una forma conocida (Ej. polígono) que describa la

forma del cluster, sería inadecuado reportar los features cuyos límites estén más

v

cerca a los límites de éste, porque la distribución de los puntos al interior del cluster

puede no ser uniforme.

A.1.5 Minería de datos en imágenes

Extracción de patrones a partir de imágenes es otro enfoque de minería de datos espaciales.

En la literatura existen diversos trabajos de minado de datos desarrollados bajo este

enfoque. Por ejemplo, Fayyad et al. [14] presentan un sistema para la identificación y

categorización de volcanes en la superficie de Venus a partir de imágenes transmitidas por

la sonda espacial Magellan. En otro trabajo [15] (Second Palomar Observatory Sky Survey)

se usaron árboles de decisión para la clasificación de galaxias, estrellas y otros objetos

estelares. Stolorz et al. [44] y Shek et al. [41] efectuaron estudios sobre minería de datos

espacio-temporal en conjuntos de datos geofísicos.

A.1.6 Clasificación de datos espaciales

La clasificación de datos espaciales tiene como objetivo encontrar reglas que dividan un

conjunto de objetos en un número de grupos, donde los objetos de cada grupo pertenecen a

una clase. Diversos tipos de información pueden ser usados para caracterizar los objetos

espaciales. Por ejemplo, atributos no espaciales de un objeto, predicados espaciales y

funciones espaciales. La idea es usar esta información para extraer ya sea atributos para la

etiquetación de clases (atributos que dividen los datos en clases) y atributos predictivos

(atributos cuyos valores son usados en un árbol de decisión para crear sus ramas).

vi

A.1.7 Detección de tendencias espaciales

Detección de tendencias espaciales describe los cambios regulares de uno ó más atributos

no espaciales de un objeto cuando éste se desplaza desde un punto de referencia inicial. Un

ejemplo de tendencia espacial sería “alejándose del centro histórico de la ciudad de Puebla

es precio de los terrenos decrece”. Las trayectorias de movimiento a partir de un punto x

son usadas para modelar dicho movimiento y análisis de regresión sobre los atributos de los

objetos son usados para describir patrones de cambio. Existen dos tipos de tendencias:

globales y locales.

A.2. Motivación

Aunque los enfoques antes mencionados realizan la tarea de minado de datos espaciales de

manera exitosa, nuestra percepción es que éstos no consideran todos los elementos

encontrados en una base de datos espaciales (datos espaciales, datos no espaciales y

relaciones espaciales entre los objetos espaciales) de una manera integral. Es decir, algunos

de ellos primero realizan minado de datos espaciales y posteriormente minado de datos no

espaciales ó viceversa, y otros permiten combinaciones de estos elementos pero de manera

restringida. Con base en lo anterior, proponemos el argumento siguiente: si somos capaces

de representar los datos espaciales, no espaciales y las relaciones entre objetos espaciales

como un solo conjunto de datos, y lo minamos como tal, podríamos generar/encontrar

patrones de conocimiento que describan/caractericen nuestro conjunto de datos conteniendo

estos tres elementos de manera conjunta. Para tal efecto en este trabajo se argumenta que

vii

una representación basada en grafos es lo suficientemente flexible y poderosa para

representar estos elementos de manera conjunta, fácilmente entendible y capaz de crear

diferentes representaciones del mismo dominio. El dominio es descrito usando grafos, los

grafos se convierten en datos de entrada para una herramienta de descubrimiento de

conocimiento basada en grafos la cual usa heurísticas para seleccionar subgrafos que son

considerados importantes (patrones).

Un grafo es definido como un par G = (V,E). V = {v1,…,vn} denota un conjunto finito de

elementos llamados vértices. E es un conjunto de arcos e satisfaciendo E ⊆ [V]2. Entonces,

cada arco e ∈ E es un par (vi,vj). Si (vi,vj) es un par ordenado para cualesquiera (vi,vj) ∈ E,

entonces se dice que G = (V,E) es un grafo dirigido. Un grafo etiquetado tiene etiquetas

asociadas a sus vértices y arcos.

Como comentamos anteriormente en nuestro modelo proponemos la representación de

relaciones espaciales entre los objetos espaciales. En la siguiente subsección detallamos los

tres tipos de relaciones espaciales que proponemos incluir.

A.2.1 Relaciones espaciales

La ubicación explícita de los objetos espaciales define relaciones implícitas de vecindad

(neighborhood) espacial entre ellos. De tal forma, la información sobre la vecindad de los

objetos espaciales constituye un elemento valioso que debe ser considerado en la tarea de

minado de datos espaciales. Martin Ester et al. [9][11] introducen el concepto de grafos de

viii

vecindad para representar explícitamente estas relaciones de vecindad implícitas. Los

grafos de vecindad pueden cubrir las relaciones de vecindad siguientes:

• Topológicas. Derivadas del modelo de nueve intersecciones [6][7][8], son

relaciones que permanecen invariantes bajo transformaciones lineales, es decir, si

ambos objetos se rotan, se trasladan ó se escalan simultáneamente las relaciones

entre ellos se preservan.

• Distancia. Compara la distancia entre dos objetos dada una constante usando

operadores aritméticos tales como <, >, =. La distancia entre dos objetos se definen

como la distancia mínima entre ellos (Ej. seleccionar todos los elementos dentro de

una radio de 50 kilómetros desde un punto x).

• Dirección. La relación espacial de dirección entre 2 objetos espaciales A y B (B R A)

se define usando un punto representativo del objeto A y todos los puntos del objeto

destino B. El punto representativo del objeto fuente A puede ser el centro del objeto

ó un punto sobre sus límites. Este punto representativo es usado como el origen de

un sistema de coordenadas virtuales y su cuadrante define la dirección.

Una vez comentada la motivación de nuestro trabajo de investigación se presenta a

continuación el modelo general basado en grafos para representar los datos espaciales.

A.3 Representaciones basadas en grafos

Como hemos comentado, nuestra propuesta se basa en crear un modelo basado en grafos

para representar conjuntamente datos espaciales, no espaciales y relaciones espaciales entre

ix

los objetos espaciales. La idea es usar los grafos generados como datos de entrada para un

algoritmo de minado de datos, de tal forma, que el algoritmo pueda encontrar patrones

involucrando estos elementos de manera conjunta y no como elementos separados. En

consecuencia, proponemos el modelo de representación mostrado (en notación UML) en la

Figura A.1.

Vertex

1..*

1..*

Distance DirectionTopological

Spatial relation

Spatial object

Non-spatial attribute

-From:-To:

Edge

Figura A.1. Modelo basado en grafos para representar datos espaciales.

En el modelo, los datos espaciales (Ej. objetos espaciales), datos no espaciales (Ej. atributos

descriptivos), y relaciones espaciales son representados como una colección de uno ó más

grafos dirigidos etiquetados. Los vértices pueden representar objetos espaciales, tipos de

relación espacial entre dos objetos (relación binaria), ó atributos no espaciales describiendo

los objetos espaciales. Los arcos representan una liga existente entre dos vértices de

cualquier tipo. Dependiendo del tipo de vértices que un arco une, éste puede representar el

nombre de un atributo descriptivo ó el nombre de una relación espacial. Los nombres de

atributos pueden referirse a descripciones de objetos espaciales y/ó a entidades no

espaciales. Se usan arcos dirigidos para representar la información direccional de relaciones

x

entre elementos (Ej. objeto x cubre objeto y) y para describir atributos pertenecientes a

objetos (Ej. objeto x tiene atributo z).

Actualmente se han desarrollado cinco representaciones a partir del modelo general

previamente descrito. Tres tópicos definen las características de los grafos creados en cada

modelo: (1) Representación de las relaciones espaciales equivalentes (Ej. tocar, traslapar).

(2) Representación de las relaciones espaciales simétricas (Ej. contiene-dentro_de,

Norte_de-Sur_de). (3) La manera de representar los objetos y sus relaciones en el modelo.

En consecuencia, los grafos creados se diferencian de manera cuantitativa y cualitativa. En

la parte cuantitativa tenemos diferencias tales como: número de vértices y arcos empleados

para representar los datos espaciales, no espaciales y las relaciones, creación de grafos

simples, manejo de arcos dirigidos y no dirigidos para representar relaciones espaciales y/o

atributos descriptivos. En la parte cualitativa hemos observado, a través de

experimentación, que ciertos modelos tienen una mayor expresividad para representar el

conjunto de datos, condicionando de manera directa la calidad de los resultados generados

en el proceso de minado. A continuación se presentan 3 de los 5 modelos propuestos

describiendo las métricas de evaluación creadas para caracterizar a cada uno de ellos.

Modelos

Con la finalidad de describir las características de cada modelo, usaremos como conjunto

de datos de ejemplo los mostrados en la Figura A.2. Como podemos observar, nuestro

conjunto de datos se compone de dos objetos espaciales, objeto A representado una casa y

objeto B representando un lago, y las tres relaciones espaciales siguientes: (1) Distancia,

xi

objeto A cerca de objeto B (relación equivalente). (2) Topológica, objeto A toca objeto B

(relación equivalente). (3) Dirección, objeto A Sur de objeto B (relación simétrica).

North

South

EastWest

Object B

Object A

Figura A.2. Base de datos de ejemplo para caracterizar los 3 modelos propuestos.

Modelo #1 - modelo base

La Figura A.3 muestra el primer modelo creado para representar datos espaciales bajo el

enfoque propuesto. Las características del modelo de acuerdo a las métricas creadas para su

caracterización son:

• Num. vértices: 2 vértices, cada uno representando un objeto espacial (objeto A y

objeto B).

• Num. arcos: 4 arcos, 3 arcos para representar las relaciones espaciales originales

existentes en nuestro conjunto de datos de ejemplo (“cerca”, “toca” y “Sur_de”) y

un arco para representar la relación “Norte_de” creada de la relación simétrica

original “Sur_de”. La relación “Norte_de” es en si misma una relación simétrica.

• Tamaño (vértices + arcos): 6

• % incremento: 0%, este es el modelo base.

• Grafo simple. No, Es un grafo complejo con 4 arcos uniendo 2 vértices.

xii

• Arco dirigido. Si, son usados para representar las relaciones simétricas “Sur_de” y

“Norte_de”. La dirección de los arcos va en concordancia con la lectura de las

relaciones entre los objetos.

• Arco no dirigido. Si, son usados para representar las relaciones equivalentes

“cerca” y “toca”.

• Información completa. Si, en el grafo es representada la relación simétrica

“Norte_de” creada a partir de la relación simétrica original “Sur_de”.

• Arco “Relation” redundante. No, en el modelo no se usan arcos “Relation”.

Spatialobject A

Spatialobject B

South_of

touch

close

North_of

Figura A.3. Modelo #1 - modelo base.

Modelo #2 - replicación simple de tipos de relación, información completa

En la Figura A.4 presentamos el segundo modelo creado para representar datos espaciales.

Las características del modelo de acuerdo a las métricas son:

• Num. vértices: 5 vértices, 2 vértices para representar los objetos espaciales y 3

vértices para representar los tipos de relaciones espaciales “topológica”, “distancia”

y “dirección”. Por cada tipo de relación especial existente entre dos objetos, se

añade un vértice etiquetado con el nombre del tipo de la relación espacial. En el

ejemplo existen una relación “topológica”, una relación de “distancia” y una

relación de “dirección”.

xiii

• Num. arcos: 6 arcos, 3 arcos para representar las relaciones originales, 2 arcos para

representar las relaciones equivalentes (“cerca” y “toca”) creadas a partir de las

relaciones originales y un arco para representar la relación simétrica (“Norte_de”)

creada también de las relaciones originales.


• % incremento: +83.33%

• Grafo simple. Si, existe a lo más un arco entre cualesquiera 2 vértices dados.

• Arco dirigido. Si, son usados para representar todas las relaciones. La dirección de

los arcos va de los vértices representando los objetos espaciales a los vértices

representado los tipos de relaciones espaciales.

• Arco no dirigido. No, en el modelo no se usan arcos no dirigidos.

• Información completa. Si, representamos las relaciones simétricas creadas a partir

de las relaciones originales.

• Arco “Relation” redundante. No, en el modelo no usamos arcos “Relation”.

South_of

touch

close close

touch

North_of

Distance

Direction

Spatialobject A Topological Spatial

object B

Figura A.4. Modelo #2 - replicación simple de tipos de relación, información completa.

xiv

Modelo #3 - doble replicación de tipos de relación, información no completa

En la Figura A.5 se presenta el tercer modelo creado para representar datos espaciales

usando un enfoque de grafos. Las características del modelo de acuerdo a las métricas son:

• Num. vértices: 8 vértices, 2 vértices para representar los objetos espaciales y 6

vértices para representar los tipos de relaciones espaciales (“distancia”, “topológica”

y “dirección”). Por cada tipo de relación espacial entre dos objetos espaciales, se

añaden 2 vértices etiquetados con el nombre del tipo de la relación espacial. Por

ejemplo, en nuestros datos de prueba existen tres tipos de relaciones: 1 relación

“topológica”, 1 relación “distancia” y 1 relación “dirección”, de tal forma, se

añaden 6 vértices, 2 por cada tipo de relación espacial.

• Num. arcos: 9 arcos, 6 arcos “Relation” para unir los vértices representando los

objetos espaciales con los vértices representando los tipos de relaciones espaciales

(desde cada vértice representando un objeto espacial nacen 3 arcos ya que existen 3

tipos de relación), y 3 arcos para presentar las relaciones espaciales originales. Estos

3 arcos son usados para unir los vértices representando los tipos de relaciones

espaciales.


• % incremento: +183.33%

• Grafo simple. Si, existe a lo más un arco entre cualesquiera 2 vértices dados.

• Arco dirigido. Si, son usados para representar las relaciones simétricas y los arcos

“Relation”. La dirección de los arcos “Relation” va desde los vértices representando

los objetos espaciales a los vértices representando los tipos de relaciones espaciales.

xv

La dirección de los restantes arcos va en concordancia a la lectura de las relaciones

espaciales entre los objetos.

• Arco no dirigido. Si, son usados para representar las relaciones equivalentes.

• Información completa. No, en el modelo no se representan las relaciones

simétricas que son creadas a partir de las relaciones espaciales originales.

• Arco “Relation” redundante. Si, en el modelo se usan arcos “Relation” para

representar explícitamente la existencia y tipo de una relación espacial entre 2

objetos espaciales. Adicionalmente, se emplean estos arcos para evitar la creación

de grafos complejos.

South_of

touch

close

Relation

Distance

Topological

Direction

Distance

Topological

Direction

Spatialobject A

Spatialobject B

Relation

R elation

Relation

RelationR

elation

Figura A.5. Modelo #3 - doble replicación de tipos de relación, información no completa.

La Tabla A.1 presenta los resultados de las nueve métricas desarrolladas para evaluar las

características de cada modelo propuesto (actualmente 5 modelos). El modelo #1 es

llamado el modelo base.

xvi

Modelo Num.

Vértices

Num.

Arcos

Tamaño

(v + e)

%

Incremento

Grafo

Simple

Arco

Dirigido

Arco no

Dirigido

Información

Completa

Arco

“Relation”

Redundante

(1) (2) (3) (4) (5) (6) (7) (8) (9)

#1 2 4 6 - No Yes Yes Yes No

#2 5 6 11 +83.33 Yes Yes No Yes No

#3 8 9 17 +183.33 Yes Yes Yes No Yes

#4 5 6 11 +83.33 Yes Yes Yes No Yes

#5 8 12 20 +233.33 Yes Yes No Yes Yes

Tabla A.1. Características de los modelos de representación basados en grafos.

Las métricas fueron propuestas basaron en el causas/efectos que cada uno de éstas tiene

tanto en el grafo creado como en el algoritmo de minado. Visualizamos cuatro tópicos

significativos relacionados directamente a estas métricas:

1. Espacio de búsqueda

El espacio de búsqueda en un algoritmo de minado de datos basado en grafos consiste de

todos los subgrafos que pueden ser derivados de su grafo de entrada, de tal forma, el

número de vértices (1) y arcos (2) del grafo creado (3) definen el tamaño del espacio de

búsqueda para el sistema del descubrimiento. Por lo tanto, el objetivo debe ser minimizar el

número de vértices y arcos usados para crear los grafos pero al mismo tiempo maximizar la

representatividad de los mismos. Como podemos ver en la Tabla A.1, el modelo usando el

número mínimo de vértices y arcos para representar el conjunto de datos de ejemplo es el

modelo #1 (2 vértices y 4 arcos) mientras que el modelo #5 es el caso opuesto (8 vértices y

12 arcos).

xvii

2. Tiempo de procesamiento

El tamaño del espacio de búsqueda juega un papel relevante con respecto al tiempo de

procesamiento usado para descubrir patrones. Si tenemos un espacio búsqueda “grande” se

requeriría más tiempo para evaluar todos los subgrafos posibles. Por lo tanto, una

comparativa de la métrica "porcentaje de incremento" (4) entre los modelos propuestos se

presenta en la Tabla A.1. Recordemos que esta métrica compara el tamaño de un modelo

dado con respecto al modelo #1 (modelo base). Por ejemplo, el modelo #5 tiene un

incremento de tamaño del grafo de 233.33% respecto al modelo #1. Es decir, el algoritmo

de minado requerirá la evaluación de 233.33% más vértices y/o arcos usando el modelo #5

en vez del modelo #1 para el mismo conjunto de datos.

3. Complejidad del grafo

En el capítulo 4 de la disertación se describe el sistema Subdue, nuestra herramienta de

minado de datos basada en grafos. Cómo ahí describimos, existe una mayor complejidad

para el algoritmo de minado trabajar con grafos complejos en vez de grafos simples (Ej. a

lo máximo un arco uniendo cualesquiera dos vértices dado y no ciclos). Por ejemplo, en el

proceso de macheo de grafos, la fase de expansión (Subdue emplea un enfoque de

"expansión" para descubrir patrones), y la etapa de compresión del grafo. Por lo tanto, el

objetivo fue proponer modelos basados en grafos que nos permitieran crear grafos simples.

Como podemos ver en la Tabla A.1, únicamente el modelo #1 no permite crear grafos

simples.

xviii

En consecuencia, como estrategia para romper la multiplicidad de arcos entre dos vértices

(por ejemplo, vértices representando dos objetos espaciales con dos ó más relaciones

espaciales entre ellos) se emplean los enfoques siguientes:

• Añadir un nuevo vértice etiquetado con el nombre del tipo de la relación (Ej.

topológica, distancia y dirección) por cada relación espacial entre los objetos

espaciales. Este enfoque es usado en el modelo #2.

• Agregar un nuevo vértice (modelo #4) o dos nuevos vértices (modelo #3 y modelo

#5) etiquetado como el nombre de tipo de la relación espacial (Ej. topológica,

distancia y dirección) por cada relación espacial entre los objetos espaciales y unir

este nuevo vértice (modelo #4) o nuevos vértices (modelo #3 y modelo #5) con los

vértices representando los objetos espaciales por medio de arcos etiquetados como

"Relation". El enfoque empleado para unir los vértices difiere para cada modelo tal

y como se ha comentado en la definición de cada uno de ellos. Esta nomenclatura se

utiliza para representar el hecho de que existe una relación espacial entre los objetos

espaciales. Estos arcos son conocidos como arcos “Relation” redundantes (9).

4. Representatividad de los datos

Las métricas arcos dirigidos (6), arcos no dirigidos (7), y información completa (8) son

usadas para maximizar la representatividad de los datos pero minimizando, tanto como sea

posible, el tamaño del grafo y su complejidad. Los arcos dirigidos son usados para

representar las relaciones espaciales simétricas (objeto A Norte_de objeto B, implica, B

Sur_de A), los arcos “Relation” redundantes, los atributos no-espaciales que describen a los

objetos espaciales. Arcos no dirigidos son usados para representar las relaciones espaciales

xix

equivalentes (la relación es representada por un no dirigido en vez de dos arcos dirigidos).

Finalmente, información completa significa que las relaciones espaciales simétricas entre

objetos espaciales también son representadas en el modelo.

A.4 Minando el grafo

La característica de overlap (traslape entre vértices pertenecientes a diferentes instancias de

una subestructura) desempeña un rol importante en el sistema de descubrimiento de

patrones (subestructuras) en nuestra herramienta de minado de datos basada en grafos, el

sistema Subdue. En consecuencia, los resultados generados están condicionados, en un alto

porcentaje, al funcionamiento de esta característica. Sin embargo, su implementación actual

es ortodoxa: se permite el overlap entre todas las instancias de una subestructura (sin

ninguna regla) ó no se permite el overlap entre ninguna instancia perteneciente a una

subestructura. Es decir, todo ó nada. En este contexto, proponemos un nuevo enfoque

llamado overlap limitado. Una de las ventajas principales de este nuevo enfoque es la

capacidad que se le da al usuario para especificar el conjunto de vértices donde el overlap

será permitido. Estos vértices podrían representar elementos significativos en el contexto de

trabajo. Visualizamos directamente tres motivaciones para proponer el nuevo algoritmo, los

cuales serán explicados en las subsecciones siguientes:

1. Reducción del espacio de búsqueda

En sistemas de descubrimiento de conocimiento basados en grafos, el algoritmo de minado

de datos usa grafos como su representación de conocimiento. El espacio de búsqueda de un

xx

algoritmo de este tipo consiste de todos los subgrafos que puedan ser derivados del grafo de

entrada. El proceso de descubrimiento de subestructuras en Subdue comienza con la

creación de subestructuras de un solo vértice a partir del grafo de entrada (una subestructura

por cada etiqueta de vértice que exista al menos 2 veces en el grafo). En cada iteración del

proceso de descubrimiento, el algoritmo selecciona las mejores subestructuras y expande

las instancias de esas subestructuras añadiendo un arco vecino (ó un arco y un nuevo

vértice) en todas las direcciones posibles.

Pero como parte del proceso para seleccionar las mejores subestructuras y entonces

expandirlas existe un proceso de filtrado. En este proceso, de acuerdo al valor del

parámetro overlap las instancias de una subestructura son evaluadas: si el overlap es

permito entonces las instancias compartiendo vértices son mantenidas, de lo contrario si el

overlap no es permito las instancias que comparten vértices son descartadas.

La mejor subestructura descubierta por Subdue (de acuerdo a sus métricas de evaluación),

en cada iteración, puede ser empleada para comprimir el grafo de entrada el cual entonces

puede convertirse en el nuevo grafo de entrada para una siguiente iteración. Después de

varias iteraciones, Subdue crea una descripción jerárquica de los datos, donde

subestructuras descubiertas en cierta iteración pueden estar definidas en base a

subestructuras descubiertas en iteraciones previas. De tal forma, el número de instancias de

una subestructura define el espacio de búsqueda (en cada iteración) en el proceso de

descubrimiento de subestructuras. Como podemos observar, a través de uso del overlap

limitado, se obtiene una reducción del espacio de búsqueda dado que el número de

xxi

instancias candidatas a ser expandidas se condiciona a los valores (vértices) donde el

overlap es permitido, siendo éstos establecidos por el usuario.

2. Reducción del tiempo de procesamiento

La reducción en el número de instancias candidatas a ser expandidas trae consigo una

reducción del espacio de búsqueda, y esto beneficia en una reducción del tiempo de

procesamiento para la búsqueda de subestructuras (patrones).

Permitiendo el overlap en Subdue, hace que éste sea considerablemente más lento (en

cuanto a tiempo de procesamiento) dado que el número de instancias candidatas a ser

expandidas, evaluadas, comparadas, y descubiertas se incrementa como hemos descrito.

Sin embargo, con la implementación del overlap limitado, el número de instancias a ser

procesadas en estas fases decrece resultando en una reducción del tiempo de procesamiento

en todo el proceso de descubrimiento de subestructuras.

3. Búsqueda orientada de patrones con overlap (traslape) selectivo

El overlap limitado da a usuario la capacidad para definir el conjunto de elementos donde el

overlap será permitido y que a su juicio considera relevantes para su contexto de trabajo

(Ej. un objeto espacial, un atributo descriptivo). Estos elementos son representados usando

vértices de acuerdo al modelo propuesto. En contra parte, el algoritmo descartará aquellos

elementos traslapados que el usuario no consideró significativos.

Por lo tanto, el overlap limitado proporciona al usuario un medio para implementar una

búsqueda orientada de patrones con overlap. Es decir, el usuario delimita el conjunto de

xxii

elementos que tendrán un papel preponderante en el proceso de descubrimiento de

subestructuras. Adicionalmente, esta característica ofrece la ventaja de que el proceso de

evaluación de patrones se simplifica dado que el conjunto de resultados generados es menor

porque éstos se centran sobre los requerimientos del usuario.

A.5 Resultados

Como parte de las pruebas para evaluar nuestra propuesta de modelado y minado de datos

espaciales usando un enfoque basado en grafos se desarrollaron tres casos de uso

ilustrativos. Los dos primeros casos fueron implementados usando un censo de población

del centro histórico de la ciudad de Puebla en el año de 1777. El tercer caso de uso fue

desarrollado usando una base de datos espaciales de la región del volcán Popocatépetl.

En esta sección presentamos ejemplos de resultados obtenidos con el caso de uso del

volcán Popocatépetl. Supongamos que deseamos conocer características comunes entre las

poblaciones (asentamientos poblacionales), carreteras y ríos en la zona que nos ayuden a

evaluar/implementar planes de evacuación en caso de una contingencia volcánica. Por

ejemplo, características de carreteras empezando en ó cruzando una población, material

usado para construir esas carreteras y su estado actual (Ej. pavimentadas, terracería),

características de carreteras y ríos que tengan alguna relación entre ellos (Ej. se crucen, se

tocan), ríos cerca de poblaciones que en caso de alta precipitación pluvial pudieran

representar peligros potenciales.

xxiii

Los experimentos fueron desarrollados usando los 5 modelos de representación

actualmente propuestos. En esta sección se presentan patrones encontrados con el modelo

#1 entre carreteras y ríos, carreteras y poblaciones, y por último ríos y poblaciones. La idea

de organizar la presentación de resultados de esta manera es mostrar los diversos patrones

que se pueden encontrar entre estos elementos. Al final de la sección se presentan tablas

comparando los resultados obtenidos con cada uno de los 5 modelos.

La Figura A.6 muestra el patrón más significativo descubierto entre carreteras y ríos usando

el modelo #1. El patrón describe una relación entre “carretera categoría terracería

traslapando un río categoría escurrimiento” en la zona. Este patrón puede considerarse

como un indicador del número de carreteras que necesitan ser supervisadas en caso de una

contingencia volcánica dado el tipo de material con el que están construidas, y porque ellos

atraviesan ríos (la lectura puede ser hecha en orden inverso) que en caso de altas

concentraciones pluviales pueden desbordarse e inutilizarlas. Subdue encontró con no

overlap 46 instancias del patrón en la segunda iteración; vía overlap estándar encontró 85

instancias en la primera iteración; y a través de overlap limitado también encontró 85

instancias en la segunda iteración. Como podemos observar overlap estándar y overlap

limitado encontraron el mismo número de instancias, pero overlap limitado necesitó dos

iteraciones para encontrar el mismo patrón. Sin embargo, esto no quiere decir que overlap

estándar es mejor que overlap limitado (respecto a tiempo de procesamiento) porque

analizando el tiempo global de procesamiento requerido por overlap limitado para finalizar

la fase de descubrimiento de subestructuras notamos que es menor que el requerido por

overlap estándar.

xxiv

a) b)

c)

Figura A.6. Relaciones entre carreteras y ríos usando el modelo #1.

El patrón más significativo, usando el modelo #1, encontrado entre carreteras y poblaciones

es presentado en la Figura A.7. Este describe una relación entre “carretera categoría

terracería tocando un asentamiento poblacional categoría construcción”. “Asentamiento

poblacional categoría construcción” representa en la capa de datos espaciales

“asentamientos” de la base de datos del volcán áreas habitacionales con alta población,

xxv

edificios y una gran cantidad de construcciones usadas para ofrecer servicios a los

habitantes. Si asumimos que la gente podría necesitar ser evacuada en caso de una erupción

y que las carreteras que serían usadas para ese propósito son de terracería, entonces, esta

situación podría convertirse en un problema (Ej. cuello de botella). En este experimento

Subdue encontró vía no overlap 6 instancias del patrón en la novena iteración; a través de

overlap estándar 9 instancias en la cuarta iteración; y usando overlap limitado 8 instancias

en la décima iteración.

a) b)

c)

Figura A.7. Relaciones entre carreteras y poblaciones usando el modelo #1.

xxvi

La Figura A.8 muestra el patrón más significativo encontrado entre ríos y poblaciones

usando el modelo #1. El patrón describe una relación entre “río categoría escurrimiento

cruzando un asentamiento poblacional categoría manzana o construcción" en la zona.

“asentamientos poblacionales categoría manzana" representa en la capa de datos espaciales

“asentamientos” de la base de datos del volcán áreas habitacionales con poca población, de

hecho con muchas áreas deshabitadas, escasos edificios y construcciones. El patrón puede

ser utilizado para identificar zonas potenciales de inundación, habitadas por personas, dada

la cercanía de ríos. A través de no overlap Subdue encontró 5 instancias del patrón en la

décima segunda iteración; usando overlap estándar encontró 5 instancia en la octava

iteración; y vía overlap limitado también encontró 5 instancias en octava iteración. Subdue

encontró el mismo patrón en los tres casos, sin embargo, usando overlap estándar y overlap

limitado la lectura del patrón es más simple.

xxvii

a) b)

c)

Figura A.8. Relaciones entre ríos y poblaciones usando el modelo #1.

La Tabla A.2 presenta una comparación, por modelo, entre el número de instancias

descubiertas/iteraciones necesitadas para descubrirlas y las tres implementaciones de

overlap. Por ejemplo, usando el modelo #1, Subdue encontró 46 instancias (en la segunda

iteración) de un patrón “completo” (nuestra definición para reportar un patrón “completo”

es que éste contengan al menos dos objetos espaciales y la relación espacial entre ellos)

xxviii

conteniendo los objetos espaciales carretera-río vía no overlap. Un valor más alto significa

un modelo que permite descubrir más instancias de una subestructura (patrones).

Recordemos que Subdue reporta como el mejor patrón (por iteración) la subestructura con

el número más alto de instancias descubiertas de esa subestructura. Esta comparación es

reportada por cada estructura “objeto-objeto” (Ej. carretera-río).

Nota: NO (no overlap), SO (overlap estándar), LO (overlap limitado).

Modelo #1 Modelo #2 Modelo #3 Modelo #4 Modelo #5

NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO

Carretera-Río

Instancias 46 85 85 41 85 64 39 85 34 39 85 60 45 85 45

Iteraciones 2 1 2 2 3 2 2 2 9 2 1 2 5 1 5

Carretera-Población

Instancias 6 9 8 5 8 7 4 8 5 6 8 8 6 0 7

Iteraciones 9 4 10 14 6 10 15 6 13 12 10 7 7 0 7

Río-Población

Instancias 5 5 5 5 10 5 5 19 5 5 10 5 5 5 5

Iteraciones 12 8 8 16 7 14 12 4 10 6 6 11 13 6 10

Tabla A.2. Instancias/iteraciones por cada modelo basado en grafos: caso de uso Popocatépetl.

La Tabla A.3 presenta una comparación de máximo/mínimo instancias descubiertas por

cada implementación de overlap. Un modelo con el valor más alto es mejor porque permite

descubrir más instancias de una subestructura. La comparación es presentada por cada

estructura “objeto-objeto”. Por ejemplo en la estructura carretera-río el modelo #1 reportó

46 instancias descubiertas en la segunda iteración (el valor más alto).

xxix

Máximo Mínimo

Carretera-Río

No overlap modelo #1 (segunda iteración) modelos #3 y #4 (segunda iter.)

Overlap estándar modelos #1, #4 y #5 (primera iter.) modelo #2 (tercera iteración)

Overlap limitado modelo #1 (segunda iteración) modelo #3 (novena iteración)

Carretera-Población

No overlap modelo #5 (séptima iteración) modelo #3 (quinceava iteración)

Overlap estándar modelo #1 (cuarta iteración) modelo #5 (patrón no completo)

Overlap limitado modelo #4 (séptima iteración) modelo #3 (treceava iteración)

Río-Población

No overlap modelo #4 (sexta iteración) modelo #2 (dieciseisava iteración).

Overlap estándar modelo #3 (cuarta iteración) modelo #1 (octava iteración).

Overlap limitado modelo #1 (octava iteración) modelo #2 (catorceava iteración)

Tabla A.3. Max/Min de instancias descubiertas por “objeto-objeto”/característica overlap.

La Tabla A.4 presenta una comparación entre el promedio de instancias descubiertas por

modelo. Un valor más alto significa un modelo permitiendo descubrir más instancias de

una subestructura. Cada valor representa el promedio de subestructuras descubiertas usando

no overlap, overlap estándar y overlap limitado. La comparación es reportada por cada

estructura “objeto-objeto”.


Carretera-Río 72.0 63.3 52.7 61.3 58.3

Carretera-Población 7.7 6.7 5.7 7.3 4.3

Río-Población 5.0 6.7 9.7 6.7 5.0

Tabla A.4. Promedio de instancias descubiertas por modelo/“objeto-objeto”.

xxx

La Tabla A.5 presenta una comparación entre el promedio de instancias descubiertas por

modelo. Un valor más alto significa un modelo permitiendo descubrir más instancias de

una subestructura. La comparación es reportada por cada implementación de overlap.



19.0 33.0 32.7 17.0 34.3 25.3 16.0 37.3 14.7 16.7 34.3 24.3 18.7 30.0 19.0

Tabla A.5. Promedio de instancias descubiertas por modelo/característica overlap.

La Tabla A.6 presenta un comparativo final entre el promedio de instancias descubiertas

por modelo. Podemos ver en la tabla que el modelo #1 reporta el valor más alto de

instancias descubiertas (de acuerdo a nuestros parámetros para reportar instancias

completas) en este caso de uso ilustrativo. Los siguientes modelos son el modelo #2 y el

modelo #4 respectivamente.


28.2 25.6 22.7 25.1 22.6

Tabla A.6. Promedio de instancias descubiertas por modelo.

A.6 Conclusiones

La constante interacción entre los seres humanos y su hábitat natural, el planeta de la tierra,

genera día a día, nuevos requerimientos asociados al manejo y explotación de datos

espaciales. Por ejemplo, el análisis urbano, la prevención los riesgos naturales, la

exploración del espacio estelar, la contaminación en los océanos, y la reforestación de los

xxxi

suelos, por nombrar algunos de ellos. La minería de datos espaciales involucra la

integración de métodos y técnicas provenientes de diversos campos científicos los cuales

nos ayudan, por medio de algoritmos de análisis y de descubrimiento, a producir una

enumeración particular de patrones sobre los datos espaciales.

Nuestra argumentación en esta disertación doctoral se basa en la idea de que los enfoque de

minaría de datos espaciales descritos no consideran todos los elementos encontrados en una

base de datos espaciales (datos espaciales, datos no espaciales y relaciones espaciales entre

los objetos espaciales) de una manera integral. En consecuencia, se propuso emplear un

enfoque basado en grafos para representar estos elementos como un solo conjunto de datos,

minarlos como un todo, para de estar forma poder encontrar patrones conteniendo ambos

tipos de datos y relaciones espaciales (patrones más descriptivos).

En nuestro modelo las relaciones espaciales entre los objetos espaciales son incluidas

porque una característica significativa de los datos espaciales es la influencia que los

vecinos de un objeto pueden tener en el objeto mismo. En el modelo incluimos tres tipos de

relaciones espaciales. Derivado del modelo general se propusieron cinco modelos

operativos. Tres aspectos definen las características de un grafo creado con estos modelos:

(1) Representación de las relaciones espaciales equivalentes. (2) Representación de

relaciones espaciales simétricas. (3) La manera de representar los objetos y sus relaciones.

Como parte integrante de nuestra metodología para el minado de datos espaciales usando

un enfoque basado en grafos, usamos el sistema Subdue como nuestra herramienta de

minado. Se propuso un nuevo algoritmo llamado overlap limitado el cual le da al usuario la

xxxii

capacidad de especificar el conjunto de vértices sobre los cuales el overlap es permitido.

Visualizamos tres motivaciones para proponer este nuevo enfoque: (1) Reducción del

espacio de búsqueda. (2) Reducción del tiempo de procesamiento. (3) Búsqueda orientada

de patrones con overlap (traslape) selectivo.

Para demostrar la viabilidad, capacidad de minado y descubrimiento de patrones usando el

enfoque propuesto, se desarrolló un prototipo implementado nuestro modelo para crear los

conjuntos de datos basados en grafos, minar esos grafos (a través del llamado al sistema

Subdue) y visualización de patrones encontrados. Los resultados generados de los casos de

uso desarrollados nos dan un panorama respecto a cómo y qué podríamos obtener usando

este enfoque. Es importante comentar el hecho de que podemos utilizar esta metodología de

modelado y representación en cualquier dominio que pueda ser representado como grafo.

Una vez demostrada la viabilidad de nuestra propuesta, las perspectivas relacionadas a

mejorar nuestro trabajo (modelo de representación basado en grafos, algoritmo de minado

de datos y sistema prototipo) incluyen los puntos siguientes:

• Visualización de conocimiento descubierto. Por ejemplo, visualización de

resultados sobre las capas espaciales, a través del uso de gráficas, y navegación en

la jerarquía de patrones descubiertos usando un enfoque de hypergrafo.

• Mejoramiento de los algoritmos empleados para crear los conjuntos de datos

basados en grafos de acuerdo a los modelos propuestos. La validación de

relaciones espaciales entre objeto espaciales es una fase que en la mayoría de los

casos requiere gran cantidad de recursos de computacionales.

xxxiii

• Minado de los grafos. Se empleó el sistema Subdue como herramienta de minado

de datos. Así mismo, se propuso un nuevo algoritmo llamado overlap limitado. El

isomorfismo de grafos es un problema NP-completo, de tal forma, debemos ser

capaces de que nuestros tiempos de procesamiento para la búsqueda de patrones

cumpla parámetros aceptables de eficiencia.

• Manejo y representación de relaciones entre datos no espaciales. Las relaciones

implícitas y explicitas entre los atributos describiendo los objetos espaciales pueden

ser incluidos en el modelo con la finalidad de mejorar la representación de los datos.

A.7 Contribución

La contribución al descubrimiento de conocimiento en el dominio de los datos espaciales,

descrito en esta disertación, es el desarrollo de un nuevo enfoque para el modelado y

minado de datos espaciales usando una representación basada en grafos. Este enfoque

incluye los aspectos siguientes:

• Se propuso una nueva representación de datos basada en grafos para datos

espaciales. Se visualizaron dos objetivos para crear un modelo de datos con estas

características. El primero de ellos es crear un único conjunto de datos, basado en

grafos, representando estos elementos relacionados. El segundo es emplear este

conjunto de datos para alimentar a un sistema de minado de datos basado en grafos,

de tal forma que pudiéramos descubrir patrones conteniendo datos espaciales, no

espaciales y relaciones espaciales los cuales nos ayuden a describir/entender los

xxxiv

datos, basado en la premisa, de que estos son elementos relacionados en el mundo

real.

• Se propuso un nuevo algoritmo para descubrir subestructuras (patrones) usando un

enfoque de overlap limitado en el sistema Subdue, nuestra herramienta de minado

de datos. Visualizamos directamente tres motivaciones para proponer la

implementación del nuevo algoritmo: reducción del espacio de búsqueda, reducción

del tiempo de procesamiento y búsqueda orientada de patrones con overlap

selectivo (specialized overlapping pattern oriented search).

• Se diseñó e implementó un sistema prototipo implementado el modelo propuesto. El

prototipo ofrece una interfase de usuario amigable para el manejo de las capas

espaciales con las que se trabajará, para la creación de grafos espaciales y no

espaciales, para el minado de estos grafos (a través del llamado al sistema Subdue)

y para el despliegue de los resultados generados.

xxxv

Appendix B

RÉSUMÉ EN FRANÇAIS

B.1. Introduction

Durant ces dernières années nous avons été témoins de la rapide croissance du nombre, de

la capacité et de la dissémination des applications informatiques consacrées à l'obtention, la

génération, la manipulation, et le stockage de données dans divers milieux de la vie

humaine. Ces actions ont impliqué le recueil d'une grande quantité de données dont

l'analyse par des moyens manuels devient chaque jour plus compliquée. Rappelons que

dans plusieurs occasions les données “crues” ont besoin d'être analysées et interprétées

pour qu'elles soient transformées en informations utiles et profitables. Cette situation a été

résolue grâce à des techniques et des outils informatiques qui, de plus en plus nombreux,

nous aident pour atteindre ces objectifs. La découverte de connaissances dans bases de

données (KDD, sigle en langue anglaise) est définie comme l'extraction non banale

d'informations implicites, préalablement inconnues et potentiellement utiles à partir de

données [16]. Il s'agit d'un processus itératif et interactif qui comprend différentes phases.

Le noyau du processus est la phase du data mining (fouille de données), qui peut être

conceptualisée comme l'application d'algorithmes d'analyse de données et de fouille de

xxxvi

données qui grâce à des paramètres adéquats produisent et découvrent une énumération

particulière de patrons (régularités) sur des données elles-mêmes [12].

Dans ce contexte, l'analyse et l'exploitation de données provenant de phénomènes existant

en, sur, et sous la surface de la terre – appelés données spatiales –, ont engendré un nouveau

domaine d'investigation appelé fouille de données spatiales, de telle sorte que la fouille de

données spatiales s'envisage comme la découverte de connaissances implicites, et préala-

blement inconnues de données spatiales [16]. Diverses approches pour la fouille de données

spatiales ont été développées, dont ne seront examinées que les plus représentatives.

B.1.1 Méthodes basées sur la généralisation

La généralisation a démontré être une des méthodes effectives pour découvrir des

connaissances. Elle a été introduite par la communauté d'apprentissage automatique et se

base sur l'apprentissage à partir d'exemples. La découverte de connaissances basée sur la

généralisation requiert la construction de hiérarchies de concepts (donnés explicitement par

l'expert ou générés automatiquement). Dans le cas des bases de données spatiales, on peut

trouver deux types de hiérarchies de concepts: (1) Hiérarchies thématiques, par exemple,

généraliser des tomates et des bananes comme des fruits, les fruits et légumes comme des

aliments d'origine végétale, etc. (2) Hiérarchies spatiales, par exemple, généraliser une série

de points spatiaux comme une région ou pays.

Lu et al. [35] étendent une technique d'induction basée sur les attributs, aux bases de

données spatiales. Cette technique construit la hiérarchie de généralisation en résumant les

xxxvii

relations entre les données spatiales et non spatiales à un niveau de concept plus élevé. Les

auteurs présentent deux algorithmes basés sur la généralisation: (1) Approche de

domination de données non spatiales. Cette méthode réalise en première instance une

induction basée sur les attributs aux données non-spatiales et après cette étape, aux données

spatiales. (2) Approche de domination de données spatiales ; étant donné la hiérarchie de

données spatiales, la généralisation est effectuée d`abord sur ces données et après sur les

données non spatiales.

B.1.2 Regroupement

Le regroupement (clustering) est le processus permettant de regrouper de manière physique

ou abstraite des objets en classes d'objets semblables. Cette approche de fouille de données

nous aide à construire des groupes “représentatifs” d'un ensemble d'objets basé sur une

mesure de similitude (Ex. distance euclidienne). En d'autres termes, le regroupement de

données identifie des groupes (clusters) ou des régions densément peuplés en accord avec

une certaine mesure de distance dans un ensemble de données multidimensionnelles. On

peut classifier les algorithmes de regroupement en quatre groupes principaux : algorithmes

de regroupement basés sur les approches k-means (centre de gravité du cluster) et k-medoid

(objet représentatif du cluster), algorithmes hiérarchiques, algorithmes basés sur la

localisation des objets (regroupement par densité), et dernièrement ceux qui sont basés sur

des grids.

xxxviii

B.1.3 Associations spatiales

Une règle d'association spatiale est une règle qui décrit l'implication d'un ensemble d'objets

vers un autre ensemble d'objets d'une base de données spatiales [29]. Un exemple d'une

règle d'association spatiale peut être “ si la entreprise est près de Mexico City est alors une

grande entreprise”. Une règle d'association spatiale est de la forme X Y, où X et Y sont des

ensembles de prédicats spatiaux ou non spatiaux. Il existe plusieurs types de prédicats

spatiaux qui pourraient constituer une règle d'association spatiale, par exemple, des

relations topologiques comme intersection, recouvrement et des relations d'orientation

spatiale comme Gauche_de, ou Ouest_de.

B.1.4 Rapprochement et agrégation

Les méthodes basées sur le rapprochement et l'agrégation cherchent à analyser les

caractéristiques des groupes d'objets (clusters) pour former des groupes d'objets (features)

proches entre eux. La proximité devient la mesure permettant de regrouper un ensemble de

points dans un cluster en un feature. L'idée de trouver des relations de proximité n'est pas

un problème simple comme on pourrait le croire, car il existe trois raisons pour cette

affirmation. Supposons qu'on ait un cluster de points et que l'on veuille trouver les k-

features les plus proches de lui :

• La taille et forme du cluster et les features peuvent être très variés.,

• On pourrait avoir une grande quantité de features à examiner,

• Même pour trouver une forme connue (ex. polygone) qui décrit la forme du cluster,

il serait impropre de reporter les features où les limites soient plus proches aux

xxxix

limites de celui-ci, parce que la distribution des points à l'intérieur du cluster ne peut

pas être uniforme.

B.1.5 Fouille de données-images

L'extraction de patrons à partir d'images est autre approche de la fouille de données

spatiales. Dans la littérature il existe divers travaux de fouille de données développés selon

cette approche. Par exemple, Fayyad et al. [14] présentent un système pour l'identification

et la catégorisation des volcans sur la surface de Venus à partir d'images transmises par la

sonde stellaire Magellan. Dans un autre travail [15] (Second Palomar Observatory Sky

Survey) les auteurs ont utilisé des arbres de décision pour la classification des galaxies, des

étoiles et des autres objets stellaires. Stolorz et al. [44] d'une part, et Shek et al. [41] d'autre

part ont effectué des fouilles de données spatio-temporelles de données géophysiques.

B.1.6 Classification de données spatiales

La classification de données spatiales a comme objectif de trouver des règles qui divisent

un ensemble d'objets en un nombre de groupes dans lesquels les objets de chaque groupe

appartiennent à une même classe. Divers types d'information peuvent être utilisés pour

caractériser les objets spatiaux, comme par exemple, les attributs non spatiaux d'un objet,

les prédicats spatiaux et les fonctions spatiales. L'idée est d'utiliser cette information pour

en extraire les attributs pour l'étiquetage de classes (attributs qui divisent les données en

classes) et ceux dont les valeurs sont utilisés dans un arbre de décision pour créer de

nouvelles branches.

xl

B.1.7 Détection de tendances spatiales

La détection de tendances spatiales décrit les changements réguliers d'un ou plus attributs

non spatiaux d'un objet quand celui-ci se déplace d'un point de référence initiale. Un

exemple de tendance spatiale serait “si on s'éloigne du centre historique de la ville de

Puebla le prix des terrains décroit”. Les trajectoires de mouvement à partir d'un point x sont

utilisées pour modéliser ce mouvement, et l'analyse de régression sur les attributs des objets

peut être utilisée pour décrire les patrons de changements. Il existe deux types de tendances

: globales et locales.

B.2. Motivation

Bien que les approches précitées effectuent les fouilles de données spatiales avec succès,

notre perception est que celles-ci ne considèrent pas tous les éléments trouvés dans une

base de données spatiales (données spatiales, données non spatiales et relations spatiales

entre les objets spatiales) d'une manière exhaustive. C'est-à-dire que certaines d'entre elles

d'abord réalisent la fouille de données spatiales et ensuite celles non spatiales ou vice-versa,

et d'autres autorisent des combinaisons de ces éléments mais de manière restreinte. Au vu

des considérations précédentes, on propose l'argument suivant : si nous sommes capables

de représenter les données spatiales, non spatiales et les relations entre objets spatiales

comme un unique ensemble de données, et on les fouille ainsi, on pourrait générer ou

trouver des patrons de connaissances qui caractérisent notre ensemble de données contenant

ces trois éléments de manière conjointe. Pour un tel objectif, on émet l'hypothèse qu'une

xli

représentation basée sur graphes est suffisamment flexible et puissante pour représenter ces

éléments de manière conjointe, facilement compréhensible et capable de créer différentes

représentations du même domaine. Le domaine est décrit en utilisant des graphes, ces

graphes se transformant en données d'entrée pour un outil de découverte de connaissances

basé sur les graphes lequel utilise des heuristiques pour sélectionner des sous-graphes qui

sont considérés comme importants (patrons).

Un graphe est défini comme une paire G = (V,E). V = {v1,…,vn} dénote un ensemble fini

d'éléments appelés sommets. E est un ensemble d'arcs e satisfaisant E ⊆ [V]2. Donc,

chaque arc e ∈ E est une paire (vi,vj). Si (vi,vj) est une paire ordonnée pour n'importe quel

(vi,vj) ∈ E, on dit que G = (V,E) est un graphe orienté. Un graphe étiqueté possède des

étiquettes associées à leurs sommets et aussi aux arcs.

Après avoir proposé la représentation de relations spatiales entre les objets spatiaux, dans la

suivante section on détaillera les trois types de relations spatiales que l'on utilisera.

B.2.1 Relations spatiales

La position explicite des objets spatiaux définit des relations implicites de voisinage

(neighborhood) spatial entre eux. Ainsi, l'information sur le voisinage des objets spatiaux

constitue un élément de valeur qui doit être considéré comme le travail de fouille de

données spatiales. Martin Ester et al. [9][11] introduisent le concept de graphes de

voisinage pour représenter explicitement ces relations de voisinage implicite. Les graphes

de voisinage peuvent couvrir les relations de voisinage suivantes :

xlii

• Topologiques : dérivées du modèle à neuf intersections [6][7][8], ce sont des

relations que restent invariables sous des transformations linéaires, c'est-à-dire que

si les deux objets simultanément tournent, se déplacent ou changent d'échelle les

relations entre eux sont conservées.

• Distance : la distance entre deux objets compare à un seuil grâce à des opérateurs

arithmétiques tels que <, >, =. La distance entre deux objets se définit comme la

distance minimum entre eux (Ex. sélectionner tous les éléments à l'intérieur d'un

rayon de 50 kilomètres d'un point x).

• Direction : la relation spatiale R de direction entre deux objets spatiaux A et B (B R

A) est définie en utilisant un point représentatif de l'objet A et tous les points de

l'objet but B. Le point représentatif de l'objet source A peut être le centre de l'objet

ou un point sur ses limites. Ce point représentatif est utilisé comme l'origine d'un

système de coordonnées virtuelles et son quadrant définit la direction.

Une fois décrite la motivation de notre travail de recherche, nous présentons à la suite le

modèle général basée sur des graphes pour représenter les données spatiales.

B.3 Représentations basées sur des graphes

Comme dit auparavant, notre proposition s'appuie sur la création d'un modèle basé sur les

graphes pour représenter conjointement données spatiales, non spatiales, relations spatiales

entre les objets spatiaux. L'idée est d'utiliser les graphes générés comme données d'entrée

pour un algorithme de fouille de données, de telle sorte que l'algorithme puisse trouver des

xliii

patrons concernant ces éléments de manière conjointe et non comme des éléments séparés.

En conséquence, on propose le modèle de représentation donné en notation UML Figure

B.1.

Vertex

1..*

1..*


Spatial relation

Spatial object


-From:-To:

Edge

Figure B.1. Modèle basé sur des graphes pour représenter des données spatiales.

Dans ce modèle, les données spatiales (ex. objets spatiaux), données non spatiales (ex.

attributs descriptifs), et relations spatiales sont représentées comme une collection d'un ou

plusieurs graphes orientés étiquetés. Les sommets peuvent représenter des objets spatiaux,

types de relation spatiale entre deux objets (relation binaire), ou des attributs non spatiaux

décrivant les objets spatiaux. Les arcs représentent un lien existant entre deux sommets de

n'importe quel type. Dépendant du type de sommets qu'un arc unit, celui-ci peut représenter

le nom d'un attribut descriptif ou le nom d'une relation spatiale. Les noms des attributs

peuvent se rapporter à des descriptions d'objets spatiaux et/ou à entités non spatiales. On

utilise des arcs orientés pour représenter les informations directionnelles, de relations entre

deux éléments (Ex. objet x couvre objet y) et pour décrire des attributs appartenant à un

objet (Ex. objet x a attribut z).

xliv

Actuellement il existe cinq représentations à partir du modèle général décrit auparavant.

Trois aspects définissent les caractéristiques des graphes créés dans chaque modèle: (1)

Représentation des relations spatiales équivalentes (Ex. toucher, recouvrir). (2)

Représentation des relations spatiales symétriques (Ex. contient-de/dans_de,

Nord_de/Sud_de). (3) La manière de représenter les objets et leurs relations dans le modèle.

En conséquence, les graphes créés se différencient de manière quantitative et qualitative.

Dans la partie quantitative il y a des différences telles que : nombre de sommets et d'arcs

employés pour représenter les données spatiales, non spatiales et les relations, création de

graphes simples, l'usage des arcs orientés et non orientés pour représenter des relations

spatiales et/ou des attributs descriptifs. Dans la partie qualitative on a observé, à travers des

expériences, que certains modèles ont une plus grande expressivité pour représenter

l'ensemble de données, conditionnant de manière directe la qualité des résultats générés

dans le processus de fouille. Ensuite on présente trois des cinq modèles proposés décrivant

les métriques d'évaluation créées pour caractériser à chacun d'eux.

Modèles

Dans le but de décrire les caractéristiques de chaque modèle, on va utiliser à titre d'exemple

l'ensemble de données de la Figure B.2. Comme on peut l'observer, notre ensemble de

données se compose de deux objets spatiaux, l'objet A représentant une maison et l'objet B

représentant un lac, et les trois relations spatiales suivantes: (1) Distance, objet A près de

objet B (relation équivalente). (2) Topologique, objet A touche objet B (relation

équivalente). (3) Direction, objet A Sud de objet B (relation symétrique).

xlv

North

South

EastWest

Object B

Object A

Figure B.2. Base d'exemple pour caractériser les 3 modèles proposés.

Modèle n°1 - modèle de base

La Figure B.3 montre le premier modèle créé pour représenter des données spatiales dans

l'approche proposée. Les caractéristiques du modèle selon les métriques créées pour sa

caractérisation sont :

• Nom. Sommets : 2 sommets, chacun représentant un objet spatial (objet A et objet

B).

• Nom. Arcs : 4 arcs, 3 arcs pour représenter les relations spatiales originales

existantes dans notre ensemble de données d'exemple (“près”, “touche” et

“Sud_de”) et un arc pour représenter la relation “Nord_de”, créé à partir de la

relation symétrique original “Sud_de”.

• Taille (sommets + arcs) : 6

• % incrément : 0%, celui-ci est le modèle base.

• Graphe simple. Non. Car c'est un graphe complexe avec 4 arcs unissant 2 sommets.

• Arc orienté. Oui, car on utilise les relations symétriques “Sud_de” et “Nord_de”. La

direction des arcs se fait avec la lecture des relations entre les objets.

xlvi

• Arc non orienté. Oui, car on utilise les relations équivalentes “près” et “touche”.

• Information complète. Oui, car dans le graphe est représentée la relation

symétrique“Nord_de” créée à partir de la relation symétrique original “Sud_de”.

• Arc “Relation” redondant. Non, car dans le modèle on n'utilise pas les arcs

“Relation”.

Spatialobject A

Spatialobject B

South_of

touch

close

North_of

Figure B.3. Modèle n°1 - modèle base.

Modèle n°2 - réplication simple de types de relation, information complète

Dans la Figure B.4 on représente le deuxième modèle créé pour représenter les données

spatiales. Les caractéristiques du modèle selon les métriques sont les suivantes :

• Nom. Sommets : 5 sommets, 2 sommets pour représenter les objets spatiales et trois

sommets pour représenter les types de relations spatiales “topologique”, “distance”

et “direction”. Pour chaque type de relation spéciale existant entre deux objets, on

ajoute un sommet étiqueté avec le nom du type de la relation spatiale. Dans

l'exemple il existe une relation “topologique”, une relation de “distance” et une

relation de “direction”.

• Nom. Arcs : 6 arcs, 3 arcs pour représenter les relations originales, 2 arcs pour

représenter les relations équivalentes (“près” et “touche”) créées à partir des

xlvii

relations originales et un arc pour représenter la relation symétrique (“Nord_de”)

créée aussi à partir des relations originales.


• % incrément : +83.33%

• Graphe simple. Oui, car il existe un arc supplémentaire entre n'importe quel couple

de sommets donnés.

• Arc orienté. Oui, car on utilise toutes les relations. La direction des arcs va des

sommets représentant les objets spatiaux aux sommets représentant les types de

relations spatiales.

• Arc non orienté. Non, car dans le modèle on n'utilise pas d'arcs non orientés.

• Information complète. Oui, car on représente les relations symétriques créées à

partir des relations originales.

• Arc “Relation” redondant. Non, dans le modèle on n'utilise pas d'arcs “Relation”.

South_of

touch

close close

touch

North_of

Distance

Direction


object B

Figure B.4. Modèle n°2 - réplication simple des types de relation, information complète.

xlviii

Modèle n°3 - double réplication des types de relation, information non complète

Dans la Figure B.5 on présente le troisième modèle créé pour représenter des données

spatiales utilisant une approche de graphes. Les caractéristiques en accord avec les

métriques sont:

• Nom. Sommets : 8 sommets, 2 sommets pour représenter les objets spatiaux et 6

sommets pour représenter les types de relations spatiales (“distance”, “topologique”

et “direction”). Pour chaque type de relation spatiale entre deux objets spatiaux, on

ajoute deux sommets étiquetés avec le nom du type de la relation spatiale. Par

exemple, dans nos données d'essai il existe trois types de relations : 1 relation

“topologique”, 1 relation “distance” et 1 relation “direction” ; ainsi on ajoute 6

sommets, 2 pour chaque type de relation spatiale.

• Nom. Arcs : 9 arcs, 6 arcs “Relation” pour unir les sommets représentant les objets

spatiaux avec les sommets représentant les types de relations spatiales (de chaque

sommets représentant un objet spatial naissent 3 arcs puisqu'il existe 3 types de

relation), et 3 arcs pour présenter les relations spatiales originales. Ces 3 arcs sont

utilisés pour unir les sommets représentant les types de relations spatiales.


• % incrément : +183.33%

• Graphe simple. Oui, car il existe au moins un arc entre n'importe quel couple de

sommets données.

• Arc orienté. Oui, car on utilise les relations symétriques et les arcs “Relation”. La

direction des arcs “Relation” va des sommets représentant les objets spatiaux aux

xlix

sommets représentant les types de relations spatiales. La direction des arcs restant se

fait avec la lecture des relations spatiales entre les objets.

• Arc non orienté. Oui, car on les utilise pour représenter les relations équivalentes.

• Information complète. Non, car dans le modèle ne sont pas représentées les

relations symétriques qui sont créées à partir des relations spatiales originales.

• Arc “Relation” redondance Oui, car dans le modèle on utilise des arcs “Relation”

pour représenter explicitement l'existence et type d'une relation spatiale entre 2

objets spatiaux. De plus, on emploie ces arcs pour éviter la création de graphes

complexes.

South_of

touch

close

Relation

Distance

Topological

Direction

Distance

Topological

Direction

Spatialobject A

Spatialobject B

Relation

R elation

Relation

RelationR

elation

Figure B.5. Modèle n°3 - double réplication des types de relation, information non complète.

La Table B.1 présente les résultats des neuf métriques développées pour évaluer les

caractéristiques de chaque modèle proposé (actuellement 5 modèles). Le modèle n°1 est

appelé le modèle base.

l

Modèle Nom.

Sommet

Nom.

Arcs

Dimensions

(s + a)

%

Incrément

Graphe

Simple

Arc

Orienté

Arc non

Orienté

Information

Complète

Arc

“Relation”

Redondant

(1) (2) (3) (4) (5) (6) (7) (8) (9)

n°1 2 4 6 - Non Oui Oui Oui Non

n°2 5 6 11 +83.33 Oui Oui Non Oui Non

n°3 8 9 17 +183.33 Oui Oui Oui Non Oui

n°4 5 6 11 +83.33 Oui Oui Oui Non Oui

n°5 8 12 20 +233.33 Oui Oui Non Oui Oui

Table B.1. Caractéristiques des modèles de représentation basés sur des graphes.

Les métriques ont été proposées sur la base de causes à effets de chacune aussi bien dans le

graphe créé que dans l'algorithme de fouille. Nous apercevons quatre caractéristiques

significatives liées directement à ces métriques.

1. Espace de recherche

L'espace de recherche dans un algorithme de fouille de données basé sur des graphes

consiste en la liste des sous-graphes qui peuvent être dérivés du graphe initial, de telle

manière que les nombres de sommets (1) et arcs (2) du graphe créé (3) définissent la taille

de l'espace de recherche pour le système de découverte. Donc, l'objectif doit être de

minimiser le nombre de sommets et d'arcs utilisés pour créer les graphes mais en même

temps maximiser la représentativité des ceux-ci. Comme on peut le voir dans la Table B.1,

le modèle utilisant le nombre minimum de sommets et d'arcs pour représenter l'ensemble de

données d'exemple c'est le modèle n°1 (2 sommets et 4 arcs) alors que le modèle n°5 est le

cas opposé (8 sommets et 12 arcs).

li

2. Temps de traitement

La taille de l'espace de recherche joue un rôle important touchant au temps de traitement

utilisé pour découvrir des patrons. Si on dispose d'un grand espace de recherche, il faudra

plus de temps pour évaluer tous les sous-graphes possibles. Donc, une comparaison de la

métrique "pourcentage d'incrémentation" (4) entre les modèles proposés figure dans la

Table B.1. Rappelons que cette métrique compare la taille d'un modèle donné par rapport

au modèle n°1 (modèle base). Par exemple, le modèle n°5 augmente la taille du graphe de

233.33% par rapport au modèle n°1. C'est-à-dire que l'algorithme de fouille aura besoin

d'évaluer 233.33% plus de sommets et/ou d'arcs en utilisant le modèle n°5 au lieu du

modèle n°1 pour le même ensemble de données.

3. Complexité du graphe

Dans le chapitre 4 du mémoire de thèse, on décrit le système Subdue, notre outil de fouille

de données basées sur des graphes, outil provenant de l'Université du Texas à Arlington.

Comme dit précédemment, il existe une plus grande complexité pour l'algorithme de fouille

travaillant avec des graphes complexes au lieu de graphes simples (Ex. au maximum un arc

unissant n'importe quel couple de sommets donné). Par exemple, de plus, il faut tenir

compte dans le traitement de comparaison des graphes, de la phase d'extension (Subdue

emploie une approche d'"extension" pour découvrir des patrons), et de l'étape de

compréhension du graphe. Donc, l'objectif a été de proposer des modèles basés sur les

graphes qui nous permettent de créer des graphes plus simples. Comme on peut le voir dans

la Table B.1, seulement le modèle n°1 ne permet pas de créer des graphes simples.

lii

En conséquence, comme stratégie pour réduire la multiplicité du nombre d'arcs entre deux

sommets, on emploie les approches suivantes:

• Ajouter un nouveau sommet étiqueté avec le nom du type de la relation (Ex.

topologique, distance et direction) pour chaque relation spatiale entre les objets

spatiaux. Cette approche est utilisée dans le modèle n°2.

• Ajouter un nouveau sommet (modèle n°4) ou deux nouveaux sommets (modèle n°3

et modèle n°5) étiquetés comme le nom du type de la relation spatiale (Ex.

topologique, distance et direction) par chaque relation spatiale entre les objets

spatiaux et fusionner ce nouveau sommet (modèle n°4) ou de nouveaux sommets

(modèle n°3 et modèle n°5) avec les sommets représentant les objets spatiaux parmi

des arcs étiquetés comme "Relation". L'approche employée pour fusionner les

sommets diffère pour chaque modèle selon la définition de chacun d'eux. Cette

nomenclature est utilisée pour représenter le fait qu'il existe une relation spatiale

entre les objets spatiaux. Ces arcs sont connus comme arcs “Relation” redondants

(9).

4. Représentativité des données

Les métriques arcs orientés (6), arcs non orientés (7), et information complète (8) sont

utilisées pour maximiser la représentativité des données mais minimisant, aussi tant que

possible, la taille du graphe et sa complexité. Les arcs orientés sont utilisés pour représenter

les relations spatiales symétriques (objet A Nord_of objet B, implique, B Sud_of A), les arcs

“Relation” redondantes, les attributs non spatiaux qui décrivent les objets spatiaux. Les

arcs non orientés sont utilisés pour représenter les relations spatiales équivalentes (la

liii

relation est représentée par un arc non orienté au lieu de deux arcs orientés). Finalement,

l'information complète signifie que les relations spatiales symétriques entre objets spatiaux

aussi sont représentées dans le modèle.

B.4 Fouille du graphe

La caractéristique de recouvrement (partage de sommets appartenant à différentes instances

d'une sous-structure) accomplit un rôle important dans le système de découverte de patrons

(sous-structures) dans notre outil de fouille de données basés sur les graphes à savoir le

système Subdue. En conséquence, les résultats générés sont conditionnés par le

fonctionnement de cette caractéristique. Pourtant, son implémentation actuelle est régulière

: permettre le recouvrement entre toutes les instances d'une sous-structure (sans aucune

règle) ou ne pas autoriser le recouvrement entre aucune instance appartenant à une sous-

structure. C'est-à-dire, tout ou rien. Dans ce contexte, on propose une nouvelle approche

appelée recouvrement limité. Un des avantages principaux de cette nouvelle approche est la

capacité d'autoriser l'usager à spécifier l'ensemble de sommets où le recouvrement sera

permis. Ces sommets pourraient représenter les éléments significatifs dans le contexte de

travail. On donnera directement trois motivations pour proposer un nouvel algorithme,

lesquelles seront expliquées dans les sous-sections suivantes:

1. Réduction de l'espace de recherche

Dans les systèmes de découverte de connaissances basés sur des graphes, l'algorithme de

fouille de données utilise des graphes comme représentation de connaissance. L'espace de

liv

recherche d'un tel algorithme consiste en tous les sous-graphes qui peuvent être dérivés du

graphe initial. Le processus de découverte de sous-structures en Subdue commence par la

création de sous-structures d'un seul sommet à partir du graphe d'entrée (une sous-structure

par chaque étiquette de sommet qui existe au moins 2 fois dans le graphe). Dans chaque

itération du processus de découverte, l'algorithme sélectionnera les meilleures sous-

structures et étend les instances de ces sous-structures en ajoutant un arc voisin (ou un arc

et un nouveau sommet) dans toutes les directions possibles.

Mais comme partie du processus de sélection des meilleurs substructures et donc

d'extension, il existe aussi un processus de filtrage. Dans ce processus, en accord avec la

valeur du paramètre de recouvrement, les instances d'une sous-structure sont évaluées : si le

recouvrement est permis, les instances partageant des sommets sont maintenues ; au

contraire si le recouvrement n'est pas autorisé les instances qui partagent les sommets sont

écartés.

La meilleure sous-structure découverte par Subdue (en accord à les métriques d'évaluation),

dans chaque itération, peut être employée pour comprimer le graphe initial qui peut donc

se transformer en un nouveau graphe d'entrée pour une interaction suivante. Après

plusieurs itérations, Subdue crée une description hiérarchique des données, où les

substructures découvertes lors de certaines itérations peuvent être définies sur la base de

sous-structures découvertes lors des itérations préalables. Ainsi, le nombre d’instances

d'une substructure définit l'espace de recherche (dans chaque itération) dans le processus de

découverte de sous-structures. Comme on peut l'observer à travers de l'utilisation du

recouvrement limité, on obtient une réduction de l'espace de recherche puisque le nombre

lv

d'instances candidates à être étendues conditionne les valeurs (sommets) où le

recouvrement est permis.

2. Réduction du temps de traitement

La réduction du nombre d'instances candidates à être étendues apporte une réduction de

l'espace de recherche, et cela contribue à une réduction du temps de traitement pour la

recherche de sous-structures (patrons).

Permettre le recouvrement en Subdue, implique une évolution du temps de calcul

puisqu'augmente le nombre d'instances candidates à être étendues, évaluées, comparées, et

découvertes. Pourtant, avec l'implémentation du recouvrement limité, le nombre d'instances

à être traitées dans ces phases décroît contribuant à une réduction du temps de traitement

dans tout le calcul de découverte de sous-structures.

3. Recherche orientée de patrons avec recouvrement (partager) sélectif

Le recouvrement limité donne à l’usager la capacité de définir des ensembles d’éléments où

le recouvrement sera permis et qu'à son avis il considère comme pertinents pour son

contexte de travail (Ex. un objet spatial, un attribut descriptif). Ces éléments sont

représentés en utilisant des sommets en accord avec le modèle proposé. En contre partie,

l'algorithme écartera les éléments que l'utilisateur n'a pas considérés comme significatifs.

Par conséquent, le recouvrement limité fournit à l'utilisateur un moyen pour mettre en

œuvre une recherche orientée vers des patrons avec recouvrement. C'est-à-dire, l'utilisateur

délimite l'ensemble d'éléments qui auront un rôle prépondérant dans le processus de

lvi

découverte de sous-structures. De plus, cette caractéristique offre l'avantage que le

processus d'évaluation de patrons est simplifié puisque l'ensemble de résultats produits est

plus petit parce que ceux-ci se centrent sur les demandes de l'utilisateur.

B.5 Résultats

A partir des essais pour évaluer notre proposition de modélisation et de fouille de données

spatiales en utilisant un modèle basé sur les graphes, on a développé trois cas d'utilisation

exemplaires. Les deux premiers cas ont été mis en œuvre en utilisant un recensement de

population du centre historique de la ville de Puebla durant l'année de 1777. Le troisième

cas d'utilisation a été développé en utilisant une base de données spatiales de la région du

volcan Popocatépetl.

Dans cette section nous présentons des exemples de résultats obtenus avec le cas du volcan

Popocatépetl. Supposons que nous souhaitons connaître des caractéristiques communes

entre les habitats, routes et rivières dans la zone qui nous aident à évaluer ou mettre en

œuvre des plans d'évacuation en cas d'éventualité volcanique : par exemple, caractéristiques

de routes commençant dans ou croisant un habitat, un matériel utilisé pour construire ces

routes et leur état actuel (ex. pavées, non pavées), caractéristiques des routes et des rivières

qui ont une certaine relation entre elles (ex. croisement), rivières près d'habitats qui en cas

de haute précipitation pluviale pourraient présenter des dangers potentiels.

lvii

Les expériences ont été développées en utilisant les 5 modèles de représentation

actuellement proposés. Dans cette section on présente des patrons trouvés avec le modèle

n°1 entre routes et rivières, routes et habitats, et finalement rivières et habitats. L'idée

d'organiser la présentation des résultats de cette manière est de montrer les divers patrons

qui peuvent être découverts entre ces éléments. À la fin de la section on présente des

tableaux comparant les résultats obtenus avec chacun des 5 modèles.

La Figure B.6 montre le patron le plus significatif découvert entre des routes et des rivières

en utilisant le modèle n° 1. Le patron décrit une relation entre "route de catégorie non pavée

qui traverse une rivière catégorie écoulement" dans la zone. Ce patron peut être considéré

comme un indicateur du nombre de routes qui ont besoin d'être vérifiées en cas

d'éventualité volcanique vu le type de matériel avec lequel elles sont construites, et parce

qu'elles traversent des rivières (la lecture peut être faite en sens inverse) qui en cas de

hautes concentrations pluviales peuvent déborder et les mettre hors d'usage. Subdue a

trouvé avec non recouvrement 46 instances du patron dans la seconde itération ; par

l'intermédiaire du recouvrement standard il a trouvé 85 instances dans la première itération

; et à travers le recouvrement limité il a aussi trouvé 85 instances dans la seconde itération.

Comme nous pouvons observer les recouvrements standard et recouvrement limité ont

trouvé le même nombre d'instances, mais le recouvrement limité a eu besoin de deux

interactions pour trouver le même patron. Toutefois, ceci ne veut pas dire que le

recouvrement standard est meilleur que le recouvrement limité (en ce qui concerne temps

de traitement) parce qu'en analysant le temps global de traitement requis pour le

recouvrement limité pour achever la phase de découverte de sous-structures nous

remarquons qu'il est plus petit que celui requis par le recouvrement standard.

lviii

a) b)

c)

Figure B.6. Relations entre des routes et des rivières en utilisant le modèle n°1.

Le patron le plus significatif, utilisant le modèle n°1, trouvé entre des routes et des habitats

est présenté dans la Figure B.7. Il décrit une relation entre "route catégorie non pavée en

touchant un habitat catégorie construction". "habitat catégorie construction" représente dans

la couche de données spatiales "habitat" de la base de données du volcan, habitats avec

grosse population, bâtiments et une grande quantité de constructions utilisées pour offrir

lix

des services aux habitants. Si nous faisons l'hypothèse que les gens pourraient avoir besoin

d'être évacués en cas d'éruption et que les routes utilisées pour ce but sont non pavées,

alors, cette situation pourrait se transformer en un problème (Ex. embouteillage). Dans cette

expérience Subdue a trouvé par l'intermédiaire du non recouvrement 6 instances du patron

dans la neuvième itération ; par le biais du recouvrement standard 9 instances dans la

quatrième itération ; et en utilisant recouvrement limité 8 instances dans la dixième

itération.

a) b)

c)

Figure B.7. Relations entre des routes et des villes en utilisant le modèle n°1.

lx

La Figure B.8 montre le patron le plus significatif trouvé entre des rivières et des

populations en utilisant le modèle n° 1. Le patron décrit une relation entre "rivière catégorie

écoulement qui traverse un habitat catégorie "îlot" ou "bâtiment" dans la zone. "habitat

catégorie îlot" représente dans la couche de données spatiales "habitat" de la base de

données du volcan, des villages avec peu de population, en fait avec beaucoup de secteurs

dépeuplés, bâtiments et constructions précaires. Le patron peut être utilisé pour identifier

des zones potentielles d'inondation, habitées par des personnes isolées habitant aux

alentours de rivières. À travers le non recouvrement, Subdue a trouvé 5 instances du patron

dans la dixième seconde itération ; en utilisant le recouvrement standard a trouvé 5 instance

dans la huitième itération ; et par l'intermédiaire du recouvrement limité il a aussi trouvé 5

instances en huitième itération. Subdue a trouvé, toutefois, le même patron dans les trois

cas mais en utilisant le recouvrement standard et le recouvrement limité.

Le Tableau B.2 présente une comparaison, par modèle, entre le nombre d'instances

découvertes/itérations nécessaires pour les découvrir et les trois mises en œuvre du

recouvrement. Par exemple, en utilisant le modèle n°1, Subdue a trouvé 46 instances (dans

la seconde itération) d'un patron "complet" (notre définition pour décrire un patron

"complet" est que celui-ci contient au moins deux objets spatiaux et la relation spatiale

entre eux) en contenant les objets spatiales route-rivière par l'intermédiaire du non

recouvrement. Une valeur plus élevée signifie un modèle qui permet de découvrir plus

d'instances d'une sous-structure (patrons). Rappelons que Subdue décrit comme le meilleur

patron (par itération) la sous-structure avec le nombre plus élevé d'instances découvertes de

lxi

cette sous-structure. Cette comparaison est effectuée par chaque structure "objet-objet" (Ex.

route-rivière).

a) b)

c)

Figure B.8. Relations entre des rivières et des villes en utilisant le modèle n°1.

lxii

Annotation: NO (non recouvrement), SO (recouvrement standard), LO (recouvrement

limité).

Modèle n°1 Modèle n°2 Modèle n°3 Modèle n°4 Modèle n°5


Route-Rivière

Instances 46 85 85 41 85 64 39 85 34 39 85 60 45 85 45

Itération 2 1 2 2 3 2 2 2 9 2 1 2 5 1 5

Route-Ville

Instances 6 9 8 5 8 7 4 8 5 6 8 8 6 0 7

Itération 9 4 10 14 6 10 15 6 13 12 10 7 7 0 7

Rivière-Ville

Instances 5 5 5 5 10 5 5 19 5 5 10 5 5 5 5

Itération 12 8 8 16 7 14 12 4 10 6 6 11 13 6 10

Table B.2. Instances/itérations par chaque modèle basé sur des graphes : cas d'utilisation Popocatépetl.

Le Tableau B.3 présente une comparaison de maximum/minimum instances découvertes

par chaque mise en œuvre du recouvrement. Un modèle avec la valeur plus élevée est

meilleur parce qu'il permet de découvrir davantage d'instances d'une sous-structure. La

comparaison est présentée par chaque structure "objet-objet". Par exemple dans la structure

route-rivière le modèle n°1 a trouvé 46 instances découvertes dans la seconde itération (la

valeur plus élevée).

lxiii

Maximum Minimum

Route-Rivière

Non recouvrement modèle n°1 (deuxième itération) modèles n°3 et n°4 (deuxième itér.)

Recouvrement standard modèles n°1, n°4 et n°5 (première itér.) modèle n°2 (troisième itération)

Recouvrement limité modèle n°1 (deuxième itération) modèle n°3 (neuvième itération)

Route-Ville

Non recouvrement modèle n°5 (septième itération) modèle n°3 (quinzième itération)

Recouvrement standard modèle n°1 (quatrième itération) modèle n°5 (modèle non complet)

Recouvrement limité modèle n°4 (septième itération) modèle n°3 (treizième itération)

Rivière-Ville

Non recouvrement modèle n°4 (sixième itération) modèle n°2 (seizième itération).

Recouvrement standard modèle n°3 (quatrième itération) modèle n°1 (huitième itération).

Recouvrement limité modèle n°1 (huitième itération) modèle n°2 (quatorzième itération)

Table B.3. Max/Min d'instances découvertes par "objet-objet"/caractéristique recouvrement.

Le Tableau B.4 présente une comparaison entre la moyenne d'instances découvertes par

modèle. Une valeur plus élevée signifie un modèle permettant de découvrir plus d'instances

d'une sous-structure. Chaque valeur représente la moyenne de sous-structures découvertes

en utilisant le non recouvrement, le recouvrement standard et le recouvrement limité. La

comparaison est donnée par chaque structure "objet-objet".


Route-Rivière 72.0 63.3 52.7 61.3 58.3

Route-Ville 7.7 6.7 5.7 7.3 4.3

Rivière-Ville 5.0 6.7 9.7 6.7 5.0

Table B.4. Moyenne d'instances découvertes par modèle/"objet-objet".

lxiv

Le Tableau B.5 présente une comparaison entre la moyenne d'instances découvertes par

modèle. Une valeur élevée signifie un modèle permettant de découvrir plus d'instances

d'une sous-structure. La comparaison est donnée par chaque mise en œuvre du

recouvrement.



19.0 33.0 32.7 17.0 34.3 25.3 16.0 37.3 14.7 16.7 34.3 24.3 18.7 30.0 19.0

Table B.5. Moyenne d'instances découvertes par modèle/caractéristique recouvrement.

Le Tableau B.6 présente une fin comparative entre la moyenne d'instances découvertes par

modèle. Nous pouvons voir dans le tableau que le modèle n°1 donne la valeur plus élevée

d'instances découvertes (en accord avec nos paramètres des instances complètes) dans ce

cas d'utilisation exemplaire. Les modèles suivants sont le modèle n°2 et le modèle n°4

respectivement.


28.2 25.6 22.7 25.1 22.6

Table B.6. Moyenne d'instances découvertes par modèle.

B.6 Conclusions

L'interaction constante entre les êtres humains et leur habitat naturel, la planète terre,

produit jour après jour, de nouvelles demandes associées à la manipulation et l'exploitation

lxv

de données spatiales. Par exemple, l'analyse urbaine, la prévention les risques naturels,

l'exploration de l'espace stellaire, la pollution des océans, et le déboisement des sols, pour

nommer certains d'entre eux. La fouille de données spatiales intègre l'intégration de

méthodes et techniques provenant de divers domaines scientifiques lesquels nous aident, au

moyen d'algorithmes d'analyse et de découverte, à produire une énumération particulière de

patrons sur les données spatiales.

Notre argumentation dans ce mémoire de thèse se base sur l'idée que la fouille de données

spatiales ne considère pas tous les éléments trouvés dans une base de données spatiales

(données spatiales, données non spatiales et relations spatiales entre les objets spatiales)

d'une manière exhaustive. Par conséquent, on a proposé d'employer une analyse basée sur

les graphes pour représenter ces éléments comme un unique ensemble de données, les

fouiller comme un tout, de façon à pouvoir découvrir des patrons contenant les deux types

données et relations spatiales (patrons les plus descriptifs).

Dans notre modèle les relations spatiales entre les objets spatiaux sont incluses parce qu'une

caractéristique significative des données spatiales est l'influence que les voisins d'un objet

peuvent avoir avec l'objet lui-même. Dans notre modèle nous incluons trois types de

relations spatiales. A partir du modèle général on a proposé cinq modèles opérationnels.

Trois aspects définissent les caractéristiques d'un graphe créé avec ces modèles : (1)

Représentation des relations spatiales équivalentes. (2) Représentation de relations spatiales

symétriques. (3) La manière de représenter les objets et leurs relations. Comme partie

intégrante de notre méthodologie pour la fouille de données spatiales en utilisant une

analyse basée sur les graphes, nous utilisons le système Subdue comme outil de fouille.

lxvi

Nous avons proposé un nouvel algorithme appelé recouvrement limité lequel donne à

l'utilisateur la capacité de spécifier l'ensemble de sommets sur lesquels le recouvrement est

permis. Nous donnons trois motivations pour proposer cette nouvelle analyse: (1)

Réduction de l'espace de recherche. (2) Réduction du temps de processus. (3) Recherche

orientée de patrons avec recouvrement (partager) sélectif.

Pour démontrer la viabilité, capacité de fouille et de découverte de patrons en utilisant

l'analyse proposée, on a développé un prototype pour la mise en œuvre de notre modèle en

créant les ensembles de données basées sur les graphes, pour fouiller ces graphes (par

l'entremise du système Subdue) et pour visualiser les patrons découverts. Les résultats

produits des cas d'utilisation développés nous donnent un vaste panorama de ce que nous

pourrions obtenir en utilisant cette analyse. Il est important de généraliser le fait que nous

pouvons utiliser cette méthodologie de modélisation et de représentation dans tout domaine

qui peut être représenté pour un graphe.

Les perspectives d'amélioration de notre travail (modèle basé en graphes, algorithme de

fouille de données et système prototype) incluent les points suivants :

• Visualisation de connaissances découvertes. Par exemple, visualisation de

résultats sur les couches spatiales, à travers l'utilisation d'icones, et la navigation

dans la hiérarchie de patrons découverts.

• Amélioration des algorithmes employés pour créer les ensembles de données

basées sur des graphes en accord avec les modèles proposés. La validation des

lxvii

relations spatiales entre objets spatiaux est une phase qui dans la majorité des cas

requiert une grande quantité de ressources en calcul.

• Fouille de graphes. On a employé le système Subdue comme outil de fouille de

données. De plus, on a proposé un nouvel algorithme appelé recouvrement limité.

L'isomorphisme de graphes est un problème NP-complet et par conséquent, les

algorithmes devront être capables de réduire les temps de traitement pour la

recherche de patrons.

• Manipulation et représentation de relations entre des données non spatiales.

Les relations implicites et explicites entre les attributs en décrivant les objets

spatiaux peuvent être inclus dans le modèle afin d'améliorer la représentation des

données.

B.7 Contribution

La contribution à la découverte de connaissances dans le domaine des données spatiales,

décrite dans ce mémoire, est le fruit d'une nouvelle façon de modéliser et de fouiller les

données spatiales en utilisant une représentation basée sur des graphes. Cette approche

inclut les aspects suivants :

• Nous avons proposé une nouvelle représentation de données basée sur les graphes

pour traiter les données spatiales. On a atteint deux objectifs pour créer un modèle

de données avec ces caractéristiques. Le premier d'entre eux est de créer un seul

ensemble de données, basé sur des graphes, en représentant ces éléments en rapport

les uns avec les autres. Le deuxième est d'employer cet ensemble de données pour

lxviii

alimenter un système de fouille de données basé sur des graphes, de telle sorte que

puissions-nous découvrir des patrons contenant des données spatiales, non spatiales

et des relations spatiales lesquelles nous aident à décrire et comprendre les données,

tout ceci basé la prémisse qu'il s'agit d'éléments en rapport dans le monde réel.

• Nous avons proposé un nouvel algorithme pour découvrir des sous-structures

(patrons) en utilisant une analyse de recouvrement limité dans le système Subdue.

Nous donnons directement trois motivations pour proposer la mise en œuvre du

nouvel algorithme : réduction de l'espace recherche, réduction du temps de

processus et recherche orientée de patrons avec recouvrement sélectif (specialized

overlapping pattern oriented search).

• On a conçu et on a mis en œuvre un prototype pour le modèle proposé. Le prototype

offre une interface d'utilisateur convivial pour la manipulation des couches spatiales

avec lesquelles on travaillera, pour la création de graphes spatiaux et non spatiaux,

pour la fouille de ces graphes (à travers du système Subdue) et pour le déploiement

des résultats produits.

1

Chapter 1

INTRODUCTION

Due to the advances in data generation and recompilation we are facing a continuous

growth in data collections. The analysis and interpretation of this data by manual

techniques is sometimes a tough task; therefore, different methods have been proposed to

help us transform it into useful knowledge. Knowledge discovery in databases (KDD) is

defined as the non-trivial extraction of implicit, previously unknown, and potentially useful

information from data [16]. This is an interactive and iterative process that involves several

phases. The data mining phase is the nucleus of the process; it consists of the application of

data analysis and discovery algorithms that, under acceptable computational efficiency

limitations, produces a particular enumeration of patterns over the data [12]. Data mining

involves the integration of methods from different scientific fields like machine learning,

database technology and statistics. The first approaches developed focused on the discovery

of knowledge from relational data.

Nowadays, however, terms like geoprocessing and Geographic Information System are

used widely in daily life. This is a result of the improvement in the human capabilities to

create, manipulate, store and use data from phenomena on, above or below the earth’s

2

surface. This data is known as spatial data. Spatial data mining focuses on the discovery of

implicit and previously unknown knowledge in spatial databases [16]. Spatial data has

many features that distinguish it from relational data such as the relationships among the

objects, complexity, and the query language used to access them.

1.1 Motivation

As a result of the growth in the volume of spatial datasets and the necessity for tools to help

us transform them into useful information, several approaches have been developed for

knowledge discovery from spatial data: the principle in the Generalization-based methods

[21, 35] is that data and objects often contain detailed information at primitive concept

levels but sometimes it is desirable to summarize that information and present it at a higher

concept level. Clustering [26, 28, 37, 40, 46] is the process of grouping physical or abstract

objects into classes (clusters) of similar objects so that the members of a cluster are as

similar as possible whereas members of different clusters differ as much as possible from

each other. Spatial associations [13, 29] discover rules that associate one or more spatial

objects with other spatial objects. In Approximation and aggregation methods [27] the idea

is to analyze the characteristics of the clusters in terms of the features (objects) close to

them. Aggregate proximity is the measure of closeness of the set of points in the cluster to a

feature. Mining in image and raster databases [13, 14] can be viewed as another approach

of spatial data mining. Some applications of this approach (based on images) are automatic

recognition and categorization of astronomical objects; classification of stars, galaxies and

other stellar objects. Spatial Classification [31] is the task of assigning objects to a set of

3

classes based on their attribute values. Spatial Trend Detection [10] describes the regular

change of one or more non-spatial attributes of an object when moving away from a given

starting object.

However, we argue that these approaches do not consider all the elements found in a spatial

database (spatial data, non-spatial data and spatial relations among the spatial objects) in an

extended way. Some of them focus first on spatial data and then on non-spatial data or vice

versa, and others consider restricted combinations of these elements. We think that it is

possible to enhance the generated results of the data mining task by mining them as a whole

and not as separated elements (in the real world they are related). In this context, we

propose to use a graph-based representation since it provides the flexibility to describe

these elements together and this is the motivation to explore the area of graph-based spatial

knowledge discovery.

Our work is based on the hypothesis that if we create a graph-based model to represent

together spatial and non-spatial data and if we use this model for generating a dataset

composed of both type of data, then we can apply data mining techniques using this

knowledge representation to spatial and non-spatial data at the same time and get

descriptive patterns considering both kind of data about objects and the spatial relations

among them.

4

1.2 Proposal

Our proposal is to create a unique graph-based model to represent spatial data, non-spatial

data and the spatial relations among spatial objects. We will generate datasets composed of

graphs with a set of these three elements. We consider that by mining a dataset with these

characteristics a graph-based mining tool can search patterns involving all these elements at

the same time improving the results of the spatial analysis task. A significant characteristic

of spatial data is that the attributes of the neighbors of an object may have an influence on

the object itself, therefore, to enhance the data representativeness we propose to include in

the model three relationship types (topological, orientation, and distance relations).

Moreover, from a general point of view spatial database systems are relational databases

plus a concept of spatial location and spatial extension. So, most KDD algorithms for

spatial databases must make use of those neighborhood relationships because it is the main

difference between KDD in relational database system and spatial database system.

In the model the spatial data (i.e., spatial objects), non-spatial data (i.e., non-spatial

attributes), and spatial relations are represented as a collection of one or more directed

graphs. A directed graph contains a collection of vertices and edges representing all these

elements. Vertices represent either spatial objects, spatial relation types between two spatial

objects (binary relation), or non-spatial attributes describing the spatial objects. Edges

represent a link between two vertices of any type. According to the type of vertices that an

edge joins, it can represent either an attribute name or a spatial relation name. The attribute

name can refer to a spatial object or a non-spatial entity. We use directed edges to represent

5



We propose to adopt the Subdue system [24, 45], a general graph-based data mining system

developed at the University of Texas at Arlington, as our mining tool. Subdue discovers

substructures using a graph-based representation of structural databases. The substructures

(a connected subgraph within the graphical representation) describe structural concepts in

the data (i.e., patterns). The discovery algorithm follows a computationally constrained

beam search. The algorithm begins with the substructure matching a single vertex in the

graph. On each iteration, the algorithm selects the best substructure and incrementally

expands the instances of the substructure. An instance of a substructure in the input graph is

a subgraph that matches (graph theoretically) that substructure.

A special feature named overlap has a primary role in the substructures discovery process

and consequently a direct impact over the generated results. However, it is currently

implemented in an orthodox way: all or nothing. If we set overlap to true, Subdue will

allow the overlap among all instances sharing at least one vertex. On the other hand, if

overlap is set to false, Subdue will not allow the overlap among instances sharing at least

one vertex. We argue that a third option is needed: a limited overlap. With this option we

give the user the capability to set over which vertices, the overlap will be allowed (vertices

representing remarkable elements that refer, for instance, to a spatial object in a spatial

database or to some characteristic defining a particular topic of a dataset). We visualize

directly three motivations issues to propose the implementation of the new algorithm:

search space reduction, processing time reduction, and pattern oriented search.

6

1.3 Contribution

The contribution to the discovery knowledge in the spatial data domain, described in this

dissertation, is the development of a new approach for spatial data modeling and mining

using a graph-based representation. This approach includes the following issues:

• We proposed a new graph-based data representation for spatial data, non-spatial

data and spatial relations among the spatial objects. We visualize two objectives for

creating a data model with these characteristics. The first one is to create a unique

graph-based dataset representing these related elements. The second one is to use

this dataset to feed a graph-based mining system, so we can discover single patterns

(involving these elements) that will help us to describe/understand the data, based

on the premise that they are related elements in the real world.

• We proposed a new algorithm to discover substructures (patterns) using a limited

overlap approach in the Subdue system. We visualize directly three motivations

issues to propose the implementation of the new algorithm: search space reduction,

processing time reduction, and specialized overlapping pattern oriented search.

• We designed and developed a prototype system implementing the proposed model.

The prototype provides to the user a friendly graphical user interface for managing

the spatial layers to work with, for creating spatial and non-spatial graphs, for

mining those graphs (by calling the Subdue system) and for displaying the

generated results.

7

1.4 Organization of the thesis

The thesis is structured in the following way: Chapter 1 presents the motivation, proposal,

and contributions of the thesis. Related work is described in Chapter 2. In Chapter 3 we

detail our graph-based model to represent together spatial data, non-spatial data, and spatial

relations. Our graph-based data mining tool, the Subdue system, and the new limited

overlap algorithm are described in Chapter 4. A prototype system implementing our model

is presented in Chapter 5. Use-cases showing the applicability of our proposal are described

in Chapter 6. Finally, conclusions and final remarks are commented in Chapter 7.

8

Chapter 2

RELATED WORK

Knowledge Discovery in Databases, spatial data, non-spatial data, Data Mining, Spatial

Data Mining, Geographic Information System, geoprocessing, and Geomatics are terms

broadly used nowadays. This chapter presents an outline of these issues. The Geographic

Information Systems are presented in Section 2.1. In Section 2.2 we describe related work

about Knowledge Discovery in Databases and data mining. Finally, Section 2.3 describes

the three types of spatial relations we propose to incorporate in our model to represent

spatial data.

2.1 Geographic Information System (GIS)

A Geographic Information System is defined as a tool for the manipulation of geographic

data [2]. The GIS performs a great diversity of functions; some of them are the compilation,

verification, storage, retrieval, manipulation, update and visualization of geographic data.

Additionally, one of its more important features is the inclusion of data analysis modules.

9

All these functions are applied by a GIS to geographic data, stored generally in a

geographic database. The data processed and manipulated is georeferenced, that is, it is

assigned to a specific location on the Earth’s surface using a coordinate system.

The GIS can process data from several sources. For example, data collected from maps,

images and photography, statistical data from mathematical analyses, and data from CAD

systems (Computer-Assisted Design).

A GIS organizes and handles digital data stored generally in a geographic database. The

databases are important in the GIS technology because they store geographic data with a

structured form, allowing the data to be used for many tasks. Many GIS implement

additional functionalities when they use database management systems (DBMS) to store

and to handle all or some data in an independent subsystem.

The diversity of the uses of the GIS has generated the proliferation of a great variety of

definitions of GIS. A user generally defines a GIS according to what he uses it for and his

own experience and abilities. Some of these definitions are:

• A system for data processing designed for the production and/or visualization of

maps.

• An information system to respond to questions about Earth’s properties or soil

types.

• A system for decision making support in situations of natural phenomena.

• An electronic positioning system to be used by terrestrial or marine transportation.

10

The GIS frequently is named according to its application field [2]. For example, when they

are used to manage land’s registries they are generally called Land Information Systems

(LIS); in applications of municipal and natural resources they are important components of

the Urban Information Systems (UIS), and Natural Resources Information Systems (NRIS)

respectively. The term Automatic Mapping/Facility Management (AM/FM) is used by

public maintenance companies, transportation agencies, and local governments for systems

dedicated to the operation and maintenance of networks.

The GIS’s field (see Figure 2.1) involves many disciplines, applications, data types, and

end users, for example:

• Disciplines: Computer Science, Cartography, Spatial Analysis, Topography,

Hydrography, Statistic, Information Sciences, Planning, etc.

• Applications: Operation and maintenance of networks and other devices,

administration of natural resources, highway planning, map production, urban

analysis, planning analysis, etc.

• Data: Digital maps, digital images and photography, data satellites, video images,

etc.

• Users: Planners, topographers, vulcanologists, geographers, environmentalist,

engineers, etc.

11

DisciplinesDisciplinesUsersUsers

ApplicationsApplicationsDataData

Figure 2.1. Geographic Information System.

Geographic Information Systems involve the uses of systems and science. Their use arise

questions such as: How does a GIS user know that the results obtained are accurate? How

can user interfaces be made readily understandable by novice users? M. F. Goodchild

published in 1992 a paper where he argued that questions such as these and their systematic

study constituted a science.

Geoprocessing study the fundamental issues arising from geographic information (i.e.,

creation, handling, storage and use of the information). The term Geomatics, the fusion of

ideas from geosciences and informatics, is defined as the umbrella covering all fields that

are today important for understanding and further developing information systems in this

context [32, 33, 34].

12

A GIS offers its users greater capabilities to process datasets than those offered by manual

systems. In a GIS database the data is stored in a structured form, unlike the manual

systems where the data is stored in files, maps, and/or reports. The data can be recovered

from geographic databases and processed faster and more safely than in manual systems.

We can classify the GIS’s users into two groups. In the first group, we find professional

operators. They are people trained in some particular software and they know the

capabilities of this technology. Many times these people do not use the results of their

work, but they pass them on to the end users.

The second user group spends less time working with the GIS’s. They maintain geographic

information in order to have tools which help them in the decision making process. They

have few opportunities for extensive training in the GIS tools, and consequently the GIS

must be simple and easy to handle.

2.2 Data Mining

Knowledge Discovery in Databases (KDD) is defined as the non-trivial extraction of

implicit, previously unknown, and potentially useful information from data [16]. This is an

interactive and iterative process that involves several phases: data preparation, search for

patterns over the data, evaluation and interpretation of discovered patterns, and refinement

of the whole process as it is shown in Figure 2.2.

13

Cleandata

Formattedandstructured data

PatternsKnowledge

Datamining

Patternsevaluation

Integration

Data Integrateddata

Selecteddata

Cleanning Selection

Transformation

Figure 2.2. Knowledge Discovery in Databases.

The data mining phase (the search for patterns) is the nucleus of the process; it consists of

the application of data analysis and discovery algorithms that, under acceptable

computational efficiency limitations, produces a particular enumeration of patterns over the

data [4, 12].

Data mining involves the integration of methods from different scientific fields such as

machine learning, database technology, statistics, and visualization as shown in Figure 2.3.

The first approaches developed focused on the discovery of knowledge from relational

data.

14

MachineLearning

DatabaseTechnology

Statistics

Visualization

OtherDisciplines

InformationScience

Data Mining

Figure 2.3. Data mining: integration of several fields.

Several architectures have been proposed for data mining [22, 25]. Figure 2.4 presents an

architecture based on the proposal of Han et al. [23]. The user is the trigger of the entire

process and receptor of the discovered knowledge. Data may be fetched from several

sources such as files, databases, and data warehouses using a data server module. The data

mining engine may use one or more data mining techniques for searching patterns from

data. The significance, importance and interestingness of the found patterns are evaluated

by the pattern evaluation module. The data mining engine and pattern evaluation modules

may use background knowledge stored in a knowledge database. The role of the graphical

user interface is to receive the user requirements and to deliver the generated results. Since

this is an iterative and interactive process, the components may interact among themselves.

15

Data Server

Data Mining Engine

Pattern Evaluation

Graphical User Interface

Database

KnowledgeDatabase

DataWarehouseData

DataData

U S E R S

Figure 2.4. Architecture for a KDD system.

2.2.1 Spatial Data Mining

Climate change, natural risk prevention, human demography, deforestation, and natural

resources atlas are examples from a large variety of issues arisen as a result of the

interaction among people and their natural environment, the Planet Earth. Data generated

from those issues are known as spatial data. Spatial data mining methods focuses on the

discovery of implicit and previously unknown knowledge in spatial databases [16].

Spatial data have many features that distinguish them from relational data. For example, the

spatial objects may have topological, distance, and direction information, the complexity,

16

and the query language used to access them. Different approaches have been developed for

knowledge discovery from spatial data such as Generalization-based methods [21, 35],

Clustering [26, 28, 37, 40, 46], Spatial associations [13, 29], Approximation and

aggregation [27], Mining in image and raster databases [13, 14], Classification learning

[31], and Spatial trend detection [10]. In the following subsections we present a brief

description of these spatial data mining approaches.

2.2.1.1 Generalization-based Method

Generalization has been shown to be one of the effective methods of discovering

knowledge. It was introduced by the machine learning community and it is based on

learning from examples. Generalization-based knowledge discovery requires concept

hierarchies (given explicitly by experts or generated automatically). In the case of the

spatial databases, there can be two types of concept hierarchies:

• Thematic hierarchies. We can generalize tomatoes and bananas as fruits, fruits and

vegetables as cash crops.

• Spatial hierarchies. We can generalize some geographic points as a country or a

region.

The approach introduced by the machine learning community (tuple-oriented) cannot be

directly adopted for large spatial databases because it does not handle very well the noise

and inconsistent data, and the algorithms are exponential in the number of examples. Han et

al. [21] present a modified technique named attribute-oriented induction for mining

17

relational data. Lu et al. [35] extended the attribute-oriented induction technique to spatial

databases. Attributed-oriented induction is performed by climbing the generalization

hierarchies and summarizing the general relationships between spatial and non-spatial data

at a higher concept level. The authors present two generalization based algorithms:

• Non-spatial data dominant. This method performs attribute-oriented induction on

the non-spatial attributes, first generalizing them to a higher concept level and later

merges corresponding spatial attributes (using spatial merge and approximation).

• Spatial data dominant. Given the spatial data hierarchy, generalization can be

performed first on the spatial data and then generalizing their corresponding non-

spatial attributes.

Both algorithms assume that the rules to be mined are general data characteristics

(characteristics rules) and that the discovery process is initiated by the user who provides a

learning request. A disadvantage in this approach is the case where a hierarchy may not

exist or the hierarchy given by the experts may not be entirely appropriate in some cases.

The quality of mined characteristics is highly dependent on the structure of the hierarchy.

2.2.1.2 Clustering

Clustering is the process of grouping physical or abstract objects into classes of similar

objects. Clustering analysis helps construct meaningful partitioning of large set of objects.

It identifies clusters, or densely populated regions, according to some distance

measurement in a multidimensional dataset. Given a set of multidimensional data points,

the data space is usually not uniformly occupied by the data points. Data clustering

18

identifies the sparse and the crowded places, and hence discovers the overall distribution

patterns of the dataset. We can classify clustering algorithms in four main approaches:

Partitioning, Hierarchical, Locality-based and Grid-based algorithms.

• Partitioning algorithms partition a database of n objects into a set of k clusters

which are represented by the gravity of the cluster (k-means algorithms) or by one

representative object of the cluster (k-medoid algorithms). These algorithms use a

two-step procedure. First, they determine k representatives, next, assign each object

to the cluster with its representative closest to the considered object.

• Hierarchical clustering algorithms decompose the database into several levels of

partitioning which are usually represented by a dendrogram. The algorithm

iteratively splits the database into smaller subsets until some termination condition

is satisfied. The dendrogram can either be created top-down (divisive) or bottom-up

(agglomerative).

• Locality-based clustering algorithms group neighboring data elements into clusters

based on local conditions and therefore allow the clustering to be performed in one

scan of the database.

• Grid-based algorithms quantize the space into a finite number of cells and then do

all operations on the quantized space. They frequently use hierarchical

agglomeration as one of their processing phases.

Classification of clustering algorithms is neither straightforward, nor canonical. Some

algorithms perform clustering by combining techniques from these approaches. Important

issues in clustering algorithms include the following properties:

19

• The algorithm must be efficient (time complexity).

• Ability to handle noise (outliers).

• Non-sensibility to data input order.

• A Priori knowledge and parameters tuning.

• Ability to find clusters of arbitrary shape.

• Scalability to large databases.

• Ability to work with high dimensional data.

2.2.1.3 Spatial Associations

A spatial association rule is a rule which describes the implication of one or a set of

features by another set of features in spatial databases [29]. An example of a spatial

association rule is “if the company is close to México City then is a big company”.

A spatial association rule is of the form X Y, where X and Y are sets of spatial or non-

spatial predicates. There are various kinds of spatial predicates that could constitute a

spatial association rule. Some examples are topological relations such as intersects,

overlaps, disjoint; and spatial orientations such as left_of, west_of.

Koperski and Han [29] developed an algorithm for spatial associations rules in spatial

databases. They use the concepts of minimum support and minimum confidence introduced

by Agrawal et al. [1] to develop association rules from large transactional databases. The

support of a pattern A in a set of spatial objects S is the probability that a member of S

satisfies pattern A, and the confidence of A B is the probability that a pattern B occurs if

20

pattern A occurs. A user or an expert may assign thresholds to confine the rules to be

discovered to be strong ones.

Although many spatial association rules may exist in large databases, some of them may

occur rarely or may not hold in most cases. In addition, such rules are usually not 100%

accurate and may contain non trivial knowledge.

The authors employ a method which uses a top-down progressive deepening search

technique. The technique firstly searches at a high concept level for large patterns and

strong implications relationships among the large patterns at a coarse resolution scale. Then

only for those large patterns, it deepens the search to lower concept levels. Such deepening

search process continues until no large patterns can be found. The search employed for

large patterns at high concept levels is applied at a coarse resolution scale efficiently by

using approximate spatial computation algorithms such as R-trees [20] or plane-sweep [39]

techniques operating on minimum bounding rectangles (MBR). Only the candidate spatial

predicates, which are detailed reviewed, will be computed by refined spatial techniques.

Such multiple-level approach saves much computation because it is very expensive to

perform detailed spatial computation for all possible spatial association relationships.

2.2.1.4 Approximation and Aggregation

Clustering algorithms are effective and efficient methods for answering questions such as:

where are the clusters in the spatial database? In some cases it is also important to answer

the question why the clusters are there. We can rephrase the question as: what are the

21

characteristics of the clusters in terms of the features (objects) that are close to it? Knorr

and Ng [27] present a study based on this question.

The aggregate proximity is the measure of closeness of the set of points in the cluster to a

feature as opposed to the distance between a cluster boundary and the boundary of a

feature. Finding aggregate proximity relationships is not as simple as it may seem. There

are three reasons:

• The sizes and shapes of the cluster and the features may vary greatly.

• There may be a very large number of features to examine.

• Even if a suitable feature (i.e., polygon) is found to describe the shape of the cluster

of points, it is inappropriate to simply report those features whose boundaries are

closest to the cluster’s boundary, because the distribution of points in a cluster may

not be uniform.

The authors propose the use of computational geometry concepts to find out the

characteristics of a given cluster in terms of the features close to it. They present the

algorithm CRH (where C is for encompassing circle1, R for isothetic rectangle2, and H for

convex hull3) which uses concepts as filters to reduce the candidate features at multiple

levels. In general, they collect a large number of features from multiples sources (i.e.,

1 A circle that encloses a set of n points. Not necessarily minimum bounding.

2 A rectangle that is orthogonal to the coordinate axes.

3 This is the unique, minimum bounding convex shape enclosing a set of points.

22

maps) and feed them along with the cluster to the algorithm CRH and discover knowledge

about spatial relationships.

Approximation by circles and then by rectangles is used to eliminate features that have

large aggregate distance to the cluster. After these filters, the algorithm calculates the

aggregate proximity of points in the cluster to the convex boundary of each feature that

passed through the previous filters. In the last step, the algorithm reports the features with

the best aggregate proximities showing the minimum and maximum distances of points in

the cluster to the feature, average distance, and percentages of points located in the distance

less than specified threshold.

2.2.1.5 Mining an Image Database

Knowledge mining from image databases can be viewed as a special case of spatial data

mining. For example, Fayyad et al. present a system [14] to identify and categorize

volcanoes on the surface of Venus from images taken by the Magellan spacecraft. Three

basic components are implemented in the system: data focusing, feature extraction, and

classification learning. The first component increases the overall efficiency of the system

by first identifying the portion of the image being analyzed that is most likely to contain a

volcano. They compare the intensity of the central pixel of a region to the estimated mean

background intensity of its neighborhood pixels. The second component extracts interesting

features from the data. The final component uses training examples provided by the experts

to create a classifier that can discriminate between volcanoes and false alarms. For this task

the authors implement decision trees.

23

Other studies of mining in image and raster databases are: Second Palomar Observatory

Sky Survey [15], it uses decision trees for the classification of galaxies, stars and other

stellar objects. Stolorz and Dean [43] proposed a system for detecting earthquakes from

space. They combined methods of statistical interference, massively parallel computing,

and global optimization to build the system that analyze tectonic activities with sub-pixel

resolution over a large area. Stolorz et al. [44] and Shek et al. [41] carry out studies about

fast spatio-temporal data mining from geophysical datasets; they described CONQUEST, a

distributed parallel querying and analysis mining tool.

2.2.1.6 Classification Learning

Spatial classification has as objective to find rules that divide a set of objects into a number

of groups, where objects in each group belong mostly to one class. Many types of

information can be used to characterize spatial objects. We can classify such information

into non-spatial attributes of objects, spatially related attributes with non-spatial values,

spatial predicates and spatial functions.

Each of these categories may be used to extract both for class label attributes (attributes that

divide data into classes) and predicting attributes (attributes on whose values the decision

tree is branched). It is possible to use aggregate values for some of these attributes.

Koperski et al. [31] proposed and evaluated a method for classification of spatial objects.

The method enables classification of spatial objects based on aggregate values of non-

24

spatial attributes for neighboring regions. Spatial relations between objects on the map are

taken into a count, which may be represented into the form of predicates.

2.2.1.7 Spatial Trend Detection

Spatial trend detection is defined as a regular change of one or more non-spatial attributes

in the neighborhood of some object in a database [10]. An example of spatial trend is as

“moving away from downtown Puebla, the price of land decreases”.

Neighborhood paths starting from some point x are used to model the movement and a

regression analysis is performed on the respective attribute values for the objects of a

neighborhood path to describe regularity of change. In the regression analysis the distance

from x is the independent variable and the difference of the attribute values are the

dependent variable(s). There are two types of trends: global trends and local trends. In the

first one, the existence of a global trend for a start object x indicates that if considering all

objects on all paths starting from x the values for the specified attribute(s) in general tend to

increase or decrease with increasing distance. Local trends exist only in a particular

direction.

2.3 Spatial relations

The explicit location and extension of objects define implicit relations of spatial

neighborhood. Therefore, the information about the neighborhood of spatial objects

constitutes a valuable element that must be considered in the mining task. In the following

25

subsection we will present the Neighborhood Graphs, Neighborhood Paths and

Neighborhood indices concepts that introduce us to the three types of spatial relations we

propose to include in our model to represent together spatial data.

2.3.1 Neighborhood Graphs, Neighborhood Paths and Neighborhood

Indices

Martin Ester et al. [9, 11] introduce the concept of neighborhood graphs for explicitly

representing those implicit neighborhood relations relevant for the KDD tasks. He claimed

that attributes of the neighbors of some object of interest may have an influence on the

object and therefore have to be considered as well. He commented that the efficiency of

many KDD algorithms for spatial database systems depends heavily on an efficient

processing of these neighborhood relationships. Neighborhood graphs may cover one of the

following neighborhood relations:

• Topological.

• Distance.

• Direction.

These relations are called binary relations since we can determine spatial relations between

pairs of objects.

A neighborhood graph G for some spatial relation neighbor is a graph where nodes are

objects in the database and edges between nodes n1 and n2 represent the fact that the

relationship neighbor (n1, n2) holds. The neighborhood path in the neighborhood graph G is

26

a set of nodes directly connected through the edges of the graph. The neighborhood index is

a data structure that allows for an efficient execution of the operations for the construction

of a graph and for browsing and expansion of paths. It stores all neighbors for the objects in

a database.

2.3.2 Topological Relations

Topological relations are those relations which are invariant under linear transformations,

i.e., if both objects are rotated, translated or scaled simultaneously the relations are

preserved. They present a definition of topological relations derived from the nine

intersection model [6, 7, 8].

In the model the topological relations between two objects are defined in terms of the

intersections of the interiors, boundaries and exteriors of the objects (see Figure 2.5). The

interior of an object consists of points that are in the object but not on its boundary, and the

exterior consists of those points that are not in the object (its complement). The Figure 2.6

shows the nine intersection model of two objects A and B: object A’s interior (Aº), object

A’s boundary (∂A) and object A’s exterior (A¯) with object B’s interior (Bº), object B’s

boundary (∂B) and object B’s exterior (B¯).

27

Interior

Exterior

Boundary

Aº∩Bº Aº∩∂B Aº∩B¯∂A∩Bº ∂A∩∂B ∂A∩B¯A¯∩Bº A¯∩∂B A¯∩B¯

Figure 2.5. Example of the interior, boundary and

exterior of a circle.

Figure 2.6. Nine intersection model.

In Figure 2.7 we present the topological relations between two objects:

• Disjoint. The boundaries and interiors of the objects do not intersect.

• Contains. The interior and boundary of one object is completely contained in the

interior of the other object.

• Inside. The opposite of contains.

• Equal. The two objects have the same boundary and interior.

• Touch. The boundaries of the objects intersect but the interiors do not intersect.

• Covers. The interior of one object is completely contained in the interior of the

other object and their boundaries intersect.

• CoveredBy. The opposite of covers.

• Overlap. The boundaries and interiors of the two objects intersect.

28

Figure 2.7. Topological Relations.

2.3.3 Distance Relations

Distance relations compare the distance between two objects with a given constant using

arithmetic operators such as <, >, =. The distance between two objects is defined as the

minimum distance between them (i.e., select all elements inside a radio of 50 km from a

“x” point). Figure 2.8 shows two examples of this type of relation; in the figure we are

representing how close and how far two objects are each other.

Figure 2.8. Distance Relations.

29

We propose to use the Euclidian distance which is defined as the straight line distance

between two points. In a plane with p1 at (x1, y1) and p2 at (x2, y2), the Euclidian distance is

( ) ( )221

221 yyxx −+− .

2.3.4 Direction Relations

The direction relation R between two spatial objects A and B, (A R B) is defined using one

representative point of object A and all the points of the destination object B. It is feasible

to define several possibilities of direction relations depending on the number of points that

are considered in the source and destination objects. The representative point of a source

object may be the center of the object or a point on its boundary. This representative point

is used as the origin of a virtual coordinate system and its quadrant defines the direction.

The direction relations between two objects are North_of, South_of, East_of, West_of,

Northeast_of, Northwest_of, Southeast_of, and Southwest_of. In Figure 2.9 we show some

examples of direction relations, for instance, object D is South of object C and East of

object A.

30

A

B

C

D

B North of A

A West of D

D East of A

C Northeast of A

D South of C

Figure 2.9. Direction Relations.

2.4 Conclusion

Data mining is a younger and promissory research field. Many of the data mining

approaches developed for knowledge discovery in relational databases were extended to the

spatial databases domain. In this chapter we presented several approaches such as

generalization, clustering, spatial association, and spatial classifications. We have described

three types of spatial relations that will be included in our model to represent as a unique

graph-based dataset spatial data, non-spatial data and spatial relations.

In the next chapter we will describe the model. First, we will talk about the graph-based

knowledge discovery, next we will detail our methodology, and finally we will present an

example showing its applicability and generated results.

31

Chapter 3

GRAPH-BASED REPRESENTATIONS

This chapter describes our graph-based model to represent together spatial data, non-spatial

data and spatial relations among the spatial objects. We propose to include three types of

spatial relations: topological, distance and direction. Section 3.1 presents some issues

related to the graph-based knowledge discovery. Our methodology is detailed in Section

3.2. In Section 3.3 we describe five graph-based representations based on our model.

Finally, an example of the applicability of the proposal is presented in Section 3.4.

3.1 Generalities

In Chapter 2 we presented several approaches developed to search knowledge in spatial

databases. The differences between these approaches are based on the data representation

and the data mining algorithm used in the search task. Certainly, the data representation

used by a mining tool is very important, and it has to be powerful enough to represent

domains containing complex relations among their components (i.e., spatial data domain).

32

A graph-based representation has these characteristics [3, 5, 17, 36]. It has the benefits of

being easy to understand and flexible enough to create different representations of the same

domain. The domain (i.e., data and relationships) is described using graphs. These graphs

become the input to a graph-based discovery tool which uses a heuristic to choose the

subgraphs that are considered important (patterns).

A graph is defined as a pair G = (V,E). V = {v1,…,vn} denotes a finite set of elements called

vertices. E is a set of edges e satisfying E ⊆ [V]2. Then, each edge e ∈ E is a pair (vi,vj). If

(vi,vj) is an ordered pair for any (vi,vj) ∈ E, then G = (V,E) is said to be a directed graph. A

labeled graph has labels associated with its edges and vertices.

In knowledge discovery systems using a graph-based approach, the data mining algorithm

uses graphs to represent knowledge; this means that the data preparation phase includes the

transformation of the data to a graph format. The search space of a graph-based data

mining algorithm consists of all the subgraphs that can be derived from its input graph.

In the literature there exist several definitions about what a spatial data is. However, before

presenting our spatial model definition, we precise the following issues:

• When we speak about a geometric object we refer to an object describing a form

(i.e., point, line, and polygon).

• A spatial object refers to an object represented by a coordinate system (i.e.,

Cartesian coordinates).

33

• A geographic object may be considered a specialization of a spatial object because

it is represented by a coordinate system but related to earthly coordinates

(sometimes called Geodetic or Geographic coordinates).

A model is a simplification of the reality. It is not the reality, rather it represents the reality.

A model is used to explain or to understand the reality. A model can be, for instance, an

equation, a hypothesis or a structured idea. A spatial model is therefore an abstraction of

spatial data that generates useful information to help us understand, describe, and predict

how things work and/or solve problems in the real world. When we work with geographic

coordinates we can talk about a geographic model.

3.2 Methodology

The idea is to create a graph-based model to represent together spatial data, non-spatial data

and the spatial relations between spatial objects. We will generate datasets composed of

graphs with a set of these three elements. We argue that by mining a dataset with these

characteristics a data mining algorithm can search patterns involving all these elements, at

the same time, improving the results of the spatial analysis process.

For example, finding out interesting patterns of objects located at some distance from a

particular point; focusing on a current problem such as finding risk zones near the

Popocatépetl volcano (México). In addition, it would be important to know the

characteristics of the evacuation routes which would be used in situations of volcanic

34

activity, i.e., what are the characteristics of the evacuation routes; could they withstand the

atmospheric conditions and the passage of vehicles in an emergency situation?

A significant characteristic of spatial data is that the attributes of the neighbors of an object

may have an influence on the object itself. So, we propose to include in the model the three

types of relationship mentioned in Chapter 2: topological, distance, and direction relations.

As we mentioned, the basis is that in a spatial database there exist spatial objects, these

objects interact with other objects (spatial relations) and they may have several attributes

describing them (non-spatial data). Thus, we propose to create graphs that help us to

describe these interconnections between all these elements.

A simple way can be as follows: for each object we create a vertex representing the object

itself and join two vertices with a directed labeled edge if they have a spatial relation in

common (we add one edge for each spatial relation among vertices). Additionally, we can

also create vertices representing the value of their non-spatial attributes and directed labeled

edges joining each attribute to its object (one edge per attribute).

But there is a higher complexity level when working with general graphs (a graph with

multiple edges between vertices) than with simple graphs (a graph with at most one edge

between any given pair of vertices and with no loops). Thus, our idea is to work with

simple graphs. So, we propose to improve this representation to the one shown in Figure

3.1 (using UML notation).

35

Vertex

1..*

1..*


Spatial relation

Spatial object


-From:-To:

Edge

Figure 3.1. General graph-based model to represent spatial data.

In the model, the spatial data (i.e., spatial objects), non-spatial data (i.e., attributes), and

spatial relations are represented as a collection of one or more directed graphs. Therefore, a

directed graph contains a collection of vertices and edges representing all these elements.

Vertices represent either spatial objects, spatial relation types between two spatial objects

(binary relation), or non-spatial attributes describing the spatial objects. Edges represent a

link between two vertices of any type. According to the type of vertices that an edge joins,

an edge can represent either an attribute name or a spatial relation name. The attribute name

can refer to a spatial object or a non-spatial entity. We use directed edges to represent



36

This knowledge representation has the capability to describe a spatial dataset using graphs,

allowing a graph-based mining tool to mine it as a whole. The capabilities of the model to

represent the relation between these objects will be of great impact in the results of the data

mining processes, since the world is described by objects and the relation between these

objects, we can figure out the relations as the elements describing the interaction of the

objects with each other.

3.3 Spatial Graph-based Data Representations

In the construction of the graph there are issues such as the graph complexity and size that

have a direct impact over the data mining algorithm performance. The quality of results

refers to another important aspect. Testing several representations will allow us to produce

comparisons among obtained results, to evaluate them, and finally to make a decision for

selecting the one(s) which offers the better results according to our criteria for success. One

approach is to show that the model allows discovering known patterns. Another approach is

to have a domain expert saying that the discovered patterns are interesting. We have

worked with domain experts for developing these tasks.

Currently, we have developed five models to represent spatial data, non-spatial data and

spatial relationships among the spatial objects as a unique dataset from the general model.

Three issues define the characteristics (i.e., number of vertices and edges, simple graph,

etc.) of the graphs created from these models:

• Equivalent spatial relations.

37

• Symmetric spatial relations.

• The way to represent objects and their relations in the model.

In the following subsections we will explain how these three characteristics affect the

structure and composition of the graph. First, we will talk about the equivalent relations,

next the symmetric relations and finally the five created models.

Equivalent Spatial Relations

Suppose that our dataset is composed of an object A disjoint of an object B, this implies that

object B is disjoint of object A. In this case the two objects are disjoint each other (an

equivalent relation). When creating the graph, this relation can be represented by two

directed edges, one edge labeled as “DISJOINT” from object A to object B and vice versa.

However we can use the following principle:

2 directed edges, 1 edge e(vi,vj) and 1 edge e(vj,vi) equal to 1 undirected edge e = ij

By applying this principle we can replace the two directed edges by only one undirected

edge labeled as the equivalent relation without losing the representation of the spatial

relation among the objects.

The equivalent relations implemented in our research are the following:

• TOUCH

• DISJOINT

38

• OVERLAP

• EQUAL

• CLOSE

Symmetric Spatial Relations

Suppose that our dataset is composed of an object A South of an object B, when we create

the graph this relation is represented by a directed edge from object A to object B labeled as

“South_of”. But it implies the relation object B is North of object A (it is a symmetric

relation). This last relation in some models is not represented.

According to the model the representation of a symmetric relation implies one of the

following options:

• The addition of a directed edge labeled as the symmetric relation.

• The addition of a vertex labeled as the spatial relation (i.e., topological, direction)

and a directed edge labeled as the symmetric relation.

The symmetric relations implemented in our research are the following:

• CONTAINS INSIDE

• COVERS COVEREDBY

• NORTH_OF SOUTH_OF

• EAST_OF WEST_OF

• NORTHEAST_OF SOUTHWEST_OF

• SOUTHEAST_OF NORTHWEST_OF

39

Models

As we have commented, testing several representations will allow us to produce

comparisons among obtained results, to evaluate them, and finally to make a decision for

selecting the one(s) which offers the better results (in term of quality). The following five

models were developed to answer issues such as: in some models we obtain a reduction in

the number of vertices and edges, but do they give us the same representation of our data?

Are we gaining in the size reduction, but what are we loosing? There is a higher complexity

working with general graphs than with simple graphs, does it affect the generated results?

What about time processing?

In order to describe the characteristics of each model we will use the sample dataset shown

in Figure 3.2. Our dataset is composed of two spatial objects, object A representing a house

and object B representing a lake, and the following three spatial relations among them:

• Distance relation

o Object A close object B (equivalent relation).

• Topological relation

o Object A touch object B (equivalent relation).

• Direction relation

o Object A South of object B (symmetric relation).

40

North

South

EastWest

Object B

Object A

Figure 3.2. Sample dataset.

Additionally, to evaluate the characteristics of each model (the model is itself a graph) we

have developed the following nine evaluation metrics:

• Num. vertices. Total number of vertices in the graph.

• Num edges. Total number of edges in the graph.

• Size (vertices + edges). Total number of vertices plus total number of edges in the

graph.

• % increment. This item represents the percent of increment (in term of vertices

plus edges) of this graph with respect to the graph created by using the base model.

• Simple graph. To indicate if the graph is a simple one (graph with at most one edge

between any given pair of vertices and with no loops).

• Directed edge. To indicate if the graph has directed edges.

• Undirected edge. To indicate if the graph has undirected edges.

• Complete information. The item shows if the symmetric relations, created from

the original spatial relations, are also represented in the graph.

41

• Redundant “Relation” edge. We introduce this edge as strategy for avoiding

creating complex graphs. Remember that we consider simple graph a graph with at

most one edge between any given pair of vertices and with no loops. As we will see

in the description of those models, we use this special edge to join the vertices

representing spatial object with the vertices representing the spatial relation types

(i.e., topological, distance and direction). In models #3, #4 and #5 we add a

“Relation” edge for representing explicitly the fact that there is a relation between

two spatial objects. This metrics tells us if a redundant “Relation” edge is presented

in the graph.

Model #1 - base model

Figure 3.3 shows the first model created for representing spatial data as proposed. The

characteristics of the model according to the metrics are:

• Num. vertices: 2 vertices for representing each spatial object (i.e., object A, and

object B).

• Num. edges: 4 edges, 3 edges for representing the original spatial relations (i.e.,

“close”, “touch”, and “South_of” relations) and 1 edge for representing the

“North_of” relation created from the original “South_of” symmetric relation. The

“North_of” relation is itself a symmetric relation.

• Size (vertices + edges): 6

• % increment: 0%, it is the base model.

• Simple graph. No, it is a complex graphs with 4 edges linking 2 vertices.

42

• Directed edge. Yes, they are used for representing the “South_of” and “North_of”

symmetric relations. The direction of the edges is according to the lecture of the

relations among the objects.

• Undirected edge. Yes, they are used for representing the “close” and “touch”

equivalent relations.

• Complete information. Yes, in the graph is represented the “North_of” symmetric

relation created from the original “South_of” symmetric relation.

• Redundant “Relation” edge. No, in the model we do not use “Relation” edges.

Spatialobject A

Spatialobject B

South_of

touch

close

North_of

Figure 3.3. Model #1 - base model.

Model #2 - single replication of relation types, complete information

In Figure 3.4 we show our second model created for representing spatial data. The


• Num. vertices: 5 vertices, 2 vertices for representing the spatial objects and 3

vertices for representing the “topological”, “distance”, and “direction” spatial

relation types. For each spatial relation type among two spatial objects we add 1

vertex labeled as its name. In the example there exist 1 “topological” relation, 1

“distance” relation, and 1 “direction” relation.

43

• Num. edges: 6 edges, 3 edges for representing the original spatial relations, 2 edges

for representing the equivalent relations (i.e., “close” and “touch” relations) created

from the original ones, and 1 edge for representing the symmetric relation (i.e.,

“North_of” relation) created also from the original relations.


• % increment: +83.33%

• Simple graph. Yes, there exists at most 1 edge between any given pair of vertices.

• Directed edge. Yes, they are used for representing all relations. The direction of the

edges is from the vertices representing the spatial objects to the vertices

representing the spatial relation types.

• Undirected edge. No, in the model we do not use undirected edges.

• Complete information. Yes, we represent the symmetric relations created from the

original ones.

• Redundant “Relation” edge. No, in the model we do not use “Relation” edges.

South_of

touch

close close

touch

North_of

Distance

Direction


object B

Figure 3.4. Model #2 - single replication of relation types, complete information.

44

Model #3 - double replication of relation types, no complete information

In Figure 3.5 we show our third model created for representing spatial data. The





vertices labeled as its name. For instance, in the example there exist three relations:

1 “topological” relation, 1 “distance” relation, and 1 “direction” relation, so we add

6 vertices, 2 per each spatial relation type.

• Num. edges: 9 edges, 6 edges (the “Relation” edges) to link the vertices

representing the spatial objects to the vertices representing the spatial relation types

(from each vertex representing a spatial object start 3 edges since we have 3

relations), and 3 edges for representing the original spatial relations. These last 3

edges are used to join the vertices representing the spatial relation types.


• % increment: +183.33%


• Directed edge. Yes, they are used for representing the symmetric relations and the

“Relation” edges. The direction of the “Relation” edges is from the vertices

representing the spatial objects to the vertices representing the spatial relation types.

The direction of the other edges is according to the lecture of the relations among

the objects.

• Undirected edge. Yes, they are used for representing the equivalent relations.

45

• Complete information. No, we do not represent the symmetric relations created

from the original ones.

• Redundant “Relation” edge. Yes, we use in the model “Relation” edges for

representing explicitly the existence and type of a relation among 2 spatial objects.

Additionally, we use this edge for avoiding creating complex graphs.

South_of

touch

close

Relation

Distance

Topological

Direction

Distance

Topological

Direction

Spatialobject A

Spatialobject B

Relation

R elation

Relation

RelationR

elation

Figure 3.5. Model #3 - double replication of relation types, no complete information.

Model #4 - single replication of relation types, no complete information

In Figure 3.6 we show our fourth model created for representing spatial data. The





vertex labeled as its name. In the example there exist 1 “topological” relation, 1

“distance” relation, and 1 “direction” relation.

46

• Num. edges: 6 edges, 3 edges (the “Relation” edges) to link a vertex representing a

spatial object (in the example we have 2 vertices since there are 2 spatial objects) to

the vertices representing the spatial relation types (from this vertex representing a

spatial object start 3 edges since we have 3 relation types), and 3 edges for

representing the original spatial relations. These last 3 edges are used to link the

vertices representing the spatial relation types to the un-used vertex representing the

other spatial object.


• % increment: +183.33%


• Directed edge. Yes, they are used for representing the symmetric relations and

“Relation” edges. The direction of the “Relation” edges is from a vertex

representing a spatial object to the vertices representing the spatial relation types.

The direction of the other edges is from the vertices representing the spatial relation

types to the un-used vertex representing the other spatial object (it is according to

the lecture of the relations among the objects).

• Undirected edge. Yes, they are used for representing the equivalent relations.

• Complete information. No, we do not represent the symmetric relations created

from the original ones.


representing explicitly the existence and type of a relation among 2 spatial objects.


47

Distance

TopologicalSpatialobject A

Direction

Spatialobject BRelation

Relation South_of

touch

Relation close

Figure 3.6. Model #4 - single replication of relation types, no complete information.

Model #5 - double replication of relation types, complete information

In Figure 3.7 we show our fifth model created for representing spatial data. The



vertices for representing the “distance”, “topological”, and “direction” spatial

relation types. For each spatial relation type we add 2 vertices labeled as its name.

In the example we add 2 vertices for the “topological” relation, 2 vertices for the

“distance” relation, and 2 vertices for the “direction” relation.

• Num. edges: 12 edges, 6 edges (the “Relation” edges) to link the vertices

representing the spatial objects to the vertices representing the spatial relation types,

and 6 edges for representing the original spatial relations and those ones generated

from them (equivalent and symmetric relations). These last 6 edges are used to link

the vertices representing the spatial relation types to the vertices representing each

spatial object (3 edges for each spatial object).


48

• % increment: +233.33%


• Directed edge. Yes, they are used for representing all relations.

• Undirected edge. No, in the model we do not use undirected edges.

• Complete information. Yes, we represent the symmetric relations created from the

original ones.


representing explicitly the existence and type of a relation among 2 objects.


Direction

Distance

Topological

Spatialobject A

Direction

Distance

Topological

Spatialobject B

Relation

RelationRelation

Relation

touchtouch

closecloseSouth_ofNorth_of Relation

Relation

Figure 3.7. Model #5 - double replication of relation types, complete information.

Table 3.1 presents the results of the nine metrics developed to evaluate the characteristics

of each model. Model #1 is named the base model.

49

Model Num.

Vertices

Num.

Edges

Size

(v + e)

%

Increment

Simple

Graph

Directed

Edge

Undirected

Edge

Complete

Information

Redundant

“Relation”

Edge

(1) (2) (3) (4) (5) (6) (7) (8) (9)

#1 2 4 6 - No Yes Yes Yes No

#2 5 6 11 +83.33 Yes Yes No Yes No

#3 8 9 17 +183.33 Yes Yes Yes No Yes

#4 5 6 11 +83.33 Yes Yes Yes No Yes

#5 8 12 20 +233.33 Yes Yes No Yes Yes

Table 3.1. Characteristics of the graph-based representation models.

The metrics were proposed based on the causes/effects each of them has both into the

created graph and the mining algorithm. We visualize four significant issues related directly

to these metrics:

1. Search space

The search space of a graph-based data mining algorithm consists of all the subgraphs that

can be derived from its input graph, thus, the number of vertices (1) and edges (2) of the

created graph (3) are two important issues defining the search space size1 for the discovery

system. Therefore, the objective must be to minimize the number of vertices and edges

used to create the graphs but at the same time to maximize the representativeness of the

dataset. As we can see in Table 3.1, the model using the minimum number of vertices and

1 The edge type (directed or undirected) and graph type (simple or complex) are issues that also define the

search space of a graph-based data mining algorithm. In Chapter 4 we describe the Subdue system, our graph-

based data mining tool, and we show the way Subdue deals with these issues.

50

edges to represent our sample dataset is model #1 (2 vertices and 4 edges) whereas model

#5 is the opposite case (8 vertices and 12 edges).

2. Processing time

The search space size plays a relevant role regarding to the processing time used to

discover patterns. If we have a large search space the algorithm would require more time to

evaluate all the possible subgraphs. Therefore, a comparison among the “percentage of

increment” (4) metric of the proposed models is presented in Table 3.1. Remember this

metric compares the size of a given model with respect to model #1 (base model). For

instance, model #5 has a graph size increment of 233.33% with respect to model #1. In

other words, the mining algorithm will require the evaluation of 233.33% more vertices

and/or edges by using model #5 instead of model #1 for the same dataset.

3. Graph complexity

Next chapter describes the Subdue system, our graph-based data mining tool. As we will

see there exists a higher complexity for the mining algorithm to work with complex graphs

than with simple ones (i.e., at most an edge among any given pair of vertices and with no

loops). For instance, in the graph match process, in the expanding phase (Subdue uses an

“expanding” approach to discover patterns), and in the graph compression stage. Therefore,

the objective was to propose graph-based models that allow us to create simple graphs. As

we can see in Table 3.1, only model #1 does not create simple graphs.

51

Thus, as a strategy to break the multiplicity of edges among two vertices (for instance,

vertices representing two spatial objects meeting two or more spatial relations among them)

we use the following approaches:

• To add a new vertex labeled as the relationship type (i.e., topological, distance, and

direction) for each spatial relation among the spatial objects. This approach is used

in model #2.

• To add a new vertex (model #4) or two new vertices (model #3 and model #5)

labeled as the relationship type (i.e., topological, distance, and direction) by each

spatial relation among the spatial objects, and to link this new vertex (model #4) or

new vertices (model #3 and model #5) with the vertices representing the spatial

objects by using edges labeled as “Relation”. The approach used to link the vertices

is different in each model as we have mentioned in the definition of each of them.

This nomenclature is used to represent the fact that there exists a spatial relation

among these spatial objects. These edges are known as redundant relation edges (9).

4. Data representativeness

The directed edges (6), undirected edges (7), and complete information (8) metrics are used

to maximize the representativeness of the dataset and minimizing, as most as possible, the

graph size and its complexity. Directed edges are used to represent the symmetric spatial

relations (object A North_of object B, implies, B South_of A), the redundant relation edges,

and the non-spatial attributes describing the spatial objects. Undirected edges are used to

represent equivalent spatial relations (the relation is represented by an undirected edge

52

instead of two directed edges). Finally, complete information means that symmetric spatial

relations among spatial objects are also represented into the model.

3.4 Use-case

To show the applicability of our model, we will use a dataset from the Popocatépetl

volcano database [38, 42]. The database contains data related to several issues in the zone

such as settlements, rivers, and evacuation roads in the zone, just to mention a few of them.

Figure 3.8 shows a fragment of the layers “roads” and “settlements” of the Popocatépetl

volcano zone. The roads layer (shown in green color) represents the roads in the zone; it is

composed of spatial objects (i.e., lines) and non-spatial data describing the characteristics

of those roads (i.e., id, start point, end point, length, and type). The settlements layer

(shown in pink color) represents population areas in the zone. The layer is composed of

spatial objects (i.e., polygons) and non-spatial data describing the characteristics of the

settlements (i.e., id, area, perimeter, and type).

53

Figure 3.8. Spatial database representing some objects of the world.

Suppose we are interested in the identification of relations among the roads starting in, or

crossing a settlement (i.e., the type and characteristics of roads in relation to the number of

people living in a settlement that in case of a volcanic contingency are required to be

evacuated). So, we need to create a dataset involving these elements for mining it and

search for patterns that could help us to evaluate the characteristics of the roads and, may

be, to make decisions for improving them (or build new ones) for their utilization in case of

volcano activity. In this case we are working with different types of spatial objects and also

we are adding non-spatial data and relationships (as touch or overlap) to our dataset.

The construction of the dataset needs to satisfy some constraints. For example, if we

include all the elements existing in the data layers we might build a huge graph, and this

will have a direct impact in the data mining algorithm. So, we explore methodologies to

deal with topics such as complexity, the size of the graph, noise and quality of the data. A

solution for the problem of creating a huge graph is to limit the set of elements to be

54

included in it by using selection windows. The user can create these windows and only the

elements inside them are candidates to be included as objects in the graph. The idea is

shown in Figure 3.9. Suppose the user will work with n data layers, thus, for each layer we

will select only the spatial objects inside the area delimited by the selection window. We

can see this functionality such as a drill operation over all the spatial layers the user works

with.

Figure 3.9. Selection window.

Once we have defined the working area, as shown in Figure 3.8, we can build the graph.

The process consists of three phases. In the first phase, the user has to choose the spatial

relation(s) to evaluate among the spatial objects (in the example the touch or overlap

topological relations). The second phase involves the validation process, only the objects

covered by the relation(s) become elements to be included in the graph.

55

Figure 3.10. Querying a spatial database.

The last phase consists of building the graph using the results of the validation process.

Figure 3.10 presents an example of this functionality. This time only the area delimited by a

selection window is shown. In the figure, the spatial objects satisfying either the touch or

overlap relations are shown in blue color, the rest of the objects are shown in their original

colors. The circles show some examples where a road is starting or crossing a settlement.

56

Figure 3.11. Graph-based representation for spatial data.

57

Figure 3.11 shows the graph created using model #2 (we used the Graphviz System [19] to

draw the graph). The vertices in the graph represent either spatial objects (i.e., road,

settlement), spatial relations between two objects (i.e., topological), or attributes describing

the objects (i.e., ID value). In this example, we use a special vertex labeled as “object” for

expressing the interconnection between the spatial objects, their spatial relations, and non-

spatial attributes (i.e., object type road with ID 989 overlaps or touches with object…).

According to the type of vertices that an edge joins, the edge can be labeled as either a

spatial relation name (i.e., overlap, touch), or as the name of an attribute describing the

characteristics of an object (i.e., type, ID).

We may read the graph as follows: there are six roads and six settlements inside the

working area meeting either a touch or overlap relation in the form road settlement. We

suppose that each polygon object in the map represents a settlement. Some roads start from

a settlement and others cross a settlement. For example, the road with ID 981 crosses the

settlements with ID’s 3079, 3139, 3151, 3083 and 3138. This means that we need to be

careful with this road since it has interaction with several settlements and in case of a

contingency it will be widely used. In the graph we included only the object's ID non-

spatial attribute for each object.

This is a simple example; in the real world the spatial data layers may have hundreds or

thousands of objects, and each object may have dozens of attributes describing it. Joining

all these elements will allow creating large graphs representing the elements found in a

58

spatial database improving the results of the data mining task. Once we have created the

graph, it will be used as input to a graph-based mining system.

3.5 Conclusion

Our idea is to propose a graph-based model to represent together spatial data, non-spatial

data and the spatial relations between spatial objects. Based on the model we generate

datasets composed of graphs with a set of these three elements. Our argumentation is that

by mining a dataset with these characteristics a data mining algorithm can search patterns

involving all these elements at the same time improving the results of the spatial analysis

process.

We have presented an example of the applicability of the proposal using data from a

Popocatépetl volcano database. The created graph will be the input for a graph-based

mining system. We propose to use the Subdue system, which will be described in the next

chapter.

59

Chapter 4

MINING THE GRAPH

This chapter describes the characteristics of the Subdue system [24, 45], our data mining

tool. Subdue was developed at the University of Texas at Arlington and it can be applied to

any domain that can be represented as a graph. In the first part of the chapter we present the

general characteristics and main functions. In the second part we describe the overlap

feature and its importance for the pattern discovery system, and introduce a new algorithm

named Limited Overlap.

4.1 Characteristics

The Subdue system discovers substructures using a graph-based representation of structural

databases. Structural data involves the concept of units or parts and the interdependence

and relationships of those parts. The substructures (a connected subgraph within the

graphical representation) describe concepts in the data (i.e., patterns).

60

The substructure discovery system represents structural data as a labeled graph. The input

to Subdue is a graph where labeled vertices correspond to objects in the data, and directed

or undirected labeled edges map relationships between objects.

The discovery algorithm follows a computationally constrained beam search. There are

three different evaluation methods to guide the search towards more appropriate

substructures: Minimum Description Length (MDL), Size-based, and Set Cover. The

default evaluation method is MDL.

The algorithm begins with the substructure matching a single vertex in the graph. During

each iteration, the algorithm selects the best substructure and incrementally expands the

instances of the substructure. An instance of a substructure in the input graph is a subgraph

that matches (graph theoretically) that substructure.

The algorithm searches for the best substructure until all possible substructures have been

considered or the total amount of computation exceeds a given limit. Evaluation of each

substructure is determined by how well the substructure compresses the description length

of the input graph.

There might be slight variations of some substructures that can be considered as instances

of another substructure. Subdue uses an inexact graph match algorithm to identify this kind

of instances. In this inexact match approach, each distortion of a graph is assigned a cost

and if the total cost is lower than a given threshold, the substructure is considered an

instance of the other substructure.

61

A distortion is described in terms of transformations such as the deletion, insertion, and

substitution of vertices and edges. The best substructure found by Subdue can be used to

compress the input graph, which can then be input to another iteration of Subdue. After

several iterations, Subdue builds a hierarchical description of the input data where later

substructures are defined in terms of substructures discovered on previous iterations.

Figures 4.1, 4.2, 4.3 and 4.4 show an example of the system’s functionality. The example is

presented in terms of the house domain, where a house is defined as a triangle on a square.

T represents a triangle, S a square, C a circle and R a rectangle (see Figure 4.1). The

objects in the figure (i.e., T1, S1, R1) become labeled vertices in the graph, and the

relationships (i.e., on(S1, R1), shape(C1, circle)) become labeled edges.

T1

S1

object

R1

on

onon

Input GraphInput Database

S3

T3

S4

T4

S2

T2

C1

on

on

rectangleshape

object

object triangle

on

shape

squareshapeobject

object triangle

on

shape

squareshapeobject

object triangle

on

shape

squareshape

object

object triangle

on

shape

squareshapeobject

circleshape

Figure 4.1. Graph representation of the house domain.

62

The graph representation of the substructure discovered by Subdue from this data is shown

in Figure 4.2 where Subdue found four instances of “triangle on square”.

T1

S1

object

objecttriangle

square

S2

S3 S4

T2

T3 T4

shape

shape

on

Substructure Instance 1

Instance 3

Instance 2

Instance 4

Figure 4.2. Substructure and instances discovered from the house domain by Subdue.

After a substructure is discovered, each instance of the substructure in the input graph is

replaced by a single vertex representing the entire substructure. In Figure 4.3 the discovered

substructure (object shape triangle on object shape square) is labeled as SUB_1

(SUBstructure number 1).

object

objecttriangle

square

shape

shape

onSUB_1

Figure 4.3. Substructure replacement procedure in the house domain.

Finally, the substructure (labeled as SUB_1) is used to compress the original input graph,

which can then be input to another iteration of Subdue (see Figure 4.4).

63

SUB_1

object

SUB_1

rectangle

SUB_1 SUB_1

object circle

shape

shape

on

on

on

onon

Figure 4.4. Graph representation of the house domain after substructure replacement.

The Subdue system’s ability to perform discovery has been proved in several domains

including scene analysis, chemical analysis and CAD circuit analysis. In [18] the authors

present an example of the Subdue capabilities to perform data mining tasks, they work with

an earthquake database and show that Subdue is capable of finding not only the shared

characteristics of the earthquake events, but also space relations between them. In the case

of the identification of shared characteristics, they used the pattern containing the region

number specification to recognize the area being studied; in the case of space relations, they

found patterns that represent parts of the paths of the involved fault (i.e., subarea with a

high concentration of earthquakes).

4.1.1 Main Functions

In this subsection we present a briefly description of the main functions composing the core

of Subdue. Each of them is itself integrated by several subfunctions, but the idea is to

present them in a global perspective.

64

Compress

The compress function returns a new graph, which is the given graph compressed with the

given substructure instances. Vertices SUB replace each instance of the substructure, and

OVERLAP edges are added between vertices SUB of overlapping instances. Edges

connecting to overlapping vertices are duplicated, one per each instance involved in the

overlap.

Discover

This function plays the role of manager in the phase of discovering the best substructures in

an input graph. It is in charge of issues such as getting initial substructures, extending each

substructure, evaluating each extension, and adding to a final list the best discovered

substructures.

Evaluate

This function implements the different evaluation methods used to guide the search towards

more appropriate substructures: Minimum Description Length (MDL), Size-based, and Set

Cover. The value of a substructure s in a graph g is computed as:

size(g) / (size(s) + size(g|s))

The value of size() depends on whether we are using the MDL or Size-based evaluation

method. If MDL is used, then size(g) computes the description length in bits of g. In case

of Size-based, then size(g) is simply vertices(g) + edges(g). The size(g|s) is the size of

graph g after compressing it with substructure s. Compression involves replacing each

65

instance of s in g with a new single vertex and reconnecting edges external to the instance

to point to the new vertex.

Extend

This function returns a list of substructures representing extensions to the given

substructure. Extensions are constructed by adding an edge (or edge and a new vertex) to

each positive instance of the given substructure in all possible ways according to the graph.

Matching extended instances are collected into new extended substructures, and all such

extended substructures are returned.

Graphmatch

The Graphmatch function returns true if graph_1 and graph_2 match with cost less than the

given threshold. If so, the objective is to store the match cost in the variable pointed to by

matchCost and to store the mapping between graph_1 and graph_2 in the given array is

non null.

Subgraph Isomorphism

Set of functions implementing a subgraph isomorphism algorithm. The objective is to find

predefined substructures: searches for subgraphs of graph_2 that match graph_1 and

returns the list of such subgraphs as instances of graph_2. Returns an empty list if no

matches exist. This procedure mimics the “discover substructures” loop by repeatedly

expanding instances of subgraphs of graph_1 in graph_2 until matches are found.

66

4.2 Overlap

Now, we will describe the overlap feature in the Subdue system. Our goal is to present the

current implementation of this feature in Subdue and then, in the next section, to describe

our new proposal named limited overlap. As we will see, Subdue reports in the results all

the overlapping substructures that it can find. Sometimes this is very expensive in time and

in the number of substructures that the end user has to analyze. Therefore, the idea is that

Subdue can find overlapping substructures only over a set of predefined vertices that are

chosen by the user (the limited overlap). The overlap has a preponderant role in the

substructure discovery system; it is controlled by the overlap user’s parameter. If overlap is

false then overlap among instances is not allowed, otherwise, overlap is allowed.

To illustrate the cause/effect of the overlap in Subdue, we will present some examples

generated by using two standalone functions belonging to Subdue: the Subgraph

Isomorphism and Minimum Description Length functions. The generated results by these

functions are affected by the value of the overlap parameter.

Subgraph Isomorphism (SGISO)

As we have already mentioned, the subgraph isomorphism function searches for subgraphs

of graph_2 that match graph_1 and returns the list of such subgraphs as instances of

graph_2. The subgraph isomorphism function is embodied in the “FindInstances” function.

The procedure is optimized toward graph_1 being a small graph, and graph_2 being a large

graph.

67

The FindInstances function starts finding a list of single-vertex instances, one for each

vertex in graph_2 that matches the first vertex of graph_1. Next, the algorithm attempts to

extend each instance in the instance list by an edge (or edge and new vertex) from graph_2

that matches the attributes of the given edge in graph_1. Finally, the instances not matching

graph_1 are filtered, and an overlapping instances validation process is performed. If

overlap is false overlapping instances are discarded. The resulting list represents the

instances of graph_1 in graph_2.

For illustrative purpose suppose we have as input graphs those shown in Figure 4.5 and

Figure 4.6. Input graph_1 has 3 vertices and 2 edges, and graph_2 has 8 vertices and 7

edges.

A

B C

a b

2 3

1

F

CBE

DCA

B

a

g

c

b f

b

a

4 2 3

6 7

5

8 1

Figure 4.5. SGISO - input graph_1. Figure 4.6. SGISO - input graph_2.

Running SGISO with the overlap parameter set to false, it finds 1 instance of graph_1 (see

Figure 4.7) in graph_2. Figure 4.8 shows the discovered instance in graph_2.

68

A

B C

a b

2 6

1

Figure 4.7. SGISO - no overlap. Figure 4.8. SGISO - no overlap, 1 instance in graph_2.

Now, running SGISO with the overlap parameter set to true, the algorithm finds the 4

instances shown in Figure 4.9. We can see that each instance has the vertices labeled as A,

B, and C but they represent different vertices (they have different ID vertices). For

example, the first instance has the ID vertices 1, 5 and 3; the second instance has the ID

vertices 1, 5 and 6 respectively, and so on. Figure 4.10 shows the 4 discovered instances in

graph_2.

A

B C

a b

5 3

1 A

B C

a b

5 6

1

A

B C

a b

2 3

1 A

B C

a b

2 6

1

Instance 1 Instance 2


Figure 4.9. SGISO - overlap. Figure 4.10. SGISO - overlap, 4 instances in graph_2.

69

Minimum Description Length (MDL)

The MDL principle states that the best theory to describe a set of data is that theory which

minimizes the description length of the entire dataset. We define the MDL of a graph to be

the number of bits necessary to completely describe the graph. The minimal encoding of

the graph in terms of bits is computed as follows:

edgeBitsrowBitsvertexBitsMDL ++=

Where vertexBits represents the number of bits needed to encode the vertex labels of the

graph, rowBits represents the number of bits needed to encode the rows of the adjacency

matrix A (the matrix represents the graph connectivity), and edgeBits represents the number

of bits needed to encode the edges represented by the entries A[i,j] = 1 of the adjacency

matrix A.

The standalone MDL function computes the description length (dL) of graph_1, graph_2

and graph_2 compressed with graph_1 along with the final MDL compression measure:

)1_|2_()1_(/)2_( graphgraphdLgraphdLgraphdL +

dL(graph_2|graph_1) represents the value of graph_2 compressed with graph_1. The

overlap feature has a direct effect to compute this value because the number of instances for

compressing graph_2 is based in the instances list returned by the FindInstances function.

As we have commented, the Subdue system implements a compress graphs function named

CompressGraph. The inputs of the function are the graph to be compressed, and the

70

substructure instances used to compress the graph. The function returns a new graph, which

is the given graph compressed with the given substructure instances.

As part of the graph compressing phase, SUB vertices replace each instance of the

substructure, and OVERLAP edges are added between SUB vertices of overlapping

instances. Edges connecting to overlapping vertices are duplicated, one per each instance

involved in the overlap. The procedure to add OVERLAP edges evaluates two cases: first, if

two instances overlap at all, then an undirected OVERLAP edge is added between them.

Second, if an external edge points to a vertex shared between multiple instances, then

duplicate edges are added to all instances sharing the vertex. The procedure to add

duplicate edges is based in the follows pseudocode:

if edge connects SUB1 to external vertex then add duplicate edge connecting external vertex to SUB2 else if edge connects SUB1 to another (or same) vertex in SUB1 then add duplicate edge connecting SUB1 to SUB2 if other vertex is unmarked (overlapping and already processed) then add duplicate edge connecting SUB2 to SUB2 if edge connects SUB1 to the same vertex in SUB1 (self edge) then add duplicate edge connecting SUB2 to SUB2 if edge directed then add duplicate edge from SUB2 to SUB1

To show the role plays by overlap in the standalone MDL function (focusing in the

generated compressed graph) we will present, in the following figures, examples of the

different cases of validation that are implemented to face this feature. Figure 4.11 shows the

input graph_1, this graph will be used in all the examples as the first input graph. It has 3

vertices and 2 edges.

71

A

B C

a b

2 3

1

Figure 4.11. MDL - input graph_1.

In our first example we will use as graph_2 the graph shown in Figure 4.12. This graph has

8 vertices and 7 edges. As we can see our graph has a “remarked” edge, the directed edge

labeled as g between vertex 1 (labeled as A) and vertex 8 (labeled as F). This edge

represents the current case of validation (in the future it will be named the guide edge).

As part of the CompressGraph algorithm, there is a function named AddOverlapEdges that

adds edges to the compressed graph, describing overlapping instances of the substructure in

the given graph, based in two conditions. First, if two instances overlap, then an undirected

edge OVERLAP is added between them. Second, if an external edge points to a vertex

shared between multiple instances, then duplicate edges are added to all instances sharing

the vertex. If the second condition is true, the AddOverlapEdges function calls a new

function named AddDuplicateEdges. The purpose of this function is to add duplicate edges

based on overlapping vertex between substructures.

As we can see, the layout of the edge (that we call the guide edge) is important for the

global functionality of the compress graph algorithm. So, in the following examples we will

change the guide edge for illustrating the different validation cases.

72

Returning to our example, if we run the MDL program with the overlap parameter set to

false, our final compressed graph is shown in Figure 4.13. Since overlap is not allowed

between instances, the FindInstances function finds 1 instance of graph_1 that match

graph_2. In the graph this instance was replaced by a vertex SUB, so we have only 1 vertex

of this type. There are not edges OVERLAP, and no edges were duplicated.

F

CBE

DCA

B

a

g

c

b f

b

a

4 2 3

6 7

5

8 1

Figure 4.12. MDL example 1 - input graph_2. Figure 4.13. MDL example 1 - no overlap.

In the next example our graph_2 is shown in Figure 4.14. It is the same graph used in the

previous example, but this time we run MDL with the overlap parameter set to true. The

generated compressed graph is shown in Figure 4.15. The FindInstances function finds 4

instances since the overlap between instances is allowed. In the graph the 4 instances were

replaced by 4 vertices SUB. There are 5 edges OVERLAP, an edge OVERLAP between two

vertices SUB tells us that these substructures overlap. By “substructures overlap or

overlapping substructures” we refer that the instances they represent have common vertices.

In the example, there is a directed edge (edge g, our guide edge) connecting two vertices,

one of them belonging to a substructure (in exact term, to an instance of a substructure) and

the other one no belonging to the substructure (vertex F), is an external edge; the

73

discovered instances were 4 (overlap is allowed), so, in the graph there are 4 directed edges

starting from each vertex SUB (the direction of the edge is preserved) to the vertex F.

A similar procedure was implemented for edge f (connecting vertex C and vertex D,

vertices number 6 and 7 respectively) and edge c (connecting vertex B and vertex E,

vertices number 2 and 4, respectively). They were duplicated since these edges connect

vertices belonging to a substructure to vertices no belonging to the substructure (external

edges). In the figure we can see that vertex D has 2 edges f connecting it to 2 vertices SUB

(preserving the direction of the edge) and vertex E has 2 edges c connecting it to 2 vertices

SUB (remember that 4 instances were discovered, overlap is allowed).

Finally, our compressed graph has 7 vertices: 4 vertices SUB, 1 vertex F, 1 vertex D, and 1

vertex E; and 13 edges: 5 edges OVERLAP, 4 edges labeled as g, 2 edges labeled as f, and 2

edges labeled as c.

F

CBE

DCA

B

a

g

c

b f

b

a

4 2 3

6 7

5

8 1

Figure 4.14. MDL example 2 - input graph_2. Figure 4.15. MDL example 2 - overlap.

For the following examples the overlap parameter is always set to true.

74

Figure 4.16 shows our new graph_2, we only changed the direction of the guide edge. Now

it starts from vertex F (number 8) to vertex A (number 1). The generated compressed graph

is shown in figure 4.17. In the compressed graph we observe that, this time, the 4 edges g

start from vertex F to the 4 vertices SUB (the direction of the edge is preserved).

Everything else remains as the previous example.

F

CBE

DCA

B

a

g

c

b f

b

a

4 2 3

6 7

5

8 1


For the next example, we only change from a directed guide edge to an undirected one. Our

graph_2 is the graph shown in Figure 4.18. By running MDL we generate the compressed

graph presented in Figure 4.19. The only modification is that the edges between vertex F

and the 4 vertices SUB are undirected edges. Everything else remains unchanged.

75

F

CBE

DCA

B

a

g

c

b f

b

a

4 2 3

6 7

5

8 1

Figure 4.18. MDL example 4 - input

graph_2.

Figure 4.19. MDL example 4 - overlap.

For illustrating a new outcome of the overlap feature in the standalone MDL function, we

change our graph_2 as follows:

• Vertex number 8 labeled as F is deleted.

• Our guide edge labeled as g is deleted.

• A new edge labeled as g (our guide edge) is added between vertex number 1 labeled

as A and vertex number 3 labeled as C. For the current example we use a directed

edge.

Our graph_2 is shown in Figure 4.20. The graph has 7 vertices and 7 edges. The “new”

edge has the characteristic that it is an edge between two vertices belonging to the same

substructure and in some cases, as we will see, they belong to overlapping substructures.

Remember that overlap is allowed between instances, so the FindInstances function finds 4

instances.

76

The generated compressed graph is shown in Figure 4.21. We mentioned that our guide

edge joins 2 vertices belonging to a same substructure (vertex number 1 labeled as A and

vertex 3 labeled as C, and in some cases these vertices belong to overlapping substructures.

So, in the graph there are 6 edges labeled as g, 4 edges joining vertices SUB (telling us that

they overlap), and two self edges in 2 vertices SUB. These last 2 edges are our new case of

validation. The self edge tells us that inside the substructure, there is an edge joining two

vertices belonging to the same substructure. The direction of the edges does not matter

since this is an edge inside the substructure (an internal edge). The other duplicated edges

(i.e., edge c and edge f) were computed using the previous described procedure.

Finally, our compressed graph has 6 vertices: 4 vertices SUB, 1 vertex D, and 1 vertex E;

and 15 edges: 5 edges OVERLAP, 6 edges labeled as g, 2 edges labeled as f, and 2 edges

labeled as c.

CBE

DCA

B

a

c

b f

b

a

g

4 2 3

6 7

5

1


graph_2.


77

For the example shown in Figure 4.22, we only changed the direction of the guide edge.

Now it starts from vertex number 3 labeled as C to vertex number 1 labeled as A. The

generated compressed graph is presented in Figure 4.23. The generated compressed graph

is the same one that in the previous example. This is telling us that just changing the

direction of the guide edge does not have effect in the resulting compressed graph.

CBE

DCA

B

a

c

b f

b

a

g

4 2 3

6 7

5

1


For the next example, we only change from a directed guide edge to an undirected one. Our

graph_2 is the graph shown in Figure 4.24. The generated compressed graph is presented in

Figure 4.25. The single change is that all edges g are now undirected edges. Everything else

remains unchanged.

78

CBE

DCA

B

a

c

b f

b

a

g

4 2 3

6 7

5

1


Now, for illustrating a new outcome of the overlap feature in the standalone MDL function

(self edge in a vertex shared by instances of a substructure), we change our graph_2 as

follows:

• Edge labeled as g between vertex number 1 labeled as A and vertex 3 labeled as C is

deleted.

• A new edge labeled as g (our guide edge) is added between vertex number 1 labeled

as A and vertex number 1 labeled as A (a self edge). For the current example a

directed edge.

Our graph_2 is shown in Figure 4.26. The graph has 7 vertices and 7 edges. The “new”

edge has the characteristic that is a self edge; it starts and ends in the same vertex. This

vertex belongs to overlapping substructures, so we have 4 substructures sharing it.

The resulting compressed graph is shown in Figure 4.27. Since our guide edge is a self

edge the compressed graph has 10 edges labeled as g. There are 6 edges joining vertices

79

SUB (telling us that they are overlapping substructures), and 4 self edges in 4 vertices SUB

telling us that they have an edge which origin and destination vertices are the same.

We note in the compressed graph a special characteristic between some vertices SUB. We

refer that some pairs of vertices SUB are joined by 2 edges labeled as g. One of them starts

in the first vertex SUB and ends in the second one. The second edge has a vice versa

direction. This is because they are overlapping substructures with a common vertex and this

vertex is the source and destination of an internal edge.



labeled as c.

CBE

DCA

B

a

c

b f

b

ag

4 2 3

6 7

5

1


graph_2.


For the last example, we use as graph_2 the one shown in Figure 4.28. We only change our

guide edge from a directed edge to an undirected one. Figure 4.29 shows the generated

compressed graph. There are 2 differences between this compressed graph and the previous

80

one. First, the edges labeled as g are undirected edges. Second, the special characteristic for

some vertices SUB is implemented as follows: Currently, we note that those vertices SUB

joined by 2 edges labeled as g in the previous example, now, they are joined by just 1

undirected edge labeled as g. Since the guide edge is undirected we only need 1 edge for

representing them because they are overlapping substructures with a common vertex and

this vertex is the source and destination of an internal undirected edge. The 4 vertices SUB

have self edges, but they are undirected ones this time. Everything else remains unchanged.



labeled as c.

CBE

DCA

B

a

c

b f

b

ag

4 2 3

6 7

5

1


81

4.3 Limited Overlap

As we have described in the previous section, the overlap feature plays a preponderant role

in the Subdue’s substructures discovery system. As a consequence, the generated results are

also conditioned, in a high percentage, to this parameter. However, as we have seen, it is

implemented to allow overlap among any instances of a substructure or among all the

instances of a substructure. In this context, we propose a new approach named limited

overlap. The major feature in this approach is to give the user the means to specify the set

of vertices where an overlap is allowed. These vertices may represent significant elements

in the context we work with.

In the following subsections we will present the motivation, advantages, new algorithm,

and an example of the new overlap feature implemented in the Subdue system.

Motivation

As we have already commented, the current overlap feature in Subdue is implemented in an

orthodox way: all or nothing. It means that Subdue allows the overlap among all the

instances sharing at least one vertex or that Subdue does not allow (discard) the overlap

among instances sharing at least one vertex.

But we argue that a third option is needed, an option where the user has the capability to set

over on which vertices the overlap will be allowed, it is a limited overlap. We visualize

directly three motivations issues to propose the implementation of the new algorithm that

will be explained in the following subsections:

82

• Search space reduction.

• Processing time reduction.

• Specialized overlapping pattern oriented search.

In order to help us to describe the characteristics of the limited overlap we will use the

graphs show in Figure 4.30 and 4.31 as input graphs in the future examples. Input graph

PS_1 (Predefined Substructure number 1) has 2 vertices and 1 edge, input graph PS_2

(Predefined Substructure number 2) has also 2 vertices and 1 edge, and finally graph_3 has

9 vertices and 8 edges.

A

B

a

2

1 C

D

f

2

1

PS_1 PS_2

B

CBE

DCA

B

a

g

c

b f

b

a

4 2 3

67

5

8 1

Df

9

Figure 4.30. Limited overlap - input

graphs PS_1 and PS_2.

Figure 4.31. Limited overlap - input graph_3.

1) Search space reduction. In knowledge discovery systems using a graph-based

approach, the data mining algorithm uses graphs as a knowledge representation; the search

space of a graph-based data mining algorithm consists of all the subgraphs that can be

derived from its input graph.

The substructures discovery process in Subdue begins with the creation of the substructures

matching a single vertex in the graph (one for each of the different labels in the graph).

83

Each iteration through the algorithm selects the best substructures and expands the

instances of these substructures by one neighboring edge (or an edge and new vertex) in all

possible ways.

But as part of the process to select the best substructures and then to expand them there

exists a filter phase. In this phase according to the overlap parameter the instances of a

substructure are evaluated: if overlap is set to true then overlapping instances are kept,

otherwise if overlap is set to false then overlapping instances are discarded.

For example, suppose we want to search the instances of PS_1 and PS_2 in graph_3. With

the overlap set to false Subdue discovers 2 instances, one instance for each predefined

substructure as we can see en Figure 4.32. PS_SUB_X is the nomenclature used by Subdue

to identify it is an instance of Predefined Substructure SUB_X; where SUB_X means it is

the SUBstructure number X. But, if overlap is set to true, Subdue discovers 4 instances, 2

instances for each predefined substructure (see Figure 4.33). Finally, suppose the user

wants to search instances of PS_1 and PS_2 in graph_3 but this time he considers that

vertices A have a higher relevance (for example, a remarkable spatial object in a spatial

database) so he proposed to use a limited overlap, the overlap will be allowed just among

instances containing vertices A. We can see that PS_1 has a vertex A, thus this time Subdue

finds 3 instances, 2 instances of PS_1 and 1 instance of PS_2 as shown in Figure 4.34. In

the figure we can observe that instance 1 and instance 2 share the vertex number 1 labeled

as A. For PS_2 Subdue finds 1 instance since the other one (instance number 4 in Figure

4.33) has also the vertex number 6 labeled as C, but the overlap is just allowed among

vertices A.

84

C

D

a

9

6

PS_SUB_2

Instance 2

A

B

a

5

1

PS_SUB_1

Instance 1

Figure 4.32. No overlap - SGISO.

C

D

f

7

6 C

D

f

9

6

PS_SUB_2 PS_SUB_2

A

B

a

2

1 A

B

a

5

1

PS_SUB_1 PS_SUB_1

Instance 1 Instance 2 Instance 3 Instance 4

Figure 4.33. Overlap - SGISO.

A

B

a

2

1 A

B

a

5

1

PS_SUB_1 PS_SUB_1


C

D

f

9

6

PS_SUB_2

Instance 3

Figure 4.34. Limited overlap PS_1 - SGISO.

We have mentioned that the best discovered substructure by Subdue (by iteration) can be

used to compress the input graph, which can then be input to another iteration. After several

iterations, Subdue builds a hierarchical description of the input data where later

substructures may be defined in terms of substructures discovered on previous iterations.

We have also commented that each iteration through the algorithm selects the best

substructures and expands the instances of these substructures by one neighboring edge (or

an edge and new vertex) in all possible ways. So the number of instances of the

substructures defines the search space (by iteration) in the substructure discovery process.

As we can see in our example, by using the limited overlap we obtain a search space

reduction (with overlap set to true), since the number of instances becoming candidates to

be expanded is selected according to the allowed values given by the user.

85

2) Processing time reduction. The reduction in the number of instances becoming

candidates to be expanded results in a search space reduction, and this effect also has a new

outcome, a processing time reduction.

Allowing overlap slows Subdue considerably since the number of candidate instances to

expand, to evaluate, to match, and to discover increase as we have seen. However, by the

implementation of the limited overlap, the number of instances to be processed in these

phases decrease resulting also in a processing time reduction in the overall substructure

discovery process.

3) Specialized overlapping pattern oriented search. We have also commented that the

limited overlap gives the user the capabilities to define the set of interesting elements over

which the overlap will be allowed (the elements are represented as vertices in the graph

according to the proposed model) so the algorithm will discard the overlapping elements

that the user does not considerer significant.

In the example presented in Figure 4.34, the judgment of the user is that vertices A have a

higher relevance so he proposes to use a limited overlap, the overlap will be allowed just

among instances containing vertices A. This consideration is a personal decision of the user

according to the work context (in many cases with the support of a domain expert). For

instance, a remarkable element may refer to a spatial object in a spatial database or to some

characteristic defining a particular topic of a dataset.

86

Therefore, the limited overlap gives the user the mechanics to implement a pattern oriented

search. The user delimits the set of elements that will have a preponderant role in the

substructure discovery process. But this characteristic gives also a new advantage, the

patterns evaluation process is simplified since the set of generated results is smaller because

it is focused over the user requirements.

Algorithm

As we have described, the limited overlap gives the user the capabilities to define the set of

elements (vertices in the graph) where overlap among instances is allowed. The process

starts reading the set of vertices which are integrated to the Subdue’s parameters as the

limited overlap label list. During the substructure discovery process a filter phase is

performed. This phase evaluates (and possible discards) based on the overlap parameter the

list of discovered instances.

If the substructure discovered process is composed by several iterations, after each of them,

the overlap label list is updated to integrate the new vertices where overlap is also allowed.

Remember that at the end of each iteration, the best discovered substructure by Subdue is

used to compress the input graph, which can then be input to another iteration of Subdue.

After several iterations, Subdue builds a hierarchical description of the input data where

later substructures are defined in terms of substructures discovered on previous iterations.

Therefore, if the best discovered substructure, in each iteration, has a vertex where overlap

is allowed then that substructure is added to the limited overlap label list.

87

Limited Overlap (instance1, instance2) //Global process while ((elementsInstance1 < totalElementsInstance1) AND NOT endProcess){ //Instance1’s vertex vertex1 = vertex from instance1 //Instance2’s vertex while (elementsInstance2 < totalElementsInstance2) vertex2 = vertex from instance2 //Instances overlap if (vertex1 = vertex2) then{ intancesOverlap = true endProcess = true } //Instances overlap, check by limited overlap if (instancesOverlap AND overlapLabelList NOT EMPTY) then{ if (vertex1 in overlapLabelList) then //It is a limited overlap limitedOverlap = true else //Process continues until all vertex1 are validated endProcess = false } } return instancesOverlap, limitedOverlap

Figure 4.35. Limited overlap in the Subdue system.

In the validation process each vertex belonging to instance1 (vertex1) is validated against

all vertices belonging to instance2 (vertex2). If vertex1 and vertex2 are the same then they

are overlapping instances. If the instances overlap then vertex1 is validated against the list

containing the set of vertices allowed for overlapping (the overlapLabelList). If vertex1

exists in overlapLabelList then it is a limited overlap.

Example

To illustrate the functionality of the new algorithm, we will present some examples using

the new version of Subdue implementing the limited overlap feature. The input graphs for

88

the examples are those shown in Figure 4.30 and Figure 4.31; the idea is to perform a

pattern oriented search by discovering patterns in graph_3 but based on PS_1 and PS_2

(our predefined patterns).

Subdue implements a pattern oriented search by, initially, finding all instances of PS_1 and

PS_2 in graph_3. The next step is to compress graph_3 using the found instances of each

PS_1 and PS_2. Figure 4.36 shows the compressed graph with the overlap parameter set to

false. As we can see, Subdue found 1 instance of PS_1 (i.e., vertex PS_SUB_1) and 1

instance of PS_2 (i.e., vertex PS_SUB_2) since overlap among instances is not allowed (see

also Figure 4.32).

Figure 4.36. No overlap - compressed graph.

Finally, the compressed graph becomes the input graph to discover substructures. Figure

4.37 shows the best 3 substructures (default number of reported substructures) found by

Subdue according to its substructure discovery system (see Section 4.1 for details). Each

substructure has 1 instance; it means that there exists 1 repetition of each of them in the

input graph (in the example we use exact graph match, but Subdue allows also inexact

graph match). The first one (labeled as “a”), is composed by 2 vertices: 1 vertex B, and 1

89

vertex E; and 1 edge labeled as c. Subdue reports this substructure as the best discovered

substructure because it is the best that minimize the description of the input graph

according to the MDL principle. The second one (labeled as “b”) is composed by 4 vertices:

1 vertex PS_SUB_1, 1 vertex C, and 2 vertices B; and 3 edges labeled as a, b. g. The vertex

PS_SUB_1 is itself a substructure composed by two vertices: 1 vertex A, and 1 vertex B;

and 1 edge labeled as a. We have already commented that later substructures may be

defined in terms of previous discovered substructures, or in term of predefined

substructures as in this example. Finally, the third one (labeled as “c”) is composed by 4

vertices: 1 vertex PS_SUB_1, 1 vertex PS_SUB_2, 1 vertex C, and 1 vertex B; and 3 edges:

2 edges labeled as b, and 1 edge labeled as g. Once again vertices PS_SUB_1 and

PS_SUB_2 are themselves substructures.

90

a) b)

c)

Figure 4.37. No overlap - discovered substructures.

For the next example we set overlap to true. The generated compressed graph is shown in

Figure 4.38. Now, Subdue finds 2 instances for each PS_1 and PS_2 (see Figure 4.33 for

details).

91

Figure 4.38. Overlap - compressed Graph.

Figure 4.39 shows the best 3 substructures discovered by Subdue. The first and second ones

(labeled as “a” and “b”) are themselves the predefined substructure PS_SUB_2 and

PS_SUB_1 respectively with 2 instances each of them. The third one (labeled as “c”) is a

substructure composed by 2 vertices and 1 edge with 1 instance.

92

a) b)

c)

Figure 4.39. Overlap - discovered substructures.

Our next example is implemented by using the limited overlap feature. Suppose the user

wants to perform a specialized overlapping pattern oriented search: he only wants to allow

overlapping instances in vertices representing PS_1. The generated compressed graph is

shown in Figure 4.40. As a result of the restriction for overlapping instances Subdue finds 2

instances of PS_1 (they share the allowed vertex A) and just 1 instance of PS_2 (since they

share a not allowed vertex C). The found instances are shown in Figure 4.34.

93

Figure 4.40. Limited overlap PS_1 - compressed graph.

Figure 4.41 shows the best 3 substructures discovered by Subdue from this graph using the

limited overlap. In the figure we can see that Subdue reports as the best substructure

PS_SUB_1 with 2 instances (labeled as “a”). This is consequence of the integration of the

compressed graph since it has 2 vertices PS_SUB_1. The second reported substructure

(labeled ad “b”) is composed by 2 vertices: 1 vertex PS_SUB_1, and 1 vertex E; and 1 edge

labeled as c. The last one (labeled as “c”) is composed also by two vertices PS_SUB_1; and

1 edge labeled as PS_OVERLAP_1.

94

a) b)

c)

Figure 4.41. Limited overlap - discovered substructures.

4.4 Conclusion

In this chapter we have described the characteristics and functionality of our graph-based

data mining tool, the Subdue system. We introduced a new algorithm named limited

overlap. We presented some examples showing the functionally of the new algorithm using

an artificial dataset. Examples using real world data will be presented in Chapter 6. These

95

examples are developed by using two test domains: a Puebla downtown population census

from the year of 1777 and a Popocatépetl volcano database.

In the next chapter we introduce a prototype system implementing the proposed model for

representing spatial data, non-spatial data and spatial relations among the spatial objects as

a whole dataset using a graph-based representation.

96

Chapter 5

PROTOTYPE

A natural skill of people is the interpretation of visual data; therefore, this is a fact we must

consider to get advantage during the data mining process. In [30] the authors say that a

future direction for the KDD research field is the design and use of user interfaces: “one can

create a query language which may be used by non-database specialists in their work. Such

a query interface can be supported by a Graphical User Interface (GUI) which can make the

process of query creation much easier”.

The challenge consists of improving the capabilities for displaying the generated results

(i.e., from a query or the data mining process) in a graphical mode. The idea is that if we

are able to analyze the results in such a graphical way, we may give feedback to the user so

that he can refine the analysis process and/or guide the direction for further study. This is

the principle in relevance feedback (do an initial query, get feedback from the user, and

then incorporate information obtained from prior relevance judgments to redefine the

query).

97

We developed a prototype system that provides a graphical user interface to perform the

data mining phase (and data analysis) using our proposed graph-based representation for

spatial and non-spatial data. The analysis is implemented by using spatial and non-spatial

queries and the data mining process is performed with the Subdue system, a graph-based

data mining tool. In our research we used two test domains to evaluate our proposal. The

first one is a database storing data from the Popocatépetl volcano (see Section 3.4 for

details). The other one is a database containing data related to a population census from the

year of 1777 in Puebla downtown as described in Section 5.1.

5.1 Population Census from the year of 1777 in Puebla

downtown

As our test domain we have worked in the project “Habitar y vivir. Análisis del espacio

habitacional de la ciudad de Puebla 1690-1890”. This is a project directed by Dra. Rosalva

Loreto López, a researcher of Urban History. Our objective is to make use of data mining

techniques to find interesting relations and patterns between the population and habitation

spaces in Puebla downtown during that time period. We argue that our model can find

patterns involving non-spatial data (i.e., characteristics of people living in the zone), spatial

data (i.e., distribution of the space), and relations between them (i.e., characteristics of

houses based on people social status and/or number of people living in a house) in a single

pattern.

98

Figure 5.1 shows the spatial structure we implemented for representing the spatial concepts

in the census. The structure has 5 spatial aggregation levels. Parish (spatial concept for

representing a physical area) is the upper spatial component. A Parish is defined by one or

more Neighborhoods. A Neighborhood can involve several Blocks. A Block is defined by

four Streets and finally a Street gives the location to a House, where a family lives.

House

Parish

Block

Neig

hbor

hood

Neighborhood

Stre

et

Street

Block

Parish

Figure 5.1. Representation of spatial concepts in the census from the year of 1777.

After defining the spatial concepts involved in our domain, we need a graph-based data

representation to describe our data as a graph. We have implemented two graph-based

representations for the non-spatial data of the census.

These representations have three main components: (1) HOUSE involves the attributes

describing a House. It contains one or more Uh (atomic physical area where a family lives).

One or more families might inhabit in a House, but each family lives in an Uh. (2) UH

involves the attributes describing the living space. (3) MEMBER involves the attributes

describing a member of a family.

99

Figure 5.2 shows the first representation created for the population census. A description of

the representation is as follows: a Parish contains one or more Neighborhoods. A

Neighborhood contains one or more Blocks or Locations (spatial concept for identifying

with precision a physical area -i.e., north of-). Block and Location contain one or more

Streets. A Street contains one or more HOUSES. A HOUSE has two attributes (NCasa -Id

assigned to the House- and NHabCasa -number of Uh in a House-). A HOUSE contains

one or more UH. An UH has several attributes describing it (Uh -Id assigned to the Uh-,

Etnicidad -predominant ethnic group among the members of a family-, TFamilia -type of

family, i.e., family with children-, TUh -type of Uh- and NMiembrosFamilia -number of

members integrating a family-). An UH contains one or more MEMBERS. Finally a

MEMBER has several attributes describing it (JFamilia -head of family-, TitPersona -title

of the member, i.e., don, doña-, Nombre -first name-, Apellido -last name-, Sexo -sex-,

EdoCivil -marital status-, Parentesco -social relationship with respect to the head of family-

, GpoEtnico -ethnic group- and Edad -age-).

100

Location

Parish

Neighborhood

Block

Street

contain

contain contain

contain

contain

containcontain

value

value

HOUSE

UH

value

value

value

value

MEMBER

value

value

value

value

value

value

value

value

value

NCasa

NHabCasa

Etnicidad

TFamilia

TUh

NMiembrosFamilia

contain

JFamilia

TitPersona

Nombre

Apellido

EdoCivil

Parentesco

GpoEtnico

Edad

Sexo

value

Uh

Figure 5.2. First representation for the non-spatial data in the population census from the year of 1777.

The second representation is presented in Figure 5.3. The representation can be read as

follows: A HOUSE is the main item. A HOUSE has several attributes describing it

(Parish, Neighborhood, Block, Location, Street, NCasa and NHabCasa). A HOUSE

contains one or more UHS. An UH has several attributes describing it (Uh, TUh, Etnicidad,

TFamilia, NMiembrosFamilia). In a UH lives (contains) one or more MEMBERS, and

finally, a MEMBER has several attributes describing it (Jfamilia, TitPersona, Nombre,

Apellido, Sexo, EdoCivil, Parentesco, GpoEtnico and Edad).

101

value

value

value

value

value

value

HOUSE UH

Location

Neighborhood

Street

NCasa

NHabCasa

Parish value

value

value

value

MEMBER

Etnicidad

TFamilia

TUh

NMiembrosFamilia

contains value

value

value

value

value

JFamilia

Sexo

EdoCivil

Parentesco

GpoEtnico

value

value

value

TitPersona

Nombre

Edad

value

Block

value

Apellido

value

Uh

contains

Figure 5.3. Second representation for the non-spatial data in the population census from the year of 1777.

5.2 Modules

This section presents a prototype system developed for testing our graph-based data

representation model for spatial data mining as proposed. The prototype is implemented in

Java. We use Oracle DBMS version 9i for storing and processing our data. Oracle has a

module to manage spatial data named Oracle Spatial; we take advantage of the spatial

operators, geometry functions and spatial aggregate functions implemented in the module

for either identifying the objects in a region over a spatial layer or obtaining/validating the

spatial relations between two spatial objects. The prototype is divided into seven modules:

102

Data Preparation and Cleaning. The data preparation and cleaning phase is a step of the

Knowledge Discovery in Databases (KDD) process. Our module helps the user to validate

the data and creates the necessary structures to store it in the database. In this process, the

data mining expert and the user must work together to identify the activities to be

performed that are related to the process (i.e., to identify noise and remove it, treat missing

values, etc). Once these activities have been defined, the process is transparent to the user

when adding more data (belonging to the same database schema for the domain).

Query. We have implemented the interface shown in Figure 5.4 to allow the user to create

non-spatial queries. The goal is to allow the user, to make use of spatial and non-spatial

attributes, to query the database and to build graphs from the obtained results.

The Query panel allows the user to query the database using an SQL-like approach. The

queries are created in real time and then submitted to the database for answering the user

request. The results are presented to the user in two ways. The first uses a relational

approach and the second is represented over a map.

103

Figure 5.4. The query panel.

The interface is divided in 2 sections, the Control and the Results sections. The Control

section is divided in 4 subsections: Select, Where, OrderBy and GroupBy. The Results

section is divided in 2 subsections: Query visualization and Generated results.

In the Control section the user creates the query. The query is created by specifying the 5

most usual clauses in a SQL statement: Select–From-Where-OrderBy-GroupBy. The first

two elements are mandatory, the last three are optional.

104

The Select subsection presents the twenty-two fields that compose the core of the database.

Twenty one of these fields have four options implementing query functionalities as follows:

option1 – attribute name – option2 – options3 – option4. The first option is used to indicate

if the field will be included in the query. The second option contains the keywords used to

implement, currently, five grouping functions: “count", "count(distinct())", "distinct",

"max" and "min". These keywords are used to group the data by the field(s) selected in the

GroupBy subsection. Options three and four are used to indicate if the descriptive attributes

associated to the field will be included in the query. If the first option is not selected then

all the other options are ignored. Additionally, the Select subsection has an item containing

the name of the schema (i.e., 1777 census) to create the “from” clause of the query.

The Where subsection is used to indicate the set of conditions, by field, for restricting the

rows to be selected (the dataset returned by the query). A condition specifies a combination

of one or more expressions composed of attribute names, attribute values, and logical

operators. These conditions are entered manually by the user, so knowledge about the

domain is required.

The OrderBy subsection allows the user to indicate if he wants to sort the rows returned by

the query and which field(s) will be used to perform the operation. We can sort the rows

returned using the twenty-two fields composing the database and we can also implement

combinations of these elements (i.e., sorting by the Name and Sex fields).

The GroupBy subsection was implemented to allow the user to group the selected rows

based on the value(s) of each row and return a single row with a summary of the

105

information for each group. We can group the rows using twenty-one of the twenty-two

fields of the database and we can also implement combinations of these elements (i.e.,

grouping by the Parish and Sex fields). The grouping aggregation function (i.e., count or

sum) is chosen in the Select subsection. As we have already mentioned, we have

implemented five grouping functions.

Our prototype system creates queries by selecting the different fields and options contained

in the four previous sections. This query is created in real time, displayed in the Query

visualization area and then submitted to the DBMS for processing. The obtained result is

displayed in the Generated results area; it is presented by using a relational approach (rows

and columns).

SQL. Using this interface the user can create a query by typing it directly (in manual form).

The principle is the same as we described in the Query panel, we want to create queries and

to use the results of the queries to create graphs (based in our model). These graphs will be

the data source for our data mining algorithm so that we can find patterns that allow us to

understand and describe our data.

In the Query panel (see Figure 5.4) the user creates the query in real time by using a

graphical interface but if he wants to modify it, it is not possible (the query is displayed just

for visualization purpose). Each time the user creates a query in the Query panel, the

created query is copied to the Query visualization area in the SQL panel so if the user wants

to enhance or modify the query he is able to do it. The generated results are presented to the

106

user in the Generated results area. Both interfaces (Query and SQL panels) provide the user

with the capability to send the generated results to a text file or a printer.

Map. Another way to create graphs in the prototype is using the Map panel. In this case the

interface displays in a graphical way the spatial layers stored in the database (see Figure

5.5).

Figure 5.5. The map panel.

By using the interface, the user can delimit the set of spatial and non-spatial data that will

be used for creating the graphs. Some times the user wants to analyze only some regions in

a spatial layer so it is not necessary to include all the data in the graph. Additionally, we

107

have to take into account that if we have a huge database we will build huge graphs and this

feature has a direct impact over the data mining algorithm, so this is an important issue we

have to face. A solution for facing the problem of creating huge graphs is delimiting the set

of elements to be included in it by using selection windows as we mentioned in Chapter 3.

A second method for delimiting the set of elements to be considered while creating a graph

is by using the results of the non-spatial queries created and processed in the Query and

SQL panels. This is implemented by a process that identifies the spatial objects involved in

the results generated by a non-spatial query (i.e., a query computing the number of people,

grouped by Sex, living in each Parish in the census from the year of 1777 so we can select

and show the Blocks or the Streets belonging to each Parish over the map).

The interface is divided in four sections: Visualization area, Layers in database,

Operations control and Map control. The Layers in database section displays the name of

the spatial layers stored in the database. The component is also used for selecting and

identifying the spatial layers to work with.

We also include information about the spatial relations to be considered in the graph. The

Operations control section includes the operations (Topological, Distance and Direction)

implemented to validate the spatial relations among spatial objects. In the case of the

topological relations we have implemented the validations supported by the Oracle Spatial

module.

108

The Map control section has five buttons implementing the Zoom in, Zoom out, Show all,

Reset all and Save map operations. The Visualization area works in two operational modes:

Query mode for creating selection windows and Zoom mode for implementing the Zoom in

and Zoom out operations. The option “X Layer” is used to indicate if the validation of

spatial relations among spatial objects will just be among objects belonging to different

spatial layers (i.e., objects belonging to the Parish and Neighborhood layers) or among all

objects belonging to all layers. The last two options are used to control the characteristics

of the Selection window: a Selection window tool can be a Rectangle or a Circle, and we

can specify if we want to select just the elements inside the Selection window or the

elements inside and touching its border.

Once the user has selected the working area, the following steps identify the objects inside

the area and validate the spatial relation(s) among them. We have implemented the

validation of topological, distance and direction relations; only the objects meeting the

relation(s) chosen by the user will be candidates to become objects in the graph.

Spatial Graph. The next task is to create the graph. First, the user must select the non-

spatial attribute(s) describing the spatial objects in the graph (see Figure 5.6). Remember

that the spatial objects and spatial relations among the objects that will be included in the

graph were selected in the Map panel.

109

Figure 5.6. The spatial graph panel.

By using this interface, the user can select the non-spatial attributes, for each spatial layer

he works with, that will be related to the spatial objects in the graph. For instance, suppose

the user is working with the spatial layers A and B; the spatial layer A has five non-spatial

attributes and the spatial layer B has three non-spatial attributes. So, he may select from

layer A two attributes and from spatial layer B three attributes. The idea is to give the user

the capability to select the non-spatial attributes, for spatial layer, that he considers relevant

for the mining task.

110

From the Graph characteristics section the user defines the format, model and output

device characteristics for the graph. The graph generated by the system can be created

following the Subdue or the Graphviz [19] layouts. In the first case the graph is created for

feeding the Subdue system, and in the second case it is created for visualization purposes.

Currently, we have implemented five graph-based representation models in our prototype.

Each model expresses a representation proposal for creating graphs involving the three

basic elements found in a spatial database (spatial, non-spatial data, and spatial relations

among spatial objects). The resulting graph can be visualized on the screen or stored in a

text file. The last option was implemented in order to provide the Subdue and Graphviz

systems with their corresponding input files.

We can see at the right side of Figure 5.6 an example of a created graph using the Subdue

layout. This sample graph was generated from 2 spatial layers (i.e., chiglesia and chpuebla).

From each of them the user selected one or more attributes that were related to the spatial

objects. Figure 5.7 presents a fragment of the same graph but this time it is drawn using the

Graphviz system.

111

Figure 5.7. Graph representation of processed data.

Non-Spatial Graph. Focusing in the “Habitar y vivir. Análisis del espacio habitacional de

la ciudad de Puebla 1690-1890” project we have implemented an interface for allowing the

user the creation of graphs based on the two graph-based representations created for the

non-spatial data of population census (they are shown in Figure 5.2 and Figure 5.3). These

graphs are created using only non-spatial data. Our objective for implementing graphs with

this characteristic is to have metrics that allow us to compare and to evaluate the results

generated by the data mining algorithm when we use graphs containing spatial data, non-

spatial data, and spatial relations at the same time against graphs containing only non-

spatial data.

The interface implemented is shown in Figure 5.8. The user selects the non-spatial

attributes and defines the settings that will be used for creating a query (as in the spatial

graph). The result obtained from the query is used for creating the non-spatial graph.

112

Figure 5.8. The non-spatial graph panel.

Subdue. The Subdue panel contains the interface developed for calling the Subdue system

(see Figure 5.9). In order to run Subdue, the user must select the corresponding text file (a

file containing a graph) and define the parameters that will guide the Subdue’s substructure

discovery system for finding substructures.

113

Figure 5.9. The Subdue panel.

Figure 5.10 shows an example of a Subdue’s standard output (only for one substructure) for

displaying the discovered substructures (i.e., patterns) from an input graph. Since our

definition of instances and substructures as graphs (see Chapter 4.1 for details), the

Subdue’s output is also a graph, in our example, it can be read as follows:

• Substructure value = 1.01706. Represents the value of the substructure in the input

graph.

• Pos instances = 3865. This value tells us how many instances of the substructure

exist in the input graph.

114

• Graph (2v, 1e). Number of vertices (“v”) and edges (in our example directed edges

“d”) in the substructure.

• “v 1 SUB_4”. Substructure’s first vertex labeled as SUB_4.

• “v 2 X”. Substructure’s second vertex labeled as X.

• “d 1 2 ETNICIDAD”. A directed edge from vertex 1 to vertex 2 labeled as

ETNICIDAD in the substructure.

Substructure: value = 1.01706, pos instances = 3865, neg instances = 0 Graph(2v,1e): v 1 SUB_4 v 2 X d 1 2 ETNICIDAD

Figure 5.10. Example of Subdue’s standard output.

As we have already commented, Subdue is a system that finds substructures in a

hierarchical way, that is, a substructure found in a previous iteration can appear in a new

iteration. When this happens, those substructures are represented by a vertex labeled as

SUB_x, which represents the best substructure discovered at iteration “x”.

In our example (Figure 5.10), SUB_4 is itself a substructure defined by 2 vertices and 1

edge where its first vertex is labeled as SUB_2. This means that the definition of SUB_4 is

composed by the definition of SUB_2. Substructure SUB_2 is defined by 2 vertices and 1

edge where its first vertex is labeled as SUB_1. Again, this means that the definition of

SUB_2 is composed by the definition of the previously discovered substructure SUB_1.

Finally, SUB_1 is a substructure defined by 2 vertices and 1 edge.

115

As we can see, the lecture and interpretation of the substructures discovered by Subdue

may be a complicated task. We have implemented a parsing function for reading a text file

containing the results generated by Subdue, so we can create the necessary data structures

for presenting to the user the same results but using an easier way to read it as we show in

Figure 5.11 (the figure presents the example described in Figure 5.10). This layout is

created by using the Graphviz system and is saved as a JPEG image file but we can save it

in any output format supported by Graphviz. The goal is to improve the way the user can

read and interpret the results generated by Subdue.

Figure 5.11. Layout for reading the Subdue’s discovered substructures.

5.3 Conclusion

As test domain we have designed and built a spatial database to store both a population

census from the year of 1777 in Puebla downtown and a map representing the blocks in the

116

zone. This data is part of a project directed by Dra. Rosalva Loreto López, a researcher in

the urban history domain.

We have developed a prototype system implementing our model to represent together

spatial data, non-spatial data and the spatial relations among the spatial objects. The

prototype allows the user to select the spatial layers to work with, to create spatial and non-

spatial queries that will be used to select the spatial objects that will be included in the

graph. For each spatial layer the user work with, he has the capability to select the attributes

that will be related to the spatial objects in the graph.

We have also implemented a visualization tool which helps us to display in a graphical way

(by using the Graphviz system) the hierarchical discovered substructures by Subdue.

In the next chapter we present three cases showing the applicability of our methodology for

modeling and mining spatial data mining using a graph-based representation.

117

Chapter 6

RESULTS

This chapter presents three use-cases of our methodology for modeling and mining spatial

data using the proposed graph-based representations. For this purpose, we used two spatial

databases as our test domains. The first database contains data related to a population

census from the year of 1777 in Puebla downtown and the second one is a database storing

data from the Popocatépetl volcano.

The use-cases described were implemented based on the following premises: evaluating the

graph-based proposal for modeling and mining spatial data, evaluating the limited overlap

feature, and evaluating the discovered knowledge with the support of a domain expert.

Therefore, the three use-cases presented in this chapter were implemented based on the

following methodology:

1. Selection of the spatial layers to work with.

2. Selection of the spatial relations that will be validated among the spatial objects.

3. Selection of the non-spatial attributes that will be related to the spatial objects in the

graph.

118

4. Mining the graph using the no overlap, standard overlap and limited overlap

features.

5. Evaluation of discovered patterns.

6.1 Population census from the year of 1777 in Puebla downtown

As we have already mentioned, our first test domain is a spatial database containing data of

a population census from the year of 1777 (see Chapter 5 for more details). Figure 6.1

shows a fragment of the “chpuebla”, “chiglesia” and “chrio” spatial layers used in the use-

cases.

Figure 6.1. Population census from the year of 1777 in Puebla downtown.

119

The chpuebla spatial layer (shown in white color) represents blocks in Puebla downtown;

this layer is related to a population census from the year of 1777 as non-spatial data. The

chiglesia layer (shown in red color) contains representative churches for each parish in the

zone. The chrio layer (shown in green color) represents a river crossing Puebla downtown.

It is important to remark that a parish is a spatial object grouping several blocks in the zone.

Each parish has a church as its agglomerative element (people used to live around a

church). Figure 6.2 shows the six parishes (each shown in a different color) and their

representative church:

1. El Sagrario.

2. San José.

3. San Marcos.

4. San Sebastián.

5. Santa Cruz.

6. Santo Angel.

120

Figure 6.2. Parishes in Puebla downtown in the 1777 year.

All use-cases presented in this section were developed using all graph-based models

(currently five models). However, the description of the generated results in this first test

domain corresponds only to model #2 (the model is named single replication of relation

types, complete information). Some of the discovered patterns were used by the domain

expert to validate facts already known (i.e., distribution of the population in the census) and

other allows him to know unknown relationships among spatial objects and non-spatial data

(attributes) in the census (i.e., common characteristics of people living along the two

borders of the river crossing Puebla downtown). In Section 6.2 we present a comparison

among the generated results by each proposed model using a Popocatépetl volcano

database.

121

6.1.1 Use-case: El Sagrario

Suppose we want to know what people have in common in the spaces within a radius of

150 meters from the representative church in parish #1 (El Sagrario). Our experiment will

be focused to find regularities related to the following issues:

• Number of habitation spaces in a house.

• Members of a family.

• Type of family.

• Ethnic group of each family member.

The guideline to select a radius of 150 meters from the church is that this value allows us to

include in our sample dataset at least one block in all directions around the church as we

show in Figure 6.3. Thus, by using the prototype system we selected the chpuebla and

chiglesia spatial layers. Once we selected the spatial layers to work with, the following

steps are to select the spatial relation(s) to be validated among the spatial objects and the

non-spatial attributes that will be related to the spatial objects in the graph. The selected

parameters were the following:

• Spatial layers: chpuebla and chiglesia.

• Pivot: representative church in parish #1.

• Spatial relation: distance.

o Value: 150 meters within a radius from the representative church to the

blocks.

• Spatial graph-based model: model #2.

122

Figure 6.3. Blocks 150m. from representative church, parish “El Sagrario”.

The generated graph was composed of 24,167 vertices, and 24,166 edges according to the

proposed graph-based model. Figure 6.4 shows a fragment of the graph.

Figure 6.4 Example of a generated graph in the use-case “El Sagrario”.

123

This graph was used as the input to the Subdue system. For the experiment we used the

following Subdue’s parameters:

• Predefined substructure: yes (we used “UH CONTAIN MEMBER” since these

elements are grouping components in our graph-based representation for the non-

spatial data of the census). See Section 5.1 for more details.

• Overlap: yes.

• Limited overlap: no/yes (first, we used standard overlap; next, we used limited

overlap).

Once Subdue completed the mining process, the generated results (patterns) were evaluated

by the domain expert. Figure 6.5 shows three examples of discovered patterns using the

standard overlap option. The expert’s opinions were based on the following issues: the

patterns are based on the population distribution schema in the census from the year of

1777 (large population inhabits in parish #1). 65% of the population, in this area, did not

given its ethnic group, this can be interpreted from a demography history perspective, as a

possible dissolution of the racial element for grouping people (creation of groups or

classes) in benefit of alternative grouping parameters such as salary, family networks (how

they lived and whom they lived with), consumption levels, type of house. 16% said to be

“Spanish”. People lived based on the model “Jefe con Familiares y Agregados”.

124

a) b)

c)

Figure 6.5. Examples of discovered patterns by using standard overlap in use-case “El Sagrario” (1).

We can see that the patterns shown in the figure are composed of spatial data (i.e., a vertex

representing a house) and non-spatial data (i.e., a vertex and an edge telling us the major

ethnic groups in the census). Although those patterns do not contain any reference to spatial

relations we must to note that a spatial relation was used to define the dataset for the mining

phase (i.e., 150 meters within a radius from the representative church to the blocks). They

do not appear in the discovered patterns but they are part of the input graph. An explication

to this situation is that most common characteristics (repetitions in the input graph), in this

example, occur in the non-spatial data describing the spatial objects (i.e., the ethnic group

“X- Undefined” and “E- Spanish”). However, these facts represent the strength of our

proposal since our basic idea is to represent spatial data, non-spatial data and spatial

125

relations among the spatial objects as a unique dataset, mine it together, so we can discover

patterns involving these elements at the same time.

The next step in our experiment was to evaluate our limited overlap proposal in Subdue.

We stated three motivations to propose the new algorithm: a search space reduction, time

processing reduction and specialized overlapping pattern oriented search. Therefore, in

order to demonstrate that by using the new approach we obtain those benefits we

implemented the following example. For the limited overlap test, we allow overlap only in

vertices representing the ethnic group of the family members. These vertices are used to

guide the overlapping pattern oriented search.

In this example, the discovered patterns by using the standard and limited overlap features

were slightly different. Figure 6.6 presents three examples of discovered patterns using the

limited overlap. For example, both cases reported as the first substructure a pattern telling

us “Undefined” is the predominant ethnic group in the dataset. The second discovered

substructure, for both cases, tells us that “Spanish” is the next predominant ethnic group. In

the case of the third substructure, using standard overlap Subdue reported a pattern related

to the number of habitation spaces in a house, but through limited overlap Subdue found a

relationship among the “Undefined” ethnic group and the family type “Jefe con Familiares

y Agregados”. An interpretation of these results is that limited overlap reports in all

discovered substructures a vertex representing ethnic group because we oriented the search

over this type of vertices when we specified that vertices representing ethnic group were

allowed for overlapping.

126

a) b)

c)

Figure 6.6. Examples of discovered patterns by using limited overlap in use-case “El Sagrario” (1).

The next step in our experiment was to evaluate the objective of a processing time

reduction by using the limited overlap feature over the standard overlap. Figure 6.7 presents

the processing time comparison chart for this experiment. In the figure, the time taken by

using standard overlap is shown in pink color (193,426 seconds). On the other hand, the

processing time taken by using the limited overlap option is shown in yellow color (36,042

seconds). As we can see, we obtained a time reduction gain of 81.37% using our proposed

limited overlap approach.

127

Figure 6.7. Processing time standard vs. limited overlap: use-case “El Sagrario” (1).

Now, we are going to modify slightly our study zone to show the way we can work with

two or more spatial relations at the same time. By using the prototype system the user can

select one or more spatial relations that will be validated among the spatial objects. For

instance, it is possible to select two spatial relations belonging to the same spatial relation

type (i.e., topological) or to different spatial relations types (i.e., one topological and one

distance relation). Suppose we want to search for regularities about people and habitation

spaces in a radius of 150 meters from the same representative church in parish #1, but this

time, we only want to evaluate blocks located on the North side as shown in Figure 6.8.

Our experiment will be focused to know common regularities over the same issues as in the

previous test.

By using the prototype system we selected the following parameters:

• Spatial layers: chpuebla and chiglesia.

• Pivot: representative church in parish #1.


193,426

36,042

0 50000 100000 150000 200000

Seconds

1 Limited OverlapOverlap

128

o Value: 150 meters.

• Spatial relation: direction

o Value: North.



proposed graph-based model.

Figure 6.8. Blocks 150m. North side from representative church, parish “El Sagrario”.

As in the previous example, the graph was used as input to Subdue. For the experiment we

selected the following Subdue’s parameters:

129




• Overlap: yes.


overlap).

Once Subdue completed the mining phase, the generated results were evaluated by the

domain expert. The expert found the following facts: the predominant ethnic group in the

area remains as “Undefined” because of the proximity to the parish center and the

continuous racial and social interchange from the North side. As consequence, they share

the same family type structure. The “Spanish” population is the most important (15%) and

the next one is “Mestizos” (7.13%). The family nucleus which includes “other people living

with” (added people) employs “Mestizos” as subordinated workers (i.e., waiters and

salesmen). It is important to remark that they do not employ “Indígenas” (maybe because

they do not speak Spanish and their limited cultural level).

This is the domain expert evaluation but we also needed to evaluate how the new limited

overlap feature worked. Therefore, the same experiment was performed using the standard

and then the limited overlap. For limited overlap, the vertex allowed for overlap was the

same as in the previous test. The discovered patterns by using standard and limited overlap

were very similar to those obtained in the first test. In fact, the first and second reported

substructures were the same although the third one was different (see Figure 6.9). By using

130

standard overlap Subdue reported “Mestizos” as the third predominant ethnic group.

Limited overlap reported a relationship among the “Undefined” ethnic group and the family

type “Jefe con Familiares y Agregados”. This last pattern was the same one reported in the

previous test.

a) b)

Figure 6.9. Examples of discovered patterns in the use-case “El Sagrario” (2)

Figure 6.10 shows the processing time comparison chart by using the standard and limited

overlap features. In this figure, the time taken when using the standard overlap feature is

shown in pink color (30,532 seconds) while the processing time taken when using limited

overlap is shown in yellow color (2,471). It is important to note that by using our new

overlap approach we obtained a time reduction gain of 92% in our experiment.

131

Figure 6.10. Processing time standard vs. limited overlap: use-case “El Sagrario” (2).

6.1.2 Use-case: People living along the borders of the river crossing

Puebla downtown

As we mentioned at the beginning of the chapter, a river crosses Puebla downtown.

Suppose that we are interested in knowing characteristics about family types and ethnic

groups of people living along the borders of the river.

Therefore, we defined as our study area all blocks located at most 50 meters from the

borders of the river as shown in Figure 6.11. We selected this distance since it allows us to

select at least one block along the entire border of the river. We used the following

parameters in our use-case:

• Spatial layer: chpuebla and chrio.

• Pivot: the river.


o Value: 50 meters.

30,532

2,471

0 10000 20000 30000 40000

Seconds


132



proposed graph-based model.

Figure 6.11. Blocks 50m. from river crossing Puebla downtown.

The created graph was used to feed the Subdue system. For this test we selected the

following Subdue’s parameters:




• Overlap: yes.

133


overlap).

Figure 6.12 shows two examples of discovered substructures in this use-case. Our domain

expert evaluated the generated results focusing in the following issues: there is a

modification for the population agglomerative criterion on the East side (“San Francisco”,

“El Alto Xonaca”, and “Los Remedios” neighborhoods) and on the West side (“San José”,

“El Sagrario” and “El Carmen” neighborhoods) of the river. The “Mestizos” is the

predominant ethnic group (24.5%); the “Spanish” is the next one (almost has the same

percentage). This pattern may outline that the “Mestizos” ethnic group played the role of

intermediator between the “Spanish” and “Undefined” groups on the West side, and

between the “Spanish” and “Indígenas” on the East side.

a) b)

Figure 6.12. Examples of discovered patterns in use-case “people around the river crossing Puebla

downtown”

The previous results were generated using standard overlap, but we needed to compare

them with the results obtained using limited overlap, so we used the same parameters as in

134

the previous test. We proposed to allow overlap only in vertices representing the ethnic

group.

It is important to mention that, if our overlap label list has many elements, it could be more

efficient (concerning to processing time) to use standard overlap because of the overhead

that results from the validation process. Remember that when we use the limited overlap

feature, each time that an overlap among instances of a substructure is detected, the overlap

is evaluated in order to know if it is allowed or not.

In this experiment we noted that the patterns discovered using the standard and limited

overlap approaches were the same but the processing time taken by each of them for the

mining task was slightly different. Figure 6.13 presents the time comparison chart for the

experiment showing the time taken when using standard overlap in pink color (15,572

seconds) and the time taken when using limited overlap in yellow color (14,021 seconds).

Limited overlap required a lower time to find the same patterns than standard overlap.

Figure 6.13. Processing time standard vs. limited overlap: use-case “people around the river crossing Puebla

downtown”.

15,572

14,021

0 5000 10000 15000 20000

Seconds


135

The two use-cases presented in this section show the functionality of our proposal for

modeling and mining spatial data using a graph-based representation. The discovered

patterns in the population census from the year of 1777 were analyzed by a domain expert.

Some of these patterns allow the user to validate facts already known. For instance, the

predominant ethnic groups classified by parish, and population distribution according to the

social status in the zone. But other patterns allow him to know implications, previously

unknown, among spatial concepts and non-spatial attributes in the census. For instance,

racial and economic interchange among people in the parishes, which are the common

characteristics of population living along the borders of the river crossing downtown (on

the West and East sides), social structure according to the ethnic group, common

regularities among the family type and habitation space.

Processing time comparison among standard and limited overlap features was other topic

evaluated in these use-cases. We presented time processing comparison charts showing the

time reduction gain obtained by using the new approach. We also demonstrated that by

using limited overlap we can orient the search over substructures (patterns) containing

elements that in our domain may represent relevant issues. For instance, in the year of 1777

the ethnic group represented a significant element to know characteristics about a family

and their habitation space.

Next section presents a use-case using a Popocatépetl volcano database. The objective in

this illustrative use-case is to evaluate/compare the generated results by each proposed

graph-based model.

136

6.2 Popocatépetl volcano

In Section 4.3 we presented a preliminary use-case showing the applicability of our

methodology using a Popocatépetl volcano database. This section presents an extended use-

case employing the same database. First, we describe the study area, next the method used

to define the dataset and the parameters for the spatial data mining processes, and finally

the generated results.

As already mentioned, the database contains data related to several issues in the

Popocatépetl zone such as settlements, rivers, and evacuation roads. Figure 6.14 shows a

fragment of the Popocatépetl volcano database. For the experiments we will use three

spatial data layers. The first one is the roads layer (representing the roads in the area), the

second one is the rivers layer (representing the rivers in the area), and finally the third one

is the settlements layer (representing population areas). To illustrate the use-case we have

delimited a study zone as shown in the figure.

137

Figure 6.14. Popocatépetl volcano.

The experiments will focus on identifying relationships and characteristics shared among

settlements, roads and rivers in the study zone. Suppose we want to know characteristics

shared by these elements that can help us to implement/evaluate evacuation plans in case of

a volcanic contingence. For example, characteristics of roads starting in or crossing a

settlement, material used to build those roads and their current status (i.e., paved, unpaved),

characteristics of the roads and rivers meeting a relationship (i.e., they cross, touch) in the

zone, rivers near to settlements that in case of huge pluvial concentration might represent a

potential risk.

138

6.2.2. Use-case: Popocatépetl

To illustrate the capabilities of our model for modeling and mining spatial data we will use

as our study zone that shown in Figure 6.15 (Southwest side of the volcano crater). The

experiment is focused on discovering characteristics among roads, rivers, and settlements in

the zone. The presentation of the generated results will be structured in the following way:

We will present discovered patterns among roads and rivers, roads and settlements, and

rivers and settlements in the study zone. We chose this structure because we wanted to

illustrate the user’s capabilities to organize the data in order to create different sceneries to

represent and mine spatial data.

Figure 6.15. Popocatépetl volcano: study zone.

Therefore, we used our prototype system to select the river, settlement, and road spatial

layers. The next step was to select the spatial relationships to be validated among the spatial

139

objects under consideration. To develop our test, we took advantage of a special feature

implemented in the Spatial Oracle module (our SDBMS system): Oracle has the capability

to examine two geometry objects to determine their topological spatial relationship.

Moreover, it is possible to indicate that the same SDBMS determines and returns the

topological relationship that best matches the geometries.

So, to create our dataset, we first selected the spatial layers to work with and then we

evaluated the relationships among the spatial objects contained in the spatial layers. The

experiments were implemented based on the following parameters:

• Spatial layers: rivers, settlements and roads.

• Spatial relation: topological relations supported by Oracle Spatial.

• Spatial graph-based model: all models.

The experiment was performed using the five proposed graph-based models. The objective

was to evaluate the results generated by each of them. Additionally, we ran the test using

no overlap, standard overlap and limited overlap features. In the following figures we show

the generated results in the experiment. In the figures the generated result by using no

overlap is labeled as “a)”, via standard overlap is labeled as “b)”, and through limited

overlap is labeled as “c)”. In the case of limited overlap, we told Subdue to allow overlap

only for vertices representing roads in the zone because this element represents a primary

item in our study domain (evaluation and implementation of population evacuation plans).

140

Since our intention in the experiment was to compare the generated results using the

proposed graph-based models, we selected (and reported) as the most significant

discovered pattern (by using no overlap, standard overlap and limited overlap), the one

covering the following restrictions:

• Complete pattern. A pattern reporting at least two spatial objects (i.e., road and

river), the spatial relation among them (i.e., touch), and some non-spatial

attribute(s) (i.e., “road category unpaved” and “river category draining”).

• Maximum number of reported instances. A pattern with the highest score of

reported instances of a substructure.

The following subsections present the generated results using model 1 to 5. By each model

we describe the discovered patterns according to the proposed structure to mine data: road

and river, road and settlement, and finally river and settlement. At the end of Section 6.2.2,

we present comparison tables and conclusions of the generated results by each model.

6.2.2.1 Model #1 - base model

Road and River. Figure 6.16 shows the most significant discovered pattern between roads

and rivers. The pattern describes a relationship among “road category unpaved overlapping

a river category draining” in the zone. This pattern may be considered as an indicator of the

number of roads that need to be supervised in case of a volcanic contingence since the

material type they are built with, and because they cross rivers (the lecture may be done in

inverse order) that in case of huge pluvial concentration may overflow and make roads

useless. Subdue found by using no overlap 46 instances (in the second iteration) of the

141

pattern; via standard overlap it found 85 instances (in the first iteration), and through

limited overlap it also found 85 instances (in the second iteration). As we can see in the

figure, standard and limited overlap found the same number of instances of the pattern, but

limited overlap required two iterations to find the same pattern. However, this fact does not

mean that standard overlap is better than limited overlap because analyzing the overall

processing time required by limited overlap to finish the substructure discovery phase we

note that it was lower than the required by standard overlap.

142

a) b)

c)

Figure 6.16. Relationships among roads and rivers by using model #1.

Road and Settlement. The most significant discovered pattern found in this experiment

describes a relationship among “road category unpaved touching a settlement category

construction” in the zone as shown in Figure 6.17. “Settlement category construction”

represents in the Popocatépetl’s settlement spatial layer inhabit areas with huge population,

buildings and several constructions used to offer services to people. If we assume that

143

people may require to be evacuated in case of an eruption and that the roads that will be

used are unpaved then this situation may become a problem (i.e., a bottleneck, water and

soil may become mud and this may make roads useless). For this experiment Subdue found

via no overlap 6 instances (in the ninth iteration) of the pattern, through standard overlap it

found 9 instances (in the fourth iteration), and by using limited overlap it discovered 8

instances (in the tenth iteration). In all cases Subdue was able to discover the same pattern;

the difference was the number of computed iterations required to discover it.

144

a) b)

c)

Figure 6.17. Relationships among roads and settlements by using model #1.

River and Settlement. Figure 6.18 shows the most significant discovered pattern for these

spatial objects. It describes a relationship among “river category draining crossing a

settlement category either (a) block or (b)(c) construction” in the zone. “Settlement

category block” represents in the Popocatépetl´s settlement spatial layer inhabit areas but

with small population, in fact there exist several uninhabited areas, few buildings and

145

constructions. The pattern may be used to identify potential flooding zones because it

represents rivers close to (may be some of them crossing) areas inhabited by people.

Through no overlap Subdue discovered 5 instances (in the twelfth iteration) of the pattern,

by using standard overlap it also found 5 instances (in the eighth iteration), and finally, via

limited overlap it also found 5 instances (in the eighth iteration). For this experiment

Subdue discovered the same pattern in the three cases, however, by using standard overlap

and limited overlap the understanding of the pattern is simpler.

146

a) b)

c)

Figure 6.18. Relationships among rivers and settlements by using model #1.

The previous experiment was done using model 1, now we perform the same experiment

using model 2 to 5. We will be focused to compare the same discovered patterns for the

three “object-object” structures (i.e., road-river, road-settlement, and river-settlement). The

147

generated results are shown in the figures following the same report schema: those using no

overlap are labeled as “a)”, those using standard overlap are labeled as “b)”, and finally

those created with limited overlap are labeled as “c)”. The most significant differences

between the generated results are the number of reported instances and the number of

iterations needed to discover the pattern. More iterations means more processing time to

discover the pattern.

6.2.2.2 Model #2 - single replication of relation types, complete

information

Road and River. By means of model #2 Subdue discovered the pattern (among road and

river) shown in Figure 6.19. Subdue found by using no overlap 41 instances (in the second

iteration) of the pattern, via standard overlap it found 85 (in the third iteration), and through

limited overlap it discovered 64 (in the second iteration). All cases reported “road category

unpaved and river category draining” as the predominant spatial objects.

148

a) b)

c)


Road and Settlement. For these spatial objects Subdue was able to discover, by means of

model #2, the pattern shown in Figure 6.20. Via no overlap Subdue found 5 instances (in

149

the fourteenth iteration) of the pattern, through standard overlap it discovered 8 (in the sixth

iteration), and by using limited overlap it found 7 (in the tenth iteration). No overlap

reported “settlement category block” whereas standard and limited overlap reported more

instances of “settlement category construction”.

a) b)

c)


150

River and Settlement. The pattern discovered by means of model #2 among these objects

is shown in Figure 6.21. Through no overlap Subdue found 5 instances (in the sixteenth

iteration) of the pattern, by using standard overlap it discovered 10 (in the seventh

iteration), and via limited overlap it found 5 (in the fourteenth iteration). No overlap and

limited overlap reported “settlement category construction” whereas limited overlap

reported more instances of “settlement category block”.

a) b)

c)


151

6.2.2.3 Model #3 - double replication of relation types, no complete

information

Road and River. The pattern discovered, by means of model #3, for these objects is shown

in Figure 6.22. Through no overlap Subdue found 39 instances (in the second iteration) of

the pattern, by using standard overlap it found 85 (in the second iteration), and via limited

overlap it discovered 34 (in the ninth iteration). In all cases Subdue reported as the

predominant spatial objects “road category unpaved and river category draining”.

152

a) b)

c)


153

Road and Settlement. By means of model #3 Subdue discovered the pattern shown in

Figure 6.23. Subdue found by using no overlap 4 instances (in the fifteenth iteration) of the

pattern, via standard overlap it found 8 (in the sixth iteration), and through limited overlap

it discovered 5 (in the thirteenth iteration). No overlap and limited overlap reported

“settlement category block” as the predominant spatial object whereas standard overlap

reported “settlement category construction”.

a) b)

c)


154

River and Settlement. For these spatial objects Subdue was able to discover, by means of

model #3, the pattern shown in Figure 6.24. Via no overlap Subdue found 5 instances (in

the twelfth iteration) of the pattern, through standard overlap it found 19 (in the fourth

iteration), and by using limited overlap it found 5 (in the tenth iteration). No overlap and

limited overlap reported less instances of “settlement category construction” whereas

standard overlap reported more instances of “settlement category block”.

a) b)

c)


155

6.2.2.4 Model #4 - single replication of relation types, no complete

information

Road and River. For these spatial objects Subdue was able to discover, by means of model

#4, the pattern shown in Figure 6.25. Via no overlap Subdue found 39 instances (in the

second iteration) of the pattern, through standard overlap it discovered 85 (in the first

iteration), and by using limited overlap it found 60 (in the second iteration). All cases

reported as the predominant spatial objects “road category unpaved and river category

draining”.

156

a) b)

c)


Road and Settlement. The pattern discovered, by means of model #4, for these objects is

shown in Figure 6.26. Through no overlap Subdue discovered 6 instances (in the twelfth

iteration) of the pattern, by using standard overlap it found 8 (in the tenth iteration), and via

157

limited overlap it found 8 (in the seventh iteration). The representative spatial objects are

“road category unpaved and settlement category construction” in all cases.

a) b)

c)


River and Settlement. By means of model #4 Subdue discovered the pattern shown in

Figure 6.27. Subdue found by using no overlap 5 instances (in the sixth iteration) of the

158

pattern, via standard overlap it found 10 (in the sixth iteration), and through limited overlap

it found 5 (in the eleventh iteration). No overlap and limited overlap reported less instances

of “settlement category construction” whereas standard overlap reported more instances of

“settlement category block”.

a) b)

c)


159

6.2.2.5 Model #5 - double replication of relation types, complete

information

Road and River. The pattern discovered, by means of model #5, for these objects is shown

in Figure 6.28. Through no overlap Subdue discovered 45 instances (in the fifth iteration)

of the pattern, by using standard overlap it found 85 (in the first iteration), and via limited

overlap it found 45 (in the fifth iteration). Subdue reported in all cases as the representative

spatial objects “road category unpaved and river category draining”.

160

a) b)

c)


161

Road and Settlement. By means of model #5 Subdue discovered the pattern shown in

Figure 6.29. Subdue found by using no overlap 6 instances (in the seventh iteration) of the

pattern, via standard overlap Subdue could not find a complete pattern (in the figure the

category of the settlement is not reported), and through limited overlap it found 7 (in the

seventh iteration). No overlap and limited overlap reported as the representative spatial

objects “road category unpaved and settlement category construction”.

a) b)

c)


162

River and Settlement. For these spatial objects Subdue was able to discover, by means of

model #5, the pattern shown in Figure 6.30. Via no overlap Subdue discovered 5 instances

(in the thirteenth iteration) of the pattern, through standard overlap it found 5 (in the sixth

iteration), and by using limited overlap it also found 5 (in the tenth iteration). Subdue

reported in all cases “river category draining and settlement category construction” as the

predominant spatial objects.

a) b)

c)


163

Table 6.1 presents a comparison, by each model and overlap feature, among the number of

discovered instances (patterns) and the number of iterations needs to discover them. For

example, by using model #1, Subdue found via no overlap 46 instances (in the second

iteration) of a “complete” pattern (according to our definition for reporting a complete

pattern) involving the spatial objects road-river. A higher score means a model allowing us

to discover more instances of a substructure. Remember Subdue reported as the best pattern

(by iteration) a substructure with the highest score of discovered instances of that

substructure. This comparison is reported by each “object-object” structure (i.e., road-

river).

Note: NO (No overlap), SO (Standard overlap), LO (Limited overlap).

Model #1 Model #2 Model #3 Model #4 Model #5


Road-River

Instances 46 85 85 41 85 64 39 85 34 39 85 60 45 85 45

Iterations 2 1 2 2 3 2 2 2 9 2 1 2 5 1 5

Road-Settlement

Instances 6 9 8 5 8 7 4 8 5 6 8 8 6 0 7

Iterations 9 4 10 14 6 10 15 6 13 12 10 7 7 0 7

River-Settlement

Instances 5 5 5 5 10 5 5 19 5 5 10 5 5 5 5

Iterations 12 8 8 16 7 14 12 4 10 6 6 11 13 6 10

Table 6.1. Instances/iterations by each graph-based model: Popocatépetl use-case.

164

From the previous table we can see that Subdue (setting no overlap -NO- in all cases)

reported using model #1, 46 instances in the second iteration, of a complete pattern among

the road and river spatial objects. Using model #2, Subdue reported 41 instances of a

complete pattern in the second iteration. Using model #3, Subdue reported 30 instances of a

complete pattern in the second iteration and so on. Finally, we can identify that model #1

was the best model to find the highest number of a complete pattern among these spatial

objects (shown in pink color in Table 6.2).


NO NO NO NO NO

Road-River

Instances 46 41 39 39 45

Iterations 2 2 2 2 5

Table 6.2. The best model to discover complete patterns among the road and river spatial objects (no overlap)

From Table 6.1 we created Table 6.3. This table presents a comparison of

maximum/minimum discovered instances by each overlap feature. A model with the

highest score is better since it allows discovering more instances of a substructure

(patterns). The comparison is presented for each “object-object” structure (i.e., road-river).

For example, we have already mentioned that model #1 was the best model to find

complete patterns (setting no overlap) among the road and river spatial objects. Subdue

reported 46 discovered instances of a complete pattern in the second iteration (the highest

score). See Table 6.1 for details.

165

On the other hand, model #3 and model #4 were the worst models to find complete patterns

among the road and river spatial objects. For example, using model #3, Subdue only

reported 39 instances of a complete pattern in the second iteration.

Maximum Minimum

Road-River

No overlap model #1 (second iteration) models #3 and #4 (second iteration)

Standard overlap models #1, #4 and #5 (first iteration) model #2 (third iteration)

Limited overlap model #1 (second iteration) model #3 (ninth iteration)

Road-Settlement

No overlap model #5 (seventh iteration) model #3 (fifteenth iteration)

Standard overlap model #1 (fourth iteration) model #5 (no complete pattern)

Limited overlap model #4 (seventh iteration) model #3 (thirteenth iteration)

River-Settlement

No overlap model #4 (sixth iteration) model #2 (sixteenth iteration).

Standard overlap model #3 (fourth iteration) model #1 (eighth iteration).

Limited overlap model #1 (eighth iteration) model #2 (fourteenth iteration)

Table 6.3. Max/Min of discovered instances by “object-object”/overlap feature.

From Table 6.3 we can identify that model #1 was the best model to discover complete

patterns in most of the cases. The second best most was model #4. On the other hand, the

worst model to discover complete patterns was model #3.

Table 6.4 presents a comparison among the average of discovered instances by model.

Higher score means a model allowing us to discover more instances of a substructure

(patterns). Each value represents the average of discovered substructures by using no

166

overlap, standard overlap and limited overlap features. For example, the value 72.0 in

column “Model #1” and row “Road-River” is the average computed from the values 46, 85

and 85 obtained from Table 6.1. The comparison is reported by each “object-object”

structure (i.e., road-river). We can see in the table that model #1 was the best model to find

complete patterns. It was followed by model #2. The worst model was model #5.


Road-River 72.0 63.3 52.7 61.3 58.3

Road-Settlement 7.7 6.7 5.7 7.3 4.3

River-Settlement 5.0 6.7 9.7 6.7 5.0

Table 6.4. Average of discovered instances by model/“object-object”.

Table 6.5 presents a comparison among the average of discovered instances by model. For

example, the value 19.0 in column “Model #1 no overlap (NO)” is the average computed

from the values 46, 6 and 5 obtained from Table 6.1. Higher score means a model allowing

us to discover more instances of a substructure (patterns). The comparison is reported by

each overlap feature.



19.0 33.0 32.7 17.0 34.3 25.3 16.0 37.3 14.7 16.7 34.3 24.3 18.7 30.0 19.0

Table 6.5. Average of discovered instances by model/overlap feature.

Finally, Table 6.6 presents a comparison among the average of discovered instances by

model. For example, the value 28.2 in column “Model #1” is the average computed from

the values 19.0, 33.0 and 32.7 obtained from Table 6.5. We can see in the table that model

167

#1 reported the highest score of discovered instances (according to our parameters for

reporting complete instances) in this illustrative Popocatépetl domain. The following ones

were model #2 and model #4 respectively. The worst model to find complete patterns as

proposed was model #5.


28.2 25.6 22.7 25.1 22.6

Table 6.6. Average of discovered instances by model.

As part of our strategy for testing our graph-based methodology for modeling and mining

spatial data, we proposed five graph-based operative models derived from our general

schema. However, from the analysis and interpretation of the results presented in this

section we can conclude that model #1 and model #2 are (following that order) the two

graph-based models that produce the best results. This conclusion is based on the

quantitative and qualitative analysis implemented using this test domain. Additionally, this

appreciation is also supported by the results obtained from the discovered patterns in the

population census from the year of 1777 in Puebla downtown test domain described in the

previous section.

6.3 Conclusion

In this chapter we have presented three illustrative use-cases of our proposal for modeling

and mining spatial data. The test domains were two spatial databases, the first one related to

168

a population census from the year of 1777 in Puebla downtown, and the second one related

to the Popocatépetl volcano.

The use-cases developed using the population census database were focused on

exemplifying the graph-based model to represent together spatial data, non-spatial data and

spatial relations, the new limited overlap feature implemented in the Subdue system

(processing time and specialized overlapping pattern oriented search), and the generated

results by the mining phase. We have presented in each use-case evaluations performed by

the domain expert over the discovered patterns.

The use-case developed using the Popocatépetl database was focused on the

comparison/evaluation of the generated results by each proposed graph-based model. The

tests were implemented upon the supposition of evacuation plans implementation in case of

volcanic contingences. We presented comparison tables describing which model(s) allow(s)

to discover more instances of a substructure via no overlap, standard overlap, and limited

overlap features. Subdue reported as the best pattern (by iteration) a substructure with the

highest score of discovered instances of that substructure.

Next chapter presents conclusion about our research work and final remarks.

169

Chapter 7

CONCLUSIONS

The continuous interaction among people and their natural home, the Planet Earth,

generates everyday new requirements associated to spatial data. For example, urban

analysis, natural risks prevention, space exploration, contamination in oceans, and

reforestation of lands just to mention some of them. Spatial data mining involves the

integration of methods from different scientific fields which help us, by means of data

analysis and discovery algorithms, to produce a particular enumeration of patterns from

spatial data.

In Chapter 2 we presented several approaches developed for mining spatial data (i.e.,

generalization-based methods, clustering, spatial associations, approximation and

aggregation, mining in image databases, spatial classification, and spatial trend detection).

However, our argumentation about those approaches was that they do not consider all the

elements found in a spatial database (spatial data, non-spatial data and spatial relations

among the spatial objects) in an extended way. We proposed in this dissertation a new

approach based on graphs. Our feeling is that if we are able to represent those elements as a

unique dataset and if we are also able to mine them as a whole, then, we might be able to

170

get patterns that might contain both types of data and spatial relationships enhancing the

quality of the results since the generated pattern(s) would describe (a) spatial object(s)

meeting (a) spatial relation(s) with other spatial object(s) and which is(are) that (those)

relationship(s). We proposed to use a graph-based representation since it provides the

desirable flexibility to describe these elements and their relations together.

As mentioned, in our model spatial relations among spatial objects are included since a

significant characteristic of spatial data is the influence of the neighbors of an object may

have on the object itself. In the model we included the representation of three types of

spatial relations.

Derived from the general graph-based schema we have proposed five operative models.

Three aspects define the characteristics of a graph created from those models: first, the

representation of equivalent spatial relations (the relation A touch B can be represented by

two directed edges, A B and B A, or by one undirected edge, A──B; we used the second

approach). Second, the representation of symmetric spatial relations (the relation A

North_of B implies a relation B South_of A, some models represent only the first relation

and other both relations). The third aspect is the way to represent the objects and their

relationships. Our intention is to represent the spatial objects and their relation as much as

possible but also considering a balance among the representative of the data and the

complexity of the created graph. This last issue has a major importance since the tie among

the complexity of the created graph and the mining phase. For example, huge graphs may

require more computational resources than small graphs, but by creating small graphs we

may loss data representativeness and perhaps we may not to discover significant patterns.

171

As component of our methodology for mining spatial data using a graph-based approach,

we used the Subdue system as our mining tool. The overlap feature plays a relevant role in

the Subdue’s substructure discovering system. But, as we described, it is implemented in an

orthodox way: allows overlap among any instances of a substructure or allows overlap

among all instances of a substructure. The first option is better regarding to the processing

time, but the second one may discover more instances of a substructure (patterns). Both

cases do not give to the user the capability to specify the set of vertices allowed for

overlapping.

Therefore, we proposed a new overlap approach named limited overlap. The new approach

gives to the user the means to specify over which vertices the overlap will be allowed.

These vertices may represent significant elements in the context we work with. For

example, in the use-case presented in Section 6.2 (implementation of evacuation plans in

case of volcanic contingences in the Popocatépetl volcano zone) vertices representing roads

were allowed for overlapping since these spatial objects represent relevant elements in the

evacuation plans.

Moreover, we visualized three motivation issues to propose the implementation of the new

approach. First, we demonstrated that by using limited overlap we obtain a search space

reduction in the substructure discovering process since we allow overlap but it is restricted

to the set of elements chosen by the user. Second, as result of a search space reduction we

also get a processing time reduction. Remember that as part of the substructure discovery

process there exist a validation and discarding phase over the instances of a substructure.

172

Instances that are not discarded may become candidates to further expansion in order to

discover new substructures. Third, giving the user the capability to choose the set of

vertices allowed for overlapping, we are orienting the search over particular overlapping

instances and, at the same time, discarding irrelevant overlapping instances.

In order to show the feasibility, capacity to mine and to discover patterns by using a graph-

based approach as proposed, we developed a prototype system implementing our model to

create graph-based datasets, to mine those datasets (by calling the Subdue system), and to

visualize the discovered patterns.

In Chapter 6 we described three illustrative use-cases showing the applicability of our

proposal. We used two real world domains: a population census from the year of 1777 in

Puebla downtown, and a Popocatépetl volcano database. The generated results from those

test domains give us a panorama about how and what we can achieve using our approach. It

is important to remark the fact that we can use this approach in any domain that can be

represented as a graph.

In this context, we visualize perspectives related to enhance our work in issues such as the

graph-based model for creating the graphs, the data mining algorithm, and the prototype

system. They include some of the following:

Visualization/presentation of discovered knowledge. For example, visualization of

discovered knowledge over the spatial layers (patterns over the maps); to use charts for

173

showing comparison among patterns; give the user the capability to navigate through the

discovered patterns hierarchy using a hypergraph approach.

Enhancing the algorithms used to create the graph-based datasets according to the

proposed models. The validation of the spatial relations among the spatial object is phase

that in most of the cases requires a lot of computational resources. (i.e., creation and

manage of spatial indices, creation of bounding box for the spatial objects, manage of

spatial reference systems, etc.) Therefore, the algorithms must as most as possible to

execute efficiently this task.

Mining the graph. We proposed and used the Subdue system as our data mining tool,

moreover we implemented a new algorithm name limited overlap. Subgraph isomorphism

is an NP-complete problem, so we must be able to face this problem in order that our

processing times for discovering knowledge meet acceptable parameters of efficiency.

Relationships among non-spatial data describing spatial objects. Implicit relations

among non-spatial data (i.e., attributes) describing the spatial objects may be included in

the model in order to enhance the representativeness of the data.

Spatial data mining is a promising research field. Several approaches have been developed,

and without any doubt, new approaches will be proposed; it is a field in continuous

improvement. Our contribution to the spatial data mining goes in that direction.

174

BIBLIOGRAPHY

[1] AGRAWAL Rakesh, IMIELINSKI Tomasz, SWAMI Arun. Mining Association

Rules between Sets of Items in Large Databases. In : Proc. of the 1993 International

Conference on Management of Data, Washington D.C.. New York, NY : Association for

Computing Machinery Press, 1993, pp. 207-216. ISBN 0-89791-592-5

[2] BERNHARDSEN Tor. Geographic Information Systems: An Introduction. 2nd ed.

New York : John Wiley & Son Inc, 1999, 448 p. ISBN 0471419680

[3] CHEIN Michel, MUGNIER Marie-Laure. Conceptual Graphs: Fundamental

Notions. Revue d’intelligence artificielle, 1992, vol. 6, no 4, pp. 365-406.

[4] CHEN Ming-Syan, HAN Jiawei, YU Philip S. Data Mining: An Overview from a

Database Perspective. Institute of Electrical and Electronics Engineers, Transactions on

Knowledge and Data Engineering, 1996, vol. 8, no 6, pp. 866-883. ISSN 1041-4347

175

[5] DIESTEL Reinhard. Graph Theory [on line]. 3rd ed. New York : Springer-Verlag,

2005. Available on :<http://www.math.uni-hamburg.de/home/diestel/books/graph.theory>

(last visit 01.10.2005). ISBN 3-540-26182-6

[6] EGENHOFER Max J. A Model for Detailed Binary Topological Relationships.

Geomatica, 1993, vol. 47, no 3-4, pp. 261-273.

[7] EGENHOFER Max J., FRANZOSA Robert D. On the Equivalence of Topological

Relations. International Journal of Geographical Information Systems, 1995, vol. 9, no 2,

pp. 133-152.

[8] EGENHOFER Max J., HERRING John R. Categorizing Binary Topological

Relationships between Regions, Lines, and Points in Geographic Databases. Technical

Report. Department of Surveying Engineering, University of Maine, Orono, 1991, 28 p.

[9] ESTER Martin, FROMMELT Alexander, KRIEGEL Hans-Peter, SANDER Jörg.

Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support. Data

Mining and Knowledge Discovery, 2000, vol. 4, no 2-3, pp. 193-216. ISSN 1384-5810

[10] ESTER Martin, FROMMELT Alexander, KRIEGEL Hans-Peter, SANDER Jörg.

Algorithms for Characterization and Trend Detection in Spatial Databases. In : Proc. of the

4th International Conference on Knowledge Discovery and Data Mining, New York, NY,

1998, pp. 44-50.

176

[11] ESTER Martin, KRIEGEL Hans-Peter, SANDER Jörg. Spatial Data Mining: A

Database Approach. In : Proc. of the 5th International Symposium on Large Spatial

Databases, Berlin, Germany, July 1997, pp. 47-66.

[12] FAYYAD Usama M., PIATETSKY-SHAPIRO Gregory, SMYTH Padhraic.

Knowledge Discovery and Data Mining: Towards a Unifying Framework. In : SIMOUDIS

Evangelos, HAN Jiawei, FAYYAD Usama M. Eds. Proc. of the 2nd International

Conference on Knowledge Discovery and Data Mining, Portland, Oregon. Menlo Park, CA

: American Association for Artificial Intelligence Press, 1996, pp. 82-88.

[13] FAYYAD Usama M., PIATETSKY-SHAPIRO Gregory, SMYTH Padhraic,

UTHURUSAMY Ramasamy Eds. Advances in Knowledge Discovery and Data Mining.

Menlo Park, CA : American Association for Artificial Intelligence/Massachusetts Institute

of Technology Press, 1996, 625 p. ISBN 0-262-56097-6

[14] FAYYAD Usama M., SMYTH Padhraic. Image Database Exploration: Progress

and Challenges. In : Proc. of the 1993 Workshop on Knowledge Discovery in Databases.

Washington D.C., July 1993, pp. 14-27.

[15] FAYYAD Usama M., WEIR Nicholas, DJORGOVSKI S. SKICAT: A Machine

Learning System for Automated Cataloging of Large Scale Sky Survey. In : UTGOFF Paul

E. Eds. Proc. of the 10th International Conference on Machine Learning, 1993, University

of Massachusetts. San Mateo, CA : Morgan Kaufmann, 1993, pp. 112-199.

177

[16] FRAWLEY William J., PIATETSKY-SHAPIRO Gregory, MATHEUS Christopher

J. Knowledge Discovery in Databases: An Overview. In : PIATETSKY-SHAPIRO

Greogry, FRAWLEY William J. Knowledge Discovery in Databases. Menlo Park CA :

American Association for Artificial Intelligence/Massachusetts Institute of Technology

Press, 1991, pp. 1-27.

[17] GONZALEZ Jesus A. Empirical and Theoretical Analysis of Relational Concept

Learning Using a Graph-based Representation. PhD Thesis. Arlington : The University of

Texas at Arlington, 2001, 138 p.

[18] GONZALEZ Jesus A., HOLDER Lawrence B., COOK Diane J. Structural

Knowledge Discovery Used to Analyze Earthquake Activity. In : Proc. of the 13th Annual

Florida Artificial Intelligence Research Symposium. American Association for Artificial

Intelligence Press, 2000, pp. 86-90.

[19] GRAPHVIZ - AT&T RESEARCH. Graphviz - Graph Visualization Software [on

line]. Available on : <http://www.research.att.com/sw/tools/graphviz/> (last visit

01.10.2005).

[20] GUTTMAN Antonin. R-Trees: A Dynamic Index Structure for Spatial Searching.

In : Proc. of the 1984 International Conference on Management of Data, Boston,

Massachusset, 1984. New York, NY : Association for Computing Machinery Press, June

1984, pp. 47-57.

http://www.research.att.com/sw/tools/graphviz/

178

[21] HAN Jiawei, CAI Yandong, CERCONE Nick. Knowledge Discovery in Databases:

An Attribute-Oriented Approach. In : YUAN Li Yan, Proc. of the 18th International

Conference on Very Large Databases, Vancouver, British Columbia, Canada. San Mateo :

Morgan Kaufmann Publishers, 1992, pp. 547-559.

[22] HAN Jiawei, FU Yongjian. Exploration of the Power of Attribute-Oriented

Induction in Data Mining. In : [13].

[23] HAN Jiawei, KAMBER Micheline. Data Mining: Concepts and Techniques. San

Francisco : Morgan Kaufmann, 2000, 550 p. ISBN 1-55860-489-8

[24] HOLDER Lawrence B., COOK Diane J., GONZALEZ Jesus A., JONYER I.

Structural Pattern Recognition in Graphs. In : CHEN Dechang, CHENG Xiuzhen Eds.

Pattern Recognition and String Matching. Dordrecht : Kluwer Academic Publishers, 2002,

772 p. ISBN 1-4020-0953-4

[25] HOLSHEIMER Marcel, KERSTEN Martin L. Architectural Support for Data

Mining. CS-R9429. Amsterdam, The Netherlands : CWI (Centre for Mathematics and

Computer Science), 1994.

[26] KAUFMAN Leonard, ROUSSEEUW Peter J. Finding Groups in Data: An

Introduction to Cluster Analysis. New York : Wiley-Interscience, 1990, 368 p. ISBN

0471878766

179

[27] KNORR Edwin M., NG Raymond T. Finding Aggregate Proximity Relationships

and Commonalities in Spatial Data Mining. Institute of Electrical and Electronics

Engineers, Transactions on Knowledge and Data Engineering, 1996, vol. 8, no 6, pp. 884-

897.

[28] KOLATCH Erica. Clustering Algorithms for Spatial Databases: A Survey.

University of Maryland, College Park : Department of Computer Science, 2001, 22 p.

[29] KOPERSKI Krzysztof, HAN Jiawei. Discovery of Spatial Association Rules in

Geographic Information Databases. In : Proc. of the 4th International Symposium on Large

Spatial Databases, Portland, Maine, August 1995, pp. 47-66.

[30] KOPERSKI Krzysztof, HAN Jiawei, ADHIKARY Junas. Spatial Data Mining:

Progress and Challenges. In : Proc. of the Workshop on Research Issues on Data Mining

and Knowledge Discovery, Montreal, Canada, June 1996, pp. 55-70.

[31] KOPERSKI Krzysztof, HAN Jiawei, STEFANOVIC Nebojsa. An Efficient Two-

Step Method for Classification of Spatial Data. In : Proc. of the 8th Symposium on Spatial

Data Handling, Vancouver, Canada, 1998, pp. 45-54.

[32] LAURINI Robert. Information Systems for Urban Planning: A Hypermedia

Cooperative Approach. London : Taylor and Francis, 2001, 368 p. ISBN 0748409645

180

[33] LAURINI Robert, MILLERET-RAFFORT F. Les bases de données en géomatique.

Paris : HERMES, 1993, 330 p.

[34] LAURINI Robert, THOMPSON Derek. Fundamentals of Spatial Information

Systems. London : Academic Press, 1992, 680 p. (The A.P.I.C. series, no 37) ISBN 0-12-

438380-7

[35] LU Wei, HAN Jiawei, OOI Beng C. Discovery of General Knowledge in Large

Spatial Databases. In : Proc. of the Far East Workshop on Geographic Information

Systems, Singapore, June 1993, pp. 275-289.

[36] MUGNIER Marie-Laure. On Generalization/Specialization for Conceptual Graphs.

Journal of Experimental and Theoretical Artificial Intelligence, 1995, vol. 7 pp. 325-344.

[37] NG Raymond T., HAN Jiawei. Efficient and Effective Clustering Methods for

Spatial Data Mining. In : Proc. of the 20th Very Large Databases Conference, Santiago,

Chile, 1994. San Francisco, CA, 1994, pp. 144-155.

[38] PECH PALACIO Manuel, SOL David, GONZALEZ Jesus A. Adaptation and Use

of Spatial and non-Spatial Data Mining. Proc. of the International 2002 Workshop

Semantic Processing of Spatial Data, Centre for Computing Research, Instituto Politécnico

Nacional, México D.F., December 2002. ISBN 970-18-8521-X

181

[39] PREPARATA Franco P., SHAMOS Michael Ian. Computacional Geometry: An

Introduction. 1st ed. New York, NY : Springer-Verlag, 1993. 398 p. ISBN 0387961313

[40] SHEIKHOLESLAMI Gholamhosein, CHATTERJEE Surojit, ZHANG Aidong.

WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases.

In : GUPTA Ashish, SHMUELI Oded, WIDOM Jennifer Eds. Proc. of the 24th

International Conference on Very Large Databases, New York, NY, 1998. San Francisco,

CA : Morgan Kaufmann Publishers, 1998, pp. 428-439. ISBN 1-55860-566-5

[41] SHEK Eddie C., MUNTZ Richard R., MESROBIAN Edmond, NG Kenneth.

Scalable Exploratory Data Mining of Distributed Geoscientific Data. In : Proc. of the 2nd

International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996.

Menlo Park, CA : American Association for Artificial Intelligence Press, 1996, pp. 32-37.

[42] SOL David, LOYO E., RAZO Antonio. Risk Management in the Popocatépetl

Volcano Zone. In : Workshop on Advanced Techniques for the Assessment of Natural

Hazards in Mountain Areas, Innsbruck, Austria, June 2000, pp. 97-101.

[43] STOLORZ Paul E, DEAN Christopher. Quakefinder: A Scalable Data Mining

System for Detecting Earthquakes from Space. In : Proc. of the 2nd International

Conference on Data Mining, Portland, Oregon, 1996. Menlo Park, CA : American

Association for Artificial Intelligence Press, 1996, pp. 208-213.

182

[44] STOLORZ Paul E., NAKAMURA H., MESROBIAN Edmond, MUNTZ Richard

R., SHEK Eddie C., SANTOS J. R., YI J., Ng K., CHIEN S. Y., MECHOSO C. R.,

FARRARA J. D. Fast Spatio-Temporal Data Mining of Large Geophysical Datasets. In :

Proc. of the 1st International Conference on Data Mining, Montreal, Canada, 1995. Menlo

Park, CA : American Association for Artificial Intelligence Press, 1995, pp. 300-305.

[45] SUBDUE SYSTEM - THE UNIVERSITY OF TEXAS AT ARLINGTON.

SUBDUE - Graph Based Knowledge Discovery [on line]. Available on :

<http://ailab.uta.edu/subdue> (last visit 01.10.2005).

[46] ZHANG Tian, RAMAKRISHNAN Raghu, LIVNY Miron. BIRCH: An Efficient

Data Clustering Method for Very Large Databases. In : Proc. of the 1996 International

Conference on Management of Data, Montreal, Canada, 1996. New York, NY :

Association for Computing Machinery Press, June 1996, pp. 103-114. ISSN 0163-5808

http://ailab.uta.edu/subdue

lxix

Appendix C

AGREEMENTS UDLAP/INSA DE LYON

lxx

lxxi

lxxii

lxxiii

FOLIO ADMINISTRATIF

THESE SOUTENUE DEVANT L'INSTITUT NATIONAL DES SCIENCES APPLIQUEES DE LYON

NOM : PECH PALACIO DATE de SOUTENANCE : 12/12/2005 (avec précision du nom de jeune fille, le cas échéant) Prénoms : Manuel Alfredo TITRE : Spatial Data Modeling and Mining using a Graph-based Representation NATURE : Doctorat Numéro d'ordre : 05 ISAL Ecole doctorale : Informatique et Information pour la Société Spécialité : Documents Multimédia, Images et Systèmes d'Information Cote B.I.U. - Lyon : T 50/210/19 / et bis CLASSE : RESUME : We propose a unique graph-based model to represent spatial data, non-spatial data and the spatial relations among spatial objects. We will generate datasets composed of graphs with a set of these three elements. We consider that by mining a dataset with these characteristics a graph-based mining tool can search patterns involving all these elements at the same time improving the results of the spatial analysis task. A significant characteristic of spatial data is that the attributes of the neighbors of an object may have an influence on the object itself. So, we propose to include in the model three relationship types (topological, orientation, and distance relations). In the model the spatial data (i.e. spatial objects), non-spatial data (i.e. non-spatial attributes), and spatial relations are represented as a collection of one or more directed graphs. A directed graph contains a collection of vertices and edges representing all these elements. Vertices represent either spatial objects, spatial relations between two spatial objects (binary relation), or non-spatial attributes describing the spatial objects. Edges represent a link between two vertices of any type. According to the type of vertices that an edge joins, it can represent either an attribute name or a spatial relation name. The attribute name can refer to a spatial object or a non-spatial entity. We use directed edges to represent directional information of relations among elements (i.e. object x touches object y) and to describe attributes about objects (i.e. object x has attribute z). We propose to adopt the Subdue system, a general graph-based data mining system developed at the University of Texas at Arlington, as our mining tool. A special feature named overlap has a primary role in the substructures discovery process and consequently a direct impact over the generated results. However, it is currently implemented in an orthodox way: all or nothing. Therefore, we propose a third approach: limited overlap, which gives the user the capability to set over which vertices the overlap will be allowed. We visualize directly three motivations issues to propose the implementation of the new algorithm: search space reduction, processing time reduction, and specialized overlapping pattern oriented search. MOTS-CLES : KDD, Spatial data mining, Graph, Spatial data, Spatial relation, Geoprocessing Laboratoire (s) de recherche : LIRIS - Laboratoire d'Informatique en Images et Systèmes d'Information / CENTIA-UDLAP, Mexique Directeur de thèse: Robert LAURINI, Professeur, INSA de Lyon, France, Directeur Président de jury : Eduardo MORALES, Profesor Titular ITESM-Morelos, Mexique, Examinateur Composition du jury : Nicandro CRUZ Profesor Titular, Universidad Veracruzana, Mexique, Rapporteur François FAGES Directeur de Recherches, INRIA, Paris, France, Examinateur Jesús GONZALEZ Profesor Titular INAOE, Mexique, Co-Directeur Robert LAURINI Professeur, INSA de Lyon, France, Directeur Hervé MARTIN Professeur Université Joseph Fourier, Grenoble, France, Rapporteur Eduardo MORALES Profesor Titular ITESM-Morelos, Mexique, Examinateur David SOL Profesor Titular UDLAP, Mexique, Directeur Anne TCHOUNIKINE Maître de Conférences, INSA de Lyon, France, Co-Directeur