Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
2005 ISAL 00118
PhD
Jointly awarded at the Institut National des Sciences Appliquées de Lyon
(Ecole Doctorale Informatique et Information pour la Société) and the Universidad de las Américas, Puebla
(School of Engineering, Department of Computer Sciences)
By
Manuel Alfredo PECH PALACIO
On December 12th, 2005
"Spatial Data Modeling and Mining using a Graph-based Representation"
----------------------------------------------------
PhD Committee Chair, Pr. Eduardo Morales Manzanares, ITESM, Morelos
Tutors:
Pr. Robert Laurini, INSA of Lyon Dr. Anne Tchounikine, INSA of Lyon Dr. David Sol Martínez, UDLAP Dr. Jesús A. González Bernal, INAOE
Reviewers:
Pr. Hervé Martin, U. Joseph Fourier, Grenoble Dr Nicandro Cruz Ramírez, U. Veracruzana Examiner: Dr. François Fages, INRIA
I
ABSTRACT
Motivation
Several approaches have been developed for mining spatial data (i.e., generalization-based
methods, clustering, spatial associations, approximation and aggregation, mining in image
and raster databases, spatial classification and spatial trend detection). However, we argue
that these approaches do not consider all the elements found in a spatial database (spatial
data, non-spatial data and spatial relations among the spatial objects) in an extended way.
Some of them focus first on spatial data and then on non-spatial data or vice versa, and
others consider restricted combinations of these elements. We think that it is possible to
enhance the generated results of the data mining task by mining them as a whole and not as
separate elements (they are related elements). A graph representation provides the
flexibility to describe these elements together and this is the motivation to explore the area
of graph-based spatial knowledge discovery.
II
Proposal
Our idea is to create a unique graph-based model to represent spatial data, non-spatial data
and the spatial relations among spatial objects. We will generate datasets composed of
graphs with a set of these three elements. We consider that by mining a dataset with these
characteristics a graph-based mining tool can search patterns involving all these elements at
the same time improving the results of the spatial analysis task. A significant characteristic
of spatial data is that the attributes of the neighbors of an object may have an influence on
the object itself. So, we propose to include in the model three relationship types
(topological, orientation, and distance relations).
In the model the spatial data (i.e., spatial objects), non-spatial data (i.e., non-spatial
attributes), and spatial relations are represented as a collection of one or more directed
graphs. A directed graph contains a collection of vertices and edges representing all these
elements. Vertices represent either spatial objects, spatial relation types between two spatial
objects (binary relation), or non-spatial attributes describing the spatial objects. Edges
represent a link between two vertices of any type. According to the type of vertices that an
edge joins, it can represent either an attribute name or a spatial relation name. The attribute
name can refer to a spatial object or a non-spatial entity. We use directed edges to represent
directional information of relations among elements (i.e., object x covers object y) and to
describe attributes about objects (i.e., object x has attribute z).
We propose to adopt the Subdue system, a general graph-based data mining system
developed at the University of Texas at Arlington, as our mining tool. Subdue discovers
III
substructures using a graph-based representation of structural databases. The substructures
(a connected subgraph within the graphical representation) describe structural concepts in
the data (i.e., patterns). The discovery algorithm follows a computationally constrained
beam search. The algorithm begins with the substructure matching a single vertex in the
graph. On each iteration, the algorithm selects the best substructure and incrementally
expands the instances of the substructure. An instance of a substructure in the input graph is
a subgraph that matches (graph theoretically) that substructure.
A special feature named overlap has a primary role in the substructures discovery process
and consequently a direct impact over the generated results. However, it is currently
implemented in an orthodox way: all or nothing. If we set overlap to true, Subdue will
allow the overlap among all instances sharing at least one vertex. On the other hand, if
overlap is set to false, Subdue will not allow the overlap among instances sharing at least
one vertex. So, we propose a third approach: limited overlap, which gives the user the
capability to set over which vertices the overlap will be allowed (vertices representing
remarkable elements that refer, for instance, to a spatial object in a spatial database or to
some characteristic defining a particular topic of a dataset). We visualize directly three
motivations issues to propose the implementation of the new algorithm: search space
reduction, processing time reduction, and pattern oriented search.
IV
Contribution
The contribution to the discovery knowledge in the spatial data domain, described in this
dissertation, is the development of a new approach for spatial data modeling and mining
using a graph-based representation. This contribution includes the following results:
• A new graph-based data representation for spatial, non-spatial data and spatial
relations.
• A new algorithm to discover substructures using a limited overlap approach in the
Subdue system.
• A prototype system implementing the proposed model.
V
Acknowledgments
Thank you; you are my guide, motivation and inspiration:
Erandi, my darling wife
Paula Ireri, my wonderful daughter
Manuel André, my little son
Manuel and Elsy, my parents
I want to express my gratitude, respect and admiration my tutors for the time, advices and
support they granted me during the research:
Mexican tutors:
Dr. David Sol Martínez
Dr. Jesús A. González Bernal
Gracias
French tutors:
Pr. Robert Laurini
Dr. Anne Tchounikine
Merci
VI
This works was supported in part by the Excellent Graduate Scholarship from the
Fundación Universidad de las Américas, Puebla, 38257-H project (Habitar y vivir. Análisis
del espacio habitacional de la ciudad de Puebla 1690-1890) Scholarship from the Mexican
Council of Science and Technology, SIRPO project (Sistema de Información para los
Riesgos del Popocatépetl) research grand from the Laboratoire Franco-Mexicain
d'Informatique, the Institut National des Sciences Appliquées de Lyon, and Excellent
Scholarship from the Gobierno del Estado de Quintana Roo.
VII
Table of contents
ABSTRACT............................................................................................................................ I
Motivation........................................................................................................................... I
Proposal ............................................................................................................................. II
Contribution ..................................................................................................................... IV
Appendix A RESUMEN EN ESPAÑOL ................................................................................i
A.1. Introducción .................................................................................................................i
A.1.1 Métodos basados en generalización..................................................................... ii
A.1.2 Agrupamiento ..................................................................................................... iii
A.1.3 Asociaciones espaciales .......................................................................................iv
A.1.4 Aproximación y agregación.................................................................................iv
A.1.5 Minería de datos en imágenes...............................................................................v
A.1.6 Clasificación de datos espaciales ..........................................................................v
A.1.7 Detección de tendencias espaciales .....................................................................vi
A.2. Motivación .................................................................................................................vi
A.2.1 Relaciones espaciales......................................................................................... vii
A.3 Representaciones basadas en grafos ........................................................................ viii
VIII
A.4 Minando el grafo.......................................................................................................xix
A.5 Resultados ............................................................................................................... xxii
A.6 Conclusiones ............................................................................................................xxx
A.7 Contribución ........................................................................................................ xxxiii
Appendix B RÉSUMÉ EN FRANÇAIS...........................................................................xxxv
B.1. Introduction ...........................................................................................................xxxv
B.1.1 Méthodes basées sur la généralisation ...........................................................xxxvi
B.1.2 Regroupement .............................................................................................. xxxvii
B.1.3 Associations spatiales.................................................................................. xxxviii
B.1.4 Rapprochement et agrégation...................................................................... xxxviii
B.1.5 Fouille de données-images.............................................................................xxxix
B.1.6 Classification de données spatiales ................................................................xxxix
B.1.7 Détection de tendances spatiales ..........................................................................xl
B.2. Motivation ..................................................................................................................xl
B.2.1 Relations spatiales.............................................................................................. xli
B.3 Représentations basées sur des graphes ................................................................... xlii
B.4 Fouille du graphe....................................................................................................... liii
B.5 Résultats .................................................................................................................... lvi
B.6 Conclusions ............................................................................................................. lxiv
B.7 Contribution ........................................................................................................... lxvii
Chapter 1 INTRODUCTION..................................................................................................1
1.1 Motivation.....................................................................................................................2
1.2 Proposal ........................................................................................................................4
1.3 Contribution ..................................................................................................................6
IX
1.4 Organization of the thesis .............................................................................................7
Chapter 2 RELATED WORK ................................................................................................8
2.1 Geographic Information System (GIS).........................................................................8
2.2 Data Mining ................................................................................................................12
2.2.1 Spatial Data Mining .............................................................................................15
2.2.1.1 Generalization-based Method .......................................................................16
2.2.1.2 Clustering......................................................................................................17
2.2.1.3 Spatial Associations......................................................................................19
2.2.1.4 Approximation and Aggregation ..................................................................20
2.2.1.5 Mining an Image Database ...........................................................................22
2.2.1.6 Classification Learning .................................................................................23
2.2.1.7 Spatial Trend Detection ................................................................................24
2.3 Spatial relations...........................................................................................................24
2.3.1 Neighborhood Graphs, Neighborhood Paths and Neighborhood Indices............25
2.3.2 Topological Relations ..........................................................................................26
2.3.3 Distance Relations ...............................................................................................28
2.3.4 Direction Relations ..............................................................................................29
2.4 Conclusion ..................................................................................................................30
Chapter 3 GRAPH-BASED REPRESENTATIONS............................................................31
3.1 Generalities .................................................................................................................31
3.2 Methodology...............................................................................................................33
3.3 Spatial Graph-based Data Representations.................................................................36
3.4 Use-case ......................................................................................................................52
3.5 Conclusion ..................................................................................................................58
X
Chapter 4 MINING THE GRAPH........................................................................................59
4.1 Characteristics.............................................................................................................59
4.1.1 Main Functions ....................................................................................................63
4.2 Overlap........................................................................................................................66
4.3 Limited Overlap..........................................................................................................81
4.4 Conclusion ..................................................................................................................94
Chapter 5 PROTOTYPE.......................................................................................................96
5.1 Population Census from the year of 1777 in Puebla downtown.................................97
5.2 Modules ....................................................................................................................101
5.3 Conclusion ................................................................................................................115
Chapter 6 RESULTS ..........................................................................................................117
6.1 Population census from the year of 1777 in Puebla downtown................................118
6.1.1 Use-case: El Sagrario........................................................................................121
6.1.2 Use-case: People living along the borders of the river crossing Puebla downtown
....................................................................................................................................131
6.2 Popocatépetl volcano ................................................................................................136
6.2.2. Use-case: Popocatépetl .....................................................................................138
6.2.2.1 Model #1 - base model................................................................................140
6.2.2.2 Model #2 - single replication of relation types, complete information ......147
6.2.2.3 Model #3 - double replication of relation types, no complete information 151
6.2.2.4 Model #4 - single replication of relation types, no complete information .155
6.2.2.5 Model #5 - double replication of relation types, complete information .....159
6.3 Conclusion ................................................................................................................167
Chapter 7 CONCLUSIONS................................................................................................169
XI
BIBLIOGRAPHY...............................................................................................................174
Appendix C AGREEMENTS UDLAP/INSA DE LYON................................................. lxix
C.1 CONVENTION DE CO-TUTELLE DE THÈSE.....................................................lxx
C.2 AVENANT RELATIF A LA CONVENTION DE THÈSE EN CO-TUTELLE DE
MANUEL ALFREDO PECH PALACIO ................................................................... lxxiii
XII
List of figures
Figura A.1. Modelo basado en grafos para representar datos espaciales. .............................ix
Figura A.2. Base de datos de ejemplo para caracterizar los 3 modelos propuestos. .............xi
Figura A.3. Modelo #1 - modelo base. ................................................................................ xii
Figura A.4. Modelo #2 - replicación simple de tipos de relación, información completa. xiii
Figura A.5. Modelo #3 - doble replicación de tipos de relación, información no completa.xv
Figura A.6. Relaciones entre carreteras y ríos usando el modelo #1.................................xxiv
Figura A.7. Relaciones entre carreteras y poblaciones usando el modelo #1.....................xxv
Figura A.8. Relaciones entre ríos y poblaciones usando el modelo #1. .......................... xxvii
Figure B.1. Modèle basé sur des graphes pour représenter des données spatiales. ........... xliii
Figure B.2. Base d'exemple pour caractériser les 3 modèles proposés................................xlv
Figure B.3. Modèle n°1 - modèle base. ............................................................................. xlvi
Figure B.4. Modèle n°2 - réplication simple des types de relation, information complète.
.................................................................................................................................. xlvii
Figure B.5. Modèle n°3 - double réplication des types de relation, information non
complète..................................................................................................................... xlix
Figure B.6. Relations entre des routes et des rivières en utilisant le modèle n°1. ............. lviii
XIII
Figure B.7. Relations entre des routes et des villes en utilisant le modèle n°1. .................. lix
Figure B.8. Relations entre des rivières et des villes en utilisant le modèle n°1. ................ lxi
Figure 2.1. Geographic Information System. .......................................................................11
Figure 2.2. Knowledge Discovery in Databases...................................................................13
Figure 2.3. Data mining: integration of several fields. .........................................................14
Figure 2.4. Architecture for a KDD system..........................................................................15
Figure 2.5. Example of the interior, boundary and exterior of a circle. ...............................27
Figure 2.6. Nine intersection model......................................................................................27
Figure 2.7. Topological Relations.........................................................................................28
Figure 2.8. Distance Relations. .............................................................................................28
Figure 2.9. Direction Relations.............................................................................................30
Figure 3.1. General graph-based model to represent spatial data. ........................................35
Figure 3.2. Sample dataset. ...................................................................................................40
Figure 3.3. Model #1 - base model. ......................................................................................42
Figure 3.4. Model #2 - single replication of relation types, complete information. .............43
Figure 3.5. Model #3 - double replication of relation types, no complete information........45
Figure 3.6. Model #4 - single replication of relation types, no complete information. ........47
Figure 3.7. Model #5 - double replication of relation types, complete information.............48
Figure 3.8. Spatial database representing some objects of the world. ..................................53
Figure 3.9. Selection window. ..............................................................................................54
Figure 3.10. Querying a spatial database. .............................................................................55
Figure 3.11. Graph-based representation for spatial data. ....................................................56
Figure 4.1. Graph representation of the house domain.........................................................61
Figure 4.2. Substructure and instances discovered from the house domain by Subdue. ......62
XIV
Figure 4.3. Substructure replacement procedure in the house domain. ................................62
Figure 4.4. Graph representation of the house domain after substructure replacement. ......63
Figure 4.5. SGISO - input graph_1.......................................................................................67
Figure 4.6. SGISO - input graph_2.......................................................................................67
Figure 4.7. SGISO - no overlap. ...........................................................................................68
Figure 4.8. SGISO - no overlap, 1 instance in graph_2........................................................68
Figure 4.9. SGISO - overlap. ................................................................................................68
Figure 4.10. SGISO - overlap, 4 instances in graph_2. ........................................................68
Figure 4.11. MDL - input graph_1. ......................................................................................71
Figure 4.12. MDL example 1 - input graph_2......................................................................72
Figure 4.13. MDL example 1 - no overlap. ..........................................................................72
Figure 4.14. MDL example 2 - input graph_2......................................................................73
Figure 4.15. MDL example 2 - overlap. ...............................................................................73
Figure 4.16. MDL example 3 - input graph_2......................................................................74
Figure 4.17. MDL example 3 - overlap. ...............................................................................74
Figure 4.18. MDL example 4 - input graph_2......................................................................75
Figure 4.19. MDL example 4 - overlap. ...............................................................................75
Figure 4.20. MDL example 5 - input graph_2......................................................................76
Figure 4.21. MDL example 5 - overlap. ...............................................................................76
Figure 4.22. MDL example 6 - input graph_2......................................................................77
Figure 4.23. MDL example 6 - overlap. ...............................................................................77
Figure 4.24. MDL example 7 - input graph_2......................................................................78
Figure 4.25. MDL example 7 - overlap. ...............................................................................78
Figure 4.26. MDL example 8 - input graph_2......................................................................79
XV
Figure 4.27. MDL example 8 - overlap. ...............................................................................79
Figure 4.28. MDL example 9 - input graph_2......................................................................80
Figure 4.29. MDL example 9 - overlap. ...............................................................................80
Figure 4.30. Limited overlap - input graphs PS_1 and PS_2. ..............................................82
Figure 4.31. Limited overlap - input graph_3.......................................................................82
Figure 4.32. No overlap - SGISO. ........................................................................................84
Figure 4.33. Overlap - SGISO. .............................................................................................84
Figure 4.34. Limited overlap PS_1 - SGISO. .......................................................................84
Figure 4.35. Limited overlap in the Subdue system. ............................................................87
Figure 4.36. No overlap - compressed graph........................................................................88
Figure 4.37. No overlap - discovered substructures. ............................................................90
Figure 4.38. Overlap - compressed Graph. ...........................................................................91
Figure 4.39. Overlap - discovered substructures. .................................................................92
Figure 4.40. Limited overlap PS_1 - compressed graph.......................................................93
Figure 4.41. Limited overlap - discovered substructures......................................................94
Figure 5.1. Representation of spatial concepts in the census from the year of 1777............98
Figure 5.2. First representation for the non-spatial data in the population census from the
year of 1777. ...............................................................................................................100
Figure 5.3. Second representation for the non-spatial data in the population census from the
year of 1777. ...............................................................................................................101
Figure 5.4. The query panel. ...............................................................................................103
Figure 5.5. The map panel. .................................................................................................106
Figure 5.6. The spatial graph panel.....................................................................................109
Figure 5.7. Graph representation of processed data............................................................111
XVI
Figure 5.8. The non-spatial graph panel. ............................................................................112
Figure 5.9. The Subdue panel. ............................................................................................113
Figure 5.10. Example of Subdue’s standard output............................................................114
Figure 5.11. Layout for reading the Subdue’s discovered substructures............................115
Figure 6.1. Population census from the year of 1777 in Puebla downtown. ......................118
Figure 6.2. Parishes in Puebla downtown in the 1777 year. ...............................................120
Figure 6.3. Blocks 150m. from representative church, parish “El Sagrario”. ....................122
Figure 6.4 Example of a generated graph in the use-case “El Sagrario”............................122
Figure 6.5. Examples of discovered patterns by using standard overlap in use-case “El
Sagrario” (1). ..............................................................................................................124
Figure 6.6. Examples of discovered patterns by using limited overlap in use-case “El
Sagrario” (1). ..............................................................................................................126
Figure 6.7. Processing time standard vs. limited overlap: use-case “El Sagrario” (1). ......127
Figure 6.8. Blocks 150m. North side from representative church, parish “El Sagrario”. ..128
Figure 6.9. Examples of discovered patterns in the use-case “El Sagrario” (2) .................130
Figure 6.10. Processing time standard vs. limited overlap: use-case “El Sagrario” (2). ....131
Figure 6.11. Blocks 50m. from river crossing Puebla downtown. .....................................132
Figure 6.12. Examples of discovered patterns in use-case “people around the river crossing
Puebla downtown”......................................................................................................133
Figure 6.13. Processing time standard vs. limited overlap: use-case “people around the river
crossing Puebla downtown”........................................................................................134
Figure 6.14. Popocatépetl volcano......................................................................................137
Figure 6.15. Popocatépetl volcano: study zone. .................................................................138
Figure 6.16. Relationships among roads and rivers by using model #1. ............................142
XVII
Figure 6.17. Relationships among roads and settlements by using model #1. ...................144
Figure 6.18. Relationships among rivers and settlements by using model #1....................146
Figure 6.19. Relationships among roads and rivers by using model #2. ............................148
Figure 6.20. Relationships among roads and settlements by using model #2. ...................149
Figure 6.21. Relationships among rivers and settlements by using model #2....................150
Figure 6.22. Relationships among roads and rivers by using model #3. ............................152
Figure 6.23. Relationships among roads and settlements by using model #3. ...................153
Figure 6.24. Relationships among rivers and settlements by using model #3....................154
Figure 6.25. Relationships among roads and rivers by using model #4. ............................156
Figure 6.26. Relationships among roads and settlements by using model #4. ...................157
Figure 6.27. Relationships among rivers and settlements by using model #4....................158
Figure 6.28. Relationships among roads and rivers by using model #5. ............................160
Figure 6.29. Relationships among roads and settlements by using model #5. ...................161
Figure 6.30. Relationships among rivers and settlements by using model #5....................162
XVIII
List of tables
Tabla A.1. Características de los modelos de representación basados en grafos. ...............xvi
Tabla A.2. Instancias/iteraciones por cada modelo basado en grafos: caso de uso
Popocatépetl............................................................................................................ xxviii
Tabla A.3. Max/Min de instancias descubiertas por “objeto-objeto”/característica overlap.
...................................................................................................................................xxix
Tabla A.4. Promedio de instancias descubiertas por modelo/“objeto-objeto”. .................xxix
Tabla A.5. Promedio de instancias descubiertas por modelo/característica overlap. .........xxx
Tabla A.6. Promedio de instancias descubiertas por modelo. ............................................xxx
Table B.1. Caractéristiques des modèles de représentation basés sur des graphes. ................l
Table B.2. Instances/itérations par chaque modèle basé sur des graphes : cas d'utilisation
Popocatépetl................................................................................................................ lxii
Table B.3. Max/Min d'instances découvertes par "objet-objet"/caractéristique
recouvrement. ............................................................................................................ lxiii
Table B.4. Moyenne d'instances découvertes par modèle/"objet-objet". .......................... lxiii
Table B.5. Moyenne d'instances découvertes par modèle/caractéristique recouvrement.. lxiv
Table B.6. Moyenne d'instances découvertes par modèle. ................................................ lxiv
XIX
Table 3.1. Characteristics of the graph-based representation models...................................49
Table 6.1. Instances/iterations by each graph-based model: Popocatépetl use-case. .........163
Table 6.2. The best model to discover complete patterns among the road and river spatial
objects (no overlap) ....................................................................................................164
Table 6.3. Max/Min of discovered instances by “object-object”/overlap feature. .............165
Table 6.4. Average of discovered instances by model/“object-object”. .............................166
Table 6.5. Average of discovered instances by model/overlap feature. .............................166
Table 6.6. Average of discovered instances by model. ......................................................167
i
Appendix A
RESUMEN EN ESPAÑOL
A.1. Introducción
En los últimos años hemos sido testigos del rápido crecimiento en el número, capacidad y
diseminación de aplicaciones informáticas dedicadas a la obtención, generación,
manipulación y almacenamiento de datos en diversos ámbitos de la vida humana. Esto ha
propiciado una gran cantidad de colecciones de datos cuyo análisis por medios manuales se
vuelve una tarea complicada. Recordemos que en muchas ocasiones los datos “crudos”
necesitan ser analizados e interpretados para convertirlos en información útil y provechosa.
Tal situación ha propiciado una creciente necesidad por técnicas/herramientas
computacionales que nos ayuden en estas tareas. Descubrimiento de conocimiento en bases
de datos (KDD, por sus siglas en el idioma Inglés) es definido como la extracción no trivial
de información implícita, previamente desconocida y potencialmente útil a partir de datos
[16]. Este es un proceso iterativo e interactivo que envuelve diferentes fases. El núcleo del
proceso es la fase de minado de datos, que se conceptualiza como la aplicación de
algoritmos de análisis de datos y de descubrimiento que bajo parámetros aceptables de
ii
eficiencia computacional producen/descubren una enumeración particular de patrones sobre
los datos mismos [12].
En este mismo contexto, pero enfocado al análisis y explotación de datos provenientes de
fenómenos generados en, sobre, y bajo la superficie de la tierra, llamados datos espaciales,
ha generado un nuevo dominio de investigación llamado Minería de datos Espaciales. De
tal forma, la Minería de datos espaciales se enfoca al descubrimiento de conocimiento
implícito, y previamente desconocido en datos espaciales [16]. Como resultado de esta
necesidad creciente diversos enfoques para el minado de datos espaciales han sido
desarrollados, entre los más representativos encontramos:
A.1.1 Métodos basados en generalización
La generalización ha demostrado ser uno de los métodos efectivos para descubrir
conocimiento. Fue introducido por la comunidad de aprendizaje máquina y se basa en el
aprendizaje a partir de ejemplos. El descubrimiento de conocimiento basado en
generalización requiere jerarquías de conceptos (dadas explícitamente por el experto ó
generadas automáticamente). En la caso de las bases de datos espaciales, pueden darse dos
tipos de jerarquías de conceptos: (1) Jerarquías temáticas, por ejemplo, generalizar tomates
y plátanos como frutas, las frutas y vegetales como alimentos de origen vegetal. (2)
Jerarquías espaciales, por ejemplo, generalizar una serie de puntos espaciales como una
región ó país.
iii
Lu et al. [35] extienden la técnica attribute-oriented induction a las bases de datos
espaciales. Esta técnica se basa en escalar la jerarquía de generalización e ir resumiendo las
relaciones entre los datos espaciales y no espaciales a un nivel de concepto más alto. Los
autores presentan dos algoritmos basados en generalización: (1) Enfoque de dominación de
datos no espaciales. Este método realiza en primera instancia inducción orientada al
atributo sobre los datos no-espaciales y posteriormente sobre los espaciales. (2) Enfoque de
dominación de datos espaciales. Dado la jerarquía de datos espaciales, la generalización se
realizado primero sobre estos datos y posteriormente sobre los datos no espaciales.
A.1.2 Agrupamiento
Agrupamiento (clustering) es el proceso de agrupar de manera física ó abstracta objetos en
clases de objetos similares. Este enfoque de minería de datos nos ayuda a construir
particiones “representativas” de un conjunto de objetos dada una medida de
similitud/distancia (Ej. distancia euclidiana). Esto es, el agrupamiento de datos identifica
grupos (clusters) ó regiones densamente pobladas de acuerdo a alguna medida de distancia
en un conjunto de datos multidimensionales. Podemos clasificar a los algoritmos de
agrupamiento en cuatro grupos principales: Algoritmos de particionamiento basados en los
enfoques k-means (centro de gravedad del cluster) y k-medoid (objeto representativo del
cluster), algoritmos jerárquicos, algoritmos basados en la ubicación de los objetos
(agrupamiento por densidad), y por último los basados en grids.
iv
A.1.3 Asociaciones espaciales
Una regla de asociación espacial es una regla que describe la implicación de uno o un
conjunto de objetos por otro conjunto de objetos en base de datos espaciales [29]. Un
ejemplo de una regla de asociación espacial podría ser “si la empresa se ubica cerca de la
Ciudad de México entonces es una empresa grande”. Una regla de asociación espacial es de
la forma X Y, donde X y Y son conjuntos de predicados espaciales o no espaciales. Existen
varios tipos de predicados espaciales que pudieran constituir una regla de asociación
espacial, por ejemplo, relaciones topológicas como son intersección, traslape y
orientaciones espaciales tales como Izquierda_de, y Oeste_de.
A.1.4 Aproximación y agregación
Los métodos basados en aproximación y agregación buscan analizar las características de
grupos de objetos (clusters) en base a objetos (features) cercanos a ellos. Proximidad
agregada es la medida de cercanía de un conjunto de puntos en un cluster a un feature. La
idea de encontrar relaciones de proximidad no es un problema simple como podría parecer,
existen tres razones para esta aseveración. Supongamos que tenemos un cluster de puntos y
queremos encontrar los k-features más cercanos a él:
• El tamaño y forma del cluster y los features puede ser muy variado.
• Podríamos tener una gran cantidad de features para examinar.
• Aún en el caso de encontrar una forma conocida (Ej. polígono) que describa la
forma del cluster, sería inadecuado reportar los features cuyos límites estén más
v
cerca a los límites de éste, porque la distribución de los puntos al interior del cluster
puede no ser uniforme.
A.1.5 Minería de datos en imágenes
Extracción de patrones a partir de imágenes es otro enfoque de minería de datos espaciales.
En la literatura existen diversos trabajos de minado de datos desarrollados bajo este
enfoque. Por ejemplo, Fayyad et al. [14] presentan un sistema para la identificación y
categorización de volcanes en la superficie de Venus a partir de imágenes transmitidas por
la sonda espacial Magellan. En otro trabajo [15] (Second Palomar Observatory Sky Survey)
se usaron árboles de decisión para la clasificación de galaxias, estrellas y otros objetos
estelares. Stolorz et al. [44] y Shek et al. [41] efectuaron estudios sobre minería de datos
espacio-temporal en conjuntos de datos geofísicos.
A.1.6 Clasificación de datos espaciales
La clasificación de datos espaciales tiene como objetivo encontrar reglas que dividan un
conjunto de objetos en un número de grupos, donde los objetos de cada grupo pertenecen a
una clase. Diversos tipos de información pueden ser usados para caracterizar los objetos
espaciales. Por ejemplo, atributos no espaciales de un objeto, predicados espaciales y
funciones espaciales. La idea es usar esta información para extraer ya sea atributos para la
etiquetación de clases (atributos que dividen los datos en clases) y atributos predictivos
(atributos cuyos valores son usados en un árbol de decisión para crear sus ramas).
vi
A.1.7 Detección de tendencias espaciales
Detección de tendencias espaciales describe los cambios regulares de uno ó más atributos
no espaciales de un objeto cuando éste se desplaza desde un punto de referencia inicial. Un
ejemplo de tendencia espacial sería “alejándose del centro histórico de la ciudad de Puebla
es precio de los terrenos decrece”. Las trayectorias de movimiento a partir de un punto x
son usadas para modelar dicho movimiento y análisis de regresión sobre los atributos de los
objetos son usados para describir patrones de cambio. Existen dos tipos de tendencias:
globales y locales.
A.2. Motivación
Aunque los enfoques antes mencionados realizan la tarea de minado de datos espaciales de
manera exitosa, nuestra percepción es que éstos no consideran todos los elementos
encontrados en una base de datos espaciales (datos espaciales, datos no espaciales y
relaciones espaciales entre los objetos espaciales) de una manera integral. Es decir, algunos
de ellos primero realizan minado de datos espaciales y posteriormente minado de datos no
espaciales ó viceversa, y otros permiten combinaciones de estos elementos pero de manera
restringida. Con base en lo anterior, proponemos el argumento siguiente: si somos capaces
de representar los datos espaciales, no espaciales y las relaciones entre objetos espaciales
como un solo conjunto de datos, y lo minamos como tal, podríamos generar/encontrar
patrones de conocimiento que describan/caractericen nuestro conjunto de datos conteniendo
estos tres elementos de manera conjunta. Para tal efecto en este trabajo se argumenta que
vii
una representación basada en grafos es lo suficientemente flexible y poderosa para
representar estos elementos de manera conjunta, fácilmente entendible y capaz de crear
diferentes representaciones del mismo dominio. El dominio es descrito usando grafos, los
grafos se convierten en datos de entrada para una herramienta de descubrimiento de
conocimiento basada en grafos la cual usa heurísticas para seleccionar subgrafos que son
considerados importantes (patrones).
Un grafo es definido como un par G = (V,E). V = {v1,…,vn} denota un conjunto finito de
elementos llamados vértices. E es un conjunto de arcos e satisfaciendo E ⊆ [V]2. Entonces,
cada arco e ∈ E es un par (vi,vj). Si (vi,vj) es un par ordenado para cualesquiera (vi,vj) ∈ E,
entonces se dice que G = (V,E) es un grafo dirigido. Un grafo etiquetado tiene etiquetas
asociadas a sus vértices y arcos.
Como comentamos anteriormente en nuestro modelo proponemos la representación de
relaciones espaciales entre los objetos espaciales. En la siguiente subsección detallamos los
tres tipos de relaciones espaciales que proponemos incluir.
A.2.1 Relaciones espaciales
La ubicación explícita de los objetos espaciales define relaciones implícitas de vecindad
(neighborhood) espacial entre ellos. De tal forma, la información sobre la vecindad de los
objetos espaciales constituye un elemento valioso que debe ser considerado en la tarea de
minado de datos espaciales. Martin Ester et al. [9][11] introducen el concepto de grafos de
viii
vecindad para representar explícitamente estas relaciones de vecindad implícitas. Los
grafos de vecindad pueden cubrir las relaciones de vecindad siguientes:
• Topológicas. Derivadas del modelo de nueve intersecciones [6][7][8], son
relaciones que permanecen invariantes bajo transformaciones lineales, es decir, si
ambos objetos se rotan, se trasladan ó se escalan simultáneamente las relaciones
entre ellos se preservan.
• Distancia. Compara la distancia entre dos objetos dada una constante usando
operadores aritméticos tales como <, >, =. La distancia entre dos objetos se definen
como la distancia mínima entre ellos (Ej. seleccionar todos los elementos dentro de
una radio de 50 kilómetros desde un punto x).
• Dirección. La relación espacial de dirección entre 2 objetos espaciales A y B (B R A)
se define usando un punto representativo del objeto A y todos los puntos del objeto
destino B. El punto representativo del objeto fuente A puede ser el centro del objeto
ó un punto sobre sus límites. Este punto representativo es usado como el origen de
un sistema de coordenadas virtuales y su cuadrante define la dirección.
Una vez comentada la motivación de nuestro trabajo de investigación se presenta a
continuación el modelo general basado en grafos para representar los datos espaciales.
A.3 Representaciones basadas en grafos
Como hemos comentado, nuestra propuesta se basa en crear un modelo basado en grafos
para representar conjuntamente datos espaciales, no espaciales y relaciones espaciales entre
ix
los objetos espaciales. La idea es usar los grafos generados como datos de entrada para un
algoritmo de minado de datos, de tal forma, que el algoritmo pueda encontrar patrones
involucrando estos elementos de manera conjunta y no como elementos separados. En
consecuencia, proponemos el modelo de representación mostrado (en notación UML) en la
Figura A.1.
Vertex
1..*
1..*
Distance DirectionTopological
Spatial relation
Spatial object
Non-spatial attribute
-From:-To:
Edge
Figura A.1. Modelo basado en grafos para representar datos espaciales.
En el modelo, los datos espaciales (Ej. objetos espaciales), datos no espaciales (Ej. atributos
descriptivos), y relaciones espaciales son representados como una colección de uno ó más
grafos dirigidos etiquetados. Los vértices pueden representar objetos espaciales, tipos de
relación espacial entre dos objetos (relación binaria), ó atributos no espaciales describiendo
los objetos espaciales. Los arcos representan una liga existente entre dos vértices de
cualquier tipo. Dependiendo del tipo de vértices que un arco une, éste puede representar el
nombre de un atributo descriptivo ó el nombre de una relación espacial. Los nombres de
atributos pueden referirse a descripciones de objetos espaciales y/ó a entidades no
espaciales. Se usan arcos dirigidos para representar la información direccional de relaciones
x
entre elementos (Ej. objeto x cubre objeto y) y para describir atributos pertenecientes a
objetos (Ej. objeto x tiene atributo z).
Actualmente se han desarrollado cinco representaciones a partir del modelo general
previamente descrito. Tres tópicos definen las características de los grafos creados en cada
modelo: (1) Representación de las relaciones espaciales equivalentes (Ej. tocar, traslapar).
(2) Representación de las relaciones espaciales simétricas (Ej. contiene-dentro_de,
Norte_de-Sur_de). (3) La manera de representar los objetos y sus relaciones en el modelo.
En consecuencia, los grafos creados se diferencian de manera cuantitativa y cualitativa. En
la parte cuantitativa tenemos diferencias tales como: número de vértices y arcos empleados
para representar los datos espaciales, no espaciales y las relaciones, creación de grafos
simples, manejo de arcos dirigidos y no dirigidos para representar relaciones espaciales y/o
atributos descriptivos. En la parte cualitativa hemos observado, a través de
experimentación, que ciertos modelos tienen una mayor expresividad para representar el
conjunto de datos, condicionando de manera directa la calidad de los resultados generados
en el proceso de minado. A continuación se presentan 3 de los 5 modelos propuestos
describiendo las métricas de evaluación creadas para caracterizar a cada uno de ellos.
Modelos
Con la finalidad de describir las características de cada modelo, usaremos como conjunto
de datos de ejemplo los mostrados en la Figura A.2. Como podemos observar, nuestro
conjunto de datos se compone de dos objetos espaciales, objeto A representado una casa y
objeto B representando un lago, y las tres relaciones espaciales siguientes: (1) Distancia,
xi
objeto A cerca de objeto B (relación equivalente). (2) Topológica, objeto A toca objeto B
(relación equivalente). (3) Dirección, objeto A Sur de objeto B (relación simétrica).
North
South
EastWest
Object B
Object A
Figura A.2. Base de datos de ejemplo para caracterizar los 3 modelos propuestos.
Modelo #1 - modelo base
La Figura A.3 muestra el primer modelo creado para representar datos espaciales bajo el
enfoque propuesto. Las características del modelo de acuerdo a las métricas creadas para su
caracterización son:
• Num. vértices: 2 vértices, cada uno representando un objeto espacial (objeto A y
objeto B).
• Num. arcos: 4 arcos, 3 arcos para representar las relaciones espaciales originales
existentes en nuestro conjunto de datos de ejemplo (“cerca”, “toca” y “Sur_de”) y
un arco para representar la relación “Norte_de” creada de la relación simétrica
original “Sur_de”. La relación “Norte_de” es en si misma una relación simétrica.
• Tamaño (vértices + arcos): 6
• % incremento: 0%, este es el modelo base.
• Grafo simple. No, Es un grafo complejo con 4 arcos uniendo 2 vértices.
xii
• Arco dirigido. Si, son usados para representar las relaciones simétricas “Sur_de” y
“Norte_de”. La dirección de los arcos va en concordancia con la lectura de las
relaciones entre los objetos.
• Arco no dirigido. Si, son usados para representar las relaciones equivalentes
“cerca” y “toca”.
• Información completa. Si, en el grafo es representada la relación simétrica
“Norte_de” creada a partir de la relación simétrica original “Sur_de”.
• Arco “Relation” redundante. No, en el modelo no se usan arcos “Relation”.
Spatialobject A
Spatialobject B
South_of
touch
close
North_of
Figura A.3. Modelo #1 - modelo base.
Modelo #2 - replicación simple de tipos de relación, información completa
En la Figura A.4 presentamos el segundo modelo creado para representar datos espaciales.
Las características del modelo de acuerdo a las métricas son:
• Num. vértices: 5 vértices, 2 vértices para representar los objetos espaciales y 3
vértices para representar los tipos de relaciones espaciales “topológica”, “distancia”
y “dirección”. Por cada tipo de relación especial existente entre dos objetos, se
añade un vértice etiquetado con el nombre del tipo de la relación espacial. En el
ejemplo existen una relación “topológica”, una relación de “distancia” y una
relación de “dirección”.
xiii
• Num. arcos: 6 arcos, 3 arcos para representar las relaciones originales, 2 arcos para
representar las relaciones equivalentes (“cerca” y “toca”) creadas a partir de las
relaciones originales y un arco para representar la relación simétrica (“Norte_de”)
creada también de las relaciones originales.
• Tamaño (vértices + arcos): 11
• % incremento: +83.33%
• Grafo simple. Si, existe a lo más un arco entre cualesquiera 2 vértices dados.
• Arco dirigido. Si, son usados para representar todas las relaciones. La dirección de
los arcos va de los vértices representando los objetos espaciales a los vértices
representado los tipos de relaciones espaciales.
• Arco no dirigido. No, en el modelo no se usan arcos no dirigidos.
• Información completa. Si, representamos las relaciones simétricas creadas a partir
de las relaciones originales.
• Arco “Relation” redundante. No, en el modelo no usamos arcos “Relation”.
South_of
touch
close close
touch
North_of
Distance
Direction
Spatialobject A Topological Spatial
object B
Figura A.4. Modelo #2 - replicación simple de tipos de relación, información completa.
xiv
Modelo #3 - doble replicación de tipos de relación, información no completa
En la Figura A.5 se presenta el tercer modelo creado para representar datos espaciales
usando un enfoque de grafos. Las características del modelo de acuerdo a las métricas son:
• Num. vértices: 8 vértices, 2 vértices para representar los objetos espaciales y 6
vértices para representar los tipos de relaciones espaciales (“distancia”, “topológica”
y “dirección”). Por cada tipo de relación espacial entre dos objetos espaciales, se
añaden 2 vértices etiquetados con el nombre del tipo de la relación espacial. Por
ejemplo, en nuestros datos de prueba existen tres tipos de relaciones: 1 relación
“topológica”, 1 relación “distancia” y 1 relación “dirección”, de tal forma, se
añaden 6 vértices, 2 por cada tipo de relación espacial.
• Num. arcos: 9 arcos, 6 arcos “Relation” para unir los vértices representando los
objetos espaciales con los vértices representando los tipos de relaciones espaciales
(desde cada vértice representando un objeto espacial nacen 3 arcos ya que existen 3
tipos de relación), y 3 arcos para presentar las relaciones espaciales originales. Estos
3 arcos son usados para unir los vértices representando los tipos de relaciones
espaciales.
• Tamaño (vértices + arcos): 17
• % incremento: +183.33%
• Grafo simple. Si, existe a lo más un arco entre cualesquiera 2 vértices dados.
• Arco dirigido. Si, son usados para representar las relaciones simétricas y los arcos
“Relation”. La dirección de los arcos “Relation” va desde los vértices representando
los objetos espaciales a los vértices representando los tipos de relaciones espaciales.
xv
La dirección de los restantes arcos va en concordancia a la lectura de las relaciones
espaciales entre los objetos.
• Arco no dirigido. Si, son usados para representar las relaciones equivalentes.
• Información completa. No, en el modelo no se representan las relaciones
simétricas que son creadas a partir de las relaciones espaciales originales.
• Arco “Relation” redundante. Si, en el modelo se usan arcos “Relation” para
representar explícitamente la existencia y tipo de una relación espacial entre 2
objetos espaciales. Adicionalmente, se emplean estos arcos para evitar la creación
de grafos complejos.
South_of
touch
close
Relation
Distance
Topological
Direction
Distance
Topological
Direction
Spatialobject A
Spatialobject B
Relation
R elation
Relation
RelationR
elation
Figura A.5. Modelo #3 - doble replicación de tipos de relación, información no completa.
La Tabla A.1 presenta los resultados de las nueve métricas desarrolladas para evaluar las
características de cada modelo propuesto (actualmente 5 modelos). El modelo #1 es
llamado el modelo base.
xvi
Modelo Num.
Vértices
Num.
Arcos
Tamaño
(v + e)
%
Incremento
Grafo
Simple
Arco
Dirigido
Arco no
Dirigido
Información
Completa
Arco
“Relation”
Redundante
(1) (2) (3) (4) (5) (6) (7) (8) (9)
#1 2 4 6 - No Yes Yes Yes No
#2 5 6 11 +83.33 Yes Yes No Yes No
#3 8 9 17 +183.33 Yes Yes Yes No Yes
#4 5 6 11 +83.33 Yes Yes Yes No Yes
#5 8 12 20 +233.33 Yes Yes No Yes Yes
Tabla A.1. Características de los modelos de representación basados en grafos.
Las métricas fueron propuestas basaron en el causas/efectos que cada uno de éstas tiene
tanto en el grafo creado como en el algoritmo de minado. Visualizamos cuatro tópicos
significativos relacionados directamente a estas métricas:
1. Espacio de búsqueda
El espacio de búsqueda en un algoritmo de minado de datos basado en grafos consiste de
todos los subgrafos que pueden ser derivados de su grafo de entrada, de tal forma, el
número de vértices (1) y arcos (2) del grafo creado (3) definen el tamaño del espacio de
búsqueda para el sistema del descubrimiento. Por lo tanto, el objetivo debe ser minimizar el
número de vértices y arcos usados para crear los grafos pero al mismo tiempo maximizar la
representatividad de los mismos. Como podemos ver en la Tabla A.1, el modelo usando el
número mínimo de vértices y arcos para representar el conjunto de datos de ejemplo es el
modelo #1 (2 vértices y 4 arcos) mientras que el modelo #5 es el caso opuesto (8 vértices y
12 arcos).
xvii
2. Tiempo de procesamiento
El tamaño del espacio de búsqueda juega un papel relevante con respecto al tiempo de
procesamiento usado para descubrir patrones. Si tenemos un espacio búsqueda “grande” se
requeriría más tiempo para evaluar todos los subgrafos posibles. Por lo tanto, una
comparativa de la métrica "porcentaje de incremento" (4) entre los modelos propuestos se
presenta en la Tabla A.1. Recordemos que esta métrica compara el tamaño de un modelo
dado con respecto al modelo #1 (modelo base). Por ejemplo, el modelo #5 tiene un
incremento de tamaño del grafo de 233.33% respecto al modelo #1. Es decir, el algoritmo
de minado requerirá la evaluación de 233.33% más vértices y/o arcos usando el modelo #5
en vez del modelo #1 para el mismo conjunto de datos.
3. Complejidad del grafo
En el capítulo 4 de la disertación se describe el sistema Subdue, nuestra herramienta de
minado de datos basada en grafos. Cómo ahí describimos, existe una mayor complejidad
para el algoritmo de minado trabajar con grafos complejos en vez de grafos simples (Ej. a
lo máximo un arco uniendo cualesquiera dos vértices dado y no ciclos). Por ejemplo, en el
proceso de macheo de grafos, la fase de expansión (Subdue emplea un enfoque de
"expansión" para descubrir patrones), y la etapa de compresión del grafo. Por lo tanto, el
objetivo fue proponer modelos basados en grafos que nos permitieran crear grafos simples.
Como podemos ver en la Tabla A.1, únicamente el modelo #1 no permite crear grafos
simples.
xviii
En consecuencia, como estrategia para romper la multiplicidad de arcos entre dos vértices
(por ejemplo, vértices representando dos objetos espaciales con dos ó más relaciones
espaciales entre ellos) se emplean los enfoques siguientes:
• Añadir un nuevo vértice etiquetado con el nombre del tipo de la relación (Ej.
topológica, distancia y dirección) por cada relación espacial entre los objetos
espaciales. Este enfoque es usado en el modelo #2.
• Agregar un nuevo vértice (modelo #4) o dos nuevos vértices (modelo #3 y modelo
#5) etiquetado como el nombre de tipo de la relación espacial (Ej. topológica,
distancia y dirección) por cada relación espacial entre los objetos espaciales y unir
este nuevo vértice (modelo #4) o nuevos vértices (modelo #3 y modelo #5) con los
vértices representando los objetos espaciales por medio de arcos etiquetados como
"Relation". El enfoque empleado para unir los vértices difiere para cada modelo tal
y como se ha comentado en la definición de cada uno de ellos. Esta nomenclatura se
utiliza para representar el hecho de que existe una relación espacial entre los objetos
espaciales. Estos arcos son conocidos como arcos “Relation” redundantes (9).
4. Representatividad de los datos
Las métricas arcos dirigidos (6), arcos no dirigidos (7), y información completa (8) son
usadas para maximizar la representatividad de los datos pero minimizando, tanto como sea
posible, el tamaño del grafo y su complejidad. Los arcos dirigidos son usados para
representar las relaciones espaciales simétricas (objeto A Norte_de objeto B, implica, B
Sur_de A), los arcos “Relation” redundantes, los atributos no-espaciales que describen a los
objetos espaciales. Arcos no dirigidos son usados para representar las relaciones espaciales
xix
equivalentes (la relación es representada por un no dirigido en vez de dos arcos dirigidos).
Finalmente, información completa significa que las relaciones espaciales simétricas entre
objetos espaciales también son representadas en el modelo.
A.4 Minando el grafo
La característica de overlap (traslape entre vértices pertenecientes a diferentes instancias de
una subestructura) desempeña un rol importante en el sistema de descubrimiento de
patrones (subestructuras) en nuestra herramienta de minado de datos basada en grafos, el
sistema Subdue. En consecuencia, los resultados generados están condicionados, en un alto
porcentaje, al funcionamiento de esta característica. Sin embargo, su implementación actual
es ortodoxa: se permite el overlap entre todas las instancias de una subestructura (sin
ninguna regla) ó no se permite el overlap entre ninguna instancia perteneciente a una
subestructura. Es decir, todo ó nada. En este contexto, proponemos un nuevo enfoque
llamado overlap limitado. Una de las ventajas principales de este nuevo enfoque es la
capacidad que se le da al usuario para especificar el conjunto de vértices donde el overlap
será permitido. Estos vértices podrían representar elementos significativos en el contexto de
trabajo. Visualizamos directamente tres motivaciones para proponer el nuevo algoritmo, los
cuales serán explicados en las subsecciones siguientes:
1. Reducción del espacio de búsqueda
En sistemas de descubrimiento de conocimiento basados en grafos, el algoritmo de minado
de datos usa grafos como su representación de conocimiento. El espacio de búsqueda de un
xx
algoritmo de este tipo consiste de todos los subgrafos que puedan ser derivados del grafo de
entrada. El proceso de descubrimiento de subestructuras en Subdue comienza con la
creación de subestructuras de un solo vértice a partir del grafo de entrada (una subestructura
por cada etiqueta de vértice que exista al menos 2 veces en el grafo). En cada iteración del
proceso de descubrimiento, el algoritmo selecciona las mejores subestructuras y expande
las instancias de esas subestructuras añadiendo un arco vecino (ó un arco y un nuevo
vértice) en todas las direcciones posibles.
Pero como parte del proceso para seleccionar las mejores subestructuras y entonces
expandirlas existe un proceso de filtrado. En este proceso, de acuerdo al valor del
parámetro overlap las instancias de una subestructura son evaluadas: si el overlap es
permito entonces las instancias compartiendo vértices son mantenidas, de lo contrario si el
overlap no es permito las instancias que comparten vértices son descartadas.
La mejor subestructura descubierta por Subdue (de acuerdo a sus métricas de evaluación),
en cada iteración, puede ser empleada para comprimir el grafo de entrada el cual entonces
puede convertirse en el nuevo grafo de entrada para una siguiente iteración. Después de
varias iteraciones, Subdue crea una descripción jerárquica de los datos, donde
subestructuras descubiertas en cierta iteración pueden estar definidas en base a
subestructuras descubiertas en iteraciones previas. De tal forma, el número de instancias de
una subestructura define el espacio de búsqueda (en cada iteración) en el proceso de
descubrimiento de subestructuras. Como podemos observar, a través de uso del overlap
limitado, se obtiene una reducción del espacio de búsqueda dado que el número de
xxi
instancias candidatas a ser expandidas se condiciona a los valores (vértices) donde el
overlap es permitido, siendo éstos establecidos por el usuario.
2. Reducción del tiempo de procesamiento
La reducción en el número de instancias candidatas a ser expandidas trae consigo una
reducción del espacio de búsqueda, y esto beneficia en una reducción del tiempo de
procesamiento para la búsqueda de subestructuras (patrones).
Permitiendo el overlap en Subdue, hace que éste sea considerablemente más lento (en
cuanto a tiempo de procesamiento) dado que el número de instancias candidatas a ser
expandidas, evaluadas, comparadas, y descubiertas se incrementa como hemos descrito.
Sin embargo, con la implementación del overlap limitado, el número de instancias a ser
procesadas en estas fases decrece resultando en una reducción del tiempo de procesamiento
en todo el proceso de descubrimiento de subestructuras.
3. Búsqueda orientada de patrones con overlap (traslape) selectivo
El overlap limitado da a usuario la capacidad para definir el conjunto de elementos donde el
overlap será permitido y que a su juicio considera relevantes para su contexto de trabajo
(Ej. un objeto espacial, un atributo descriptivo). Estos elementos son representados usando
vértices de acuerdo al modelo propuesto. En contra parte, el algoritmo descartará aquellos
elementos traslapados que el usuario no consideró significativos.
Por lo tanto, el overlap limitado proporciona al usuario un medio para implementar una
búsqueda orientada de patrones con overlap. Es decir, el usuario delimita el conjunto de
xxii
elementos que tendrán un papel preponderante en el proceso de descubrimiento de
subestructuras. Adicionalmente, esta característica ofrece la ventaja de que el proceso de
evaluación de patrones se simplifica dado que el conjunto de resultados generados es menor
porque éstos se centran sobre los requerimientos del usuario.
A.5 Resultados
Como parte de las pruebas para evaluar nuestra propuesta de modelado y minado de datos
espaciales usando un enfoque basado en grafos se desarrollaron tres casos de uso
ilustrativos. Los dos primeros casos fueron implementados usando un censo de población
del centro histórico de la ciudad de Puebla en el año de 1777. El tercer caso de uso fue
desarrollado usando una base de datos espaciales de la región del volcán Popocatépetl.
En esta sección presentamos ejemplos de resultados obtenidos con el caso de uso del
volcán Popocatépetl. Supongamos que deseamos conocer características comunes entre las
poblaciones (asentamientos poblacionales), carreteras y ríos en la zona que nos ayuden a
evaluar/implementar planes de evacuación en caso de una contingencia volcánica. Por
ejemplo, características de carreteras empezando en ó cruzando una población, material
usado para construir esas carreteras y su estado actual (Ej. pavimentadas, terracería),
características de carreteras y ríos que tengan alguna relación entre ellos (Ej. se crucen, se
tocan), ríos cerca de poblaciones que en caso de alta precipitación pluvial pudieran
representar peligros potenciales.
xxiii
Los experimentos fueron desarrollados usando los 5 modelos de representación
actualmente propuestos. En esta sección se presentan patrones encontrados con el modelo
#1 entre carreteras y ríos, carreteras y poblaciones, y por último ríos y poblaciones. La idea
de organizar la presentación de resultados de esta manera es mostrar los diversos patrones
que se pueden encontrar entre estos elementos. Al final de la sección se presentan tablas
comparando los resultados obtenidos con cada uno de los 5 modelos.
La Figura A.6 muestra el patrón más significativo descubierto entre carreteras y ríos usando
el modelo #1. El patrón describe una relación entre “carretera categoría terracería
traslapando un río categoría escurrimiento” en la zona. Este patrón puede considerarse
como un indicador del número de carreteras que necesitan ser supervisadas en caso de una
contingencia volcánica dado el tipo de material con el que están construidas, y porque ellos
atraviesan ríos (la lectura puede ser hecha en orden inverso) que en caso de altas
concentraciones pluviales pueden desbordarse e inutilizarlas. Subdue encontró con no
overlap 46 instancias del patrón en la segunda iteración; vía overlap estándar encontró 85
instancias en la primera iteración; y a través de overlap limitado también encontró 85
instancias en la segunda iteración. Como podemos observar overlap estándar y overlap
limitado encontraron el mismo número de instancias, pero overlap limitado necesitó dos
iteraciones para encontrar el mismo patrón. Sin embargo, esto no quiere decir que overlap
estándar es mejor que overlap limitado (respecto a tiempo de procesamiento) porque
analizando el tiempo global de procesamiento requerido por overlap limitado para finalizar
la fase de descubrimiento de subestructuras notamos que es menor que el requerido por
overlap estándar.
xxiv
a) b)
c)
Figura A.6. Relaciones entre carreteras y ríos usando el modelo #1.
El patrón más significativo, usando el modelo #1, encontrado entre carreteras y poblaciones
es presentado en la Figura A.7. Este describe una relación entre “carretera categoría
terracería tocando un asentamiento poblacional categoría construcción”. “Asentamiento
poblacional categoría construcción” representa en la capa de datos espaciales
“asentamientos” de la base de datos del volcán áreas habitacionales con alta población,
xxv
edificios y una gran cantidad de construcciones usadas para ofrecer servicios a los
habitantes. Si asumimos que la gente podría necesitar ser evacuada en caso de una erupción
y que las carreteras que serían usadas para ese propósito son de terracería, entonces, esta
situación podría convertirse en un problema (Ej. cuello de botella). En este experimento
Subdue encontró vía no overlap 6 instancias del patrón en la novena iteración; a través de
overlap estándar 9 instancias en la cuarta iteración; y usando overlap limitado 8 instancias
en la décima iteración.
a) b)
c)
Figura A.7. Relaciones entre carreteras y poblaciones usando el modelo #1.
xxvi
La Figura A.8 muestra el patrón más significativo encontrado entre ríos y poblaciones
usando el modelo #1. El patrón describe una relación entre “río categoría escurrimiento
cruzando un asentamiento poblacional categoría manzana o construcción" en la zona.
“asentamientos poblacionales categoría manzana" representa en la capa de datos espaciales
“asentamientos” de la base de datos del volcán áreas habitacionales con poca población, de
hecho con muchas áreas deshabitadas, escasos edificios y construcciones. El patrón puede
ser utilizado para identificar zonas potenciales de inundación, habitadas por personas, dada
la cercanía de ríos. A través de no overlap Subdue encontró 5 instancias del patrón en la
décima segunda iteración; usando overlap estándar encontró 5 instancia en la octava
iteración; y vía overlap limitado también encontró 5 instancias en octava iteración. Subdue
encontró el mismo patrón en los tres casos, sin embargo, usando overlap estándar y overlap
limitado la lectura del patrón es más simple.
xxvii
a) b)
c)
Figura A.8. Relaciones entre ríos y poblaciones usando el modelo #1.
La Tabla A.2 presenta una comparación, por modelo, entre el número de instancias
descubiertas/iteraciones necesitadas para descubrirlas y las tres implementaciones de
overlap. Por ejemplo, usando el modelo #1, Subdue encontró 46 instancias (en la segunda
iteración) de un patrón “completo” (nuestra definición para reportar un patrón “completo”
es que éste contengan al menos dos objetos espaciales y la relación espacial entre ellos)
xxviii
conteniendo los objetos espaciales carretera-río vía no overlap. Un valor más alto significa
un modelo que permite descubrir más instancias de una subestructura (patrones).
Recordemos que Subdue reporta como el mejor patrón (por iteración) la subestructura con
el número más alto de instancias descubiertas de esa subestructura. Esta comparación es
reportada por cada estructura “objeto-objeto” (Ej. carretera-río).
Nota: NO (no overlap), SO (overlap estándar), LO (overlap limitado).
Modelo #1 Modelo #2 Modelo #3 Modelo #4 Modelo #5
NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO
Carretera-Río
Instancias 46 85 85 41 85 64 39 85 34 39 85 60 45 85 45
Iteraciones 2 1 2 2 3 2 2 2 9 2 1 2 5 1 5
Carretera-Población
Instancias 6 9 8 5 8 7 4 8 5 6 8 8 6 0 7
Iteraciones 9 4 10 14 6 10 15 6 13 12 10 7 7 0 7
Río-Población
Instancias 5 5 5 5 10 5 5 19 5 5 10 5 5 5 5
Iteraciones 12 8 8 16 7 14 12 4 10 6 6 11 13 6 10
Tabla A.2. Instancias/iteraciones por cada modelo basado en grafos: caso de uso Popocatépetl.
La Tabla A.3 presenta una comparación de máximo/mínimo instancias descubiertas por
cada implementación de overlap. Un modelo con el valor más alto es mejor porque permite
descubrir más instancias de una subestructura. La comparación es presentada por cada
estructura “objeto-objeto”. Por ejemplo en la estructura carretera-río el modelo #1 reportó
46 instancias descubiertas en la segunda iteración (el valor más alto).
xxix
Máximo Mínimo
Carretera-Río
No overlap modelo #1 (segunda iteración) modelos #3 y #4 (segunda iter.)
Overlap estándar modelos #1, #4 y #5 (primera iter.) modelo #2 (tercera iteración)
Overlap limitado modelo #1 (segunda iteración) modelo #3 (novena iteración)
Carretera-Población
No overlap modelo #5 (séptima iteración) modelo #3 (quinceava iteración)
Overlap estándar modelo #1 (cuarta iteración) modelo #5 (patrón no completo)
Overlap limitado modelo #4 (séptima iteración) modelo #3 (treceava iteración)
Río-Población
No overlap modelo #4 (sexta iteración) modelo #2 (dieciseisava iteración).
Overlap estándar modelo #3 (cuarta iteración) modelo #1 (octava iteración).
Overlap limitado modelo #1 (octava iteración) modelo #2 (catorceava iteración)
Tabla A.3. Max/Min de instancias descubiertas por “objeto-objeto”/característica overlap.
La Tabla A.4 presenta una comparación entre el promedio de instancias descubiertas por
modelo. Un valor más alto significa un modelo permitiendo descubrir más instancias de
una subestructura. Cada valor representa el promedio de subestructuras descubiertas usando
no overlap, overlap estándar y overlap limitado. La comparación es reportada por cada
estructura “objeto-objeto”.
Modelo #1 Modelo #2 Modelo #3 Modelo #4 Modelo #5
Carretera-Río 72.0 63.3 52.7 61.3 58.3
Carretera-Población 7.7 6.7 5.7 7.3 4.3
Río-Población 5.0 6.7 9.7 6.7 5.0
Tabla A.4. Promedio de instancias descubiertas por modelo/“objeto-objeto”.
xxx
La Tabla A.5 presenta una comparación entre el promedio de instancias descubiertas por
modelo. Un valor más alto significa un modelo permitiendo descubrir más instancias de
una subestructura. La comparación es reportada por cada implementación de overlap.
Modelo #1 Modelo #2 Modelo #3 Modelo #4 Modelo #5
NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO
19.0 33.0 32.7 17.0 34.3 25.3 16.0 37.3 14.7 16.7 34.3 24.3 18.7 30.0 19.0
Tabla A.5. Promedio de instancias descubiertas por modelo/característica overlap.
La Tabla A.6 presenta un comparativo final entre el promedio de instancias descubiertas
por modelo. Podemos ver en la tabla que el modelo #1 reporta el valor más alto de
instancias descubiertas (de acuerdo a nuestros parámetros para reportar instancias
completas) en este caso de uso ilustrativo. Los siguientes modelos son el modelo #2 y el
modelo #4 respectivamente.
Modelo #1 Modelo #2 Modelo #3 Modelo #4 Modelo #5
28.2 25.6 22.7 25.1 22.6
Tabla A.6. Promedio de instancias descubiertas por modelo.
A.6 Conclusiones
La constante interacción entre los seres humanos y su hábitat natural, el planeta de la tierra,
genera día a día, nuevos requerimientos asociados al manejo y explotación de datos
espaciales. Por ejemplo, el análisis urbano, la prevención los riesgos naturales, la
exploración del espacio estelar, la contaminación en los océanos, y la reforestación de los
xxxi
suelos, por nombrar algunos de ellos. La minería de datos espaciales involucra la
integración de métodos y técnicas provenientes de diversos campos científicos los cuales
nos ayudan, por medio de algoritmos de análisis y de descubrimiento, a producir una
enumeración particular de patrones sobre los datos espaciales.
Nuestra argumentación en esta disertación doctoral se basa en la idea de que los enfoque de
minaría de datos espaciales descritos no consideran todos los elementos encontrados en una
base de datos espaciales (datos espaciales, datos no espaciales y relaciones espaciales entre
los objetos espaciales) de una manera integral. En consecuencia, se propuso emplear un
enfoque basado en grafos para representar estos elementos como un solo conjunto de datos,
minarlos como un todo, para de estar forma poder encontrar patrones conteniendo ambos
tipos de datos y relaciones espaciales (patrones más descriptivos).
En nuestro modelo las relaciones espaciales entre los objetos espaciales son incluidas
porque una característica significativa de los datos espaciales es la influencia que los
vecinos de un objeto pueden tener en el objeto mismo. En el modelo incluimos tres tipos de
relaciones espaciales. Derivado del modelo general se propusieron cinco modelos
operativos. Tres aspectos definen las características de un grafo creado con estos modelos:
(1) Representación de las relaciones espaciales equivalentes. (2) Representación de
relaciones espaciales simétricas. (3) La manera de representar los objetos y sus relaciones.
Como parte integrante de nuestra metodología para el minado de datos espaciales usando
un enfoque basado en grafos, usamos el sistema Subdue como nuestra herramienta de
minado. Se propuso un nuevo algoritmo llamado overlap limitado el cual le da al usuario la
xxxii
capacidad de especificar el conjunto de vértices sobre los cuales el overlap es permitido.
Visualizamos tres motivaciones para proponer este nuevo enfoque: (1) Reducción del
espacio de búsqueda. (2) Reducción del tiempo de procesamiento. (3) Búsqueda orientada
de patrones con overlap (traslape) selectivo.
Para demostrar la viabilidad, capacidad de minado y descubrimiento de patrones usando el
enfoque propuesto, se desarrolló un prototipo implementado nuestro modelo para crear los
conjuntos de datos basados en grafos, minar esos grafos (a través del llamado al sistema
Subdue) y visualización de patrones encontrados. Los resultados generados de los casos de
uso desarrollados nos dan un panorama respecto a cómo y qué podríamos obtener usando
este enfoque. Es importante comentar el hecho de que podemos utilizar esta metodología de
modelado y representación en cualquier dominio que pueda ser representado como grafo.
Una vez demostrada la viabilidad de nuestra propuesta, las perspectivas relacionadas a
mejorar nuestro trabajo (modelo de representación basado en grafos, algoritmo de minado
de datos y sistema prototipo) incluyen los puntos siguientes:
• Visualización de conocimiento descubierto. Por ejemplo, visualización de
resultados sobre las capas espaciales, a través del uso de gráficas, y navegación en
la jerarquía de patrones descubiertos usando un enfoque de hypergrafo.
• Mejoramiento de los algoritmos empleados para crear los conjuntos de datos
basados en grafos de acuerdo a los modelos propuestos. La validación de
relaciones espaciales entre objeto espaciales es una fase que en la mayoría de los
casos requiere gran cantidad de recursos de computacionales.
xxxiii
• Minado de los grafos. Se empleó el sistema Subdue como herramienta de minado
de datos. Así mismo, se propuso un nuevo algoritmo llamado overlap limitado. El
isomorfismo de grafos es un problema NP-completo, de tal forma, debemos ser
capaces de que nuestros tiempos de procesamiento para la búsqueda de patrones
cumpla parámetros aceptables de eficiencia.
• Manejo y representación de relaciones entre datos no espaciales. Las relaciones
implícitas y explicitas entre los atributos describiendo los objetos espaciales pueden
ser incluidos en el modelo con la finalidad de mejorar la representación de los datos.
A.7 Contribución
La contribución al descubrimiento de conocimiento en el dominio de los datos espaciales,
descrito en esta disertación, es el desarrollo de un nuevo enfoque para el modelado y
minado de datos espaciales usando una representación basada en grafos. Este enfoque
incluye los aspectos siguientes:
• Se propuso una nueva representación de datos basada en grafos para datos
espaciales. Se visualizaron dos objetivos para crear un modelo de datos con estas
características. El primero de ellos es crear un único conjunto de datos, basado en
grafos, representando estos elementos relacionados. El segundo es emplear este
conjunto de datos para alimentar a un sistema de minado de datos basado en grafos,
de tal forma que pudiéramos descubrir patrones conteniendo datos espaciales, no
espaciales y relaciones espaciales los cuales nos ayuden a describir/entender los
xxxiv
datos, basado en la premisa, de que estos son elementos relacionados en el mundo
real.
• Se propuso un nuevo algoritmo para descubrir subestructuras (patrones) usando un
enfoque de overlap limitado en el sistema Subdue, nuestra herramienta de minado
de datos. Visualizamos directamente tres motivaciones para proponer la
implementación del nuevo algoritmo: reducción del espacio de búsqueda, reducción
del tiempo de procesamiento y búsqueda orientada de patrones con overlap
selectivo (specialized overlapping pattern oriented search).
• Se diseñó e implementó un sistema prototipo implementado el modelo propuesto. El
prototipo ofrece una interfase de usuario amigable para el manejo de las capas
espaciales con las que se trabajará, para la creación de grafos espaciales y no
espaciales, para el minado de estos grafos (a través del llamado al sistema Subdue)
y para el despliegue de los resultados generados.
xxxv
Appendix B
RÉSUMÉ EN FRANÇAIS
B.1. Introduction
Durant ces dernières années nous avons été témoins de la rapide croissance du nombre, de
la capacité et de la dissémination des applications informatiques consacrées à l'obtention, la
génération, la manipulation, et le stockage de données dans divers milieux de la vie
humaine. Ces actions ont impliqué le recueil d'une grande quantité de données dont
l'analyse par des moyens manuels devient chaque jour plus compliquée. Rappelons que
dans plusieurs occasions les données “crues” ont besoin d'être analysées et interprétées
pour qu'elles soient transformées en informations utiles et profitables. Cette situation a été
résolue grâce à des techniques et des outils informatiques qui, de plus en plus nombreux,
nous aident pour atteindre ces objectifs. La découverte de connaissances dans bases de
données (KDD, sigle en langue anglaise) est définie comme l'extraction non banale
d'informations implicites, préalablement inconnues et potentiellement utiles à partir de
données [16]. Il s'agit d'un processus itératif et interactif qui comprend différentes phases.
Le noyau du processus est la phase du data mining (fouille de données), qui peut être
conceptualisée comme l'application d'algorithmes d'analyse de données et de fouille de
xxxvi
données qui grâce à des paramètres adéquats produisent et découvrent une énumération
particulière de patrons (régularités) sur des données elles-mêmes [12].
Dans ce contexte, l'analyse et l'exploitation de données provenant de phénomènes existant
en, sur, et sous la surface de la terre – appelés données spatiales –, ont engendré un nouveau
domaine d'investigation appelé fouille de données spatiales, de telle sorte que la fouille de
données spatiales s'envisage comme la découverte de connaissances implicites, et préala-
blement inconnues de données spatiales [16]. Diverses approches pour la fouille de données
spatiales ont été développées, dont ne seront examinées que les plus représentatives.
B.1.1 Méthodes basées sur la généralisation
La généralisation a démontré être une des méthodes effectives pour découvrir des
connaissances. Elle a été introduite par la communauté d'apprentissage automatique et se
base sur l'apprentissage à partir d'exemples. La découverte de connaissances basée sur la
généralisation requiert la construction de hiérarchies de concepts (donnés explicitement par
l'expert ou générés automatiquement). Dans le cas des bases de données spatiales, on peut
trouver deux types de hiérarchies de concepts: (1) Hiérarchies thématiques, par exemple,
généraliser des tomates et des bananes comme des fruits, les fruits et légumes comme des
aliments d'origine végétale, etc. (2) Hiérarchies spatiales, par exemple, généraliser une série
de points spatiaux comme une région ou pays.
Lu et al. [35] étendent une technique d'induction basée sur les attributs, aux bases de
données spatiales. Cette technique construit la hiérarchie de généralisation en résumant les
xxxvii
relations entre les données spatiales et non spatiales à un niveau de concept plus élevé. Les
auteurs présentent deux algorithmes basés sur la généralisation: (1) Approche de
domination de données non spatiales. Cette méthode réalise en première instance une
induction basée sur les attributs aux données non-spatiales et après cette étape, aux données
spatiales. (2) Approche de domination de données spatiales ; étant donné la hiérarchie de
données spatiales, la généralisation est effectuée d`abord sur ces données et après sur les
données non spatiales.
B.1.2 Regroupement
Le regroupement (clustering) est le processus permettant de regrouper de manière physique
ou abstraite des objets en classes d'objets semblables. Cette approche de fouille de données
nous aide à construire des groupes “représentatifs” d'un ensemble d'objets basé sur une
mesure de similitude (Ex. distance euclidienne). En d'autres termes, le regroupement de
données identifie des groupes (clusters) ou des régions densément peuplés en accord avec
une certaine mesure de distance dans un ensemble de données multidimensionnelles. On
peut classifier les algorithmes de regroupement en quatre groupes principaux : algorithmes
de regroupement basés sur les approches k-means (centre de gravité du cluster) et k-medoid
(objet représentatif du cluster), algorithmes hiérarchiques, algorithmes basés sur la
localisation des objets (regroupement par densité), et dernièrement ceux qui sont basés sur
des grids.
xxxviii
B.1.3 Associations spatiales
Une règle d'association spatiale est une règle qui décrit l'implication d'un ensemble d'objets
vers un autre ensemble d'objets d'une base de données spatiales [29]. Un exemple d'une
règle d'association spatiale peut être “ si la entreprise est près de Mexico City est alors une
grande entreprise”. Une règle d'association spatiale est de la forme X Y, où X et Y sont des
ensembles de prédicats spatiaux ou non spatiaux. Il existe plusieurs types de prédicats
spatiaux qui pourraient constituer une règle d'association spatiale, par exemple, des
relations topologiques comme intersection, recouvrement et des relations d'orientation
spatiale comme Gauche_de, ou Ouest_de.
B.1.4 Rapprochement et agrégation
Les méthodes basées sur le rapprochement et l'agrégation cherchent à analyser les
caractéristiques des groupes d'objets (clusters) pour former des groupes d'objets (features)
proches entre eux. La proximité devient la mesure permettant de regrouper un ensemble de
points dans un cluster en un feature. L'idée de trouver des relations de proximité n'est pas
un problème simple comme on pourrait le croire, car il existe trois raisons pour cette
affirmation. Supposons qu'on ait un cluster de points et que l'on veuille trouver les k-
features les plus proches de lui :
• La taille et forme du cluster et les features peuvent être très variés.,
• On pourrait avoir une grande quantité de features à examiner,
• Même pour trouver une forme connue (ex. polygone) qui décrit la forme du cluster,
il serait impropre de reporter les features où les limites soient plus proches aux
xxxix
limites de celui-ci, parce que la distribution des points à l'intérieur du cluster ne peut
pas être uniforme.
B.1.5 Fouille de données-images
L'extraction de patrons à partir d'images est autre approche de la fouille de données
spatiales. Dans la littérature il existe divers travaux de fouille de données développés selon
cette approche. Par exemple, Fayyad et al. [14] présentent un système pour l'identification
et la catégorisation des volcans sur la surface de Venus à partir d'images transmises par la
sonde stellaire Magellan. Dans un autre travail [15] (Second Palomar Observatory Sky
Survey) les auteurs ont utilisé des arbres de décision pour la classification des galaxies, des
étoiles et des autres objets stellaires. Stolorz et al. [44] d'une part, et Shek et al. [41] d'autre
part ont effectué des fouilles de données spatio-temporelles de données géophysiques.
B.1.6 Classification de données spatiales
La classification de données spatiales a comme objectif de trouver des règles qui divisent
un ensemble d'objets en un nombre de groupes dans lesquels les objets de chaque groupe
appartiennent à une même classe. Divers types d'information peuvent être utilisés pour
caractériser les objets spatiaux, comme par exemple, les attributs non spatiaux d'un objet,
les prédicats spatiaux et les fonctions spatiales. L'idée est d'utiliser cette information pour
en extraire les attributs pour l'étiquetage de classes (attributs qui divisent les données en
classes) et ceux dont les valeurs sont utilisés dans un arbre de décision pour créer de
nouvelles branches.
xl
B.1.7 Détection de tendances spatiales
La détection de tendances spatiales décrit les changements réguliers d'un ou plus attributs
non spatiaux d'un objet quand celui-ci se déplace d'un point de référence initiale. Un
exemple de tendance spatiale serait “si on s'éloigne du centre historique de la ville de
Puebla le prix des terrains décroit”. Les trajectoires de mouvement à partir d'un point x sont
utilisées pour modéliser ce mouvement, et l'analyse de régression sur les attributs des objets
peut être utilisée pour décrire les patrons de changements. Il existe deux types de tendances
: globales et locales.
B.2. Motivation
Bien que les approches précitées effectuent les fouilles de données spatiales avec succès,
notre perception est que celles-ci ne considèrent pas tous les éléments trouvés dans une
base de données spatiales (données spatiales, données non spatiales et relations spatiales
entre les objets spatiales) d'une manière exhaustive. C'est-à-dire que certaines d'entre elles
d'abord réalisent la fouille de données spatiales et ensuite celles non spatiales ou vice-versa,
et d'autres autorisent des combinaisons de ces éléments mais de manière restreinte. Au vu
des considérations précédentes, on propose l'argument suivant : si nous sommes capables
de représenter les données spatiales, non spatiales et les relations entre objets spatiales
comme un unique ensemble de données, et on les fouille ainsi, on pourrait générer ou
trouver des patrons de connaissances qui caractérisent notre ensemble de données contenant
ces trois éléments de manière conjointe. Pour un tel objectif, on émet l'hypothèse qu'une
xli
représentation basée sur graphes est suffisamment flexible et puissante pour représenter ces
éléments de manière conjointe, facilement compréhensible et capable de créer différentes
représentations du même domaine. Le domaine est décrit en utilisant des graphes, ces
graphes se transformant en données d'entrée pour un outil de découverte de connaissances
basé sur les graphes lequel utilise des heuristiques pour sélectionner des sous-graphes qui
sont considérés comme importants (patrons).
Un graphe est défini comme une paire G = (V,E). V = {v1,…,vn} dénote un ensemble fini
d'éléments appelés sommets. E est un ensemble d'arcs e satisfaisant E ⊆ [V]2. Donc,
chaque arc e ∈ E est une paire (vi,vj). Si (vi,vj) est une paire ordonnée pour n'importe quel
(vi,vj) ∈ E, on dit que G = (V,E) est un graphe orienté. Un graphe étiqueté possède des
étiquettes associées à leurs sommets et aussi aux arcs.
Après avoir proposé la représentation de relations spatiales entre les objets spatiaux, dans la
suivante section on détaillera les trois types de relations spatiales que l'on utilisera.
B.2.1 Relations spatiales
La position explicite des objets spatiaux définit des relations implicites de voisinage
(neighborhood) spatial entre eux. Ainsi, l'information sur le voisinage des objets spatiaux
constitue un élément de valeur qui doit être considéré comme le travail de fouille de
données spatiales. Martin Ester et al. [9][11] introduisent le concept de graphes de
voisinage pour représenter explicitement ces relations de voisinage implicite. Les graphes
de voisinage peuvent couvrir les relations de voisinage suivantes :
xlii
• Topologiques : dérivées du modèle à neuf intersections [6][7][8], ce sont des
relations que restent invariables sous des transformations linéaires, c'est-à-dire que
si les deux objets simultanément tournent, se déplacent ou changent d'échelle les
relations entre eux sont conservées.
• Distance : la distance entre deux objets compare à un seuil grâce à des opérateurs
arithmétiques tels que <, >, =. La distance entre deux objets se définit comme la
distance minimum entre eux (Ex. sélectionner tous les éléments à l'intérieur d'un
rayon de 50 kilomètres d'un point x).
• Direction : la relation spatiale R de direction entre deux objets spatiaux A et B (B R
A) est définie en utilisant un point représentatif de l'objet A et tous les points de
l'objet but B. Le point représentatif de l'objet source A peut être le centre de l'objet
ou un point sur ses limites. Ce point représentatif est utilisé comme l'origine d'un
système de coordonnées virtuelles et son quadrant définit la direction.
Une fois décrite la motivation de notre travail de recherche, nous présentons à la suite le
modèle général basée sur des graphes pour représenter les données spatiales.
B.3 Représentations basées sur des graphes
Comme dit auparavant, notre proposition s'appuie sur la création d'un modèle basé sur les
graphes pour représenter conjointement données spatiales, non spatiales, relations spatiales
entre les objets spatiaux. L'idée est d'utiliser les graphes générés comme données d'entrée
pour un algorithme de fouille de données, de telle sorte que l'algorithme puisse trouver des
xliii
patrons concernant ces éléments de manière conjointe et non comme des éléments séparés.
En conséquence, on propose le modèle de représentation donné en notation UML Figure
B.1.
Vertex
1..*
1..*
Distance DirectionTopological
Spatial relation
Spatial object
Non-spatial attribute
-From:-To:
Edge
Figure B.1. Modèle basé sur des graphes pour représenter des données spatiales.
Dans ce modèle, les données spatiales (ex. objets spatiaux), données non spatiales (ex.
attributs descriptifs), et relations spatiales sont représentées comme une collection d'un ou
plusieurs graphes orientés étiquetés. Les sommets peuvent représenter des objets spatiaux,
types de relation spatiale entre deux objets (relation binaire), ou des attributs non spatiaux
décrivant les objets spatiaux. Les arcs représentent un lien existant entre deux sommets de
n'importe quel type. Dépendant du type de sommets qu'un arc unit, celui-ci peut représenter
le nom d'un attribut descriptif ou le nom d'une relation spatiale. Les noms des attributs
peuvent se rapporter à des descriptions d'objets spatiaux et/ou à entités non spatiales. On
utilise des arcs orientés pour représenter les informations directionnelles, de relations entre
deux éléments (Ex. objet x couvre objet y) et pour décrire des attributs appartenant à un
objet (Ex. objet x a attribut z).
xliv
Actuellement il existe cinq représentations à partir du modèle général décrit auparavant.
Trois aspects définissent les caractéristiques des graphes créés dans chaque modèle: (1)
Représentation des relations spatiales équivalentes (Ex. toucher, recouvrir). (2)
Représentation des relations spatiales symétriques (Ex. contient-de/dans_de,
Nord_de/Sud_de). (3) La manière de représenter les objets et leurs relations dans le modèle.
En conséquence, les graphes créés se différencient de manière quantitative et qualitative.
Dans la partie quantitative il y a des différences telles que : nombre de sommets et d'arcs
employés pour représenter les données spatiales, non spatiales et les relations, création de
graphes simples, l'usage des arcs orientés et non orientés pour représenter des relations
spatiales et/ou des attributs descriptifs. Dans la partie qualitative on a observé, à travers des
expériences, que certains modèles ont une plus grande expressivité pour représenter
l'ensemble de données, conditionnant de manière directe la qualité des résultats générés
dans le processus de fouille. Ensuite on présente trois des cinq modèles proposés décrivant
les métriques d'évaluation créées pour caractériser à chacun d'eux.
Modèles
Dans le but de décrire les caractéristiques de chaque modèle, on va utiliser à titre d'exemple
l'ensemble de données de la Figure B.2. Comme on peut l'observer, notre ensemble de
données se compose de deux objets spatiaux, l'objet A représentant une maison et l'objet B
représentant un lac, et les trois relations spatiales suivantes: (1) Distance, objet A près de
objet B (relation équivalente). (2) Topologique, objet A touche objet B (relation
équivalente). (3) Direction, objet A Sud de objet B (relation symétrique).
xlv
North
South
EastWest
Object B
Object A
Figure B.2. Base d'exemple pour caractériser les 3 modèles proposés.
Modèle n°1 - modèle de base
La Figure B.3 montre le premier modèle créé pour représenter des données spatiales dans
l'approche proposée. Les caractéristiques du modèle selon les métriques créées pour sa
caractérisation sont :
• Nom. Sommets : 2 sommets, chacun représentant un objet spatial (objet A et objet
B).
• Nom. Arcs : 4 arcs, 3 arcs pour représenter les relations spatiales originales
existantes dans notre ensemble de données d'exemple (“près”, “touche” et
“Sud_de”) et un arc pour représenter la relation “Nord_de”, créé à partir de la
relation symétrique original “Sud_de”.
• Taille (sommets + arcs) : 6
• % incrément : 0%, celui-ci est le modèle base.
• Graphe simple. Non. Car c'est un graphe complexe avec 4 arcs unissant 2 sommets.
• Arc orienté. Oui, car on utilise les relations symétriques “Sud_de” et “Nord_de”. La
direction des arcs se fait avec la lecture des relations entre les objets.
xlvi
• Arc non orienté. Oui, car on utilise les relations équivalentes “près” et “touche”.
• Information complète. Oui, car dans le graphe est représentée la relation
symétrique“Nord_de” créée à partir de la relation symétrique original “Sud_de”.
• Arc “Relation” redondant. Non, car dans le modèle on n'utilise pas les arcs
“Relation”.
Spatialobject A
Spatialobject B
South_of
touch
close
North_of
Figure B.3. Modèle n°1 - modèle base.
Modèle n°2 - réplication simple de types de relation, information complète
Dans la Figure B.4 on représente le deuxième modèle créé pour représenter les données
spatiales. Les caractéristiques du modèle selon les métriques sont les suivantes :
• Nom. Sommets : 5 sommets, 2 sommets pour représenter les objets spatiales et trois
sommets pour représenter les types de relations spatiales “topologique”, “distance”
et “direction”. Pour chaque type de relation spéciale existant entre deux objets, on
ajoute un sommet étiqueté avec le nom du type de la relation spatiale. Dans
l'exemple il existe une relation “topologique”, une relation de “distance” et une
relation de “direction”.
• Nom. Arcs : 6 arcs, 3 arcs pour représenter les relations originales, 2 arcs pour
représenter les relations équivalentes (“près” et “touche”) créées à partir des
xlvii
relations originales et un arc pour représenter la relation symétrique (“Nord_de”)
créée aussi à partir des relations originales.
• Taille (sommets + arcs) : 11
• % incrément : +83.33%
• Graphe simple. Oui, car il existe un arc supplémentaire entre n'importe quel couple
de sommets donnés.
• Arc orienté. Oui, car on utilise toutes les relations. La direction des arcs va des
sommets représentant les objets spatiaux aux sommets représentant les types de
relations spatiales.
• Arc non orienté. Non, car dans le modèle on n'utilise pas d'arcs non orientés.
• Information complète. Oui, car on représente les relations symétriques créées à
partir des relations originales.
• Arc “Relation” redondant. Non, dans le modèle on n'utilise pas d'arcs “Relation”.
South_of
touch
close close
touch
North_of
Distance
Direction
Spatialobject A Topological Spatial
object B
Figure B.4. Modèle n°2 - réplication simple des types de relation, information complète.
xlviii
Modèle n°3 - double réplication des types de relation, information non complète
Dans la Figure B.5 on présente le troisième modèle créé pour représenter des données
spatiales utilisant une approche de graphes. Les caractéristiques en accord avec les
métriques sont:
• Nom. Sommets : 8 sommets, 2 sommets pour représenter les objets spatiaux et 6
sommets pour représenter les types de relations spatiales (“distance”, “topologique”
et “direction”). Pour chaque type de relation spatiale entre deux objets spatiaux, on
ajoute deux sommets étiquetés avec le nom du type de la relation spatiale. Par
exemple, dans nos données d'essai il existe trois types de relations : 1 relation
“topologique”, 1 relation “distance” et 1 relation “direction” ; ainsi on ajoute 6
sommets, 2 pour chaque type de relation spatiale.
• Nom. Arcs : 9 arcs, 6 arcs “Relation” pour unir les sommets représentant les objets
spatiaux avec les sommets représentant les types de relations spatiales (de chaque
sommets représentant un objet spatial naissent 3 arcs puisqu'il existe 3 types de
relation), et 3 arcs pour présenter les relations spatiales originales. Ces 3 arcs sont
utilisés pour unir les sommets représentant les types de relations spatiales.
• Taille (sommets + arcs) : 17
• % incrément : +183.33%
• Graphe simple. Oui, car il existe au moins un arc entre n'importe quel couple de
sommets données.
• Arc orienté. Oui, car on utilise les relations symétriques et les arcs “Relation”. La
direction des arcs “Relation” va des sommets représentant les objets spatiaux aux
xlix
sommets représentant les types de relations spatiales. La direction des arcs restant se
fait avec la lecture des relations spatiales entre les objets.
• Arc non orienté. Oui, car on les utilise pour représenter les relations équivalentes.
• Information complète. Non, car dans le modèle ne sont pas représentées les
relations symétriques qui sont créées à partir des relations spatiales originales.
• Arc “Relation” redondance Oui, car dans le modèle on utilise des arcs “Relation”
pour représenter explicitement l'existence et type d'une relation spatiale entre 2
objets spatiaux. De plus, on emploie ces arcs pour éviter la création de graphes
complexes.
South_of
touch
close
Relation
Distance
Topological
Direction
Distance
Topological
Direction
Spatialobject A
Spatialobject B
Relation
R elation
Relation
RelationR
elation
Figure B.5. Modèle n°3 - double réplication des types de relation, information non complète.
La Table B.1 présente les résultats des neuf métriques développées pour évaluer les
caractéristiques de chaque modèle proposé (actuellement 5 modèles). Le modèle n°1 est
appelé le modèle base.
l
Modèle Nom.
Sommet
Nom.
Arcs
Dimensions
(s + a)
%
Incrément
Graphe
Simple
Arc
Orienté
Arc non
Orienté
Information
Complète
Arc
“Relation”
Redondant
(1) (2) (3) (4) (5) (6) (7) (8) (9)
n°1 2 4 6 - Non Oui Oui Oui Non
n°2 5 6 11 +83.33 Oui Oui Non Oui Non
n°3 8 9 17 +183.33 Oui Oui Oui Non Oui
n°4 5 6 11 +83.33 Oui Oui Oui Non Oui
n°5 8 12 20 +233.33 Oui Oui Non Oui Oui
Table B.1. Caractéristiques des modèles de représentation basés sur des graphes.
Les métriques ont été proposées sur la base de causes à effets de chacune aussi bien dans le
graphe créé que dans l'algorithme de fouille. Nous apercevons quatre caractéristiques
significatives liées directement à ces métriques.
1. Espace de recherche
L'espace de recherche dans un algorithme de fouille de données basé sur des graphes
consiste en la liste des sous-graphes qui peuvent être dérivés du graphe initial, de telle
manière que les nombres de sommets (1) et arcs (2) du graphe créé (3) définissent la taille
de l'espace de recherche pour le système de découverte. Donc, l'objectif doit être de
minimiser le nombre de sommets et d'arcs utilisés pour créer les graphes mais en même
temps maximiser la représentativité des ceux-ci. Comme on peut le voir dans la Table B.1,
le modèle utilisant le nombre minimum de sommets et d'arcs pour représenter l'ensemble de
données d'exemple c'est le modèle n°1 (2 sommets et 4 arcs) alors que le modèle n°5 est le
cas opposé (8 sommets et 12 arcs).
li
2. Temps de traitement
La taille de l'espace de recherche joue un rôle important touchant au temps de traitement
utilisé pour découvrir des patrons. Si on dispose d'un grand espace de recherche, il faudra
plus de temps pour évaluer tous les sous-graphes possibles. Donc, une comparaison de la
métrique "pourcentage d'incrémentation" (4) entre les modèles proposés figure dans la
Table B.1. Rappelons que cette métrique compare la taille d'un modèle donné par rapport
au modèle n°1 (modèle base). Par exemple, le modèle n°5 augmente la taille du graphe de
233.33% par rapport au modèle n°1. C'est-à-dire que l'algorithme de fouille aura besoin
d'évaluer 233.33% plus de sommets et/ou d'arcs en utilisant le modèle n°5 au lieu du
modèle n°1 pour le même ensemble de données.
3. Complexité du graphe
Dans le chapitre 4 du mémoire de thèse, on décrit le système Subdue, notre outil de fouille
de données basées sur des graphes, outil provenant de l'Université du Texas à Arlington.
Comme dit précédemment, il existe une plus grande complexité pour l'algorithme de fouille
travaillant avec des graphes complexes au lieu de graphes simples (Ex. au maximum un arc
unissant n'importe quel couple de sommets donné). Par exemple, de plus, il faut tenir
compte dans le traitement de comparaison des graphes, de la phase d'extension (Subdue
emploie une approche d'"extension" pour découvrir des patrons), et de l'étape de
compréhension du graphe. Donc, l'objectif a été de proposer des modèles basés sur les
graphes qui nous permettent de créer des graphes plus simples. Comme on peut le voir dans
la Table B.1, seulement le modèle n°1 ne permet pas de créer des graphes simples.
lii
En conséquence, comme stratégie pour réduire la multiplicité du nombre d'arcs entre deux
sommets, on emploie les approches suivantes:
• Ajouter un nouveau sommet étiqueté avec le nom du type de la relation (Ex.
topologique, distance et direction) pour chaque relation spatiale entre les objets
spatiaux. Cette approche est utilisée dans le modèle n°2.
• Ajouter un nouveau sommet (modèle n°4) ou deux nouveaux sommets (modèle n°3
et modèle n°5) étiquetés comme le nom du type de la relation spatiale (Ex.
topologique, distance et direction) par chaque relation spatiale entre les objets
spatiaux et fusionner ce nouveau sommet (modèle n°4) ou de nouveaux sommets
(modèle n°3 et modèle n°5) avec les sommets représentant les objets spatiaux parmi
des arcs étiquetés comme "Relation". L'approche employée pour fusionner les
sommets diffère pour chaque modèle selon la définition de chacun d'eux. Cette
nomenclature est utilisée pour représenter le fait qu'il existe une relation spatiale
entre les objets spatiaux. Ces arcs sont connus comme arcs “Relation” redondants
(9).
4. Représentativité des données
Les métriques arcs orientés (6), arcs non orientés (7), et information complète (8) sont
utilisées pour maximiser la représentativité des données mais minimisant, aussi tant que
possible, la taille du graphe et sa complexité. Les arcs orientés sont utilisés pour représenter
les relations spatiales symétriques (objet A Nord_of objet B, implique, B Sud_of A), les arcs
“Relation” redondantes, les attributs non spatiaux qui décrivent les objets spatiaux. Les
arcs non orientés sont utilisés pour représenter les relations spatiales équivalentes (la
liii
relation est représentée par un arc non orienté au lieu de deux arcs orientés). Finalement,
l'information complète signifie que les relations spatiales symétriques entre objets spatiaux
aussi sont représentées dans le modèle.
B.4 Fouille du graphe
La caractéristique de recouvrement (partage de sommets appartenant à différentes instances
d'une sous-structure) accomplit un rôle important dans le système de découverte de patrons
(sous-structures) dans notre outil de fouille de données basés sur les graphes à savoir le
système Subdue. En conséquence, les résultats générés sont conditionnés par le
fonctionnement de cette caractéristique. Pourtant, son implémentation actuelle est régulière
: permettre le recouvrement entre toutes les instances d'une sous-structure (sans aucune
règle) ou ne pas autoriser le recouvrement entre aucune instance appartenant à une sous-
structure. C'est-à-dire, tout ou rien. Dans ce contexte, on propose une nouvelle approche
appelée recouvrement limité. Un des avantages principaux de cette nouvelle approche est la
capacité d'autoriser l'usager à spécifier l'ensemble de sommets où le recouvrement sera
permis. Ces sommets pourraient représenter les éléments significatifs dans le contexte de
travail. On donnera directement trois motivations pour proposer un nouvel algorithme,
lesquelles seront expliquées dans les sous-sections suivantes:
1. Réduction de l'espace de recherche
Dans les systèmes de découverte de connaissances basés sur des graphes, l'algorithme de
fouille de données utilise des graphes comme représentation de connaissance. L'espace de
liv
recherche d'un tel algorithme consiste en tous les sous-graphes qui peuvent être dérivés du
graphe initial. Le processus de découverte de sous-structures en Subdue commence par la
création de sous-structures d'un seul sommet à partir du graphe d'entrée (une sous-structure
par chaque étiquette de sommet qui existe au moins 2 fois dans le graphe). Dans chaque
itération du processus de découverte, l'algorithme sélectionnera les meilleures sous-
structures et étend les instances de ces sous-structures en ajoutant un arc voisin (ou un arc
et un nouveau sommet) dans toutes les directions possibles.
Mais comme partie du processus de sélection des meilleurs substructures et donc
d'extension, il existe aussi un processus de filtrage. Dans ce processus, en accord avec la
valeur du paramètre de recouvrement, les instances d'une sous-structure sont évaluées : si le
recouvrement est permis, les instances partageant des sommets sont maintenues ; au
contraire si le recouvrement n'est pas autorisé les instances qui partagent les sommets sont
écartés.
La meilleure sous-structure découverte par Subdue (en accord à les métriques d'évaluation),
dans chaque itération, peut être employée pour comprimer le graphe initial qui peut donc
se transformer en un nouveau graphe d'entrée pour une interaction suivante. Après
plusieurs itérations, Subdue crée une description hiérarchique des données, où les
substructures découvertes lors de certaines itérations peuvent être définies sur la base de
sous-structures découvertes lors des itérations préalables. Ainsi, le nombre d’instances
d'une substructure définit l'espace de recherche (dans chaque itération) dans le processus de
découverte de sous-structures. Comme on peut l'observer à travers de l'utilisation du
recouvrement limité, on obtient une réduction de l'espace de recherche puisque le nombre
lv
d'instances candidates à être étendues conditionne les valeurs (sommets) où le
recouvrement est permis.
2. Réduction du temps de traitement
La réduction du nombre d'instances candidates à être étendues apporte une réduction de
l'espace de recherche, et cela contribue à une réduction du temps de traitement pour la
recherche de sous-structures (patrons).
Permettre le recouvrement en Subdue, implique une évolution du temps de calcul
puisqu'augmente le nombre d'instances candidates à être étendues, évaluées, comparées, et
découvertes. Pourtant, avec l'implémentation du recouvrement limité, le nombre d'instances
à être traitées dans ces phases décroît contribuant à une réduction du temps de traitement
dans tout le calcul de découverte de sous-structures.
3. Recherche orientée de patrons avec recouvrement (partager) sélectif
Le recouvrement limité donne à l’usager la capacité de définir des ensembles d’éléments où
le recouvrement sera permis et qu'à son avis il considère comme pertinents pour son
contexte de travail (Ex. un objet spatial, un attribut descriptif). Ces éléments sont
représentés en utilisant des sommets en accord avec le modèle proposé. En contre partie,
l'algorithme écartera les éléments que l'utilisateur n'a pas considérés comme significatifs.
Par conséquent, le recouvrement limité fournit à l'utilisateur un moyen pour mettre en
œuvre une recherche orientée vers des patrons avec recouvrement. C'est-à-dire, l'utilisateur
délimite l'ensemble d'éléments qui auront un rôle prépondérant dans le processus de
lvi
découverte de sous-structures. De plus, cette caractéristique offre l'avantage que le
processus d'évaluation de patrons est simplifié puisque l'ensemble de résultats produits est
plus petit parce que ceux-ci se centrent sur les demandes de l'utilisateur.
B.5 Résultats
A partir des essais pour évaluer notre proposition de modélisation et de fouille de données
spatiales en utilisant un modèle basé sur les graphes, on a développé trois cas d'utilisation
exemplaires. Les deux premiers cas ont été mis en œuvre en utilisant un recensement de
population du centre historique de la ville de Puebla durant l'année de 1777. Le troisième
cas d'utilisation a été développé en utilisant une base de données spatiales de la région du
volcan Popocatépetl.
Dans cette section nous présentons des exemples de résultats obtenus avec le cas du volcan
Popocatépetl. Supposons que nous souhaitons connaître des caractéristiques communes
entre les habitats, routes et rivières dans la zone qui nous aident à évaluer ou mettre en
œuvre des plans d'évacuation en cas d'éventualité volcanique : par exemple, caractéristiques
de routes commençant dans ou croisant un habitat, un matériel utilisé pour construire ces
routes et leur état actuel (ex. pavées, non pavées), caractéristiques des routes et des rivières
qui ont une certaine relation entre elles (ex. croisement), rivières près d'habitats qui en cas
de haute précipitation pluviale pourraient présenter des dangers potentiels.
lvii
Les expériences ont été développées en utilisant les 5 modèles de représentation
actuellement proposés. Dans cette section on présente des patrons trouvés avec le modèle
n°1 entre routes et rivières, routes et habitats, et finalement rivières et habitats. L'idée
d'organiser la présentation des résultats de cette manière est de montrer les divers patrons
qui peuvent être découverts entre ces éléments. À la fin de la section on présente des
tableaux comparant les résultats obtenus avec chacun des 5 modèles.
La Figure B.6 montre le patron le plus significatif découvert entre des routes et des rivières
en utilisant le modèle n° 1. Le patron décrit une relation entre "route de catégorie non pavée
qui traverse une rivière catégorie écoulement" dans la zone. Ce patron peut être considéré
comme un indicateur du nombre de routes qui ont besoin d'être vérifiées en cas
d'éventualité volcanique vu le type de matériel avec lequel elles sont construites, et parce
qu'elles traversent des rivières (la lecture peut être faite en sens inverse) qui en cas de
hautes concentrations pluviales peuvent déborder et les mettre hors d'usage. Subdue a
trouvé avec non recouvrement 46 instances du patron dans la seconde itération ; par
l'intermédiaire du recouvrement standard il a trouvé 85 instances dans la première itération
; et à travers le recouvrement limité il a aussi trouvé 85 instances dans la seconde itération.
Comme nous pouvons observer les recouvrements standard et recouvrement limité ont
trouvé le même nombre d'instances, mais le recouvrement limité a eu besoin de deux
interactions pour trouver le même patron. Toutefois, ceci ne veut pas dire que le
recouvrement standard est meilleur que le recouvrement limité (en ce qui concerne temps
de traitement) parce qu'en analysant le temps global de traitement requis pour le
recouvrement limité pour achever la phase de découverte de sous-structures nous
remarquons qu'il est plus petit que celui requis par le recouvrement standard.
lviii
a) b)
c)
Figure B.6. Relations entre des routes et des rivières en utilisant le modèle n°1.
Le patron le plus significatif, utilisant le modèle n°1, trouvé entre des routes et des habitats
est présenté dans la Figure B.7. Il décrit une relation entre "route catégorie non pavée en
touchant un habitat catégorie construction". "habitat catégorie construction" représente dans
la couche de données spatiales "habitat" de la base de données du volcan, habitats avec
grosse population, bâtiments et une grande quantité de constructions utilisées pour offrir
lix
des services aux habitants. Si nous faisons l'hypothèse que les gens pourraient avoir besoin
d'être évacués en cas d'éruption et que les routes utilisées pour ce but sont non pavées,
alors, cette situation pourrait se transformer en un problème (Ex. embouteillage). Dans cette
expérience Subdue a trouvé par l'intermédiaire du non recouvrement 6 instances du patron
dans la neuvième itération ; par le biais du recouvrement standard 9 instances dans la
quatrième itération ; et en utilisant recouvrement limité 8 instances dans la dixième
itération.
a) b)
c)
Figure B.7. Relations entre des routes et des villes en utilisant le modèle n°1.
lx
La Figure B.8 montre le patron le plus significatif trouvé entre des rivières et des
populations en utilisant le modèle n° 1. Le patron décrit une relation entre "rivière catégorie
écoulement qui traverse un habitat catégorie "îlot" ou "bâtiment" dans la zone. "habitat
catégorie îlot" représente dans la couche de données spatiales "habitat" de la base de
données du volcan, des villages avec peu de population, en fait avec beaucoup de secteurs
dépeuplés, bâtiments et constructions précaires. Le patron peut être utilisé pour identifier
des zones potentielles d'inondation, habitées par des personnes isolées habitant aux
alentours de rivières. À travers le non recouvrement, Subdue a trouvé 5 instances du patron
dans la dixième seconde itération ; en utilisant le recouvrement standard a trouvé 5 instance
dans la huitième itération ; et par l'intermédiaire du recouvrement limité il a aussi trouvé 5
instances en huitième itération. Subdue a trouvé, toutefois, le même patron dans les trois
cas mais en utilisant le recouvrement standard et le recouvrement limité.
Le Tableau B.2 présente une comparaison, par modèle, entre le nombre d'instances
découvertes/itérations nécessaires pour les découvrir et les trois mises en œuvre du
recouvrement. Par exemple, en utilisant le modèle n°1, Subdue a trouvé 46 instances (dans
la seconde itération) d'un patron "complet" (notre définition pour décrire un patron
"complet" est que celui-ci contient au moins deux objets spatiaux et la relation spatiale
entre eux) en contenant les objets spatiales route-rivière par l'intermédiaire du non
recouvrement. Une valeur plus élevée signifie un modèle qui permet de découvrir plus
d'instances d'une sous-structure (patrons). Rappelons que Subdue décrit comme le meilleur
patron (par itération) la sous-structure avec le nombre plus élevé d'instances découvertes de
lxi
cette sous-structure. Cette comparaison est effectuée par chaque structure "objet-objet" (Ex.
route-rivière).
a) b)
c)
Figure B.8. Relations entre des rivières et des villes en utilisant le modèle n°1.
lxii
Annotation: NO (non recouvrement), SO (recouvrement standard), LO (recouvrement
limité).
Modèle n°1 Modèle n°2 Modèle n°3 Modèle n°4 Modèle n°5
NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO
Route-Rivière
Instances 46 85 85 41 85 64 39 85 34 39 85 60 45 85 45
Itération 2 1 2 2 3 2 2 2 9 2 1 2 5 1 5
Route-Ville
Instances 6 9 8 5 8 7 4 8 5 6 8 8 6 0 7
Itération 9 4 10 14 6 10 15 6 13 12 10 7 7 0 7
Rivière-Ville
Instances 5 5 5 5 10 5 5 19 5 5 10 5 5 5 5
Itération 12 8 8 16 7 14 12 4 10 6 6 11 13 6 10
Table B.2. Instances/itérations par chaque modèle basé sur des graphes : cas d'utilisation Popocatépetl.
Le Tableau B.3 présente une comparaison de maximum/minimum instances découvertes
par chaque mise en œuvre du recouvrement. Un modèle avec la valeur plus élevée est
meilleur parce qu'il permet de découvrir davantage d'instances d'une sous-structure. La
comparaison est présentée par chaque structure "objet-objet". Par exemple dans la structure
route-rivière le modèle n°1 a trouvé 46 instances découvertes dans la seconde itération (la
valeur plus élevée).
lxiii
Maximum Minimum
Route-Rivière
Non recouvrement modèle n°1 (deuxième itération) modèles n°3 et n°4 (deuxième itér.)
Recouvrement standard modèles n°1, n°4 et n°5 (première itér.) modèle n°2 (troisième itération)
Recouvrement limité modèle n°1 (deuxième itération) modèle n°3 (neuvième itération)
Route-Ville
Non recouvrement modèle n°5 (septième itération) modèle n°3 (quinzième itération)
Recouvrement standard modèle n°1 (quatrième itération) modèle n°5 (modèle non complet)
Recouvrement limité modèle n°4 (septième itération) modèle n°3 (treizième itération)
Rivière-Ville
Non recouvrement modèle n°4 (sixième itération) modèle n°2 (seizième itération).
Recouvrement standard modèle n°3 (quatrième itération) modèle n°1 (huitième itération).
Recouvrement limité modèle n°1 (huitième itération) modèle n°2 (quatorzième itération)
Table B.3. Max/Min d'instances découvertes par "objet-objet"/caractéristique recouvrement.
Le Tableau B.4 présente une comparaison entre la moyenne d'instances découvertes par
modèle. Une valeur plus élevée signifie un modèle permettant de découvrir plus d'instances
d'une sous-structure. Chaque valeur représente la moyenne de sous-structures découvertes
en utilisant le non recouvrement, le recouvrement standard et le recouvrement limité. La
comparaison est donnée par chaque structure "objet-objet".
Modèle n°1 Modèle n°2 Modèle n°3 Modèle n°4 Modèle n°5
Route-Rivière 72.0 63.3 52.7 61.3 58.3
Route-Ville 7.7 6.7 5.7 7.3 4.3
Rivière-Ville 5.0 6.7 9.7 6.7 5.0
Table B.4. Moyenne d'instances découvertes par modèle/"objet-objet".
lxiv
Le Tableau B.5 présente une comparaison entre la moyenne d'instances découvertes par
modèle. Une valeur élevée signifie un modèle permettant de découvrir plus d'instances
d'une sous-structure. La comparaison est donnée par chaque mise en œuvre du
recouvrement.
Modèle n°1 Modèle n°2 Modèle n°3 Modèle n°4 Modèle n°5
NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO
19.0 33.0 32.7 17.0 34.3 25.3 16.0 37.3 14.7 16.7 34.3 24.3 18.7 30.0 19.0
Table B.5. Moyenne d'instances découvertes par modèle/caractéristique recouvrement.
Le Tableau B.6 présente une fin comparative entre la moyenne d'instances découvertes par
modèle. Nous pouvons voir dans le tableau que le modèle n°1 donne la valeur plus élevée
d'instances découvertes (en accord avec nos paramètres des instances complètes) dans ce
cas d'utilisation exemplaire. Les modèles suivants sont le modèle n°2 et le modèle n°4
respectivement.
Modèle n°1 Modèle n°2 Modèle n°3 Modèle n°4 Modèle n°5
28.2 25.6 22.7 25.1 22.6
Table B.6. Moyenne d'instances découvertes par modèle.
B.6 Conclusions
L'interaction constante entre les êtres humains et leur habitat naturel, la planète terre,
produit jour après jour, de nouvelles demandes associées à la manipulation et l'exploitation
lxv
de données spatiales. Par exemple, l'analyse urbaine, la prévention les risques naturels,
l'exploration de l'espace stellaire, la pollution des océans, et le déboisement des sols, pour
nommer certains d'entre eux. La fouille de données spatiales intègre l'intégration de
méthodes et techniques provenant de divers domaines scientifiques lesquels nous aident, au
moyen d'algorithmes d'analyse et de découverte, à produire une énumération particulière de
patrons sur les données spatiales.
Notre argumentation dans ce mémoire de thèse se base sur l'idée que la fouille de données
spatiales ne considère pas tous les éléments trouvés dans une base de données spatiales
(données spatiales, données non spatiales et relations spatiales entre les objets spatiales)
d'une manière exhaustive. Par conséquent, on a proposé d'employer une analyse basée sur
les graphes pour représenter ces éléments comme un unique ensemble de données, les
fouiller comme un tout, de façon à pouvoir découvrir des patrons contenant les deux types
données et relations spatiales (patrons les plus descriptifs).
Dans notre modèle les relations spatiales entre les objets spatiaux sont incluses parce qu'une
caractéristique significative des données spatiales est l'influence que les voisins d'un objet
peuvent avoir avec l'objet lui-même. Dans notre modèle nous incluons trois types de
relations spatiales. A partir du modèle général on a proposé cinq modèles opérationnels.
Trois aspects définissent les caractéristiques d'un graphe créé avec ces modèles : (1)
Représentation des relations spatiales équivalentes. (2) Représentation de relations spatiales
symétriques. (3) La manière de représenter les objets et leurs relations. Comme partie
intégrante de notre méthodologie pour la fouille de données spatiales en utilisant une
analyse basée sur les graphes, nous utilisons le système Subdue comme outil de fouille.
lxvi
Nous avons proposé un nouvel algorithme appelé recouvrement limité lequel donne à
l'utilisateur la capacité de spécifier l'ensemble de sommets sur lesquels le recouvrement est
permis. Nous donnons trois motivations pour proposer cette nouvelle analyse: (1)
Réduction de l'espace de recherche. (2) Réduction du temps de processus. (3) Recherche
orientée de patrons avec recouvrement (partager) sélectif.
Pour démontrer la viabilité, capacité de fouille et de découverte de patrons en utilisant
l'analyse proposée, on a développé un prototype pour la mise en œuvre de notre modèle en
créant les ensembles de données basées sur les graphes, pour fouiller ces graphes (par
l'entremise du système Subdue) et pour visualiser les patrons découverts. Les résultats
produits des cas d'utilisation développés nous donnent un vaste panorama de ce que nous
pourrions obtenir en utilisant cette analyse. Il est important de généraliser le fait que nous
pouvons utiliser cette méthodologie de modélisation et de représentation dans tout domaine
qui peut être représenté pour un graphe.
Les perspectives d'amélioration de notre travail (modèle basé en graphes, algorithme de
fouille de données et système prototype) incluent les points suivants :
• Visualisation de connaissances découvertes. Par exemple, visualisation de
résultats sur les couches spatiales, à travers l'utilisation d'icones, et la navigation
dans la hiérarchie de patrons découverts.
• Amélioration des algorithmes employés pour créer les ensembles de données
basées sur des graphes en accord avec les modèles proposés. La validation des
lxvii
relations spatiales entre objets spatiaux est une phase qui dans la majorité des cas
requiert une grande quantité de ressources en calcul.
• Fouille de graphes. On a employé le système Subdue comme outil de fouille de
données. De plus, on a proposé un nouvel algorithme appelé recouvrement limité.
L'isomorphisme de graphes est un problème NP-complet et par conséquent, les
algorithmes devront être capables de réduire les temps de traitement pour la
recherche de patrons.
• Manipulation et représentation de relations entre des données non spatiales.
Les relations implicites et explicites entre les attributs en décrivant les objets
spatiaux peuvent être inclus dans le modèle afin d'améliorer la représentation des
données.
B.7 Contribution
La contribution à la découverte de connaissances dans le domaine des données spatiales,
décrite dans ce mémoire, est le fruit d'une nouvelle façon de modéliser et de fouiller les
données spatiales en utilisant une représentation basée sur des graphes. Cette approche
inclut les aspects suivants :
• Nous avons proposé une nouvelle représentation de données basée sur les graphes
pour traiter les données spatiales. On a atteint deux objectifs pour créer un modèle
de données avec ces caractéristiques. Le premier d'entre eux est de créer un seul
ensemble de données, basé sur des graphes, en représentant ces éléments en rapport
les uns avec les autres. Le deuxième est d'employer cet ensemble de données pour
lxviii
alimenter un système de fouille de données basé sur des graphes, de telle sorte que
puissions-nous découvrir des patrons contenant des données spatiales, non spatiales
et des relations spatiales lesquelles nous aident à décrire et comprendre les données,
tout ceci basé la prémisse qu'il s'agit d'éléments en rapport dans le monde réel.
• Nous avons proposé un nouvel algorithme pour découvrir des sous-structures
(patrons) en utilisant une analyse de recouvrement limité dans le système Subdue.
Nous donnons directement trois motivations pour proposer la mise en œuvre du
nouvel algorithme : réduction de l'espace recherche, réduction du temps de
processus et recherche orientée de patrons avec recouvrement sélectif (specialized
overlapping pattern oriented search).
• On a conçu et on a mis en œuvre un prototype pour le modèle proposé. Le prototype
offre une interface d'utilisateur convivial pour la manipulation des couches spatiales
avec lesquelles on travaillera, pour la création de graphes spatiaux et non spatiaux,
pour la fouille de ces graphes (à travers du système Subdue) et pour le déploiement
des résultats produits.
1
Chapter 1
INTRODUCTION
Due to the advances in data generation and recompilation we are facing a continuous
growth in data collections. The analysis and interpretation of this data by manual
techniques is sometimes a tough task; therefore, different methods have been proposed to
help us transform it into useful knowledge. Knowledge discovery in databases (KDD) is
defined as the non-trivial extraction of implicit, previously unknown, and potentially useful
information from data [16]. This is an interactive and iterative process that involves several
phases. The data mining phase is the nucleus of the process; it consists of the application of
data analysis and discovery algorithms that, under acceptable computational efficiency
limitations, produces a particular enumeration of patterns over the data [12]. Data mining
involves the integration of methods from different scientific fields like machine learning,
database technology and statistics. The first approaches developed focused on the discovery
of knowledge from relational data.
Nowadays, however, terms like geoprocessing and Geographic Information System are
used widely in daily life. This is a result of the improvement in the human capabilities to
create, manipulate, store and use data from phenomena on, above or below the earth’s
2
surface. This data is known as spatial data. Spatial data mining focuses on the discovery of
implicit and previously unknown knowledge in spatial databases [16]. Spatial data has
many features that distinguish it from relational data such as the relationships among the
objects, complexity, and the query language used to access them.
1.1 Motivation
As a result of the growth in the volume of spatial datasets and the necessity for tools to help
us transform them into useful information, several approaches have been developed for
knowledge discovery from spatial data: the principle in the Generalization-based methods
[21, 35] is that data and objects often contain detailed information at primitive concept
levels but sometimes it is desirable to summarize that information and present it at a higher
concept level. Clustering [26, 28, 37, 40, 46] is the process of grouping physical or abstract
objects into classes (clusters) of similar objects so that the members of a cluster are as
similar as possible whereas members of different clusters differ as much as possible from
each other. Spatial associations [13, 29] discover rules that associate one or more spatial
objects with other spatial objects. In Approximation and aggregation methods [27] the idea
is to analyze the characteristics of the clusters in terms of the features (objects) close to
them. Aggregate proximity is the measure of closeness of the set of points in the cluster to a
feature. Mining in image and raster databases [13, 14] can be viewed as another approach
of spatial data mining. Some applications of this approach (based on images) are automatic
recognition and categorization of astronomical objects; classification of stars, galaxies and
other stellar objects. Spatial Classification [31] is the task of assigning objects to a set of
3
classes based on their attribute values. Spatial Trend Detection [10] describes the regular
change of one or more non-spatial attributes of an object when moving away from a given
starting object.
However, we argue that these approaches do not consider all the elements found in a spatial
database (spatial data, non-spatial data and spatial relations among the spatial objects) in an
extended way. Some of them focus first on spatial data and then on non-spatial data or vice
versa, and others consider restricted combinations of these elements. We think that it is
possible to enhance the generated results of the data mining task by mining them as a whole
and not as separated elements (in the real world they are related). In this context, we
propose to use a graph-based representation since it provides the flexibility to describe
these elements together and this is the motivation to explore the area of graph-based spatial
knowledge discovery.
Our work is based on the hypothesis that if we create a graph-based model to represent
together spatial and non-spatial data and if we use this model for generating a dataset
composed of both type of data, then we can apply data mining techniques using this
knowledge representation to spatial and non-spatial data at the same time and get
descriptive patterns considering both kind of data about objects and the spatial relations
among them.
4
1.2 Proposal
Our proposal is to create a unique graph-based model to represent spatial data, non-spatial
data and the spatial relations among spatial objects. We will generate datasets composed of
graphs with a set of these three elements. We consider that by mining a dataset with these
characteristics a graph-based mining tool can search patterns involving all these elements at
the same time improving the results of the spatial analysis task. A significant characteristic
of spatial data is that the attributes of the neighbors of an object may have an influence on
the object itself, therefore, to enhance the data representativeness we propose to include in
the model three relationship types (topological, orientation, and distance relations).
Moreover, from a general point of view spatial database systems are relational databases
plus a concept of spatial location and spatial extension. So, most KDD algorithms for
spatial databases must make use of those neighborhood relationships because it is the main
difference between KDD in relational database system and spatial database system.
In the model the spatial data (i.e., spatial objects), non-spatial data (i.e., non-spatial
attributes), and spatial relations are represented as a collection of one or more directed
graphs. A directed graph contains a collection of vertices and edges representing all these
elements. Vertices represent either spatial objects, spatial relation types between two spatial
objects (binary relation), or non-spatial attributes describing the spatial objects. Edges
represent a link between two vertices of any type. According to the type of vertices that an
edge joins, it can represent either an attribute name or a spatial relation name. The attribute
name can refer to a spatial object or a non-spatial entity. We use directed edges to represent
5
directional information of relations among elements (i.e., object x covers object y) and to
describe attributes about objects (i.e., object x has attribute z).
We propose to adopt the Subdue system [24, 45], a general graph-based data mining system
developed at the University of Texas at Arlington, as our mining tool. Subdue discovers
substructures using a graph-based representation of structural databases. The substructures
(a connected subgraph within the graphical representation) describe structural concepts in
the data (i.e., patterns). The discovery algorithm follows a computationally constrained
beam search. The algorithm begins with the substructure matching a single vertex in the
graph. On each iteration, the algorithm selects the best substructure and incrementally
expands the instances of the substructure. An instance of a substructure in the input graph is
a subgraph that matches (graph theoretically) that substructure.
A special feature named overlap has a primary role in the substructures discovery process
and consequently a direct impact over the generated results. However, it is currently
implemented in an orthodox way: all or nothing. If we set overlap to true, Subdue will
allow the overlap among all instances sharing at least one vertex. On the other hand, if
overlap is set to false, Subdue will not allow the overlap among instances sharing at least
one vertex. We argue that a third option is needed: a limited overlap. With this option we
give the user the capability to set over which vertices, the overlap will be allowed (vertices
representing remarkable elements that refer, for instance, to a spatial object in a spatial
database or to some characteristic defining a particular topic of a dataset). We visualize
directly three motivations issues to propose the implementation of the new algorithm:
search space reduction, processing time reduction, and pattern oriented search.
6
1.3 Contribution
The contribution to the discovery knowledge in the spatial data domain, described in this
dissertation, is the development of a new approach for spatial data modeling and mining
using a graph-based representation. This approach includes the following issues:
• We proposed a new graph-based data representation for spatial data, non-spatial
data and spatial relations among the spatial objects. We visualize two objectives for
creating a data model with these characteristics. The first one is to create a unique
graph-based dataset representing these related elements. The second one is to use
this dataset to feed a graph-based mining system, so we can discover single patterns
(involving these elements) that will help us to describe/understand the data, based
on the premise that they are related elements in the real world.
• We proposed a new algorithm to discover substructures (patterns) using a limited
overlap approach in the Subdue system. We visualize directly three motivations
issues to propose the implementation of the new algorithm: search space reduction,
processing time reduction, and specialized overlapping pattern oriented search.
• We designed and developed a prototype system implementing the proposed model.
The prototype provides to the user a friendly graphical user interface for managing
the spatial layers to work with, for creating spatial and non-spatial graphs, for
mining those graphs (by calling the Subdue system) and for displaying the
generated results.
7
1.4 Organization of the thesis
The thesis is structured in the following way: Chapter 1 presents the motivation, proposal,
and contributions of the thesis. Related work is described in Chapter 2. In Chapter 3 we
detail our graph-based model to represent together spatial data, non-spatial data, and spatial
relations. Our graph-based data mining tool, the Subdue system, and the new limited
overlap algorithm are described in Chapter 4. A prototype system implementing our model
is presented in Chapter 5. Use-cases showing the applicability of our proposal are described
in Chapter 6. Finally, conclusions and final remarks are commented in Chapter 7.
8
Chapter 2
RELATED WORK
Knowledge Discovery in Databases, spatial data, non-spatial data, Data Mining, Spatial
Data Mining, Geographic Information System, geoprocessing, and Geomatics are terms
broadly used nowadays. This chapter presents an outline of these issues. The Geographic
Information Systems are presented in Section 2.1. In Section 2.2 we describe related work
about Knowledge Discovery in Databases and data mining. Finally, Section 2.3 describes
the three types of spatial relations we propose to incorporate in our model to represent
spatial data.
2.1 Geographic Information System (GIS)
A Geographic Information System is defined as a tool for the manipulation of geographic
data [2]. The GIS performs a great diversity of functions; some of them are the compilation,
verification, storage, retrieval, manipulation, update and visualization of geographic data.
Additionally, one of its more important features is the inclusion of data analysis modules.
9
All these functions are applied by a GIS to geographic data, stored generally in a
geographic database. The data processed and manipulated is georeferenced, that is, it is
assigned to a specific location on the Earth’s surface using a coordinate system.
The GIS can process data from several sources. For example, data collected from maps,
images and photography, statistical data from mathematical analyses, and data from CAD
systems (Computer-Assisted Design).
A GIS organizes and handles digital data stored generally in a geographic database. The
databases are important in the GIS technology because they store geographic data with a
structured form, allowing the data to be used for many tasks. Many GIS implement
additional functionalities when they use database management systems (DBMS) to store
and to handle all or some data in an independent subsystem.
The diversity of the uses of the GIS has generated the proliferation of a great variety of
definitions of GIS. A user generally defines a GIS according to what he uses it for and his
own experience and abilities. Some of these definitions are:
• A system for data processing designed for the production and/or visualization of
maps.
• An information system to respond to questions about Earth’s properties or soil
types.
• A system for decision making support in situations of natural phenomena.
• An electronic positioning system to be used by terrestrial or marine transportation.
10
The GIS frequently is named according to its application field [2]. For example, when they
are used to manage land’s registries they are generally called Land Information Systems
(LIS); in applications of municipal and natural resources they are important components of
the Urban Information Systems (UIS), and Natural Resources Information Systems (NRIS)
respectively. The term Automatic Mapping/Facility Management (AM/FM) is used by
public maintenance companies, transportation agencies, and local governments for systems
dedicated to the operation and maintenance of networks.
The GIS’s field (see Figure 2.1) involves many disciplines, applications, data types, and
end users, for example:
• Disciplines: Computer Science, Cartography, Spatial Analysis, Topography,
Hydrography, Statistic, Information Sciences, Planning, etc.
• Applications: Operation and maintenance of networks and other devices,
administration of natural resources, highway planning, map production, urban
analysis, planning analysis, etc.
• Data: Digital maps, digital images and photography, data satellites, video images,
etc.
• Users: Planners, topographers, vulcanologists, geographers, environmentalist,
engineers, etc.
11
DisciplinesDisciplinesUsersUsers
ApplicationsApplicationsDataData
Figure 2.1. Geographic Information System.
Geographic Information Systems involve the uses of systems and science. Their use arise
questions such as: How does a GIS user know that the results obtained are accurate? How
can user interfaces be made readily understandable by novice users? M. F. Goodchild
published in 1992 a paper where he argued that questions such as these and their systematic
study constituted a science.
Geoprocessing study the fundamental issues arising from geographic information (i.e.,
creation, handling, storage and use of the information). The term Geomatics, the fusion of
ideas from geosciences and informatics, is defined as the umbrella covering all fields that
are today important for understanding and further developing information systems in this
context [32, 33, 34].
12
A GIS offers its users greater capabilities to process datasets than those offered by manual
systems. In a GIS database the data is stored in a structured form, unlike the manual
systems where the data is stored in files, maps, and/or reports. The data can be recovered
from geographic databases and processed faster and more safely than in manual systems.
We can classify the GIS’s users into two groups. In the first group, we find professional
operators. They are people trained in some particular software and they know the
capabilities of this technology. Many times these people do not use the results of their
work, but they pass them on to the end users.
The second user group spends less time working with the GIS’s. They maintain geographic
information in order to have tools which help them in the decision making process. They
have few opportunities for extensive training in the GIS tools, and consequently the GIS
must be simple and easy to handle.
2.2 Data Mining
Knowledge Discovery in Databases (KDD) is defined as the non-trivial extraction of
implicit, previously unknown, and potentially useful information from data [16]. This is an
interactive and iterative process that involves several phases: data preparation, search for
patterns over the data, evaluation and interpretation of discovered patterns, and refinement
of the whole process as it is shown in Figure 2.2.
13
Cleandata
Formattedandstructured data
PatternsKnowledge
Datamining
Patternsevaluation
Integration
Data Integrateddata
Selecteddata
Cleanning Selection
Transformation
Figure 2.2. Knowledge Discovery in Databases.
The data mining phase (the search for patterns) is the nucleus of the process; it consists of
the application of data analysis and discovery algorithms that, under acceptable
computational efficiency limitations, produces a particular enumeration of patterns over the
data [4, 12].
Data mining involves the integration of methods from different scientific fields such as
machine learning, database technology, statistics, and visualization as shown in Figure 2.3.
The first approaches developed focused on the discovery of knowledge from relational
data.
14
MachineLearning
DatabaseTechnology
Statistics
Visualization
OtherDisciplines
InformationScience
Data Mining
Figure 2.3. Data mining: integration of several fields.
Several architectures have been proposed for data mining [22, 25]. Figure 2.4 presents an
architecture based on the proposal of Han et al. [23]. The user is the trigger of the entire
process and receptor of the discovered knowledge. Data may be fetched from several
sources such as files, databases, and data warehouses using a data server module. The data
mining engine may use one or more data mining techniques for searching patterns from
data. The significance, importance and interestingness of the found patterns are evaluated
by the pattern evaluation module. The data mining engine and pattern evaluation modules
may use background knowledge stored in a knowledge database. The role of the graphical
user interface is to receive the user requirements and to deliver the generated results. Since
this is an iterative and interactive process, the components may interact among themselves.
15
Data Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Database
KnowledgeDatabase
DataWarehouseData
DataData
U S E R S
Figure 2.4. Architecture for a KDD system.
2.2.1 Spatial Data Mining
Climate change, natural risk prevention, human demography, deforestation, and natural
resources atlas are examples from a large variety of issues arisen as a result of the
interaction among people and their natural environment, the Planet Earth. Data generated
from those issues are known as spatial data. Spatial data mining methods focuses on the
discovery of implicit and previously unknown knowledge in spatial databases [16].
Spatial data have many features that distinguish them from relational data. For example, the
spatial objects may have topological, distance, and direction information, the complexity,
16
and the query language used to access them. Different approaches have been developed for
knowledge discovery from spatial data such as Generalization-based methods [21, 35],
Clustering [26, 28, 37, 40, 46], Spatial associations [13, 29], Approximation and
aggregation [27], Mining in image and raster databases [13, 14], Classification learning
[31], and Spatial trend detection [10]. In the following subsections we present a brief
description of these spatial data mining approaches.
2.2.1.1 Generalization-based Method
Generalization has been shown to be one of the effective methods of discovering
knowledge. It was introduced by the machine learning community and it is based on
learning from examples. Generalization-based knowledge discovery requires concept
hierarchies (given explicitly by experts or generated automatically). In the case of the
spatial databases, there can be two types of concept hierarchies:
• Thematic hierarchies. We can generalize tomatoes and bananas as fruits, fruits and
vegetables as cash crops.
• Spatial hierarchies. We can generalize some geographic points as a country or a
region.
The approach introduced by the machine learning community (tuple-oriented) cannot be
directly adopted for large spatial databases because it does not handle very well the noise
and inconsistent data, and the algorithms are exponential in the number of examples. Han et
al. [21] present a modified technique named attribute-oriented induction for mining
17
relational data. Lu et al. [35] extended the attribute-oriented induction technique to spatial
databases. Attributed-oriented induction is performed by climbing the generalization
hierarchies and summarizing the general relationships between spatial and non-spatial data
at a higher concept level. The authors present two generalization based algorithms:
• Non-spatial data dominant. This method performs attribute-oriented induction on
the non-spatial attributes, first generalizing them to a higher concept level and later
merges corresponding spatial attributes (using spatial merge and approximation).
• Spatial data dominant. Given the spatial data hierarchy, generalization can be
performed first on the spatial data and then generalizing their corresponding non-
spatial attributes.
Both algorithms assume that the rules to be mined are general data characteristics
(characteristics rules) and that the discovery process is initiated by the user who provides a
learning request. A disadvantage in this approach is the case where a hierarchy may not
exist or the hierarchy given by the experts may not be entirely appropriate in some cases.
The quality of mined characteristics is highly dependent on the structure of the hierarchy.
2.2.1.2 Clustering
Clustering is the process of grouping physical or abstract objects into classes of similar
objects. Clustering analysis helps construct meaningful partitioning of large set of objects.
It identifies clusters, or densely populated regions, according to some distance
measurement in a multidimensional dataset. Given a set of multidimensional data points,
the data space is usually not uniformly occupied by the data points. Data clustering
18
identifies the sparse and the crowded places, and hence discovers the overall distribution
patterns of the dataset. We can classify clustering algorithms in four main approaches:
Partitioning, Hierarchical, Locality-based and Grid-based algorithms.
• Partitioning algorithms partition a database of n objects into a set of k clusters
which are represented by the gravity of the cluster (k-means algorithms) or by one
representative object of the cluster (k-medoid algorithms). These algorithms use a
two-step procedure. First, they determine k representatives, next, assign each object
to the cluster with its representative closest to the considered object.
• Hierarchical clustering algorithms decompose the database into several levels of
partitioning which are usually represented by a dendrogram. The algorithm
iteratively splits the database into smaller subsets until some termination condition
is satisfied. The dendrogram can either be created top-down (divisive) or bottom-up
(agglomerative).
• Locality-based clustering algorithms group neighboring data elements into clusters
based on local conditions and therefore allow the clustering to be performed in one
scan of the database.
• Grid-based algorithms quantize the space into a finite number of cells and then do
all operations on the quantized space. They frequently use hierarchical
agglomeration as one of their processing phases.
Classification of clustering algorithms is neither straightforward, nor canonical. Some
algorithms perform clustering by combining techniques from these approaches. Important
issues in clustering algorithms include the following properties:
19
• The algorithm must be efficient (time complexity).
• Ability to handle noise (outliers).
• Non-sensibility to data input order.
• A Priori knowledge and parameters tuning.
• Ability to find clusters of arbitrary shape.
• Scalability to large databases.
• Ability to work with high dimensional data.
2.2.1.3 Spatial Associations
A spatial association rule is a rule which describes the implication of one or a set of
features by another set of features in spatial databases [29]. An example of a spatial
association rule is “if the company is close to México City then is a big company”.
A spatial association rule is of the form X Y, where X and Y are sets of spatial or non-
spatial predicates. There are various kinds of spatial predicates that could constitute a
spatial association rule. Some examples are topological relations such as intersects,
overlaps, disjoint; and spatial orientations such as left_of, west_of.
Koperski and Han [29] developed an algorithm for spatial associations rules in spatial
databases. They use the concepts of minimum support and minimum confidence introduced
by Agrawal et al. [1] to develop association rules from large transactional databases. The
support of a pattern A in a set of spatial objects S is the probability that a member of S
satisfies pattern A, and the confidence of A B is the probability that a pattern B occurs if
20
pattern A occurs. A user or an expert may assign thresholds to confine the rules to be
discovered to be strong ones.
Although many spatial association rules may exist in large databases, some of them may
occur rarely or may not hold in most cases. In addition, such rules are usually not 100%
accurate and may contain non trivial knowledge.
The authors employ a method which uses a top-down progressive deepening search
technique. The technique firstly searches at a high concept level for large patterns and
strong implications relationships among the large patterns at a coarse resolution scale. Then
only for those large patterns, it deepens the search to lower concept levels. Such deepening
search process continues until no large patterns can be found. The search employed for
large patterns at high concept levels is applied at a coarse resolution scale efficiently by
using approximate spatial computation algorithms such as R-trees [20] or plane-sweep [39]
techniques operating on minimum bounding rectangles (MBR). Only the candidate spatial
predicates, which are detailed reviewed, will be computed by refined spatial techniques.
Such multiple-level approach saves much computation because it is very expensive to
perform detailed spatial computation for all possible spatial association relationships.
2.2.1.4 Approximation and Aggregation
Clustering algorithms are effective and efficient methods for answering questions such as:
where are the clusters in the spatial database? In some cases it is also important to answer
the question why the clusters are there. We can rephrase the question as: what are the
21
characteristics of the clusters in terms of the features (objects) that are close to it? Knorr
and Ng [27] present a study based on this question.
The aggregate proximity is the measure of closeness of the set of points in the cluster to a
feature as opposed to the distance between a cluster boundary and the boundary of a
feature. Finding aggregate proximity relationships is not as simple as it may seem. There
are three reasons:
• The sizes and shapes of the cluster and the features may vary greatly.
• There may be a very large number of features to examine.
• Even if a suitable feature (i.e., polygon) is found to describe the shape of the cluster
of points, it is inappropriate to simply report those features whose boundaries are
closest to the cluster’s boundary, because the distribution of points in a cluster may
not be uniform.
The authors propose the use of computational geometry concepts to find out the
characteristics of a given cluster in terms of the features close to it. They present the
algorithm CRH (where C is for encompassing circle1, R for isothetic rectangle2, and H for
convex hull3) which uses concepts as filters to reduce the candidate features at multiple
levels. In general, they collect a large number of features from multiples sources (i.e.,
1 A circle that encloses a set of n points. Not necessarily minimum bounding.
2 A rectangle that is orthogonal to the coordinate axes.
3 This is the unique, minimum bounding convex shape enclosing a set of points.
22
maps) and feed them along with the cluster to the algorithm CRH and discover knowledge
about spatial relationships.
Approximation by circles and then by rectangles is used to eliminate features that have
large aggregate distance to the cluster. After these filters, the algorithm calculates the
aggregate proximity of points in the cluster to the convex boundary of each feature that
passed through the previous filters. In the last step, the algorithm reports the features with
the best aggregate proximities showing the minimum and maximum distances of points in
the cluster to the feature, average distance, and percentages of points located in the distance
less than specified threshold.
2.2.1.5 Mining an Image Database
Knowledge mining from image databases can be viewed as a special case of spatial data
mining. For example, Fayyad et al. present a system [14] to identify and categorize
volcanoes on the surface of Venus from images taken by the Magellan spacecraft. Three
basic components are implemented in the system: data focusing, feature extraction, and
classification learning. The first component increases the overall efficiency of the system
by first identifying the portion of the image being analyzed that is most likely to contain a
volcano. They compare the intensity of the central pixel of a region to the estimated mean
background intensity of its neighborhood pixels. The second component extracts interesting
features from the data. The final component uses training examples provided by the experts
to create a classifier that can discriminate between volcanoes and false alarms. For this task
the authors implement decision trees.
23
Other studies of mining in image and raster databases are: Second Palomar Observatory
Sky Survey [15], it uses decision trees for the classification of galaxies, stars and other
stellar objects. Stolorz and Dean [43] proposed a system for detecting earthquakes from
space. They combined methods of statistical interference, massively parallel computing,
and global optimization to build the system that analyze tectonic activities with sub-pixel
resolution over a large area. Stolorz et al. [44] and Shek et al. [41] carry out studies about
fast spatio-temporal data mining from geophysical datasets; they described CONQUEST, a
distributed parallel querying and analysis mining tool.
2.2.1.6 Classification Learning
Spatial classification has as objective to find rules that divide a set of objects into a number
of groups, where objects in each group belong mostly to one class. Many types of
information can be used to characterize spatial objects. We can classify such information
into non-spatial attributes of objects, spatially related attributes with non-spatial values,
spatial predicates and spatial functions.
Each of these categories may be used to extract both for class label attributes (attributes that
divide data into classes) and predicting attributes (attributes on whose values the decision
tree is branched). It is possible to use aggregate values for some of these attributes.
Koperski et al. [31] proposed and evaluated a method for classification of spatial objects.
The method enables classification of spatial objects based on aggregate values of non-
24
spatial attributes for neighboring regions. Spatial relations between objects on the map are
taken into a count, which may be represented into the form of predicates.
2.2.1.7 Spatial Trend Detection
Spatial trend detection is defined as a regular change of one or more non-spatial attributes
in the neighborhood of some object in a database [10]. An example of spatial trend is as
“moving away from downtown Puebla, the price of land decreases”.
Neighborhood paths starting from some point x are used to model the movement and a
regression analysis is performed on the respective attribute values for the objects of a
neighborhood path to describe regularity of change. In the regression analysis the distance
from x is the independent variable and the difference of the attribute values are the
dependent variable(s). There are two types of trends: global trends and local trends. In the
first one, the existence of a global trend for a start object x indicates that if considering all
objects on all paths starting from x the values for the specified attribute(s) in general tend to
increase or decrease with increasing distance. Local trends exist only in a particular
direction.
2.3 Spatial relations
The explicit location and extension of objects define implicit relations of spatial
neighborhood. Therefore, the information about the neighborhood of spatial objects
constitutes a valuable element that must be considered in the mining task. In the following
25
subsection we will present the Neighborhood Graphs, Neighborhood Paths and
Neighborhood indices concepts that introduce us to the three types of spatial relations we
propose to include in our model to represent together spatial data.
2.3.1 Neighborhood Graphs, Neighborhood Paths and Neighborhood
Indices
Martin Ester et al. [9, 11] introduce the concept of neighborhood graphs for explicitly
representing those implicit neighborhood relations relevant for the KDD tasks. He claimed
that attributes of the neighbors of some object of interest may have an influence on the
object and therefore have to be considered as well. He commented that the efficiency of
many KDD algorithms for spatial database systems depends heavily on an efficient
processing of these neighborhood relationships. Neighborhood graphs may cover one of the
following neighborhood relations:
• Topological.
• Distance.
• Direction.
These relations are called binary relations since we can determine spatial relations between
pairs of objects.
A neighborhood graph G for some spatial relation neighbor is a graph where nodes are
objects in the database and edges between nodes n1 and n2 represent the fact that the
relationship neighbor (n1, n2) holds. The neighborhood path in the neighborhood graph G is
26
a set of nodes directly connected through the edges of the graph. The neighborhood index is
a data structure that allows for an efficient execution of the operations for the construction
of a graph and for browsing and expansion of paths. It stores all neighbors for the objects in
a database.
2.3.2 Topological Relations
Topological relations are those relations which are invariant under linear transformations,
i.e., if both objects are rotated, translated or scaled simultaneously the relations are
preserved. They present a definition of topological relations derived from the nine
intersection model [6, 7, 8].
In the model the topological relations between two objects are defined in terms of the
intersections of the interiors, boundaries and exteriors of the objects (see Figure 2.5). The
interior of an object consists of points that are in the object but not on its boundary, and the
exterior consists of those points that are not in the object (its complement). The Figure 2.6
shows the nine intersection model of two objects A and B: object A’s interior (Aº), object
A’s boundary (∂A) and object A’s exterior (A¯) with object B’s interior (Bº), object B’s
boundary (∂B) and object B’s exterior (B¯).
27
Interior
Exterior
Boundary
Aº∩Bº Aº∩∂B Aº∩B¯∂A∩Bº ∂A∩∂B ∂A∩B¯A¯∩Bº A¯∩∂B A¯∩B¯
Figure 2.5. Example of the interior, boundary and
exterior of a circle.
Figure 2.6. Nine intersection model.
In Figure 2.7 we present the topological relations between two objects:
• Disjoint. The boundaries and interiors of the objects do not intersect.
• Contains. The interior and boundary of one object is completely contained in the
interior of the other object.
• Inside. The opposite of contains.
• Equal. The two objects have the same boundary and interior.
• Touch. The boundaries of the objects intersect but the interiors do not intersect.
• Covers. The interior of one object is completely contained in the interior of the
other object and their boundaries intersect.
• CoveredBy. The opposite of covers.
• Overlap. The boundaries and interiors of the two objects intersect.
28
Figure 2.7. Topological Relations.
2.3.3 Distance Relations
Distance relations compare the distance between two objects with a given constant using
arithmetic operators such as <, >, =. The distance between two objects is defined as the
minimum distance between them (i.e., select all elements inside a radio of 50 km from a
“x” point). Figure 2.8 shows two examples of this type of relation; in the figure we are
representing how close and how far two objects are each other.
Figure 2.8. Distance Relations.
29
We propose to use the Euclidian distance which is defined as the straight line distance
between two points. In a plane with p1 at (x1, y1) and p2 at (x2, y2), the Euclidian distance is
( ) ( )221
221 yyxx −+− .
2.3.4 Direction Relations
The direction relation R between two spatial objects A and B, (A R B) is defined using one
representative point of object A and all the points of the destination object B. It is feasible
to define several possibilities of direction relations depending on the number of points that
are considered in the source and destination objects. The representative point of a source
object may be the center of the object or a point on its boundary. This representative point
is used as the origin of a virtual coordinate system and its quadrant defines the direction.
The direction relations between two objects are North_of, South_of, East_of, West_of,
Northeast_of, Northwest_of, Southeast_of, and Southwest_of. In Figure 2.9 we show some
examples of direction relations, for instance, object D is South of object C and East of
object A.
30
A
B
C
D
B North of A
A West of D
D East of A
C Northeast of A
D South of C
Figure 2.9. Direction Relations.
2.4 Conclusion
Data mining is a younger and promissory research field. Many of the data mining
approaches developed for knowledge discovery in relational databases were extended to the
spatial databases domain. In this chapter we presented several approaches such as
generalization, clustering, spatial association, and spatial classifications. We have described
three types of spatial relations that will be included in our model to represent as a unique
graph-based dataset spatial data, non-spatial data and spatial relations.
In the next chapter we will describe the model. First, we will talk about the graph-based
knowledge discovery, next we will detail our methodology, and finally we will present an
example showing its applicability and generated results.
31
Chapter 3
GRAPH-BASED REPRESENTATIONS
This chapter describes our graph-based model to represent together spatial data, non-spatial
data and spatial relations among the spatial objects. We propose to include three types of
spatial relations: topological, distance and direction. Section 3.1 presents some issues
related to the graph-based knowledge discovery. Our methodology is detailed in Section
3.2. In Section 3.3 we describe five graph-based representations based on our model.
Finally, an example of the applicability of the proposal is presented in Section 3.4.
3.1 Generalities
In Chapter 2 we presented several approaches developed to search knowledge in spatial
databases. The differences between these approaches are based on the data representation
and the data mining algorithm used in the search task. Certainly, the data representation
used by a mining tool is very important, and it has to be powerful enough to represent
domains containing complex relations among their components (i.e., spatial data domain).
32
A graph-based representation has these characteristics [3, 5, 17, 36]. It has the benefits of
being easy to understand and flexible enough to create different representations of the same
domain. The domain (i.e., data and relationships) is described using graphs. These graphs
become the input to a graph-based discovery tool which uses a heuristic to choose the
subgraphs that are considered important (patterns).
A graph is defined as a pair G = (V,E). V = {v1,…,vn} denotes a finite set of elements called
vertices. E is a set of edges e satisfying E ⊆ [V]2. Then, each edge e ∈ E is a pair (vi,vj). If
(vi,vj) is an ordered pair for any (vi,vj) ∈ E, then G = (V,E) is said to be a directed graph. A
labeled graph has labels associated with its edges and vertices.
In knowledge discovery systems using a graph-based approach, the data mining algorithm
uses graphs to represent knowledge; this means that the data preparation phase includes the
transformation of the data to a graph format. The search space of a graph-based data
mining algorithm consists of all the subgraphs that can be derived from its input graph.
In the literature there exist several definitions about what a spatial data is. However, before
presenting our spatial model definition, we precise the following issues:
• When we speak about a geometric object we refer to an object describing a form
(i.e., point, line, and polygon).
• A spatial object refers to an object represented by a coordinate system (i.e.,
Cartesian coordinates).
33
• A geographic object may be considered a specialization of a spatial object because
it is represented by a coordinate system but related to earthly coordinates
(sometimes called Geodetic or Geographic coordinates).
A model is a simplification of the reality. It is not the reality, rather it represents the reality.
A model is used to explain or to understand the reality. A model can be, for instance, an
equation, a hypothesis or a structured idea. A spatial model is therefore an abstraction of
spatial data that generates useful information to help us understand, describe, and predict
how things work and/or solve problems in the real world. When we work with geographic
coordinates we can talk about a geographic model.
3.2 Methodology
The idea is to create a graph-based model to represent together spatial data, non-spatial data
and the spatial relations between spatial objects. We will generate datasets composed of
graphs with a set of these three elements. We argue that by mining a dataset with these
characteristics a data mining algorithm can search patterns involving all these elements, at
the same time, improving the results of the spatial analysis process.
For example, finding out interesting patterns of objects located at some distance from a
particular point; focusing on a current problem such as finding risk zones near the
Popocatépetl volcano (México). In addition, it would be important to know the
characteristics of the evacuation routes which would be used in situations of volcanic
34
activity, i.e., what are the characteristics of the evacuation routes; could they withstand the
atmospheric conditions and the passage of vehicles in an emergency situation?
A significant characteristic of spatial data is that the attributes of the neighbors of an object
may have an influence on the object itself. So, we propose to include in the model the three
types of relationship mentioned in Chapter 2: topological, distance, and direction relations.
As we mentioned, the basis is that in a spatial database there exist spatial objects, these
objects interact with other objects (spatial relations) and they may have several attributes
describing them (non-spatial data). Thus, we propose to create graphs that help us to
describe these interconnections between all these elements.
A simple way can be as follows: for each object we create a vertex representing the object
itself and join two vertices with a directed labeled edge if they have a spatial relation in
common (we add one edge for each spatial relation among vertices). Additionally, we can
also create vertices representing the value of their non-spatial attributes and directed labeled
edges joining each attribute to its object (one edge per attribute).
But there is a higher complexity level when working with general graphs (a graph with
multiple edges between vertices) than with simple graphs (a graph with at most one edge
between any given pair of vertices and with no loops). Thus, our idea is to work with
simple graphs. So, we propose to improve this representation to the one shown in Figure
3.1 (using UML notation).
35
Vertex
1..*
1..*
Distance DirectionTopological
Spatial relation
Spatial object
Non-spatial attribute
-From:-To:
Edge
Figure 3.1. General graph-based model to represent spatial data.
In the model, the spatial data (i.e., spatial objects), non-spatial data (i.e., attributes), and
spatial relations are represented as a collection of one or more directed graphs. Therefore, a
directed graph contains a collection of vertices and edges representing all these elements.
Vertices represent either spatial objects, spatial relation types between two spatial objects
(binary relation), or non-spatial attributes describing the spatial objects. Edges represent a
link between two vertices of any type. According to the type of vertices that an edge joins,
an edge can represent either an attribute name or a spatial relation name. The attribute name
can refer to a spatial object or a non-spatial entity. We use directed edges to represent
directional information of relations among elements (i.e., object x covers object y) and to
describe attributes about objects (i.e., object x has attribute z).
36
This knowledge representation has the capability to describe a spatial dataset using graphs,
allowing a graph-based mining tool to mine it as a whole. The capabilities of the model to
represent the relation between these objects will be of great impact in the results of the data
mining processes, since the world is described by objects and the relation between these
objects, we can figure out the relations as the elements describing the interaction of the
objects with each other.
3.3 Spatial Graph-based Data Representations
In the construction of the graph there are issues such as the graph complexity and size that
have a direct impact over the data mining algorithm performance. The quality of results
refers to another important aspect. Testing several representations will allow us to produce
comparisons among obtained results, to evaluate them, and finally to make a decision for
selecting the one(s) which offers the better results according to our criteria for success. One
approach is to show that the model allows discovering known patterns. Another approach is
to have a domain expert saying that the discovered patterns are interesting. We have
worked with domain experts for developing these tasks.
Currently, we have developed five models to represent spatial data, non-spatial data and
spatial relationships among the spatial objects as a unique dataset from the general model.
Three issues define the characteristics (i.e., number of vertices and edges, simple graph,
etc.) of the graphs created from these models:
• Equivalent spatial relations.
37
• Symmetric spatial relations.
• The way to represent objects and their relations in the model.
In the following subsections we will explain how these three characteristics affect the
structure and composition of the graph. First, we will talk about the equivalent relations,
next the symmetric relations and finally the five created models.
Equivalent Spatial Relations
Suppose that our dataset is composed of an object A disjoint of an object B, this implies that
object B is disjoint of object A. In this case the two objects are disjoint each other (an
equivalent relation). When creating the graph, this relation can be represented by two
directed edges, one edge labeled as “DISJOINT” from object A to object B and vice versa.
However we can use the following principle:
2 directed edges, 1 edge e(vi,vj) and 1 edge e(vj,vi) equal to 1 undirected edge e = ij
By applying this principle we can replace the two directed edges by only one undirected
edge labeled as the equivalent relation without losing the representation of the spatial
relation among the objects.
The equivalent relations implemented in our research are the following:
• TOUCH
• DISJOINT
38
• OVERLAP
• EQUAL
• CLOSE
Symmetric Spatial Relations
Suppose that our dataset is composed of an object A South of an object B, when we create
the graph this relation is represented by a directed edge from object A to object B labeled as
“South_of”. But it implies the relation object B is North of object A (it is a symmetric
relation). This last relation in some models is not represented.
According to the model the representation of a symmetric relation implies one of the
following options:
• The addition of a directed edge labeled as the symmetric relation.
• The addition of a vertex labeled as the spatial relation (i.e., topological, direction)
and a directed edge labeled as the symmetric relation.
The symmetric relations implemented in our research are the following:
• CONTAINS INSIDE
• COVERS COVEREDBY
• NORTH_OF SOUTH_OF
• EAST_OF WEST_OF
• NORTHEAST_OF SOUTHWEST_OF
• SOUTHEAST_OF NORTHWEST_OF
39
Models
As we have commented, testing several representations will allow us to produce
comparisons among obtained results, to evaluate them, and finally to make a decision for
selecting the one(s) which offers the better results (in term of quality). The following five
models were developed to answer issues such as: in some models we obtain a reduction in
the number of vertices and edges, but do they give us the same representation of our data?
Are we gaining in the size reduction, but what are we loosing? There is a higher complexity
working with general graphs than with simple graphs, does it affect the generated results?
What about time processing?
In order to describe the characteristics of each model we will use the sample dataset shown
in Figure 3.2. Our dataset is composed of two spatial objects, object A representing a house
and object B representing a lake, and the following three spatial relations among them:
• Distance relation
o Object A close object B (equivalent relation).
• Topological relation
o Object A touch object B (equivalent relation).
• Direction relation
o Object A South of object B (symmetric relation).
40
North
South
EastWest
Object B
Object A
Figure 3.2. Sample dataset.
Additionally, to evaluate the characteristics of each model (the model is itself a graph) we
have developed the following nine evaluation metrics:
• Num. vertices. Total number of vertices in the graph.
• Num edges. Total number of edges in the graph.
• Size (vertices + edges). Total number of vertices plus total number of edges in the
graph.
• % increment. This item represents the percent of increment (in term of vertices
plus edges) of this graph with respect to the graph created by using the base model.
• Simple graph. To indicate if the graph is a simple one (graph with at most one edge
between any given pair of vertices and with no loops).
• Directed edge. To indicate if the graph has directed edges.
• Undirected edge. To indicate if the graph has undirected edges.
• Complete information. The item shows if the symmetric relations, created from
the original spatial relations, are also represented in the graph.
41
• Redundant “Relation” edge. We introduce this edge as strategy for avoiding
creating complex graphs. Remember that we consider simple graph a graph with at
most one edge between any given pair of vertices and with no loops. As we will see
in the description of those models, we use this special edge to join the vertices
representing spatial object with the vertices representing the spatial relation types
(i.e., topological, distance and direction). In models #3, #4 and #5 we add a
“Relation” edge for representing explicitly the fact that there is a relation between
two spatial objects. This metrics tells us if a redundant “Relation” edge is presented
in the graph.
Model #1 - base model
Figure 3.3 shows the first model created for representing spatial data as proposed. The
characteristics of the model according to the metrics are:
• Num. vertices: 2 vertices for representing each spatial object (i.e., object A, and
object B).
• Num. edges: 4 edges, 3 edges for representing the original spatial relations (i.e.,
“close”, “touch”, and “South_of” relations) and 1 edge for representing the
“North_of” relation created from the original “South_of” symmetric relation. The
“North_of” relation is itself a symmetric relation.
• Size (vertices + edges): 6
• % increment: 0%, it is the base model.
• Simple graph. No, it is a complex graphs with 4 edges linking 2 vertices.
42
• Directed edge. Yes, they are used for representing the “South_of” and “North_of”
symmetric relations. The direction of the edges is according to the lecture of the
relations among the objects.
• Undirected edge. Yes, they are used for representing the “close” and “touch”
equivalent relations.
• Complete information. Yes, in the graph is represented the “North_of” symmetric
relation created from the original “South_of” symmetric relation.
• Redundant “Relation” edge. No, in the model we do not use “Relation” edges.
Spatialobject A
Spatialobject B
South_of
touch
close
North_of
Figure 3.3. Model #1 - base model.
Model #2 - single replication of relation types, complete information
In Figure 3.4 we show our second model created for representing spatial data. The
characteristics of the model according to the metrics are:
• Num. vertices: 5 vertices, 2 vertices for representing the spatial objects and 3
vertices for representing the “topological”, “distance”, and “direction” spatial
relation types. For each spatial relation type among two spatial objects we add 1
vertex labeled as its name. In the example there exist 1 “topological” relation, 1
“distance” relation, and 1 “direction” relation.
43
• Num. edges: 6 edges, 3 edges for representing the original spatial relations, 2 edges
for representing the equivalent relations (i.e., “close” and “touch” relations) created
from the original ones, and 1 edge for representing the symmetric relation (i.e.,
“North_of” relation) created also from the original relations.
• Size (vertices + edges): 11
• % increment: +83.33%
• Simple graph. Yes, there exists at most 1 edge between any given pair of vertices.
• Directed edge. Yes, they are used for representing all relations. The direction of the
edges is from the vertices representing the spatial objects to the vertices
representing the spatial relation types.
• Undirected edge. No, in the model we do not use undirected edges.
• Complete information. Yes, we represent the symmetric relations created from the
original ones.
• Redundant “Relation” edge. No, in the model we do not use “Relation” edges.
South_of
touch
close close
touch
North_of
Distance
Direction
Spatialobject A Topological Spatial
object B
Figure 3.4. Model #2 - single replication of relation types, complete information.
44
Model #3 - double replication of relation types, no complete information
In Figure 3.5 we show our third model created for representing spatial data. The
characteristics of the model according to the metrics are:
• Num. vertices: 8 vertices, 2 vertices for representing the spatial objects and 6
vertices for representing the “topological”, “distance”, and “direction” spatial
relation types. For each spatial relation type among two spatial objects we add 2
vertices labeled as its name. For instance, in the example there exist three relations:
1 “topological” relation, 1 “distance” relation, and 1 “direction” relation, so we add
6 vertices, 2 per each spatial relation type.
• Num. edges: 9 edges, 6 edges (the “Relation” edges) to link the vertices
representing the spatial objects to the vertices representing the spatial relation types
(from each vertex representing a spatial object start 3 edges since we have 3
relations), and 3 edges for representing the original spatial relations. These last 3
edges are used to join the vertices representing the spatial relation types.
• Size (vertices + edges): 17
• % increment: +183.33%
• Simple graph. Yes, there exists at most 1 edge between any given pair of vertices.
• Directed edge. Yes, they are used for representing the symmetric relations and the
“Relation” edges. The direction of the “Relation” edges is from the vertices
representing the spatial objects to the vertices representing the spatial relation types.
The direction of the other edges is according to the lecture of the relations among
the objects.
• Undirected edge. Yes, they are used for representing the equivalent relations.
45
• Complete information. No, we do not represent the symmetric relations created
from the original ones.
• Redundant “Relation” edge. Yes, we use in the model “Relation” edges for
representing explicitly the existence and type of a relation among 2 spatial objects.
Additionally, we use this edge for avoiding creating complex graphs.
South_of
touch
close
Relation
Distance
Topological
Direction
Distance
Topological
Direction
Spatialobject A
Spatialobject B
Relation
R elation
Relation
RelationR
elation
Figure 3.5. Model #3 - double replication of relation types, no complete information.
Model #4 - single replication of relation types, no complete information
In Figure 3.6 we show our fourth model created for representing spatial data. The
characteristics of the model according to the metrics are:
• Num. vertices: 5 vertices, 2 vertices for representing the spatial objects and 3
vertices for representing the “topological”, “distance”, and “direction” spatial
relation types. For each spatial relation type among two spatial objects we add 1
vertex labeled as its name. In the example there exist 1 “topological” relation, 1
“distance” relation, and 1 “direction” relation.
46
• Num. edges: 6 edges, 3 edges (the “Relation” edges) to link a vertex representing a
spatial object (in the example we have 2 vertices since there are 2 spatial objects) to
the vertices representing the spatial relation types (from this vertex representing a
spatial object start 3 edges since we have 3 relation types), and 3 edges for
representing the original spatial relations. These last 3 edges are used to link the
vertices representing the spatial relation types to the un-used vertex representing the
other spatial object.
• Size (vertices + edges): 11
• % increment: +183.33%
• Simple graph. Yes, there exists at most 1 edge between any given pair of vertices.
• Directed edge. Yes, they are used for representing the symmetric relations and
“Relation” edges. The direction of the “Relation” edges is from a vertex
representing a spatial object to the vertices representing the spatial relation types.
The direction of the other edges is from the vertices representing the spatial relation
types to the un-used vertex representing the other spatial object (it is according to
the lecture of the relations among the objects).
• Undirected edge. Yes, they are used for representing the equivalent relations.
• Complete information. No, we do not represent the symmetric relations created
from the original ones.
• Redundant “Relation” edge. Yes, we use in the model “Relation” edges for
representing explicitly the existence and type of a relation among 2 spatial objects.
Additionally, we use this edge for avoiding creating complex graphs.
47
Distance
TopologicalSpatialobject A
Direction
Spatialobject BRelation
Relation South_of
touch
Relation close
Figure 3.6. Model #4 - single replication of relation types, no complete information.
Model #5 - double replication of relation types, complete information
In Figure 3.7 we show our fifth model created for representing spatial data. The
characteristics of the model according to the metrics are:
• Num. vertices: 8 vertices, 2 vertices for representing the spatial objects and 6
vertices for representing the “distance”, “topological”, and “direction” spatial
relation types. For each spatial relation type we add 2 vertices labeled as its name.
In the example we add 2 vertices for the “topological” relation, 2 vertices for the
“distance” relation, and 2 vertices for the “direction” relation.
• Num. edges: 12 edges, 6 edges (the “Relation” edges) to link the vertices
representing the spatial objects to the vertices representing the spatial relation types,
and 6 edges for representing the original spatial relations and those ones generated
from them (equivalent and symmetric relations). These last 6 edges are used to link
the vertices representing the spatial relation types to the vertices representing each
spatial object (3 edges for each spatial object).
• Size (vertices + edges): 20
48
• % increment: +233.33%
• Simple graph. Yes, there exists at most 1 edge between any given pair of vertices.
• Directed edge. Yes, they are used for representing all relations.
• Undirected edge. No, in the model we do not use undirected edges.
• Complete information. Yes, we represent the symmetric relations created from the
original ones.
• Redundant “Relation” edge. Yes, we use in the model “Relation” edges for
representing explicitly the existence and type of a relation among 2 objects.
Additionally, we use this edge for avoiding creating complex graphs.
Direction
Distance
Topological
Spatialobject A
Direction
Distance
Topological
Spatialobject B
Relation
RelationRelation
Relation
touchtouch
closecloseSouth_ofNorth_of Relation
Relation
Figure 3.7. Model #5 - double replication of relation types, complete information.
Table 3.1 presents the results of the nine metrics developed to evaluate the characteristics
of each model. Model #1 is named the base model.
49
Model Num.
Vertices
Num.
Edges
Size
(v + e)
%
Increment
Simple
Graph
Directed
Edge
Undirected
Edge
Complete
Information
Redundant
“Relation”
Edge
(1) (2) (3) (4) (5) (6) (7) (8) (9)
#1 2 4 6 - No Yes Yes Yes No
#2 5 6 11 +83.33 Yes Yes No Yes No
#3 8 9 17 +183.33 Yes Yes Yes No Yes
#4 5 6 11 +83.33 Yes Yes Yes No Yes
#5 8 12 20 +233.33 Yes Yes No Yes Yes
Table 3.1. Characteristics of the graph-based representation models.
The metrics were proposed based on the causes/effects each of them has both into the
created graph and the mining algorithm. We visualize four significant issues related directly
to these metrics:
1. Search space
The search space of a graph-based data mining algorithm consists of all the subgraphs that
can be derived from its input graph, thus, the number of vertices (1) and edges (2) of the
created graph (3) are two important issues defining the search space size1 for the discovery
system. Therefore, the objective must be to minimize the number of vertices and edges
used to create the graphs but at the same time to maximize the representativeness of the
dataset. As we can see in Table 3.1, the model using the minimum number of vertices and
1 The edge type (directed or undirected) and graph type (simple or complex) are issues that also define the
search space of a graph-based data mining algorithm. In Chapter 4 we describe the Subdue system, our graph-
based data mining tool, and we show the way Subdue deals with these issues.
50
edges to represent our sample dataset is model #1 (2 vertices and 4 edges) whereas model
#5 is the opposite case (8 vertices and 12 edges).
2. Processing time
The search space size plays a relevant role regarding to the processing time used to
discover patterns. If we have a large search space the algorithm would require more time to
evaluate all the possible subgraphs. Therefore, a comparison among the “percentage of
increment” (4) metric of the proposed models is presented in Table 3.1. Remember this
metric compares the size of a given model with respect to model #1 (base model). For
instance, model #5 has a graph size increment of 233.33% with respect to model #1. In
other words, the mining algorithm will require the evaluation of 233.33% more vertices
and/or edges by using model #5 instead of model #1 for the same dataset.
3. Graph complexity
Next chapter describes the Subdue system, our graph-based data mining tool. As we will
see there exists a higher complexity for the mining algorithm to work with complex graphs
than with simple ones (i.e., at most an edge among any given pair of vertices and with no
loops). For instance, in the graph match process, in the expanding phase (Subdue uses an
“expanding” approach to discover patterns), and in the graph compression stage. Therefore,
the objective was to propose graph-based models that allow us to create simple graphs. As
we can see in Table 3.1, only model #1 does not create simple graphs.
51
Thus, as a strategy to break the multiplicity of edges among two vertices (for instance,
vertices representing two spatial objects meeting two or more spatial relations among them)
we use the following approaches:
• To add a new vertex labeled as the relationship type (i.e., topological, distance, and
direction) for each spatial relation among the spatial objects. This approach is used
in model #2.
• To add a new vertex (model #4) or two new vertices (model #3 and model #5)
labeled as the relationship type (i.e., topological, distance, and direction) by each
spatial relation among the spatial objects, and to link this new vertex (model #4) or
new vertices (model #3 and model #5) with the vertices representing the spatial
objects by using edges labeled as “Relation”. The approach used to link the vertices
is different in each model as we have mentioned in the definition of each of them.
This nomenclature is used to represent the fact that there exists a spatial relation
among these spatial objects. These edges are known as redundant relation edges (9).
4. Data representativeness
The directed edges (6), undirected edges (7), and complete information (8) metrics are used
to maximize the representativeness of the dataset and minimizing, as most as possible, the
graph size and its complexity. Directed edges are used to represent the symmetric spatial
relations (object A North_of object B, implies, B South_of A), the redundant relation edges,
and the non-spatial attributes describing the spatial objects. Undirected edges are used to
represent equivalent spatial relations (the relation is represented by an undirected edge
52
instead of two directed edges). Finally, complete information means that symmetric spatial
relations among spatial objects are also represented into the model.
3.4 Use-case
To show the applicability of our model, we will use a dataset from the Popocatépetl
volcano database [38, 42]. The database contains data related to several issues in the zone
such as settlements, rivers, and evacuation roads in the zone, just to mention a few of them.
Figure 3.8 shows a fragment of the layers “roads” and “settlements” of the Popocatépetl
volcano zone. The roads layer (shown in green color) represents the roads in the zone; it is
composed of spatial objects (i.e., lines) and non-spatial data describing the characteristics
of those roads (i.e., id, start point, end point, length, and type). The settlements layer
(shown in pink color) represents population areas in the zone. The layer is composed of
spatial objects (i.e., polygons) and non-spatial data describing the characteristics of the
settlements (i.e., id, area, perimeter, and type).
53
Figure 3.8. Spatial database representing some objects of the world.
Suppose we are interested in the identification of relations among the roads starting in, or
crossing a settlement (i.e., the type and characteristics of roads in relation to the number of
people living in a settlement that in case of a volcanic contingency are required to be
evacuated). So, we need to create a dataset involving these elements for mining it and
search for patterns that could help us to evaluate the characteristics of the roads and, may
be, to make decisions for improving them (or build new ones) for their utilization in case of
volcano activity. In this case we are working with different types of spatial objects and also
we are adding non-spatial data and relationships (as touch or overlap) to our dataset.
The construction of the dataset needs to satisfy some constraints. For example, if we
include all the elements existing in the data layers we might build a huge graph, and this
will have a direct impact in the data mining algorithm. So, we explore methodologies to
deal with topics such as complexity, the size of the graph, noise and quality of the data. A
solution for the problem of creating a huge graph is to limit the set of elements to be
54
included in it by using selection windows. The user can create these windows and only the
elements inside them are candidates to be included as objects in the graph. The idea is
shown in Figure 3.9. Suppose the user will work with n data layers, thus, for each layer we
will select only the spatial objects inside the area delimited by the selection window. We
can see this functionality such as a drill operation over all the spatial layers the user works
with.
Figure 3.9. Selection window.
Once we have defined the working area, as shown in Figure 3.8, we can build the graph.
The process consists of three phases. In the first phase, the user has to choose the spatial
relation(s) to evaluate among the spatial objects (in the example the touch or overlap
topological relations). The second phase involves the validation process, only the objects
covered by the relation(s) become elements to be included in the graph.
55
Figure 3.10. Querying a spatial database.
The last phase consists of building the graph using the results of the validation process.
Figure 3.10 presents an example of this functionality. This time only the area delimited by a
selection window is shown. In the figure, the spatial objects satisfying either the touch or
overlap relations are shown in blue color, the rest of the objects are shown in their original
colors. The circles show some examples where a road is starting or crossing a settlement.
56
Figure 3.11. Graph-based representation for spatial data.
57
Figure 3.11 shows the graph created using model #2 (we used the Graphviz System [19] to
draw the graph). The vertices in the graph represent either spatial objects (i.e., road,
settlement), spatial relations between two objects (i.e., topological), or attributes describing
the objects (i.e., ID value). In this example, we use a special vertex labeled as “object” for
expressing the interconnection between the spatial objects, their spatial relations, and non-
spatial attributes (i.e., object type road with ID 989 overlaps or touches with object…).
According to the type of vertices that an edge joins, the edge can be labeled as either a
spatial relation name (i.e., overlap, touch), or as the name of an attribute describing the
characteristics of an object (i.e., type, ID).
We may read the graph as follows: there are six roads and six settlements inside the
working area meeting either a touch or overlap relation in the form road settlement. We
suppose that each polygon object in the map represents a settlement. Some roads start from
a settlement and others cross a settlement. For example, the road with ID 981 crosses the
settlements with ID’s 3079, 3139, 3151, 3083 and 3138. This means that we need to be
careful with this road since it has interaction with several settlements and in case of a
contingency it will be widely used. In the graph we included only the object's ID non-
spatial attribute for each object.
This is a simple example; in the real world the spatial data layers may have hundreds or
thousands of objects, and each object may have dozens of attributes describing it. Joining
all these elements will allow creating large graphs representing the elements found in a
58
spatial database improving the results of the data mining task. Once we have created the
graph, it will be used as input to a graph-based mining system.
3.5 Conclusion
Our idea is to propose a graph-based model to represent together spatial data, non-spatial
data and the spatial relations between spatial objects. Based on the model we generate
datasets composed of graphs with a set of these three elements. Our argumentation is that
by mining a dataset with these characteristics a data mining algorithm can search patterns
involving all these elements at the same time improving the results of the spatial analysis
process.
We have presented an example of the applicability of the proposal using data from a
Popocatépetl volcano database. The created graph will be the input for a graph-based
mining system. We propose to use the Subdue system, which will be described in the next
chapter.
59
Chapter 4
MINING THE GRAPH
This chapter describes the characteristics of the Subdue system [24, 45], our data mining
tool. Subdue was developed at the University of Texas at Arlington and it can be applied to
any domain that can be represented as a graph. In the first part of the chapter we present the
general characteristics and main functions. In the second part we describe the overlap
feature and its importance for the pattern discovery system, and introduce a new algorithm
named Limited Overlap.
4.1 Characteristics
The Subdue system discovers substructures using a graph-based representation of structural
databases. Structural data involves the concept of units or parts and the interdependence
and relationships of those parts. The substructures (a connected subgraph within the
graphical representation) describe concepts in the data (i.e., patterns).
60
The substructure discovery system represents structural data as a labeled graph. The input
to Subdue is a graph where labeled vertices correspond to objects in the data, and directed
or undirected labeled edges map relationships between objects.
The discovery algorithm follows a computationally constrained beam search. There are
three different evaluation methods to guide the search towards more appropriate
substructures: Minimum Description Length (MDL), Size-based, and Set Cover. The
default evaluation method is MDL.
The algorithm begins with the substructure matching a single vertex in the graph. During
each iteration, the algorithm selects the best substructure and incrementally expands the
instances of the substructure. An instance of a substructure in the input graph is a subgraph
that matches (graph theoretically) that substructure.
The algorithm searches for the best substructure until all possible substructures have been
considered or the total amount of computation exceeds a given limit. Evaluation of each
substructure is determined by how well the substructure compresses the description length
of the input graph.
There might be slight variations of some substructures that can be considered as instances
of another substructure. Subdue uses an inexact graph match algorithm to identify this kind
of instances. In this inexact match approach, each distortion of a graph is assigned a cost
and if the total cost is lower than a given threshold, the substructure is considered an
instance of the other substructure.
61
A distortion is described in terms of transformations such as the deletion, insertion, and
substitution of vertices and edges. The best substructure found by Subdue can be used to
compress the input graph, which can then be input to another iteration of Subdue. After
several iterations, Subdue builds a hierarchical description of the input data where later
substructures are defined in terms of substructures discovered on previous iterations.
Figures 4.1, 4.2, 4.3 and 4.4 show an example of the system’s functionality. The example is
presented in terms of the house domain, where a house is defined as a triangle on a square.
T represents a triangle, S a square, C a circle and R a rectangle (see Figure 4.1). The
objects in the figure (i.e., T1, S1, R1) become labeled vertices in the graph, and the
relationships (i.e., on(S1, R1), shape(C1, circle)) become labeled edges.
T1
S1
object
R1
on
onon
Input GraphInput Database
S3
T3
S4
T4
S2
T2
C1
on
on
rectangleshape
object
object triangle
on
shape
squareshapeobject
object triangle
on
shape
squareshapeobject
object triangle
on
shape
squareshape
object
object triangle
on
shape
squareshapeobject
circleshape
Figure 4.1. Graph representation of the house domain.
62
The graph representation of the substructure discovered by Subdue from this data is shown
in Figure 4.2 where Subdue found four instances of “triangle on square”.
T1
S1
object
objecttriangle
square
S2
S3 S4
T2
T3 T4
shape
shape
on
Substructure Instance 1
Instance 3
Instance 2
Instance 4
Figure 4.2. Substructure and instances discovered from the house domain by Subdue.
After a substructure is discovered, each instance of the substructure in the input graph is
replaced by a single vertex representing the entire substructure. In Figure 4.3 the discovered
substructure (object shape triangle on object shape square) is labeled as SUB_1
(SUBstructure number 1).
object
objecttriangle
square
shape
shape
onSUB_1
Figure 4.3. Substructure replacement procedure in the house domain.
Finally, the substructure (labeled as SUB_1) is used to compress the original input graph,
which can then be input to another iteration of Subdue (see Figure 4.4).
63
SUB_1
object
SUB_1
rectangle
SUB_1 SUB_1
object circle
shape
shape
on
on
on
onon
Figure 4.4. Graph representation of the house domain after substructure replacement.
The Subdue system’s ability to perform discovery has been proved in several domains
including scene analysis, chemical analysis and CAD circuit analysis. In [18] the authors
present an example of the Subdue capabilities to perform data mining tasks, they work with
an earthquake database and show that Subdue is capable of finding not only the shared
characteristics of the earthquake events, but also space relations between them. In the case
of the identification of shared characteristics, they used the pattern containing the region
number specification to recognize the area being studied; in the case of space relations, they
found patterns that represent parts of the paths of the involved fault (i.e., subarea with a
high concentration of earthquakes).
4.1.1 Main Functions
In this subsection we present a briefly description of the main functions composing the core
of Subdue. Each of them is itself integrated by several subfunctions, but the idea is to
present them in a global perspective.
64
Compress
The compress function returns a new graph, which is the given graph compressed with the
given substructure instances. Vertices SUB replace each instance of the substructure, and
OVERLAP edges are added between vertices SUB of overlapping instances. Edges
connecting to overlapping vertices are duplicated, one per each instance involved in the
overlap.
Discover
This function plays the role of manager in the phase of discovering the best substructures in
an input graph. It is in charge of issues such as getting initial substructures, extending each
substructure, evaluating each extension, and adding to a final list the best discovered
substructures.
Evaluate
This function implements the different evaluation methods used to guide the search towards
more appropriate substructures: Minimum Description Length (MDL), Size-based, and Set
Cover. The value of a substructure s in a graph g is computed as:
size(g) / (size(s) + size(g|s))
The value of size() depends on whether we are using the MDL or Size-based evaluation
method. If MDL is used, then size(g) computes the description length in bits of g. In case
of Size-based, then size(g) is simply vertices(g) + edges(g). The size(g|s) is the size of
graph g after compressing it with substructure s. Compression involves replacing each
65
instance of s in g with a new single vertex and reconnecting edges external to the instance
to point to the new vertex.
Extend
This function returns a list of substructures representing extensions to the given
substructure. Extensions are constructed by adding an edge (or edge and a new vertex) to
each positive instance of the given substructure in all possible ways according to the graph.
Matching extended instances are collected into new extended substructures, and all such
extended substructures are returned.
Graphmatch
The Graphmatch function returns true if graph_1 and graph_2 match with cost less than the
given threshold. If so, the objective is to store the match cost in the variable pointed to by
matchCost and to store the mapping between graph_1 and graph_2 in the given array is
non null.
Subgraph Isomorphism
Set of functions implementing a subgraph isomorphism algorithm. The objective is to find
predefined substructures: searches for subgraphs of graph_2 that match graph_1 and
returns the list of such subgraphs as instances of graph_2. Returns an empty list if no
matches exist. This procedure mimics the “discover substructures” loop by repeatedly
expanding instances of subgraphs of graph_1 in graph_2 until matches are found.
66
4.2 Overlap
Now, we will describe the overlap feature in the Subdue system. Our goal is to present the
current implementation of this feature in Subdue and then, in the next section, to describe
our new proposal named limited overlap. As we will see, Subdue reports in the results all
the overlapping substructures that it can find. Sometimes this is very expensive in time and
in the number of substructures that the end user has to analyze. Therefore, the idea is that
Subdue can find overlapping substructures only over a set of predefined vertices that are
chosen by the user (the limited overlap). The overlap has a preponderant role in the
substructure discovery system; it is controlled by the overlap user’s parameter. If overlap is
false then overlap among instances is not allowed, otherwise, overlap is allowed.
To illustrate the cause/effect of the overlap in Subdue, we will present some examples
generated by using two standalone functions belonging to Subdue: the Subgraph
Isomorphism and Minimum Description Length functions. The generated results by these
functions are affected by the value of the overlap parameter.
Subgraph Isomorphism (SGISO)
As we have already mentioned, the subgraph isomorphism function searches for subgraphs
of graph_2 that match graph_1 and returns the list of such subgraphs as instances of
graph_2. The subgraph isomorphism function is embodied in the “FindInstances” function.
The procedure is optimized toward graph_1 being a small graph, and graph_2 being a large
graph.
67
The FindInstances function starts finding a list of single-vertex instances, one for each
vertex in graph_2 that matches the first vertex of graph_1. Next, the algorithm attempts to
extend each instance in the instance list by an edge (or edge and new vertex) from graph_2
that matches the attributes of the given edge in graph_1. Finally, the instances not matching
graph_1 are filtered, and an overlapping instances validation process is performed. If
overlap is false overlapping instances are discarded. The resulting list represents the
instances of graph_1 in graph_2.
For illustrative purpose suppose we have as input graphs those shown in Figure 4.5 and
Figure 4.6. Input graph_1 has 3 vertices and 2 edges, and graph_2 has 8 vertices and 7
edges.
A
B C
a b
2 3
1
F
CBE
DCA
B
a
g
c
b f
b
a
4 2 3
6 7
5
8 1
Figure 4.5. SGISO - input graph_1. Figure 4.6. SGISO - input graph_2.
Running SGISO with the overlap parameter set to false, it finds 1 instance of graph_1 (see
Figure 4.7) in graph_2. Figure 4.8 shows the discovered instance in graph_2.
68
A
B C
a b
2 6
1
Figure 4.7. SGISO - no overlap. Figure 4.8. SGISO - no overlap, 1 instance in graph_2.
Now, running SGISO with the overlap parameter set to true, the algorithm finds the 4
instances shown in Figure 4.9. We can see that each instance has the vertices labeled as A,
B, and C but they represent different vertices (they have different ID vertices). For
example, the first instance has the ID vertices 1, 5 and 3; the second instance has the ID
vertices 1, 5 and 6 respectively, and so on. Figure 4.10 shows the 4 discovered instances in
graph_2.
A
B C
a b
5 3
1 A
B C
a b
5 6
1
A
B C
a b
2 3
1 A
B C
a b
2 6
1
Instance 1 Instance 2
Instance 3 Instance 4
Figure 4.9. SGISO - overlap. Figure 4.10. SGISO - overlap, 4 instances in graph_2.
69
Minimum Description Length (MDL)
The MDL principle states that the best theory to describe a set of data is that theory which
minimizes the description length of the entire dataset. We define the MDL of a graph to be
the number of bits necessary to completely describe the graph. The minimal encoding of
the graph in terms of bits is computed as follows:
edgeBitsrowBitsvertexBitsMDL ++=
Where vertexBits represents the number of bits needed to encode the vertex labels of the
graph, rowBits represents the number of bits needed to encode the rows of the adjacency
matrix A (the matrix represents the graph connectivity), and edgeBits represents the number
of bits needed to encode the edges represented by the entries A[i,j] = 1 of the adjacency
matrix A.
The standalone MDL function computes the description length (dL) of graph_1, graph_2
and graph_2 compressed with graph_1 along with the final MDL compression measure:
)1_|2_()1_(/)2_( graphgraphdLgraphdLgraphdL +
dL(graph_2|graph_1) represents the value of graph_2 compressed with graph_1. The
overlap feature has a direct effect to compute this value because the number of instances for
compressing graph_2 is based in the instances list returned by the FindInstances function.
As we have commented, the Subdue system implements a compress graphs function named
CompressGraph. The inputs of the function are the graph to be compressed, and the
70
substructure instances used to compress the graph. The function returns a new graph, which
is the given graph compressed with the given substructure instances.
As part of the graph compressing phase, SUB vertices replace each instance of the
substructure, and OVERLAP edges are added between SUB vertices of overlapping
instances. Edges connecting to overlapping vertices are duplicated, one per each instance
involved in the overlap. The procedure to add OVERLAP edges evaluates two cases: first, if
two instances overlap at all, then an undirected OVERLAP edge is added between them.
Second, if an external edge points to a vertex shared between multiple instances, then
duplicate edges are added to all instances sharing the vertex. The procedure to add
duplicate edges is based in the follows pseudocode:
if edge connects SUB1 to external vertex then add duplicate edge connecting external vertex to SUB2 else if edge connects SUB1 to another (or same) vertex in SUB1 then add duplicate edge connecting SUB1 to SUB2 if other vertex is unmarked (overlapping and already processed) then add duplicate edge connecting SUB2 to SUB2 if edge connects SUB1 to the same vertex in SUB1 (self edge) then add duplicate edge connecting SUB2 to SUB2 if edge directed then add duplicate edge from SUB2 to SUB1
To show the role plays by overlap in the standalone MDL function (focusing in the
generated compressed graph) we will present, in the following figures, examples of the
different cases of validation that are implemented to face this feature. Figure 4.11 shows the
input graph_1, this graph will be used in all the examples as the first input graph. It has 3
vertices and 2 edges.
71
A
B C
a b
2 3
1
Figure 4.11. MDL - input graph_1.
In our first example we will use as graph_2 the graph shown in Figure 4.12. This graph has
8 vertices and 7 edges. As we can see our graph has a “remarked” edge, the directed edge
labeled as g between vertex 1 (labeled as A) and vertex 8 (labeled as F). This edge
represents the current case of validation (in the future it will be named the guide edge).
As part of the CompressGraph algorithm, there is a function named AddOverlapEdges that
adds edges to the compressed graph, describing overlapping instances of the substructure in
the given graph, based in two conditions. First, if two instances overlap, then an undirected
edge OVERLAP is added between them. Second, if an external edge points to a vertex
shared between multiple instances, then duplicate edges are added to all instances sharing
the vertex. If the second condition is true, the AddOverlapEdges function calls a new
function named AddDuplicateEdges. The purpose of this function is to add duplicate edges
based on overlapping vertex between substructures.
As we can see, the layout of the edge (that we call the guide edge) is important for the
global functionality of the compress graph algorithm. So, in the following examples we will
change the guide edge for illustrating the different validation cases.
72
Returning to our example, if we run the MDL program with the overlap parameter set to
false, our final compressed graph is shown in Figure 4.13. Since overlap is not allowed
between instances, the FindInstances function finds 1 instance of graph_1 that match
graph_2. In the graph this instance was replaced by a vertex SUB, so we have only 1 vertex
of this type. There are not edges OVERLAP, and no edges were duplicated.
F
CBE
DCA
B
a
g
c
b f
b
a
4 2 3
6 7
5
8 1
Figure 4.12. MDL example 1 - input graph_2. Figure 4.13. MDL example 1 - no overlap.
In the next example our graph_2 is shown in Figure 4.14. It is the same graph used in the
previous example, but this time we run MDL with the overlap parameter set to true. The
generated compressed graph is shown in Figure 4.15. The FindInstances function finds 4
instances since the overlap between instances is allowed. In the graph the 4 instances were
replaced by 4 vertices SUB. There are 5 edges OVERLAP, an edge OVERLAP between two
vertices SUB tells us that these substructures overlap. By “substructures overlap or
overlapping substructures” we refer that the instances they represent have common vertices.
In the example, there is a directed edge (edge g, our guide edge) connecting two vertices,
one of them belonging to a substructure (in exact term, to an instance of a substructure) and
the other one no belonging to the substructure (vertex F), is an external edge; the
73
discovered instances were 4 (overlap is allowed), so, in the graph there are 4 directed edges
starting from each vertex SUB (the direction of the edge is preserved) to the vertex F.
A similar procedure was implemented for edge f (connecting vertex C and vertex D,
vertices number 6 and 7 respectively) and edge c (connecting vertex B and vertex E,
vertices number 2 and 4, respectively). They were duplicated since these edges connect
vertices belonging to a substructure to vertices no belonging to the substructure (external
edges). In the figure we can see that vertex D has 2 edges f connecting it to 2 vertices SUB
(preserving the direction of the edge) and vertex E has 2 edges c connecting it to 2 vertices
SUB (remember that 4 instances were discovered, overlap is allowed).
Finally, our compressed graph has 7 vertices: 4 vertices SUB, 1 vertex F, 1 vertex D, and 1
vertex E; and 13 edges: 5 edges OVERLAP, 4 edges labeled as g, 2 edges labeled as f, and 2
edges labeled as c.
F
CBE
DCA
B
a
g
c
b f
b
a
4 2 3
6 7
5
8 1
Figure 4.14. MDL example 2 - input graph_2. Figure 4.15. MDL example 2 - overlap.
For the following examples the overlap parameter is always set to true.
74
Figure 4.16 shows our new graph_2, we only changed the direction of the guide edge. Now
it starts from vertex F (number 8) to vertex A (number 1). The generated compressed graph
is shown in figure 4.17. In the compressed graph we observe that, this time, the 4 edges g
start from vertex F to the 4 vertices SUB (the direction of the edge is preserved).
Everything else remains as the previous example.
F
CBE
DCA
B
a
g
c
b f
b
a
4 2 3
6 7
5
8 1
Figure 4.16. MDL example 3 - input graph_2. Figure 4.17. MDL example 3 - overlap.
For the next example, we only change from a directed guide edge to an undirected one. Our
graph_2 is the graph shown in Figure 4.18. By running MDL we generate the compressed
graph presented in Figure 4.19. The only modification is that the edges between vertex F
and the 4 vertices SUB are undirected edges. Everything else remains unchanged.
75
F
CBE
DCA
B
a
g
c
b f
b
a
4 2 3
6 7
5
8 1
Figure 4.18. MDL example 4 - input
graph_2.
Figure 4.19. MDL example 4 - overlap.
For illustrating a new outcome of the overlap feature in the standalone MDL function, we
change our graph_2 as follows:
• Vertex number 8 labeled as F is deleted.
• Our guide edge labeled as g is deleted.
• A new edge labeled as g (our guide edge) is added between vertex number 1 labeled
as A and vertex number 3 labeled as C. For the current example we use a directed
edge.
Our graph_2 is shown in Figure 4.20. The graph has 7 vertices and 7 edges. The “new”
edge has the characteristic that it is an edge between two vertices belonging to the same
substructure and in some cases, as we will see, they belong to overlapping substructures.
Remember that overlap is allowed between instances, so the FindInstances function finds 4
instances.
76
The generated compressed graph is shown in Figure 4.21. We mentioned that our guide
edge joins 2 vertices belonging to a same substructure (vertex number 1 labeled as A and
vertex 3 labeled as C, and in some cases these vertices belong to overlapping substructures.
So, in the graph there are 6 edges labeled as g, 4 edges joining vertices SUB (telling us that
they overlap), and two self edges in 2 vertices SUB. These last 2 edges are our new case of
validation. The self edge tells us that inside the substructure, there is an edge joining two
vertices belonging to the same substructure. The direction of the edges does not matter
since this is an edge inside the substructure (an internal edge). The other duplicated edges
(i.e., edge c and edge f) were computed using the previous described procedure.
Finally, our compressed graph has 6 vertices: 4 vertices SUB, 1 vertex D, and 1 vertex E;
and 15 edges: 5 edges OVERLAP, 6 edges labeled as g, 2 edges labeled as f, and 2 edges
labeled as c.
CBE
DCA
B
a
c
b f
b
a
g
4 2 3
6 7
5
1
Figure 4.20. MDL example 5 - input
graph_2.
Figure 4.21. MDL example 5 - overlap.
77
For the example shown in Figure 4.22, we only changed the direction of the guide edge.
Now it starts from vertex number 3 labeled as C to vertex number 1 labeled as A. The
generated compressed graph is presented in Figure 4.23. The generated compressed graph
is the same one that in the previous example. This is telling us that just changing the
direction of the guide edge does not have effect in the resulting compressed graph.
CBE
DCA
B
a
c
b f
b
a
g
4 2 3
6 7
5
1
Figure 4.22. MDL example 6 - input graph_2. Figure 4.23. MDL example 6 - overlap.
For the next example, we only change from a directed guide edge to an undirected one. Our
graph_2 is the graph shown in Figure 4.24. The generated compressed graph is presented in
Figure 4.25. The single change is that all edges g are now undirected edges. Everything else
remains unchanged.
78
CBE
DCA
B
a
c
b f
b
a
g
4 2 3
6 7
5
1
Figure 4.24. MDL example 7 - input graph_2. Figure 4.25. MDL example 7 - overlap.
Now, for illustrating a new outcome of the overlap feature in the standalone MDL function
(self edge in a vertex shared by instances of a substructure), we change our graph_2 as
follows:
• Edge labeled as g between vertex number 1 labeled as A and vertex 3 labeled as C is
deleted.
• A new edge labeled as g (our guide edge) is added between vertex number 1 labeled
as A and vertex number 1 labeled as A (a self edge). For the current example a
directed edge.
Our graph_2 is shown in Figure 4.26. The graph has 7 vertices and 7 edges. The “new”
edge has the characteristic that is a self edge; it starts and ends in the same vertex. This
vertex belongs to overlapping substructures, so we have 4 substructures sharing it.
The resulting compressed graph is shown in Figure 4.27. Since our guide edge is a self
edge the compressed graph has 10 edges labeled as g. There are 6 edges joining vertices
79
SUB (telling us that they are overlapping substructures), and 4 self edges in 4 vertices SUB
telling us that they have an edge which origin and destination vertices are the same.
We note in the compressed graph a special characteristic between some vertices SUB. We
refer that some pairs of vertices SUB are joined by 2 edges labeled as g. One of them starts
in the first vertex SUB and ends in the second one. The second edge has a vice versa
direction. This is because they are overlapping substructures with a common vertex and this
vertex is the source and destination of an internal edge.
Finally, our compressed graph has 6 vertices: 4 vertices SUB, 1 vertex D, and 1 vertex E;
and 19 edges: 5 edges OVERLAP, 10 edges labeled as g, 2 edges labeled as f, and 2 edges
labeled as c.
CBE
DCA
B
a
c
b f
b
ag
4 2 3
6 7
5
1
Figure 4.26. MDL example 8 - input
graph_2.
Figure 4.27. MDL example 8 - overlap.
For the last example, we use as graph_2 the one shown in Figure 4.28. We only change our
guide edge from a directed edge to an undirected one. Figure 4.29 shows the generated
compressed graph. There are 2 differences between this compressed graph and the previous
80
one. First, the edges labeled as g are undirected edges. Second, the special characteristic for
some vertices SUB is implemented as follows: Currently, we note that those vertices SUB
joined by 2 edges labeled as g in the previous example, now, they are joined by just 1
undirected edge labeled as g. Since the guide edge is undirected we only need 1 edge for
representing them because they are overlapping substructures with a common vertex and
this vertex is the source and destination of an internal undirected edge. The 4 vertices SUB
have self edges, but they are undirected ones this time. Everything else remains unchanged.
Finally, our compressed graph has 6 vertices: 4 vertices SUB, 1 vertex D, and 1 vertex E;
and 16 edges: 5 edges OVERLAP, 7 edges labeled as g, 2 edges labeled as f, and 2 edges
labeled as c.
CBE
DCA
B
a
c
b f
b
ag
4 2 3
6 7
5
1
Figure 4.28. MDL example 9 - input graph_2. Figure 4.29. MDL example 9 - overlap.
81
4.3 Limited Overlap
As we have described in the previous section, the overlap feature plays a preponderant role
in the Subdue’s substructures discovery system. As a consequence, the generated results are
also conditioned, in a high percentage, to this parameter. However, as we have seen, it is
implemented to allow overlap among any instances of a substructure or among all the
instances of a substructure. In this context, we propose a new approach named limited
overlap. The major feature in this approach is to give the user the means to specify the set
of vertices where an overlap is allowed. These vertices may represent significant elements
in the context we work with.
In the following subsections we will present the motivation, advantages, new algorithm,
and an example of the new overlap feature implemented in the Subdue system.
Motivation
As we have already commented, the current overlap feature in Subdue is implemented in an
orthodox way: all or nothing. It means that Subdue allows the overlap among all the
instances sharing at least one vertex or that Subdue does not allow (discard) the overlap
among instances sharing at least one vertex.
But we argue that a third option is needed, an option where the user has the capability to set
over on which vertices the overlap will be allowed, it is a limited overlap. We visualize
directly three motivations issues to propose the implementation of the new algorithm that
will be explained in the following subsections:
82
• Search space reduction.
• Processing time reduction.
• Specialized overlapping pattern oriented search.
In order to help us to describe the characteristics of the limited overlap we will use the
graphs show in Figure 4.30 and 4.31 as input graphs in the future examples. Input graph
PS_1 (Predefined Substructure number 1) has 2 vertices and 1 edge, input graph PS_2
(Predefined Substructure number 2) has also 2 vertices and 1 edge, and finally graph_3 has
9 vertices and 8 edges.
A
B
a
2
1 C
D
f
2
1
PS_1 PS_2
B
CBE
DCA
B
a
g
c
b f
b
a
4 2 3
67
5
8 1
Df
9
Figure 4.30. Limited overlap - input
graphs PS_1 and PS_2.
Figure 4.31. Limited overlap - input graph_3.
1) Search space reduction. In knowledge discovery systems using a graph-based
approach, the data mining algorithm uses graphs as a knowledge representation; the search
space of a graph-based data mining algorithm consists of all the subgraphs that can be
derived from its input graph.
The substructures discovery process in Subdue begins with the creation of the substructures
matching a single vertex in the graph (one for each of the different labels in the graph).
83
Each iteration through the algorithm selects the best substructures and expands the
instances of these substructures by one neighboring edge (or an edge and new vertex) in all
possible ways.
But as part of the process to select the best substructures and then to expand them there
exists a filter phase. In this phase according to the overlap parameter the instances of a
substructure are evaluated: if overlap is set to true then overlapping instances are kept,
otherwise if overlap is set to false then overlapping instances are discarded.
For example, suppose we want to search the instances of PS_1 and PS_2 in graph_3. With
the overlap set to false Subdue discovers 2 instances, one instance for each predefined
substructure as we can see en Figure 4.32. PS_SUB_X is the nomenclature used by Subdue
to identify it is an instance of Predefined Substructure SUB_X; where SUB_X means it is
the SUBstructure number X. But, if overlap is set to true, Subdue discovers 4 instances, 2
instances for each predefined substructure (see Figure 4.33). Finally, suppose the user
wants to search instances of PS_1 and PS_2 in graph_3 but this time he considers that
vertices A have a higher relevance (for example, a remarkable spatial object in a spatial
database) so he proposed to use a limited overlap, the overlap will be allowed just among
instances containing vertices A. We can see that PS_1 has a vertex A, thus this time Subdue
finds 3 instances, 2 instances of PS_1 and 1 instance of PS_2 as shown in Figure 4.34. In
the figure we can observe that instance 1 and instance 2 share the vertex number 1 labeled
as A. For PS_2 Subdue finds 1 instance since the other one (instance number 4 in Figure
4.33) has also the vertex number 6 labeled as C, but the overlap is just allowed among
vertices A.
84
C
D
a
9
6
PS_SUB_2
Instance 2
A
B
a
5
1
PS_SUB_1
Instance 1
Figure 4.32. No overlap - SGISO.
C
D
f
7
6 C
D
f
9
6
PS_SUB_2 PS_SUB_2
A
B
a
2
1 A
B
a
5
1
PS_SUB_1 PS_SUB_1
Instance 1 Instance 2 Instance 3 Instance 4
Figure 4.33. Overlap - SGISO.
A
B
a
2
1 A
B
a
5
1
PS_SUB_1 PS_SUB_1
Instance 1 Instance 2
C
D
f
9
6
PS_SUB_2
Instance 3
Figure 4.34. Limited overlap PS_1 - SGISO.
We have mentioned that the best discovered substructure by Subdue (by iteration) can be
used to compress the input graph, which can then be input to another iteration. After several
iterations, Subdue builds a hierarchical description of the input data where later
substructures may be defined in terms of substructures discovered on previous iterations.
We have also commented that each iteration through the algorithm selects the best
substructures and expands the instances of these substructures by one neighboring edge (or
an edge and new vertex) in all possible ways. So the number of instances of the
substructures defines the search space (by iteration) in the substructure discovery process.
As we can see in our example, by using the limited overlap we obtain a search space
reduction (with overlap set to true), since the number of instances becoming candidates to
be expanded is selected according to the allowed values given by the user.
85
2) Processing time reduction. The reduction in the number of instances becoming
candidates to be expanded results in a search space reduction, and this effect also has a new
outcome, a processing time reduction.
Allowing overlap slows Subdue considerably since the number of candidate instances to
expand, to evaluate, to match, and to discover increase as we have seen. However, by the
implementation of the limited overlap, the number of instances to be processed in these
phases decrease resulting also in a processing time reduction in the overall substructure
discovery process.
3) Specialized overlapping pattern oriented search. We have also commented that the
limited overlap gives the user the capabilities to define the set of interesting elements over
which the overlap will be allowed (the elements are represented as vertices in the graph
according to the proposed model) so the algorithm will discard the overlapping elements
that the user does not considerer significant.
In the example presented in Figure 4.34, the judgment of the user is that vertices A have a
higher relevance so he proposes to use a limited overlap, the overlap will be allowed just
among instances containing vertices A. This consideration is a personal decision of the user
according to the work context (in many cases with the support of a domain expert). For
instance, a remarkable element may refer to a spatial object in a spatial database or to some
characteristic defining a particular topic of a dataset.
86
Therefore, the limited overlap gives the user the mechanics to implement a pattern oriented
search. The user delimits the set of elements that will have a preponderant role in the
substructure discovery process. But this characteristic gives also a new advantage, the
patterns evaluation process is simplified since the set of generated results is smaller because
it is focused over the user requirements.
Algorithm
As we have described, the limited overlap gives the user the capabilities to define the set of
elements (vertices in the graph) where overlap among instances is allowed. The process
starts reading the set of vertices which are integrated to the Subdue’s parameters as the
limited overlap label list. During the substructure discovery process a filter phase is
performed. This phase evaluates (and possible discards) based on the overlap parameter the
list of discovered instances.
If the substructure discovered process is composed by several iterations, after each of them,
the overlap label list is updated to integrate the new vertices where overlap is also allowed.
Remember that at the end of each iteration, the best discovered substructure by Subdue is
used to compress the input graph, which can then be input to another iteration of Subdue.
After several iterations, Subdue builds a hierarchical description of the input data where
later substructures are defined in terms of substructures discovered on previous iterations.
Therefore, if the best discovered substructure, in each iteration, has a vertex where overlap
is allowed then that substructure is added to the limited overlap label list.
87
Limited Overlap (instance1, instance2) //Global process while ((elementsInstance1 < totalElementsInstance1) AND NOT endProcess){ //Instance1’s vertex vertex1 = vertex from instance1 //Instance2’s vertex while (elementsInstance2 < totalElementsInstance2) vertex2 = vertex from instance2 //Instances overlap if (vertex1 = vertex2) then{ intancesOverlap = true endProcess = true } //Instances overlap, check by limited overlap if (instancesOverlap AND overlapLabelList NOT EMPTY) then{ if (vertex1 in overlapLabelList) then //It is a limited overlap limitedOverlap = true else //Process continues until all vertex1 are validated endProcess = false } } return instancesOverlap, limitedOverlap
Figure 4.35. Limited overlap in the Subdue system.
In the validation process each vertex belonging to instance1 (vertex1) is validated against
all vertices belonging to instance2 (vertex2). If vertex1 and vertex2 are the same then they
are overlapping instances. If the instances overlap then vertex1 is validated against the list
containing the set of vertices allowed for overlapping (the overlapLabelList). If vertex1
exists in overlapLabelList then it is a limited overlap.
Example
To illustrate the functionality of the new algorithm, we will present some examples using
the new version of Subdue implementing the limited overlap feature. The input graphs for
88
the examples are those shown in Figure 4.30 and Figure 4.31; the idea is to perform a
pattern oriented search by discovering patterns in graph_3 but based on PS_1 and PS_2
(our predefined patterns).
Subdue implements a pattern oriented search by, initially, finding all instances of PS_1 and
PS_2 in graph_3. The next step is to compress graph_3 using the found instances of each
PS_1 and PS_2. Figure 4.36 shows the compressed graph with the overlap parameter set to
false. As we can see, Subdue found 1 instance of PS_1 (i.e., vertex PS_SUB_1) and 1
instance of PS_2 (i.e., vertex PS_SUB_2) since overlap among instances is not allowed (see
also Figure 4.32).
Figure 4.36. No overlap - compressed graph.
Finally, the compressed graph becomes the input graph to discover substructures. Figure
4.37 shows the best 3 substructures (default number of reported substructures) found by
Subdue according to its substructure discovery system (see Section 4.1 for details). Each
substructure has 1 instance; it means that there exists 1 repetition of each of them in the
input graph (in the example we use exact graph match, but Subdue allows also inexact
graph match). The first one (labeled as “a”), is composed by 2 vertices: 1 vertex B, and 1
89
vertex E; and 1 edge labeled as c. Subdue reports this substructure as the best discovered
substructure because it is the best that minimize the description of the input graph
according to the MDL principle. The second one (labeled as “b”) is composed by 4 vertices:
1 vertex PS_SUB_1, 1 vertex C, and 2 vertices B; and 3 edges labeled as a, b. g. The vertex
PS_SUB_1 is itself a substructure composed by two vertices: 1 vertex A, and 1 vertex B;
and 1 edge labeled as a. We have already commented that later substructures may be
defined in terms of previous discovered substructures, or in term of predefined
substructures as in this example. Finally, the third one (labeled as “c”) is composed by 4
vertices: 1 vertex PS_SUB_1, 1 vertex PS_SUB_2, 1 vertex C, and 1 vertex B; and 3 edges:
2 edges labeled as b, and 1 edge labeled as g. Once again vertices PS_SUB_1 and
PS_SUB_2 are themselves substructures.
90
a) b)
c)
Figure 4.37. No overlap - discovered substructures.
For the next example we set overlap to true. The generated compressed graph is shown in
Figure 4.38. Now, Subdue finds 2 instances for each PS_1 and PS_2 (see Figure 4.33 for
details).
91
Figure 4.38. Overlap - compressed Graph.
Figure 4.39 shows the best 3 substructures discovered by Subdue. The first and second ones
(labeled as “a” and “b”) are themselves the predefined substructure PS_SUB_2 and
PS_SUB_1 respectively with 2 instances each of them. The third one (labeled as “c”) is a
substructure composed by 2 vertices and 1 edge with 1 instance.
92
a) b)
c)
Figure 4.39. Overlap - discovered substructures.
Our next example is implemented by using the limited overlap feature. Suppose the user
wants to perform a specialized overlapping pattern oriented search: he only wants to allow
overlapping instances in vertices representing PS_1. The generated compressed graph is
shown in Figure 4.40. As a result of the restriction for overlapping instances Subdue finds 2
instances of PS_1 (they share the allowed vertex A) and just 1 instance of PS_2 (since they
share a not allowed vertex C). The found instances are shown in Figure 4.34.
93
Figure 4.40. Limited overlap PS_1 - compressed graph.
Figure 4.41 shows the best 3 substructures discovered by Subdue from this graph using the
limited overlap. In the figure we can see that Subdue reports as the best substructure
PS_SUB_1 with 2 instances (labeled as “a”). This is consequence of the integration of the
compressed graph since it has 2 vertices PS_SUB_1. The second reported substructure
(labeled ad “b”) is composed by 2 vertices: 1 vertex PS_SUB_1, and 1 vertex E; and 1 edge
labeled as c. The last one (labeled as “c”) is composed also by two vertices PS_SUB_1; and
1 edge labeled as PS_OVERLAP_1.
94
a) b)
c)
Figure 4.41. Limited overlap - discovered substructures.
4.4 Conclusion
In this chapter we have described the characteristics and functionality of our graph-based
data mining tool, the Subdue system. We introduced a new algorithm named limited
overlap. We presented some examples showing the functionally of the new algorithm using
an artificial dataset. Examples using real world data will be presented in Chapter 6. These
95
examples are developed by using two test domains: a Puebla downtown population census
from the year of 1777 and a Popocatépetl volcano database.
In the next chapter we introduce a prototype system implementing the proposed model for
representing spatial data, non-spatial data and spatial relations among the spatial objects as
a whole dataset using a graph-based representation.
96
Chapter 5
PROTOTYPE
A natural skill of people is the interpretation of visual data; therefore, this is a fact we must
consider to get advantage during the data mining process. In [30] the authors say that a
future direction for the KDD research field is the design and use of user interfaces: “one can
create a query language which may be used by non-database specialists in their work. Such
a query interface can be supported by a Graphical User Interface (GUI) which can make the
process of query creation much easier”.
The challenge consists of improving the capabilities for displaying the generated results
(i.e., from a query or the data mining process) in a graphical mode. The idea is that if we
are able to analyze the results in such a graphical way, we may give feedback to the user so
that he can refine the analysis process and/or guide the direction for further study. This is
the principle in relevance feedback (do an initial query, get feedback from the user, and
then incorporate information obtained from prior relevance judgments to redefine the
query).
97
We developed a prototype system that provides a graphical user interface to perform the
data mining phase (and data analysis) using our proposed graph-based representation for
spatial and non-spatial data. The analysis is implemented by using spatial and non-spatial
queries and the data mining process is performed with the Subdue system, a graph-based
data mining tool. In our research we used two test domains to evaluate our proposal. The
first one is a database storing data from the Popocatépetl volcano (see Section 3.4 for
details). The other one is a database containing data related to a population census from the
year of 1777 in Puebla downtown as described in Section 5.1.
5.1 Population Census from the year of 1777 in Puebla
downtown
As our test domain we have worked in the project “Habitar y vivir. Análisis del espacio
habitacional de la ciudad de Puebla 1690-1890”. This is a project directed by Dra. Rosalva
Loreto López, a researcher of Urban History. Our objective is to make use of data mining
techniques to find interesting relations and patterns between the population and habitation
spaces in Puebla downtown during that time period. We argue that our model can find
patterns involving non-spatial data (i.e., characteristics of people living in the zone), spatial
data (i.e., distribution of the space), and relations between them (i.e., characteristics of
houses based on people social status and/or number of people living in a house) in a single
pattern.
98
Figure 5.1 shows the spatial structure we implemented for representing the spatial concepts
in the census. The structure has 5 spatial aggregation levels. Parish (spatial concept for
representing a physical area) is the upper spatial component. A Parish is defined by one or
more Neighborhoods. A Neighborhood can involve several Blocks. A Block is defined by
four Streets and finally a Street gives the location to a House, where a family lives.
House
Parish
Block
Neig
hbor
hood
Neighborhood
Stre
et
Street
Block
Parish
Figure 5.1. Representation of spatial concepts in the census from the year of 1777.
After defining the spatial concepts involved in our domain, we need a graph-based data
representation to describe our data as a graph. We have implemented two graph-based
representations for the non-spatial data of the census.
These representations have three main components: (1) HOUSE involves the attributes
describing a House. It contains one or more Uh (atomic physical area where a family lives).
One or more families might inhabit in a House, but each family lives in an Uh. (2) UH
involves the attributes describing the living space. (3) MEMBER involves the attributes
describing a member of a family.
99
Figure 5.2 shows the first representation created for the population census. A description of
the representation is as follows: a Parish contains one or more Neighborhoods. A
Neighborhood contains one or more Blocks or Locations (spatial concept for identifying
with precision a physical area -i.e., north of-). Block and Location contain one or more
Streets. A Street contains one or more HOUSES. A HOUSE has two attributes (NCasa -Id
assigned to the House- and NHabCasa -number of Uh in a House-). A HOUSE contains
one or more UH. An UH has several attributes describing it (Uh -Id assigned to the Uh-,
Etnicidad -predominant ethnic group among the members of a family-, TFamilia -type of
family, i.e., family with children-, TUh -type of Uh- and NMiembrosFamilia -number of
members integrating a family-). An UH contains one or more MEMBERS. Finally a
MEMBER has several attributes describing it (JFamilia -head of family-, TitPersona -title
of the member, i.e., don, doña-, Nombre -first name-, Apellido -last name-, Sexo -sex-,
EdoCivil -marital status-, Parentesco -social relationship with respect to the head of family-
, GpoEtnico -ethnic group- and Edad -age-).
100
Location
Parish
Neighborhood
Block
Street
contain
contain contain
contain
contain
containcontain
value
value
HOUSE
UH
value
value
value
value
MEMBER
value
value
value
value
value
value
value
value
value
NCasa
NHabCasa
Etnicidad
TFamilia
TUh
NMiembrosFamilia
contain
JFamilia
TitPersona
Nombre
Apellido
EdoCivil
Parentesco
GpoEtnico
Edad
Sexo
value
Uh
Figure 5.2. First representation for the non-spatial data in the population census from the year of 1777.
The second representation is presented in Figure 5.3. The representation can be read as
follows: A HOUSE is the main item. A HOUSE has several attributes describing it
(Parish, Neighborhood, Block, Location, Street, NCasa and NHabCasa). A HOUSE
contains one or more UHS. An UH has several attributes describing it (Uh, TUh, Etnicidad,
TFamilia, NMiembrosFamilia). In a UH lives (contains) one or more MEMBERS, and
finally, a MEMBER has several attributes describing it (Jfamilia, TitPersona, Nombre,
Apellido, Sexo, EdoCivil, Parentesco, GpoEtnico and Edad).
101
value
value
value
value
value
value
HOUSE UH
Location
Neighborhood
Street
NCasa
NHabCasa
Parish value
value
value
value
MEMBER
Etnicidad
TFamilia
TUh
NMiembrosFamilia
contains value
value
value
value
value
JFamilia
Sexo
EdoCivil
Parentesco
GpoEtnico
value
value
value
TitPersona
Nombre
Edad
value
Block
value
Apellido
value
Uh
contains
Figure 5.3. Second representation for the non-spatial data in the population census from the year of 1777.
5.2 Modules
This section presents a prototype system developed for testing our graph-based data
representation model for spatial data mining as proposed. The prototype is implemented in
Java. We use Oracle DBMS version 9i for storing and processing our data. Oracle has a
module to manage spatial data named Oracle Spatial; we take advantage of the spatial
operators, geometry functions and spatial aggregate functions implemented in the module
for either identifying the objects in a region over a spatial layer or obtaining/validating the
spatial relations between two spatial objects. The prototype is divided into seven modules:
102
Data Preparation and Cleaning. The data preparation and cleaning phase is a step of the
Knowledge Discovery in Databases (KDD) process. Our module helps the user to validate
the data and creates the necessary structures to store it in the database. In this process, the
data mining expert and the user must work together to identify the activities to be
performed that are related to the process (i.e., to identify noise and remove it, treat missing
values, etc). Once these activities have been defined, the process is transparent to the user
when adding more data (belonging to the same database schema for the domain).
Query. We have implemented the interface shown in Figure 5.4 to allow the user to create
non-spatial queries. The goal is to allow the user, to make use of spatial and non-spatial
attributes, to query the database and to build graphs from the obtained results.
The Query panel allows the user to query the database using an SQL-like approach. The
queries are created in real time and then submitted to the database for answering the user
request. The results are presented to the user in two ways. The first uses a relational
approach and the second is represented over a map.
103
Figure 5.4. The query panel.
The interface is divided in 2 sections, the Control and the Results sections. The Control
section is divided in 4 subsections: Select, Where, OrderBy and GroupBy. The Results
section is divided in 2 subsections: Query visualization and Generated results.
In the Control section the user creates the query. The query is created by specifying the 5
most usual clauses in a SQL statement: Select–From-Where-OrderBy-GroupBy. The first
two elements are mandatory, the last three are optional.
104
The Select subsection presents the twenty-two fields that compose the core of the database.
Twenty one of these fields have four options implementing query functionalities as follows:
option1 – attribute name – option2 – options3 – option4. The first option is used to indicate
if the field will be included in the query. The second option contains the keywords used to
implement, currently, five grouping functions: “count", "count(distinct())", "distinct",
"max" and "min". These keywords are used to group the data by the field(s) selected in the
GroupBy subsection. Options three and four are used to indicate if the descriptive attributes
associated to the field will be included in the query. If the first option is not selected then
all the other options are ignored. Additionally, the Select subsection has an item containing
the name of the schema (i.e., 1777 census) to create the “from” clause of the query.
The Where subsection is used to indicate the set of conditions, by field, for restricting the
rows to be selected (the dataset returned by the query). A condition specifies a combination
of one or more expressions composed of attribute names, attribute values, and logical
operators. These conditions are entered manually by the user, so knowledge about the
domain is required.
The OrderBy subsection allows the user to indicate if he wants to sort the rows returned by
the query and which field(s) will be used to perform the operation. We can sort the rows
returned using the twenty-two fields composing the database and we can also implement
combinations of these elements (i.e., sorting by the Name and Sex fields).
The GroupBy subsection was implemented to allow the user to group the selected rows
based on the value(s) of each row and return a single row with a summary of the
105
information for each group. We can group the rows using twenty-one of the twenty-two
fields of the database and we can also implement combinations of these elements (i.e.,
grouping by the Parish and Sex fields). The grouping aggregation function (i.e., count or
sum) is chosen in the Select subsection. As we have already mentioned, we have
implemented five grouping functions.
Our prototype system creates queries by selecting the different fields and options contained
in the four previous sections. This query is created in real time, displayed in the Query
visualization area and then submitted to the DBMS for processing. The obtained result is
displayed in the Generated results area; it is presented by using a relational approach (rows
and columns).
SQL. Using this interface the user can create a query by typing it directly (in manual form).
The principle is the same as we described in the Query panel, we want to create queries and
to use the results of the queries to create graphs (based in our model). These graphs will be
the data source for our data mining algorithm so that we can find patterns that allow us to
understand and describe our data.
In the Query panel (see Figure 5.4) the user creates the query in real time by using a
graphical interface but if he wants to modify it, it is not possible (the query is displayed just
for visualization purpose). Each time the user creates a query in the Query panel, the
created query is copied to the Query visualization area in the SQL panel so if the user wants
to enhance or modify the query he is able to do it. The generated results are presented to the
106
user in the Generated results area. Both interfaces (Query and SQL panels) provide the user
with the capability to send the generated results to a text file or a printer.
Map. Another way to create graphs in the prototype is using the Map panel. In this case the
interface displays in a graphical way the spatial layers stored in the database (see Figure
5.5).
Figure 5.5. The map panel.
By using the interface, the user can delimit the set of spatial and non-spatial data that will
be used for creating the graphs. Some times the user wants to analyze only some regions in
a spatial layer so it is not necessary to include all the data in the graph. Additionally, we
107
have to take into account that if we have a huge database we will build huge graphs and this
feature has a direct impact over the data mining algorithm, so this is an important issue we
have to face. A solution for facing the problem of creating huge graphs is delimiting the set
of elements to be included in it by using selection windows as we mentioned in Chapter 3.
A second method for delimiting the set of elements to be considered while creating a graph
is by using the results of the non-spatial queries created and processed in the Query and
SQL panels. This is implemented by a process that identifies the spatial objects involved in
the results generated by a non-spatial query (i.e., a query computing the number of people,
grouped by Sex, living in each Parish in the census from the year of 1777 so we can select
and show the Blocks or the Streets belonging to each Parish over the map).
The interface is divided in four sections: Visualization area, Layers in database,
Operations control and Map control. The Layers in database section displays the name of
the spatial layers stored in the database. The component is also used for selecting and
identifying the spatial layers to work with.
We also include information about the spatial relations to be considered in the graph. The
Operations control section includes the operations (Topological, Distance and Direction)
implemented to validate the spatial relations among spatial objects. In the case of the
topological relations we have implemented the validations supported by the Oracle Spatial
module.
108
The Map control section has five buttons implementing the Zoom in, Zoom out, Show all,
Reset all and Save map operations. The Visualization area works in two operational modes:
Query mode for creating selection windows and Zoom mode for implementing the Zoom in
and Zoom out operations. The option “X Layer” is used to indicate if the validation of
spatial relations among spatial objects will just be among objects belonging to different
spatial layers (i.e., objects belonging to the Parish and Neighborhood layers) or among all
objects belonging to all layers. The last two options are used to control the characteristics
of the Selection window: a Selection window tool can be a Rectangle or a Circle, and we
can specify if we want to select just the elements inside the Selection window or the
elements inside and touching its border.
Once the user has selected the working area, the following steps identify the objects inside
the area and validate the spatial relation(s) among them. We have implemented the
validation of topological, distance and direction relations; only the objects meeting the
relation(s) chosen by the user will be candidates to become objects in the graph.
Spatial Graph. The next task is to create the graph. First, the user must select the non-
spatial attribute(s) describing the spatial objects in the graph (see Figure 5.6). Remember
that the spatial objects and spatial relations among the objects that will be included in the
graph were selected in the Map panel.
109
Figure 5.6. The spatial graph panel.
By using this interface, the user can select the non-spatial attributes, for each spatial layer
he works with, that will be related to the spatial objects in the graph. For instance, suppose
the user is working with the spatial layers A and B; the spatial layer A has five non-spatial
attributes and the spatial layer B has three non-spatial attributes. So, he may select from
layer A two attributes and from spatial layer B three attributes. The idea is to give the user
the capability to select the non-spatial attributes, for spatial layer, that he considers relevant
for the mining task.
110
From the Graph characteristics section the user defines the format, model and output
device characteristics for the graph. The graph generated by the system can be created
following the Subdue or the Graphviz [19] layouts. In the first case the graph is created for
feeding the Subdue system, and in the second case it is created for visualization purposes.
Currently, we have implemented five graph-based representation models in our prototype.
Each model expresses a representation proposal for creating graphs involving the three
basic elements found in a spatial database (spatial, non-spatial data, and spatial relations
among spatial objects). The resulting graph can be visualized on the screen or stored in a
text file. The last option was implemented in order to provide the Subdue and Graphviz
systems with their corresponding input files.
We can see at the right side of Figure 5.6 an example of a created graph using the Subdue
layout. This sample graph was generated from 2 spatial layers (i.e., chiglesia and chpuebla).
From each of them the user selected one or more attributes that were related to the spatial
objects. Figure 5.7 presents a fragment of the same graph but this time it is drawn using the
Graphviz system.
111
Figure 5.7. Graph representation of processed data.
Non-Spatial Graph. Focusing in the “Habitar y vivir. Análisis del espacio habitacional de
la ciudad de Puebla 1690-1890” project we have implemented an interface for allowing the
user the creation of graphs based on the two graph-based representations created for the
non-spatial data of population census (they are shown in Figure 5.2 and Figure 5.3). These
graphs are created using only non-spatial data. Our objective for implementing graphs with
this characteristic is to have metrics that allow us to compare and to evaluate the results
generated by the data mining algorithm when we use graphs containing spatial data, non-
spatial data, and spatial relations at the same time against graphs containing only non-
spatial data.
The interface implemented is shown in Figure 5.8. The user selects the non-spatial
attributes and defines the settings that will be used for creating a query (as in the spatial
graph). The result obtained from the query is used for creating the non-spatial graph.
112
Figure 5.8. The non-spatial graph panel.
Subdue. The Subdue panel contains the interface developed for calling the Subdue system
(see Figure 5.9). In order to run Subdue, the user must select the corresponding text file (a
file containing a graph) and define the parameters that will guide the Subdue’s substructure
discovery system for finding substructures.
113
Figure 5.9. The Subdue panel.
Figure 5.10 shows an example of a Subdue’s standard output (only for one substructure) for
displaying the discovered substructures (i.e., patterns) from an input graph. Since our
definition of instances and substructures as graphs (see Chapter 4.1 for details), the
Subdue’s output is also a graph, in our example, it can be read as follows:
• Substructure value = 1.01706. Represents the value of the substructure in the input
graph.
• Pos instances = 3865. This value tells us how many instances of the substructure
exist in the input graph.
114
• Graph (2v, 1e). Number of vertices (“v”) and edges (in our example directed edges
“d”) in the substructure.
• “v 1 SUB_4”. Substructure’s first vertex labeled as SUB_4.
• “v 2 X”. Substructure’s second vertex labeled as X.
• “d 1 2 ETNICIDAD”. A directed edge from vertex 1 to vertex 2 labeled as
ETNICIDAD in the substructure.
Substructure: value = 1.01706, pos instances = 3865, neg instances = 0 Graph(2v,1e): v 1 SUB_4 v 2 X d 1 2 ETNICIDAD
Figure 5.10. Example of Subdue’s standard output.
As we have already commented, Subdue is a system that finds substructures in a
hierarchical way, that is, a substructure found in a previous iteration can appear in a new
iteration. When this happens, those substructures are represented by a vertex labeled as
SUB_x, which represents the best substructure discovered at iteration “x”.
In our example (Figure 5.10), SUB_4 is itself a substructure defined by 2 vertices and 1
edge where its first vertex is labeled as SUB_2. This means that the definition of SUB_4 is
composed by the definition of SUB_2. Substructure SUB_2 is defined by 2 vertices and 1
edge where its first vertex is labeled as SUB_1. Again, this means that the definition of
SUB_2 is composed by the definition of the previously discovered substructure SUB_1.
Finally, SUB_1 is a substructure defined by 2 vertices and 1 edge.
115
As we can see, the lecture and interpretation of the substructures discovered by Subdue
may be a complicated task. We have implemented a parsing function for reading a text file
containing the results generated by Subdue, so we can create the necessary data structures
for presenting to the user the same results but using an easier way to read it as we show in
Figure 5.11 (the figure presents the example described in Figure 5.10). This layout is
created by using the Graphviz system and is saved as a JPEG image file but we can save it
in any output format supported by Graphviz. The goal is to improve the way the user can
read and interpret the results generated by Subdue.
Figure 5.11. Layout for reading the Subdue’s discovered substructures.
5.3 Conclusion
As test domain we have designed and built a spatial database to store both a population
census from the year of 1777 in Puebla downtown and a map representing the blocks in the
116
zone. This data is part of a project directed by Dra. Rosalva Loreto López, a researcher in
the urban history domain.
We have developed a prototype system implementing our model to represent together
spatial data, non-spatial data and the spatial relations among the spatial objects. The
prototype allows the user to select the spatial layers to work with, to create spatial and non-
spatial queries that will be used to select the spatial objects that will be included in the
graph. For each spatial layer the user work with, he has the capability to select the attributes
that will be related to the spatial objects in the graph.
We have also implemented a visualization tool which helps us to display in a graphical way
(by using the Graphviz system) the hierarchical discovered substructures by Subdue.
In the next chapter we present three cases showing the applicability of our methodology for
modeling and mining spatial data mining using a graph-based representation.
117
Chapter 6
RESULTS
This chapter presents three use-cases of our methodology for modeling and mining spatial
data using the proposed graph-based representations. For this purpose, we used two spatial
databases as our test domains. The first database contains data related to a population
census from the year of 1777 in Puebla downtown and the second one is a database storing
data from the Popocatépetl volcano.
The use-cases described were implemented based on the following premises: evaluating the
graph-based proposal for modeling and mining spatial data, evaluating the limited overlap
feature, and evaluating the discovered knowledge with the support of a domain expert.
Therefore, the three use-cases presented in this chapter were implemented based on the
following methodology:
1. Selection of the spatial layers to work with.
2. Selection of the spatial relations that will be validated among the spatial objects.
3. Selection of the non-spatial attributes that will be related to the spatial objects in the
graph.
118
4. Mining the graph using the no overlap, standard overlap and limited overlap
features.
5. Evaluation of discovered patterns.
6.1 Population census from the year of 1777 in Puebla downtown
As we have already mentioned, our first test domain is a spatial database containing data of
a population census from the year of 1777 (see Chapter 5 for more details). Figure 6.1
shows a fragment of the “chpuebla”, “chiglesia” and “chrio” spatial layers used in the use-
cases.
Figure 6.1. Population census from the year of 1777 in Puebla downtown.
119
The chpuebla spatial layer (shown in white color) represents blocks in Puebla downtown;
this layer is related to a population census from the year of 1777 as non-spatial data. The
chiglesia layer (shown in red color) contains representative churches for each parish in the
zone. The chrio layer (shown in green color) represents a river crossing Puebla downtown.
It is important to remark that a parish is a spatial object grouping several blocks in the zone.
Each parish has a church as its agglomerative element (people used to live around a
church). Figure 6.2 shows the six parishes (each shown in a different color) and their
representative church:
1. El Sagrario.
2. San José.
3. San Marcos.
4. San Sebastián.
5. Santa Cruz.
6. Santo Angel.
120
Figure 6.2. Parishes in Puebla downtown in the 1777 year.
All use-cases presented in this section were developed using all graph-based models
(currently five models). However, the description of the generated results in this first test
domain corresponds only to model #2 (the model is named single replication of relation
types, complete information). Some of the discovered patterns were used by the domain
expert to validate facts already known (i.e., distribution of the population in the census) and
other allows him to know unknown relationships among spatial objects and non-spatial data
(attributes) in the census (i.e., common characteristics of people living along the two
borders of the river crossing Puebla downtown). In Section 6.2 we present a comparison
among the generated results by each proposed model using a Popocatépetl volcano
database.
121
6.1.1 Use-case: El Sagrario
Suppose we want to know what people have in common in the spaces within a radius of
150 meters from the representative church in parish #1 (El Sagrario). Our experiment will
be focused to find regularities related to the following issues:
• Number of habitation spaces in a house.
• Members of a family.
• Type of family.
• Ethnic group of each family member.
The guideline to select a radius of 150 meters from the church is that this value allows us to
include in our sample dataset at least one block in all directions around the church as we
show in Figure 6.3. Thus, by using the prototype system we selected the chpuebla and
chiglesia spatial layers. Once we selected the spatial layers to work with, the following
steps are to select the spatial relation(s) to be validated among the spatial objects and the
non-spatial attributes that will be related to the spatial objects in the graph. The selected
parameters were the following:
• Spatial layers: chpuebla and chiglesia.
• Pivot: representative church in parish #1.
• Spatial relation: distance.
o Value: 150 meters within a radius from the representative church to the
blocks.
• Spatial graph-based model: model #2.
122
Figure 6.3. Blocks 150m. from representative church, parish “El Sagrario”.
The generated graph was composed of 24,167 vertices, and 24,166 edges according to the
proposed graph-based model. Figure 6.4 shows a fragment of the graph.
Figure 6.4 Example of a generated graph in the use-case “El Sagrario”.
123
This graph was used as the input to the Subdue system. For the experiment we used the
following Subdue’s parameters:
• Predefined substructure: yes (we used “UH CONTAIN MEMBER” since these
elements are grouping components in our graph-based representation for the non-
spatial data of the census). See Section 5.1 for more details.
• Overlap: yes.
• Limited overlap: no/yes (first, we used standard overlap; next, we used limited
overlap).
Once Subdue completed the mining process, the generated results (patterns) were evaluated
by the domain expert. Figure 6.5 shows three examples of discovered patterns using the
standard overlap option. The expert’s opinions were based on the following issues: the
patterns are based on the population distribution schema in the census from the year of
1777 (large population inhabits in parish #1). 65% of the population, in this area, did not
given its ethnic group, this can be interpreted from a demography history perspective, as a
possible dissolution of the racial element for grouping people (creation of groups or
classes) in benefit of alternative grouping parameters such as salary, family networks (how
they lived and whom they lived with), consumption levels, type of house. 16% said to be
“Spanish”. People lived based on the model “Jefe con Familiares y Agregados”.
124
a) b)
c)
Figure 6.5. Examples of discovered patterns by using standard overlap in use-case “El Sagrario” (1).
We can see that the patterns shown in the figure are composed of spatial data (i.e., a vertex
representing a house) and non-spatial data (i.e., a vertex and an edge telling us the major
ethnic groups in the census). Although those patterns do not contain any reference to spatial
relations we must to note that a spatial relation was used to define the dataset for the mining
phase (i.e., 150 meters within a radius from the representative church to the blocks). They
do not appear in the discovered patterns but they are part of the input graph. An explication
to this situation is that most common characteristics (repetitions in the input graph), in this
example, occur in the non-spatial data describing the spatial objects (i.e., the ethnic group
“X- Undefined” and “E- Spanish”). However, these facts represent the strength of our
proposal since our basic idea is to represent spatial data, non-spatial data and spatial
125
relations among the spatial objects as a unique dataset, mine it together, so we can discover
patterns involving these elements at the same time.
The next step in our experiment was to evaluate our limited overlap proposal in Subdue.
We stated three motivations to propose the new algorithm: a search space reduction, time
processing reduction and specialized overlapping pattern oriented search. Therefore, in
order to demonstrate that by using the new approach we obtain those benefits we
implemented the following example. For the limited overlap test, we allow overlap only in
vertices representing the ethnic group of the family members. These vertices are used to
guide the overlapping pattern oriented search.
In this example, the discovered patterns by using the standard and limited overlap features
were slightly different. Figure 6.6 presents three examples of discovered patterns using the
limited overlap. For example, both cases reported as the first substructure a pattern telling
us “Undefined” is the predominant ethnic group in the dataset. The second discovered
substructure, for both cases, tells us that “Spanish” is the next predominant ethnic group. In
the case of the third substructure, using standard overlap Subdue reported a pattern related
to the number of habitation spaces in a house, but through limited overlap Subdue found a
relationship among the “Undefined” ethnic group and the family type “Jefe con Familiares
y Agregados”. An interpretation of these results is that limited overlap reports in all
discovered substructures a vertex representing ethnic group because we oriented the search
over this type of vertices when we specified that vertices representing ethnic group were
allowed for overlapping.
126
a) b)
c)
Figure 6.6. Examples of discovered patterns by using limited overlap in use-case “El Sagrario” (1).
The next step in our experiment was to evaluate the objective of a processing time
reduction by using the limited overlap feature over the standard overlap. Figure 6.7 presents
the processing time comparison chart for this experiment. In the figure, the time taken by
using standard overlap is shown in pink color (193,426 seconds). On the other hand, the
processing time taken by using the limited overlap option is shown in yellow color (36,042
seconds). As we can see, we obtained a time reduction gain of 81.37% using our proposed
limited overlap approach.
127
Figure 6.7. Processing time standard vs. limited overlap: use-case “El Sagrario” (1).
Now, we are going to modify slightly our study zone to show the way we can work with
two or more spatial relations at the same time. By using the prototype system the user can
select one or more spatial relations that will be validated among the spatial objects. For
instance, it is possible to select two spatial relations belonging to the same spatial relation
type (i.e., topological) or to different spatial relations types (i.e., one topological and one
distance relation). Suppose we want to search for regularities about people and habitation
spaces in a radius of 150 meters from the same representative church in parish #1, but this
time, we only want to evaluate blocks located on the North side as shown in Figure 6.8.
Our experiment will be focused to know common regularities over the same issues as in the
previous test.
By using the prototype system we selected the following parameters:
• Spatial layers: chpuebla and chiglesia.
• Pivot: representative church in parish #1.
• Spatial relation: distance.
193,426
36,042
0 50000 100000 150000 200000
Seconds
1 Limited OverlapOverlap
128
o Value: 150 meters.
• Spatial relation: direction
o Value: North.
• Spatial graph-based model: model #2.
The generated graph was composed of 12,021 vertices, and 12,026 edges according to the
proposed graph-based model.
Figure 6.8. Blocks 150m. North side from representative church, parish “El Sagrario”.
As in the previous example, the graph was used as input to Subdue. For the experiment we
selected the following Subdue’s parameters:
129
• Predefined substructure: yes (we used “UH CONTAIN MEMBER” since these
elements are grouping components in our graph-based representation for the non-
spatial data of the census). See Section 5.1 for more details.
• Overlap: yes.
• Limited overlap: no/yes (first, we used standard overlap; next, we used limited
overlap).
Once Subdue completed the mining phase, the generated results were evaluated by the
domain expert. The expert found the following facts: the predominant ethnic group in the
area remains as “Undefined” because of the proximity to the parish center and the
continuous racial and social interchange from the North side. As consequence, they share
the same family type structure. The “Spanish” population is the most important (15%) and
the next one is “Mestizos” (7.13%). The family nucleus which includes “other people living
with” (added people) employs “Mestizos” as subordinated workers (i.e., waiters and
salesmen). It is important to remark that they do not employ “Indígenas” (maybe because
they do not speak Spanish and their limited cultural level).
This is the domain expert evaluation but we also needed to evaluate how the new limited
overlap feature worked. Therefore, the same experiment was performed using the standard
and then the limited overlap. For limited overlap, the vertex allowed for overlap was the
same as in the previous test. The discovered patterns by using standard and limited overlap
were very similar to those obtained in the first test. In fact, the first and second reported
substructures were the same although the third one was different (see Figure 6.9). By using
130
standard overlap Subdue reported “Mestizos” as the third predominant ethnic group.
Limited overlap reported a relationship among the “Undefined” ethnic group and the family
type “Jefe con Familiares y Agregados”. This last pattern was the same one reported in the
previous test.
a) b)
Figure 6.9. Examples of discovered patterns in the use-case “El Sagrario” (2)
Figure 6.10 shows the processing time comparison chart by using the standard and limited
overlap features. In this figure, the time taken when using the standard overlap feature is
shown in pink color (30,532 seconds) while the processing time taken when using limited
overlap is shown in yellow color (2,471). It is important to note that by using our new
overlap approach we obtained a time reduction gain of 92% in our experiment.
131
Figure 6.10. Processing time standard vs. limited overlap: use-case “El Sagrario” (2).
6.1.2 Use-case: People living along the borders of the river crossing
Puebla downtown
As we mentioned at the beginning of the chapter, a river crosses Puebla downtown.
Suppose that we are interested in knowing characteristics about family types and ethnic
groups of people living along the borders of the river.
Therefore, we defined as our study area all blocks located at most 50 meters from the
borders of the river as shown in Figure 6.11. We selected this distance since it allows us to
select at least one block along the entire border of the river. We used the following
parameters in our use-case:
• Spatial layer: chpuebla and chrio.
• Pivot: the river.
• Spatial relation: distance.
o Value: 50 meters.
30,532
2,471
0 10000 20000 30000 40000
Seconds
1 Limited OverlapOverlap
132
• Spatial graph-based model: model #2.
The generated graph was composed of 16,597 vertices, and 16,596 edges according to the
proposed graph-based model.
Figure 6.11. Blocks 50m. from river crossing Puebla downtown.
The created graph was used to feed the Subdue system. For this test we selected the
following Subdue’s parameters:
• Predefined substructure: yes (we used “UH CONTAIN MEMBER” since these
elements are grouping components in our graph-based representation for the non-
spatial data of the census). See Section 5.1 for more details.
• Overlap: yes.
133
• Limited overlap: no/yes (first, we used standard overlap; next, we used limited
overlap).
Figure 6.12 shows two examples of discovered substructures in this use-case. Our domain
expert evaluated the generated results focusing in the following issues: there is a
modification for the population agglomerative criterion on the East side (“San Francisco”,
“El Alto Xonaca”, and “Los Remedios” neighborhoods) and on the West side (“San José”,
“El Sagrario” and “El Carmen” neighborhoods) of the river. The “Mestizos” is the
predominant ethnic group (24.5%); the “Spanish” is the next one (almost has the same
percentage). This pattern may outline that the “Mestizos” ethnic group played the role of
intermediator between the “Spanish” and “Undefined” groups on the West side, and
between the “Spanish” and “Indígenas” on the East side.
a) b)
Figure 6.12. Examples of discovered patterns in use-case “people around the river crossing Puebla
downtown”
The previous results were generated using standard overlap, but we needed to compare
them with the results obtained using limited overlap, so we used the same parameters as in
134
the previous test. We proposed to allow overlap only in vertices representing the ethnic
group.
It is important to mention that, if our overlap label list has many elements, it could be more
efficient (concerning to processing time) to use standard overlap because of the overhead
that results from the validation process. Remember that when we use the limited overlap
feature, each time that an overlap among instances of a substructure is detected, the overlap
is evaluated in order to know if it is allowed or not.
In this experiment we noted that the patterns discovered using the standard and limited
overlap approaches were the same but the processing time taken by each of them for the
mining task was slightly different. Figure 6.13 presents the time comparison chart for the
experiment showing the time taken when using standard overlap in pink color (15,572
seconds) and the time taken when using limited overlap in yellow color (14,021 seconds).
Limited overlap required a lower time to find the same patterns than standard overlap.
Figure 6.13. Processing time standard vs. limited overlap: use-case “people around the river crossing Puebla
downtown”.
15,572
14,021
0 5000 10000 15000 20000
Seconds
1 Limited OverlapOverlap
135
The two use-cases presented in this section show the functionality of our proposal for
modeling and mining spatial data using a graph-based representation. The discovered
patterns in the population census from the year of 1777 were analyzed by a domain expert.
Some of these patterns allow the user to validate facts already known. For instance, the
predominant ethnic groups classified by parish, and population distribution according to the
social status in the zone. But other patterns allow him to know implications, previously
unknown, among spatial concepts and non-spatial attributes in the census. For instance,
racial and economic interchange among people in the parishes, which are the common
characteristics of population living along the borders of the river crossing downtown (on
the West and East sides), social structure according to the ethnic group, common
regularities among the family type and habitation space.
Processing time comparison among standard and limited overlap features was other topic
evaluated in these use-cases. We presented time processing comparison charts showing the
time reduction gain obtained by using the new approach. We also demonstrated that by
using limited overlap we can orient the search over substructures (patterns) containing
elements that in our domain may represent relevant issues. For instance, in the year of 1777
the ethnic group represented a significant element to know characteristics about a family
and their habitation space.
Next section presents a use-case using a Popocatépetl volcano database. The objective in
this illustrative use-case is to evaluate/compare the generated results by each proposed
graph-based model.
136
6.2 Popocatépetl volcano
In Section 4.3 we presented a preliminary use-case showing the applicability of our
methodology using a Popocatépetl volcano database. This section presents an extended use-
case employing the same database. First, we describe the study area, next the method used
to define the dataset and the parameters for the spatial data mining processes, and finally
the generated results.
As already mentioned, the database contains data related to several issues in the
Popocatépetl zone such as settlements, rivers, and evacuation roads. Figure 6.14 shows a
fragment of the Popocatépetl volcano database. For the experiments we will use three
spatial data layers. The first one is the roads layer (representing the roads in the area), the
second one is the rivers layer (representing the rivers in the area), and finally the third one
is the settlements layer (representing population areas). To illustrate the use-case we have
delimited a study zone as shown in the figure.
137
Figure 6.14. Popocatépetl volcano.
The experiments will focus on identifying relationships and characteristics shared among
settlements, roads and rivers in the study zone. Suppose we want to know characteristics
shared by these elements that can help us to implement/evaluate evacuation plans in case of
a volcanic contingence. For example, characteristics of roads starting in or crossing a
settlement, material used to build those roads and their current status (i.e., paved, unpaved),
characteristics of the roads and rivers meeting a relationship (i.e., they cross, touch) in the
zone, rivers near to settlements that in case of huge pluvial concentration might represent a
potential risk.
138
6.2.2. Use-case: Popocatépetl
To illustrate the capabilities of our model for modeling and mining spatial data we will use
as our study zone that shown in Figure 6.15 (Southwest side of the volcano crater). The
experiment is focused on discovering characteristics among roads, rivers, and settlements in
the zone. The presentation of the generated results will be structured in the following way:
We will present discovered patterns among roads and rivers, roads and settlements, and
rivers and settlements in the study zone. We chose this structure because we wanted to
illustrate the user’s capabilities to organize the data in order to create different sceneries to
represent and mine spatial data.
Figure 6.15. Popocatépetl volcano: study zone.
Therefore, we used our prototype system to select the river, settlement, and road spatial
layers. The next step was to select the spatial relationships to be validated among the spatial
139
objects under consideration. To develop our test, we took advantage of a special feature
implemented in the Spatial Oracle module (our SDBMS system): Oracle has the capability
to examine two geometry objects to determine their topological spatial relationship.
Moreover, it is possible to indicate that the same SDBMS determines and returns the
topological relationship that best matches the geometries.
So, to create our dataset, we first selected the spatial layers to work with and then we
evaluated the relationships among the spatial objects contained in the spatial layers. The
experiments were implemented based on the following parameters:
• Spatial layers: rivers, settlements and roads.
• Spatial relation: topological relations supported by Oracle Spatial.
• Spatial graph-based model: all models.
The experiment was performed using the five proposed graph-based models. The objective
was to evaluate the results generated by each of them. Additionally, we ran the test using
no overlap, standard overlap and limited overlap features. In the following figures we show
the generated results in the experiment. In the figures the generated result by using no
overlap is labeled as “a)”, via standard overlap is labeled as “b)”, and through limited
overlap is labeled as “c)”. In the case of limited overlap, we told Subdue to allow overlap
only for vertices representing roads in the zone because this element represents a primary
item in our study domain (evaluation and implementation of population evacuation plans).
140
Since our intention in the experiment was to compare the generated results using the
proposed graph-based models, we selected (and reported) as the most significant
discovered pattern (by using no overlap, standard overlap and limited overlap), the one
covering the following restrictions:
• Complete pattern. A pattern reporting at least two spatial objects (i.e., road and
river), the spatial relation among them (i.e., touch), and some non-spatial
attribute(s) (i.e., “road category unpaved” and “river category draining”).
• Maximum number of reported instances. A pattern with the highest score of
reported instances of a substructure.
The following subsections present the generated results using model 1 to 5. By each model
we describe the discovered patterns according to the proposed structure to mine data: road
and river, road and settlement, and finally river and settlement. At the end of Section 6.2.2,
we present comparison tables and conclusions of the generated results by each model.
6.2.2.1 Model #1 - base model
Road and River. Figure 6.16 shows the most significant discovered pattern between roads
and rivers. The pattern describes a relationship among “road category unpaved overlapping
a river category draining” in the zone. This pattern may be considered as an indicator of the
number of roads that need to be supervised in case of a volcanic contingence since the
material type they are built with, and because they cross rivers (the lecture may be done in
inverse order) that in case of huge pluvial concentration may overflow and make roads
useless. Subdue found by using no overlap 46 instances (in the second iteration) of the
141
pattern; via standard overlap it found 85 instances (in the first iteration), and through
limited overlap it also found 85 instances (in the second iteration). As we can see in the
figure, standard and limited overlap found the same number of instances of the pattern, but
limited overlap required two iterations to find the same pattern. However, this fact does not
mean that standard overlap is better than limited overlap because analyzing the overall
processing time required by limited overlap to finish the substructure discovery phase we
note that it was lower than the required by standard overlap.
142
a) b)
c)
Figure 6.16. Relationships among roads and rivers by using model #1.
Road and Settlement. The most significant discovered pattern found in this experiment
describes a relationship among “road category unpaved touching a settlement category
construction” in the zone as shown in Figure 6.17. “Settlement category construction”
represents in the Popocatépetl’s settlement spatial layer inhabit areas with huge population,
buildings and several constructions used to offer services to people. If we assume that
143
people may require to be evacuated in case of an eruption and that the roads that will be
used are unpaved then this situation may become a problem (i.e., a bottleneck, water and
soil may become mud and this may make roads useless). For this experiment Subdue found
via no overlap 6 instances (in the ninth iteration) of the pattern, through standard overlap it
found 9 instances (in the fourth iteration), and by using limited overlap it discovered 8
instances (in the tenth iteration). In all cases Subdue was able to discover the same pattern;
the difference was the number of computed iterations required to discover it.
144
a) b)
c)
Figure 6.17. Relationships among roads and settlements by using model #1.
River and Settlement. Figure 6.18 shows the most significant discovered pattern for these
spatial objects. It describes a relationship among “river category draining crossing a
settlement category either (a) block or (b)(c) construction” in the zone. “Settlement
category block” represents in the Popocatépetl´s settlement spatial layer inhabit areas but
with small population, in fact there exist several uninhabited areas, few buildings and
145
constructions. The pattern may be used to identify potential flooding zones because it
represents rivers close to (may be some of them crossing) areas inhabited by people.
Through no overlap Subdue discovered 5 instances (in the twelfth iteration) of the pattern,
by using standard overlap it also found 5 instances (in the eighth iteration), and finally, via
limited overlap it also found 5 instances (in the eighth iteration). For this experiment
Subdue discovered the same pattern in the three cases, however, by using standard overlap
and limited overlap the understanding of the pattern is simpler.
146
a) b)
c)
Figure 6.18. Relationships among rivers and settlements by using model #1.
The previous experiment was done using model 1, now we perform the same experiment
using model 2 to 5. We will be focused to compare the same discovered patterns for the
three “object-object” structures (i.e., road-river, road-settlement, and river-settlement). The
147
generated results are shown in the figures following the same report schema: those using no
overlap are labeled as “a)”, those using standard overlap are labeled as “b)”, and finally
those created with limited overlap are labeled as “c)”. The most significant differences
between the generated results are the number of reported instances and the number of
iterations needed to discover the pattern. More iterations means more processing time to
discover the pattern.
6.2.2.2 Model #2 - single replication of relation types, complete
information
Road and River. By means of model #2 Subdue discovered the pattern (among road and
river) shown in Figure 6.19. Subdue found by using no overlap 41 instances (in the second
iteration) of the pattern, via standard overlap it found 85 (in the third iteration), and through
limited overlap it discovered 64 (in the second iteration). All cases reported “road category
unpaved and river category draining” as the predominant spatial objects.
148
a) b)
c)
Figure 6.19. Relationships among roads and rivers by using model #2.
Road and Settlement. For these spatial objects Subdue was able to discover, by means of
model #2, the pattern shown in Figure 6.20. Via no overlap Subdue found 5 instances (in
149
the fourteenth iteration) of the pattern, through standard overlap it discovered 8 (in the sixth
iteration), and by using limited overlap it found 7 (in the tenth iteration). No overlap
reported “settlement category block” whereas standard and limited overlap reported more
instances of “settlement category construction”.
a) b)
c)
Figure 6.20. Relationships among roads and settlements by using model #2.
150
River and Settlement. The pattern discovered by means of model #2 among these objects
is shown in Figure 6.21. Through no overlap Subdue found 5 instances (in the sixteenth
iteration) of the pattern, by using standard overlap it discovered 10 (in the seventh
iteration), and via limited overlap it found 5 (in the fourteenth iteration). No overlap and
limited overlap reported “settlement category construction” whereas limited overlap
reported more instances of “settlement category block”.
a) b)
c)
Figure 6.21. Relationships among rivers and settlements by using model #2.
151
6.2.2.3 Model #3 - double replication of relation types, no complete
information
Road and River. The pattern discovered, by means of model #3, for these objects is shown
in Figure 6.22. Through no overlap Subdue found 39 instances (in the second iteration) of
the pattern, by using standard overlap it found 85 (in the second iteration), and via limited
overlap it discovered 34 (in the ninth iteration). In all cases Subdue reported as the
predominant spatial objects “road category unpaved and river category draining”.
152
a) b)
c)
Figure 6.22. Relationships among roads and rivers by using model #3.
153
Road and Settlement. By means of model #3 Subdue discovered the pattern shown in
Figure 6.23. Subdue found by using no overlap 4 instances (in the fifteenth iteration) of the
pattern, via standard overlap it found 8 (in the sixth iteration), and through limited overlap
it discovered 5 (in the thirteenth iteration). No overlap and limited overlap reported
“settlement category block” as the predominant spatial object whereas standard overlap
reported “settlement category construction”.
a) b)
c)
Figure 6.23. Relationships among roads and settlements by using model #3.
154
River and Settlement. For these spatial objects Subdue was able to discover, by means of
model #3, the pattern shown in Figure 6.24. Via no overlap Subdue found 5 instances (in
the twelfth iteration) of the pattern, through standard overlap it found 19 (in the fourth
iteration), and by using limited overlap it found 5 (in the tenth iteration). No overlap and
limited overlap reported less instances of “settlement category construction” whereas
standard overlap reported more instances of “settlement category block”.
a) b)
c)
Figure 6.24. Relationships among rivers and settlements by using model #3.
155
6.2.2.4 Model #4 - single replication of relation types, no complete
information
Road and River. For these spatial objects Subdue was able to discover, by means of model
#4, the pattern shown in Figure 6.25. Via no overlap Subdue found 39 instances (in the
second iteration) of the pattern, through standard overlap it discovered 85 (in the first
iteration), and by using limited overlap it found 60 (in the second iteration). All cases
reported as the predominant spatial objects “road category unpaved and river category
draining”.
156
a) b)
c)
Figure 6.25. Relationships among roads and rivers by using model #4.
Road and Settlement. The pattern discovered, by means of model #4, for these objects is
shown in Figure 6.26. Through no overlap Subdue discovered 6 instances (in the twelfth
iteration) of the pattern, by using standard overlap it found 8 (in the tenth iteration), and via
157
limited overlap it found 8 (in the seventh iteration). The representative spatial objects are
“road category unpaved and settlement category construction” in all cases.
a) b)
c)
Figure 6.26. Relationships among roads and settlements by using model #4.
River and Settlement. By means of model #4 Subdue discovered the pattern shown in
Figure 6.27. Subdue found by using no overlap 5 instances (in the sixth iteration) of the
158
pattern, via standard overlap it found 10 (in the sixth iteration), and through limited overlap
it found 5 (in the eleventh iteration). No overlap and limited overlap reported less instances
of “settlement category construction” whereas standard overlap reported more instances of
“settlement category block”.
a) b)
c)
Figure 6.27. Relationships among rivers and settlements by using model #4.
159
6.2.2.5 Model #5 - double replication of relation types, complete
information
Road and River. The pattern discovered, by means of model #5, for these objects is shown
in Figure 6.28. Through no overlap Subdue discovered 45 instances (in the fifth iteration)
of the pattern, by using standard overlap it found 85 (in the first iteration), and via limited
overlap it found 45 (in the fifth iteration). Subdue reported in all cases as the representative
spatial objects “road category unpaved and river category draining”.
160
a) b)
c)
Figure 6.28. Relationships among roads and rivers by using model #5.
161
Road and Settlement. By means of model #5 Subdue discovered the pattern shown in
Figure 6.29. Subdue found by using no overlap 6 instances (in the seventh iteration) of the
pattern, via standard overlap Subdue could not find a complete pattern (in the figure the
category of the settlement is not reported), and through limited overlap it found 7 (in the
seventh iteration). No overlap and limited overlap reported as the representative spatial
objects “road category unpaved and settlement category construction”.
a) b)
c)
Figure 6.29. Relationships among roads and settlements by using model #5.
162
River and Settlement. For these spatial objects Subdue was able to discover, by means of
model #5, the pattern shown in Figure 6.30. Via no overlap Subdue discovered 5 instances
(in the thirteenth iteration) of the pattern, through standard overlap it found 5 (in the sixth
iteration), and by using limited overlap it also found 5 (in the tenth iteration). Subdue
reported in all cases “river category draining and settlement category construction” as the
predominant spatial objects.
a) b)
c)
Figure 6.30. Relationships among rivers and settlements by using model #5.
163
Table 6.1 presents a comparison, by each model and overlap feature, among the number of
discovered instances (patterns) and the number of iterations needs to discover them. For
example, by using model #1, Subdue found via no overlap 46 instances (in the second
iteration) of a “complete” pattern (according to our definition for reporting a complete
pattern) involving the spatial objects road-river. A higher score means a model allowing us
to discover more instances of a substructure. Remember Subdue reported as the best pattern
(by iteration) a substructure with the highest score of discovered instances of that
substructure. This comparison is reported by each “object-object” structure (i.e., road-
river).
Note: NO (No overlap), SO (Standard overlap), LO (Limited overlap).
Model #1 Model #2 Model #3 Model #4 Model #5
NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO
Road-River
Instances 46 85 85 41 85 64 39 85 34 39 85 60 45 85 45
Iterations 2 1 2 2 3 2 2 2 9 2 1 2 5 1 5
Road-Settlement
Instances 6 9 8 5 8 7 4 8 5 6 8 8 6 0 7
Iterations 9 4 10 14 6 10 15 6 13 12 10 7 7 0 7
River-Settlement
Instances 5 5 5 5 10 5 5 19 5 5 10 5 5 5 5
Iterations 12 8 8 16 7 14 12 4 10 6 6 11 13 6 10
Table 6.1. Instances/iterations by each graph-based model: Popocatépetl use-case.
164
From the previous table we can see that Subdue (setting no overlap -NO- in all cases)
reported using model #1, 46 instances in the second iteration, of a complete pattern among
the road and river spatial objects. Using model #2, Subdue reported 41 instances of a
complete pattern in the second iteration. Using model #3, Subdue reported 30 instances of a
complete pattern in the second iteration and so on. Finally, we can identify that model #1
was the best model to find the highest number of a complete pattern among these spatial
objects (shown in pink color in Table 6.2).
Model #1 Model #2 Model #3 Model #4 Model #5
NO NO NO NO NO
Road-River
Instances 46 41 39 39 45
Iterations 2 2 2 2 5
Table 6.2. The best model to discover complete patterns among the road and river spatial objects (no overlap)
From Table 6.1 we created Table 6.3. This table presents a comparison of
maximum/minimum discovered instances by each overlap feature. A model with the
highest score is better since it allows discovering more instances of a substructure
(patterns). The comparison is presented for each “object-object” structure (i.e., road-river).
For example, we have already mentioned that model #1 was the best model to find
complete patterns (setting no overlap) among the road and river spatial objects. Subdue
reported 46 discovered instances of a complete pattern in the second iteration (the highest
score). See Table 6.1 for details.
165
On the other hand, model #3 and model #4 were the worst models to find complete patterns
among the road and river spatial objects. For example, using model #3, Subdue only
reported 39 instances of a complete pattern in the second iteration.
Maximum Minimum
Road-River
No overlap model #1 (second iteration) models #3 and #4 (second iteration)
Standard overlap models #1, #4 and #5 (first iteration) model #2 (third iteration)
Limited overlap model #1 (second iteration) model #3 (ninth iteration)
Road-Settlement
No overlap model #5 (seventh iteration) model #3 (fifteenth iteration)
Standard overlap model #1 (fourth iteration) model #5 (no complete pattern)
Limited overlap model #4 (seventh iteration) model #3 (thirteenth iteration)
River-Settlement
No overlap model #4 (sixth iteration) model #2 (sixteenth iteration).
Standard overlap model #3 (fourth iteration) model #1 (eighth iteration).
Limited overlap model #1 (eighth iteration) model #2 (fourteenth iteration)
Table 6.3. Max/Min of discovered instances by “object-object”/overlap feature.
From Table 6.3 we can identify that model #1 was the best model to discover complete
patterns in most of the cases. The second best most was model #4. On the other hand, the
worst model to discover complete patterns was model #3.
Table 6.4 presents a comparison among the average of discovered instances by model.
Higher score means a model allowing us to discover more instances of a substructure
(patterns). Each value represents the average of discovered substructures by using no
166
overlap, standard overlap and limited overlap features. For example, the value 72.0 in
column “Model #1” and row “Road-River” is the average computed from the values 46, 85
and 85 obtained from Table 6.1. The comparison is reported by each “object-object”
structure (i.e., road-river). We can see in the table that model #1 was the best model to find
complete patterns. It was followed by model #2. The worst model was model #5.
Model #1 Model #2 Model #3 Model #4 Model #5
Road-River 72.0 63.3 52.7 61.3 58.3
Road-Settlement 7.7 6.7 5.7 7.3 4.3
River-Settlement 5.0 6.7 9.7 6.7 5.0
Table 6.4. Average of discovered instances by model/“object-object”.
Table 6.5 presents a comparison among the average of discovered instances by model. For
example, the value 19.0 in column “Model #1 no overlap (NO)” is the average computed
from the values 46, 6 and 5 obtained from Table 6.1. Higher score means a model allowing
us to discover more instances of a substructure (patterns). The comparison is reported by
each overlap feature.
Model #1 Model #2 Model #3 Model #4 Model #5
NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO
19.0 33.0 32.7 17.0 34.3 25.3 16.0 37.3 14.7 16.7 34.3 24.3 18.7 30.0 19.0
Table 6.5. Average of discovered instances by model/overlap feature.
Finally, Table 6.6 presents a comparison among the average of discovered instances by
model. For example, the value 28.2 in column “Model #1” is the average computed from
the values 19.0, 33.0 and 32.7 obtained from Table 6.5. We can see in the table that model
167
#1 reported the highest score of discovered instances (according to our parameters for
reporting complete instances) in this illustrative Popocatépetl domain. The following ones
were model #2 and model #4 respectively. The worst model to find complete patterns as
proposed was model #5.
Model #1 Model #2 Model #3 Model #4 Model #5
28.2 25.6 22.7 25.1 22.6
Table 6.6. Average of discovered instances by model.
As part of our strategy for testing our graph-based methodology for modeling and mining
spatial data, we proposed five graph-based operative models derived from our general
schema. However, from the analysis and interpretation of the results presented in this
section we can conclude that model #1 and model #2 are (following that order) the two
graph-based models that produce the best results. This conclusion is based on the
quantitative and qualitative analysis implemented using this test domain. Additionally, this
appreciation is also supported by the results obtained from the discovered patterns in the
population census from the year of 1777 in Puebla downtown test domain described in the
previous section.
6.3 Conclusion
In this chapter we have presented three illustrative use-cases of our proposal for modeling
and mining spatial data. The test domains were two spatial databases, the first one related to
168
a population census from the year of 1777 in Puebla downtown, and the second one related
to the Popocatépetl volcano.
The use-cases developed using the population census database were focused on
exemplifying the graph-based model to represent together spatial data, non-spatial data and
spatial relations, the new limited overlap feature implemented in the Subdue system
(processing time and specialized overlapping pattern oriented search), and the generated
results by the mining phase. We have presented in each use-case evaluations performed by
the domain expert over the discovered patterns.
The use-case developed using the Popocatépetl database was focused on the
comparison/evaluation of the generated results by each proposed graph-based model. The
tests were implemented upon the supposition of evacuation plans implementation in case of
volcanic contingences. We presented comparison tables describing which model(s) allow(s)
to discover more instances of a substructure via no overlap, standard overlap, and limited
overlap features. Subdue reported as the best pattern (by iteration) a substructure with the
highest score of discovered instances of that substructure.
Next chapter presents conclusion about our research work and final remarks.
169
Chapter 7
CONCLUSIONS
The continuous interaction among people and their natural home, the Planet Earth,
generates everyday new requirements associated to spatial data. For example, urban
analysis, natural risks prevention, space exploration, contamination in oceans, and
reforestation of lands just to mention some of them. Spatial data mining involves the
integration of methods from different scientific fields which help us, by means of data
analysis and discovery algorithms, to produce a particular enumeration of patterns from
spatial data.
In Chapter 2 we presented several approaches developed for mining spatial data (i.e.,
generalization-based methods, clustering, spatial associations, approximation and
aggregation, mining in image databases, spatial classification, and spatial trend detection).
However, our argumentation about those approaches was that they do not consider all the
elements found in a spatial database (spatial data, non-spatial data and spatial relations
among the spatial objects) in an extended way. We proposed in this dissertation a new
approach based on graphs. Our feeling is that if we are able to represent those elements as a
unique dataset and if we are also able to mine them as a whole, then, we might be able to
170
get patterns that might contain both types of data and spatial relationships enhancing the
quality of the results since the generated pattern(s) would describe (a) spatial object(s)
meeting (a) spatial relation(s) with other spatial object(s) and which is(are) that (those)
relationship(s). We proposed to use a graph-based representation since it provides the
desirable flexibility to describe these elements and their relations together.
As mentioned, in our model spatial relations among spatial objects are included since a
significant characteristic of spatial data is the influence of the neighbors of an object may
have on the object itself. In the model we included the representation of three types of
spatial relations.
Derived from the general graph-based schema we have proposed five operative models.
Three aspects define the characteristics of a graph created from those models: first, the
representation of equivalent spatial relations (the relation A touch B can be represented by
two directed edges, A B and B A, or by one undirected edge, A──B; we used the second
approach). Second, the representation of symmetric spatial relations (the relation A
North_of B implies a relation B South_of A, some models represent only the first relation
and other both relations). The third aspect is the way to represent the objects and their
relationships. Our intention is to represent the spatial objects and their relation as much as
possible but also considering a balance among the representative of the data and the
complexity of the created graph. This last issue has a major importance since the tie among
the complexity of the created graph and the mining phase. For example, huge graphs may
require more computational resources than small graphs, but by creating small graphs we
may loss data representativeness and perhaps we may not to discover significant patterns.
171
As component of our methodology for mining spatial data using a graph-based approach,
we used the Subdue system as our mining tool. The overlap feature plays a relevant role in
the Subdue’s substructure discovering system. But, as we described, it is implemented in an
orthodox way: allows overlap among any instances of a substructure or allows overlap
among all instances of a substructure. The first option is better regarding to the processing
time, but the second one may discover more instances of a substructure (patterns). Both
cases do not give to the user the capability to specify the set of vertices allowed for
overlapping.
Therefore, we proposed a new overlap approach named limited overlap. The new approach
gives to the user the means to specify over which vertices the overlap will be allowed.
These vertices may represent significant elements in the context we work with. For
example, in the use-case presented in Section 6.2 (implementation of evacuation plans in
case of volcanic contingences in the Popocatépetl volcano zone) vertices representing roads
were allowed for overlapping since these spatial objects represent relevant elements in the
evacuation plans.
Moreover, we visualized three motivation issues to propose the implementation of the new
approach. First, we demonstrated that by using limited overlap we obtain a search space
reduction in the substructure discovering process since we allow overlap but it is restricted
to the set of elements chosen by the user. Second, as result of a search space reduction we
also get a processing time reduction. Remember that as part of the substructure discovery
process there exist a validation and discarding phase over the instances of a substructure.
172
Instances that are not discarded may become candidates to further expansion in order to
discover new substructures. Third, giving the user the capability to choose the set of
vertices allowed for overlapping, we are orienting the search over particular overlapping
instances and, at the same time, discarding irrelevant overlapping instances.
In order to show the feasibility, capacity to mine and to discover patterns by using a graph-
based approach as proposed, we developed a prototype system implementing our model to
create graph-based datasets, to mine those datasets (by calling the Subdue system), and to
visualize the discovered patterns.
In Chapter 6 we described three illustrative use-cases showing the applicability of our
proposal. We used two real world domains: a population census from the year of 1777 in
Puebla downtown, and a Popocatépetl volcano database. The generated results from those
test domains give us a panorama about how and what we can achieve using our approach. It
is important to remark the fact that we can use this approach in any domain that can be
represented as a graph.
In this context, we visualize perspectives related to enhance our work in issues such as the
graph-based model for creating the graphs, the data mining algorithm, and the prototype
system. They include some of the following:
Visualization/presentation of discovered knowledge. For example, visualization of
discovered knowledge over the spatial layers (patterns over the maps); to use charts for
173
showing comparison among patterns; give the user the capability to navigate through the
discovered patterns hierarchy using a hypergraph approach.
Enhancing the algorithms used to create the graph-based datasets according to the
proposed models. The validation of the spatial relations among the spatial object is phase
that in most of the cases requires a lot of computational resources. (i.e., creation and
manage of spatial indices, creation of bounding box for the spatial objects, manage of
spatial reference systems, etc.) Therefore, the algorithms must as most as possible to
execute efficiently this task.
Mining the graph. We proposed and used the Subdue system as our data mining tool,
moreover we implemented a new algorithm name limited overlap. Subgraph isomorphism
is an NP-complete problem, so we must be able to face this problem in order that our
processing times for discovering knowledge meet acceptable parameters of efficiency.
Relationships among non-spatial data describing spatial objects. Implicit relations
among non-spatial data (i.e., attributes) describing the spatial objects may be included in
the model in order to enhance the representativeness of the data.
Spatial data mining is a promising research field. Several approaches have been developed,
and without any doubt, new approaches will be proposed; it is a field in continuous
improvement. Our contribution to the spatial data mining goes in that direction.
174
BIBLIOGRAPHY
[1] AGRAWAL Rakesh, IMIELINSKI Tomasz, SWAMI Arun. Mining Association
Rules between Sets of Items in Large Databases. In : Proc. of the 1993 International
Conference on Management of Data, Washington D.C.. New York, NY : Association for
Computing Machinery Press, 1993, pp. 207-216. ISBN 0-89791-592-5
[2] BERNHARDSEN Tor. Geographic Information Systems: An Introduction. 2nd ed.
New York : John Wiley & Son Inc, 1999, 448 p. ISBN 0471419680
[3] CHEIN Michel, MUGNIER Marie-Laure. Conceptual Graphs: Fundamental
Notions. Revue d’intelligence artificielle, 1992, vol. 6, no 4, pp. 365-406.
[4] CHEN Ming-Syan, HAN Jiawei, YU Philip S. Data Mining: An Overview from a
Database Perspective. Institute of Electrical and Electronics Engineers, Transactions on
Knowledge and Data Engineering, 1996, vol. 8, no 6, pp. 866-883. ISSN 1041-4347
175
[5] DIESTEL Reinhard. Graph Theory [on line]. 3rd ed. New York : Springer-Verlag,
2005. Available on :<http://www.math.uni-hamburg.de/home/diestel/books/graph.theory>
(last visit 01.10.2005). ISBN 3-540-26182-6
[6] EGENHOFER Max J. A Model for Detailed Binary Topological Relationships.
Geomatica, 1993, vol. 47, no 3-4, pp. 261-273.
[7] EGENHOFER Max J., FRANZOSA Robert D. On the Equivalence of Topological
Relations. International Journal of Geographical Information Systems, 1995, vol. 9, no 2,
pp. 133-152.
[8] EGENHOFER Max J., HERRING John R. Categorizing Binary Topological
Relationships between Regions, Lines, and Points in Geographic Databases. Technical
Report. Department of Surveying Engineering, University of Maine, Orono, 1991, 28 p.
[9] ESTER Martin, FROMMELT Alexander, KRIEGEL Hans-Peter, SANDER Jörg.
Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support. Data
Mining and Knowledge Discovery, 2000, vol. 4, no 2-3, pp. 193-216. ISSN 1384-5810
[10] ESTER Martin, FROMMELT Alexander, KRIEGEL Hans-Peter, SANDER Jörg.
Algorithms for Characterization and Trend Detection in Spatial Databases. In : Proc. of the
4th International Conference on Knowledge Discovery and Data Mining, New York, NY,
1998, pp. 44-50.
176
[11] ESTER Martin, KRIEGEL Hans-Peter, SANDER Jörg. Spatial Data Mining: A
Database Approach. In : Proc. of the 5th International Symposium on Large Spatial
Databases, Berlin, Germany, July 1997, pp. 47-66.
[12] FAYYAD Usama M., PIATETSKY-SHAPIRO Gregory, SMYTH Padhraic.
Knowledge Discovery and Data Mining: Towards a Unifying Framework. In : SIMOUDIS
Evangelos, HAN Jiawei, FAYYAD Usama M. Eds. Proc. of the 2nd International
Conference on Knowledge Discovery and Data Mining, Portland, Oregon. Menlo Park, CA
: American Association for Artificial Intelligence Press, 1996, pp. 82-88.
[13] FAYYAD Usama M., PIATETSKY-SHAPIRO Gregory, SMYTH Padhraic,
UTHURUSAMY Ramasamy Eds. Advances in Knowledge Discovery and Data Mining.
Menlo Park, CA : American Association for Artificial Intelligence/Massachusetts Institute
of Technology Press, 1996, 625 p. ISBN 0-262-56097-6
[14] FAYYAD Usama M., SMYTH Padhraic. Image Database Exploration: Progress
and Challenges. In : Proc. of the 1993 Workshop on Knowledge Discovery in Databases.
Washington D.C., July 1993, pp. 14-27.
[15] FAYYAD Usama M., WEIR Nicholas, DJORGOVSKI S. SKICAT: A Machine
Learning System for Automated Cataloging of Large Scale Sky Survey. In : UTGOFF Paul
E. Eds. Proc. of the 10th International Conference on Machine Learning, 1993, University
of Massachusetts. San Mateo, CA : Morgan Kaufmann, 1993, pp. 112-199.
177
[16] FRAWLEY William J., PIATETSKY-SHAPIRO Gregory, MATHEUS Christopher
J. Knowledge Discovery in Databases: An Overview. In : PIATETSKY-SHAPIRO
Greogry, FRAWLEY William J. Knowledge Discovery in Databases. Menlo Park CA :
American Association for Artificial Intelligence/Massachusetts Institute of Technology
Press, 1991, pp. 1-27.
[17] GONZALEZ Jesus A. Empirical and Theoretical Analysis of Relational Concept
Learning Using a Graph-based Representation. PhD Thesis. Arlington : The University of
Texas at Arlington, 2001, 138 p.
[18] GONZALEZ Jesus A., HOLDER Lawrence B., COOK Diane J. Structural
Knowledge Discovery Used to Analyze Earthquake Activity. In : Proc. of the 13th Annual
Florida Artificial Intelligence Research Symposium. American Association for Artificial
Intelligence Press, 2000, pp. 86-90.
[19] GRAPHVIZ - AT&T RESEARCH. Graphviz - Graph Visualization Software [on
line]. Available on : <http://www.research.att.com/sw/tools/graphviz/> (last visit
01.10.2005).
[20] GUTTMAN Antonin. R-Trees: A Dynamic Index Structure for Spatial Searching.
In : Proc. of the 1984 International Conference on Management of Data, Boston,
Massachusset, 1984. New York, NY : Association for Computing Machinery Press, June
1984, pp. 47-57.
178
[21] HAN Jiawei, CAI Yandong, CERCONE Nick. Knowledge Discovery in Databases:
An Attribute-Oriented Approach. In : YUAN Li Yan, Proc. of the 18th International
Conference on Very Large Databases, Vancouver, British Columbia, Canada. San Mateo :
Morgan Kaufmann Publishers, 1992, pp. 547-559.
[22] HAN Jiawei, FU Yongjian. Exploration of the Power of Attribute-Oriented
Induction in Data Mining. In : [13].
[23] HAN Jiawei, KAMBER Micheline. Data Mining: Concepts and Techniques. San
Francisco : Morgan Kaufmann, 2000, 550 p. ISBN 1-55860-489-8
[24] HOLDER Lawrence B., COOK Diane J., GONZALEZ Jesus A., JONYER I.
Structural Pattern Recognition in Graphs. In : CHEN Dechang, CHENG Xiuzhen Eds.
Pattern Recognition and String Matching. Dordrecht : Kluwer Academic Publishers, 2002,
772 p. ISBN 1-4020-0953-4
[25] HOLSHEIMER Marcel, KERSTEN Martin L. Architectural Support for Data
Mining. CS-R9429. Amsterdam, The Netherlands : CWI (Centre for Mathematics and
Computer Science), 1994.
[26] KAUFMAN Leonard, ROUSSEEUW Peter J. Finding Groups in Data: An
Introduction to Cluster Analysis. New York : Wiley-Interscience, 1990, 368 p. ISBN
0471878766
179
[27] KNORR Edwin M., NG Raymond T. Finding Aggregate Proximity Relationships
and Commonalities in Spatial Data Mining. Institute of Electrical and Electronics
Engineers, Transactions on Knowledge and Data Engineering, 1996, vol. 8, no 6, pp. 884-
897.
[28] KOLATCH Erica. Clustering Algorithms for Spatial Databases: A Survey.
University of Maryland, College Park : Department of Computer Science, 2001, 22 p.
[29] KOPERSKI Krzysztof, HAN Jiawei. Discovery of Spatial Association Rules in
Geographic Information Databases. In : Proc. of the 4th International Symposium on Large
Spatial Databases, Portland, Maine, August 1995, pp. 47-66.
[30] KOPERSKI Krzysztof, HAN Jiawei, ADHIKARY Junas. Spatial Data Mining:
Progress and Challenges. In : Proc. of the Workshop on Research Issues on Data Mining
and Knowledge Discovery, Montreal, Canada, June 1996, pp. 55-70.
[31] KOPERSKI Krzysztof, HAN Jiawei, STEFANOVIC Nebojsa. An Efficient Two-
Step Method for Classification of Spatial Data. In : Proc. of the 8th Symposium on Spatial
Data Handling, Vancouver, Canada, 1998, pp. 45-54.
[32] LAURINI Robert. Information Systems for Urban Planning: A Hypermedia
Cooperative Approach. London : Taylor and Francis, 2001, 368 p. ISBN 0748409645
180
[33] LAURINI Robert, MILLERET-RAFFORT F. Les bases de données en géomatique.
Paris : HERMES, 1993, 330 p.
[34] LAURINI Robert, THOMPSON Derek. Fundamentals of Spatial Information
Systems. London : Academic Press, 1992, 680 p. (The A.P.I.C. series, no 37) ISBN 0-12-
438380-7
[35] LU Wei, HAN Jiawei, OOI Beng C. Discovery of General Knowledge in Large
Spatial Databases. In : Proc. of the Far East Workshop on Geographic Information
Systems, Singapore, June 1993, pp. 275-289.
[36] MUGNIER Marie-Laure. On Generalization/Specialization for Conceptual Graphs.
Journal of Experimental and Theoretical Artificial Intelligence, 1995, vol. 7 pp. 325-344.
[37] NG Raymond T., HAN Jiawei. Efficient and Effective Clustering Methods for
Spatial Data Mining. In : Proc. of the 20th Very Large Databases Conference, Santiago,
Chile, 1994. San Francisco, CA, 1994, pp. 144-155.
[38] PECH PALACIO Manuel, SOL David, GONZALEZ Jesus A. Adaptation and Use
of Spatial and non-Spatial Data Mining. Proc. of the International 2002 Workshop
Semantic Processing of Spatial Data, Centre for Computing Research, Instituto Politécnico
Nacional, México D.F., December 2002. ISBN 970-18-8521-X
181
[39] PREPARATA Franco P., SHAMOS Michael Ian. Computacional Geometry: An
Introduction. 1st ed. New York, NY : Springer-Verlag, 1993. 398 p. ISBN 0387961313
[40] SHEIKHOLESLAMI Gholamhosein, CHATTERJEE Surojit, ZHANG Aidong.
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases.
In : GUPTA Ashish, SHMUELI Oded, WIDOM Jennifer Eds. Proc. of the 24th
International Conference on Very Large Databases, New York, NY, 1998. San Francisco,
CA : Morgan Kaufmann Publishers, 1998, pp. 428-439. ISBN 1-55860-566-5
[41] SHEK Eddie C., MUNTZ Richard R., MESROBIAN Edmond, NG Kenneth.
Scalable Exploratory Data Mining of Distributed Geoscientific Data. In : Proc. of the 2nd
International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996.
Menlo Park, CA : American Association for Artificial Intelligence Press, 1996, pp. 32-37.
[42] SOL David, LOYO E., RAZO Antonio. Risk Management in the Popocatépetl
Volcano Zone. In : Workshop on Advanced Techniques for the Assessment of Natural
Hazards in Mountain Areas, Innsbruck, Austria, June 2000, pp. 97-101.
[43] STOLORZ Paul E, DEAN Christopher. Quakefinder: A Scalable Data Mining
System for Detecting Earthquakes from Space. In : Proc. of the 2nd International
Conference on Data Mining, Portland, Oregon, 1996. Menlo Park, CA : American
Association for Artificial Intelligence Press, 1996, pp. 208-213.
182
[44] STOLORZ Paul E., NAKAMURA H., MESROBIAN Edmond, MUNTZ Richard
R., SHEK Eddie C., SANTOS J. R., YI J., Ng K., CHIEN S. Y., MECHOSO C. R.,
FARRARA J. D. Fast Spatio-Temporal Data Mining of Large Geophysical Datasets. In :
Proc. of the 1st International Conference on Data Mining, Montreal, Canada, 1995. Menlo
Park, CA : American Association for Artificial Intelligence Press, 1995, pp. 300-305.
[45] SUBDUE SYSTEM - THE UNIVERSITY OF TEXAS AT ARLINGTON.
SUBDUE - Graph Based Knowledge Discovery [on line]. Available on :
<http://ailab.uta.edu/subdue> (last visit 01.10.2005).
[46] ZHANG Tian, RAMAKRISHNAN Raghu, LIVNY Miron. BIRCH: An Efficient
Data Clustering Method for Very Large Databases. In : Proc. of the 1996 International
Conference on Management of Data, Montreal, Canada, 1996. New York, NY :
Association for Computing Machinery Press, June 1996, pp. 103-114. ISSN 0163-5808
lxix
Appendix C
AGREEMENTS UDLAP/INSA DE LYON
lxx
lxxi
lxxii
lxxiii
FOLIO ADMINISTRATIF
THESE SOUTENUE DEVANT L'INSTITUT NATIONAL DES SCIENCES APPLIQUEES DE LYON
NOM : PECH PALACIO DATE de SOUTENANCE : 12/12/2005 (avec précision du nom de jeune fille, le cas échéant) Prénoms : Manuel Alfredo TITRE : Spatial Data Modeling and Mining using a Graph-based Representation NATURE : Doctorat Numéro d'ordre : 05 ISAL Ecole doctorale : Informatique et Information pour la Société Spécialité : Documents Multimédia, Images et Systèmes d'Information Cote B.I.U. - Lyon : T 50/210/19 / et bis CLASSE : RESUME : We propose a unique graph-based model to represent spatial data, non-spatial data and the spatial relations among spatial objects. We will generate datasets composed of graphs with a set of these three elements. We consider that by mining a dataset with these characteristics a graph-based mining tool can search patterns involving all these elements at the same time improving the results of the spatial analysis task. A significant characteristic of spatial data is that the attributes of the neighbors of an object may have an influence on the object itself. So, we propose to include in the model three relationship types (topological, orientation, and distance relations). In the model the spatial data (i.e. spatial objects), non-spatial data (i.e. non-spatial attributes), and spatial relations are represented as a collection of one or more directed graphs. A directed graph contains a collection of vertices and edges representing all these elements. Vertices represent either spatial objects, spatial relations between two spatial objects (binary relation), or non-spatial attributes describing the spatial objects. Edges represent a link between two vertices of any type. According to the type of vertices that an edge joins, it can represent either an attribute name or a spatial relation name. The attribute name can refer to a spatial object or a non-spatial entity. We use directed edges to represent directional information of relations among elements (i.e. object x touches object y) and to describe attributes about objects (i.e. object x has attribute z). We propose to adopt the Subdue system, a general graph-based data mining system developed at the University of Texas at Arlington, as our mining tool. A special feature named overlap has a primary role in the substructures discovery process and consequently a direct impact over the generated results. However, it is currently implemented in an orthodox way: all or nothing. Therefore, we propose a third approach: limited overlap, which gives the user the capability to set over which vertices the overlap will be allowed. We visualize directly three motivations issues to propose the implementation of the new algorithm: search space reduction, processing time reduction, and specialized overlapping pattern oriented search. MOTS-CLES : KDD, Spatial data mining, Graph, Spatial data, Spatial relation, Geoprocessing Laboratoire (s) de recherche : LIRIS - Laboratoire d'Informatique en Images et Systèmes d'Information / CENTIA-UDLAP, Mexique Directeur de thèse: Robert LAURINI, Professeur, INSA de Lyon, France, Directeur Président de jury : Eduardo MORALES, Profesor Titular ITESM-Morelos, Mexique, Examinateur Composition du jury : Nicandro CRUZ Profesor Titular, Universidad Veracruzana, Mexique, Rapporteur François FAGES Directeur de Recherches, INRIA, Paris, France, Examinateur Jesús GONZALEZ Profesor Titular INAOE, Mexique, Co-Directeur Robert LAURINI Professeur, INSA de Lyon, France, Directeur Hervé MARTIN Professeur Université Joseph Fourier, Grenoble, France, Rapporteur Eduardo MORALES Profesor Titular ITESM-Morelos, Mexique, Examinateur David SOL Profesor Titular UDLAP, Mexique, Directeur Anne TCHOUNIKINE Maître de Conférences, INSA de Lyon, France, Co-Directeur