Visualizing The Semantic Similarity of Geographic Features

Preview:

Citation preview

Introduction Methods Result Conclusion & Future Work References

Visualizing The Semantic Similarity of GeographicFeatures

Gengchen Mai 1 Krzysztof Janowicz 1 Sathya Prasad 2

Bo Yan 1

1STKO Lab, UC Santa Barbara, CA, USA

2Esri Inc, Redlands, CA, USA

April, 2018

Introduction Methods Result Conclusion & Future Work References

A Semantically Enriched Visualization

Maps: extensively used to visualize GI and spatialrelationships.

Difficult to directly express non-spatial relationships (semanticsimilarity) using such maps.

Introduction Methods Result Conclusion & Future Work References

A Semantically Enriched Visualization

A Semantically Enriched Visualization: An analogy of thematicmaps to visualize the distribution of geographic features in asemantic space.

Points: Geographic Coordinates → Locations in the SemanticSpace

Polygons: Administration Regions/Continents → SemanticContinents

Introduction Methods Result Conclusion & Future Work References

A Semantically Enriched Visualization

A Semantically Enriched Visualization:

Semantically similar entities are clusters within the sameregion;

The distance between geographic features represents howsimilar they are.

Introduction Methods Result Conclusion & Future Work References

A Semantically Enriched Visualization

In this work, a semantically enriched geospatial datavisualization and searching framework are presented.

We evaluate it using a subset of places from DBpedia.

Multiple techniques:

Paragraph VectorSpatial ClusteringConcave Hull ConstructionInformation Retrieval (IR) Model

Introduction Methods Result Conclusion & Future Work References

A Semantically Enriched Visualization

A semantically enriched visualization resembles cartographic layouts

Introduction Methods Result Conclusion & Future Work References

The Workflow

Introduction Methods Result Conclusion & Future Work References

Paragraph Vectors Computing

Paragraph Vector (or called Doc2Vec) is a representationlearning method proposed by Natural Language Processingcommunity.

Idea: Give a collection of documents, Doc2Vec learns ahigh-dimensional continuous vector (embedding) for eachdocument.

The cosine similarity between the learned document vectorsrepresents the semantic similarity between theircorresponding documents.

Introduction Methods Result Conclusion & Future Work References

Paragraph Vectors Computing

The two-layer neural network architectural of Doc2Vec

Introduction Methods Result Conclusion & Future Work References

Paragraph Vectors Computing

Outputs of Doc2Vec:

Embeddings of documents;

Embeddings of word tokens in the document corpus.

Introduction Methods Result Conclusion & Future Work References

Paragraph Vectors Computing

Data Source: All entities typed dbo:HistoricalPlace inDBpedia (21010 places)

Each historic place has an abstract, comments, images, andgeographic coordinates.

Method: Doc2Vec Model (PVDM [4])

Textual data collection: Treat each place as a documentwhose content is its abstract and commentsTextual data preprocessing: tokenization and lemmalizationParagraph vector training: embedding dimention K = 300;window size N = 10; learning rate α = 0.025

Introduction Methods Result Conclusion & Future Work References

Information Retrieval Model

Place Embeddings: the learned embedding of each historicplace from Doc2Vec.

Query Embeddings:Utilize the Doc2Vec.infer vector() function from gensim’sDoc2Vec packageThe TF-IDF score weighted embedding based on wordembeddings of query word tokensThe simple average of the query tokens’ embeddingsafter stop words removal

Semantic Similarity Score Function: the cosine similaritybetween the query embedding and place embeddings

An API 1 is provided for the semantic searching functionalityamong DBpedia historic places.

1http://stko-testing.geog.ucsb.edu:

3050/semantic/search?searchText=grave%20yard.

Introduction Methods Result Conclusion & Future Work References

Semantic Similarity Map Construction

Spatialization: how to construct an overview of the semanticdistribution of geographic entities such that it follows acartographic tradition (semantic continent).

Introduction Methods Result Conclusion & Future Work References

Semantic Similarity Map Construction

K-means clustering: group these place embeddings into differentclusters;

Try #(clusters) from 2 to 30 andcompute silhouette coefficient [5] of theclustering results;

#(clusters) = 16 gives the highestsilhouette coefficient;

The descriptions of places in each clusterare combined as one document;

Word clouds are produced from 10 wordwith highest TF-IDF score;

Each cluster is named according to the itstop 10 words.

The word cloud for Educationcluster

Introduction Methods Result Conclusion & Future Work References

Semantic Similarity Map Construction

Dimension reduction: to visualize the semantic distribution ofgeographic entities in a 2-dimensional space

Different dimension deduction methods including PCA andt-SNE are experimented;

t-SNE performs best and the clusters derived from k-meansare still well separated.

Introduction Methods Result Conclusion & Future Work References

Semantic Similarity Map Construction

DBSCAN:

Although t-SNE produces a gooddimension reduction result, somepoints are far away from theircluster centroids and scattered inthe 2D space.

We apply DBSCAN [3] to eachprojected k-means cluster toextract the “core” parts of them.

Visual interpretation are used toselect the parameter combinationfor DBSCAN. (Eps = 1.1 andMinPts = 6)

Introduction Methods Result Conclusion & Future Work References

Semantic Similarity Map Construction

Concave Hull Construction: Chi-shape algorithm [2]

It first constructs a Delaunay triangulation;

It erodes the boundary by deleting boundary’sedges until the longest edge less than athreshold.

A normalized length parameter λp ∈ [1, 100]controls this threshold;

To get optimal λp, a fitness score function [1] isused to balance the complexity and emptynessof the resulting concave hull.

φ(P,D) = Emptiness(P,D) + C ∗ Complexity(P)(1)

P: the derived simple polygon; D: the Delaunaytriangulation of the corresponding point cluster.

Introduction Methods Result Conclusion & Future Work References

Semantic Similarity Map Construction

Concave Hull Construction:

We iterate λp from 1 to 100 and compute the average fitnessscore of all point clusters produced by DBSCAN;

The optimal λp with the lowest average fitness score is 30.

The average fitness score for different λp among all DBSCAN clusters.

Introduction Methods Result Conclusion & Future Work References

Semantic Similarity Map Construction

Publishing the Semantic View Webmap in ArcGIS Online:

We publish this map as a webmap service in ArcGIS Online 2.

2http://www.arcgis.com/home/item.html?id=

7e15f98399ff4788a502fd04320bdafc

Introduction Methods Result Conclusion & Future Work References

Result

We have deployed a web-based user interface3 to showcase thefunctionality using the historical places dataset.

the search result of “grave yard” in the semantic space

3http://stko-testing.geog.ucsb.edu:3050/

Introduction Methods Result Conclusion & Future Work References

Result

the search result of “grave yard” in the geographic space

Introduction Methods Result Conclusion & Future Work References

Result

The pop-up window shows some basic information fordbo:Istre Cemetery Grave Houses.

Introduction Methods Result Conclusion & Future Work References

Conclusion & Future Work

In this work, we presented a semantically enriched geospatialdata visualization and search framework.

In the future, the proposed methods have to be calibrated,e.g., by setting the hyperparameters, based on results ofhuman participants testing.

Introduction Methods Result Conclusion & Future Work References

Paper

This work has been accepted to AGILE 2018.

Gengchen Mai, Krzysztof Janowicz, Sathya Prasad, Bo Yan.Visualizing The Semantic Similarity of Geographic Features, In:Proceedings of AGILE 2018, June 12 - 15, 2018, Lund, Sweden.

Introduction Methods Result Conclusion & Future Work References

References I

[1] Akdag, Fatih, Eick, Christoph F, & Chen, Guoning. 2014.Creating polygon models for spatial clusters. Pages 493–499 of:International Symposium on Methodologies for IntelligentSystems. Springer.

[2] Duckham, Matt, Kulik, Lars, Worboys, Mike, & Galton,Antony. 2008. Efficient generation of simple polygons forcharacterizing the shape of a set of points in the plane. PatternRecognition, 41(10), 3224–3236.

[3] Ester, Martin, Kriegel, Hans-Peter, Sander, Jorg, Xu, Xiaowei,et al. . 1996. A density-based algorithm for discovering clustersin large spatial databases with noise. Pages 226–231 of: Kdd,vol. 96.

Introduction Methods Result Conclusion & Future Work References

References II

[4] Le, Quoc, & Mikolov, Tomas. 2014. Distributedrepresentations of sentences and documents. Pages 1188–1196of: International Conference on Machine Learning.

[5] Rousseeuw, Peter J. 1987. Silhouettes: a graphical aid to theinterpretation and validation of cluster analysis. Journal ofcomputational and applied mathematics, 20, 53–65.

Recommended