Download pdf - Resource Classification as the Basis for a Visualization Pipeline in LOD Scenarios

Resource Classification as the Basis for a Visualization Pipeline

in LOD Scenarios

Oscar Peña del Rio, Unai Aguileraand Diego López de Ipiña

DeustoTech, University of Deusto

Motivation

• The Semantic Web is still waiting for massive traction

• The focus is set on ontology development, provenance and supporting technological stack

• Little is known about the SW outside the research community

• Potential benefits should be addressed to non-technical user profiles

2

Data overview

• Need to get the whole picture of a dataset before working with it

• Usually time/resource constraints and lack of expert knowledge are present

• Based on Ben Shneiderman’s Overview task (from his famous Visual Information Seeking Mantra)

• Diverse approaches perform basic statistics to fulfill this task (counts, averages, min/max, etc.)

3

Natural approaches to data exploration

• Takes some ideas from Tukey’s Exploratory Data Analysis field (EDA)

• Proposes different approaches to get an overview of the data

• Techniques lack the rigor of more formal methodologies, is a more data-driven perspective

• Data discovery is more natural this way, in line with the follow your nose principles

4

Visualization Pipeline

raw data analysis operators

visual transformations

recommender engine

end user visualizations

rdf, json-ld, …

statistical analysis, datatype inference…

how to encode data in visual elements

learned lessons, best practices & fit models

Web browser accessible visualization

5

Defending visualization as the means for a coherent, understandable Semantic Web beneficial for all actors

Extract resource features

• We focus on the data itself to infer its structure and relevance within the whole dataset

• The data is directly accessed through SPARQL queries

• Property usage: # unique class instances / # instance objects

• Completeness ratio: # values assigned to property / # instance objects

6

dc:title -> 1 foaf:nick -> 3.4dc-terms:license -> 0.12

foaf:name -> 1 foaf:title -> 0.36

Primitive datatype inference

• Required to understand how each property may be interpreted, the operations it allows, and how they relate to each other

• We define the following classification categories:

• Integer

• Float

• Boolean

• IRI

• String

• Geographical component

• Datetime component

• Categorical data

7

Results

• 190 properties evaluated (149 unique)

• 5 datasets (approx. 10M triples)

• Diverse topics

• Inference algorithm tested against agreement between 6 experts (>80% agreement, 5 out of 6)

dataset TP TN FP FN Cat Correct

Air quality 17 160 2 10 5 93,65%

Restaurants 17 201 3 17 5 91,6%

Historical sites 14 165 4 13 3 91,33%

MORElab 56 399 15 13 12 94,2%

Teseo 22 162 4 1 3 97,35%

8

Conclusions & future steps

• Common pitfalls addressed

• Missing property datatypes & ranges

• Incorrect typing / usage

• Redundancy

• Most instances typed as plain, literal strings

• Feed all the features to a classifier in order to create Entity Visualization Templates (work in progress)

• Recommend coherent visual representations for each template

9

Thank you

Oscar Peña del [email protected]

mailto:[email protected]