Resource Classification as the Basis for a Visualization Pipeline
in LOD Scenarios
Oscar Peña del Rio, Unai Aguileraand Diego López de Ipiña
DeustoTech, University of Deusto
Motivation
• The Semantic Web is still waiting for massive traction
• The focus is set on ontology development, provenance and supporting technological stack
• Little is known about the SW outside the research community
• Potential benefits should be addressed to non-technical user profiles
2
Data overview
• Need to get the whole picture of a dataset before working with it
• Usually time/resource constraints and lack of expert knowledge are present
• Based on Ben Shneiderman’s Overview task (from his famous Visual Information Seeking Mantra)
• Diverse approaches perform basic statistics to fulfill this task (counts, averages, min/max, etc.)
3
Natural approaches to data exploration
• Takes some ideas from Tukey’s Exploratory Data Analysis field (EDA)
• Proposes different approaches to get an overview of the data
• Techniques lack the rigor of more formal methodologies, is a more data-driven perspective
• Data discovery is more natural this way, in line with the follow your nose principles
4
Visualization Pipeline
raw data analysis operators
visual transformations
recommender engine
end user visualizations
rdf, json-ld, …
statistical analysis, datatype inference…
how to encode data in visual elements
learned lessons, best practices & fit models
Web browser accessible visualization
5
Defending visualization as the means for a coherent, understandable Semantic Web beneficial for all actors
Extract resource features
• We focus on the data itself to infer its structure and relevance within the whole dataset
• The data is directly accessed through SPARQL queries
• Property usage: # unique class instances / # instance objects
• Completeness ratio: # values assigned to property / # instance objects
6
dc:title -> 1 foaf:nick -> 3.4dc-terms:license -> 0.12
foaf:name -> 1 foaf:title -> 0.36
Primitive datatype inference
• Required to understand how each property may be interpreted, the operations it allows, and how they relate to each other
• We define the following classification categories:
• Integer
• Float
• Boolean
• IRI
• String
• Geographical component
• Datetime component
• Categorical data
7
Results
• 190 properties evaluated (149 unique)
• 5 datasets (approx. 10M triples)
• Diverse topics
• Inference algorithm tested against agreement between 6 experts (>80% agreement, 5 out of 6)
dataset TP TN FP FN Cat Correct
Air quality 17 160 2 10 5 93,65%
Restaurants 17 201 3 17 5 91,6%
Historical sites 14 165 4 13 3 91,33%
MORElab 56 399 15 13 12 94,2%
Teseo 22 162 4 1 3 97,35%
8
Conclusions & future steps
• Common pitfalls addressed
• Missing property datatypes & ranges
• Incorrect typing / usage
• Redundancy
• Most instances typed as plain, literal strings
• Feed all the features to a classifier in order to create Entity Visualization Templates (work in progress)
• Recommend coherent visual representations for each template
9