Upload
besnik-fetahu
View
298
Download
0
Tags:
Embed Size (px)
DESCRIPTION
The increasing adoption of Linked Data principles has led to an abundance of datasets on the Web. However, take-up and reuse is hindered by the lack of descriptive information about the nature of the data, such as their topic coverage, dynamics or evolution. To address this issue, we propose an approach for creating linked dataset profiles. A profile consists of structured dataset metadata describing topics and their relevance. Profiles are generated through the configuration of techniques for resource sampling from datasets, topic extraction from reference datasets and their ranking based on graphical models. To enable a good trade-off between scalability and accuracy of generated profiles, appropriate parameters are determined experimentally. Our evaluation considers topic profiles for all accessible datasets from the Linked Open Data cloud. The results show that our approach generates accurate profiles even with comparably small sample sizes (10%) and outperforms established topic modelling approaches.
Citation preview
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
A Scalable Approach for Efficiently GeneratingStructured Dataset Topic Profiles
Besnik Fetahu1, Stefan Dietze1, Bernardo Pereira Nunes2, MarcoAntonio Casanova2, Davide Taibi3, Wolfgang Nejdl1
1L3S Research Center, Leibniz Universitat Hannover2Department of Informatics - PUC-Rio
3Institute for Educational Technologies, CNR
May 29, 2014
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
6 Conclusions
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
6 Conclusions
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Introduction
• Increasing amount of Web Data
• Data heterogeneity: representation, language, quality anddomains
• Sparsely connected datasets
• Lack of descriptive metadata about datasets
• Exhaustive techniques for data analysis
• Efficiency heavily dependent on information need
• Ease of access and representation of datasets
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
6 Conclusions
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Why dataset profiling?
• Growing number of datasets: 227 datasets
• Data represented as triples: 31 billion triples
• Multi-lingual content: 18 languages
• Broad set of topics covered
• Inter-dataset links
Domain # Data. TriplesMedia 25 1,841,852,061Geographic 31 6,145,532,484Government 49 13,315,009,400Publications 87 2,950,720,693Cross-domain 41 4,184,635,715Life sciences 41 3,036,336,004User-generated 20 134,127,413
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Why dataset profiling?
Find datasets covering the domain of “Renewable Energy”?
• Sparsity: Datasets that cover the topic?• 38 out of 228 datasets contain
topic coverage information.
• Scalability: Use SPARQL filter clause?• regex(*) filter clause needs
to check all triples that containa specific keyword.
• Disambiguity: What are all the possible forms ofrenewable energy?
• solar energy, wind energy, geothermal. . .
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Why dataset profiling?
Find datasets covering the domain of “Renewable Energy”?
• Sparsity: Datasets that cover the topic?• 38 out of 228 datasets contain
topic coverage information.
• Scalability: Use SPARQL filter clause?• regex(*) filter clause needs
to check all triples that containa specific keyword.
• Disambiguity: What are all the possible forms ofrenewable energy?
• solar energy, wind energy, geothermal. . .
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Why dataset profiling?
Find datasets covering the domain of “Renewable Energy”?
• Sparsity: Datasets that cover the topic?• 38 out of 228 datasets contain
topic coverage information.
• Scalability: Use SPARQL filter clause?• regex(*) filter clause needs
to check all triples that containa specific keyword.
• Disambiguity: What are all the possible forms ofrenewable energy?
• solar energy, wind energy, geothermal. . .
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
6 Conclusions
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
4 Profile graphs
5 Profiles representation
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
4 Profile graphs
5 Profiles representation
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
4 Profile graphs
5 Profiles representation
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
4 Profile graphs
5 Profiles representation
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Profiling Overview
1 Metadata extraction
2 Resource sampling
3 Entity/topic extraction
4 Profile graphs
5 Profiles representation
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Dataset Profiling Example
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Resource Instance and Type Extraction
• Simple SPARQL SELECT queries
• Avg. indexing time 10% (7min) vs. 100% (4hrs).
• Approximately ∼300 million resource instances
10
100
1000
10000
100000
uriburner
bluk-bnb
bio2rdf-kegg-pathway
nomenclator-asturias
b3kat
lobid-resources
twc-ieeevis
educationalprogramss isvu
farmbio-chem
bl
world-bank-linked-data
event-media
eeaeunishungarian-national-library-catalog
bio2rdf-pubmed
linked-user-feedback
oecd-linked-data
bio2rdf-goa
pscs-catalogue
bio2rdf-genbank
linkedmdb
bfs-linked-data
bio2rdf-reactome
british-museum-collection
bio2rdf-ncbigene
datos-bcn-cl
l3s-dblp
bio2rdf-sgd
hellenic-fire-brigade
Log-s
cale
indexin
g t
ime 100%
10%
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Resource Sampling ApproachesEntity and Topic Extraction
Resource Sampling
• random: randomly select a resource instance for analysis
• weighted: weigh a resource by the number of datatypeproperties used to describe it wk = |f (rk)|/max{|f (rj)|}
• centrality: weigh a resource by the number of typesused to describe it ck = |C ′
k |/|C |
Topic Extraction
• Resources as documents by combining all textual literals
• Perform NED1 and extract corresponding DBpedia entities
• Extract topics as DBpedia categories from entities viadcterms:subject
1DBpedia Spotlight
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Resource Sampling ApproachesEntity and Topic Extraction
Resource Sampling
• random: randomly select a resource instance for analysis
• weighted: weigh a resource by the number of datatypeproperties used to describe it wk = |f (rk)|/max{|f (rj)|}
• centrality: weigh a resource by the number of typesused to describe it ck = |C ′
k |/|C |
Topic Extraction
• Resources as documents by combining all textual literals
• Perform NED1 and extract corresponding DBpedia entities
• Extract topics as DBpedia categories from entities viadcterms:subject
1DBpedia Spotlight
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Resource Sampling ApproachesEntity and Topic Extraction
Resource Sampling
• random: randomly select a resource instance for analysis
• weighted: weigh a resource by the number of datatypeproperties used to describe it wk = |f (rk)|/max{|f (rj)|}
• centrality: weigh a resource by the number of typesused to describe it ck = |C ′
k |/|C |
Topic Extraction
• Resources as documents by combining all textual literals
• Perform NED1 and extract corresponding DBpedia entities
• Extract topics as DBpedia categories from entities viadcterms:subject
1DBpedia Spotlight
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Constructing profiles: Dataset-topic graph
1 Profile graph nodes: datasets,
resources, topics
2 Weighted graph edges: ∆〈D, t〉3 Edge weights: ∆〈Di , t〉 6= ∆〈Dj , t〉4 Compute ∆〈Di , t〉 by assessing the
importance of t given the resourcesof Di as prior knowledge
5 The given prior knowledge biasesthe importance of t in the profilegraph towards Di
2
6 Incrementally add datasets in theprofile graph, by simply computingthe weights ∆〈Dk , t〉
2Scott White and Padhraic Smyth. 2003. Algorithms for estimating relativeimportance in networks. In 9th ACM SIGKDD (KDD ’03).
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Topic Ranking ApproachesTopic filtering
Topic pre-filtering:
NTR(t,D) =Φ(·,D)
Φ(t,D)+
Φ(·, ·)Φ(t, ·)
• Filter noisy topics
• φ(·, ·) - number of entitiesassociated with topic t
• Closely related to the tf-idfweighting scheme
Topic Ranking
• PageRank with Priors (PRankP)
• HITS with Priors (HITSP)
• K-Step Markov (KStepM)
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
6 Conclusions
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Experimental Setup
Datasets and Ground-truth
• 129 dataset from lod-cloud3
• 6 ground-truth datasets with manually assigned topicindicators for their resources
Dataset Properties #Resources
yovistoskos:subject, dbp:{subject, class,
discipline, kategorie, tagline} 62879
oxpoints dcterms:subject,dc:subject 37258
socialsemweb-thesaurusskos:subject, tag:associatedTag,dcterms:subject
2243
semantic-web-dog-food dcterms:subject, dc:subject 20145lak-dataset dcterms:subject, dc:subject 1691
Evaluation Metrics
• NDCG@k (k=1, . . . , 1000)
• Compare the induced ranking by the graphical modelsagainst the ideal ranking
3At the time of experimentation only 129 dataset endpoints wereresponsive.
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Baselines
• tf-idf: Consider resources as documents. Extract foreach dataset the top {50, 100, 150, 200} terms.
• LDA: Consider dataset as documents4. Extract topweighted topic terms. For every dataset extract top {50,100, 150, 200} with a number of topics {10, 20, 30, 40,50}.
4In this case it does not matter if datasets are considered at theresource level or aggregated.
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
1 Introduction
2 Problem and Motivation
3 ApproachResource Instance and Type ExtractionResource Sampling ApproachesConstructing profiles: Dataset-topic graphTopic Ranking Approaches
4 Experimental SetupBaselines
5 Evaluation ResultsEfficiency of Dataset ProfilingScalability of Dataset Profiling
6 Conclusions
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Efficiency of Dataset Profiling
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 100 200 300 400 500 600 700 800 900 1000
ND
CG r
anki
ng s
core
NDCG rank
Profiling accuracy for all topic ranking approaches
K-Step Markov + NTRPageRank with priors + NTR
HITS with priors + NTR
LDAtf-idf
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Sample Size
K-Step Markov profiling accuracy (Centrality Sampling)
KStepM + NTRKStepM
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Scalability of Dataset Profiling
100
1000
10000
0 20 40 60 80 100 0
0.05
0.1
0.15
0.2
0.25
0.3
Log-
scal
e tim
e pe
rfor
man
ce
ND
CG r
anki
ng s
core
Sample Size
Time Performance vs. Profiling Accuracy
HITS with priors timeHITS with priors ranking
K-Step Markov time
K-Step Markov rankingPageRank with priors time
PageRank with priors ranking
• 5% and 10% already providestable profiling accuracy
• Avg. 7mins for indexing 10%of resources per dataset vs.4hrs per dataset
• 2mins for ranking datasetprofiles with 10% of resourcesvs. 45mins for 100%
• NED runtime 10% vs. 100%?
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Motivation Example Revisited!
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Conclusions and Future Work
• Structured dataset profiles
• Scalable approach through sampling
• Efficient profiling through topic filtering and ranking
• Incremental generation of dataset profiles
• Dataset profiles as a set of links (entity and topic links)
• Provenance information of links (e.g. resources fromwhich an entity is extracted)
• Profiles for dataset recommendation, search, etc.
Resources
• Profiles Endpoint:http://data-observatory.org/lod-profiles/sparql
• Profiles Webpage:http://data-observatory.org/lod-profiles/
Introduction
Problem andMotivation
Approach
ResourceInstance andType Extraction
ResourceSamplingApproaches
Constructingprofiles:Dataset-topicgraph
Topic RankingApproaches
ExperimentalSetup
Baselines
EvaluationResults
Efficiency ofDataset Profiling
Scalability ofDataset Profiling
Conclusions
Thank you! Questions?#eswc2014Fetahu