Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Towards
semantic assessment of
summarizability
in self-service BI
BigNovelTI, September 24 2017
(collocated with ADBIS 2017, Chipre)
Luis-Daniel Ibañez, Elena Simperl University of Southampton (UK)
Jose Norberto Mazón Universidad de Alicante (Spain) Twitter: @jnmazon email: [email protected]
Data, data everywhere
• The promise of (big) data
2
++ data ++ insights better decisions
Business Intelligence & OLAP
• Business Intelligence (BI) used for decision support in organizations
• OLAP (On-line Analytical Process) • Multidimensional modeling
• Fact
• Events of interests for analysis (e.g. sales, treatments of patients...)
• Measures
• Dimension
• Specify different ways the data can be viewed, aggregated and sorted (e.g. time, store, customer...)
• Hierarchies
3
• Multidimensional model is implemented in cubes
• Example: cube sales data
• A product is sold in a supermarket at a specific date
Dimensions
Dimension hierarchies
Time
Supermarket
Product
Product sales
Fact
Measures
Quantity
city province country
product type
day month year ...
Madrid
Barcelona
Alicante
Drink
Food ...
January May
Business Intelligence & OLAP
9
8
4
• OLAP Algebra for data analysis
• Roll-up and drill-down to navigate through
hierarchies
• Aggregation functions applied to measures
• avg, sum, min, etc.
• Drill-through from one fact to another
Business Intelligence & OLAP
5
Business Intelligence & OLAP
• MD query structure [Rafanelli et al 1996]
• Phenomenon of interest, i.e. measure to be
analyzed
• Category attributes, i.e. context for analyzing the
phenomenon of interest (dimensions)
• Aggregation sets, i.e. subsets of the phenomenon of
interest according to several category attributes
• Aggregation functions, i.e. operators to apply on the
aggregation sets to summarize their factual data.
6
• OLAP cubes are easy-to-use
• Multidimensional queries
• Fast and simple data aggregation
• Data analysis from different contexts
100
food drink
product.type
Sales
supermarket.
region =
“Valencia
Region”
Frozen Fresh Spirits Alcohol
Alicante Albatera
Elche
Valencia Burjasot
Cullera
500
900
1300
200
600
1000
1400
300
700
1100
1500
400
800
1200
1600
Business Intelligence & OLAP
1400 2200
4600 5400
Sales’ Food Drink
Alicante
Valencia
7
Business Intelligence & OLAP
8
Business Intelligence & OLAP
• Data within “traditional” BI
• Known (internal) data sources
• MD design for specific data sources
• Data integration at design time
• Data sources owned by decision maker
• Only domain-expert users access to data
9
Sources:
https://pixabay.com/static/uploads/photo/2015/11/03/09/30/away-1020435_960_720.jpg
https://upload.wikimedia.org/wikipedia/commons/a/ab/Open_Definition_logo.png
USING OPEN DATA FOR
DECISION MAKING
10
++ data ++ insights better decisions
• Considering more data?
• Good news! high availability of open data
Self-service Business Intelligence
• New type of BI
• Unknown (external) data sources
• MD design for unseen data sources
• Data integration at runtime
• Open data as source
• Everybody can access data (including non-expert users)
• Situational BI, Exploratory BI, Live BI, Self-
service BI, Open BI, etc.
11
Self-service Business Intelligence
• Self-service BI [Abelló et al 2013]
• Enabling non-expert users to make well-informed
decisions by enriching the decision process with
new data not owned and controlled by the decision
maker
• Search, extraction, integration, and storage for reuse
or sharing should be accomplished by non-expert
decision makers without any intervention by
designers or programmers
12
Self-service Business Intelligence
13
[Abelló et al 2013]
Some Self-service BI Challenges
• Multidimensional divide
• Open data unkown at design time
• Incorrect multidimensional elements
• Data divide
• Non-expert users (no enough skills for data analysis)
• Unmeaning queries
14
Avoid summarizability problems in non-expert queries
Self-service BI Challenges
• Summarizability
• Multidimensional models must ensure to accurately
compute aggregation of measures along dimensions [Lenz and Shoshani, 1997]
15
Self-service BI Challenges
• Summarizability
• At syntactic level
• Many-to-one relationship between dimension hierarchy
levels [Mazón et al 2009]
16
100
food drink
product.type
Sales
supermarket.
region =
“Valencia
Region”
Frozen Fresh Spirits Alcohol
Alicante Albatera
Elche
Valencia Burjasot
Cullera
500
900
1300
200
600
1000
1400
300
700
1100
1500
400
800
1200
1600
1400 2200
4600 5400
Sales’ Food Drink
Alicante
Valencia
Self-service BI Challenges
• Summarizability
• At the semantic level [Niemi et al, 2014]
• Type compatibility
• Aggregation function
• Measure
• Dimension
17
100
June July
time
Stock level
supermarket.
region =
“Valencia
Region”
Day1 Day30 Day1 Day31
Alicante Albatera
Elche
Valencia Burjasot
Cullera
500
900
1300
200
600
1000
1400
300
700
1100
1500
400
800
1200
1600
Alicante
Valencia
Stock level’
1400
June July
800
Self-service BI Challenges
• Type compatibility [Lenz and Shoshani, 1997]
• Flow: measure recorded at the end of a period
• Monthly number of births, annual income, etc.
• Stock: measure recorded at particular point of time
• Inventory of cars, number of citizens, etc.
• Value for unit: same as stock but unit is not a ratio
• Item price, cost per unit manufactured, exchange rate, etc.
18
Type compatibility
through Time dimension
Type compatibility
through non-Time dimension
Self-service BI Challenges
• Statistical linked open data as source for Self-
service BI
• RDF Data Cube
• https://www.w3.org/TR/vocab-data-cube
• Vocabulary for publishing multidimensional data, such as
statistics, on the Web
• Building upon Statistical Data and Metadata Exchange
(SDMX)
• Collect, exchange, process, and disseminate aggregate
statistics
• http://sdmx.org/docs/2_0/SDMX_2_0%20SECTION_02_I
nformationModel.pdf 19
Self-service BI Challenges • Summarizability is not considered in current statistical
open data tools
• Word DataBank [http://databank.worldbank.org/]
20
Research goal
• Some works propose mechanisms for supporting
users in using statistical open data
• RDF Data Cube extension QB4OLAP [Etcheverry et al 2014]
• Survey on exploratory BI & Semantic Web [Abello et al 2015]
• OpenGovIntelligence - http://www.opengovintelligence.eu/
• OpenCube - http://opencube-project.eu/ & http://opencube-
toolkit.eu/
21
Research goal
• Unfortunately, current research fails in ensuring
summarizability issues
• There is always a manual step for ensuring type
compatibility
• Inconsistent with a self-service BI scenario that
reuses open data
22
Summarizability-aware querying of statistical
open data considering type compatibility
Summarizability-aware querying
23
*NO means that there is not enough information to support user but there may be
information on similar situations HINT new knowledge from the user to enrich KB
Summarizability KB
• Extending RDF Data Cube and QB4OLAP to
include type compatibility
• https://www.w3.org/TR/vocab-data-cube/
• Steps to automate the enrichment of QB data sets
with specific QB4OLAP semantics [Varga et al 2016]
• Also, provenance information is included
24
25
Summarizability KB
26
• One measure can be aggregated by using the
aggregation function through the dimension
Summarizability KB
• Created from USEWOD
• http://usewod.org/
• Usage Analysis and the Web of Data
• DBpedia logs
• Reference data set for research on query logs of
Linked Data endpoints
• [Luczak-Roesch et al 2016]
27
Summarizability KB
28
prefix dbpprop: <http://dbpedia.org/property/>
SELECT ?movies max(?runtime) min(?runtime) avg(?runtime)
WHERE
{
?movies <http://dbpedia.org/ontology/runtime> ?runtime .
?movies <http://dbpedia.org/ontology/starring>
<http://dbpedia.org/resource/Clint_Eastwood>.
}
group by ?movies
Max, min and average of movies runtime can be calculated
< runtime , MAX , movie >
< runtime , MIN , movie >
< runtime , AVG , movie >
Proposed aproach (1)
• STEP 0 Parse SPARQL sentence (with ARQ)
• STEP 1 Look for aggregation functions in the
SELECT
• avg, sum, max, min (future work may consider UDF)
• For each aggregation function
• STEP 2 Get variable within the aggregation function
• STEP 3 Extract basic graph pattern (triplet with
variables) containing the measure
• STEP 4 From the basic graph pattern, assume that
measure is the object and then extract predicate (this
predicate is the measure)
• STEP 5 Determine dimension 29
Proposed approach (and 2)
• STEP 5 Determine dimension
• If there is “group by” then
• Get variable in group by
• Look for the variable in the subjects in the where clause
• Get the predicates and look for the most specific domain (this
is the dimension).
• Else
• Get variable in the subject of the basic graph pattern
containing the measure
• Look for "a" or "rdf:type" to look for the type. If not there, then
look for the domain of the predicate (this is the dimension)
30
Structure of Type Compatibility KB
31
<http://dbpedia.org/ontology/runtime> TypeCompatibility :blank1
:blank1 dimension <http://dbpedia.org/ontology/Work>
:blank1 aggFunction qb4olap:avg
:blank1 query "...."
:blank1 provenance "dbpedia"
prefix dbpprop: <http://dbpedia.org/property/>
SELECT ?movies max(?runtime) min(?runtime) avg(?runtime)
WHERE
{
?movies <http://dbpedia.org/ontology/runtime> ?runtime .
?movies <http://dbpedia.org/ontology/starring>
<http://dbpedia.org/resource/Clint_Eastwood>.
}
group by ?movies rdfs:domain <http://dbpedia.org/ontology/Work>
Conclusions
• Self-Service BI uses external (open) data for
supporting decision making
• Decision makers are likely to make meaningless
queries that lead to summarizability problems
• Framework for semantic assessment of
summarizability in Self-service BI
• Based on DBpedia logs of the USEWOD dataset
• Queries using aggregation functions (8946 out of 35M)
• Future work
• Use other source of logs with higher percentage of
queries using aggregation functions
32
33 Source: https://c1.staticflickr.com/7/6156/6164861516_05d435e7b4_b.jpg
Work in progress…
Let’s unleash statistical open data potential!
Towards
semantic assessment of
summarizability
in self-service BI
BigNovelTI, September 24 2017
(collocated with ADBIS 2017, Chipre)
Luis-Daniel Ibañez, Elena Simperl University of Southampton (UK)
Jose Norberto Mazón Universidad de Alicante (Spain) Twitter: @jnmazon email: [email protected]