Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
CONCEPTUALIZINGANALYTICSConceptual Modeling and Data Analytics –A Tutorial
Christoph G. Schuetz Michael Schrefl
CONCEPTUALIZINGANALYTICSConceptual Modeling and Data Analytics –A Tutorial
Christoph G. Schuetz Michael SchreflThanksIlko Kovacic, Median Hilal, and Georg Grossmann (UniSA) formaterial that served as the basis for parts of this tutorial.
Table of Contents
Introduction and Background
Acquisition and Recording
Extraction, Cleaning, and Annotation
Integration, Aggregation, and Representation
Analysis and Modeling
Interpretation and Action
Open Issues
1/131
INTRODUCTION ANDBACKGROUND
Scope of this Tutorial
How may conceptual modeling facilitate data analytics?
2/131
Scope of this Tutorial
How may conceptual modeling facilitate data analytics?
2/131
What is Data Analytics?
Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”
Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”
“Analytics tools enable people to query and analyze informa-tion”
Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”
“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”
3/131
What is Data Analytics?
Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”
Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”
“Analytics tools enable people to query and analyze informa-tion”
Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”
“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”
3/131
What is Data Analytics?
Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”
Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”
“Analytics tools enable people to query and analyze informa-tion”
Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”
“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”
3/131
What is Data Analytics?
Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”
Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”“Analytics tools enable people to query and analyze informa-tion”
Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”
“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”
3/131
What is Data Analytics?
Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”
Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”“Analytics tools enable people to query and analyze informa-tion”
Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”
“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”
3/131
What is Data Analytics?
Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”
Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”“Analytics tools enable people to query and analyze informa-tion”
Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”
3/131
Data Analytics
� Descriptive: What happened? Multidimensional analysis(OLAP), statistical analysis of the past. Dashboards,scorecards, key performance indicators.
� Predictive: Use of statistical methods in an attempt topredict what will happen in the future.
� Prescriptive: What actions should be taken? Alerts andactions triggered by analysis results. Active datawarehouse.
Data analytics must be viewed in the broader context ofbusiness intelligence
4/131
Adv
ance
d
Data Analytics
� Descriptive: What happened? Multidimensional analysis(OLAP), statistical analysis of the past. Dashboards,scorecards, key performance indicators.
� Predictive: Use of statistical methods in an attempt topredict what will happen in the future.
� Prescriptive: What actions should be taken? Alerts andactions triggered by analysis results. Active datawarehouse.
Data analytics must be viewed in the broader context ofbusiness intelligence
4/131
Adv
ance
d
Data Analytics
� Descriptive: What happened? Multidimensional analysis(OLAP), statistical analysis of the past. Dashboards,scorecards, key performance indicators.
� Predictive: Use of statistical methods in an attempt topredict what will happen in the future.
� Prescriptive: What actions should be taken? Alerts andactions triggered by analysis results. Active datawarehouse.
Data analytics must be viewed in the broader context ofbusiness intelligence
4/131
Adv
ance
d
What is Business Intelligence?
Data
Integration
Data
Warehousing
Business
Intelligence
Store
Integrated
Data
Integrate &
Cleanse Data
from Multiple
Sources
Present &
Analyze
Information
Figure: The relationship between data integration, data warehousing,and business intelligence [29, p. 15]
5/131
What about big data analytics?
Compared to traditional business intelligence (BI), analysis ofbig data is really not so different from a conceptual point of view
Acquisition/
Recording
Exctraction/
Cleaning/
Annotation
Integration/
Aggregation/
Representation
Analysis/
Modeling
Interpretation/
Action
Figure: The (big) data analysis pipeline (adapted from [3])
One may argue that business intelligence has always beenabout the analysis of what constituted “big” data at the time [32]
The specific technologies, however, may differ.
6/131
What about big data analytics?
Compared to traditional business intelligence (BI), analysis ofbig data is really not so different from a conceptual point of view
Acquisition/
Recording
Exctraction/
Cleaning/
Annotation
Integration/
Aggregation/
Representation
Analysis/
Modeling
Interpretation/
Action
Figure: The (big) data analysis pipeline (adapted from [3])
One may argue that business intelligence has always beenabout the analysis of what constituted “big” data at the time [32]
The specific technologies, however, may differ.
6/131
What about big data analytics?
Compared to traditional business intelligence (BI), analysis ofbig data is really not so different from a conceptual point of view
Acquisition/
Recording
Exctraction/
Cleaning/
Annotation
Integration/
Aggregation/
Representation
Analysis/
Modeling
Interpretation/
Action
Figure: The (big) data analysis pipeline (adapted from [3])
One may argue that business intelligence has always beenabout the analysis of what constituted “big” data at the time [32]
The specific technologies, however, may differ.
6/131
What about big data analytics?
Compared to traditional business intelligence (BI), analysis ofbig data is really not so different from a conceptual point of view
Acquisition/
Recording
Exctraction/
Cleaning/
Annotation
Integration/
Aggregation/
Representation
Analysis/
Modeling
Interpretation/
Action
Figure: The (big) data analysis pipeline (adapted from [3])
One may argue that business intelligence has always beenabout the analysis of what constituted “big” data at the time [32]
The specific technologies, however, may differ.
6/131
Scope of this Tutorial
How may conceptual modeling facilitate data analytics?
This tutorial follows the steps of the (big) data analysis pipelineand illustrates selected examples of conceptual modelingsupporting each step.
7/131
Running Example: Precision Dairy Farming
From this ...
8/131
Running Example: Precision Dairy Farming
From this ...
8/131
Running Example: Precision Dairy Farming
... to that!
9/131
The AgriProKnow Project
Joint research effort between various companies and researchinstitutions on data analytics in dairy farming
� Smartbow develops smart animal eartags to track activity.� Wasserbauer develops automated feeding machines.� The University of Veterinary Medicine Vienna provides the
domain knowledge.� Johannes Kepler University (JKU) Linz has statistical and
business intelligence (BI) knowledge for data analysis.
Project goal: Building an active semantic data warehouse forprecision dairy farming [28]
10/131
The AgriProKnow Project
Joint research effort between various companies and researchinstitutions on data analytics in dairy farming
� Smartbow develops smart animal eartags to track activity.� Wasserbauer develops automated feeding machines.� The University of Veterinary Medicine Vienna provides the
domain knowledge.� Johannes Kepler University (JKU) Linz has statistical and
business intelligence (BI) knowledge for data analysis.
Project goal: Building an active semantic data warehouse forprecision dairy farming [28]
11/131
Further ReadingC. G. Schuetz, S. Schausberger, M. Schrefl. Building anactive semantic data warehouse for precision dairy farming.Journal of Organizational Computing and Electronic Com-merce, 28(2), 122-144, 2018.
ACQUISITION ANDRECORDING
Acquisition and Recording
Interesting data originate from various sources such asoperational databases, sensors, or the web.
Possible storage forms for (big) data with support for dataanalysis are:
� Data Warehouse: A clean and integrated databaseproviding data of interest in a format fit for analysis
� Data Lake: Store the raw data as-is, possibly withadditional metadata to help retrieve datasets. Data aretransformed when needed for the analysis.
12/131
Acquisition and Recording
Interesting data originate from various sources such asoperational databases, sensors, or the web.
Possible storage forms for (big) data with support for dataanalysis are:
� Data Warehouse: A clean and integrated databaseproviding data of interest in a format fit for analysis
� Data Lake: Store the raw data as-is, possibly withadditional metadata to help retrieve datasets. Data aretransformed when needed for the analysis.
12/131
Acquisition and Recording
Interesting data originate from various sources such asoperational databases, sensors, or the web.
Possible storage forms for (big) data with support for dataanalysis are:
� Data Warehouse: A clean and integrated databaseproviding data of interest in a format fit for analysis
� Data Lake: Store the raw data as-is, possibly withadditional metadata to help retrieve datasets. Data aretransformed when needed for the analysis.
12/131
Acquisition and Recording
Interesting data originate from various sources such asoperational databases, sensors, or the web.
Possible storage forms for (big) data with support for dataanalysis are:
� Data Warehouse: A clean and integrated databaseproviding data of interest in a format fit for analysis
� Data Lake: Store the raw data as-is, possibly withadditional metadata to help retrieve datasets. Data aretransformed when needed for the analysis.
12/131
The Data Warehouse is Dead!
“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it.
(But the multidimensional modelremains relevant.)”
— An actual colleague from a consulting firm
“You’re using a data warehouse to analyze sensor data?
Really? But everyone uses data stream processing for that.”
— An actual attendant of EDOC 2016
13/131
The Data Warehouse is Dead!
“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it.
(But the multidimensional modelremains relevant.)”
— An actual colleague from a consulting firm
“You’re using a data warehouse to analyze sensor data?
Really? But everyone uses data stream processing for that.”
— An actual attendant of EDOC 2016
13/131
The Data Warehouse is Dead!
“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”
— An actual colleague from a consulting firm
“You’re using a data warehouse to analyze sensor data?
Really? But everyone uses data stream processing for that.”
— An actual attendant of EDOC 2016
13/131
The Data Warehouse is Dead!
“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”
— An actual colleague from a consulting firm
“You’re using a data warehouse to analyze sensor data?
Really? But everyone uses data stream processing for that.”
— An actual attendant of EDOC 2016
13/131
The Data Warehouse is Dead!
“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”
— An actual colleague from a consulting firm
“You’re using a data warehouse to analyze sensor data?
Really? But everyone uses data stream processing for that.”
— An actual attendant of EDOC 2016
13/131
The Data Warehouse is Dead!
“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”
— An actual colleague from a consulting firm
“You’re using a data warehouse to analyze sensor data?Really?
But everyone uses data stream processing for that.”
— An actual attendant of EDOC 2016
13/131
The Data Warehouse is Dead!
“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”
— An actual colleague from a consulting firm
“You’re using a data warehouse to analyze sensor data?Really? But everyone uses data stream processing for that.”
— An actual attendant of EDOC 2016
13/131
The Data Warehouse is Dead!
“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”
— An actual colleague from a consulting firm
“You’re using a data warehouse to analyze sensor data?Really? But everyone uses data stream processing for that.”
— An actual attendant of EDOC 2016
13/131
Is The Data Warehouse Dead?
“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]
⇒ Be flexible regarding implementation technology!
14/131
Is The Data Warehouse Dead?
“What is a data warehouse?
Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]
⇒ Be flexible regarding implementation technology!
14/131
Is The Data Warehouse Dead?
“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today.
Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]
⇒ Be flexible regarding implementation technology!
14/131
Is The Data Warehouse Dead?
“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse.
(. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]
⇒ Be flexible regarding implementation technology!
14/131
Is The Data Warehouse Dead?
“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture.
It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]
⇒ Be flexible regarding implementation technology!
14/131
Is The Data Warehouse Dead?
“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]
⇒ Be flexible regarding implementation technology!
14/131
Is The Data Warehouse Dead?
“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]
⇒ Be flexible regarding implementation technology!
14/131
Is The Data Warehouse Dead?
“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed.
(. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration. Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]
The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.
⇒ Conceptual modeling to the rescue!
15/131
Is The Data Warehouse Dead?
“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed. (. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration.
Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]
The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.
⇒ Conceptual modeling to the rescue!
15/131
Is The Data Warehouse Dead?
“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed. (. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration. Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]
The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.
⇒ Conceptual modeling to the rescue!
15/131
Is The Data Warehouse Dead?
“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed. (. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration. Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]
The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.
⇒ Conceptual modeling to the rescue!
15/131
Is The Data Warehouse Dead?
“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed. (. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration. Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]
The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.
⇒ Conceptual modeling to the rescue!
15/131
Data Lake
A data lake serves to store raw data for later analysis
once theanalysts have figured out what to do with the data
A data lake may complement a traditional data warehouse,especially in the presence of high velocity data streams
16/131
Data Lake
A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data
A data lake may complement a traditional data warehouse,especially in the presence of high velocity data streams
16/131
Data Lake
A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data
A data lake may complement a traditional data warehouse,especially in the presence of high velocity data streams
16/131
Sensor Data Warehousing (Dobson et al. [5])
Real-Time
Analysis
Business
IntelligenceData Lake
Event
Processing
Data
Warehouse
Stream
Processing
17/131
Conceptual Model: Sensor Measurements
Agent
+ receptionTimestamp
+ sensingTimestamp
+ value
+ accuracy
Measurement*
1
+ name
+ unit
MeasurementType
1
*
+ name
Transformation 0..1 *
+ latitude
+ longitude
Location0..1*
Figure: A domain model for sensor measurements [5]
18/131
Conceptual Model: Agent Types
Agent
+ name
+ age
+ position
Person
+ id
Process
+ name
ProcessType1 *
+ assignmentTimestamp
AssignedDevice
+ id
+ nominalAccuracy
PhysicalDevice
+ id
LogicalDevice
Stationary MobileLocation
1 1* *
1 *
0..1
*
Figure: A domain model for sensor agents [5]
19/131
Dimensional Fact Model: Measurements
Figure: Multidimensional model for measurements [5]
Example Sensor Readings
S. Time Meas. Type Agent Transform. Acc. Value
2018/10/02 14:00 3 1 AVG10 0.1 22.2
2018/10/02 14:10 3 1 AVG10 0.1 22.4
2018/10/02 14:05 2 2 AVG5 0.1 61.3
2018/10/02 14:15 2 2 AVG5 0.2 60.9
Meas. Type ID Meas. Type Unit
1 Milk yield kg
2 Rumination activity Chews/Cud
3 Temperature °C
2018/10/03 10:20 2 3 62
Agent ID
1
2
3
Agent
THE01
EAR23
VET01
Agent Type
Device
Device
Person
Phys. Dev. Log. Dev. Loc. Dev. Type
THE01232
EAR03143
Temp. Feed
Area #1
Feed
Area #1Thermo.
Earmark
The Need for Shared Conceptualization
In order to allow for comparison of results, a sharedconceptualization is vital.
Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm.
Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area, e.g., feeding area, restingarea, milking area
.
But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.
22/131
The Need for Shared Conceptualization
In order to allow for comparison of results, a sharedconceptualization is vital.
Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm.
Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area, e.g., feeding area, restingarea, milking area
.
But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.
22/131
The Need for Shared Conceptualization
In order to allow for comparison of results, a sharedconceptualization is vital.
Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm. Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area
, e.g., feeding area, restingarea, milking area
.
But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.
22/131
The Need for Shared Conceptualization
In order to allow for comparison of results, a sharedconceptualization is vital.
Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm. Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area, e.g., feeding area, restingarea, milking area.
But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.
22/131
The Need for Shared Conceptualization
In order to allow for comparison of results, a sharedconceptualization is vital.
Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm. Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area, e.g., feeding area, restingarea, milking area. But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.
22/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.
Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnow
Sensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.
23/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnow
Sensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.
23/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnow
Sensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.
23/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.
30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.
23/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animal
Large farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.
23/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals
⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.
23/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farm
The ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.
23/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.
But: All those readings are often not needed. More abstractlevel is more interesting.
23/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.
23/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful.
For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.⇒ shared conceptualization of activity types
, which shouldbe known upon recording to be able to reduce data early on
24/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful.
For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.⇒ shared conceptualization of activity types
, which shouldbe known upon recording to be able to reduce data early on
24/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful. For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.
⇒ shared conceptualization of activity types
, which shouldbe known upon recording to be able to reduce data early on
24/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful. For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.⇒ shared conceptualization of activity types
, which shouldbe known upon recording to be able to reduce data early on
24/131
The Need for Shared Conceptualization
Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.
Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful. For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.⇒ shared conceptualization of activity types, which shouldbe known upon recording to be able to reduce data early on
24/131
Dimensional Fact Model: Movement Activity
For example, animal AT23464 on the Kremesberg farm sitemay have spent 0 minutes lying, 10 minutes standing, and 5minutes walking in a feeding area on the 10 October 2018 inthe 13th hour of the day.
Dimensional Fact Model: Movement Activity
For example, animal AT23464 on the Kremesberg farm sitemay have spent 0 minutes lying, 10 minutes standing, and 5minutes walking in a feeding area on the 10 October 2018 inthe 13th hour of the day.
Data Lake
A data lake serves to store raw data for later analysis
once theanalysts have figured out what to do with the data.
Of course, the data sets need to be organized such that theanalysts can find them.
⇒ Structured data lake approach
26/131
Data Lake
A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data.
Of course, the data sets need to be organized such that theanalysts can find them.
⇒ Structured data lake approach
26/131
Data Lake
A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data.
Of course, the data sets need to be organized such that theanalysts can find them.
⇒ Structured data lake approach
26/131
Data Lake
A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data.
Of course, the data sets need to be organized such that theanalysts can find them.
⇒ Structured data lake approach
26/131
Semantic Data Containers [28]
The semantic container approach allows to organize data setsalong spatial, temporal, and other semantic dimensions
(orfacets).
The dimensions/facets consist of concepts, which arehierarchically organized.
Example: ATM Information CubesModern air traffic management (ATM) heavily relies on timelyexchange of accurate information. ATM stakeholders requireinformation at various granularities and levels of details. ATMinformation cubes are structured repositories of ATM mes-sages, where each cell is a semantic container.
27/131
Semantic Data Containers [28]
The semantic container approach allows to organize data setsalong spatial, temporal, and other semantic dimensions (orfacets).
The dimensions/facets consist of concepts, which arehierarchically organized.
Example: ATM Information CubesModern air traffic management (ATM) heavily relies on timelyexchange of accurate information. ATM stakeholders requireinformation at various granularities and levels of details. ATMinformation cubes are structured repositories of ATM mes-sages, where each cell is a semantic container.
27/131
Semantic Data Containers [28]
The semantic container approach allows to organize data setsalong spatial, temporal, and other semantic dimensions (orfacets).
The dimensions/facets consist of concepts, which arehierarchically organized.
Example: ATM Information CubesModern air traffic management (ATM) heavily relies on timelyexchange of accurate information. ATM stakeholders requireinformation at various granularities and levels of details. ATMinformation cubes are structured repositories of ATM mes-sages, where each cell is a semantic container.
27/131
Semantic Data Containers [28]
The semantic container approach allows to organize data setsalong spatial, temporal, and other semantic dimensions (orfacets).
The dimensions/facets consist of concepts, which arehierarchically organized.
Example: ATM Information CubesModern air traffic management (ATM) heavily relies on timelyexchange of accurate information. ATM stakeholders requireinformation at various granularities and levels of details. ATMinformation cubes are structured repositories of ATM mes-sages, where each cell is a semantic container.
27/131
ATM Information Cube: Operations
Operational
Restriction
ED
UU
-01
Flight Critical
ED
UU
-02
Essential
Briefing Package
ED
UU
� Merge: Change granularity of the cube by merging thecontents of the cells.
� Abstract: Replace entities inside a cell with more abstractentities.
ATM Information Cube: Example
Operational
Restriction
TS
-LO
WW
-01
Flight
Critical
TS
-LO
WW
-02
TS
-LZ
IB-0
1T
S-L
ZIB
-02
Potential
Hazard
Additional
Information
ATM Information Cube: Merge
Essential
Briefing Package
LO
VV
LZ
BB
Supplementary
Briefing Package
1
2
3
4
EXTRACTION, CLEANING,AND ANNOTATION
Process Modeling and ETL
Extract, transform, and load (ETL) processes feed the datafrom the sources into the data warehouse.
Traditionally, the implementation of ETL processes involves alot of low-level programming.
Process modeling approaches with support for code generationmay facilitate the implementation of ETL processes and alsoserve as documentation.
Besides proprietary modeling languages, the Business ProcessModel and Notation (BPMN) or UML activity diagrams mayserve for ETL process modeling.
31/131
BPMN Models of ETL Processes(El Akkaoui et al. [7, 6])
Two perspectives on ETL processes:
� Control process (process orchestration): Handlebranching and synchronizing of the data flow
� Data process: Specify precisely how the input datatransforms into output data
32/131
Control Process: Example
Before animal movement data can be loaded into theAgriProKnow data warehouse, the animal dimension and thefunction areas at specific farms must be loaded.
Agr
iPro
Kn
ow
DW
H
FarmFunctionArea Load
AnimalDimLoad
AnimalMovement Load
Data Process: Animal Movement
Input Data
Lookup
Insert Data
File: EAR34-Movement.csvType: CSV
Insert Data
Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt
Type: Text
NotFound
Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)
Add Column
Column: Timestamp: Date
Convert Column
Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)
AggregateDatabase: AgriProKnowDWH
Table: Movement
Found
Data Process: Animal Movement
Input Data
Lookup
Insert Data
File: EAR34-Movement.csvType: CSV
Insert Data
NationalID,Lat,Long,TimestampAT-12,5,10,1537348997000AT-12,6,10,1537348998000AT-23,7,15,1537348997000AT-23,7,15,1537348998000
Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt
Type: Text
NotFound
Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)
Add Column
Column: Timestamp: Date
Convert Column
Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)
AggregateDatabase: AgriProKnowDWH
Table: Movement
Found
Data Process: Animal Movement
Input Data
Lookup
Insert Data
File: EAR34-Movement.csvType: CSV
Insert Data
NationalID Lat Long TimestampAT-12 5 10 1537348997000AT-12 6 10 1537348998000AT-23 7 15 1537348997000AT-23 7 15 1537348998000
Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt
Type: Text
NotFound
Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)
Add Column
Column: Timestamp: Date
Convert Column
Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)
AggregateDatabase: AgriProKnowDWH
Table: Movement
Found
Data Process: Animal Movement
Input Data
Lookup
Insert Data
File: EAR34-Movement.csvType: CSV
Insert Data
NationalID Lat Long TimestampAT-12 5 10 Sep 19, 2018 09:23:17AT-12 6 10 Sep 19, 2018 09:23:18AT-23 7 15 Sep 19, 2018 09:23:17AT-23 7 15 Sep 19, 2018 09:23:18
Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt
Type: Text
NotFound
Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)
Add Column
Column: Timestamp: Date
Convert Column
Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)
AggregateDatabase: AgriProKnowDWH
Table: Movement
Found
Data Process: Animal Movement
Input Data
Lookup
Insert Data
File: EAR34-Movement.csvType: CSV
Insert Data
NationalID Coordinates TimestampAT-12 (5, 10) Sep 19, 2018 09:23:17AT-12 (6, 10) Sep 19, 2018 09:23:18AT-23 (7, 15) Sep 19, 2018 09:23:17AT-23 (7, 15) Sep 19, 2018 09:23:18
Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt
Type: Text
NotFound
Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)
Add Column
Column: Timestamp: Date
Convert Column
Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)
AggregateDatabase: AgriProKnowDWH
Table: Movement
Found
Data Process: Animal Movement
Input Data
Lookup
Insert Data
File: EAR34-Movement.csvType: CSV
Insert Data
NationalID Coordinates TimestampAT-12 (5, 10) Sep 19, 2018 09:23:17AT-12 (6, 10) Sep 19, 2018 09:23:18AT-23 (7, 15) Sep 19, 2018 09:23:17AT-23 (7, 15) Sep 19, 2018 09:23:18
FunctionArea Area FunctionAreaType1stFarmFeeding [(0,0);(6,12)] Feeding1stFarmResting [(5,14);(10,20)] Resting Retrieve: FunctionAreaType
Database: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt
Type: Text
NotFound
Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)
Add Column
Column: Timestamp: Date
Convert Column
Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)
AggregateDatabase: AgriProKnowDWH
Table: Movement
Found
Data Process: Animal Movement
Input Data
Lookup
Insert Data
File: EAR34-Movement.csvType: CSV
Insert Data
NationalID FAType ... TimestampAT-12 Feeding Sep 19, 2018 09:23:17AT-12 Feeding Sep 19, 2018 09:23:18AT-23 Resting Sep 19, 2018 09:23:17AT-23 Resting Sep 19, 2018 09:23:18
Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt
Type: Text
NotFound
Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)
Add Column
Column: Timestamp: Date
Convert Column
Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)
AggregateDatabase: AgriProKnowDWH
Table: Movement
Found
Data Process: Animal Movement
Input Data
Lookup
Insert Data
File: EAR34-Movement.csvType: CSV
Insert Data
NatlID FAType Hour Year Month Day DurAT-12 Feeding 10 2018 9 19 2AT-23 Resting 10 2018 9 19 2
Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt
Type: Text
NotFound
Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)
Add Column
Column: Timestamp: Date
Convert Column
Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)
AggregateDatabase: AgriProKnowDWH
Table: Movement
Found
ETL Patterns (Oliveira et al. [23, 24])
Identify conceptual models for a set of standard ETL processessuch as change data capture, slowly changing dimensions, andsurrogate key pipelining [23].
The goal is to foster code reusability.
Oliveira et al. [24] also extend the BPMN metamodel withconcepts specific to ETL processes.
42/131
Mining ETL Patterns(Theodorou et al. [30])
ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.
What are the use cases for such mined ETL patterns?
� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes
� Apply quality metrics on ETL process models at higherlevel of abstraction.
� Show a higher level summary of the ETL process to fosterunderstanding.
43/131
Mining ETL Patterns(Theodorou et al. [30])
ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.
What are the use cases for such mined ETL patterns?
� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes
� Apply quality metrics on ETL process models at higherlevel of abstraction.
� Show a higher level summary of the ETL process to fosterunderstanding.
43/131
Mining ETL Patterns(Theodorou et al. [30])
ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.
What are the use cases for such mined ETL patterns?
� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes
� Apply quality metrics on ETL process models at higherlevel of abstraction.
� Show a higher level summary of the ETL process to fosterunderstanding.
43/131
Mining ETL Patterns(Theodorou et al. [30])
ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.
What are the use cases for such mined ETL patterns?
� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes
� Apply quality metrics on ETL process models at higherlevel of abstraction.
� Show a higher level summary of the ETL process to fosterunderstanding.
43/131
Mining ETL Patterns(Theodorou et al. [30])
ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.
What are the use cases for such mined ETL patterns?
� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes
� Apply quality metrics on ETL process models at higherlevel of abstraction.
� Show a higher level summary of the ETL process to fosterunderstanding.
43/131
Mining ETL Patterns(Theodorou et al. [30])
ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.
What are the use cases for such mined ETL patterns?
� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes
� Apply quality metrics on ETL process models at higherlevel of abstraction.
� Show a higher level summary of the ETL process to fosterunderstanding.
43/131
ETL Processes for Big Data (Bala et al. [4])
Massive distribution and parallelization is key to handling bigdata processing.
Employ distribution and parallelization for ETL processes withbig data!
� Describe ETL process in terms of core functionalities
� Distribute processing of these core functionalities tomultiple nodes
⇒ Conceptual modeling is key to effective optimization ofETL processes in the age of big data
44/131
ETL Processes for Big Data (Bala et al. [4])
Massive distribution and parallelization is key to handling bigdata processing.
Employ distribution and parallelization for ETL processes withbig data!
� Describe ETL process in terms of core functionalities
� Distribute processing of these core functionalities tomultiple nodes
⇒ Conceptual modeling is key to effective optimization ofETL processes in the age of big data
44/131
ETL Processes for Big Data (Bala et al. [4])
Massive distribution and parallelization is key to handling bigdata processing.
Employ distribution and parallelization for ETL processes withbig data!
� Describe ETL process in terms of core functionalities
� Distribute processing of these core functionalities tomultiple nodes
⇒ Conceptual modeling is key to effective optimization ofETL processes in the age of big data
44/131
ETL Processes for Big Data (Bala et al. [4])
Massive distribution and parallelization is key to handling bigdata processing.
Employ distribution and parallelization for ETL processes withbig data!
� Describe ETL process in terms of core functionalities
� Distribute processing of these core functionalities tomultiple nodes
⇒ Conceptual modeling is key to effective optimization ofETL processes in the age of big data
44/131
ETL Processes for Big Data (Bala et al. [4])
An ETL library contains a list of ETL functionalities, which canbe used to design ETL processes.
LookUp
Source
Target
Lookup Table
Errors
Outputs
Which data will be stored?
Cache mode
Figure: Example description of ETL functionality in the ETL library [4]
45/131
Modeling Transformations for Data Mining(Ordonez et al. [25])
Data mining algorithms require the source data in a veryspecific format.
The source data, however, are often scattered across multipledatasets/relations (even in a data warehouse).
Transformations include denormalizations and aggregations,where denormalization is a rather broad term that also includesapplying complex expressions on attributes.
Modeling transformations as separate entities along with anSQL query allows to track lineage of data.
Modeling Transformations for Data Mining(Ordonez et al. [25])
Data mining algorithms require the source data in a veryspecific format.
The source data, however, are often scattered across multipledatasets/relations (even in a data warehouse).
Transformations include denormalizations and aggregations,where denormalization is a rather broad term that also includesapplying complex expressions on attributes.
Modeling transformations as separate entities along with anSQL query allows to track lineage of data.
Modeling Transformations for Data Mining(Ordonez et al. [25])
Data mining algorithms require the source data in a veryspecific format.
The source data, however, are often scattered across multipledatasets/relations (even in a data warehouse).
Transformations include denormalizations and aggregations,where denormalization is a rather broad term that also includesapplying complex expressions on attributes.
Modeling transformations as separate entities along with anSQL query allows to track lineage of data.
Modeling Transformations for Data Mining(Ordonez et al. [25])
Data mining algorithms require the source data in a veryspecific format.
The source data, however, are often scattered across multipledatasets/relations (even in a data warehouse).
Transformations include denormalizations and aggregations,where denormalization is a rather broad term that also includesapplying complex expressions on attributes.
Modeling transformations as separate entities along with anSQL query allows to track lineage of data.
Example: Source Schema
Animal
MilkYield
Feeding
FeedMix
NationalIDPK
Name
Sex
Breed
AnimalNationalIDFKPK
DatePK
TimePK
MilkYield
AnimalNationalIDFKPK
DatePK
TimePK
Quantity
FeedMixIDFK
FeedMixIDPK
PercentRoughage
FeedMixType
PercentSilage
Example: Transformation for Data Mining
Can the feed intake serve as a predictor for milk yield on thenext day?
A data mining algorithm may answer that question.
But first, we need to obtain a data set that contains the animalsmilk yield on a particular date along with the feed intake fromthe day before.
Transformation Entity: Denormalization
Animal
MilkYield
Denormalzation: T1
NationalIDPK
Name
Sex
Breed
AnimalNationalIDFKPK
DatePK
TimePK
MilkYield
AnimalNationalIDFKPK
DateFKPK
TimeFKPK
MilkYield
AnimalBreed
SQL
Transformation Entity: Denormalization
Denormalization: T2 Feeding
FeedMix
AnimalNationalIDFKPK
QuantityRoughage
QuantitySilage
AnimalNationalIDFKPK
DatePK
TimePK
Quantity
FeedMixIDFK
FeedMixIDPK
PercentRoughage
FeedMixType
PercentSilage
DateFKPK
TimeFKPK
FeedMixIDFK
FeedMixType
SQL
Transformation Entity: Aggregation
Aggregation: T3 Aggregation: T4
AnimalNationalIDFKPK AnimalNationalIDFKPK
DatePK
QuantityRoughage
FeedMixIDFKPK
DatePK
SQL
MilkYield
AnimalBreedPK
QuantitySilage
FeedMixTypePK
SQL
Transformation Entity: Target
Denormalization: T5
AnimalNationalIDFKPK
MilkingDatePK
FeedingDatePK
MilkYield
FeedMixIDFKPK
AnimalBreedPK
QuantitySilage
QuantityRoughage
Superimposed Multidimensional Schemas
In some cases, it may be impractical to extract the data fromthe source systems.
⇒ Volume/Velocity/Volatility
Rather, a multidimensional schema with mapping rules may besuperimposed over the sources
Further ReadingM. Hilal, C. G. Schuetz, M. Schrefl. Using superimposed mul-tidimensional schemas and OLAP patterns for RDF data anal-ysis Open Computer Science, 8(1), 18-37, 2018.
53/131
Superimposed Multidimensional Schemas
In some cases, it may be impractical to extract the data fromthe source systems.
⇒ Volume/Velocity/Volatility
Rather, a multidimensional schema with mapping rules may besuperimposed over the sources
Further ReadingM. Hilal, C. G. Schuetz, M. Schrefl. Using superimposed mul-tidimensional schemas and OLAP patterns for RDF data anal-ysis Open Computer Science, 8(1), 18-37, 2018.
53/131
Superimposed Multidimensional Schemas
In some cases, it may be impractical to extract the data fromthe source systems.
⇒ Volume/Velocity/Volatility
Rather, a multidimensional schema with mapping rules may besuperimposed over the sources
Further ReadingM. Hilal, C. G. Schuetz, M. Schrefl. Using superimposed mul-tidimensional schemas and OLAP patterns for RDF data anal-ysis Open Computer Science, 8(1), 18-37, 2018.
53/131
Superimposed Multidimensional Schemas
In some cases, it may be impractical to extract the data fromthe source systems.
⇒ Volume/Velocity/Volatility
Rather, a multidimensional schema with mapping rules may besuperimposed over the sources
Further ReadingM. Hilal, C. G. Schuetz, M. Schrefl. Using superimposed mul-tidimensional schemas and OLAP patterns for RDF data anal-ysis Open Computer Science, 8(1), 18-37, 2018.
53/131
Example: Superimposed MultidimensionalSchemas for Linked Data Analysis [12]
Repositories of linked data such as Wikidata can be animportant resource for data analysis.
� RDF data do not follow a structure suitable for OLAP-styledata analysis
� These data are not under analyst’s control.
� Exploiting these data by casual analysis is not an easytask and requires knowledge of SPARQL
⇒ Superimposition of multidimensional schemas renders thesedata accessible for OLAP
54/131
Analytical SPARQL Query over Wikidata
Film Cube over Wikidata
INTEGRATION,AGGREGATION, ANDREPRESENTATION
Data Integration
Most ETL processes integrate data from multiple sources
The presented techniques for conceptual ETL processmodeling account for that fact
With the emergence of the (semantic) web and social media,the data generated on web platforms has become a valuableresource for the analysis
57/131
Fusion Cubes (Abelló et al. [2])
The vision:
� Complement existing data cubes with fusion cubes thatinclude external data from RDF and linked data sources.
� Provide a drill-beyond operator to allow the user to definehow and where an existing cube should be extended withexternal data.
� Business intelligence should become truly self-service.
⇒ A uniform representation format might help
58/131
Fusion Cubes (Abelló et al. [2])
The vision:
� Complement existing data cubes with fusion cubes thatinclude external data from RDF and linked data sources.
� Provide a drill-beyond operator to allow the user to definehow and where an existing cube should be extended withexternal data.
� Business intelligence should become truly self-service.
⇒ A uniform representation format might help
58/131
QB4OLAP: BI Vocabulary for Linked Data(Etcheverry et al. [8])
QB4OLAP extends the W3C’s Data Cube (QB) Vocabulary withconcepts required for OLAP, e.g., hierarchies.
⇒ Representation of statistical linked data.
In AgriProKnow, QB4OLAP serves for the semantic descriptionof the data warehouse schema, where elements can be linkedto domain ontologies and websites.
QB4OLAP may also serve as the vocabulary for superimposedmultidimensional schemas as well [12].
59/131
Social Business Intelligence
Combines data from companies (e.g., sales) with datagenerated by users on social media.
Often, social business intelligence involves sentiment analysisof user content based on natural language processing.
The results of such analysis may be stored in cubes for furtheranalysis [9].
Example query: What is the average sentiment towardssmartphones?
60/131
ANALYSIS AND MODELING
Business Intelligence Model (BIM)(Horkoff et al. [13])
Representation of business strategies:
� Goals, which are selected from the Balanced Scorecarddimensions (financial, customer, processes, learning), atstrategic, tactial, or operational level.
� Situations represent internal and external factors thatinfluence goals positively or negatively.
� Processes aim to achieve the goals.
� Key Performance Indicators (→ later in this tutorial).
61/131
Example: BIM for AgriProKnow
Maximize milk
yielddesires
Milk yield
evaluates
Farmer
Prevent
animal illness
Optimize
feed intake
++
AND
Automatic
Feeding
Well-fed
animals
Strength
Body Condition Score
Antibiotics
resistance
Threat
# of known
resistant germs
P L
F
Business-Driven Data Analytics(Nalchigar and Yu [20])
Requirements analysis and design of data analytics systemshas multiple, complementary views.
� Business view: Starting from the business goals, the dataanalytics goals are defined.
� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.
� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics
→ similar to ETL models but at higher level
63/131
Business-Driven Data Analytics(Nalchigar and Yu [20])
Requirements analysis and design of data analytics systemshas multiple, complementary views.
� Business view: Starting from the business goals, the dataanalytics goals are defined.
� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.
� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics
→ similar to ETL models but at higher level
63/131
Business-Driven Data Analytics(Nalchigar and Yu [20])
Requirements analysis and design of data analytics systemshas multiple, complementary views.
� Business view: Starting from the business goals, the dataanalytics goals are defined.
� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.
� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics
→ similar to ETL models but at higher level
63/131
Business-Driven Data Analytics(Nalchigar and Yu [20])
Requirements analysis and design of data analytics systemshas multiple, complementary views.
� Business view: Starting from the business goals, the dataanalytics goals are defined.
� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.
� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics
→ similar to ETL models but at higher level
63/131
Business-Driven Data Analytics(Nalchigar and Yu [20])
Requirements analysis and design of data analytics systemshas multiple, complementary views.
� Business view: Starting from the business goals, the dataanalytics goals are defined.
� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.
� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics→ similar to ETL models but at higher level
63/131
Business View: Animal Illness
Maximize milk
yielddesires
Milk yield
evaluates
Farmer
Prevent
animal illness
Decision on
feed change
DDecision on
calling veterinarian
D
Which animals
are at risk?
QWhich feed mix
is best for animals?
Q
Optimize
feed intake
++
AND
+ type = Predictive
model
+ input = movement
and health data
+ output = Alert
+ usageFrequency =
daily
+ updateFrequency =
quarterly
+ learningPeriod =
12 months
Animals at risk
predictive model
answers
Data Analytics Design View: Animal Illness
Predict animal
illness
Recall
evaluatesClassification
of animals
+ type = Predictive
model
+ input = movement
and health data
+ output = Alert
+ usageFrequency =
daily
+ updateFrequency =
quarterly
+ learningPeriod =
12 months
Animals at risk
predictive model
generates
Precision evaluates
Logistic Regression
Deep Learning
0.85
0.55
achievesachieves
0.75
0.65
65/131
Data Analytics Design View: Animal Illness
Predict animal
illness
Recall
evaluatesClassification
of animals
+ type = Predictive
model
+ input = movement
and health data
+ output = Alert
+ usageFrequency =
daily
+ updateFrequency =
quarterly
+ learningPeriod =
12 months
Animals at risk
predictive model
generates
Precision evaluates
Logistic Regression
Deep Learning
0.85
0.55
achievesachieves
0.75
0.65
Tolerance to
missing values
++
-
66/131
Reference Modeling: BIRD Approach [27]
The idea stems from a small industry project we had a coupleof years ago.
Lightweight reference models for OLAP cubes, calculatedmeasures, and business terms should be customizable fordifferent small and medium-sized companies within an industryor large companies with multiple divisions.
Calculated measures and business terms are representedusing snippets of SQL code.
Further ReadingC. G. Schuetz, B. Neumayr, M. Schrefl, T. Neuböck. Refer-ence Modeling for Data Analysis: The BIRD Approach. Inter-national Journal of Cooperative Information Systems, 25(2),1-46, 2016.
Example: Reference Model
plannedQuantity
actualQuantity «mandatory»
plannedCosts
actualCosts «mandatory»
/plannedCostsPerUnit
/actualCostsPerUnit «mandatory»
/actualCostsYTD
/actualCostsToPreviousDay
MaterialUsedForProduct
costs «mandatory»
shippingCosts
/totalCosts
/totalCostsYTD
/totalCostsToPrevWeek
MaterialInSupplyOrder
«mandatory»
week
day
quarter
month
year
«mandatory»
«mandatory»
«mandatory»productOrder
productCategory
minLifeTime
minTemperature
maxTemperature
ColdResistantProduct
HeatResistantProduct
DurableProduct
«mandatory»
materialCategory
material
maxTemperature
minTemperature
ColdResistantMaterial
HeatResistantMaterial
customercustomer-
Region
industry
consumer-
Group
building
site
country
«mandatory» «mandatory» «mandatory»
«mandatory»«mandatory»
Product Time
Material
Factory
Customer
Example: Reference Model Customization
plannedQuantity
actualQuantity
plannedCosts
actualCosts
/plannedCostsPerUnit
/actualCostsPerUnit
/actualCostsYTD
/actualCostsToPreviousDay
MaterialUsedForProduct
+ orderedQuantity
+ deliveredQuantity
costs
shippingCosts
/totalCosts
+ /totalCostsPerUnit
/totalCostsYTD
/totalCostsToPrevWeek
MaterialInSupplyOrder
week
day
quarter
month
year
productOrder
productCategory
minLifeTime
minTemperature
maxTemperature
ColdResistantProduct
HeatResistantProduct
DurableProduct
materialCategory
material
maxTemperature
minTemperature
ColdResistantMaterial
HeatResistantMaterial
+ supplier-
Region
+ supplier
customercustomer-
Region
industry
consumer-
Group
building
site
country
Product Time
Material
+ Supplier
Factory
Customer
69/131
Tool Support [27]
Implementation using Indyco Builder as modeling tool, XML forthe specification of business terms and calculated measures aswell as customizations/redefinitions, and XQuery to apply thetransformations to the reference model.
70/131
Applicability of BIRD to AgriProKnowAgriProKnow aims to integrate data from multiple farms in orderto generate new process knowledge, e.g., early indicators of aswell as influence factors for animal illness.
The thus generated knowledge should be applied forrule-based process monitoring and control, e.g., by calling adoctor when danger of illness is detected.
Thus, once operational, the AgriProKnow data warehousecould consist of two parts:
� Inter-farm data warehouse: Integrating the data fromvarious sources in order to generate new knowledge.
� Farm-specific data warehouses: A data warehouse foreach farm built through customization of a referencemodel; the analysis rules are executed over thefarm-specific data warehouses.
Applicability of BIRD to AgriProKnowAgriProKnow aims to integrate data from multiple farms in orderto generate new process knowledge, e.g., early indicators of aswell as influence factors for animal illness.
The thus generated knowledge should be applied forrule-based process monitoring and control, e.g., by calling adoctor when danger of illness is detected.
Thus, once operational, the AgriProKnow data warehousecould consist of two parts:
� Inter-farm data warehouse: Integrating the data fromvarious sources in order to generate new knowledge.
� Farm-specific data warehouses: A data warehouse foreach farm built through customization of a referencemodel; the analysis rules are executed over thefarm-specific data warehouses.
Applicability of BIRD to AgriProKnowAgriProKnow aims to integrate data from multiple farms in orderto generate new process knowledge, e.g., early indicators of aswell as influence factors for animal illness.
The thus generated knowledge should be applied forrule-based process monitoring and control, e.g., by calling adoctor when danger of illness is detected.
Thus, once operational, the AgriProKnow data warehousecould consist of two parts:
� Inter-farm data warehouse: Integrating the data fromvarious sources in order to generate new knowledge.
� Farm-specific data warehouses: A data warehouse foreach farm built through customization of a referencemodel; the analysis rules are executed over thefarm-specific data warehouses.
Applicability of BIRD to AgriProKnowAgriProKnow aims to integrate data from multiple farms in orderto generate new process knowledge, e.g., early indicators of aswell as influence factors for animal illness.
The thus generated knowledge should be applied forrule-based process monitoring and control, e.g., by calling adoctor when danger of illness is detected.
Thus, once operational, the AgriProKnow data warehousecould consist of two parts:
� Inter-farm data warehouse: Integrating the data fromvarious sources in order to generate new knowledge.
� Farm-specific data warehouses: A data warehouse foreach farm built through customization of a referencemodel; the analysis rules are executed over thefarm-specific data warehouses.
AgriProKnow Reference Model [27]
72/131
AgriProKnow Reference Model [27]
73/131
AgriProKnow Reference ModelCustomization [27]
74/131
Fact Table [27]
A fact table is generated by Indyco Builder based on acustomized reference model.
75/131
Query [27]
Queries are formulated using analysis situations (or OLAPpatterns) and then automatically translated into SQL.
XQuery functions take the customized reference model and theSQL DDL statements to generate SQL queries for analysissituations.
76/131
OLAP Patterns: Basic Idea
77/131
OLAP Patterns: Basic Idea
78/131
OLAP Patterns: Basic Idea
79/131
OLAP Patterns: Definition
80/131
OLAP Patterns: Examples
81/131
Enhanced Dimensional Fact Model (eDFM)
82/131
eDFM in QB/QB4OLAP + Extension
83/131
OLAP Patterns: Description Form
84/131
OLAP Patterns: Description
85/131
OLAP Patterns: Framework
86/131
OLAP Patterns: RDF Definition
:HomogeneousIndependentSetComparison a pl:Pattern;
pl:name "Homogeneous independent-set comparison"@en;
pl:situation "Compare SI and SC with the same ..."@en;
pl:solution "The fact class, dimensions, grouping ..."@en;
pl:structure "SI: 1 fact class, 1..* selection ..."@en;
pl:example "Calculate the delta (comparative ..."@en.
pl:hasElement :base, :baseSlice, :measure, :dimensionLevel,
:dimension, :measureNotNull, :siSlice, :scSlice,
:compMeasure, :compHaving, :SetOfInterest,
:SetOfComparison;
pl:result :compMeasure, :dimensionLevel,
[pl:element :measure; pl:elementPrefix "SI_"],
[pl:element :measure; pl:elementPrefix "SC_"].
87/131
OLAP Patterns: Instantiation
:DeltaMilkYield a pl:QbPatternInstance;
pl:instanceOf :HIndependentSetComparison;
:base agri:Milk;
:baseSlice :DateIn2017;
:measure :SumOfMilkYield;
:dimension agri:Animal, agri:FarmSite;
:dimensionLevel agri:Animal, agri:FarmSite;
:siSlice :today;
:scSlice :prior5days;
:compMeasure :DeltaMilkYield;
:compHaving :positiveDeltaMilkYield;
88/131
OLAP Patterns: Measures and Predicates
89/131
OLAP Patterns: Measures and Predicates
90/131
Pattern Expression
For each pattern, a generic query template in a target languageis defined – the pattern expression.
That target language can be SQL but also another languagesuch as SPARQL [12]
.
Upon pattern instantiation, predicate and measure expressionsare inserted into the placeholders in the pattern expression.
91/131
Pattern Expression
For each pattern, a generic query template in a target languageis defined – the pattern expression.
That target language can be SQL
but also another languagesuch as SPARQL [12]
.
Upon pattern instantiation, predicate and measure expressionsare inserted into the placeholders in the pattern expression.
91/131
Pattern Expression
For each pattern, a generic query template in a target languageis defined – the pattern expression.
That target language can be SQL but also another languagesuch as SPARQL [12].
Upon pattern instantiation, predicate and measure expressionsare inserted into the placeholders in the pattern expression.
91/131
Pattern Expression
For each pattern, a generic query template in a target languageis defined – the pattern expression.
That target language can be SQL but also another languagesuch as SPARQL [12].
Upon pattern instantiation, predicate and measure expressionsare inserted into the placeholders in the pattern expression.
91/131
OLAP Patterns: Guided Instantiation
The RDF representation of pattern and multidimensional modelelements as well as the relationships among those elementsmay serve to build a “wizard” for guided query instantiation.
A demonstration video can be found here:https://www.youtube.com/watch?v=BLt6heO7WKY
92/131
Analysis Graphs (Neuböck et al. [21, 27])
Analysis graphs explicitly represent knowledge about analysisprocesses.
Potential applications:
� Documentation of analysis processes
� To build tool support for exploratory OLAP
� Potentially automate complex analysis processes
� As the representation format for analysis process mining
93/131
Example: Analysis Graph (Bird’s-Eye View)
Quantity and Expected Delivery
Time of Undelivered MaterialMaterial
Order
Canceled
Orders from other Suppliers that
Contain Undelivered Material
Products that Contain
Undeliverd Material
Ordered Material with Properties
Similar to Undeliverd Material
List of Customer Orders
Affected by Material Order
Canceling
Figure: An unrefined analysis graph for analysis in the event of ordercancellation [27]
94/131
Example: Analysis Graph
factClass = MaterialUsedForProduct
measure = {SUM(actualCosts)}
MonthlyCostsOfMaterialUse : AnalysisSituation
diceLevel = material
diceNode = ?mat
MaterialParameters
diceLevel = ?tmLevel
diceNode = ?tm
granularity = Time.month
TimeParameters
factClass = MaterialUsedForProduct
measure = {SUM(actualCosts)}
MonthlyCostsOfMaterialSupplyOrderWithProperty : AnalysisSituation
diceLevel = material
diceNode = ?mat
sliceCondition = ?prop
MaterialParameters
diceLevel = ?tmLevel
diceNode = ?tm
granularity = Time.month
TimeParameters
diceLevel =
Product.productCategory
diceNode = ?prodCat
ProductParameters
addSliceCondition(Material, ?prop)
moveToNode(Product, Product.product-
Category, ?prodCat)
FocusO
n-
Pro
pert
y
Figure: Example navigation step between analysis situations [27]
OLAP Endpoints for Linked Data [11]
Linked data repositories could provide an endpoint for OLAPanalysis based on superimposed multidimensional schemasand analysis graphs in order to facilitate exploration andanalysis of the data repository.
OLAP Endpoints for Linked Data [11]: Video
A demonstration video of a preliminary version can be foundhere: https://youtu.be/ymhkqla8J1I
We have since improved the appearance and are currentlypreparing a user study.
97/131
INTERPRETATION ANDACTION
Summarizability
What is summarizability about? Correct interpretation
Conceptual modeling may help to ensure summarizability or atleast make issues with summarizability explicit:
� Concepts in the modeling language [22, 17, 15]
� Constraint-based approaches [16, 14, 1]
98/131
Summarizability
What is summarizability about? Correct interpretation
Conceptual modeling may help to ensure summarizability
or atleast make issues with summarizability explicit:
� Concepts in the modeling language [22, 17, 15]
� Constraint-based approaches [16, 14, 1]
98/131
Summarizability
What is summarizability about? Correct interpretation
Conceptual modeling may help to ensure summarizability or atleast make issues with summarizability explicit:
� Concepts in the modeling language [22, 17, 15]
� Constraint-based approaches [16, 14, 1]
98/131
Summarizability
What is summarizability about? Correct interpretation
Conceptual modeling may help to ensure summarizability or atleast make issues with summarizability explicit:
� Concepts in the modeling language [22, 17, 15]
� Constraint-based approaches [16, 14, 1]
98/131
Conditions for Summarizability
DaVinciCode
Book
HonoluluSkirt
ClothingCategory
All
Product
Entertainment
+ Profit = 10 + Profit = 20
+ Profit = 10 + Profit = 10 + Profit = 20
All+ Profit = 30
10 + 10 + 20 = 40
10 + 20 = 30
Figure: Condition 1: Disjointness (Strict Hierarchies)
99/131
Conditions for Summarizability
Allall
AustriaSwitzerland
Vaud
Lausanne Montreux Salzburg Viennacity
canton
country
profit = 70
profit = 10 profit = 5
profit = 15
profit = 15 profit = 40
Figure: Condition 2: Completeness (Balanced Hierarchies)
100/131
Attribute Groups in DFM
101/131
Generalized/Hetero-HomogeneousHierarchies
<agent>
<agentType>
Sensor : <T>
concretization of
<agent>
<position>
Person : <agentType>
<age>
<agent>
<processType>
Process : <agentType>
concretization of
<logicalDevice>
<deviceType>
Device : <agentType>
<agent>
+ nominalAccuracy
concretization of
<logicalDevice>
<milkingParlorType>
MilkingParlor : <deviceType>
<agent>
+ measuredIngredients
concretization of
Key Performance Indicators (KPIs)
The base measures are typically combined into morecomprehensive indicators of economic success.
Definition [31, p. 362]“KPIs are complex measurements used to estimate the effec-tiveness of an organization in carrying out their activities andto monitor the performance of their processes and businessstrategies.
KPIs are traditionally defined with respect to abusiness strategy and business objectives”
103/131
Key Performance Indicators (KPIs)
The base measures are typically combined into morecomprehensive indicators of economic success.
Definition [31, p. 362]“KPIs are complex measurements used to estimate the effec-tiveness of an organization in carrying out their activities andto monitor the performance of their processes and businessstrategies.
KPIs are traditionally defined with respect to abusiness strategy and business objectives”
103/131
Key Performance Indicators (KPIs)
The base measures are typically combined into morecomprehensive indicators of economic success.
Definition [31, p. 362]“KPIs are complex measurements used to estimate the effec-tiveness of an organization in carrying out their activities andto monitor the performance of their processes and businessstrategies. KPIs are traditionally defined with respect to abusiness strategy and business objectives”
103/131
Goal Modeling and KPIs (Maté et al. [18])
The Business Intelligence Model (BIM) can be employed for thesystematic derivation of KPIs that are in line with businessstrategy.
104/131
Example: BIM for AgriProKnow
Maximize milk
yielddesires
Milk yield
evaluates
Farmer
Prevent
animal illness
Optimize
feed intake
++
AND
Automatic
Feeding
Well-fed
animals
Strength
Body Condition Score
Antibiotics
resistance
Threat
# of known
resistant germs
P L
F
Goal Modeling and KPIs (Maté et al. [18])
The Business Intelligence Model (BIM) can be employed for thesystematic derivation of KPIs that are in line with the overallbusiness strategy.
Using the Semantics of Business Vocabulary and Rules(SBVR), the KPIs are subsequently precisely specified inStructured English.
The KPIs defined in SBVR then translate into executable MDXqueries over a multidimensional schema.
106/131
Goal-Based Selection of Visualizations(Golfarelli et al. [10])
Idea: The user specifies their analysis goals and otherparameters, which are subsequently used to recommend (orrecommend against) certain types of visualizations.
Users may hence declare:
� Goal : composition, order, cluster, distribution, etc.
� Interaction: overview, zoom, filter, details-on-demand
� Experience: lay or tech person
� Dimensionality : n-dimensional, tree, graph
� Cardinality : low, high
� Type of measure: nominal, ordinal, interval, ratio
107/131
Goal-Based Selection of Visualizations(Golfarelli et al. [10])
Given a single criterion, a visualization may be fit, acceptable,neutral, discouraged, or unfit.
For example, a pie chart is fit for composition whereas a heatmap is unfit. A pie chart is fit for giving an overview whereas abubble chart is acceptable.
Given selections for multiple criteria, the optimal visualizationtypes can be calculated based on the qualitative suitability ofeach visualization for the different criteria.
108/131
VizDSL (Morgan et al. [19])
Goals:
� Platform-independent and extensible modeling language� Non-IT experts are able to quickly and easily describe,
model, and create interactive visualizations
Structured Source Code VizDSL Model Interactive Visualization
109/131
VizDSL (Morgan et al. [19])
Extension of Interaction Flow Modeling Language (IFML) withmodeling elements for interactive visualization of data.
Figure: The visual notation for VizDSL
110/131
VizDSL (Morgan et al. [19])
111/131
VizDSL (Morgan et al. [19])
112/131
VizDSL (Morgan et al. [19])
113/131
Analysis Rules
In AgriProKnow, we have implemented analysis rules based onthe notion of OLAP patterns.
An action, e.g., calling a vet, can be triggered by noteworthyresults of analyses that are periodically carried out.
The analyses are specified using OLAP patterns.
114/131
OPEN ISSUES
Open Issues
� Integration of conceptual models (or knowledge graphs)with machine/deep learning
→ overlap between ER and Semantic Web communities
� .
.. any thoughts?
115/131
Open Issues
� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities
� .
.. any thoughts?
115/131
Open Issues
� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities
� .
.. any thoughts?
115/131
Open Issues
� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities
� ..
. any thoughts?
115/131
Open Issues
� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities
� ...
any thoughts?
115/131
Open Issues
� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities
� ... any thoughts?
115/131
References I
[1] Combining objects with rules to represent aggregationknowledge in data warehouse and OLAP systems.
[2] A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.-N.Mazón, F. Naumann, T. B. Pedersen, S. Rizzi, J. Trujillo,P. Vassiliadis, and G. Vossen.Fusion cubes: towards self-service business intelligence.International Journal of Data Warehousing and Mining.
117/131
References II
[3] D. Agrawal, P. Bernstein, E. Bertino, S. Davidson,U. Dayal, M. Franklin, and others.Challenges and opportunities with big data – a communitywhite paper developed by leading researchers across theUnited States.Technical report, 2012.https:
//cra.org/ccc/resources/ccc-led-whitepapers/ (lastaccess: 2 October 2018).
118/131
References III
[4] M. Bala, O. Boussaid, and Z. Alimazighi.A fine-grained distribution approach for ETL processes inbig data environments.Data & Knowledge Engineering, 111:114–136, 2017.
[5] S. Dobson, M. Golfarelli, S. Graziani, and S. Rizzi.A reference architecture and model for sensor datawarehousing.IEEE Sensors Journal, 18(18):7659–7670, 2018.
119/131
References IV
[6] Z. El Akkaoui, J. Mazón, A. A. Vaisman, and E. Zimányi.BPMN-based conceptual modeling of ETL processes.In A. Cuzzocrea and U. Dayal, editors, DaWaK 2012,volume 7448 of Lecture Notes in Computer Science,pages 1–14. Springer, 2012.
[7] Z. El Akkaoui and E. Zimányi.Defining ETL worfklows using BPMN and BPEL.In Proceedings of the ACM 12th International Workshopon Data Warehousing and OLAP, pages 41–48, 2009.
120/131
References V
[8] L. Etcheverry, A. Vaisman, and E. Zimányi.Modeling and querying data warehouses on the semanticweb using QB4OLAP.In L. Bellatreche and M. K. Mohania, editors, DaWaK2014, volume 8646 of LNCS, pages 45–56. Springer,2014.
[9] M. Golfarelli.Design issues in social business intelligence projects.In E. Zimányi and A. Abelló, editors, eBISS 2015, volume253 of LNBIP, pages 62–86. Springer, 2016.
121/131
References VI
[10] M. Golfarelli, T. Pirini, and S. Rizzi.Goal-based selection of visual representations for big dataanalytics.In S. de Cesare and U. Frank, editors, ER 2017Workshops, volume 10651 of LNCS, pages 47–57.Springer, 2017.
[11] M. Hilal, C. G. Schuetz, and M. Schrefl.An OLAP endpoint for RDF data analysis using analysisgraphs.In Proceedings of the ISWC 2017 Posters &Demonstrations and Industry Tracks, 2017.
122/131
References VII
[12] M. Hilal, C. G. Schuetz, and M. Schrefl.Using superimposed multidimensional schemas andOLAP patterns for RDF data analysis.Open Computer Science, 8(1):18–37, 2018.
[13] J. Horkoff, D. Barone, L. Jiang, E. Yu, D. Amyot,A. Borgida, and J. Mylopoulos.Strategic business modeling: representation andreasoning.Software & Systems Modeling, 13(3):1015–1041, 2014.
123/131
References VIII
[14] C. Hurtado, C. Gutierrez, and A. Mendelzon.Capturing summarizability with integrity constraints inOLAP.ACM Transactions on Database Systems, 30:854–886,2005.
[15] Indyco.Attribute groups, 2015.http://indyco.freshdesk.com/support/solutions/
articles/1000212913-attribute-groups [Online;accessed 7-October-2018].
124/131
References IX
[16] J. Lechtenbörger and G. Vossen.Multidimensional normal forms for data warehouse design.
Information Systems, 28:415–434, 2003.
[17] E. Malinowski and E. Zimányi.A conceptual model for temporal data warehouses and itstransformation to the ER and the object-relational models.Data & Knowledge Engineering, 64(1):101–133, 2008.
[18] A. Maté, J. Trujillo, and J. Mylopoulos.Frequent patterns in ETL workflows: An empiricalapproach.Data & Knowledge Engineering, 108:30–49, 2017.
125/131
References X
[19] R. Morgan, G. Grossmann, M. Schrefl, M. Stumptner, andT. Payne.Vizdsl: A visual DSL for interactive informationvisualization.In J. Krogstie and H. A. Reijers, editors, CAiSE 2018,volume 10816 of LNCS, pages 440–455. Springer, 2018.
[20] S. Nalchigar and E. Yu.Business-driven data analytics: A conceptual modelingframework.Data & Knowledge Engineering, 117:359–372, 2018.
126/131
References XI
[21] T. Neuböck, B. Neumayr, M. Schrefl, and C. G. Schütz.Ontology-driven business intelligence for comparative dataanalysis.In E. Zimányi, editor, eBISS 2013, volume 172 of LNBIP,pages 77–120. Springer, 2014.
[22] B. Neumayr, M. Schrefl, and B. Thalheim.Hetero-homogeneous hierarchies in data warehouses.In S. Link and A. Ghose, editors, APCCM 2010, volume110 of CRPIT, pages 61–70. Australian Computer Society,2010.
127/131
References XII
[23] B. Oliveira and O. Belo.BPMN patterns for ETL conceptual modelling andvalidation, 2012.
[24] B. Oliveira, V. Santos, and O. Belo.Pattern-based ETL conceptual modelling, 2013.
[25] C. Ordonez, S. Maabout, D. S. Matusevich, andW. Cabrera.Extending er models to capture database transformationsto build data sets for data mining.Data & Knowledge Engineering, 89:38–54, 2014.
128/131
References XIII
[26] P. Russom.Hadoop for the enterprise.Technical report, TDWI, 2015.https://www.cloudera.com/content/dam/cloudera/
Resources/PDF/Reports/TDWI-Best-Practices-Report_
Hadoop-for-the-Enterprise.pdf (last access: 28 June2016).
[27] C. G. Schuetz, B. Neumayr, M. Schrefl, and T. Neuböck.Reference modeling for data analysis: The BIRDapproach.International Journal of Cooperative Information Systems,25(2):1–46, 2016.
129/131
References XIV
[28] C. G. Schuetz, S. Schausberger, and M. Schrefl.Building an active semantic data warehouse for precisiondairy farming.Journal of Organizational Computing and ElectronicCommerce, 28(2):122–141, 2018.
[29] R. Sherman.Business Intelligence Guidebook.Morgan Kaufmann, 2015.
[30] V. Theodoroua, A. Abelló, M. Thieleb, and W. Lehner.Frequent patterns in ETL workflows: An empiricalapproach.Data & Knowledge Engineering, 112:1–16, 2017.
130/131
References XV
[31] A. Vaisman and E. Zimányi.Data Warehouse Systems – Design and Implementation.Springer, 2014.
[32] S. Williams.Business Intelligence Strategy and Big Data Analytics.Morgan Kaufmann, 2016.
131/131