Upload
martin-kaltenboeck
View
123
Download
1
Embed Size (px)
Citation preview
www.adequate.at
Workshop on Quality Assessment and Improvements on Open Data (Portals)
opendata.ch conference, 14.6.2016, 12.45 - 14:00pm CESTLausanne, Casino de Montbenon, Allée Ernest-Ansermet 3
Slides published CC-BY AT 3.0
Jürgen Umbrich Vienna University of Economics and Business [email protected]
Johann Höchtl Donau-Universität Krems [email protected]
Martin Kaltenböck Semantic Web Company [email protected]
www.adequate.at
Agenda
2
Time Session Remarks
20’ incl q&a Welcome & Introduction● WS Objectives, Agenda & WS Team● Participants● The ADEQUATe project: basics, objectives, status & outlook
Martin Kaltenböck (SWC)
20’ incl q&a Results of Requirements Elicitation, DQ Metrics and Interaction items● What do the users want?● What are the most “important” ones? What are metrics specifically targeting
openness?● Why data portal quality interaction items with end users and what do we
plan to do in ADEQUATe?
Johann Höchtl (DUK)
20’ incl q&a Best Practise & the ADEQUATe OD Framework● Data & CSV on the web working group recommendations (W3C)● AD Framework: architecture & components
Jürgen Umbrich (WU)
15’ open discussion Interactive & open discussion on DQ issues:● Requirements for DQ in Open Data● What is in place or planned for DQ
Moderated by the WS Team
www.adequate.at
FFG Projecthttp://www.adequate.at
3
Das Projekt „ADEQUATe“ wird im Rahmen des FTI - Programms „IKT der Zukunft“ durch das Bundesministerium für Verkehr, Innovation und Technologie gefördert und von der Österreichischen Forschungsförderungsgesellschaft abgewickelt [Projektnummer: 849982].
www.adequate.at
What is ?
ADEQUATe Open Data: Analytics & Data Enrichment to improve
the QUAliTy of Open Data builds on two observations:
An increasing amount of Open Data becomes available as an important resource for emerging businesses and further on the
integration of such open, freely re-usable data sources into organisations’ data warehouse and data management systems is
seen as a key success factor for competitive advantages in a data-driven economy.
The project now identifies crucial issues which have to be tackled to fully exploit the value of open data and the efficient
integration with other data sources:
● the overall quality issues with metadata and the data itself
● the lack of interoperability between data sources
The project's approach is to address this points already in an early stage – when the open data is freshly provided by either
governmental organisations or others.
4
www.adequate.at
What is ?
✓ 3 Partners:1. Semantic Web Company2. Danube University Krems3. University of Economics Vienna
✓ 30 months project duration, Oct. 2015 - March 2018✓ 2 Use Case Partners: data.gv.at & opendataportal.at✓ Objective: Improvement of Data Quality through:
○ Quality Assessment and Monitoring
○ Automatic Algorithms
○ Making use of Linked Data principles
○ Improvements of the data by the user (community)
6
www.adequate.at
Project Structure & Schedule
7
ADEQUATe: GOALS
WP1 - Requirements & SpecificationWP2 - Quality Improvement & Monitoring FrameworkWP3 - Algorithms & Tools for Quality ImprovementsWP4 - Data LinkageWP5 - Community driven Quality ImprovementsWP6 - Use Case IntegrationWP7 - Project Management & Dissemination
www.adequate.at
Outlook & Timing of Results
8
M30 (03/2018)Evaluation, Refinements, Improvements
M21 (06/2017)Quality improvements Use case connection
M15 (12/2016)Quality monitoring framework Data linkage
M10 (07/2016)
Architecture Blueprint
M9 (06/2017)Quality metrics Requirements
ADEQUATe: GOALS
www.adequate.at
Concrete Outputs & Outlook
✓ End of June 2016: 3 Deliverables○ State of the Art○ Requirements Elicitation○ Quality Metrics
✓ End of July 2016: 1 Deliverable○ Architecture Blueprint○ All components specified
✓ End of 2016: ADEQUATe Framework - 1st release○ Assessment & Monitoring Framework○ Data Quality Algorithms & Tools○ Linked Data Mechanisms○ 1st set of user driven Mechanisms
✓ Early 2017: Dock onto ODP & data.gv.at9
www.adequate.at
Contents and Formats
○ I would really prefer to have the data themselves consistent. [...] metadata does not match; standards regarding the representation of their content
○ It would be really great if we could shift somehow to UTF-8○ meta data for CSV files were incomplete [...] header for CSV was missing ○ no static identifiers for objects in data sets. This in turn leads to problems if you want
to track changes related to these objects over time
Results of Requirements Elicitation
12
www.adequate.at
Communication
○ central communication point for exchanging experiences and issues○ Meta data should be written in English language
Reliability
○ Servers are restarted every day [...] hosted data becomes unavailable
Results of Requirements Elicitation
13
www.adequate.at
DQ metrics (1)
Completeness
● Metadata Completeness: How many (manadatory) metadata keys have values?● Table completeness: How many (CSV) cells have non-null values
Timeliness
● Tau of Data: How “outdated” are datasets based on the promised update frequency
14
www.adequate.at
DQ metrics (2)
Machine readability
● Regularity of CSV-files (CSV-Lint), RDF, ...● Structural consistency - variations in structure of CSV files
Openness
● Open formats - no well-defined definition of what constitutes an open ● Open Licenses - Seems opendefinitions.org has them all covered
Persistence
15
www.adequate.at
Contributors to DQ Improvements (1/2)
● Providers○ Correctness and Completeness of Data and Metadata○ SLAs governing availability○ Readiness for feedback, discussion and interaction
● Algorithms○ Automated improvements
■ Availability checks and reporting■ Missing information, outliers■ Check of format (valid UTF8?), size■ Data format conversions: CSV → CSV on the web specification
○ Semi-automated Improvements and Enhancements■ Identification of related data sets■ Mapping of (data) attributes, ...
● Interaction with the Data Community19
www.adequate.at
Interaction: Data Community
20
● Control the results of automated enhancements○ Interlinking○ format conversions○ encodings
● Correct mistakes and report mistakes● Data enrichment and transformations
www.adequate.at
Interaction: Data Community
21https://open.wien.gv.at/site/riesenbaum-in-wien-entdeckt/#more-87184
www.adequate.at
Interaction: Forking: Identify - Improve - Share
22
1 47 11
2 48 15
1 47 11
2 48 151
1 47 11
2 47 15
2
www.adequate.at
Making results tangible
24https://github.com/antontarasenko/gpq/blob/master/notebooks/contracts_intro.ipynb
Government Procurement Queries projectUS Government contracts 2000 - 2016 (USAspending.gov)
www.adequate.at
The ADEQUATe Framework
26
● The ADEQUATe framework offers:○ quality assessment and monitoring○ a set of data quality improvement algorithms○ a set of algorithms to create, maintain a knowledge graph and “link” data into this graph
■ Think about shared identifiers for addresses, companies, departments, parties, ...○ community involvement ( e.g., data editors, feedback loops, forking & merging)
● Main objectives:○ all developed components will be Open Source ( see the ADEQUATe Github Repo)○ components should be used as standalone components
■ Use only what you need
www.adequate.at
The ADEQUATe Framework
27
● Core Components1. Data monitoring2. Knowledge Vault3. Quality Assessment4. Quality Improvement5. Data Linkage6. Community Improvement7. UI, API & User authentication
Users
(Met
a)D
ata
Mon
itor
KnowledgeVault
QualityAssessment
Orchestration / API
QualityImprovement Linkage Community
Improvement
Authentication / Load Balancing /
UI Public API catalog
data.gv.at
ODP
Clients
RESTful APIComponentData
www.adequate.at
W3C CSV on the Web & ADEQUATe
One core feature in ADEQUATe will be to use the CSV on the Web metadata standard, which allows to:
➢ describe CSV files○ used dialect & encoding
○ table & column descriptions ( with language tags)
○ data types and value ranges for columns
➢ add semantics to it○ primary & foreign key, URIs, entity types, ...
➢ validate CSV files against a predefined schema➢ specify the transformation
○ CSV -> JSON or RDF
28
www.adequate.at
W3C CSV on the Web: Example (JSON-LD) 1/3 {
"@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
"url": "http://data.mumok.at/exhibition.csv",
"dc:title": "Exhibitions for objects from the mumok collection",
"dcat:keyword": ["art", "museum", "exhibition"],
"dc:publisher": {
"schema:name": "mumok - museum moderner kunst stiftung ludwig wien",
"schema:url": {"@id": "http://www.mumok.at"}
},
"dc:license": {"@id": "https://creativecommons.org/licenses/by/3.0/at/legalcode"},
"dc:modified": {"@value": "2015-07-04", "@type": "xsd:date"},
….
32
www.adequate.at
W3C CSV on the Web: Example (JSON-LD) 2/3 "dialect": {
"encoding": "utf-8", "lineTerminators": ["\r\n", "\n"],
"quoteChar": "\"", "doubleQuote": true,
"skipRows": 0, "commentPrefix": "#",
"header": true, "headerRowCount": 1,
"delimiter": ",",
"skipColumns": 0,
"skipBlankRows": false,
"skipInitialSpace": false,
"trim": false
},
33
www.adequate.at
W3C CSV on the Web: Example (JSON-LD) 3/3 "tableSchema": {
"columns": [{
"name": "exhibition_id",
"titles": "Exhibition Identifier",
"dc:description": "A unique identifier for the exhibition.",
"datatype": "integer",
"required": true
}, {
"name": "city",
"titles": "City",
"dc:description": "The city in which the exhibition took place (no language defined, mostly in German).",
"datatype": "string"
}
34
www.adequate.at
W3C CSV on the Web: Discovery● Registered content type: application/csvm+json● 3 discovery mechanisms
○ File extension■ http://data.mumok.at/exhibition.csv -> http://data.mumok.at/exhibition.csv-metadata.json
○ Well-known location
■ /.well-known/csvm
○ LINK HTTP Header
35
» curl -I http://data.mumok.at/exhibition.csvHTTP/1.1 200 OKDate: Thu, 26 Nov 2015 22:18:47 GMTServer: Apache/2.2.22 (Debian)….Content-Length: 112723Content-Type: text/csv; charset=utf-8; header=presentLink: </exhibition.csv-metadata.json>;rel=describedBy;type=application/csvm+json
www.adequate.at
CSV on the Web Summary
● Don’t publish CSV on the Web for humans, publish also for machines○ e.g., EXCEL exports
● RFC 4180● Encoding
○ Use UTF-8, don’t mix encodings
● File extension: .csv● Content-type: text/csv
Optional, but big improvement!
● Ideally, publish CSV MetaData along your CSV file● Avoid acronyms or encodings (e.g., sex=1,2,3)
36
www.adequate.at
CSV on the Web Summary
37
● CSV URLs● CSVs link to other CSVs● CSVs link to other resources● RDF and JSON conversion
REFERENCES
● CSV on the Web Working Group
● CSV on the Web Community Group
● CSV on the Web Github Repository
● Tabular Data on the Web - A Introduction to CSV on the Web (Slides)
● Implementing CSV on the Web ( Gregg Kellogg)
●
www.adequate.at
Announcements & Pointers
38
@adequate_od
17-19 May 2017Danube University Krems
30.8.-02.09.2016, Helsinki
www.adequate.at
Contact
39
Jürgen Umbrich Vienna University of Economics and Business
Juergen.umbrich @ wu.ac.at Short CV:https://www.wu.ac.at/en/infobiz/team/umbrich/
Johann Höchtl Donau-Universität Krems
Johann.hoechtl @ donau-uni.ac.at Short CV: https://at.linkedin.com/in/johannhoechtl
http://adequate.at/ http://vienna.theodi.org
Martin Kaltenböck Semantic Web Company
[email protected] Short CV: https://www.linkedin.com/in/martinkaltenboeck