8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
1/23
[email protected] www.knowerce.sk
Slovak Public ProcurementAnnouncmenets
Extraction, transformation and Loading Process July 2010
knowerce
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
2/23
Document information
Creator Knowerce, s.r.o.Vavilovova 16851 01 Bratislava
[email protected] www.knowerce.sk
Author tefan Urbnek, [email protected]
Date of creation 20.7.2010
Document revision 2
1.Document RestrictionsCopyright (C) 2010 Knowerce, s.r.o., Stefan Urbanek
Permission is granted to copy, distribute and/or modify this document under the terms of the GNUFree Documentation License, Version 1.3 or any later version published by the Free SoftwareFoundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of thelicense is included in the section entitled "GNU Free Documentation License".
Slovak Public Procurement Announcements ETL knowerce
2
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
3/23
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
4/23
2. Introduction
This document describes extraction, transformation and loading process of public procurementdocuments in Slovakia. Objective of the VVO project was to transform unstructured publicprocurement announcement documents into structured form.
Source code: http://github.com/Stiivi/vvo-etl
Data source URL: http://www.e-vestnik.sk/
Application using the data: http://vestnik.transparency.sk
raw open dataunstructuredHTML
Slovak Public Procurement Announcements ETL knowerce
4
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
5/23
3. Overview
3.1. The ProcessPublic procurement announcement documents are being processed in a chain of ETL jobs. The jobsare:
Reasons for creating several jobs instead of single monolithic processing script are mainly: bettermaintainability, ability to re-run failed part of the chain, ability to plug-in other sources into the chain in
the future.If a part of the chain fails, it is not necessary to run whole chain again, just the part of the chain fromfailed part. This lowers processing load and network load on source servers. For example, cleansingfails, it is not necessary to download the les again.
In addition to the processing jobs, there are three required, however independent jobs:
Job Type Description
Download core Download HTML documents from the source
Parse core Parse HTML documents into structured form
Load source core Load structured form into database table
Cleanse core Cleanse data, x values, map corrections
Create cube core Create analytical structure: fact table and dimensions
Create search index core Create search index for full-text searching with support for Slovak/ASCII searching
Regis Extraction suppor t Extract list of all Slovak organisations
Geography loading suppor t Load data from Slovak post-o ffice about regional break-down
CPV loading support Load CPV (common procurement vocabulary) data
Source Extraction Transformation Analytical Transformation
Load source CleanseDownload Parse Create cubeCreate
search index
RegisExtraction
Geography Loading CPV Loading
Slovak Public Procurement Announcements ETL knowerce
5
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
6/23
4. Jobs
4.1. Download
Inputs: HTML documents stored on public procurements website
Outputs: HTML les stored locally
Conguration: public procurements web site root, path to bulletin index, document encoding
Options: incremental mode (default), full mode (download all announcements)
At site root one can nd paginated list of bulletins:
http://www.e-vestnik.sk/#EVestnik/Vsetky_vydania
By following a bulletin link, there is list of announcement types:
http://www.e-vestnik.sk/#EVestnik/Vestnik?date=2010-08-07&from=Vsetky_vydania
Download
raw sources HTML les
Slovak Public Procurement Announcements ETL knowerce
6
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
7/23
By clicking on a link with desired public procurement type (procurement results) list is expanded andwe get list of all announcements within the bulletin:
http://www.e-vestnik.sk/#EVestnik/Vestnik?cat=7&date=2010-08-07
Situation: no data API provided by website no single list of all public procurements, only paginated browsing of bulletins no proper HTML id attributes, nor non-ambiguous class attributes layout by table
Process
1. Download and parse document index at specied site root, get number of pages2. Download and parse all bulletin list page pages, output is name and URL of each bulletin
3. Compare list of available bulletins with list of already downloaded bulletins and generate list of bulletins to be downloaded (all if full download is requested)
4. Download all announcements found on each bulletin page and save into download directory 5. Store list of downloaded bulletins
4.2. Parse
HTML les
Parse
YAML les
Slovak Public Procurement Announcements ETL knowerce
7
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
8/23
Inputs: HTML documents with announcements, stored locally
Outputs: YAML structured les with parsed elds, one YAML per announcement
Conguration: none
Options: none
Situation: very messy HTML structure ambiguous class attributes, mis-use of class attributes no usable id attributes heavy table layout with nested tables, level 3 is common (table in table in table) sometimes broken layout, causing many parsing exceptions not reliably indexable values by referencing row number non-consistent table layout might and might not contain tbody Document example:
http://www.e-vestnik.sk/EVestnik/Detail/16563
Example of layout with emphasised contrast CSS for better layout visibility:
Slovak Public Procurement Announcements ETL knowerce
8
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
9/23
Example of broken layout, where the cyan values in left column were supposed to be in the rightcolumn:
Example of an element nesting within a document with 24 levels of nesting:html > body > #page > #container > #main > #innerMain > div >
> table > tbody > tr > td >> table > tbody > tr > td >
> table > tbody > tr >td >> table > tbody > tr >td > span.hodnota
Having situation like described above makes parsing of public procurement documents tricky.Rough document structure (as seen by user/human): document title basic announcement information parts of announcement each part of announcement contain sections each section contains list of information pieces (I would not call that key-value pairs, as they are
not)
Slovak Public Procurement Announcements ETL knowerce
9
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
10/23
Process
The whole document was parsed as HTML document tree.
Strategies used:
Unicode regular exception matching element references by element index (unstable, but su fficient for most cases) - instead of using
proper id/class attribute (which were missing) we used index of the element that we wanted toparse
because structure was not consistent, sometimes searching for elements was necessary instead of directly referencing by path, which made processing little bit slower
1. read basic announcement information: date, announcement number, type2. nd table with document parts and split HTML document subtrees for each part
3. parse each part
Part parsing:The main body of the document is a table containing cells which contain optional part title and partbody in the form of a table. The part body table contains anonymous rows with section contents in
two columns. The left column is used mostly for padding and might contain section number. The rightcolumn contains information to be extracted. How the part and sections look like is depicted in thefollowing picture:
part body
part title
part body
... more parts
number section title
(empty) cell with content
(empty) cell with content
(empty) cell with content
number section title
(empty) cell with content
number section title
(empty) cell with content
part title
part body
part body
Slovak Public Procurement Announcements ETL knowerce
10
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
11/23
It was not possible to reliably nd sections in parts by referencing rows directly. Each part was brokeninto list of table rows and rows were parsed sequentially as on a tape:1. prepare section structure2. get next row
3. if left column contains value, then it is beginning of next section3.1. process previous section, if there is any 3.2. prepare new section structure
3.3. save next section name into section structure4. if left column is empty then:
4.1. add right column to list of section rows in the section structure
5. repeat from 2 until all rows are processed
Section parsing:
After parsing parts, the section structure contains section title, section number and list of rows (cellsfrom left column of a part table). The rows are processed sequentially as well.Each set of section rows were parsed into eld/value pars using unicode regexp matching. Becausenaming of values was non consistent, multiple values/matches had to be used or more complex regular
expressions. The value keys had di ff erent wordings or used di ff erent words to describe the same value.Examples of section rows:
Rendered Document HTML
Slovak Public Procurement Announcements ETL knowerce
11
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
12/23
Rendered Document HTML
Part V. contained list of contracts and required separate parsing.
No heavy data cleansing is performed. Only xing numerical values and trimming text strings.
Issues
elds with currency amounts were in many forms: one amount (expected) or two amounts (expected and nal) single amount or from-to range with or without currency with or without VAT included ag with or without VAT rate
there were no eld name prexes (such as name:, phone:) in all contacts, eld order was usedin that case (not 100% reliable)
empty/bogus HTML nodes, sometimes preventing proper parsing
4.3. Load Source
Inputs: YAML structured les with parsed elds, one YAML per announcementOutputs: populated staging database table with contracts
Conguration: none
Options: default mode (just load data), create mode (create DB structures)
Process
Simple mapping of structured les into DB table:
load structured le and for each contract do: insert contract record into table
Load source
YAML les contracts table(staging)
Slovak Public Procurement Announcements ETL knowerce
12
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
13/23
Table contains mostly unprocessed raw text values and numerics only for currency amounts. Contentof the table mostly matches information from source documents.
4.4. Cleanse
Inputs: populated staging database table with contracts
Outputs: cleaned staging data with consolidated suppliers
Conguration: none
Options: default mode (just load data), create mode (create DB structures)
ProcessGoal of this job is to cleanse data taken from source and consolidate them. More specically:
cleanse organisation number (ICO) format (without validity checking) coalesce values of short enumerations consolidate date formats add procurer additions into procurers table consolidate suppliers and add additions into suppliers table
Suppliers Consolidation
Requirements:
table with suppliers that might contain more information than present in REGIS database possibility to automatically correct errors in source documents, such as invalid IDs collect all unknown IDs for further correction in separate table
Presence and validity of organisation identication number (ICO) in the source does not match quality requirements. There are cases when ICO does not match with any organisation in the organisationdatabase. For those cases a mapping table is created where one can specify mapping of invalidcompany identications to valid ones. There are two ways of corrective mapping:
map directly organisation within specic announcement:
[announcement , organisation ID] [correct organisation ID]
staging clean dataelds with appropriate type and format
Cleanse
contracts table(staging)
"unknown"suppliers map
REGIS (SK organisations)
Slovak Public Procurement Announcements ETL knowerce
13
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
14/23
map unknown organisations:
[country, organisation ID, organisation name] [correct organisation ID]
The process is depicted in the following image:
1. Try to nd unknown suppliers2. Coalesce supplier name: use org.id from suppliers table if found, otherwise use from suppliers
table by mapping.3. Append newly found suppliers
Reason for having separate suppliers table is, that it might be extended with more necessary
information than provided by the organisations database REGIS.
sta_vvo_vysledky sta_regis
sta_suppliers
-
+
+
map_suppliers
+
tmp_coalesced_suppliers_sk
new suppliers
-
unknown suppliers
1
2
3
?
-
Slovensko
Slovak Public Procurement Announcements ETL knowerce
14
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
15/23
4.5. Create Cube
Inputs: cleaned staging data
Outputs: fact table, dimension tables, analytical model description
Conguration: none
Options: default mode (just load data), create mode (create DB structures)
This step creates and loads all structures for analytical processing:
fact table fact is contract dimensions:
supplier procurer process type contract type evaluation type account sector supplier geography
Process
1. create dimension for suppliers2. create dimension for procurers
3. create fact table (see below)4. x unknown dimension values - if there are values in the source data that are not found in the
dimensions, mark them as unknown and add them into dimension tables as new value additions5. create table with issues (for quality monitoring) and identify issues, such as empty or unknown
elds
Create Fact Table
Fact table is created simply by transforming cleansed data and joining with prepared dimension tables.
staging clean data
fact tableCreate cube
dimension tables
analytical modeldescription
Slovak Public Procurement Announcements ETL knowerce
15
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
16/23
4.6. Create search index
Inputs: dimension tables
Outputs: Sphinx search index
Conguration: none
Options: none
This step creates index of dimension values at searchable levels and indexes them with Sphinx full-textsearch indexer. Index is created using Slovak character mapping, to be able have search queries in plainASCII (without carrons and accents).
The analytical model is multidimensional cube in star schema 1 with hierarchical dimensions that havemultiple levels. It would be not su fficient to create full-text search index for each table, as we need toknow at what level the searched eld was found. For this purpose a dimension index table is created.
The dimension index contains elds:
dimension dimension key (reference to dimension row - whole dime nsion point) level (for example: county, region or country in geogra phy) level key value of level key attribute (for example: county code) indexed eld name indexed eld valueSphinx indexes the dimension index table.
Use example for search query: Bystri*. There are more cities called Bystrica, such as BanskaBystrica, however there is also a region called Banskobystricky that will match the same query andwe want to get both results higher level (region) and detailed level (city).
4.7. Regis DownloadInputs: documents at website of Statistics O ffice of Slovak Republic
Outputs: table with list of organisations in Slovakia
Conguration: source URL, document ID range, number of concurrent processing threads
Options: incremental download (default), full reload
dimension tables
Createsearch index
search index
dimension index
Slovak Public Procurement Announcements ETL knowerce
16
1
Fact table joined with dimension tables with no deeper references. All tables are joined to the fact tabledirectly, there are no joins: FT - T1 - T2.
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
17/23
Process
Documents are being downloaded sequentially by document ID from source URL. The downloading isbeing done in batches of 50k documents (congurable) and in 20 parallel threads (congurable).
In-spite of the documents being labeled as HTML, they contain no valid HTML code and can beconsidered as text documents with HTML tags. The downloaded documents are stripped of HTML
tags and then are parsed with regular expressions as plain-text documents.
The process of downloading and processing all documents takes 2 hours in average, therefore it isadvised to run the process on a weekly basis.
4.8. Geography LoadingInputs: list of municipalities and counties from Slovak Post O ffice
Outputs: single de-normalised table with hierarchical geography information about Slovakia
Conguration: none
Options: none
Process
Records are simply being mapped with mapping tables containing ISO 3166-2:SK division codes andregion names into a single de-normalised table.
4.9. CPV LoadingInputs: Multilingual wide CPV code table
Outputs: single de-normalised table with hierarchical CPV structure
Conguration: none
Options: none
Process
Common Procurement Vocabulary (CPV) code table provided by EU institutions is in linear structurewith tree-structure properties. This table is being transformed into de-normalised table with treehierarchy levels in multiple columns.
Slovak Public Procurement Announcements ETL knowerce
17
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
18/23
5. Data
Overview:
Source Mirror Staging Data Datamart
source documents
There are three data stores:
source mirror on a le system staging data database schema datamart database shcemaMore detailed view:
Source Mirror
Staging Data
Datamart
source documents HTML les
YAML les
source contractdata
staging contractdata
fact table dimensions
logical model(metadata)
staging datalists
mappings temporary tables
contracts cube
parsedownload
load source
cleanse
create cube
Slovak Public Procurement Announcements ETL knowerce
18
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
19/23
5.1. Source MirrorThe source mirror contains downloaded original documents and parsed structured version of the
documents in YAML format. If the source becomes unavailable and it is desired to parse the les again(more attributes gathered, di ff erent parsing method, bug x), it can be done on locally stored les.
Documents are not parsed directly into database. Reasons:
required YAML text le storage structured documents can be processed with other tools without any database server connection
5.2. Staging DataStructured les are loaded into database into staging data datastore (preferably separate schema). Theles are loaded without any or very minor transformations. The table should be 1:1 copy of thestructured les.
The staging data store contains:
lists/enumerations, for example ISO country region subdivision copies of various sources or preprocessed datasets, such as geography from SK post o ffice,
registry of organisations (REGIS) staging data for procurers and suppliers might contain more information than provided by
registry of organisations (REGIS) maps for mapping source values to desired values, coalescing and unifying
map of unknown organisations map unknown org. names and org. codes into existingorganisations
map of region names di ff erent region naming in REGIS than in o fficial post office region
registry map of reference codes map of fulltext values, such as names of procurement types into
short codes (identiers) that will be used as keys. Also unies similar names into same code. temporary tables tables being used during transformation process that are created only for the
purpose of the single transformation run (for example: coalesced suppliers according to REGIS,mapped unknown organisations and existing registered organisations)
Some tables are being appended with new data during the transformation process. New data arebeing added into:
map of unknown organisations for further xing new known organisations for further update with additional information
5.3. Datamart DatastoreThe Datamart Datastore, separate database schema, contains nal data ready for analysis andreporting. Structures in the schema are:
logical model metadata description of the OLAP cube for contracts (Brewery framework objects)
dimension tables tables with dimension values (hierarchical) fact table cleansed table with procurement contracts, joinable with dimensions
Slovak Public Procurement Announcements ETL knowerce
19
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
20/23
The dimension tables with fact table in this schema form snowake schema. 2
Brewery OLAP is using the structures in the datamart datastore to denormalize the snowake schemainto wide fact table suitable for analysis, aggregation and reporting. That means, that the end-user
the analyst does have to know about physical structures behind the procurement contracts. He hasonly one logical fact table where one row is one fact, that is one contract. The logical metadata enables
the analyst to perform analysis on multidimensional hierarchical structure.
Slovak Public Procurement Announcements ETL knowerce
20
2 http://en.wikipedia.org/wiki/Snowake_schema
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
21/23
6. Search Index
One of the requirements for the public procurements portal was to be able to search through thedata by many diff erent elds. The nature of nal data is:
many elds, described by metadata we should not rely on xed data structure, hierarchical structure we need to know at what level the value that we are searching for can be
found
Example of a search query: chemical. The word chemical might be contained in subject type,however at di ff erent levels: division, category, subcategory We have to know exact level where thework appeared. If the word chemical is found at division level, we want report at division level, if theword is found at category level, we want to aggregate at the category level, etc.
The sphinx searching engine can create one index for a table for known set of elds. While searching,we do not know in which eld the value was found, only document number (row). To make search inmultiple elds and through hierarchies possible we had to pre-index data with enough metadata. Thenal table that is being indexed contains:
string value of indexed searchable eld dimension of the eld (cpv, organisation, region, ) dimension level of the eld (division/category/subcategory, region/county...) level key of the indexed eld some index document id that will be returned by sphinx
Slovak Public Procurement Announcements ETL knowerce
21
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
22/23
7. Installation
7.1. Software Requirements PostgreSQL database server ruby 1.9 (does not work with version 1.8) gems: sequel, data-mapper, nokogiri Sphinx Brewery from http://github.com/Stiivi/brewery/
7.2. Preparation
I. create a directory where working les, such as dumps and ETL les, will be stored, for example:
/var/lib/vvo-lesII. initialize and congure Brewery (see Brewery installation instructions)III. create two database schemas: vvo_staging for staging tables and vvo_data for analytical data
7.3. ETL Database initialisationTo initialize ETL database schema run the Brewery ETL tool:
etl initialize
This will create all necessary system tables. If you try to initialise a schema which already contains ETLsystem tables you will get an error message. This prevents you to overwrite existing data. To recreate
the schema and start with empty tables execute initialize command with --force ag:
etl --force initialize
Slovak Public Procurement Announcements ETL knowerce
22
8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading
23/23
8. Running ETL Jobs
8.1. Launching
Manual Launching
Jobs are being run with simply launching the etl tool:
etl run job_name
To manually run all daily jobs, you might use following script:
#!/bin/bash#
DEBUG='--debug'
etl $DEBUG run vvo_downloadetl $DEBUG run vvo_parseetl $DEBUG run vvo_load_sourceetl $DEBUG run vvo_cleanseetl $DEBUG run vvo_create_cubeetl $DEBUG run vvo_search_index
If a job fails, you have to run only the jobs after the failed job.
To do full download, instead of incremental, do:etl run vvo_download all
Slovak Public Procurement Announcements ETL knowerce