Download pdf - Slovak Public Procurement Announcements - Extraction, Transformation and Loading

8/9/2019 Slovak Public Procurement Announcements - Extraction, Transformation and Loading

1/23

[email protected] www.knowerce.sk

Slovak Public ProcurementAnnouncmenets

Extraction, transformation and Loading Process July 2010

knowerce


2/23

Document information

Creator Knowerce, s.r.o.Vavilovova 16851 01 Bratislava

[email protected] www.knowerce.sk

Author tefan Urbnek, [email protected]

Date of creation 20.7.2010

Document revision 2

1.Document RestrictionsCopyright (C) 2010 Knowerce, s.r.o., Stefan Urbanek

Permission is granted to copy, distribute and/or modify this document under the terms of the GNUFree Documentation License, Version 1.3 or any later version published by the Free SoftwareFoundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of thelicense is included in the section entitled "GNU Free Documentation License".

Slovak Public Procurement Announcements ETL knowerce

2


3/23


4/23

2. Introduction

This document describes extraction, transformation and loading process of public procurementdocuments in Slovakia. Objective of the VVO project was to transform unstructured publicprocurement announcement documents into structured form.

Source code: http://github.com/Stiivi/vvo-etl

Data source URL: http://www.e-vestnik.sk/

Application using the data: http://vestnik.transparency.sk

raw open dataunstructuredHTML


4


5/23

3. Overview

3.1. The ProcessPublic procurement announcement documents are being processed in a chain of ETL jobs. The jobsare:

Reasons for creating several jobs instead of single monolithic processing script are mainly: bettermaintainability, ability to re-run failed part of the chain, ability to plug-in other sources into the chain in

the future.If a part of the chain fails, it is not necessary to run whole chain again, just the part of the chain fromfailed part. This lowers processing load and network load on source servers. For example, cleansingfails, it is not necessary to download the les again.

In addition to the processing jobs, there are three required, however independent jobs:

Job Type Description

Download core Download HTML documents from the source

Parse core Parse HTML documents into structured form

Load source core Load structured form into database table

Cleanse core Cleanse data, x values, map corrections

Create cube core Create analytical structure: fact table and dimensions

Create search index core Create search index for full-text searching with support for Slovak/ASCII searching

Regis Extraction suppor t Extract list of all Slovak organisations

Geography loading suppor t Load data from Slovak post-o ffice about regional break-down

CPV loading support Load CPV (common procurement vocabulary) data

Source Extraction Transformation Analytical Transformation

Load source CleanseDownload Parse Create cubeCreate

search index

RegisExtraction

Geography Loading CPV Loading


5


6/23

4. Jobs

4.1. Download

Inputs: HTML documents stored on public procurements website

Outputs: HTML les stored locally

Conguration: public procurements web site root, path to bulletin index, document encoding

Options: incremental mode (default), full mode (download all announcements)

At site root one can nd paginated list of bulletins:

http://www.e-vestnik.sk/#EVestnik/Vsetky_vydania

By following a bulletin link, there is list of announcement types:

http://www.e-vestnik.sk/#EVestnik/Vestnik?date=2010-08-07&from=Vsetky_vydania

Download

raw sources HTML les


6


7/23

By clicking on a link with desired public procurement type (procurement results) list is expanded andwe get list of all announcements within the bulletin:

http://www.e-vestnik.sk/#EVestnik/Vestnik?cat=7&date=2010-08-07

Situation: no data API provided by website no single list of all public procurements, only paginated browsing of bulletins no proper HTML id attributes, nor non-ambiguous class attributes layout by table

Process

1. Download and parse document index at specied site root, get number of pages2. Download and parse all bulletin list page pages, output is name and URL of each bulletin

3. Compare list of available bulletins with list of already downloaded bulletins and generate list of bulletins to be downloaded (all if full download is requested)

4. Download all announcements found on each bulletin page and save into download directory 5. Store list of downloaded bulletins

4.2. Parse

HTML les

Parse

YAML les


7


8/23

Inputs: HTML documents with announcements, stored locally

Outputs: YAML structured les with parsed elds, one YAML per announcement

Conguration: none

Options: none

Situation: very messy HTML structure ambiguous class attributes, mis-use of class attributes no usable id attributes heavy table layout with nested tables, level 3 is common (table in table in table) sometimes broken layout, causing many parsing exceptions not reliably indexable values by referencing row number non-consistent table layout might and might not contain tbody Document example:

http://www.e-vestnik.sk/EVestnik/Detail/16563

Example of layout with emphasised contrast CSS for better layout visibility:


8


9/23

Example of broken layout, where the cyan values in left column were supposed to be in the rightcolumn:

Example of an element nesting within a document with 24 levels of nesting:html > body > #page > #container > #main > #innerMain > div >

> table > tbody > tr > td >> table > tbody > tr > td >

> table > tbody > tr >td >> table > tbody > tr >td > span.hodnota

Having situation like described above makes parsing of public procurement documents tricky.Rough document structure (as seen by user/human): document title basic announcement information parts of announcement each part of announcement contain sections each section contains list of information pieces (I would not call that key-value pairs, as they are

not)


9


10/23

Process

The whole document was parsed as HTML document tree.

Strategies used:

Unicode regular exception matching element references by element index (unstable, but su fficient for most cases) - instead of using

proper id/class attribute (which were missing) we used index of the element that we wanted toparse

because structure was not consistent, sometimes searching for elements was necessary instead of directly referencing by path, which made processing little bit slower

1. read basic announcement information: date, announcement number, type2. nd table with document parts and split HTML document subtrees for each part

3. parse each part

Part parsing:The main body of the document is a table containing cells which contain optional part title and partbody in the form of a table. The part body table contains anonymous rows with section contents in

two columns. The left column is used mostly for padding and might contain section number. The rightcolumn contains information to be extracted. How the part and sections look like is depicted in thefollowing picture:

part body

part title

part body

... more parts

number section title

(empty) cell with content







part title

part body

part body


10


11/23

It was not possible to reliably nd sections in parts by referencing rows directly. Each part was brokeninto list of table rows and rows were parsed sequentially as on a tape:1. prepare section structure2. get next row

3. if left column contains value, then it is beginning of next section3.1. process previous section, if there is any 3.2. prepare new section structure

3.3. save next section name into section structure4. if left column is empty then:

4.1. add right column to list of section rows in the section structure

5. repeat from 2 until all rows are processed

Section parsing:

After parsing parts, the section structure contains section title, section number and list of rows (cellsfrom left column of a part table). The rows are processed sequentially as well.Each set of section rows were parsed into eld/value pars using unicode regexp matching. Becausenaming of values was non consistent, multiple values/matches had to be used or more complex regular

expressions. The value keys had di ff erent wordings or used di ff erent words to describe the same value.Examples of section rows:

Rendered Document HTML


11


12/23

Rendered Document HTML

Part V. contained list of contracts and required separate parsing.

No heavy data cleansing is performed. Only xing numerical values and trimming text strings.

Issues

elds with currency amounts were in many forms: one amount (expected) or two amounts (expected and nal) single amount or from-to range with or without currency with or without VAT included ag with or without VAT rate

there were no eld name prexes (such as name:, phone:) in all contacts, eld order was usedin that case (not 100% reliable)

empty/bogus HTML nodes, sometimes preventing proper parsing

4.3. Load Source

Inputs: YAML structured les with parsed elds, one YAML per announcementOutputs: populated staging database table with contracts

Conguration: none

Options: default mode (just load data), create mode (create DB structures)

Process

Simple mapping of structured les into DB table:

load structured le and for each contract do: insert contract record into table

Load source

YAML les contracts table(staging)


12


13/23

Table contains mostly unprocessed raw text values and numerics only for currency amounts. Contentof the table mostly matches information from source documents.

4.4. Cleanse

Inputs: populated staging database table with contracts

Outputs: cleaned staging data with consolidated suppliers

Conguration: none


ProcessGoal of this job is to cleanse data taken from source and consolidate them. More specically:

cleanse organisation number (ICO) format (without validity checking) coalesce values of short enumerations consolidate date formats add procurer additions into procurers table consolidate suppliers and add additions into suppliers table

Suppliers Consolidation

Requirements:

table with suppliers that might contain more information than present in REGIS database possibility to automatically correct errors in source documents, such as invalid IDs collect all unknown IDs for further correction in separate table

Presence and validity of organisation identication number (ICO) in the source does not match quality requirements. There are cases when ICO does not match with any organisation in the organisationdatabase. For those cases a mapping table is created where one can specify mapping of invalidcompany identications to valid ones. There are two ways of corrective mapping:

map directly organisation within specic announcement:

[announcement , organisation ID] [correct organisation ID]

staging clean dataelds with appropriate type and format

Cleanse

contracts table(staging)

"unknown"suppliers map

REGIS (SK organisations)


13


14/23

map unknown organisations:

[country, organisation ID, organisation name] [correct organisation ID]

The process is depicted in the following image:

1. Try to nd unknown suppliers2. Coalesce supplier name: use org.id from suppliers table if found, otherwise use from suppliers

table by mapping.3. Append newly found suppliers

Reason for having separate suppliers table is, that it might be extended with more necessary

information than provided by the organisations database REGIS.

sta_vvo_vysledky sta_regis

sta_suppliers

-

+

+

map_suppliers

+

tmp_coalesced_suppliers_sk

new suppliers

-

unknown suppliers

1

2

3

?

-

Slovensko


14


15/23

4.5. Create Cube

Inputs: cleaned staging data

Outputs: fact table, dimension tables, analytical model description

Conguration: none


This step creates and loads all structures for analytical processing:

fact table fact is contract dimensions:

supplier procurer process type contract type evaluation type account sector supplier geography

Process

1. create dimension for suppliers2. create dimension for procurers

3. create fact table (see below)4. x unknown dimension values - if there are values in the source data that are not found in the

dimensions, mark them as unknown and add them into dimension tables as new value additions5. create table with issues (for quality monitoring) and identify issues, such as empty or unknown

elds

Create Fact Table

Fact table is created simply by transforming cleansed data and joining with prepared dimension tables.

staging clean data

fact tableCreate cube

dimension tables

analytical modeldescription


15


16/23

4.6. Create search index

Inputs: dimension tables

Outputs: Sphinx search index

Conguration: none

Options: none

This step creates index of dimension values at searchable levels and indexes them with Sphinx full-textsearch indexer. Index is created using Slovak character mapping, to be able have search queries in plainASCII (without carrons and accents).

The analytical model is multidimensional cube in star schema 1 with hierarchical dimensions that havemultiple levels. It would be not su fficient to create full-text search index for each table, as we need toknow at what level the searched eld was found. For this purpose a dimension index table is created.

The dimension index contains elds:

dimension dimension key (reference to dimension row - whole dime nsion point) level (for example: county, region or country in geogra phy) level key value of level key attribute (for example: county code) indexed eld name indexed eld valueSphinx indexes the dimension index table.

Use example for search query: Bystri*. There are more cities called Bystrica, such as BanskaBystrica, however there is also a region called Banskobystricky that will match the same query andwe want to get both results higher level (region) and detailed level (city).

4.7. Regis DownloadInputs: documents at website of Statistics O ffice of Slovak Republic

Outputs: table with list of organisations in Slovakia

Conguration: source URL, document ID range, number of concurrent processing threads

Options: incremental download (default), full reload

dimension tables

Createsearch index

search index

dimension index


16

1

Fact table joined with dimension tables with no deeper references. All tables are joined to the fact tabledirectly, there are no joins: FT - T1 - T2.


17/23

Process

Documents are being downloaded sequentially by document ID from source URL. The downloading isbeing done in batches of 50k documents (congurable) and in 20 parallel threads (congurable).

In-spite of the documents being labeled as HTML, they contain no valid HTML code and can beconsidered as text documents with HTML tags. The downloaded documents are stripped of HTML

tags and then are parsed with regular expressions as plain-text documents.

The process of downloading and processing all documents takes 2 hours in average, therefore it isadvised to run the process on a weekly basis.

4.8. Geography LoadingInputs: list of municipalities and counties from Slovak Post O ffice

Outputs: single de-normalised table with hierarchical geography information about Slovakia

Conguration: none

Options: none

Process

Records are simply being mapped with mapping tables containing ISO 3166-2:SK division codes andregion names into a single de-normalised table.

4.9. CPV LoadingInputs: Multilingual wide CPV code table

Outputs: single de-normalised table with hierarchical CPV structure

Conguration: none

Options: none

Process

Common Procurement Vocabulary (CPV) code table provided by EU institutions is in linear structurewith tree-structure properties. This table is being transformed into de-normalised table with treehierarchy levels in multiple columns.


17


18/23

5. Data

Overview:

Source Mirror Staging Data Datamart

source documents

There are three data stores:

source mirror on a le system staging data database schema datamart database shcemaMore detailed view:

Source Mirror

Staging Data

Datamart

source documents HTML les

YAML les

source contractdata

staging contractdata

fact table dimensions

logical model(metadata)

staging datalists

mappings temporary tables

contracts cube

parsedownload

load source

cleanse

create cube


18


19/23

5.1. Source MirrorThe source mirror contains downloaded original documents and parsed structured version of the

documents in YAML format. If the source becomes unavailable and it is desired to parse the les again(more attributes gathered, di ff erent parsing method, bug x), it can be done on locally stored les.

Documents are not parsed directly into database. Reasons:

required YAML text le storage structured documents can be processed with other tools without any database server connection

5.2. Staging DataStructured les are loaded into database into staging data datastore (preferably separate schema). Theles are loaded without any or very minor transformations. The table should be 1:1 copy of thestructured les.

The staging data store contains:

lists/enumerations, for example ISO country region subdivision copies of various sources or preprocessed datasets, such as geography from SK post o ffice,

registry of organisations (REGIS) staging data for procurers and suppliers might contain more information than provided by

registry of organisations (REGIS) maps for mapping source values to desired values, coalescing and unifying

map of unknown organisations map unknown org. names and org. codes into existingorganisations

map of region names di ff erent region naming in REGIS than in o fficial post office region

registry map of reference codes map of fulltext values, such as names of procurement types into

short codes (identiers) that will be used as keys. Also unies similar names into same code. temporary tables tables being used during transformation process that are created only for the

purpose of the single transformation run (for example: coalesced suppliers according to REGIS,mapped unknown organisations and existing registered organisations)

Some tables are being appended with new data during the transformation process. New data arebeing added into:

map of unknown organisations for further xing new known organisations for further update with additional information

5.3. Datamart DatastoreThe Datamart Datastore, separate database schema, contains nal data ready for analysis andreporting. Structures in the schema are:

logical model metadata description of the OLAP cube for contracts (Brewery framework objects)

dimension tables tables with dimension values (hierarchical) fact table cleansed table with procurement contracts, joinable with dimensions


19


20/23

The dimension tables with fact table in this schema form snowake schema. 2

Brewery OLAP is using the structures in the datamart datastore to denormalize the snowake schemainto wide fact table suitable for analysis, aggregation and reporting. That means, that the end-user

the analyst does have to know about physical structures behind the procurement contracts. He hasonly one logical fact table where one row is one fact, that is one contract. The logical metadata enables

the analyst to perform analysis on multidimensional hierarchical structure.


20

2 http://en.wikipedia.org/wiki/Snowake_schema


21/23

6. Search Index

One of the requirements for the public procurements portal was to be able to search through thedata by many diff erent elds. The nature of nal data is:

many elds, described by metadata we should not rely on xed data structure, hierarchical structure we need to know at what level the value that we are searching for can be

found

Example of a search query: chemical. The word chemical might be contained in subject type,however at di ff erent levels: division, category, subcategory We have to know exact level where thework appeared. If the word chemical is found at division level, we want report at division level, if theword is found at category level, we want to aggregate at the category level, etc.

The sphinx searching engine can create one index for a table for known set of elds. While searching,we do not know in which eld the value was found, only document number (row). To make search inmultiple elds and through hierarchies possible we had to pre-index data with enough metadata. Thenal table that is being indexed contains:

string value of indexed searchable eld dimension of the eld (cpv, organisation, region, ) dimension level of the eld (division/category/subcategory, region/county...) level key of the indexed eld some index document id that will be returned by sphinx


21


22/23

7. Installation

7.1. Software Requirements PostgreSQL database server ruby 1.9 (does not work with version 1.8) gems: sequel, data-mapper, nokogiri Sphinx Brewery from http://github.com/Stiivi/brewery/

7.2. Preparation

I. create a directory where working les, such as dumps and ETL les, will be stored, for example:

/var/lib/vvo-lesII. initialize and congure Brewery (see Brewery installation instructions)III. create two database schemas: vvo_staging for staging tables and vvo_data for analytical data

7.3. ETL Database initialisationTo initialize ETL database schema run the Brewery ETL tool:

etl initialize

This will create all necessary system tables. If you try to initialise a schema which already contains ETLsystem tables you will get an error message. This prevents you to overwrite existing data. To recreate

the schema and start with empty tables execute initialize command with --force ag:

etl --force initialize


22


23/23

8. Running ETL Jobs

8.1. Launching

Manual Launching

Jobs are being run with simply launching the etl tool:

etl run job_name

To manually run all daily jobs, you might use following script:

#!/bin/bash#

DEBUG='--debug'

etl $DEBUG run vvo_downloadetl $DEBUG run vvo_parseetl $DEBUG run vvo_load_sourceetl $DEBUG run vvo_cleanseetl $DEBUG run vvo_create_cubeetl $DEBUG run vvo_search_index

If a job fails, you have to run only the jobs after the failed job.

To do full download, instead of incremental, do:etl run vvo_download all