Upload
karthik-raparthy
View
13
Download
0
Embed Size (px)
DESCRIPTION
adad
Citation preview
A COURSEWARE ON ETL PROCESS
Nithin VijayendraB.E, Visveswaraiah Technological University, Karnataka, India, 2005
PROJECT
Submitted in partial satisfaction ofthe requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
FALL2010
A COURSEWARE ON ETL PROCESS
A Project
by
Nithin Vijayendra
Approved by:
__________________________________, Committee ChairDr. Meiliu Lu
__________________________________, Second ReaderDr. Chung-E Wang
____________________________Date
ii
Student: Nithin Vijayendra
I certify that this student has met the requirements for format contained in the University format
manual, and that this project is suitable for shelving in the Library and credit is to be awarded for
the Project.
__________________________, Graduate Coordinator ________________Dr. Nik Faroughi Date
Department of Computer Science
iii
Abstract
of
A COURSEWARE ON ETL PROCESS
by
Nithin Vijayendra
Extract, Transform and Load (ETL) is a fundamental process used to populate a
data warehouse. It involves extracting data from various sources, transforming the data
according to business requirements and loading them into target data structures. Inside
the transform phase a well-designed ETL system should also enforce data quality, data
consistency and conforms data so that data from various source systems can be
integrated. Once the data is loaded into target systems in a presentation-ready format, the
end users can run queries against them to generate reports which help them make better
business decisions. Even though ETL process consumes roughly 70% of computing
resources they are hardly visible to the end users [5].
The objective of this project is to create a website which contains courseware on
ETL process and a web based ETL tool. The website, containing the ETL courseware,
can be accessed by anyone with internet access. This will be helpful to a wide range of
audience from beginners to experienced users. The website is created using technologies
HTML, PHP, Korn shell scripts and MySQL.
iv
The ETL tool is web based and anyone with internet access can use this tool for
free. However guests have limited access to this tool and registered users have complete
access. Using this tool, data can be extracted from text files and MySQL tables,
combined together and loaded into MySQL tables. Before loading the data into target
MySQL tables, various transformations according to business requirements, can be
applied to them. This tool is developed using HTML, PHP, SQL and Korn shell scripts.
_______________________, Committee ChairDr. Meiliu Lu
_______________________Date
v
ACKNOWLEDGMENTS
I am thankful to all the people who have helped and guided me through this
journey of completing my Masters Project.
My sincere thanks to Dr. Meiliu Lu, for giving me an opportunity to work under
her guidance on my masters project. She has been very supportive, encouraging and has
guided me throughout the project. My heartfelt thanks to Prof. Chung-E Wang for being
my second reader.
My special thanks to my friend Sreenivasan Natarajan for his patience reviewing
my project report. I would also like to thank all my friends who have been there for me
throughout my graduate program at California State University, Sacramento.
Last but not the least I would like to thank my parents, sister and relatives for
their unconditional love, support and motivation.
vi
TABLE OF CONTENTS Page
Acknowledgements............................................................................................................vi
List of Tables.......................................................................................................................x
List of Figures.....................................................................................................................xi
Chapter
1. INTRODUCTION...........................................................................................................1
2. BACKGROUND.............................................................................................................4
2.1 Need for an ETL tool.................................................................................................5
2.2 Scope of the project...................................................................................................6
2.3 Technology related....................................................................................................7
3. ETL COURSEWARE...................................................................................................11
3.1 ETL components......................................................................................................14
3.2 Requirements...........................................................................................................14
3.2.1 Business requirements......................................................................................15
3.2.2 Data profiling....................................................................................................15
3.2.3 Data integration requirements...........................................................................16
3.2.4 Data latency requirements................................................................................16
3.2.5 Data archiving requirements.............................................................................17
3.3 Data profiling...........................................................................................................17
3.4 Data extraction..................................................................................................19
3.5 Data validation and integration.........................................................................23
vii
3.6 Data cleansing...................................................................................................24
3.7 Data transformations.........................................................................................26
3.7.1 Surrogate key generator operation.............................................................27
3.7.2 Lookup operation.......................................................................................27
3.7.3 Merge operation.........................................................................................28
3.7.4 Aggregation operation......................................................................................29
3.7.5 Change capture operation.................................................................................29
3.7.6 Change apply operation....................................................................................30
3.7.7 Data type operation...........................................................................................31
3.8 Data load..................................................................................................................32
3.8.1 Historic load......................................................................................................32
3.8.2 Incremental load...............................................................................................33
3.8.3 Loading dimension tables.................................................................................34
3.8.3.1 Type 1 Slowly Changing Dimension.........................................................35
3.8.3.2 Type 2 Slowly Changing Dimension.........................................................36
3.8.4 Loading fact tables...........................................................................................40
3.9 Exception handling..................................................................................................43
4. ETL TOOL ARCHITECTURE.....................................................................................45
5. ETL TOOL IMPLEMENTATION................................................................................50
5.1 Using the tool..........................................................................................................50
5.2 Extraction phase......................................................................................................52
5.2.1 Text file as source............................................................................................53
viii
5.2.2 MySQL table as source....................................................................................59
5.3 Transformation phase..............................................................................................61
5.3.1 Transformation for a single source...................................................................61
5.3.2 Transformation for multiple sources.................................................................67
5.4 Loading phase.........................................................................................................68
6. CONLUSION................................................................................................................71
6.1 Future enhancements...............................................................................................72
Bibliography......................................................................................................................73
ix
LIST OF TABLES
Page
Table 1 Before snapshot of Store Dimension Table for Type 1 SCD...............................36
Table 2 After snapshot of Store Dimension Table for Type 1 SCD..................................36
Table 3 Snapshot 1 of Store Dimension Table for Type 2 SCD (Method 1) ...................38
Table 4 Snapshot 2 of Store Dimension Table for Type 2 SCD (Method 1) ...................38
Table 5 Snapshot 3 of Store Dimension Table for Type 2 SCD (Method 1) ...................38
Table 6 Snapshot 1 of Store Dimension Table for Type 2 SCD (Method 2) ...................39
Table 7 Snapshot 2 of Store Dimension Table for Type 2 SCD (Method 2) ...................39
Table 8 Snapshot 3 of Store Dimension Table for Type 2 SCD (Method 2) ...................40
Table 9 Table structure to Store Usernames and Password...............................................51
Table 10 Type of Input Box Based on Data Type.............................................................56
Table 11 Structure of INFORMATION_SCHEMA.COLUMNS Table...........................63
x
LIST OF FIGURES Page
Figure 1 Overview of ETL Process.....................................................................................2
Figure 2 Screenshot of Transformations Page.....................................................................6
Figure 3 Screenshot of Transformations and Load Page of ETL Tool................................7
Figure 4 ETL Process........................................................................................................13
Figure 5 Components of ETL............................................................................................14
Figure 6 OLTP source for ETL Process............................................................................20
Figure 7 Delimited File Format.........................................................................................21
Figure 8 Fixed-width File Format......................................................................................21
Figure 9 Overview of Data Transformation......................................................................26
Figure 10 Lookup Operation.............................................................................................27
Figure 11 Merge Operation...............................................................................................28
Figure 12 ETL Tool Layers...............................................................................................45
Figure 13 Layers and Components of ETL Tool...............................................................47
Figure 14 Add or Delete Users..........................................................................................51
Figure 15 Source Selection................................................................................................52
Figure 16 Screenshot of Define Metadata Page................................................................55
Figure 17 Flow for Landing Data Using Text Source.......................................................58
Figure 18 Flow of Landing Data Using MySQL Table as Source....................................60
Figure 19 Screenshot of Database Details Webpage.........................................................60
Figure 20 Flow of Tranformation Phase............................................................................61
xi
Figure 21 Screenshot Showing Various Transformations.................................................65
Figure 22 Flow of Landing Phase......................................................................................68
Figure 23 Screenshot Showing Transformations for Multiple Sources.............................69
xii
1
Chapter 1
INTRODUCTION
According to Bill Inmon [1], “A data warehouse is a historical, subject-oriented,
integrated, time-variant and non-volatile collection of data in support of management's
decision making process”. By Historical we mean, the data is continuously collected from
various sources and loaded in the warehouse. The previously loaded data is not deleted
for long period of time thus containing historical data in the warehouse. By Subject
Oriented we mean, data is grouped into specific business areas instead of the business as
a whole. By Integrated we mean, collecting and merging data from various sources and
these sources could be disparate in nature. By Time-variant we mean, that all data in the
data warehouse is identified with a particular time period. By Non-volatile we mean,
once data is loaded in the warehouse it is never deleted or overwritten; hence it is not
expected to change over time.
Extract, Transform, Load (ETL) is the back-end process which involves collecting
data from various sources, preparing the data according to business requirements and
loading it in the data warehouse. Extraction is the process where data is extracted from
various source systems and temporarily stored in database tables or files. Source systems
could range from one to many in number, and similar or completely disparate in nature.
Once the extracted data is staged temporarily it should be checked for validity and
2
consistency using the data validation rules. Transformation is the process which involves
application of business rules to source data before it's loaded into the data warehouse.
Figure 1 Overview of ETL process
As can be seen in Figure 1, there can be several data sources that have
characteristics which differ from each other. These data sources could be in a different
geographic location; could be incompatible with the organization’s data store; could be
many in number; could be on different platforms like mainframes, UNIX or Windows;
the availability of data from each source system may also vary. In Extraction phase, data
needs to be extracted from various source systems and placed temporarily in databases or
flat files called the landing zone [6].
3
In Transformation phase, the landed data is picked up, cleansed and transformed
based on the business requirements. There can be one or many transformation operations
that are applied to datasets, which could lead to a change in data value, change of data
type or change of data structure by addition or deletion of data from it. The transformed
data is loaded into databases tables and this area is called the staging area [6].
In Loading phase, which is the last step in ETL process, validated, integrated,
cleansed, transformed and ready-to-load data from the staging area is loaded into the
warehouse dimension and fact tables.
This report is organized into several chapters. Chapter 1 gives a brief introduction
about data warehouse and the role of ETL process in data warehousing projects. Chapter
2 discusses background and detailed introduction to ETL process. It discusses about the
need for an ETL tool, scope of the project and the related technology used to build the
ETL web tool. Chapter 3 discusses about the ETL courseware. The ETL courseware
contains material which a new user, to ETL processes, must know in order to implement
successful ETL projects. There are several components of ETL like requirements, data
profiling, extraction, validation, integration and others which this chapter explains in
detail. Chapter 4 gives an overview of architecture of ETL web tool created for this
project. Chapter 5 discusses the implementation of ETL web tool in details along with
snippets of important source code. Chapter 6 summarizes and concludes this report with a
glimpse into future enhancements to the courseware and the tool.
4
Chapter 2
BACKGROUND
Existence of data warehouse dates back to the late 1980s when IBM researchers
Barry Devlin and Paul Murphy developed the "business data warehouse". It meant to
provide an architectural model which focused on the flow of data from operational
systems to decision support systems. The architecture consisted of operational data layer,
data access layer, metadata layer and informational access layer. Operational data layer is
the source for a data warehouse; data access layer is the interface between data access
and informational access layer; metadata layer is the data dictionary; and informational
access layer is the last layer which is used by business analysts to analyze and generate
reports [8].
There are several approaches to populate a warehouse. The top down approach,
by Bill Inmon [1], proposed populating the data warehouse first and then populating the
data marts. The bottom up approach, by Ralph Kimball [3], proposed populating the data
marts first and then populating the data warehouse. There is also the hybrid approach
which is a combination of top down and bottom up approach [5]. An ETL tool is used in
data access layer to extract data from source systems and load them into the warehouse or
the mart irrespective of which approach is used.
5
2.1 Need for an ETL tool
Interested users who would like to learn more about ETL tools may not have
access to it. This is because commercial ETL tools are expensive to buy for small or
medium sized projects. They need expensive hardware to run on and needs specialists to
configure them before a normal user can start using it. Most of them are not open source
and web based. They also have short evaluation periods like 30 to 60 days. The ETL web
tool created in this project helps overcome the above challenges. It is web based,
accessible freely to anyone with internet access and very user-friendly. Beginning ETL
developers can use this tool to get a feel of what an ETL tool does before they dive into
understanding complex commercial ETL tools.
An ETL tool has many advantages over hand-coded ETL code. It helps in
simpler, faster and cheaper development of ETL code. Technical people with broad
business skills who are not professional programmers can use ETL tools effectively.
Many ETL tools generate metadata automatically at every step of the process thus
enforcing consistent metadata throughout the process. They also have integrated metadata
repositories which can be synchronized with other source systems, target systems and
other Business Intelligence tools. They deliver good performance with very large datasets
by using parallelism concepts such as pipelining and parallelism [4]. They come with
built-in connectors for most of source and target systems. Most of the ETL tools these
days have built-in schedulers to run the ETL code at scheduled times.
6
2.2 Scope of the project
This project mainly focuses on creation of a website which contains courseware
on ETL and a web based ETL tool.
The ETL courseware is available to everyone with an internet connection and is
intended for a wide variety of audience from beginners to experienced professionals.
Initially it explains about basics about each phase- Extract, Transform and Load; later it
delves into details of each phase. Below is a screenshot of a webpage from ETL
courseware website.
Figure 2 Screenshot of ETL courseware page
The ETL tool is web based and accessible to anyone with internet access. It
allows users to select data from two types of source systems. They can be Comma
Separated Value (CSV) flat files, MySQL tables or a combination both. Once data is
7
extracted, they can be merged together, data type based transformations applied to each
column and loaded to target MySQL target tables. Below is a screenshot of a webpage
from ETL web tool.
Figure 3 Screenshot of transformations and load page of ETL tool
2.3 Technology related
This section discusses the various technologies used in developing the ETL web tool.
The ETL tool is composed of three layers: the client layer, the processing layer and the
database layer which are described below along with the technologies used:
Client Layer: Users use a web browser, like Microsoft Internet Explorer, to
access/control the ETL tool. They can specify the number of sources, the type of
source, transformation to be applied on the extracted data and the target MySQL
table connection details. This layer is built using PHP and HTML.
8
Processing Layer: The processing layer collects the user input from the client
layer. It process user request in the background and displays success or error
messages to the user. This layer has MySQL and text file connectors which
connects to source and extract data, based on the user’s input. Once the data is
extracted, it is staged in temporary MySQL tables or flat files so that
transformations can be applied to it without disturbing the source content. This
layer is built using PHP, Korn shell scripts, MySQL statements and scripts.
Database Layer: The database layer has the target MySQL connector which
connect to the target MySQL table and inserts the transformed data. This layer is
built using Korn shell scripts, MySQL statements and scripts
MySQL 5.1
MySQL is an open source SQL database management system, developed,
distributed, and supported by Oracle Corporation. It is used for mission-critical, heavy
load production systems and delivers a very fast, multi-user, multi-threaded and robust
SQL database server [10].
Key Features
1. High performance for variety of workloads.
2. Connectors for C, ODBC, Java, PHP, Perl,.NET etc.
3. Wide range of supported platforms.
4. XML functions with XPath support
5. Partitioning
9
6. Row-based replication
7. Great documentation, community and commercial support [11].
PHP 4.3
PHP: Hypertext Preprocessor is a general-purpose scripting language that is
designed for web development to produce dynamic web pages. PHP code is embedded
into the HTML source document and this is interpreted by a web server installed with a
PHP processor module which generates the web page document.
Key Features
1. Persistent database connections
2. Good connection handling
3. Easy remote file handling
4. Better session handling
5. Good command line usage [12]
Korn Shell Scripts:
The Korn shell (ksh) is a Unix shell which was developed by David Korn. It is
backwards-compatible with the Bourne shell and includes many features of the C shell
[13].
Key Features
1. Supports associative arrays and built-in floating point arithmetic.
2. Support pipes
10
3. Supports pattern matching
4. Exception handling
5. Multidimensional arrays
6. Sub-shells
7. Unicode support [14]
In addition to above features, PHP and MySQL is extensively used in California State
University, Sacramento campus.
In this chapter we discussed about background of ETL tools. We mentioned the
need for ETL tool and several advantages of over hand-coded ETL. Scope of this project
was discussed along with the related technologies that were used to create a web-based
ETL tool. Now that the user is familiar with the overview and motivation for this project
we discuss more on the ETL courseware in Chapter 3.
11
Chapter 3
ETL COURSEWARE
ETL courseware can be read by anyone is who is interested to learn about ETL
processes. This courseware is very useful to users who have no prior knowledge of ETL
processes or about ETL tools. Professional ETL developers could also use this as
reference. ETL courseware is freely available to everyone who has internet access and
can be accessed at http://gaia.ecs.csus.edu/~web_etl/etl/.
The courseware starts with basics and then proceeds to advanced topics. Initially
it introduces ETL process to readers. Then it discusses about the various ETL
components. Each ETL component is explained in detail with an example and a figure for
easier understanding. Important topics like requirements, data profiling, data extraction,
data transformation and data loading are explained in depth.
After reading this courseware, readers should be able to define what ETL process
is and what its various components are. They’ll have a thorough understanding of what
each component does and they’ll be able to apply these concepts in ETL project
implementations.
Users can better understand the ETL courseware by using the ETL tool
implemented for this project. The ETL tool is simple to use, user-friendly and users can
learn basics of an ETL tool before trying their hands at commercially available complex
ETL tools.
12
Based on my prior work experience and by referring the following books, I have
compiled this courseware.
W.H Inmon, "Building the Data Warehouse" Fourth Edition
Jack E. Olson, "Data Quality: The Accuracy Dimension"
Ralph Kimball, Margy Ross, "The Data Warehouse Toolkit: The
Complete Guide to Dimensional Modeling" Second Edition
Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger, "Mastering Data
Warehouse Design: Relational and Dimensional Techniques"
Ralph Kimball, Joe Caserta, "The Data Warehouse ETL Toolkit: Practical
Techniques for Extracting, Cleaning, Conforming, and Delivering Data"
Larissa T. Moss, Shaku Atre, "Business Intelligence Roadmap: The
Complete Project Lifecycle for Decision-Support Applications"
Ralph Kimball, Margy Ross, "The Kimball Group Reader: Relentlessly
Practical Tools for Data Warehousing and Business Intelligence"
The following sections brief about ETL process and its components. For more details
please refer the website at http://gaia.ecs.csus.edu/~web_etl/etl/
13
Extract, Transform and Load (ETL) is a fundamental process used to populate a
data warehouse. It involves extracting data from various sources, validating it for
accuracy, cleaning and making it consistent, transforming the data according to business
requirements and loading them into target data warehouse. Inside the Transform phase a
well-designed ETL system should also enforce data quality, data consistency and
conforms data so that data from various source systems can be integrated. Once the data
is loaded into target systems in a presentation-ready format, the end users can run queries
against them to make better business decisions. Even though ETL process consumes
roughly 70% of the resources they are hardly visible to the end users [5].
Figure 4 ETL process
14
3.1 ETL components
Below are the various components of a ETL process.
Figure 5 Components of ETL
This shows just one ETL flow which load one or multiple tables. Similarly in the
background there can be multiple ETL flows loading tables in the same data warehouse.
3.2 Requirements
Just like designing any system requires understanding the requirements first,
design of ETL system also should start with requirements analysis. All the known
requirements and constraints affecting the ETL system have to be gathered at one place.
Based on the requirements, architectural decisions should be made at the
beginning of the ETL project. Construction of ETL code should start only after
architectural decisions are baselined. Change in architecture at a later point of
implementation would result in implementing the entire system from the very beginning
since they affect hardware, software, personnel and coding practices. Listed below are the
major requirements.
15
3.2.1 Business requirements
Business requirements are the requirements of the ends users who use the data
warehouse. Based on the populated information content the end users can make better
informed decisions. Selection of source systems is directly dependent on the business
needs. Interviewing the end users to gather business requirements not only sets an
expectation as to what they can do with the data but also there exists a possibility of the
ETL team discovering additional capabilities in data sources that can expand end user’s
decision making capabilities.
3.2.2 Data profiling
Data profiling is the process of examining the available data present in the source
systems and collecting statistics about that data. The purpose of these statistics can be to
Assess the risk involved in integrating data from various applications, including
the challenges of joins.
Assess whether metadata accurately describes the actual values present in source
systems.
Understanding data challenges early stages of the project would avoid delays and
cost.
Data profiling examines the quality, scope and context of the source data and enables
the ETL team to build an effective ETL system. If source data is very clean and is well
maintained before it arrives at the data warehouse then minimal transformation is
required to load them into dimension and fact tables. If source data is dirty then most of
16
the ETL team’s effort would be in transforming, cleaning and conforming the data.
Sometimes source data might be deeply flawed and would not be able to support business
objectives. This in case the data warehouse project should be cancelled.
Data profiling gives the ETL team a clear picture of how much data cleaning
processes should be in place to achieve end users requirements. This would also result in
better estimates and timely completion of the project.
3.2.3 Data integration requirement
Data from transaction systems must be integrated before they arrive in the data
warehouse. In data warehousing it takes the form of conforming dimensions and
conforming facts.
Conformed dimensions contains common attributes from different databases so
that drill across reports can be generated using these attributes. Conformed facts are
common measures, like Key Performance Indicators (KPIs), from across different
databases so that these numbers can be compared mathematically.
In ETL system, data integration is a separate step and it involves mandating
common names of attributes, facts and common units of measurement.
3.2.4 Data Latency requirement
Data latency requirement from the end users specifies how quickly the data has to
be delivered to them. This requirement has a huge effect on the architecture of the ETL
system. The batch oriented ETL system can be sped up using efficient processing
17
algorithms, parallel processing and faster hardware but sometimes the end users require
data on a real-time basis. This requires a conversion of ETL system from batch-oriented
to real-time oriented.
3.2.5 Data archiving requirement
Archiving data after it’s loaded into the data warehouse is a safer approach when data
needs to be reprocessed or for auditing purposes.
3.3 Data profiling
Data profiling is the process of examining the available data present in the source systems
and collecting statistics about that data. The purpose of these statistics can be to
Assess the risk involved in integrating data from various applications, including
the challenges of joins.
Assess whether metadata accurately describes the actual values present in source
systems.
Understanding data challenges early stages of the project would avoid delays and
cost.
According to Jack Olson [2], data profiling employs analytic methods of looking at
data for the purpose of developing a thorough understanding of the content, structure and
quality of the data. A good data profiling system can process very large amounts of data,
and with the skills of the analyst, uncover all sorts of issues that need to be addressed.
18
Data profiling examines the quality, scope and context of the source data and
enables the ETL team to build an effective ETL system. If source data is very clean and is
well maintained before it arrives at the data warehouse then minimal transformation is
required to load them into dimension and fact tables. If source data is dirty then most of
the ETL team’s effort would be in transforming, cleaning and conforming the data.
Sometimes source data might be deeply flawed and would not be able to support business
objectives. This in case the data warehouse project should be cancelled. Data profiling
can be achieved using commercial tools or hand coded applications. Data profiling
process reads the source data and generates a comprehensive report on-
Data types of each field
Natural keys
Relationships between tables
Data statistics like maximum values, minimum values, most occurred values,
number of occurrences of each values etc...
Dates in non-date fields
Data anomalies like junk values, values outside given range, missing values etc...
Null values
Data profiling gives the ETL team a clear picture of how much data cleaning
processes should be in place to achieve end users requirements. This would also result in
better estimates and timely completion of the project.
19
3.4 Data extraction
This chapter focuses on extraction of data from the sources systems. Data
extraction is the process of selecting, transporting and consolidating the data from source
systems to the ETL environment.
As organizations grow they would like to add a lot of data sources to their
existing data store. Each source systems have characteristics which differ from each
other. These data sources could be in a different geographic location; could be
incompatible with the organization’s data store; could be many in number; could be on
different platforms like mainframes, UNIX or Windows; the periodicity (daily, weekly,
monthly basis etc.) of feeding the data to the warehouse could vary; the availability of
data from each source system may also vary. These vast differences in source system
characteristics make data integration challenging.
Data extracted from source systems are placed temporarily in databases or flat
files and this area is called the landing zone. Described below are few sources which are
commonly used by organizations to extract data.
OLTP systems:
OLTP stands for Online Transaction Processing Systems. These are a class of
systems which facilitate and manage transaction-oriented applications which require
faster data insertions and retrievals. These systems store their daily transactional data in
relational databases. These databases are normalized for faster insertion and retrieval
queries.
20
To extract data from these systems, ODBC drivers or native database drivers are used.
The disadvantage of ODBC driver over native database drivers are that they require more
steps, as shown in diagram below, and take more time. It’s a better approach to always
extract only the required data by using appropriate WHERE clause in the SELECT query.
Figure 6 OLTP source for ETL process
Flat files
A flat file is a operating system file which contains text or binary content, one
record per line. In ETL projects we usually come across two formats of flat file. Both are
described below.
21
Delimited file format
In this file format, data is stored in separate lines as shown below. Each line is
separated by a new line character and each field is separated by a delimiter. Field
delimiters can be a comma, pipe or other characters but have to remain the same for all
fields. Also each record could have a record delimiter. In this example the record
delimiter is a semi-colon. There could also be a quote delimiter for each field. It contains
a single quote or a double quote. The final delimiter, in this case an End-Of-File
character, denotes the end of that flat file.
Figure 7 Delimited file format
Fixed-width file format
In this file format, data is stored in separate lines similar to delimited file format
however each field occupies a fixed width and each line are of same width. Field values
which do not occupy the entire field length is filled with spaces. Each field can be
identified by the start and end position. In the example below field 1 starts at position 1
and ends at position 4, field 2 starts at position 5 and ends at 11 and so on.
Figure 8 Fixed-width file format
22
Web log sources
Most of the internet companies have a log called web log which stores visitors
information. These web logs record information posted or retrieved for that particular
website by each user. There are several uses for this. One is to analyze users click pattern
on their website and find out which webpage gets the most hits from which geographic
location. Based on this information they can further analyze and improve their website to
increase user traffic. In order to analyze web logs, they have to be extracted from various
regions, transformed and stored in data warehouses.
ERP systems:
ERP stands for Enterprise Resource Planning and was designed to provide
integrated enterprise solutions by integrating important business functions like inventory,
human resources, sales, financials etc. Since ERP systems are massive and extremely
complex it would take years to collect data in them according to business requirements.
This would be a valuable source for the ETL systems. Nowadays most of the ETL tools
provide ERP connectors to fetch data from ERP systems.
FTP
FTP stands for File Transfer Protocol and is a standard protocol used over TCP/IP
networks to transfer files between machines. When ETL tools don't have appropriate
connectors/adapters to connect to the source system, FTP pull/push is used to fetch data
into ETL environment.
23
FTP pull take place at the ETL end and is used when we are sure the source file
will be available at a pre-determined time. Here FTP is initiated by the ETL tool at a
scheduled time.
FTP push takes place at the source and is used when availability of the source file
is unknown. Here FTP is initiated by the source system when the file is created.
3.5 Data validation and integration
This chapter focuses on integration and validation of extracted data. Data
extracted from the source systems must be validated and integrated before proceeding to
the next phase of ETL process.
Data validation: Data that has been extracted from different source systems, and
landed in the landing zone, must be validated before integrating them. This phase makes
sure the source data has been completed transferred to the ETL environment. Also, it
makes sure the latest data in source systems have been extracted. It's important that the
extraction process only extracts the latest business data else it would lead to duplicates in
data warehouse if old data is reloaded.
There are several ways to check if complete source data is extracted:
If flat files are extracted, make sure the record count in source and target match.
For flat files, make sure the source and target checksum match.
For data extracted in to tables, make sure record count in source and target match.
24
Data integration: Data that has been extracted from different source systems, and
landed in the landing zone, must be integrated together after it has been validated. Also
data that is similar should be integrated with each other. Care must be taken to start off
the integration process only after all the required data sets are present in the landing zone
since source systems can be located in different geographic locations and would have to
be extracted at different time zones.
3.6 Data cleansing
This chapter focuses on cleansing the validated data. Data cleansing is a process
that makes sure the quality of data in a warehouse is maintained. It is also defined as a
process which helps to maintain complete, consistent, correct, unambiguous and accurate
data [2]. These attributes define the quality of data and are explained below.
Complete: Complete means data is defined for each instance without any NULL
values in them and records are not lost in the information flow. For example, in Social
Security Administration office, an individual's record with no SSN would be incomplete.
Consistent: The definition, value and format of the data must be same throughout
the warehouse. For example, California State University Sacramento is known by several
names: Sac state, CSUS, Cal Univ. Sacramento etc. To make this consistent only one
convention should be followed everywhere.
Correct: The value should be true and meaningful. For example, age cannot be
negative. Another example is, if a pallet contains 4 items then the same should reflect in
the warehouse.
25
Unambiguous: The data can have only one meaning. For example, there are
several cities in the U.S by the name New Hope but there is only one city by that name in
Pennsylvania. This unambiguous data should be loaded in the warehouse for clarity.
Accurate: Accurate means that the data loaded in the warehouse should be
complete, consistent, correct, unambiguous and must be derived or calculated with
precision.
There are several data-quality checks which can be enforced.
1) Column check: This check ensures that incoming data contains expected values as per
source system's perspective. Some of the column property checks are- checking for
NULL values in non-nullable columns, checking fields for unexpected lengths, checking
numeric values which fall outside a range, checking fields which contains other values
than what is expected etc...
2) Structure check: Column checks focuses on individual columns but structure check
focuses on relationship between those columns. Structure check can be enforced by
having proper primary keys and foreign keys so that they obey referential integrity.
Structure checks also enforce parent-child relationships.
3) Data and value check: Data and value check can be simple or complex. Example for
a simple check is if a customer is flies in business class then he'll get double number of
points to his account than a customer who flies economy class. Example for a complex
check is a customer cannot be in limited partnership with firm A and a member of board
for directors in firm B.
26
3.7 Data transformations
This chapter focuses on transforming the cleansed data to load it into the
warehouse. These transformations are applied based on the business requirements, data
warehouse loading approach (top down or bottom up) and source-to-target mapping
document.
A transformation operation takes input data, modifies it by applying one or more
functions and returns the output data. This could lead to a change in data value, change of
data type or change of data structure by addition or deletion of data from it. When
multiple functions are applied, the intermediate data are called data sets.
After data undergoes transformation according to requirements it's a better
approach to save it temporarily in a database before finally loading it into the warehouse.
This temporary area of storage is called as the staging zone. In the last step, which is the
data load phase, when loading data from staging area in to warehouse if the load process
fails then only the load process can be restarted avoiding going through cleansing and
transformation processes again.
Figure 9 Overview of data transformation
Described below are a few transformation operations. For a comprehensive list of
transformation operations please refer the online courseware.
27
3.7.1 Surrogate key generator operation:
A surrogate key is a number that uniquely identify a record in a database table
and is different from a primary key. A surrogate key is not derived from the application
data but the primary key is.
A surrogate key generator operation takes one input, adds a new column which
contains surrogate key for each record and outputs the result dataset. For each input
record, surrogate key is calculated based on 4 parameters - initial value, current value,
increment value and final value. If it's the first record in the dataset then surrogate key
generator assigns initial value to it. If it's not the first record then surrogate key generator
adds the increment value to the current value and stores it in the record. Usually current
value is stored in a flat file or in a database table.
3.7.2 Lookup operation
Lookup operation has an input dataset, reference data set and a output/final
dataset as shown in Figure 10.
Figure 10 Lookup operation
28
Lookup operation fetches the fields specified by the user and looks for a match in the
reference dataset and returns the joined records if they match else it returns NULL values
in place of reference columns
3.7.3 Merge operations
Merge operations can have 1 or more input datasets but only 1 output dataset. It
combines data from input datasets and results in the output. The criteria to use merge
operation is the number of fields in all input datasets should be same and data type of all
fields should match with each other. The result dataset will have the same number of
fields and same data type as its input datasets. The number of records in output dataset
would be the total of all input datasets together.
Figure 11 Merge operation
29
3.7.4 Aggregation operation
Aggregation operation takes a single input and produces a single output. It
classifies records, from the input dataset, into groups and computes totals, minimum,
maximum or other aggregate functions for each group outputting them on to output
dataset. Fields that needs to be grouped, to be used for aggregate function calculation,
must be specified by the user.
Below are a few examples of aggregate functions which commercial ETL tools provide
nowadays:
Maximum value
Minimum value
Mean
Percentage coefficient of variation
Standard deviation
Sum of weights
Sum
Missing values count
Range
Variance
3.7.5 Change capture operation
Change capture operation takes two datasets as input, makes a record of
differences and produces one output dataset. The input datasets are denoted as before and
30
after datasets. The output dataset contains records which represent changes made to the
before data set to obtain the after data set. The compare is based on a set of key fields,
records from the two data sets are assumed to be copies of one another if they have same
values in these key columns. The output dataset has an extract column which denotes
insert, delete, copy and edit. These terms are explained in Change apply operation
section.
3.7.6 Change apply operation
Change apply operation can be used only after change capture operation. It takes
the change data set, which is the resultant dataset from change capture operation, applies
the encoded change operations to the before data set to compute an after data set. Below
are the encoded change operations described:
Insert: The change record is copied to the output;
Delete: The value columns of the before and change records are compared. If the value
columns are the same or if the Check Value Columns on Delete is specified as False, the
change and before records are both discarded; no record is transferred to the output. If the
value columns are not the same, the before record is copied to the output.
Edit: The change record is copied to the output; the before record is discarded.
Copy: The change record is discarded. The before record is copied to the output.
31
3.7.7 Data type operation
Data type operations change the data type, precision or format of the input
dataset.
Data type conversions: Conversion from text to date, text to timestamp, text to number,
date to timestamp, decimal to integer are few examples
Precision conversions: Changing the numeric precision say from decimal value 3.12345
to decimal value 3.11 is an example for this conversion.
Format conversions: Changing date and timestamp formats is one of the examples for
this conversion. If input is 28/01/90, based on business requirement, this could be
changed to 1990-Jan-28.
Compare operations: Compare operation takes two inputs and produces a single output.
It compares a column by column comparison of the records. This can be applied to
numeric and alpha-numeric fields. The output dataset contains three columns. The first
column is the result column which contains a code giving the result of the compare
operation. The second column contains the columns of the first input link and the third
column contains the columns of the second input link. The result column usually has
numeric codes which denote if both the inputs are equal, first is empty value, second is
empty value, first is greater or first is lesser.
32
3.8 Data load
This section focuses on the last step of ETL process which is loading validated,
integrated, cleansed and transformed data into data warehouse.
As mentioned in the previous chapter, it's a better approach to stage the ready-to-
load data in temporary tables in a database. When loading data from staging area in to
warehouse if the load process fails then only the load process can be restarted avoiding
going through cleansing and transformation processes again.
Basically there are two types of data load namely historic load and incremental load.
3.8.1 Historic load
A data warehouse contains historical data. Based on user requirements data in
warehouse has to be retained for a particular duration of time. This duration could be
anywhere from a single year to several decades.
When a data warehouse is created i.e., tables in them are created, it contains no records.
This is since planning for creation of a warehouse could take several months or years.
During this time there would be lot of data in OLTP systems which act as source systems
for warehouse. Loading of this initial historical data into warehouse is initial historic
load.
Sometimes it may so happen that when data is loaded regularly in to the
warehouse ETL process might break and fixing it would take several hours to several
days. During this fix time there will be data in OLTP systems. Loading this data is also a
historic load.
33
3.8.2 Incremental load
Incremental load is the periodic load of data into warehouse. This process loads
the most recent data from OLTP systems. This process run periodically till the end of
warehouse's life. Incremental loads could run daily, weekly, fortnightly, monthly,
quarterly, yearly or at a scheduled time. For every incremental load there is a load
window within which the ETL load process should start and finish loading into target
warehouse. After end of load window, business users will usually start querying and
analyzing data in warehouse.
There are several operations involved in loading warehouse. Based on the type of
table being loaded, fact or dimension, the appropriate operation is selected. Few of data
load operations are described below:
Insert operation: This operation inserts data in to warehouse. If data already exists in
table, this operation will fail. Hence target table should be checked ahead of time before
executing this operation.
Update operation: This operation updates the existing records in warehouse. Unlike
insert operation this doesn't fail if update records are not found.
Upsert operation: This operation first executes update operation and if that fails then it
inserts records into warehouse.
Insert update operation: This operation first executes insert operation and if that fails
then it updates existing records in warehouse. Insert update operation is preferred to
upsert operation since it is more efficient [7].
34
Delete insert operation: This operation first executes delete operation and then inserts
source records in to warehouse.
Bulk load operation: Bulk load is a utility provided by major ETL vendors these days
which are faster and efficient in loading huge amounts (hundred millions) of data into
warehouse [4][7].
3.8.3 Loading dimension tables
A table that stores business related attributes and provides the context for fact
table is called a dimension table. They are in demoralized form.
A Dimension table contains a surrogate key which is a meaningless incrementing integer
value. Surrogate key values are generated and inserted by ETL process along with other
dimension attributes. This is made the primary key for dimension table and is used to join
with records in the fact table. By definition a surrogate key is supposed to be
meaningless. However it can be made meaningful by creating a surrogate key value by
intelligently combing data from other attributes in dimension table. But this would lead to
more ETL processing, maintenance and updates if the actual attributes, on which these
keys are based, change.
In addition to surrogate key, dimension table also contains a natural key. Unlike
surrogate key, natural key is derived from meaningful application data. Dimension table
also consists of other attributes.
A slowly changing dimension table is a dimension table which has updates
coming in for the existing records. According to user requirements older values in
35
dimension tables can be historically maintained or discarded. Based on this there are
three types of slowly changing dimensions. Loading data into dimension table differs
based on type of dimension table. Below is a detailed explanation of each type and how
to load them.
3.8.3.1 Type 1 Slowly Changing Dimension
For an existing record in dimension table, if there is an update on any or all
attributes from source systems, SCD Type 1 approach is to overwrite the existing record
without saving old values. This approach is used when already inserted record in
dimension table is incorrect and needs to be correct. Or when business users don't see any
use in keep history of previous values.
If record doesn't exist, then a new surrogate key value is generated, appended to
dimension attributes and inserted into the table.
If record does exist then the surrogate key of existing dimension record is fetched,
appended to new source record, the old record is deleted and then the new record is
inserted into dimension table.
Upsert, insert update or delete insert operations can be used in this scenario.
However care has to be taken to save existing surrogate key value when using delete
insert operation.
If there are a large number of Type 1 changes, then the best way to implement is
to prepare new dimension records in a new table. Then drop existing records in
dimension table and use bulk load operation.
36
Surr_Key Store_id Store_city Store_state Store_country
384729478 37287 Sacramento California United States
Table 1 Before snapshot of Store dimension table for Type 1 SCD
Surr_Key Store_id Store_city Store_state Store_country
384729478 37287 Los Angeles California United States
Table 2 After snapshot of Store dimension table for Type 1 SCD
Note: In the new snapshot store_city field is updated without changing surrogate key
value.
3.8.3.2 Type 2 Slowly Changing Dimension
In this approach, if there are any changes to dimension attributes for existing
records in dimension table, the old values are preserved.
When exiting record needs to be changed, instead of over writing, a new record with a
new surrogate key is generated and inserted into dimension table. This new surrogate key
is used in fact table from that moment onwards. There is no need to change or update
existing records in fact or dimension table.
Type 2 SCD requires a good change capture system in place to detect changes in
source systems and then notify ETL system. Sometimes update notifications won't be
propagated from source to ETL system. In this case ETL code is supposed to download
the complete dimension and make a field by field, record by record comparison to detect
updates. If dimension table has millions of records and has over 100 fields then this
would be a very time consuming process and ETL code can't complete within the
37
specified load window. In order overcome this problem, CRC codes are associated with
each record. Entire record is given as input to CRC function which calculates a unique
long integer code. This integer code will change even if there is a change of a single
character in input record. When CRC codes are associated with each record then only
these codes are compared instead of field by field, record by record comparison.
To implement this in ETL code, there are two flows. One flow has the new record and the
other flow has the old record from dimension table. For the new record, a new surrogate
key is generated along with current flag or start and end date values. This is used in insert
operation. For the old record, current flag or start and end date values are updated and is
used in update operation.
There are 2 ways to implement SCD type 2:
1) Method 1: In the first method, a new flag column is added to the dimension table
which indicates where the record is current or now. In the example below when a new
record with surrogate key 384729479 is added, it's current flag is inserted with "Y" and
the current flag field value of record with surrogate key 384729478 is made "N". The
same can be applied for snapshot 3.
Snapshot 1 of a Store dimension table:
38
Surr_Key Store_id Store_city Store_state Store_country Current
flag
384729478 37287 Sacramento California United States Y
Table 3 Snapshot 1 of Store dimension table for Type 2 SCD (Method 1)
Snapshot 2:
Surr_Key Store_id Store_city Store_state Store_country Current
flag
384729478 37287 Sacramento California United States N
384729479 37287 Los Angeles California United States Y
Table 4 Snapshot 2 of Store dimension table for Type 2 SCD (Method 1)
Snapshot 3:
Surr_Key Store_id Store_city Store_state Store_country Current
flag
384729478 37287 Sacramento California United States N
384729479 37287 Los Angeles California United States N
384729521 37287 Arlington Texas United States Y
Table 5 Snapshot 3 of Store dimension table for Type 2 SCD (Method 1)
2) Method 2: In the second method, two columns are added to dimension table namely
start date and end date. Start date has the current date when the record was inserted and
39
end date has either high date (which is 12/31/9999) or Current_date - 1 value. As you can
see from the example below, in snapshot 2, when a new record with new surrogate key is
inserted, its end date has high date and start date has current date. The previous record's
end date would be updated with Current_date - 1. The same can be applied to snapshot 3.
To find the latest record, a query is run against dimension table where end_date =
'12/31/9999'. Using method 2
Snashot 1 of a Store dimension table:
Surr_Key Store_id Store_city Store_state Store_country Start date End date
384729478 37287 Sacramento California United States 10/1/2010 12/31/9999
Table 6 Snapshot 1 of Store dimension table for Type 2 SCD (Method 2)
Snapshot 2:
Surr_Key Store_id Store_city Store_state Store_country Start date End date
384729478 37287 Sacramento California United States 10/1/2010 10/20/2010
384729479 37287 Los Angeles California United States 10/21/2010 12/31/9999
Table 7 Snapshot 2 of Store dimension table for Type 2 SCD (Method 2)
Snapshot 3:
Surr_Key Store_id Store_city Store_state Store_country Start date End date
384729478 37287 Sacramento California United States 10/1/2010 10/20/2010
40
384729479 37287 Los Angeles California United States 10/21/2010 11/01/2010
384729521 37287 Arlington Texas United States 11/02/2010 12/31/9999
Table 8 Snapshot 3 of Store dimension table for Type 2 SCD (Method 2)
Please refer the online courseware for loading Type 3 SCD table.
3.8.4 Loading fact tables
Fact tables contain measurements or metrics of business processes of an
organization. According to Ralph Kimball [5], "measurement is an amount determined by
observation with an instrument or a scale".
Fact tables are defined by their grain. A grain represents the most atomic level by
which facts are defined. For example, in a sales fact table, the grain could be an
individual line item of a sales receipt.
Fact tables contain one or more measurements along with a set of foreign key
which point to dimension tables. Dimension tables are built around fact tables to provide
context to measurements present in fact tables. Just a fact table, without any dimension
tables surrounding it, makes no business sense.
Each fact table has a primary key which is a field or a group of chosen fields.
Primary key of fact table should be defined carefully such that during the load process
duplicates don't occur. There is this possibility of occurrence when insufficient attention
is paid during fact table's design or unexpected values starts flowing in from source
systems. When this happens then there is no way to differentiate those records apart. To
41
avoid this it's a good approach to have a unique key sequence included for each fact
record insert.
To insert source records into fact table each natural dimension key should be
replaced by latest surrogate key. Natural key value could always be found out from
respective dimension table. Surrogate keys are looked up using the lookup operation
defined in Transformations chapter.
Below are a few points, which can improve load performance, to keep in mind
when loading a fact table:
1) Insert and update records should be loaded separated. This can be done by writing the
temporary update and insert data into different datasets in staging area and creating two
separate ETL flows to load these records into fact table. Many vendor ETL tools usually
provide upsert and/or insert-update options. In this scenario, insert-update option works
efficiently.
2) Avoid SQL insert statements and use bulk load utility, if available, to insert huge
number of records efficiently thus improving performance of ETL code.
3) Load data in parallel if there are no dependencies between ETL flows. For example, if
two tables being loaded do not have a parent-child relationship then ETL code to load
both could be started simultaneously. Many vendor ETL tools provide partitioning and
pipelining mechanisms to load data in parallel. In partitioning mechanism, a huge dataset
is partitioned and several processes are created to work on them in parallel. In pipelining
mechanism, after a process finishes processing a part of a huge dataset, it passes the
42
processed chunk to next stage in ETL code. These two mechanisms speed up ETL
processing significantly.
4) Disable rollback logging for databases which house data warehouse tables. Rollback
logging is best suited for OLTP applications which require recovery from uncommitted
transaction failures but for OLAP applications this consumes extra memory and CPU
cycles since all data is entered and managed by ETL process [4].
5) Temporarily disabling indexing for fact tables when loading data and enabling them
when the load in complete is a great performance enhancer. Also another option is to
drop unnecessary indexes and rebuild only required ones.
6) Partitioning very big fact tables improves user's query performance. A table and its
index can by physically divided and stored on separate disks. By doing so a query that
requires a month of data from a table having millions of records can directly fetch data
from that particular physical disk without scanning other data [4].
Few of the above steps could be thought ahead of time and implemented thus saving time
in redo code changes at a later stage in the project.
3.9 Exception handling
This chapter discusses about exception handling in ETL process. An exception in
ETL is defined as any abnormal termination, unacceptable event or incorrect data which
stops the ETL flow and thus stopping data reaching data warehouse. Exception handling
43
is the process of handling these exceptions without corrupting any existing committed
data and terminating the ETL process gracefully. During ETL process exceptions occur
either due to incorrect data or due to infrastructure issues.
Data related exceptions could be caused because of incorrect data formats,
incorrect values or incomplete data from the source systems. These records needs to be
captured as rejects either in a file or in a database table and should be corrected,
reprocessed and inserted in the next ETL run.
Infrastructure exception could be caused because of hardware, network, database,
operating systems or other software issues. In such scenarios when ETL jobs fail, care
must be taken to make them restart able. Making the jobs restart able means that when
ETL jobs are restarted they don't insert duplicate data or abort due to existing data.
Exception handling should be handled in extraction, validation & integration,
cleansing, transformations and data load phase in order to have a stable and efficient ETL
code in place.
In this chapter we discussed about the courseware. Several components of ETL
process which are important for every ETL project implementation is discussed. Any user
who wishes to participate in implementation of ETL projects should know about these
components and should include appropriate exception handling mechanisms in place for
successful implementations. In the next chapter the architecture of ETL web tool, created
for this project, is discussed along with its various components and how they interact
with each other.
44
Chapter 4
ETL TOOL ARCHITECTURE
45
This chapter describes the architecture of the tool implemented in this project.
Initially the various layers, their interaction with each other and how they integrate to
form a system are discussed. Then the components of the tool are discussed in detail.
The ETL tool is composed of three layers: the client layer, the processing layer
and the database layer.
Figure 12 ETL tool layers
The client layer is the one which is visible to the end user and is used by the user
to interact with the system. In the processing layer, the processing of the user input takes
places. This has source connectors for text file and MySQL database. The database layer
has the target connector for MySql database and uses it to connect to target MySQL
database in order to insert records.
Users use a web browser, like Microsoft Internet Explorer, to control the ETL
tool. They can specify the number of sources, the type of source, transformation to be
applied on the extracted data and the target MySQL table connection details.
46
The processing layer collects the user input from the client layer. It process user
request in the background and displays success or error messages to the user. This layer
has MySQL and text file connectors which connects to the source, to extract data, based
on the user input. Once the data is extracted, it is staged in temporary MySQL tables or
flat files so that transformations can be applied to it without disturbing the source content.
The database layer has the target MySQL connect which connect to the target
MySQL table and inserts the transformed data.
Figure 13 shows various components of the tool and how it is connected with the
layers.
47
Figure 13 Layers and components of ETL tool
Users must use a web browser to access the tool and courseware. There are two
types of users: Guest and Registered user. Guest user is one who is not a student, faculty
or staff of California State University, Sacramento. Guest users don’t need username or
password to access the tool or courseware but have limited access to the tool. They can
only select sample source text files or sample database tables and write only to sample
target database tables. Option to enter database and file details are blocked to them. On
the other hand a registered user is one who is a student, faculty or staff of California State
University, Sacramento. Registered users require a username and password to login and
get unrestricted access to the tool. They can specify custom absolute file paths and
48
custom MySQL tables as source. These tables can be in any database located within
California State University, Sacramento campus and should be accessible via campus
LAN. They can also specify custom target MySQL table to load their source data.
When users, guest or registered, opens the homepage they have an option to either
go to courseware or the tool. When they click on the tool link, first page that they see is
the login page. When they don’t have a username and password they can click on the
guest link to access.
Once they are in the tool they come across extract, transform and load pages in
the same order. When they are in extract page, they can select either text file or MySQL
table as the source and enter the details like absolute path or the database details. When
they click on continue, the input details are passed on to the processing layer as shown in
Figure 13.
The processing layer reads the input details and uses source connectors to validate
the input data. If the user specifies text as source then the text source connector checks if
the file exists and is readable before proceeding to the next step. If the user specifies
MySQL table then the MySQL source connector connects to the source database and
makes sure it has table read privileges before proceeding further. Once the validation is
successful the processing layer copies over the source data to landing zone. A landing
zone is a temporary work area where the source data is landed in so that it can be
processed by the tool. The reason for having landing zone is that the source data remains
untouched and the tool has complete access to manage and modify the files in the landing
zone.
49
The next step is to define metadata for the selected source. If the source selected
is text then the processing layer display a webpage to define metadata for individual
columns. If the source is MySQL table then the processing layer automatically fetches the
table’s metadata from database dictionary.
Once metadata for source is defined the user chooses transformations which need
to be applied on the extracted source data. Applying the transformations too is taken care
by the processing layer with the help of PHP and Korn shell scripts. The transformed data
is temporarily stored in temporary files or MySQL tables and this zone is called the
staging zone.
Data from the staging zone is picked up by the loading layer and is loaded into
target MySQL tables using the target connectors.
This chapter gave an overview of the architecture, layers and the components of
the tool. It also discussed about the system design. Once the reader is familiar with design
and architecture it helps him to better understand the detailed implementation of the tool
in Chapter 5.
Chapter 5
ETL TOOL IMPLEMENTATION
50
This chapter describes the implementation of the ETL web tool. It discusses about
the guest and registered user login process. Based on the type of user several features
may or may be available. All options available in each phase are explained along with
screenshots and examples for each option.
5.1 Using the tool
Users must use a web browser to use the ETL tool. There are two types of users-
guest and registered user. A guest is anyone who is not a student, faculty or staff of
California State University, Sacramento. Guests don’t need a username or password to
use the tool. They click on the hyperlink “Guest? Click here”. When using the guest
credentials they have limited access to source files, source tables and target tables. Four
sample files along with their absolute path and four sample tables are mentioned in the
tool online. Guests cannot enter source tables or target tables details as they are disabled
in the tool.
A registered user is one who is a student, faculty or staff of California State
University, Sacramento and has been supplied username and password by the professor.
Registered users have unrestricted access unlike guest users. They can enter absolute path
of the source files which they want to load. They can enter credentials like server name,
database name, table name, user name and password of a different MySQL source and
target databases which can be anywhere on-campus and are available via college LAN.
New users can be added or existing users can be deleted by the professor by
logging in as administrator. Username and password are stored in Users MySQL table in
51
web_etl database. This database can be accessed only by the administrator. Users MySQL
table structure is given below.
Username Varchar(50) Primary Key
Password Varchar(50) Not null
Table 9 Table structure to store usernames and password
From Table 9, username is of varchar datatype of length 50 and is the primary key
of the table. Password field is of varchar datatype of length 50 and has NOT NULL
constraint added to it. Each username entered must be unique in this table.
Figure 14 shows the flow administrator has to follow in order to add or delete
users from the system.
Figure 14 Add or delete users
The administrator opens the homepage, clicks on the ETL tool link, then enters
administrator username and password. Once inside the system, the administrator can add
new users or delete existing users.
5.2 Extraction phase
52
Users must use a web browser to use the ETL tool. The first page displays the
extraction phase of the tool. Here the user can select number of sources and the type of
source. The minimum number of sources is one and the maximum is two. The type of
source can be flat file and/or MySQL table. Below is the flow of how users can select
text or MySQL table as source.
Figure 15 Source selection
Users open homepage, choose ETL tool, login with their username and password
or choose guest login and then they can choose the number of sources and their type.
Both the options are explained in detail below.
5.2.1 Text file as source
If user is a guest user then he/she has limited options. They can only choose from
the list of sample files displayed on the webpage but if the user is a registered user then
he/she can specify custom absolute path name.
53
If the user chooses text file as one of the source then he/she has to specify the
absolute path of the file present on Linux. Below are the points which should be
followed-
1. The file should exist and should have read permissions.
2. The file should be in ASCII format.
3. It should be comma delimited without a final delimiter.
4. It should not have any quote or double quote to separate the fields
5. It can have maximum of ten fields.
When the user enters the absolute path of the source file and clicks Next the following are
checked in the validateSourceFile.ksh script:
1. Check if user has entered the file path in the input box. If yes proceed to step 2
else raise an error.
2. Check if the entered path is a file or directory. If yes proceed to step 3 else raise
an error.
3. Check if the file has read permissions. If yes proceed to step 4 else raise an error.
4. Check if the file size is equal to zero. If yes, it means that the file is empty, then
raise an error else proceed to defineMetaData.php page to define the metadata of
the source file.
Below is the code snippet of validateSourceFile.ksh
## Check for number of arguments to this program## 1st argument is the filenameif [[ $# -ne 1 ]]; then echo "Error: Name of the source file not entered" exit 1;fi
54
## Assign 1st arguement to var sourceFilesourceFile=$1
## Check if var sourceFile is emptyif [[ -z $sourceFile ]]; then echo "Error: Filename supplied is empty" exit 1;fi
## Check if var sourceFile is a directoryif [[ -d $sourceFile ]]; then echo "Error: Source file path supplied is a directory. Please specify a file" exit 1;fi
## Check for file permissionsif [[ ! ( -r $sourceFile ) ]]; then echo "Error: Source file path is incorrect or source file doesn't have read permissions" exit 1;fi
## Check for empty fileif [[ ! ( -s $sourceFile ) ]]; then echo "Error: Source file path is incorrect or source file is empty" exit 1;fi
Once the source file is validated by validateSourceFile.ksh, the next step is to
define the metadata in defineMetaData.php page. The ETL tool allows up to ten fields to
be defined for a source text file. The metadata consists of the field name, its data type and
its length. The defineMetaData.php page displays a snapshot of 10 lines of the source file
to the user so that they can refer it while defining the metadata. Below is a screenshot of
web page which allows users to define metadata for source file.
55
Figure 16 Screenshot of Define Metadata page
The first field is the name field. This is an input box where users have to type in the field
name. The fields in source have to be named in this page and the source file shouldn’t
contain field names as their first record.
The second field is the data type field. It is a drop down box and only one option can be
selected. It has the following values:
Varchar
Char
Integer
Date
Timestamp
Decimal
56
The third field is an input box where the length of the field has to be specified by the user.
This should contain only numeric values and should be greater than zero. This field is
dynamic and can display length, precision or nothing based on the data type chosen by
the user. Below is the table showing the data type chosen and the display that appears on
the webpage.
Data type selected Display
Varchar Length input box
Char Length input box
Integer Length input box
Date No display
Timestamp No display
DecimalPrecision and Scale input
boxes
Table 10 Type of input box based on data type
Below is a snippet of HTML and JavaScript which dynamically changes the length field
based the data type selected.
<script type="text/javascript">$(document).ready(function(){
$("#bloc1").change (function() {
if ($("#bloc1").val() == 'varchar' || $("#bloc1").val() == 'char' || $("#bloc1").val() == 'integer' ) { $(".col1").show(); $(".col1_d").hide();
57
$(".col1_t").hide(); } else if ($("#bloc1").val() == 'date' || $("#bloc1").val() == 'timestamp') { $(".col1").hide(); $(".col1_d").hide(); $(".col1_t").show(); } else { $(".col1").hide(); $(".col1_d").show(); $(".col1_t").hide(); }
});$("#bloc1").change();}...</script>..<tr>
<td>Column1: Name <input type="text" name="col1_name"><select id="bloc1" name="col1_select">
<option SELECTED value="0"></option><option value="varchar">varchar</option><option value="char">char</option><option value="integer">integer</option><option value="date">date</option><option value="timestamp">timestamp</option><option value="decimal">decimal</option>
</select><td class="col1">
Length <input type="text" name="col1_length" ></td><td class="col1_t"></td><td class="col1_d">
Precision <input type="text" name="col1_precision" >Scale <input type="text" name="col1_scale" >
58
</td></td>
</tr>
Once metadata is defined, temporary MySQL tables are created in web_etl database to
hold the source file. Figure 17 shows the source flow.
Figure 17 Flow for landing data using text source
Internally in PHP source code, a custom CREATE TABLE SQL statement is prepared
from the manually defined metadata and is run against the web_etl database. This creates
a table similar to the source data. Then the source data is loaded into temporary table
using the load scripts
5.2.2 MySQL table as source
59
If the user is a guest user then he/she has limited options. They can only choose
from the list of sample tables displayed on the webpage but if the user is a registered user
then they can specify details as described in the following sections.
If the user chooses a MySQL table as one of the source then the following details
must be specified-
1. Name of the remote server. The server should be within the college network and
accessible.
2. Name of the database in the remote server.
3. Name of the table.
4. Username
5. Password
The script source_mysql.php accepts the above input and validates the following before
proceeding further-
1. Checks the connection to the remote server
2. Validates the username and password
3. Checks if the database exists
4. Checks if the table exists.
If the above points are satisfied then the source table structure and data are captured by
createTempTableMySql.ksh. The user doesn’t have to manually define the metadata.
Below is the working of MySQL table source flow:
60
Figure 18 Flow of landing data using MySQL table as source
After MySQL source table details are entered, the PHP and Korn shell scripts extract the
metadata automatically from the source table by connecting to that database. It also
exports the source data into temporary files in landing zone. Then it creates temporary
table in web_etl database and loads the landing zone data into the temporary table.
Below is a screenshot of web page which allows users to enter credentials for source
MySQL table.
Figure 19 Screenshot of database details webpage
5.3 Transformation phase
61
Users must select transformation, based on their business requirements, after they have
selected the sources. There are several transformations available and are described below
in detail. The flow is show in Figure 19.
Figure 20 Flow of transformation phase
The transformations are based on number of sources selected during extract phase. If
multiple sources are selected then merge with duplicates and merge without duplicates
are the two options available. If a single source is selected then transformations are based
on data type of the fields in input dataset. After the data is transformed, it is written into
staging area ready to be loaded into target MySQL tables.
5.3.1 Transformation for a single source
When users choose single source, they have several transformations available to them
based on the data type of each field. The data type of each field is fetched from database
dictionary using the query 'SELECT COLUMN_NAME, DATA_TYPE FROM
INFORMATION_SCHEMA.COLUMNS WHERE table_name = ‘$tablename’;
The table below shows the structure of COLUMNS table present in
INFORMATION_SCHEMA database [9]. Of all the fields present in this table only
62
COLUMN_NAME and DATA_TYPE fields are required to display the several
transformations available to the user.
Field Type Null
TABLE_CATALOG
TABLE_SCHEMA
TABLE_NAME
COLUMN_NAME
ORDINAL_POSITION
COLUMN_DEFAULT
IS_NULLABLE
DATA_TYPE
CHARACTER_MAXIMUM_LENGTH
CHARACTER_OCTET_LENGTH
NUMERIC_PRECISION
NUMERIC_SCALE
CHARACTER_SET_NAME
COLLATION_NAME
COLUMN_TYPE
COLUMN_KEY
EXTRA
PRIVILEGES
varchar(512)
varchar(64)
varchar(64)
varchar(64)
bigint(21) unsigned
longtext
varchar(3)
varchar(64)
bigint(21) unsigned
bigint(21) unsigned
bigint(21) unsigned
bigint(21) unsigned
varchar(32)
varchar(32)
longtext
varchar(3)
varchar(27)
varchar(80)
YES
NO
NO
NO
NO
YES
NO
NO
YES
YES
YES
YES
YES
YES
NO
NO
NO
NO
63
COLUMN_COMMENT varchar(255) NO
Table 11 Structure of INFORMATION_SCHEMA.COLUMNS table
The different data types and the transformations available are described below.
Integer Data type
No transformations available for this data type
Varchar and Char Data type
Convert to lower case: This option converts the input string or characters to
lower case. Example: Input “ABC”; Output “abc”
Convert to upper case: This option converts the input string or characters to
upper case. Example: Input “Abc”; Output “ABC”
Remove leading spaces: This option removes leading spaces from the input
string or characters, if any. Example: Input “ abc ”; Output “abc ”
Remove trailing spaces: This option removes trailing spaces from the input
string or characters, if any. Example: Input “ abc ”; Output “ abc”
Remove leading and trailing spaces: This option removes leading and trailing
spaces from the input string or characters, if any. Example: Input “ abc ”;
Output “abc”
Decimal type
64
Round: This option converts the input decimal to an approximately equal,
simpler and shorter representation. Example: Input 2.20 ; Output 2.00
Ceiling: This option converts the input decimal to its smaller integer value.
Example: Input 2.20 ; Output 3
Floor: This option converts the input decimal to its largest integer value.
Example: Input 2.20 ; Output 2
Absolute value: This option converts the input decimal to its absolute value.
Example: Input -2.20 ; Output 2.20
Date and Timestamp data type
Get date: This option extracts the date part from date or timestamp input.
Example: Input “2010-01-04 14:09:02”; Output “2010-01-04”
Get day: This option extracts the day of the month from the date or timestamp
input. Example: Input “2010-01-04 14:09:02”; Output “3”
Get day of the week: This option returns the weekday index from the date or
timestamp input. Returns 1 for Sunday, 2 for Monday…. 7 for Saturday. Example:
Input “2010-10-18”; Output 2
Get month: This option returns the month from the date or timestamp input.
Example: Input “2010-10-18”; Output 10
Get name of the month: This option returns the month name from the date or
timestamp input. Example: Input “2010-10-18”; Output “October”
Get quarter: This option returns the quarter from the date or timestamp input.
The returned value is between 1 and 4. Example: Input “2010-02-18”; Output 1
65
Get year: This option returns the year from the date or timestamp input.
Example: Input “2010-02-18”; Output 2010
Below is a screenshot of web page shows the transformation page when a single source is
selected.
Figure 21 Screenshot showing various transformations
Below is the flow of source code
Below is the code from one_source.php which displays dynamically the transformations
available to user based on the data type of input fields.
echo "<table border='1'><tr>
<th>Column name</th><th>Datatype</th><th>Transformation</th>
</tr>";
66
$i=1;$j=0;while($row = mysql_fetch_array($result)){
echo "<tr>";echo "<td>" . $row[0] . "</td>";echo "<td>" . $row[1] . "</td>"; echo "<input type=hidden name=ROWS value=" . $num . " />";if ($row[1] == 'decimal'){
$j=$i+1;echo "<td> <select name=". $j .">"; echo"<option value=' '> </option>"; echo"<option value=round>Round</option>"; echo"<option value=ceil>Ceiling</option>"; echo"<option value=floor>Floor</option>"; echo"<option value=abs>Absolute Value</option>"; echo" </select> </td>";echo "<input type=hidden name=".$i." value=" . $row[0] . " />";
}else if ($row[1] == 'varchar' || $row[1] == 'char'){
$j=$i+1;echo "<td> <select name=".$j."> <option value=' '> </option>";echo "<option value=lower>Convert to lower case</option>";echo "<option value=upper>Convert to upper case</option>";echo "<option value=ltrim>Remove leading spaces</option>";echo "<option value=rtrim>Remove trailing spaces</option>";echo "<option value=trim>Remove leading & trailing
spaces</option>"; echo " </select> </td>";echo "<input type=hidden name=".$i." value=" . $row[0] . " />";
}
else if ($row[1] == 'timestamp' || $row[1] == 'date'){
$j=$i+1;echo "<td> <select name=".$j."> <option value=' '> </option>";echo "<option value=date>Get date</option>";echo "<option value=day>Get day</option>";echo "<option value=dayofweek>Get day of the week</option>";echo "<option value=month>Get month</option>";echo "<option value=monthname>Get name of the
month</option>";
67
echo "<option value=quarter>Get quarter</option>"; echo "<option value=year>Get year</option>"; echo " </select> </td>";echo "<input type=hidden name=".$i." value=" . $row[0] . " />";
}else{
$j=$i+1;echo "<input type=hidden name=".$i." value=" . $row[0] . " />";
}echo "</tr>"; $i=$i+2;
}echo "</table>";
5.3.2 Transformation for multiple sources
When users choose two sources, they have two options of transforming the input
data. One option is to merge the two with duplicates and the other is to merge them
without any duplicates. However the users should note that when they choose multiple
sources to merge, the number of fields in both the sources should be the same and the
data type of corresponding fields should also be the same.
Once they have selected the transformation, PHP and Korn shell scripts check if
both the datasets are compatible with each other by comparing the number of columns
and the data type of corresponding columns. If they match then are temporarily landed in
the staging zone which is a temporary MySQL table. If they don’t match then an error is
displayed to the user.
68
5.4 Loading phase
After the user chooses transformations to be applied to the source data, he/she has to
choose a MySQL table to load the source data. The transformed data is directly loaded
from the staging zone to target MySQL table as show in Figure 21.
Figure 22 Flow of loading phase
If the target table is in the same server and database as staging zone table then data is
directly copied over to the target table. If the target table is located on a different database
than staging zone table then data from staging zone is exported to temporary files and
then loaded into target table using the export scripts.
The MySQL table should be defined and should exist in the target database.
User has to specify the following details -
Name of the remote server. The server should be within the college network and
accessible.
Name of the database in the remote server.
Name of the table.
Username
Password
The script transform_source.php accepts the above input and validates the following
before proceeding further-
69
1. Checks the connection to the remote server
2. Validates the username and password
3. Checks if the database exists
4. Checks if the table exists.
5. Checks if the user has insert permissions.
If the above checks are satisfied then transform_source.php fetches the data from staging
zone and inserts the transformed data in the target table.
Below is a screenshot of web page shows the transformation and load page when a
multiple sources are selected.
Figure 23 Screenshot showing transformations for multiple sources
70
Below is the code snippet from transform_source.php which generates the custom SQL to
insert the data into target table.
$ROWS = $_POST[ROWS];$ROWS=$ROWS*2;$i=1;while ( $i <= $ROWS){
$a = $_POST[$i];$j=$i+1;$b = $_POST[$j];
if ($i == ($ROWS-1)){
$c = $c . $b ."(" . $a.")"; }else{
$c = $c . $b ."(" . $a."),"; }$i=$i+2;
}
$sql = "insert into $table select $c from table";
This chapter discussed about implementation of the ETL web tool. It discussed
options available for each phase in detail along with screenshot, description on how to
use, several options available, an example for each option and important source code
snippets to help understand the internal working of the tool for the user. Chapter 6 is the
final chapter which gives an overall summary along with possible future enhancement to
the courseware and tool.
71
Chapter 6
CONCLUSION
In this project ETL courseware and ETL tool implementation were discussed.
ETL courseware discusses important aspects, from initially requirements gathering to
final error handling processes, which are important for ETL developers to know in order
to implement ETL projects successfully. ETL tool implemented in this project can extract
from multiple heterogeneous sources, combine them together, apply various
transformations and load the transformed data into target database tables. The ETL tool is
implemented using PHP 4.3, Korn Shell scripts and MySQL 5.1.4 scripts.
As a conclusion to this report, we can say that this project has accomplished its
primary goals and objectives as discussed in scope section 2.2 of Chapter 2. The main
objective of the ETL courseware is to first introduce basic concepts to familiarize the
interested audience about ETL and then discuss advanced topics like cleansing,
transformations, dimension and fact table loads. The main objective of the ETL tool is to
provide free access to interested audience to learn what an ETL is and how it works. The
tool provides interactive graphical user interface and is user-friendly. It can extract from
heterogeneous sources, land them in landing zone, apply various types of
transformations, stage them in staging zone and finally load the transformed data into
database tables. The heterogeneous sources can be flat files or MySQL tables. There are
several transformations available to apply on the landed source data. The source and
72
target database must be MySQL database and must be connected via LAN within
California State University, Sacramento campus.
This project has helped me understand the basics of ETL tool’s internal working.
It also helped me learn new languages like PHP and Korn shell scripting and am thankful
that I got an opportunity to work on MySQL database.
6.1 Future enhancements
There a few limitations in this project which can be worked upon in the future to
enhance the ETL tool and courseware. The first limitation is limited number of source
and target connectors. Currently the tool has flat file and MySQL table connectors.
Oracle database, MS SQL server database and XML file connectors could be added. The
second enhancement would be to add more transformations like SCD, change-capture
and change apply stage to the ETL tool.
73
BIBLIOGRAPHY
[1] W.H Inmon, "Building the Data Warehouse" Fourth Edition
[2] Jack E. Olson, "Data Quality: The Accuracy Dimension"
[3] Ralph Kimball, Margy Ross, "The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling" Second Edition
[4] Claudia Imhoff, Nicholas Galemmo, Jonathan G. Geiger, "Mastering Data Warehouse
Design: Relational and Dimensional Techniques"
[5] Ralph Kimball, Joe Caserta, "The Data Warehouse ETL Toolkit: Practical Techniques
for Extracting, Cleaning, Conforming, and Delivering Data"
[6] Larissa T. Moss, Shaku Atre, "Business Intelligence Roadmap: The Complete Project
Lifecycle for Decision-Support Applications"
[7] Ralph Kimball, Margy Ross, "The Kimball Group Reader: Relentlessly Practical
Tools for Data Warehousing and Business Intelligence"
[8] Wikipedia, General Information about Data Warehouse, [Online].
Available: http://en.wikipedia.org/wiki/Data_warehouse
[9] MySQL, The INFORMATION_SCHEMA COLUMNS Table, [Online]
Available: http://dev.mysql.com/doc/refman/5.0/en/columns-table.html
[10] MySQL, Overview of MySQL, [Online]
Available: http://dev.mysql.com/doc/refman/5.1/en/what-is-mysql.html
[11] MySQL, What Is New in MySQL 5.1, [Online]
Available: http://dev.mysql.com/doc/refman/5.1/en/mysql-nutshell.html
74
[12] PHP, PHP features, [Online]
Available: http://php.net/manual/en/features.php
[13] Wikipedia, Korn Shell, [Online]
Available: http://en.wikipedia.org/wiki/Korn_shell
[14] Wikipedia, Comparison of Computer shells, [Online]
Available: http://en.wikipedia.org/wiki/Comparison_of_computer_shells