14
Recipes of Data Warehouse and Business Intelligence Load a Data Source File (with header, footer and fixed lenght columns) into a Staging Area table with a click

Data Warehouse and Business Intelligence - Recipe 1

Embed Size (px)

DESCRIPTION

Load a Data Source File (with header, footer and fixed lenght columns) into a Staging Area table with a click

Citation preview

Page 1: Data Warehouse and Business Intelligence - Recipe 1

Recipes of Data Warehouse and Business Intelligence

Load a Data Source File (with header, footer and fixed lenght columns) into a Staging Area table

with a click

Page 2: Data Warehouse and Business Intelligence - Recipe 1

The Micro ETL Foundation

• The Micro ETL Foundation is a set of ideas and solutions for Data Warehouse and Business Intelligence Projects in Oracle environment.

• It doesn’t use expensive ETL tools, but only your intelligence and ability to think, configure, build and load data using the features and the programming language of your RDBMS.

• This recipe is an easy example. Copying the content of the following slides with your editor and SQL Interface utility, you can reproduce this example.

Page 3: Data Warehouse and Business Intelligence - Recipe 1

The source data file

• Get the data file to load. In this recipe we use a data file with these features:• Four initial rows like header. The reference day of the data is in the first row with

the «dd/mm/yyyy» format.• One tail row with the number of records of the data file.• Columns of fixed size (we will configure later).• The next figure is the content of the data file that we call employees4.txt

BANKIN1431/12/20130000BEGINHEADERBANKIN1400 EMPLOYEESENDHEADER 100Steven King SKING 5.151.234.567 17/06/2003AD_PRES 24000 90 101Neena Kochhar NKOCHHAR 5.151.234.568 21/09/2005AD_VP 17000 100 90 102Lex De Haan LDEHAAN 5.151.234.569 13/01/2001AD_VP 17000 100 90 145John Russell JRUSSEL 011.44.1344.429268 01/10/2004SA_MAN 14000 0.04100 80 146Karen Partners KPARTNER 011.44.1344.467268 05/01/2005SA_MAN 13500 0.03100 80 147Alberto Errazuriz AERRAZUR 011.44.1344.429278 10/03/2005SA_MAN 12000 0.03100 80 148Gerald Cambrault GCAMBRAU 011.44.1344.619268 15/10/2007SA_MAN 11000 0.03100 80 149Eleni Zlotkey EZLOTKEY 011.44.1344.429018 29/01/2008SA_MAN 10500 0.02100 80 150Peter Tucker PTUCKER 011.44.1344.129268 30/01/2005SA_REP 10000 0.03145 80BANKIN1431/12/201300000000009

Page 4: Data Warehouse and Business Intelligence - Recipe 1

The definition file

• Build the definition file from your documentation.• It has to be a «csv» file because it must be seen by an external table.• For this example we define the minimum set of information.• COLUMN_COD will be the name of the column in the DWH.• FXV_TXT contains little transformations to be done.• COLSIZE_NUM is the size of the column in the data file.• The next is the content of the definition file that we call employees4.csv

COLUMN_ID HOST_COLUMN_COD COLUMN_COD TYPE_TXT COLSIZE_NUM FXV_TXT1 EMPLOYEE_ID EMPLOYEE_ID NUMBER (6) 6 to_number(EMPLOYEE_ID)2 FIRST_NAME FIRST_NAME VARCHAR2(20) 20 3 LAST_NAME LAST_NAME VARCHAR2(25) 25 4 EMAIL EMAIL VARCHAR2(25) 25 5 PHONE_NUMBER PHONE_NUMBER VARCHAR2(20) 20 replace(PHONE_NUMBER,'.','')6 HIRE_DATE HIRE_DATE NUMBER 10 TO_NUMBER(to_char(to_date(HIRE_DATE,'dd/mm/yyyy'),'yyyymmdd'))7 JOB_ID JOB_ID VARCHAR2(10) 10 8 SALARY SALARY NUMBER (8,2) 9 to_number(SALARY)9 COMMISSION_PCT COMMISSION_PCT NUMBER (2,2) 4 to_number(COMMISSION_PCT,'99.99')

10 MANAGER_ID MANAGER_ID NUMBER (6) 6 to_number(MANAGER_ID)11 DEPARTMENT_ID DEPARTMENT_ID NUMBER (4) 4 to_number(DEPARTMENT_ID)

Page 5: Data Warehouse and Business Intelligence - Recipe 1

The physical/logical environment

• Create two Operating System folders. The first for the data file and the second for the configuration file. (C:\ios and c:\ios\cft)

• Create some Oracle directories needed for the external tables definition.

• Position the data and the configuration file in the folders.

DROP DIRECTORY STA_BCK;CREATE DIRECTORY STA_BCK AS 'c:\ios';

DROP DIRECTORY STA_LOG;CREATE DIRECTORY STA_LOG AS 'c:\ios';

DROP DIRECTORY STA_RCV;CREATE DIRECTORY STA_RCV AS 'c:\ios';

DROP DIRECTORY STA_CFT;CREATE DIRECTORY STA_CFT AS 'c:\ios\cft';

DROP DIRECTORY STA_CFT_LOG;CREATE DIRECTORY STA_CFT_LOG AS 'c:\ios\cft';

Page 6: Data Warehouse and Business Intelligence - Recipe 1

The source configuration table

• Create the configuration table of the data source showed in the slide 3

• It contains the unique identificator of data source (IO_ID)

• It contains the folder references (*_DIR)• It contains the information about the format

of different types of data source• Only some fields will be configured.

DROP TABLE STA_IO_CFT;CREATE TABLE STA_IO_CFT( IO_COD VARCHAR2(12), RCV_DIR VARCHAR2(30), BCK_DIR VARCHAR2(30), LOG_DIR VARCHAR2(30), HEAD_CNT NUMBER, FOO_CNT NUMBER, SEP_TXT VARCHAR2(1), IDR_NUM NUMBER, IDC_NUM NUMBER, IDS_NUM NUMBER, IDF_TXT VARCHAR2(30), EDC_NUM NUMBER, EDS_NUM NUMBER, EDF_TXT VARCHAR2(30), RCR_NUM NUMBER, RCC_NUM NUMBER, RCS_NUM NUMBER, RCF_LIKE_TXT VARCHAR2(30), FILE_LIKE_TXT VARCHAR2(60));

Page 7: Data Warehouse and Business Intelligence - Recipe 1

The load of configuration table

• Load the previous table according to features of the slide 3:

• The folders reference (rcv_dir,bck_dir,log_dir)• The name of the source file (file_like_txt)• The number of header (head_cnt) and footer

rows (foo_cnt).• The separator character (sep_txt). Null

because is not a csv file.• The position, in the header, of the reference

day of the source, and its format. (idr_num,idc_num,ids_num,idf_txt

• The offset from tail, position and size in the footer section of the number or rows of the source. (rcr_num,rcc_num,rcs_num)

DELETE STA_IO_CFT WHERE IO_COD = 'employees4';

INSERT INTO STA_IO_CFT (IO_COD,RCV_DIR,BCK_DIR,LOG_DIR,FILE_LIKE_TXT,HEAD_CNT,FOO_CNT,SEP_TXT,IDR_NUM,IDC_NUM,IDS_NUM,IDF_TXT,RCR_NUM,RCC_NUM,RCS_NUM)VALUES ('employees4','STA_RCV','STA_BCK','STA_LOG','employees4.txt',4,1,NULL,1,9,10,'DD/MM/YYYY',0,19,13);

Page 8: Data Warehouse and Business Intelligence - Recipe 1

The configuration table of the definition file

• Create the configuration table of the data structure showed in the slide 4

• It is a metadata table • You can add others info like the column

description

DROP TABLE STA_EMPLOYEES4_CXT;CREATE TABLE STA_EMPLOYEES4_CXT ( COLUMN_ID VARCHAR2(4), HOST_COLUMN_COD VARCHAR2(30), COLUMN_COD VARCHAR2(30), TYPE_TXT VARCHAR2(30), COLSIZE_NUM VARCHAR2(4), FXV_TXT VARCHAR2(200))ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY STA_CFT ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE BADFILE STA_CFT:'EMPLOYEES4.BAD' DISCARDFILE STA_CFT:'EMPLOYEES4.DSC' LOGFILE STA_CFT:'EMPLOYEES4.LOG' SKIP 1 FIELDS TERMINATED BY';' LRTRIM MISSING FIELD VALUES ARE NULL REJECT ROWS WITH ALL NULL FIELDS ( COLUMN_ID ,HOST_COLUMN_COD ,COLUMN_COD ,TYPE_TXT ,COLSIZE_NUM ,FXV_TXT)) LOCATION (STA_CFT:'EMPLOYEES4.CSV'))REJECT LIMIT UNLIMITEDNOPARALLELNOMONITORING;

Page 9: Data Warehouse and Business Intelligence - Recipe 1

The structure configuration view

• Create the structure configuration view based on the previous configuration table.• In addition, it only calculates the limits of the fixed columns of the data file using

an analytics function.

CREATE OR REPLACE VIEW STA_EMPLOYEES4_CXV ASSELECT COLUMN_ID ,HOST_COLUMN_COD ,COLUMN_COD ,TYPE_TXT ,COLSIZE_NUM ,FXV_TXT ,(SUM (COLSIZE_NUM) OVER (ORDER BY TO_NUMBER (COLUMN_ID))) - COLSIZE_NUM + 1 AS FROM_NUM ,SUM (COLSIZE_NUM) OVER (ORDER BY TO_NUMBER (COLUMN_ID)) AS TO_NUMFROM STA_EMPLOYEES4_CXTORDER BY TO_NUMBER (COLUMN_ID);

Page 10: Data Warehouse and Business Intelligence - Recipe 1

The source external table

• Create the external table linked to the source data file.

• The name and type of columns have to be the same of the configuration view.

• ROW_CNT is a useful feature of the Oracle external table to give a numbering to every row

• ROW_TXT is the entire row without restriction. It will be used in the following view

DROP TABLE STA_EMPLOYEES4_FXT;CREATE TABLE STA_EMPLOYEES4_FXT( EMPLOYEE_ID VARCHAR2(11) ,FIRST_NAME VARCHAR2(20) ,LAST_NAME VARCHAR2(25) ,EMAIL VARCHAR2(25) ,PHONE_NUMBER VARCHAR2(20) ,HIRE_DATE VARCHAR2(10) ,JOB_ID VARCHAR2(10) ,SALARY VARCHAR2(9) ,COMMISSION_PCT VARCHAR2(14) ,MANAGER_ID VARCHAR2(10) ,DEPARTMENT_ID VARCHAR2(13) ,ROW_CNT NUMBER ,ROW_TXT VARCHAR2(4000))ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY STA_BCK ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE BADFILE STA_LOG:'EMPLOYEES2.BAD' DISCARDFILE STA_LOG:'EMPLOYEES2.DSC' LOGFILE STA_LOG:'EMPLOYEES2.LOG' FIELDS TERMINATED BY '' LRTRIM MISSING FIELD VALUES ARE NULL REJECT ROWS WITH ALL NULL FIELDS ( EMPLOYEE_ID POSITION(1:6) ,FIRST_NAME POSITION(7:26) ,LAST_NAME POSITION(27:51) ,EMAIL POSITION(52:76) ,PHONE_NUMBER POSITION(77:96) ,HIRE_DATE POSITION(97:106) ,JOB_ID POSITION(107:116) ,SALARY POSITION(117:125) ,COMMISSION_PCT POSITION(126:129) ,MANAGER_ID POSITION(130:135) ,DEPARTMENT_ID POSITION(136:139) ,ROW_CNT RECNUM ,ROW_TXT POSITION(1:139))) LOCATION (STA_BCK:'employees4.txt'))REJECT LIMIT UNLIMITEDNOPARALLELNOMONITORING;

Page 11: Data Warehouse and Business Intelligence - Recipe 1

The source external view (1)

• The goal of the view is to prepare the data to load in the staging table.• It will use the useful SQL clause «with» to build the information needed. See in

details the single sub-query blocks.

– T1 = get the name of the source data file using a table of the Oracle dictionary– T2 = get the reference day of the data using the info of the source definition

table– T3 = get the declared number of rows declared in the file footer. The final -0

means that there is no offset from the tail of the file.– T4 = get the number of rows using the row counter of the external table– T5 = get the header/footer numbers of rows

Page 12: Data Warehouse and Business Intelligence - Recipe 1

The source external view (2)

CREATE OR REPLACE FORCE VIEW STA_EMPLOYEES4_FXV ASWITH T1 AS (SELECT SUBSTR(LOCATION,1,80) SOURCE_COD FROM USER_EXTERNAL_LOCATIONS WHERE TABLE_NAME = 'STA_EMPLOYEES4_FXT'),T2 AS (SELECT TO_NUMBER(TO_CHAR(TO_DATE(SUBSTR(ROW_TXT,9,10),'dd/mm/yyyy'),'yyyymmdd')) DAY_KEY FROM STA_EMPLOYEES4_FXT WHERE ROW_CNT = 1),T3 AS (SELECT TO_NUMBER(SUBSTR(ROW_TXT,19,13)) ROWS_NUM FROM STA_EMPLOYEES4_FXT WHERE ROW_CNT=(SELECT MAX(ROW_CNT) FROM STA_EMPLOYEES4_FXT)-0),T4 AS (SELECT MAX(ROW_CNT) R FROM STA_EMPLOYEES4_FXT),T5 AS (SELECT HEAD_CNT X,FOO_CNT Y FROM STA_IO_CFT WHERE IO_COD = 'employees4')SELECT TO_NUMBER(EMPLOYEE_ID) EMPLOYEE_ID,FIRST_NAME FIRST_NAME,LAST_NAME LAST_NAME,EMAIL EMAIL,REPLACE(PHONE_NUMBER,'.','') PHONE_NUMBER,TO_NUMBER(TO_CHAR(TO_DATE(HIRE_DATE,'dd/mm/yyyy'),'yyyymmdd')) HIRE_DATE,JOB_ID JOB_ID,TO_NUMBER(SALARY) SALARY,TO_NUMBER(COMMISSION_PCT,'99.99') COMMISSION_PCT,TO_NUMBER(MANAGER_ID) MANAGER_ID,TO_NUMBER(DEPARTMENT_ID) DEPARTMENT_ID,SOURCE_COD,DAY_KEY,ROWS_NUMFROM STA_EMPLOYEES4_FXT,T1,T2,T3,T4,T5WHERE ROW_CNT > X AND ROW_CNT <= R-Y;

• The complete SQL Statement is:

Page 13: Data Warehouse and Business Intelligence - Recipe 1

The Staging table

• The Staging table will be loaded from the previous view.

• It has the 3 technical fields to remember the name of the source data file, the reference day, and the rows num.

• The rows num can be avoided, (is the same for all records) but it can be useful for statistical checks.

DROP TABLE STA_EMPLOYEES4_STT;CREATE TABLE STA_EMPLOYEES4_STT( EMPLOYEE_ID NUMBER, FIRST_NAME VARCHAR2(20), LAST_NAME VARCHAR2(25), EMAIL VARCHAR2(25), PHONE_NUMBER VARCHAR2(20), HIRE_DATE NUMBER, JOB_ID VARCHAR2(10), SALARY NUMBER, COMMISSION_PCT NUMBER, MANAGER_ID NUMBER, DEPARTMENT_ID NUMBER, SOURCE_COD VARCHAR2(320), DAY_KEY VARCHAR2(8), ROWS_NUM NUMBER);

Page 14: Data Warehouse and Business Intelligence - Recipe 1

The final load

• I underline the following features:– All is done without ETL Tool– The only physical structure created in the DWH is the final staging table– Everything is controlled by logical structures (external tables and views)– Everything without writing any code– If you create a SQL script from this recipe, you will load the staging table

with a click

Email - [email protected] (italian/english) - http://massimocenci.blogspot.it/

• We are at the end of this recipes. Now we can do the final load with a simple SQL statement

INSERT INTO STA_EMPLOYEES4_STTSELECT * FROM STA_EMPLOYEES4_FXV;