Upload
reksab16
View
216
Download
0
Embed Size (px)
Citation preview
RETAIL-SALES DWBI IN-STAGE ETL SPECIFICATIONS
SOURCE TABLES
1.STG_CUST_ORD_ITM_DTLS
2.STG_CUST_ORD_DTLS
3.STG_PROD_DTLS
Overview:
First source table contains very detailed transactional data, second
table contains lookup data of orders related to customers, third table contains
products selected by the customers during transactions in each order.
First source table contains primary key ITEM_DETAIL_CODE
defined with sequence of values for each transaction generated by the
individual customers with CUSTOMER_ID in different orders. For
individual customers, transactions are identified with ITEM_ID with
sequence of values uniquely and repeated for multiple orders. ORDERNO is
an invoice number generated for each visit though online or offline. For
repeated customers ORDERNO can not be repeated. Each item selected by
the customers defined with PRODUCTNO.
Second source table contains orders information for reference in order
to lookup the data coming from source based on lookup condition on
ORDERNO.
Third source table contains products information that are selected by
the customer for each transaction based on reference key PRODUCTNO.
Data flow diagram for RET-SALES-ETL-IN-STAGE
Construct staging layer for target table loading with following steps in stage
working database.
Step 1: Define filter condition on ORDER_STATUS=’Y’ at source side and
select required columns from the source, ITEM_ID, ORDERNO,
CUSTOMER_ID, PRODUCTNO with key performance indicators like QTY,
ITEM_PRICE, SALES_PRICE, ITEM_COST and VAT (if applicable).
Step2: Design business logic for each transaction in the following way,
ITEM_AMOUNT= QTY*ITEM_PRICE;
ITEM_DISC= QTY*(ITEM_PRICE-SALE_PRICE);
ITEM_FINAL_AMOUNT=QTY*SALES_PRICE*VAT;
Step3: Select reference from lookup tables STG_CUST_ORD_DTLS,
STG_PROD_DTLS on ORDERNO and PRODUCTNO to find updated
information from the source.
Step4: Select surrogate keys from all possible dimensions with reference of
source table and connect to the target table, to load detailed data with
arithmetic operations based on parameters and variables to select required
data from the source, in daily basis.
Below fact table is a transactional grain fact table to maintain data
daily basis in incremental manner for ad-hoc and standard requirements.
TARGET TABLE 1
1. WH_ITEM_FACT
WH_CUSTOMER_ID
WH_RETAILER_ID
WH_TIME_ID
WH_REGION_ID
WH_MARKETPLC_ID
WH_AGENT_ID
WH_CURRENCY_ID
WH_SALES_PERSON_ID
WH_PRODUCT_ID
WH_LEGAL_ENTITY_ID
WH_SESSION_ID
WH_ORDER_METHOD_ID
WH_SESSION_TYPE_ID
WH_STORE_ID
ORDER_NO
ORDER_DATE
ITEM_ID
ITEM_QTY
ITEM_PRICE
ITEM_COST
ITEM_SALES_PRICE
ITEM_VAT
ITEM_AMOUNT
ITEM_DISC
ITEM_FINAL_AMOUNT
SHIPPED_DATE
DELIVARY_DATE
DELIVARY_STATUS
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
DATE
NUMBER (20)
NUMBER (20)
NUMBER (20,2)
NUMBER (20,2)
NUMBER (20,2)
NUMBER (20.2)
NUMBER (20,2)
NUMBER (20,2)
NUMBER (20,2)
DATE
DATE
VARCHAR2 (1)
Data flow diagram for standard database (Working Database):
1. Business Logic Implementation at Item Level
The below data flow diagram shows to design ETL logic for RETAIL-
SALES Data Mart loaded in daily basis. The grain of the fact table is
ITEM_ID with ORDERNO and ORDER_DATE. Use filter to eliminate
unwanted data at source itself; filter condition is ORDER_STATUS=’Y’, this
condition eliminate orders cancelled at OLTP systems by end-users.
After cleanse and scrub source data arrange in required format using
available lookup tables based on lookup condition with common columns.
Finally control the transactional data based on required condition. Define
filter condition on delivery status is equal to ‘Y’ coming from STG_CUST_ORD_DTLS.
Figure 1: Dataflow Diagram for Working Database at Item Level.
Dimension Modeling:
Design dimension model with available dimension tables and fact table as Star Schema. There are fourteen dimension tables and one fact table available in the target system listed below,
Dimension tables:
1. WH_CUSTOMER_DIM
2. WH_PRODUCT_DIM
3. WH_TIME_DIM
4. WH_ORDER_METHOD_DIM
5. WH_RETAILER_DIM
6. WH_AGENT_DIM
7. WH_MARKET_PLACE_DIM
8. WH_STORE_DIM
9. WH_SALES_PERSON_DIM
10. WH_SESSION_ID
11. WH_SESSION_TYPE_DIM
12. WH_LEGAL_ENTITY_DIM
13. WH_SUPPLIER_DIM
14. WH_ORDER_DIM
Figure 2: Dimension modeling diagram for Sales Data Mart for Level 1 requirements.
Explanation:
For level1 requirements target table loaded with very detailed data
with necessary conditions, data manipulation with necessary formulas based
on reference from dimension tables. It is semi-non-volatile in nature; means
allows modifications coming from source systems and loaded in daily basis
incremental manner. It is mainly constructed for standard requirements for
customer order management for daily requirements.
Find data for each ORDERNO for all transactions (ITEM_ID)
generated by the CUSTOMER_ID based on various dimensions like
PRODUCT_ID, ORDER_DATE, etc.
Before load data to the target table control the data based on location,
then records direction will be changed and take reference from various
dimension tables connect them to target definition.
Business Logic Implementation for Order Level:
This business logic maintains overall information for each order generated by the customer; aggregate transactional data at order level irrespective of products selected by the customer in daily basis and loaded in incremental manner.
The main metrics used for order level requirements shown below:
1. ORDER_AMOUNT = SUM(ITEM_AMOUNT)
2. ORDER_DISC = SUM(ITEM_DISC)
3. ORDER_FINAL_AMOUNT = SUM(ITEM_SALES_AMOUNT)
4. ORDER_QTY = SUM (ITEM_QTY)
How to add VAT to ORDER_AMOUNT?
Then select lookup flat file given by the business people contains data in following way:
MIN_ORDER_AMOUNT (in DOLLARS)
MAX_ORDER_AMOUNT (in DOLLARS)
VAT(in percentage)
100
250
500
1000
250
500
1000
2000
0.9%
1.49%
2.49%
4.49%
Define lookup condition on input order amount with multiple lookup ports in the following way:
If ORDER_AMOUNT >= MIN_ORDER_AMOUNT AND
ORDER_AMOUNT <= MAX_ORDER_AMOUNT
Then add associate VAT to ORDER_FINAL_AMOUNT using following formula:
ORDER_VAT_AMOUNT = ORDER_FINAL_AMOUNT*VAT/100
Select output of expression stage after cleansing, aggregate the data
with necessary aggregate functions based on ORDERNO with the reference
of associate dimension keys. Design the logic to calculate VAT for
ORDER_AMOUNT; if ORDER_AMOUNT is less than specified lookup
values, and then select null value from the lookup stage.
Working with null values: if lookup condition is not satisfied then that stage
pass null values to next level; if any calculation is define on null value we
can get null from that calculation. Then uses decode function to overcome
null values in the following way:
DECODE (VAT, NULL, 0, VAT);
Design output column
FINAL_ORDER_AMOUNT=ORDER_FINAL_AMOUNT+
ORDER_FINAL_AMOUNT*DECODE (VAT, NULL, 0, VAT)/100;
TARGET TABLE 2
WH_ORDER_FCT
WH_TIME_ID
WH_CUSTOMER_ID
WH_AGENT_ID
WH_RETAILER_ID
WH_STORE_ID
WH_ORDER_METHOD_ID
WH_REGION_ID
WH_MARKET_PLACE_ID
WH_CURRENCY_ID
ORDER_DATE
ORDER_QTY
ORDER_AMOUNT
ORDER_DISC
ORDER_VAT
ORDER_FINAL_AMOUNT
FINAL_ORDER_AMOUNT
SHIPPED_DATE
DELIVERY_DATE
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
DATE
NUMBER(20)
NUMBER(20,2)
NUMBER(20,2)
NUMBER(20,2)
NUMBER(20,2)
NUMBER(20,2)DATE
DATE
2. Business Logic Implementation at Product Sales Level 2:
Product sales fact table used for analyze business process in daily
basis in incremental manner. The below diagram shows the RETAIL-
PRODUCTS-SALES Data Mart loaded in daily basis. It allows the fully
summarized data based on following requirements in daily basis with
reference of possible dimensions.
a. Sales Revenue
b. Daily Net Profit
c. Daily Gross Profit
d. Total Quantity Sold
e. Total Orders, etc.
These metrics calculated and loaded in incremental manner for daily
basis requirements. Arrange the data Retailer wise based on location and
market place. At the end of the year target should contain 365 days
information. It is very feasibility to use by the business people for best
decision making in point of time. The grain of the fact table is
PRODUCT_ID with all possible dimensions.
Define filter condition on DELIVERY_STATUS=’Y’ and use
necessary aggregate functions (here three level of aggregations are needed)
and load data to the target table with surrogate reference of required
dimensions.
TARGET TABLE 3
1. 3. WH_PRODUCT_SALES_FACT
WH_RETAILER_ID
WH_TIME_ID
WH_REGION_ID
WH_MARKETPLC_ID
WH_AGENT_ID
WH_CURRENCY_ID
WH_SALES_PERSON_ID
WH_PRODUCT_ID
WH_LEGAL_ENTITY_ID
WH_ORDER_METHOD_ID
WH_SESSION_TYPE_ID
WH_STORE_ID
ORDER_DATE
TOTAL_REVENUE
TOTAL_QTY
TOTAL_ORDERS
TOTAL_NET_PROFIT
TOTAL_GROSS_PROFIT
TRANS_DATE
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
NUMBER (20)
DATE
NUMBER (20)
NUMBER (20)
NUMBER (20,2)
NUMBER (20,2)
NUMBER (20,2)
DATE
Dimension modeling for RETAIL-PRODUCT-SALES:
Design dimension model for retail product sales star schema with
confirmed dimensions used in retail item sales star schema visible like an
integrated schema also called as galaxy schema shown below.
Figure 3: Dimension modeling diagram for Sales Data Mart for Level 2 requirements.
Explanation:
For level2, use three aggregations with to find product level sales
information like daily SALES_REVENUE, NET_PROFIT,
GROSS_PROFIT, etc; based on dimension keys reference from the various
dimension tables with the filter condition delivery status= ‘not null’ or ‘Y’.
From this data business people can understand what kind of products can buy
rapidly by the customers.
Summary:
From above explanation Level 1 used for customer order management and
leve2 used for products sales analysis.
Data flow diagram for RET-SALES-ETL-POST-STAGE
In this stage load non-volatile data from working data base to DWH in
periodic manner. Use periodic snap-shot fact tables to load aggregated
information for analysis based on different dimensions. This is useful for
overall Sales Analysis, Traffic Analysis, and Market-Basket Analysis.
Market Basket Analysis:
Market-Basket analysis is useful to analyze what combination of
products more frequently purchased by the customers. That means same
combination of the products that are selected by different customers on
particular period of time. For this requirement design ETL logic to load
aggregated data to target system with combination of CUSTOMER_ID and
PRODUCT_ID based on following requirements,
1. TOTAL_AMOUNT_SOLD
2. TOTAL_QTY_SOLD
Based on time period. This fact table maintains Semi-summarized data
with last transaction date by the customers. In this requirement we can
include CUSTOMER_TYPE based on following condition,
If TOTAL_AMOUNT_SOLD>=5000 Dollars
OR TOTAL_QTY_SOLD >= 150 Units
Then CUSTOMER_TYPE = ‘prime’ or ‘non-prime’.
Source Table:
1. WH_ITEM_FACT
Step1: select required from the source, define filter condition on time id like
WH_TIME_ID between 1 AND 31 if January month data needed from
source.
Step2: define necessary aggregated function to calculate at CUSTOMER_ID
and PRODUCT_ID in the following way with necessary group by columns,
based on last order date. For this requirement don’t enable group by port for
ORDER_DATE.
1. TOTAL_AMOUNT_SOLD=SUM (ITEM_FINAL_AMOUNT).
2. TOTAL_QTY_SOLD= (ITEM_QTY).
Step3: define filter condition to find customer type with following condition,
If TOTAL_AMOUNT_SOLD>=5000 Dollars
OR TOTAL_QTY_SOLD >= 150 Units
Then CUSTOMER_TYPE = ‘prime’ or ‘non-prime’.
Step4: provide rank for each customer based on market place. And finally load required data to the target table.
Target table 4
4 AGGR_CUST_SALES_FCT
WH_CUSTOMER_ID
WH_PRODUCT_ID
WH_AGENT_ID
WH_RETAILER_ID
WH_CURRENCY_ID
WH_MARKETPLACE_ID
WH_ORDER_METHOD_ID
WH_REGION_ID
LAST_ORDER_DATE
TOTAL_AMOUNT_SOLD
TOTAL_QTY_SOLD
CUSTOMER_TYPE
TRANSDATE
RANK
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
DATE
NUMBER(20,2)
NUMBER(20,2)
VARCHAR2(30)
DATE
NUMBER(20)
Note: arrange the records based rank and define rank equal to 1000 at rank transformation stage.
Dataflow diagram for Product Sales Analysis:
In this dataflow diagram design ETL logic to analyze sales can be
done on different category of products in different market places based on
retailer and agent combined. For this requirement choose
WH_PRODUCT_SALES_FCT fact table and load bulk amount of data to
target table periodic manner with necessary aggregate functions. Select
required periodic fact table WH_AGGR_PRODUCT_SALES_FCT from
target system. Take reference from different dimension tables and aggregate
incoming data shown in below,
1. TOTAL_SALES_REVENUE = SUM (SALES_REVENUE)
2. TOTAL_NET_PROFIT =SUM (NET_PROFIT)
3. TOTAL_GROSS_PROFIT = SUM (GROSS_PROFIT)
4. TOTAL_QTY_SOLD = SUM ( TOTAL_QTY)
Target Table 5
WH_AGGR_PRODUCT_SALES_FCT
WH_PRODUCT_ID
WH_SALES_PERSON_ID
WH_RETAILER_ID
WH_AGENT_ID
WH_MARKETPLACE_ID
WH_CURRENCY_ID
WH_REGION_ID
WH_ORDER_METHOD_ID
LAST_ORDER_DATE
TOTAL_SALES_REVENUE
TOTAL_NET_PROFIT
TOTAL_GROSS_PROFIT
TOTAL_QTY
RANK
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
DATE
NUMBER(20,2)
NUMBER(20,2)
NUMBER(20,2)
NUMBER(20)
NUMBER(20)
Note: Don’t use dimension tables for reference to load data into aggregate
fact tables, because reference is already taken when load data into working
database.
Summary: target table 3 used for Market-Basket Analysis where as target
table 4 used for Product Sales Analysis.
Divergence table for Traffic Analysis
Traffic Analysis used to find number of customers available in
different locations and what they can interested to select, here we can find
increase or decrease number of customers, revenue, net profit and gross
profit, etc., based on granularity on time period. Here no need to take
reference from dimension tables and calculate measures based on week,
month, quarter, year; that means grain of this fact table is week. It is an
independent table and also called as snap shot table or materialized view.
Source tables:
1. WH_ITEM_FACT
2. WH_PRODUCT_SALES_FCT
Step1: Define mapping variable as $$WEEK and define initial value equal to
1
Step2: Select required columns from the source table WH_ITEM_FACT;
define filter condition to select required data show below,
TO_NUMBER (
TO_CHAR (
TO_DATE (WH_ITEM_FCT.ORDER_DATE, ‘MM/DD/YYYY’), ‘WW’))
= $$WEEK
Same as like second source table, select required columns from the source
table WH_ITEM_FACT; define filter condition to select required data show
below,
TO_NUMBER (
TO_CHAR (
TO_DATE (WH_PRODUCT_SALES_FCT.ORDER_DATE,
‘MM/DD/YYYY’), ‘WW’)) = $$WEEK
Step 3: Define variable expression as $$WEEK = $$WEEK +1.
Step4: Design logic to find WEEK, MONTH, QUARTER, and YEAR from
last ORDER_DATE.
Step5: Aggregate incoming data with necessary aggregate function to find
required results based on LOCATION and time period shown below with
three aggregation stages.
1. TOTAL_CUSTOMERS = DISTINCT COUNT (WH_ITEM_FCT.
WH_CUSTOMER_ID) based on time period and location;
2. TOTAL_ORDERS= SUM
( WH_PRODUCT_SALES_FCT.TOTAL_ORDERS) based on
time period and location;
3. TOTAL_REVENUE=SUM
( WH_PRODUCT_SALES_FCT.SALES_REVENUE) based on time
period and location;
4. TOTAL_NET_PROFIT= SUM
( WH_PRODUCT_SALES_FCT.NET_PROFIT) based on time
period and location;
5. TOTAL_GROSS_PROFIT= SUM
( WH_PRODUCT_SALES_FCT.GROSS_PROFIT) based on time
period and location;
Target table 6
WH_AGGR_SALES_FCT
YEAR
QTR
MONTH
WEEK
COUNTRY
STATE
CITY
TOTAL_REVENUE
TOTAL_NET_PROFIT
TOTAL_GROSS_RPFOT
TOTAL_ORDERS
TOTAL_CUSTOMERS
TRANSDATE
NUMBER(20)
NUMBER(20)
NUMBER(20)
NUMBER(20)
VARCHAR2(30)
VARCHAR2(30)
VARCHAR2(30)
NUMBER(20,2)
NUMBER(20,2)
NUMBER(20,2)
NUMBER(20)
NUMBER(20)
DATE
Summary: Transaction grain fact tables loaded in daily basis to maintain
very detailed data in specified order, where as periodic fact tables loaded in
periodic basis and allows aggregate functions. Here four star schemas
constructed and modeled like galaxy schema, that means all integrated
together. This is called as Sales Data Mart used for customer order
management, Sales Analysis, Market-Basket Analysis.
Summary of the project:
1. Number of dimension tables 14
2. Number of fact tables 5
3. Number of divergence tables 1
4. Number of lookup tables 4
Use SCD Type-1 for following dimension tables,
a. WH_ORDER_DIM
b. WH_STORE_DIM
c. WH_REGION_DIM
d. WH_MARKET_PLACE_DIM
e. WH_ORDER_METHOD_DIM
f. WH_SUPPLIER_DIM
Use SCD Type-2 for following dimension tables,
a. WH_CUSTOMER_DIM
b. WH_RETAILER_DIM
c. WH_AGENT_DIM
d. WH_SALES_PERSON_DIM
e. WH_SESSION_TYPE_DIM
f. WH_LEGAL_ENTITY_DIM
Currency dimension WH_CURRENCY_DIM is monster dimension, no need
to construct ETL logic for this dimension. It is directly used in the OLAP
systems for reporting. For time dimension WH_TIME_DIM don’t construct
ETL logic, just generate sequence for this dimension.
Use incremental loading for transaction grain fact table and daily basis
loading fact tables shown below,
a. WH_ITEM_FCT
b. WH_ORDER_FCT
c. WH_PRODUCTS_SALES_FCT
Use periodic loading for periodic snap shot fact tables and divergence tables
shown below,
a. WH_AGGR_CUSTOMER_SALES_FCT
b. WH_AGGR_PRODUCT_SALES_FCT
c. WH_AGGR_SALES_FCT (divergence table).
Choose parallel processing mechanism for staging Database
construction. Always load fresh data to staging for every execution in
order to replace existing data from the staging.