Data Warehouse Concepts

Preview:

Citation preview

1 utu wurehooseData Warehouse is a central managed and integrated database containing data from the operational sources in an organization (such as SAP, CRM, ERP system). It may gather manual inputs from users determining criteria and parameters for grouping or classifying records.That database contains structured data for query analysis and can be accessed by users. The data warehouse can be created or updated at any time, with minimum disruption to operational systems. It is ensured by a strategy implemented in a ETL process. A source for the data warehouse is a data extract from operational databases. The data is validated, cleansed, transformed and finally aggregated and it becomes ready to be loaded into the data warehouse. Data warehouse is a dedicated database which contains detailed, stable, non-volatile and consistent data which can be analyzed in the time variant. Sometimes, where only a portion of detailed data is required, it may be worth considering using a data mart. A data mart is generated from the data warehouse and contains data focused on a given subject and data that is frequently accessed or summarized.Business Intelligence - Data Warehouse - ETL: Keeping the data warehouse Iilled with very detailed and not eIIiciently selected data may lead to growing the database to a huge size, which may be diIIicult to manage and unusable. A good example oI successIul data management are set-ups used by the leaders in the Iield oI telecoms such as O2 broadband. To signiIicantly reduce number oI rows in the data warehouse, the data is aggregated which leads to the easier data maintenance and eIIiciency in browsing and data analysis. Key Data Warehouse systems and the most widely used database engines Ior storing and serving data Ior the enterprise business intelligence and perIormance management:OTeradataOSAP BW - Business InIormation WarehouseOOracleOMicrosoIt SQL ServerOIBM DB2OSAS utuWurehoose Architectore The main difference between the database architecture in a standard, on-line transaction processing oriented system (usually ERP or CRM system) and a DataWarehouse is that the systems relational model is usually de-normalized into dimension and fact tables which are typical to a data warehouse database design.The differences in the database architectures are caused by different purposes of their existence. In a typical %! system the database performance is crucial, as end-user interface responsiveness is one of the most important factors determining usefulness of the application. That kind of a database needs to handle inserting thousands of new records every hour. To achieve this usually the database is optimized for speed of Inserts, Updates and Deletes and for holding as few records as possible. So from a technical point of view most of the SQL queries issued will be INSERT, UPDATE and DELETE. Opposite to OLTP systems, a DataWarehouse is a system that should give response to almost any question regarding company performance measure. Usually the information delivered from a data warehouse is used by people who are in charge of making decisions. So the information should be accessible quickly and easily but it doesn't need to be the most recent possible and in the lowest detail level. utu murt Data marts are designated to fulfill the role of strategic decision support for managers responsible for a specific business area. Data warehouse operates on an enterprise level and contains all data used for reporting and analysis, while data mart is used by a specific business department and are focused on a specific subject (business area). A scheduled ETL process populates data marts within the subject specific data warehouse information. The typical approach for maintaining a data warehouse environment with data marts is to have one Enterprise Data Warehouse which comprises divisional and regional data warehouse instances together with a set of dependent data marts which derive the information directly from the data warehouse. It is crucial to keep data marts consistent with the enterprise-wide data warehouse system as this will ensure that they are properly defined, constituted and managed. Otherwise the DW environment mission of being "the single version of the truth" becomes a myth. However, in data warehouse systems there are cases where developing an independent data mart is the only way to get the required figures out of the DW environment. Developing independent data marts, which are not 100% reconciled with the data warehouse environment and in most cases includes a supplementary source of data, must be clearly understood and all the associated risks must be identified. Data marts are usually maintained and made available in the same environment as the data warehouse (systems like Oracle, Teradata, MS SQL Server, SAS) and are smaller in size than the enterprise data warehouse.There are also many cases when data marts are created and refreshed on a server and distributed to the end users using shared drives or email and stored locally. This approach generates high maintenance costs, however makes it possible to keep data marts available offline. There are two approaches to organize data in data marts:ODatabase datamart tables or its extracts represented by text Iiles - one-dimensional, not aggregated data set; in most cases the data is processed and summarized many times by the reporting application.OMultidimensional database (MDDB) - aggregated data organized in multidimensional structure. The data is aggregated only once and is ready Ior business analysis right away. In the next stage, the data Irom data marts is usually gathered by a reporting or analytic processing (OLAP) tool, such as Hyperion, Cognos, Business Objects, Pentaho BI, MicrosoIt Excel and made available Ior business analysis. Usually, a company maintains multiple data marts serving the needs oI Iinance, marketing, sales, operations, IT and other departments upon needs. Example use oI data marts in an organization: CRM reporting, customer migration analysis, production planning, monitoring of marketing campaigns, performance indicators, internal ratings and scoring, risk management, integration with other systems (systems which use the processed DW data) and more uses specific to the individual business. #eporting A successful reporting platform implementation in a business intelligence environment requires great attention to be paid from both the business end users and IT professionals.The fact is that the reporting layer is what business users might consider a data warehouse system and if they do not like it, they will not use it. Even though it might be a perfectly maintained data warehouse with high-quality data, stable and optimized ETL processes and faultless operation. It will be just useless for them, thus useless for the whole organization. The problem is that the report generation process is not particularly interesting from the IT point of view as it does not involve heavy data processing and manipulation tasks. The IT professionals do not tend to pay great attention to this BI area as they consider it rather 'look and feel' than the 'real heavy stuff'.From the other hand, the lack of technical exposure of the business users usually makes the report design process too complicated for them.The conclusion is that the key to success in reporting (and the whole BI environment) is the collaboration between the business and IT professionals. utu mining Data mining is an area of using intelligent information management tool to discover the knowledge and extract information to help support the decision making process in an organization. Data mining is an approach to discovering data behavior in large data sets by exploring the data, fitting different models and investigating different relationships in vast repositories. The information extracted with a data mining tool can be used in such areas as decision support, prediction, sales forecasts, financial and risk analysis, estimation and optimization. Sample real-world business use of data mining applications includes:OCRM - aids customers classiIication and retention campaignsOWeb site traIIic analysis - guest behavior prediction or relevant content deliveryOPublic sector organizations may use data mining to detect occurences oI Iraud such as money laundering and tax evasion, match crime and terrorist patterns, etc.OGenomics research - analysis oI the vast data stores The most widely known and encountered Data Mining techniques:OStatistical modeling which uses mathematical equations to do the analysis. The most popular statistical models are: generalized linear models, discriminant analysis, linear regression and logistic regression.ODecision list models and decision treesONeural networksOGenetic algorithmsOScreening models Data mining tools oIIer a number data discovery techniques to provide expertise to the data and to help identiIy relevant set oI attributes in the data:OData manipulation which consists oI construction oI new data subsets derived Irom existing data sources.OBrowsing, auditing and visualization oI the data which helps identiIy non-typical, suspected relationships between variables in the data.OHypothesis testing A group oI the most signiIicant data mining tools is represented by: OSPSS ClementineOSAS Enterprise MinerOIBM DB2 Intelligent MinerOSTATISTICA Data MinerOPentaho Data Mining (WEKA)OIsoIt Alice ATA III#ATION Try to remember all the times, when you had to collect data from many different sources. I guess, this reminds you all those situations when you have been searching and processing data for many hours, because you had to check every source one by one. After that it turns out that you had missed something and you spent another day attaching that missing data. However, there is solution, which can simplify your work. That solution is called "data federation.Wbat is tbis and bow does it work ?Data federation is a kind of software that standardizes the integration of data from different (sometimes very dispersed) sources. It collects data from multiple sources, creates strategic layer of data and optimizes the integration of dispersed views, providing standardized access to those integrated view of information within single layer of data. That created layer ensures re-usability of the data. Due to that data federation has a very important significance while creating a SOA kind of software (Service Oriented Architecture - a computer system, where the main aim is to create software that will live up users expectations).Tbe most important advantages of good data federation softwareData federation strong points:OLarge companies process huge amounts of data, that comes from different sources. In addition, with the company development, there are more and more data sources, that employees have to deal with and eventually its nearly impossible to work without using specific supportive software.OStandardized and simplified access to data - thanks to that even when we use data from different, dispersed sources that additionally have different formats, they can look like they are from the same source.Oou can easily create integrated views of data and libraries of data store, that can be used multiple times.OData are provided in real time, from original sources, not from cumulative databases or duplicates.OEfficient delivery of up-to-date data and protection of database in the same time.ata federation weeknesses There is no garden without weeds, so data federation also have some drawbacks. While using that software, parameters should be closely watched and optimized, because aggregation logic takes place in server, not in the database. Wrong parameters or other errors can influence on the transmission or correctness of results. We should also remember that data comes from big amount of sources and it is important to check their reliability. Through using unproven and uncorrected data, we can have some errors in our work and that can cause financial looses for our company. Software cant always judge if the source is reliable or not, so we have to make sure, that there ale special information managers, who watch over the correctness of data and decide what sources can our data federation application use. As you can see, data federation system is very important, especially for big companies. It makes work with big amount of data from different sources much more easier.Top ata Federation tools Below a list of the most popular enterprise data integration tools providing the data federation features:OSAP BusinessObjects Data FederatorOSybase Data FederationOIBM InfoSphere Federation ServerOOracle Data Service IntegratorOSAS Enterprise Data Integration Server - data federation features OuaLa federaLlon furLher readlng on daLa federaLlon osiness Intelligence stuIIingThis study illustrates how global companies typically structure business intelligence resources and what teams and departments are involved in the support and maintenance of an nterprise Data Warehouse. This proposal may range widely across the organizations depending on the company requirements, the sector and BI strategy, however it might be considered as a template. The successful business intelligence strategy requires the involvement of people from various departments. Those are mainly:- Top Management (CIO, board of directors members, business managers)- Finance- Sales and Marketing- IT - both business analysts and technical managers and specialiststeering committee , Business ownersA typical enterprise data warehouse environment is under the control and direction of a steering committee. The steering committee sets the policy and strategic direction of the data warehouse and is responsible for the prioritization of the DW initiatives.The steering committee is usually chaired by a DW business owner. Business owner acts as sponsor and champion of the whole BI environment in an organization. Business ata Management group , ata governance teamThis group forms the Business Stream of the BI staff and consist of a cross-functional team of representatives from the business and I% (mostly IT managers and analysts) with a shared vision to promote data standards, data quality, manage DW metadata and assure that the Data Warehouse is The Single Version of the Truth. Within the data governance group each major data branch has a business owner appointed to act as data steward. One of the key roles of the data steward is to approve access to the DW data. The new DW projects and enhancements are shaped and approved by the Business Data Management group. The team also defines the Business Continuity Management and disaster recovery strategy. ata Warebousing teamThe IT stream of a Data Warehouse is usually the vastest and the most relevant from the technical perspective. It is hard to name this kind of team precisely and easily as it may differ significantly among organizations. This group might be referenced as usiness Intelligence team, Data Warehousing team, Information Analysis Group, Information Delivery team, MIS Support, I development, Datawarehousing Delivery team, I solutions and services. asically, the creativity of HR departments in namingteams and positions is much broader than you can imagine.Some organizations also group the staff into smaller teams, oriented on a specific function or a topic. The typical positions occupied by the members of the datawarehousing team are:OuaLa Warehouse analysL Lhose are people who know Lhe CL1 and MlS sysLems and Lhe dependencles beLween Lhem ln an organlzaLlon very ofLen Lhe analysLs creaLe funcLlonal speclflcaLlon documenLs hlghlevel soluLlon ouLllnes eLcOuW archlLecL a Lechnlcal person whlch needs Lo undersLand Lhe buslness sLraLegy and lmplemenL LhaL vlslon Lhrough LechnologyO8uslness lnLelllgence speclallsL uaLa Warehouse speclallsL uW developer whlch are Lechnlcal 8l experLsOL1L modeler L1L developer daLa lnLegraLlon Lechnlcal experLsO8eporLlng analysL CLA analysL lnformaLlon uellvery analysL 8eporL deslgner people wlLh analyLlcal mlnds wlLh some Lechnlcal exposureO1eam Leaders by far Lhe mosL ofLen each of Lhe uaLa Warehouslng Leam subgroups has lLs own Leam leader who reporLs Lo a 8uslness lnLelllgence l1 manager 1hose are very ofLen experLs who had been promoLedO8uslness lnLelllgence l1 manager ls a very lmporLanL role because of Lhe facL LhaL a 8l manager ls a llnk beLween Lhe sLeerlng commlLLee Lhe daLa governance group and Lhe experLs W upport and administrationThis team is focused on the support, maintenance and resolution of operational problems within the data warehouse environment. The people included may be:OuaLabase admlnlsLraLors (u8A)OAdmlnlsLraLors of oLher 8l Lools ln an organlzaLlon (daLabase reporLlng L1L)OPelpdesk supporL cusLomer supporL supporL Lhe reporLlng fronL ends accesses Lo Lhe uW appllcaLlons and oLher hardware and sofLware problemsample W team organogramEnterprise Data Warehouse organizational chart with all the people involved and a sample staff count. The head count in our sample DW organization lists 35 people. Be aware that this organogram depends heavily on the company size and its mission, however it may be treated as a good illustration of proportions on the BI resources in a typical environment.Keep in mind that not all positions involve full time resources.OSLeerlng commlLLee 8uslness owners *O8uslness uaLa ManagemenL group OuW Leam uW/L1L speclallsLs and developersreporL deslgners and analysLsuW Lechnlcal analysLs1esLers* (an analysL may also play Lhe role of a LesLer)OSupporL and malnLenance * utu Wurehoose metudutu The metadata in a data warehouse system unfolds the definitions, meaning, origin and rules of the data used in a Data Warehouse. There are two main types of metadata in a data warehouse system: business metadata and technical metadata. Those two types illustrate both business and technical point of view on the data.The Data Warehouse Metadata is usually stored in a Metadata Repository which is accessible by a wide range of users.Business metadata Business metadata (datawarehouse metadata, front room metadata, operational metadata) - this type of metadata stores business definitions of the data, it contains high-level definitions of all fields present in the data warehouse, information about cubes, aggregates, datamarts.Business metadata is mainly addressed to and used by the data warehouse users, report authors (for ad-hoc querying), cubes creators, data managers, testers, analysts. Typically, the following information needs to be provided to describe business metadata:ODW 1ab|e NameODW Co|umn NameOus|ness Name shorL and descLlpLlve header lnformaLlonODef|n|t|on exLended descrlpLlon wlLh brlef overlvlew of Lhe buslness rules for Lhe fleldO|e|d 1ype a flag may lndlcaLe wheLher a glven fleld sLores Lhe key or a dlscreLe value wheLher ls acLlve or noL or whaL daLa Lype ls lL 1he conLenL of LhaL fleld (or flelds) may vary upon buslness needsTecbnical metadata Technical metadata (ETL process metadata, back room metadata, transformation metadata) is a representation of the ETL process. It stores data mapping and transformations from source systems to the data warehouse and is mostly used by datawarehouse developers, specialists and ETL modellers.Most commercial ETL applications provide a metadata repository with an integrated metadata management system to manage the ETL process definition.The definition of technical metadata is usually more complex than the business metadata and it sometimes involves multiple dependencies. The technical metadata can be structured in the following way:Oource Database or sysLem deflnlLlon lL can be a source sysLem daLabase anoLher daLa warehouse flle sysLem eLcO1arget Database uaLa Warehouse lnsLanceOource 1ab|es one or more Lables whlch are lnpuL Lo calculaLe a value of Lhe fleldOource Co|umns one or more columns whlch are lnpuL Lo calculaLe a value of Lhe fleldO1arget 1ab|e LargeL uW Lable and column are always slngle ln a meLadaLa reposlLoryO1arget Co|umn LargeL uW columnO1ransformat|on Lhe descrlpLlve parL of a meLadaLa enLry lL usually conLalns a loL of lnformaLlon so lL ls lmporLanL Lo use a common sLandard LhroughouL Lhe organlsaLlon Lo keep Lhe daLa conslsLenL Some too|s ded|cated to the metadata management(many of Lhem are bundled wlLh L1L Lools)O1eradaLa MeLadaLa ServlcesOLrwln uaLa modeller1 OMlcrosofL 8eposlLoryOl8M (AscenLlal) MeLaSLageOenLaho MeLadaLaOAblnlLlo LML (LnLerplse MeLadaLa LnvlronmenL) $ - $lowly chunging dimensions Slowly changing dimensions (SCD) determine how the historical changes in the dimension tables are handled. Implementing the SCD mechanism enables users to know to which category an item belonged to in any given date. Types of Slowly Changing Dimensions in the Data Warehouse architectures:O%ype 0 SCD is not used frequently, as it is classified as when no effort has been made to deal with the changing dimensions issues. So, some dimension data may be overwritten and other may stay unchanged over the time and it can result in confusing end-users.O%ype 1 SCD DW architecture applies when no history is kept in the database. The new, changed data simply overwrites old entries. This approach is used quite often with data which change over the time and it is caused by correcting data quality errors (misspells, data consolidations, trimming spaces, language specific characters). Type 1 SCD is easy to maintain and used mainly when losing the ability to track the old history is not an issue.OIn the %ype 2 SCD model the whole history is stored in the database. An additional dimension record is created and the segmenting between the old record values and the new (current) value is easy to extract and the history is clear. The fields 'effective date' and 'current indicator' are very often used in that dimension.O%ype 3 SCD - only the information about a previous value of a dimension is written into the database. An 'old 'or 'previous' column is created which stores the immediate previous attribute. In Type 3 SCD users are able to describe history immediately and can report both forward and backward from the change.However, that model can't track all historical changes, such as when a dimension changes twice or more. It would require creating next columns to store historical data and could make the whole data warehouse schema very complex.O%ype 4 SCD idea is to store all historical changes in a separate historical data table for each of the dimensions.In order to manage Slowly Changing Dimensions properly and easily it is highly recommended to use Surrogate Keys in the Data Warehouse tables.A Surrogate Key is a technical key added to a fact table or a dimension table which is used instead of a business key (like product ID or customer ID). 11 Surrogate keys are always numeric and unique on a table level which makes it easy to distinguish and track values changed over time. In practice, in big production Data Warehouse environments, mostly the Slowly Changing Dimensions %ype 1, %ype 2 and %ype 3 are considered and used. It is a common practice to apply different SCD models to different dimension tables (or even columns in the same table) depending on the business reporting needs of a given type of data. TL process and concepts% stands for extraction, transformation and loading. Etl is a process that involves the following tasks:Oextracting data from source operational or archive systems which are the primary source of data for the data warehouseOtransforming the data - which may involve cleaning, filtering, validating and applying business rulesOloading the data into a data warehouse or any other database or application that houses dataThe ETL process is also very often referred to as Data Integration process and ETL tool as a Data Integration platform. The terms closely related to and managed by ETL processes are: data migration, data management, data cleansing, data synchronization and data consolidation. The main goal of maintaining an ETL process in an organization is to migrate and transform data from the source OLTP systems to feed a data warehouse and form data marts. utu Wurehoosing ITL totoriul The ETL and Data Warehousing tutorial is organized into lessons representing various business intelligence scenarios, each of which describes a typical data warehousing challenge.This guide might be considered as an % process and Data Warehousing knowledge base with a series of examples illustrating how to manage and implement the ETL process in a data warehouse environment. The purpose of this tutorial is to outline and analyze the most widely encountered real life datawarehousing problems and challenges that need to be taken during the design and architecture phases of a successful data warehouse project deployment. Going through the sample implementations of the business scenarios is also a good way to compare usiness Intelligence and % tools and get to know the different approaches to designing the data integration process. This also gives an idea and helps identify strong and weak points of various ETL and data warehousing applications. This tutorial shows how to use the following BI, ETL and datawarehousing tools: Datastage, SAS, Pentaho, Cognos and Teradata.1 ata Warebousing & TL Tutorial lessons OSurrogate key generation example which includes information on business keys and surrogate keys and shows how to design an ETL process to manage surrogate keys in a data warehouse environment. $ample design in Pentaho Data ntegration OHeader and trailer processing - considerations on processing files arranged in blocks consisting of a header record, body items and a trailer. This type of files usually come from mainframes, also it applies to EDI and EPIC files. $olution examples in Datastage, $$ and Pentaho Data ntegration OLoading customers - a data extract is placed on an FTP server. It is copied to an ETL server and loaded into the data warehouse. $ample loading in Teradata MultiLoad OData allocation ETL process case study for allocating data. xamples in Pentaho Data ntegration and Cognos PowerPlay OData masking and scambling algorithms and ETL deployments. $ample Kettle implementation OSite traffic analysis - a guide to creating a data warehouse with data marts for website traffic analysis and reporting. $ample design in Pentaho Kettle OData Quality - ETL process design aimed to test and cleanse data in a Data Warehouse. $ample outline in PD OML ETL processing enerute sorrogute keyoalFillinadatawarehousedimensiontablewithdatawhichcomesfromdifferentsource systems and assign a unique record identifier (surrogate key) to each record.cenario overview and detailsTo illustrate this example, we will use two made up sources of information to provide data about customers dimension. Each extract contains customer records with a business key (natural key) assigned to it. In order to isolate the data warehouse from source systems, we will introduce a technical surrogate key instead of re-using the source system's natural (business) key.A unique and common surrogate key is a one-field numeric key which is shorter, easier to maintain and understand, and independent from changes in source system than using a business key. Also, if a surrogate key generation process is implemented correctly, adding a new source system to the data warehouse processing will not require major efforts. Surrogate key generation mechanism may vary depending on the requirements, however the inputs and outputs usually fit into the design shown below:Inputs:- an input respresented by an extract from the source system- datawarehouse table reference for identifying the existing records- maximum key lookup Outputs:1 - output table or file with newly assigned surrogate keys- new maximum key- updated reference table with new recordsroposed solution Assumptions:- The surrogate key field for our made up example is WH_CUST_NO.- To make the example clearer, we will use SCD 1 to handle changing dimensions. This means that new records overwrite the existing data.The ETL process implementation requires several inputs and outputs. Input data:- customers_extract.csv - first source system extract- customers2.txt - second source system extract- CUST_REF - a lookup table which contains mapping between natural keys and surrogate keys- MA_KE - a sequence number which represents last key assignment Output data:- D_CUSTOMER - table with new records and correctly associated surrogate keys- CUST_REF - new mappings added- MA_KE sequence increased %he design of an % process for generating surrogate keys will be as follows:O1he loadlng process wlll be execuLed Lwlce once for each of Lhe lnpuL fllesOCheck lf Lhe lookup reference daLa ls correcL and avallable 8Cu_8Ll Lable max_key sequenceO8ead Lhe exLracL and flrsL check lf a record already exlsLs lf lL does asslgn an exlsLlng surrogaLe key Lo lL and updaLe Lhe desclpLlve daLa ln Lhe maln dlmenslon LableOlf lL ls a new record Lhen populaLe a new surrogaLe key and asslgn lL Lo Lhe record 1he new key wlll be populaLed by lncremenLlng Lhe old maxlmum key by 1 lnserL a new record lnLo Lhe producLs Lable lnserL a new record lnLo Lhe mapplng Lable (whlch sLores buslness and surrogaLe keys mapplng) updaLe Lhe new maxlmum key ample Implementations eneraLlon of surrogaLe key lmplemenLaLlon ln varlous L1L envlronmenLs Oul surrogaLe key surrogaLe key generaLlon example lmplemenLed ln enLaho uaLa lnLegraLlon 1 !rocessing u heuder und truiler textIile oalProcess a text file which contains records arranged in blocks consisting of a header record, details(items,body)andatrailer.Theaimistonormalizerecordsandloadthemintoa relational database structure.cenario overview and detailsTypically, header and item processing needs to be implemented when processing files that origin from mainframe systems, and also DI transmission files, SWIF% Schemas, !IC files. The input file in our scenario is a stream of records representing invoices. Each invoice consists of :- A header containing invoice number, dates, customer reference and other details- One or more items representing ordered products, including an item number, quantity, price and value- A trailer which contains summary information for all the items The records are distinguished by the first character in each line: H stands for headers, I for items and T for the trailers.The lines in a section of the input file are fixed length (so for example all headers have the same number of characters but it varies from items). Input data:O1exL flle ln a PeaderLraller formaL 1he flle presenLed ln our example ls dlvlded lnLo headers lLems and Lrallers A header sLarLs wlLh a leLLer P and Lhen conLalns an lnvolce number a cusLomer number a daLe and lnvolce currency Lvery lLem sLarLs wlLh a leLLer l whlch ls followed by producL lu producL age quanLlLy and neL value 1he Lraller sLarLs wlLh a 1 and conLalns Lwo values whlch acL as a checksum Lhe LoLal number of lnvolce llnes and a LoLal neL value1 Sample header and Lraller lnpuL LexLflle CuLpuL daLaOlnvC_PLAuL8 and lnvC_LlnL relaLlonal Lables one wlLh lnvolce headers and Lhe oLher wlLh Lhe lnvolce llnesOre[ecLs_1LxL a LexL flle wlLh loadlng errors deslred ouLpuL IT! copy und loud costomers extructoalLoad cusLomers daLa lnLo a daLa warehouse accordlng Lo Lhe buslness requlremenLs1he cusLomers daLa deLalls ls very ofLen changed ln a source sysLem and we wanL Lo reflecL Lhose changes ln Lhe daLa warehouse We also wanL Lo be able Lo keep Lrack of LhaL changescenario overview and details1he cusLomers daLa ls exLracLed from Lhe source sysLem on a monLhly basls and placed on a l1 server 1he L1L process needs Lo geL Lhe flle locally and load lL lnLo Lhe daLa warehouse 1here are several posslble scenarlos whlch need Lo be lncluded ln Lhe process a new cusLomer ls added a cusLomer already exlsLs and needs Lo be updaLed an exlsLlng cusLomer remalns unchanged a record may be lnvalld AddlLlonally we wanL Lo keep Lrack of Lhe changes made Lo Lhe cusLomers daLa and be able Lo see all Lhe changes Lo a glven record ln Llme 1o handle hlsLorlcal daLa we wlll use Lhe 1ypeSlowly Changlng ulmenslon SCu 1ypemeans LhaL Lhe changed or deleLed daLa ls sLored ln a separaLe Lable A 1 LlmesLamp and a change lndex wlll be added Lo handle records whlch change more Lhan once lnpuL daLa u_CuS1CML8 daLa warehouse Lable wlLh cusLomers uwh_cusL_exLracL_ 1008txt cusLomers exLracL for Lhe currenL monLh CuLpuL daLa u_CuS1CML8 updaLed Lable whlch sLores only currenL records u_CuS1CML8_PlS1 Lable Lable wlLh hlsLorlcal daLa for Lhe cusLomers dlmenslon lL sLores deleLed records and cusLomers LhaL have already been updaLed CusL_errors_100827LxL a log flle wlLh loadlng errorsroposed solution 1he deslgn of an L1L process flow for Lhe cusLomers loadlng wlll be as followsOA LexL flle wlLh cusLomer exLracL ls generaLed by a source sysLem and placed on an l1 serverO1he flle ls reLrleved Lo an L1L serverO1he cusLomer exLracL ls loaded lnLo a Lemporary LableO1he exlsLlng uW cusLomers flle ls loaded lnLo a Lemporary lookup flleOLach record from Lhe cusLomers flle ls valldaLed and looked up from Lhe exlsLlng cusLomers flle 1he Lransform needs Lo apply Lhe followlng rules a lf a record ls malformed and does noL pass valldaLlon lL ls redlrecLed Lo a re[ecL flow b lf Lhe lookup does noL maLch records lL means LhaL Lhls ls a new cusLomer and lL needs Lo be loaded lnLo Lhe cusLomers Lable c lf Lhe lookup maLches we need Lo compare Lhe nonkey flelds Lo check lf Lhe cusLomer deLalls have changed 1here are Lwo opLlons avallable all flelds remaln Lhe same (Lhen we leave Lhe record as lL ls and proceed Lo Lhe nexL one) or a fleld has changed ln LhaL case Lhe currenL record needs Lo be lnsLerLed lnLo Lhe hlsLorlcal Lable and replaced by a new one ln Lhe maln cusLomers LableImplementation Loadlng cusLomers L1L process lmplemenLaLlon ln varlous envlronmenLs O1eradaLa MulLlLoad and an l1 shell scrlpL Lo load cusLomers exLracL utu ullocution1 oal Populate data for a daily sales report which indicates a profit margin for each invoice in the datawarehouse. This means that it will be feasible to get an information on how much revenue is generated by each invoice line.Financial background In absolute terms the profit margin can be illustrated with the following expression: Profit margin = sales amount - costs - sales deductions (discounts) - rebates Net profit margin = profit margin - taxes Sales amount is the gross total sales figure listed on an invoice and paid by a customer Sales deductions is a discount given during a sales transaction (listed on an invoice) Costs include variable and fixed costs (provided on a monthly and yearly basis) Rebates and customer bonus are usually given to a customer and calculated on a monthly, quarterly and yearly basis.ata allocation concept Data allocation (technique also referred to as filling gaps) is useful when dealing with data whichhasadifferentlevelofdetail(granularity)andtherearegapsforsomemeasures. Indatawarehousingsystems,theallocationtechniqueisinmanycasescompulsoryand usedwidelyinordertogetaconsistentandcompletesetofdata. Theconceptofdataallocationiscloselyrelatedtothegranularityofthedata.Indata warehousing, data granularity refers to the level of detail in a given fact table. The tables below illustrate various levels of granularity.Coarse-grained data (low granularity) Datevalue 2007 1000 20082000 20091500 ... Fine-grained (high granularity) Datevalue 20080101 8 2008010215 2008010312 2008010714 2008010911 ... Sample measures LhaL are very ofLen allocaLed ln a daLa warehouse are costs opetotloool fotecosts soles ploos costomet tebotes ooJ boooses etc 1here are Lwo approaches for daLa allocaLlon1 ODynam|c a||ocat|on (welghLed or proporLlonal allocaLlon) values are allocaLed uslng calculaLed subLoLals of anoLher value 1he welghLed Lype of allocaLlon ls ofLen used ln realllfe daLa warehouses envlronmenLs Sample uses of dynamlc allocaLlon Jeslqoote pottloos of o boJqet pool ollocote mooofoctotloq costs to ptoJocts etcO|xed a||ocat|on whlch means LhaL Lhere ls a consLanL value asslgned Lo all records lncluded ln Lhe allocaLlon group 8e aware LhaL Lhls approach mlghL be rlsky and confuslng as Lhose values cannoL be summarlzed Sample uses of flxed allocaLlon are stotloq voloes tbot Jo oot cbooqe ofteo (fot exomple cteJlt cotJ llmlt) ot coooot be ollocoteJ Jyoomlcolly t |s a|so |mportant to keep |n m|nd that |n some bus|ness cases a||ocat|on |s unsu|tab|e r|or to us|ng a||ocat|on |t |s necessary to ana|yze the data thorough|y and make sure |t f|ts |nto a bus|ness |og|ccenario details The company's data warehouse stores the sales data down to the invoice line level of detail and the costs which are calculated on a monthly (variable costs) and quarterly basis (fixed costs). - The variable costs total value is assigned per year, month and product group. - The fixed costs figure is a grand total assigned per year and month. The aim is to compare revenue to fixed and variable costs in all time dimension levels available. The source data has the following table structure: Date_id;nvc_head;invc_line;prod_id;prod_grp;cust_id;quantity;price;sales_amountolution outline The data allocation ETL process will be realized in a few steps:1.Loadtechnicaltableinvoices-thetablecontainsalldatarelatedtotheinvoices, including gross sales and net sales2.Load updated monthly and yearly costs into a separate costs table3.Createanothertechnicaltablewhichassignedimportancelevelsandthefollowing figures which are populated using a fixed allocation mechanism mentioned above for groups of data records: variable costs, fixed costs, sales invoice total4.Load the DW invoices table - with costs figures allocated accordingly and calculated profit margin olutions and sample implementations OuaLa allocaLlon ln enLaho uaLa lnLegraLlon sample L1L processlng ln ul based on a producLlon/manufacLurlng daLa1 Olor furLher analysls please also refer Lo Lhe Cognos measure allocaLlon example Cognos 8uslness lnLelllgence appllcaLlons provlde an auLomaLed bullLln mechanlsm Lo lmplemenL Lhe daLa allocaLlon Lechnlque

Recommended