4
Best Practices for High Volume Data Migrations Chandrakant Shankrayya Hiremath

Best Practices for High Volume Data Migrations Papers/Services/E… · Best Practices for High Volume Data Migrations ... migration should be validated with respect to the key

Embed Size (px)

Citation preview

Page 1: Best Practices for High Volume Data Migrations Papers/Services/E… · Best Practices for High Volume Data Migrations ... migration should be validated with respect to the key

Best Practices for High Volume Data Migrations

Chandrakant Shankrayya Hiremath

Page 2: Best Practices for High Volume Data Migrations Papers/Services/E… · Best Practices for High Volume Data Migrations ... migration should be validated with respect to the key

Executive SummaryOften part of large transformation projects, Data Migration projects usually

suffer from a lack of careful planning and prioritization. As a result, every

instance of a Data Migration project has cases where the original estimates of

Time and Cost are revised, technical challenges discovered much later in the

project cycle, and the overall customer satisfaction is far lower than expected.

Having been part of numerous Data Migration projects for over fifteen years,

Tech Mahindra has seen it all! Also, through the rich experience we have gained

over the years, we have identified certain best practices that have proved to be

instrumental in the success of our Data Migration engagements. In this

whitepaper, we present these Best Practices that can be applied to various Data

Migration scenarios to avoid some common pitfalls. While the whitepaper

covers most general scenarios for Data Migrations, a prior familiarity with Data

Migration and ETL concepts would definitely help the reader in understanding

the ideas better.

IntroductionThis white paper discusses general approaches and Best Practices to migrate

humungous data from legacy to new Stack. While this whitepaper also covers

general methods to troubleshoot and optimize the performance of Extraction,

Loading, Transformation (ETL) processes and EIM, it is imperative to remember

that every implementation of target stack is unique, and hence, every use of data

loading is therefore also unique. Therefore, it is in the best interest of all parties

to test, continually monitor and tune the EIM/Loader processes to achieve

optimal throughput. It is assumed that the reader is familiar with ETL concepts.

AbstractA fully integrated migration environment should address the following four

key areas:

1. Understand. Comprehensive profiling and auditing of all data sources from

full-volume samples can eliminate unexpected scenarios during migration

2. Improve. Poor data quality in source systems should be addressed before or

during the migration process. Modern data quality software can be used to

restructure, standardize, cleanse, enrich, de-duplicate, and reconcile the

data. Instead of using complex custom code or scripts, a simpler technology

which is easy to use and aimed at the analyst or business user needs to be

adopted

3. Protect. Migrated data will naturally degrade over time, until it becomes a

problem again. Maintaining and improving the quality of this data is vital to

increasing the value that can be derived from the information. Enterprise

data needs to be protected from degradation due to errors, incompleteness

or duplication. Implementing a data quality solution to monitor data feeds—in

both batch and real time—is critical to maintaining the integrity and

therefore the value of the application.

4. Govern. Regularly tracking and publishing data quality metrics to a

dashboard enables senior executives and business users to monitor the

progress of data migration projects or data quality initiatives

This paper champions a platform-driven approach towards Data Migration. The

recommended steps to implement the solution to execute large-scale data

transformation projects in a typical platform driven project are listed below.

ExtractionData extraction is the stage where data is analyzed to identify patterns and

retrieve information about data sources. It is primarily performed on data in

multiple formats like flat files, tables, dump files, xml files etc.

Redefining Data for MigrationRegardless of structure, type or format, the source data intended for

migration should be validated with respect to the key attributes

mentioned below

– Relevance: Is it relevant to its intended purpose?

– Accuracy: Is it correct and objective, and can it be validated?

– Integrity: Does it have a coherent, logical structure?

– Consistency: Is it consistent and easy to understand?

– Completeness: Does it provide all the information required by the

business?

– Validity: Is it within acceptable parameters for the business?

– Availability: Is the data current and available whenever required?

– Accessibility: Can it be easily accessed and exported to the target

application?

– Compliance: Does it comply with regulatory & legal norms?

Data Staging and LoadingOnce Data is extracted and analyzed, it is brought into the staging phase where it is loaded into the staging databases which eventually are used to carry out migrations.

Creating Stage Area – Stage TablesThe recommended steps for creating Staging area are listed below.

– All Stage tables are created in no logging mode

– Stage history tables should be created to hold previous run data

– History tables should be portioned on a combination of business required columns for high volume of data for performance and reporting

Data loading to stage tablesFor dump files we should use data pump impdp utility and follow the guidelines as mentioned in extraction section for data pump.

For flat files – Use SQL Loader utility to load data on to stage tables

– Use direct path method for best performance

– Avoid using any Oracle or user defined function while loading data onto stage tables

– Use parallel load for loading data and to configure parallel sessions

– All constraints and indexes should be disabled before load and enabled after load and indexes should be recreated using parallel degree

– Statistics should be gathered after every migration run

– Data Extraction (Oracle and Non Oracle systems)

– Load to staging area

– Rule engine

Preparation– Transformation

– Data Loading to Target systems / Output file generation

– EIM Loading

Execution– Catch-up load

– Reconciliation

Validation01 02 03

Page 3: Best Practices for High Volume Data Migrations Papers/Services/E… · Best Practices for High Volume Data Migrations ... migration should be validated with respect to the key

Rule EngineRule engines are used to define business rules that drive the data Migration

process. It is recommended to use a Rule engine which is Modular,

Lightweight, Flexible and Configurable.

A good Rule Engine should be designed to facilitate the following:

– Provide a simple interface to create rule queries at runtime

– Provide flexibility to the user to create, modify, deactivate and execute

new rules whenever needed

– Ensure reusability of rules across multiple components of the framework and

reduce dependency on the software development team for rule development

– Ensure reduced time in terms of release cycle & deployment efforts to

code new rules

– Capture all data validation rules with respect to target stack validation,

include all mandatory fields in target for comparison with source

– Execute all rules in parallel

– Fine-tune the rule queries, create indexes wherever required

– Update the fallout tables

TransformationTransformation is the stage where a series of rules or functions are applied to

the extracted data to derive the data for loading into the end target. Though

there are various ETL (Extract. Transform, Load) tools available in the market

for this purpose we are outlining best transformation practices with ETL tools,

as well as without them.

Data Transformation with an ETL toolUsing an ETL tool greatly reduces the effort and hassles involved in Data

Transformation. However, it is advised that the following steps should be

ensured for a successful transformation.

– Create views for all transformations and allow the ETL tool to

map the view and target table

– Fine-tune the queries

– Create a one-on-one mapping between views and target tables

– Capture exceptions

– Build orchestration using the ETL tool

Data Transformation without any ETL toolTo carry out Data Transformation without the help of any ETL tools, following

steps should be taken to ensure successful transformation.

– Create target schema in the framework similar to target load schema

– Create all tables in “no logging” mode

– Create a separate procedure for each target table

– Use bulk collect option in all procedures

– Fine-tune the cursors in all procedures

– Capture exceptions

– Gather stats on target table after every migration run

Output file generationOnce Transformation is complete, we need to define and generate the Output

file. To generate the output file, we should

– Use Output File Generation module of the platform

– Configure file type, separator and parallel sessions

– Configure degree of the parallel based on the server configuration

– Use archival mode of the platform for output file archival where the

platform will compress and archive output files

– Configure file size based on the requirements. Files can be created based on

configured size or based on UoM (Unit of Migration). This will help in creating

multiple files, while having the transfer and load process run in parallel

Reconciliation at Load StagingThis is Reconciliation done for before and after Pre-verification of data in the staging environment.

Reconciliation after Validations & Transformation at Staging

Summary reports: Report on Counts for each entity in Staging table Vs EIM table Report on Counts for each entity for Successful EIM records Vs Staging records

Error ReportsSummary and detailed reports, these reports will capture error occurred at each stages of migration

Asset Quality reportThis is the special reconciliation report used for reconciling transformed bill plans (all rate plans) table.

Data Loading to Target SystemsData Loading is the next step in this process. Data Loading is the process of

loading the transformed data in to target databases. For the purpose of

simplicity, the target databases are categorized into two categories- Oracle

databases and Non-Oracle databases.

Target as Oracle database

– The platform’s target schema should be exactly matching the target

table schema

– Use “DATA PUMP” utility to export and import data to target schema

– Use platform adaptors for all supported target data loads

– Compare target schema table structure to the platform target

schema before executing loading scripts

– Use parallel sessions to load data onto target

Target as non Oracle database

– If target requires data in flat files, compress the file transfer and load

onto target

– Capture the response from target system back into the platform

– If APIs are provided, trigger them from the platform and capture

response back to the platform

Catch-up LoadAfter the data has been consolidated from all stage tables based on entity and

source system, the delta between target and source data should be identified

for all scoped attributes for migration and not for just the elements in files. By

considering this approach, we can easily identify the required catch up load of

migration across all subject areas.

Reconciliation and Migration ReportsReconciliation information and reports that should be produced at each stage

of Migration are as follows:

Page 4: Best Practices for High Volume Data Migrations Papers/Services/E… · Best Practices for High Volume Data Migrations ... migration should be validated with respect to the key

Summary - Best Practices for Successful Data Migration

To Summarize, there are a few Golden Rules that must be followed in order to

maximize the chances of success of a Data Migration project. These rules are:

– Clearly define the scope of the project

– Understand the current business & technical challenges

– Assess the current data quality levels, identify gaps

– Involve relevant stakeholders from the project start date

– Actively refine the scope of the project through targeted profiling and

auditing

– Minimize the amount of data to be migrated

– Profile and audit all source data in the scope before writing mapping

specifications

– Define a realistic project budget and timeline, based on knowledge of

data issues

– Clearly define the RACI (Responsible, Accountable, Consulted, Informed)

Matrix

– Secure sign-off on each stage from relevant stakeholders

– Prioritize with a top-down, target-driven approach

– Aim to volume-test all data in the scope as early as possible at the unit level

– Aim to migration functional test on a data sample that represents

source data

– Aim to test business flows on migrated data on target stack

– Allow time for volume testing and issue resolution

– Segment the project into manageable, incremental chunks

– Focus on the business objectives and cost/benefits

Chandrakant Shankrayya Hiremath is a Principal Architect for Data

Management Services with Tech Mahindra with over 16 years of experience in

data migrations. He is part of the team that built Tech Mahindra’s highly

acclaimed and copyrighted Data Migration platforms, including the Unified Data

Management Framework (UDMF©). UDMF is a highly flexible, open source

compliant framework that has successfully delivered migrations in varying

domains like Telecom, BFSI and Manufacturing - addressing End-to-End Data

Migrations, ERP Migrations, Data Archival and Data Quality.

To know more about our capabilities and how we can help your organization

successfully carry out Data Migrations, email us at

[email protected]

[email protected]

DMS

About the Author: