Upload
buiquynh
View
214
Download
1
Embed Size (px)
Citation preview
Best Practices for High Volume Data Migrations
Chandrakant Shankrayya Hiremath
Executive SummaryOften part of large transformation projects, Data Migration projects usually
suffer from a lack of careful planning and prioritization. As a result, every
instance of a Data Migration project has cases where the original estimates of
Time and Cost are revised, technical challenges discovered much later in the
project cycle, and the overall customer satisfaction is far lower than expected.
Having been part of numerous Data Migration projects for over fifteen years,
Tech Mahindra has seen it all! Also, through the rich experience we have gained
over the years, we have identified certain best practices that have proved to be
instrumental in the success of our Data Migration engagements. In this
whitepaper, we present these Best Practices that can be applied to various Data
Migration scenarios to avoid some common pitfalls. While the whitepaper
covers most general scenarios for Data Migrations, a prior familiarity with Data
Migration and ETL concepts would definitely help the reader in understanding
the ideas better.
IntroductionThis white paper discusses general approaches and Best Practices to migrate
humungous data from legacy to new Stack. While this whitepaper also covers
general methods to troubleshoot and optimize the performance of Extraction,
Loading, Transformation (ETL) processes and EIM, it is imperative to remember
that every implementation of target stack is unique, and hence, every use of data
loading is therefore also unique. Therefore, it is in the best interest of all parties
to test, continually monitor and tune the EIM/Loader processes to achieve
optimal throughput. It is assumed that the reader is familiar with ETL concepts.
AbstractA fully integrated migration environment should address the following four
key areas:
1. Understand. Comprehensive profiling and auditing of all data sources from
full-volume samples can eliminate unexpected scenarios during migration
2. Improve. Poor data quality in source systems should be addressed before or
during the migration process. Modern data quality software can be used to
restructure, standardize, cleanse, enrich, de-duplicate, and reconcile the
data. Instead of using complex custom code or scripts, a simpler technology
which is easy to use and aimed at the analyst or business user needs to be
adopted
3. Protect. Migrated data will naturally degrade over time, until it becomes a
problem again. Maintaining and improving the quality of this data is vital to
increasing the value that can be derived from the information. Enterprise
data needs to be protected from degradation due to errors, incompleteness
or duplication. Implementing a data quality solution to monitor data feeds—in
both batch and real time—is critical to maintaining the integrity and
therefore the value of the application.
4. Govern. Regularly tracking and publishing data quality metrics to a
dashboard enables senior executives and business users to monitor the
progress of data migration projects or data quality initiatives
This paper champions a platform-driven approach towards Data Migration. The
recommended steps to implement the solution to execute large-scale data
transformation projects in a typical platform driven project are listed below.
ExtractionData extraction is the stage where data is analyzed to identify patterns and
retrieve information about data sources. It is primarily performed on data in
multiple formats like flat files, tables, dump files, xml files etc.
Redefining Data for MigrationRegardless of structure, type or format, the source data intended for
migration should be validated with respect to the key attributes
mentioned below
– Relevance: Is it relevant to its intended purpose?
– Accuracy: Is it correct and objective, and can it be validated?
– Integrity: Does it have a coherent, logical structure?
– Consistency: Is it consistent and easy to understand?
– Completeness: Does it provide all the information required by the
business?
– Validity: Is it within acceptable parameters for the business?
– Availability: Is the data current and available whenever required?
– Accessibility: Can it be easily accessed and exported to the target
application?
– Compliance: Does it comply with regulatory & legal norms?
Data Staging and LoadingOnce Data is extracted and analyzed, it is brought into the staging phase where it is loaded into the staging databases which eventually are used to carry out migrations.
Creating Stage Area – Stage TablesThe recommended steps for creating Staging area are listed below.
– All Stage tables are created in no logging mode
– Stage history tables should be created to hold previous run data
– History tables should be portioned on a combination of business required columns for high volume of data for performance and reporting
Data loading to stage tablesFor dump files we should use data pump impdp utility and follow the guidelines as mentioned in extraction section for data pump.
For flat files – Use SQL Loader utility to load data on to stage tables
– Use direct path method for best performance
– Avoid using any Oracle or user defined function while loading data onto stage tables
– Use parallel load for loading data and to configure parallel sessions
– All constraints and indexes should be disabled before load and enabled after load and indexes should be recreated using parallel degree
– Statistics should be gathered after every migration run
– Data Extraction (Oracle and Non Oracle systems)
– Load to staging area
– Rule engine
Preparation– Transformation
– Data Loading to Target systems / Output file generation
– EIM Loading
Execution– Catch-up load
– Reconciliation
Validation01 02 03
Rule EngineRule engines are used to define business rules that drive the data Migration
process. It is recommended to use a Rule engine which is Modular,
Lightweight, Flexible and Configurable.
A good Rule Engine should be designed to facilitate the following:
– Provide a simple interface to create rule queries at runtime
– Provide flexibility to the user to create, modify, deactivate and execute
new rules whenever needed
– Ensure reusability of rules across multiple components of the framework and
reduce dependency on the software development team for rule development
– Ensure reduced time in terms of release cycle & deployment efforts to
code new rules
– Capture all data validation rules with respect to target stack validation,
include all mandatory fields in target for comparison with source
– Execute all rules in parallel
– Fine-tune the rule queries, create indexes wherever required
– Update the fallout tables
TransformationTransformation is the stage where a series of rules or functions are applied to
the extracted data to derive the data for loading into the end target. Though
there are various ETL (Extract. Transform, Load) tools available in the market
for this purpose we are outlining best transformation practices with ETL tools,
as well as without them.
Data Transformation with an ETL toolUsing an ETL tool greatly reduces the effort and hassles involved in Data
Transformation. However, it is advised that the following steps should be
ensured for a successful transformation.
– Create views for all transformations and allow the ETL tool to
map the view and target table
– Fine-tune the queries
– Create a one-on-one mapping between views and target tables
– Capture exceptions
– Build orchestration using the ETL tool
Data Transformation without any ETL toolTo carry out Data Transformation without the help of any ETL tools, following
steps should be taken to ensure successful transformation.
– Create target schema in the framework similar to target load schema
– Create all tables in “no logging” mode
– Create a separate procedure for each target table
– Use bulk collect option in all procedures
– Fine-tune the cursors in all procedures
– Capture exceptions
– Gather stats on target table after every migration run
Output file generationOnce Transformation is complete, we need to define and generate the Output
file. To generate the output file, we should
– Use Output File Generation module of the platform
– Configure file type, separator and parallel sessions
– Configure degree of the parallel based on the server configuration
– Use archival mode of the platform for output file archival where the
platform will compress and archive output files
– Configure file size based on the requirements. Files can be created based on
configured size or based on UoM (Unit of Migration). This will help in creating
multiple files, while having the transfer and load process run in parallel
Reconciliation at Load StagingThis is Reconciliation done for before and after Pre-verification of data in the staging environment.
Reconciliation after Validations & Transformation at Staging
Summary reports: Report on Counts for each entity in Staging table Vs EIM table Report on Counts for each entity for Successful EIM records Vs Staging records
Error ReportsSummary and detailed reports, these reports will capture error occurred at each stages of migration
Asset Quality reportThis is the special reconciliation report used for reconciling transformed bill plans (all rate plans) table.
Data Loading to Target SystemsData Loading is the next step in this process. Data Loading is the process of
loading the transformed data in to target databases. For the purpose of
simplicity, the target databases are categorized into two categories- Oracle
databases and Non-Oracle databases.
Target as Oracle database
– The platform’s target schema should be exactly matching the target
table schema
– Use “DATA PUMP” utility to export and import data to target schema
– Use platform adaptors for all supported target data loads
– Compare target schema table structure to the platform target
schema before executing loading scripts
– Use parallel sessions to load data onto target
Target as non Oracle database
– If target requires data in flat files, compress the file transfer and load
onto target
– Capture the response from target system back into the platform
– If APIs are provided, trigger them from the platform and capture
response back to the platform
Catch-up LoadAfter the data has been consolidated from all stage tables based on entity and
source system, the delta between target and source data should be identified
for all scoped attributes for migration and not for just the elements in files. By
considering this approach, we can easily identify the required catch up load of
migration across all subject areas.
Reconciliation and Migration ReportsReconciliation information and reports that should be produced at each stage
of Migration are as follows:
Summary - Best Practices for Successful Data Migration
To Summarize, there are a few Golden Rules that must be followed in order to
maximize the chances of success of a Data Migration project. These rules are:
– Clearly define the scope of the project
– Understand the current business & technical challenges
– Assess the current data quality levels, identify gaps
– Involve relevant stakeholders from the project start date
– Actively refine the scope of the project through targeted profiling and
auditing
– Minimize the amount of data to be migrated
– Profile and audit all source data in the scope before writing mapping
specifications
– Define a realistic project budget and timeline, based on knowledge of
data issues
– Clearly define the RACI (Responsible, Accountable, Consulted, Informed)
Matrix
– Secure sign-off on each stage from relevant stakeholders
– Prioritize with a top-down, target-driven approach
– Aim to volume-test all data in the scope as early as possible at the unit level
– Aim to migration functional test on a data sample that represents
source data
– Aim to test business flows on migrated data on target stack
– Allow time for volume testing and issue resolution
– Segment the project into manageable, incremental chunks
– Focus on the business objectives and cost/benefits
Chandrakant Shankrayya Hiremath is a Principal Architect for Data
Management Services with Tech Mahindra with over 16 years of experience in
data migrations. He is part of the team that built Tech Mahindra’s highly
acclaimed and copyrighted Data Migration platforms, including the Unified Data
Management Framework (UDMF©). UDMF is a highly flexible, open source
compliant framework that has successfully delivered migrations in varying
domains like Telecom, BFSI and Manufacturing - addressing End-to-End Data
Migrations, ERP Migrations, Data Archival and Data Quality.
To know more about our capabilities and how we can help your organization
successfully carry out Data Migrations, email us at
DMS
About the Author: