1
<Insert Picture Here>
OWB Data Quality – Best PracticesJean-Pierre DijcksDecember 2008
3
Agenda
• Building a data quality firewall• The importance of data rules• The difference between profiling and auditing
4
Basic system architecture
StagingData Layer
Operational data layer
Performance data layerSiebel CRM
Oracle EBS
PeopleSoft
SAP/R3
Other Sources
Data Sources
Message Queues
5
Building a data quality firewall
StagingData Layer
Operational data layer
Performance data layerSiebel CRM
Oracle EBS
PeopleSoft
SAP/R3
Other Sources
Data Sources
Message Queues
DataProfiling
Stage 2 DataCorrection
Schema & Data Type Correction
Data Audits
Data Audits
DataGovernance
6
Building a data quality firewall
StagingData Layer
Operational data layer
Performance data layer
Siebel CRM
Oracle EBS
PeopleSoft
SAP/R3
Other Sources
Data Sources
Message Queues Profile Workspace
Move Sample Data to Profile Workspace
7
Schema and Data Type Correction
• Leverage data profiling for• Generating the staging area tables• Schema corrections• Data Type corrections (enforce real data types)
Oracle EBS
StagingData LayerDiscuss with business users
Untangle for lookups or recoding
Profile data
Schema & Data Type Correction
8
Anatomy of the operational data layer
Goal:• Create lowest grain data for
reporting• Create a schema to service all
applications with correct data• Act as source for performance
layer
Characteristics• De-normalized but still close to 3-
NF• Relationships established and
enforced• Data corrected and de-duplicated• Permanent data
Operational data layer
9
Loading the operational layer
• Leverage in-database architecture• Do all the hard work here!• Load between schemas – not databases• Huge performance gains through OWB architecture
• Embed data quality into the loads• Create a data quality fire wall
• Strictly enforce all required rules• Document all erroneous data and correct if desired
• Do matching and merging to create uniqueness from many data flows• Create master data records• Re-code as necessary• Re-key as necessary• Keep cross references
10
Data Quality Fire Wall
Cleanse:• De-duplicate incoming data• Fix data issues
• Name and address• String comparisons
Protect:• Enforce referential integrity• Enforce data rules• Enforce data types and
conversions
• Report• Data issues• Quality levels• Quality trends
Operational data layer
ProtectCleanse Report
11
Feeding non-DW systems
• Always load from the operational layer
• Delivers flexibility and lowest grain to external systems
• Aggregate on the way out if required (not typical)
• Delivers clean data, with measured service levels for DQ
Operational data layer
ProtectCleanse Report
12
Data QualityThe importance of data rules
1) Profile Know your data
Data Rules
Correction Mappings Data Auditors
Coherent Data Audit Results and trends
2) Generate
3) Operate 4) Monitor
Trust your data
Information
5) Report
Fear your data
Ignorance
13
Data QualityData Profiling – Unique Capabilities
• Complete offering
• Two usage modes:• Use to investigate
unknown data• Use to validate known
business rules against real data
14
Data Profiling vs. Data Auditing
Data Profiling • Ad-hoc when required• Discovery in search of
unknowns• Time consuming• Resource intensive
Data Auditing:• Continuous processes• Planned to be done
repetitively • Gathers information over time• Small tasks
Both serve the same purpose through different means
15
<Insert Picture Here>
D E M O N S T R A T I O N
16
Performance Tips for Data Profiling
• Data Profiling is a highly processor and I/O intensive process
• Run large profiles (>10M Rows in a table) on multi-processor machines
• Use parallel:• OWB uses /*+ PARALLEL(<TBL>) */ hints in DP queries• Default degree of parallelism is picked up from database
• Balance your configuration• Stripe data across disks using ASM• Make sure I/O and CPU ratios are remotely correct
17
Performance Tips for Data Profiling
• When loading the workspace you are moving lots of data => optimize this:• Place the profile workspace in the same database as the
source data• Enable the source tables for parallel reads• Consider moving the data with regular OWB maps first, or use
Transportable Tablespaces or Data Pump
• Memory:• SGA should be no less than 500MB, preferably be around 2-
3G for most profiles• Buffer cache hit ratio >95%• Library cache hit ratio >99%
18
Further Reference Material
• http://blogs.oracle.com/warehousebuilder• Data Quality posts about:
• Using data rules for Referential Integrity• Key Quality Indicators• Match and Merge
• Demonstrations on OTN• Data Profiling and Corrections• Fuzzy match and merging• Name and address cleansing
• Training• Extending your Knowledge (data profiling handson)
19
New Features for DQBeta Program for 11gR2
If you are interested in the beta please contact the OWB product management team:
• Michelle Bird ([email protected])
Or directly go to:http://otnbeta.oracle.com/bpo/prospects/index.htm
Make sure to mention Michelle as sponsor.
20
Questions
21
Summer2009
Spring2009
CY2010
CY2011
UnifiedTeam
UnifiedPlatform
High-Level Data Integration RoadmapNatural Upgrade Path for Existing Solutions
• OWB/ODI Investments are Fully Protected
• No Forced Migrations• Natural Upgrade Path• Unified Platform aims to be
a Superset of Existing Products – no regression
22