Upload
sai-kishore
View
153
Download
5
Embed Size (px)
Citation preview
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
QualityStage 8 Essentials
DX741
© Copyright IBM Corporation 2009
Copyright, Disclaimer of Warranties and Limitation of Liability
© Copyright IBM Corporation February 2007
IBM Software GroupOne Rogers StreetCambridge, MA 02142
All rights reserved. Printed in the United States.
IBM and the IBM logo are registered trademarks of International Business Machines Corporation.
The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both:
AnswersOnLine DynamicServer, WorkgroupEdition RedBrick Decision ServerAIX Enterprise Storage Server RedBrickMineBuilderAPPN FFST/2 RedBrickDecisionscapeAS/400 Foundation.2000 RedBrickReadyBookMaster Illustra RedBrickSystemsC-ISAM Informix RelyonRedBrickClient SDK Informix4GL S/390Cloudscape InformixExtendedParallelServer SequentConnection Services InformixInternet Foundation.2000 SPDatabase Architecture Informix RedBrick Decision Server System ViewDataBlade J/Foundation TivoliDataJoiner MaxConnect TMEDataPropagator MVS UniDataDB2 MVS/ESA UniData&DesignDB2 Connect Net.Data UniversalDataWarehouseBlueprintDB2 Extenders NUMA-Q UniversalDatabaseComponentsDB2 Universal Database ON-Bar UniversalWebConnectDistributed Database OnLineDynamicServer UniVerseDistributed Relational OS/2 VirtualTableInterfaceDPI OS/2 WARP VisionaryDRDA OS/390 VisualAgeDynamicScalableArchitecture OS/400 WebIntegrationSuiteDynamicServer PTX WebSphereDynamicServer.2000 QBIC WebSphere DataStageDynamicServer with Advanced DecisionSupportOption QMFDynamicServer with Extended ParallelOption RAMACDynamicServer with UniversalDataOption RedBrickDesignDynamicServer with WebIntegrationOption RedBrickDataMine
Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
All other product or brand names may be trademarks of their respective companies.
All information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has been certified as being Year 2000 compliant.
This document may not be reproduced in whole or in part without the priori written permission of IBM.
Note to U.S. Government Users – Documentation related to restricted rights – Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
© Copyright IBM Corporation 2009
Course Contents
295Globalization (NLS)
278Special Topics
55Developing with QualityStage
39QualityStage 8 Architecture
312Address Verification Stage
258Survivorship
185Match
117Standardize
79Investigation
5Data Quality Issues
PageTopic
© Copyright IBM Corporation 2009
Course contents
●Data quality issues● Information Server purpose and architecture● Introduction to DataStage and QualityStage● Investigation●Standardization●Match●Survivorship●Special Topics
– Data quality methodology– QualityStage Migration Tool
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Data Quality Issues
© Copyright IBM Corporation 2009
Unit objectives
●After completing this unit, you should be able to:– List the five common data quality contaminants– Describe each of the following processes:
• Investigation• Standardization• Match• Survivorship
© Copyright IBM Corporation 2009
Data quality challenges
●Different or inconsistent standards in structure, format or values
●Missing data, default values●Spelling errors, data in wrong fields●Buried information●Data anomalies
© Copyright IBM Corporation 2009
Data quality – why do we care?
●Accurate reports●Accurate information for support operations●Support development of applications that go beyond original
scope for which data was designed– Master Data Management– Data Warehouse– Analytical applications
© Copyright IBM Corporation 2009
Example - Master Data Management
Source 1
Source 2
Source 3
Consolidated customer
view
AlignHarmonizeConsolidate
© Copyright IBM Corporation 2009
MARC DILORENZO ESQ BOSTONMRS DENNIS MARIO HARTFORDMR & MRS T. ROBERTS CHICAGO
DILORENZO, MARK 6793MARIO, DENISE 0215ROBERTS, TOM & MARY 8721
MARK DI LORENZO MA93DENIS E. MARIO CT15TOM & MARY ROBERTS IL21
Name Field LLocation
Source 3
Source 1
Source 2
Different or inconsistent standards
© Copyright IBM Corporation 2009
Missing data & default values
Do the field values match the meta data labels?
NAME SOC. SEC. # TELEPHONE
Denise Mario DBA
Marc Di Lorenzo ETAL
Tom & Mary Roberts
First Natl Provident
Astorial Fedrl Savings
Kevin Cooke, Receiver
John Doe Trustee for K
228-02-1975
999999999
025-37-1888
34-2671434
101010101
LN#12-756
18-7534216
111111111
6173380300
3380321
415-392-2000
508-466-1200
212-235-1000
FAX 528-9825
5436
© Copyright IBM Corporation 2009
Robert A. Jones TTE Robert Jones Jr. First Natl Provident FBO Elaine & Michael Lincoln UTADTD 3-30-89 59 Via Hermosa c/o Colleen Mailer Esq Seattle, WA 98101-2345
Legacy Meta Desc. Legacy Record Values
Buried information
NAME 1
ADDRESS 1
ADDRESS 2
ADDRESS 3
ADDRESS 4
ADDRESS 5
© Copyright IBM Corporation 2009
CUSNUM NAME ADDRESS SALES $
90328574
90328575
90238495
90233479
90233489
90234889
90345672
IBM
I.B.M. Inc.
International Bus. M.
Int. Bus. Machines
Inter-Nation Consults
Int. Bus. Consultants
I.B. Manufacturing
8,494.00
3,432.00
2,243.00
5,900.00
6,800.00
10,243.00
15,999.00
187 N.Pk. Str. Salem NH 01456
187 N.Pk. St. Sarem NH 01456
187 No. Park StSalem NH 04156
187 Park Ave Salem NH 04156
15 Main St. Andover MA 02341
PO Box 9 Boston MA 02210
Park Blvd. Boston MA 04106
The anomalies nightmare
Spelling Errors
Anomalies Lack of StandardsNo common key
© Copyright IBM Corporation 2009
Acct # Name Address City State Zip Note
5154155 Peter J. Lalonde 40 Beacon St. Melrose, Mass 02176 ODP
5152335 LaLonde, Peter 76 George 617-210-0824 Boston YES MA 02111
5146261 Lalonde, Sofie 40 Bacon Street Melrose MA CHK ID
87121 Pete & Soph Lalond 76 George Road Boston MASS FR Alert
87458 P. Lalonde FBO S.Lalonde40 Becon Rd. Melrose MA 02 176
What data challenges do you face?
•No consistent naming convention
•Business terms and spillover text
•Missing values or data in the wrong fields
•Buried information
•Misspelling
•No unique key linking records together
© Copyright IBM Corporation 2009
Why investigate?
●Discover potential anomalies in the data●Examine single domain and free-form fields● Identify invalid and default values●Reveal undocumented business rules ●Verify the reliability of the data in the fields to be used as
matching criteria●Gain complete understanding of data
© Copyright IBM Corporation 2009
Investigate – single domain report
• Single domainField % of Total
Freq. Count
Sample source data
Frequency
© Copyright IBM Corporation 2009
Investigate – word pattern report
• Freeform text (Word)Field
% of Total
Pattern
Sample source data
Frequency
© Copyright IBM Corporation 2009
What is standardize?
●Applying business logic to data chaos.– Pattern manipulation
●Enforcing business standards on data elements.– Standards definition
●Transforming the input to an output which meets the business requirement.– Field structuring
© Copyright IBM Corporation 2009
How to standardize
●Parse specific data fields into smaller, atomic data elements– Atomic data elements are called tokens– Categorize identified elements
• Separate Name, Address, and Area from freeform Name & Address lines• Identification of Distinct Material Categories (e.g. Sutures vs. Orthopedic
Equipment)●Refine data elements
• Example 1– Name = ‘DR PAUL E JONES’ becomes:
> Title = ‘DR’> First Name = ‘PAUL’> Middle Name = ‘E’> Last Name = ‘JONES’
• Example 2– Part Description = ‘BLK LATEX GLOVE’ becomes:
> Color = ‘BLACK’> Type = ‘LATEX’> Part = ‘GLOVE’
© Copyright IBM Corporation 2009
Why standardize?
●Normalize values in data fields to standard values– Transform First Name = ‘MIKE’ ‘MICHAEL’– Transform Title = ‘Doctor’ ‘Dr’– Transform Address = ‘ST. Michael Street’ ‘Saint Michael St.’– Transform Color = ‘BLK’ ‘BLACK’
●Apply phonetic coding to key words - facilitates record linkage– NYSIIS– Soundex– Typically applied to Name fields (first, last, street, city)
© Copyright IBM Corporation 2009
QualityStage standardize
●Uses a highly flexible pattern recognition language●Can employ field or domain specific standardization (i.e. unique
rules for names vs. addresses vs. dates, etc.)●Contains customizable classification and standardization tables●Utilizes results from data investigation
© Copyright IBM Corporation 2009
QualityStage standardize report exampleInd./Org. flagOriginal data
© Copyright IBM Corporation 2009
Match
“Conditioned data and QualityStage’s matching engine link the previously unlinkable.”
● Match Construction: – Reliability of input data defines a match result.
● Statistical Analysis & Match Scoring:– Linkage probability determined on a sliding scale by field level
comparison.● Report Generation:
– All business rules applied have easy to understand report structure.
© Copyright IBM Corporation 2009
What is match?
● Identifying all records on one file that correspond to similar records on another file
● Identifying duplicate records in one file●Building relationships between records in multiple files●Performing statistical and probabilistic matching●Calculating a score based on the probability of a match
© Copyright IBM Corporation 2009
Why match?
● Identify duplicate entities within one or more files●Perform householding●Create consolidated view of customer●Establish cross-reference linkage
© Copyright IBM Corporation 2009
How to match
●Single file (Unduplication) or two file (Reference)●Different match comparisons for different types of data (e.g.
exact character, uncertainty/fuzzy match, keystroke errors, multiple word comparison)
●Generation of composite weights from multiple fields●Use of probabilistic or statistical algorithms●Application of match cutoffs or thresholds to identify automatic
and clerical match levels● Incorporation of override weights to assess particular data
conditions (e.g. default values, discriminatory elements)
© Copyright IBM Corporation 2009
QualityStage match
●A wide variety of match comparison algorithms providing a full spectrum of fuzzy matching functions
●Statistically-based method for determining matches (Probabilistic Record Linkage Theory)
●Field-by-field comparisons for agreement or disagreement●Assignment of weights or penalties●Overrides for unique data conditions●Score results to determine the probability of matched records●Thresholds for final match determination●Ability to measure informational content of data
© Copyright IBM Corporation 2009
QualityStage match examples
© Copyright IBM Corporation 2009
What is survive?
●Creation of best-of-breed “surviving” data based on record or field level information
●Development of cross-reference file of related keys●Creating output formats:
– Relational table with primary and foreign keys– Transactions to update databases– Cross-reference files
© Copyright IBM Corporation 2009
Why survive?
●Provide consolidated view of data●Provide consolidated view containing the “best-of-breed” data●Resolve conflicting values and fill missing values●Cross-populate best available data● Implement business rules●Create cross-reference keys
© Copyright IBM Corporation 2009
How to survive
●Highly flexible rules●Record or field level survivorship decisions●Rules can be based upon data frequency, data recency (i.e.
date), data source, value presence or length●Rules can incorporate multiple tests●QualityStage features
– Point-and-click (GUI-based) creation of business rules to determine best-of-breed “surviving” data
– Performed at record or field level
© Copyright IBM Corporation 2009
QualityStage survive examples
Example 1: The longest populated Middle and Last Name
First Name
Middle Name
Last Name First Name
Middle Name
Last Name
MARI LEMELSON-LAPPNER
MARI S LEMELSON-LAPPNER
MARI S LEMELSON
Matched Survived
Example 2: The longest populated Middle Name, Date of Birth, and SSN
First Name Middle NLast Name DOB SSN First Name Middle NaLast NamDOB SSNDENISE TRIANO 19580211 98524173 DENISE F TRIANO 19580211 98524173DENISE F TRIANO
Matched Survived
© Copyright IBM Corporation 2009
Course lab project design
Policy
InvestigateAssess Data Quality
Standardize Country InvestigateConditioned Results
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Condition Name, Address and Area
Select US Data for furtherprocessing
© Copyright IBM Corporation 2009
Checkpoint
1. (T/F) Data quality investigation cleans the source data.2. (T/F) Standardization modifies the source data so that it can be loaded
into the target system. 3. (T/F) Survivorship data can be either record based or field based.
© Copyright IBM Corporation 2009
Checkpoint solutions
1. (T/F) (T/F) Data quality investigation cleans the source data.Answer: False
2. (T/F) Standardization modifies the source data so that it can be loaded into the target system.Answer: False
3. Survivorship data can be either record based or field based.Answer: True
© Copyright IBM Corporation 2009
Unit summary
Having completed this unit, you should be able to:●List the five common data quality contaminants
– Different standards– Missing and default values– Spillover and buried information – Anomalies– No consolidated view
●Describe each of the following processes:– Investigation– Standardization– Match– Survivorship
© Copyright IBM Corporation 2009
Lab 1: Review course project
●Course business case: WINN Insurance CRM project●See QualityStage Essentials Exercises
© Copyright IBM Corporation 2009
Lab 2: Copy student files
●Copy student files to disk– Use C: drive as root for folder
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
QualityStage 8 Architecture
© Copyright IBM Corporation 2009
Unit objectives
●After completing this unit, you should be able to:– Describe the Data Quality architecture– Identify server and client components
© Copyright IBM Corporation 2009
Information Server conceptual architecture
Metadata Server & Repository
DataStageQualityStage
Information DirectorInformationAnalyzer
Information Server
MetadataAccess Services
Other ServicesClient logon accessLoggingSecurity
© Copyright IBM Corporation 2009
QualityStage technical overview
●Uses DataStage (parallel version)– DataStage design environment– Parallel execution engine– Stages are native enterprise operators– Match designer is embedded in DataStage Designer Client – Get DataStage data connectivity by default
• No need for meta brokers, plug-ins• Common meta data
●Legacy (pre-version 8) QS job execution– Migration utility available to aid conversion from QS 7.x to QS 8– Converted jobs can be compiled and executed in the QS 8
environment
© Copyright IBM Corporation 2009
DataStage/QualityStage physical architecture
Clients
DataStage/QualityStage
UNIX
Windows
Via TCP/IP
Information Server
Windows
DesignerDirectorAdministrator
Connect to projectsProjects
© Copyright IBM Corporation 2009
DataStage clients
●Administrator– Add and delete projects– Set project defaults– Set project environment parameters
●Designer– Maintain data definitions– Add, modify, and delete jobs– Add, modify, and delete match specifications– Manage rule sets– Compile jobs– Run jobs– Provision rule sets and match specifications
●Director– Run jobs– Review job log– Schedule jobs
© Copyright IBM Corporation 2009
DataStage Administrator● Administrator
– Create or delete projects– Set project defaults– Apply security
Project list
© Copyright IBM Corporation 2009
Project property defaults
© Copyright IBM Corporation 2009
DataStage Designer● Designer
– Client GUI for designing jobs• Windows 2000+, XP• Build meta data• Build Jobs • Modify Standardization Rules• Build match specifications
– Designer Repository• Database
Sample QualityStage job as viewed in Designer
© Copyright IBM Corporation 2009
Designer canvas, repository, and palette
© Copyright IBM Corporation 2009
DataStage Director● Director
– Client GUI for managing job execution– Windows 2000+, XP– Run jobs – set job options and parameters– View job log– Schedule job execution
© Copyright IBM Corporation 2009
Job log viewed in Director
© Copyright IBM Corporation 2009
Checkpoint
1. (T/F) DataStage Administrator executes jobs.2. (T/F) DataStage Designer configures projects. 3. Which DataStage component displays objects in the designer database?
© Copyright IBM Corporation 2009
Checkpoint solutions
1. (T/F) DataStage Administrator executes jobs.Answer: False
2. (T/F) DataStage Designer configures projects.Answer: False
3. Which DataStage component displays objects in the designer database.Answer: the repository view
© Copyright IBM Corporation 2009
Unit summary
Having completed this unit, you should be able to:– Describe the Data Quality architecture– Identify server and client components
© Copyright IBM Corporation 2009
Lab 3: configure QualityStage project● Create a project using Administrator (if necessary)● Set project properties
– General defaults– Environment variables– Security groups and roles
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Developing with QualityStage
© Copyright IBM Corporation 2009
Unit objectives
●After completing this unit, you should be able to:– Import meta data– Build DataStage/QualityStage Jobs– Run jobs– Review results
© Copyright IBM Corporation 2009
DataStage/QualityStage project
●Components– Jobs– Stages within jobs– Table Definitions– The Designer repository view shows project components
© Copyright IBM Corporation 2009
Job definition
●A job is an executable DataStage/QualityStage program●Created by job compilation●Jobs can be run in batch or in real time
© Copyright IBM Corporation 2009
Job development overview
● Designer client– Import or enter file meta data defining your sources and targets– Add stages and links defining the process– Compile the job– Run the job (Designer or Director)– Review data results
● Server– Runs the job– View job log
© Copyright IBM Corporation 2009
Log onto project in Designer or Director
User name and Password controlled by Information Server
List of valid projects
© Copyright IBM Corporation 2009
Designer repository components
●Database which stores– Data file definitions– Job designs– Standardization rules– Data connection objects
© Copyright IBM Corporation 2009
Project structure
Repository view
In Designer
© Copyright IBM Corporation 2009
DataStage/QualityStage design environment
Stages
Data definitions
© Copyright IBM Corporation 2009
Data definitions
●Entered or loaded via DataStage import mechanisms– Sequential file– ODBC– Native database connection
●New and redefined columns can be added on the data flow via Transformer stage
© Copyright IBM Corporation 2009
Data Quality folder● Stages are the building blocks● Focused in function● All phases of data quality:
– Investigate– Standardize– Match Frequency– Match
• Unduplicate Match• Reference Match
– Survive– International postal
• MNS– Optional
• Address Verification
© Copyright IBM Corporation 2009
Standardization rule sets● Pre-defined rules for parsing and
standardizing:– Name– Address– Area (City, State and Zip)
● Multi-national address processing● Validate structure:
– Tax ID– US Phone– Date– Email
● Append ISO country codes● Rule sets are stored in the repository
and provisioned to the job execution area
Rule set for USNAME
© Copyright IBM Corporation 2009
Rule set components
● Can modify some rule set components
● Test rule sets● Copy rule sets
© Copyright IBM Corporation 2009
Match Specifications in the DataStage Repository
●Created using the Match Designer
●Allows online testing of match criteria
© Copyright IBM Corporation 2009
Executing a job via Director
Director
Server
Executes the jobClick run button
Set run options
Execute job
View job log
View job monitor
© Copyright IBM Corporation 2009
Running a job in Director● Director
– Client GUI for running jobs• Windows 2000+, XP• View job logs and monitor• Job scheduling
Job status view
© Copyright IBM Corporation 2009
Execution environment
Data Quality Job Log
© Copyright IBM Corporation 2009
Job Monitor statistics
© Copyright IBM Corporation 2009
Job development process
● Import meta data● Define job
– Draw stages and links– Set stage properties– Compile
● Run the job● Review results
© Copyright IBM Corporation 2009
Checkpoint
1. (T/F) The job monitor displays link statistics.2. (T/F) The job log is viewed in DataStage Designer. 3. What protocol is used for communication between the DataStage clients
and server?
© Copyright IBM Corporation 2009
Checkpoint solutions
1. (T/F) The job monitor displays link statistics.Answer: True
2. (T/F) The job log is viewed in DataStage Designer.Answer: False
3. What protocol is used for communication between the DataStage clients and server?Answer: TCPIP
© Copyright IBM Corporation 2009
Unit summary
Having completed this unit, you should be able to:– Import meta data– Build DataStage/QualityStage Jobs– Run jobs– Review results
© Copyright IBM Corporation 2009
Lab 4: Import meta data
●DataStage import mechanisms– DataStage components
• Any object built in DataStage, such as jobs, table definitions, match specifications
© Copyright IBM Corporation 2009
Lab 5: Build and run DataStage job
●Read sequential file– Must use format tab to handle nulls
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Investigation
© Copyright IBM Corporation 2009
Unit objectives
●After completing this unit, you should be able to:– Build Investigate jobs– Use character discrete, concatenate, and word investigations to
analyze data fields– Review results
© Copyright IBM Corporation 2009
Investigation
●Verify the domain– Review each field of interest and verify the data matches the meta
data● Identify data formats, missing and default values● Identify data anomalies
– Format– Structure– Content
●Discover “unwritten” business rules● Identify data preparation requirements
© Copyright IBM Corporation 2009
Investigate stage
●Features– Analyze free-form and single domain fields– Provide frequency distributions of distinct values and patterns
● Investigate methods– Character Discrete– Character Concatenate– Word
© Copyright IBM Corporation 2009
Investigate methods
Identifying free-form fields that may require parsing and discovery of key words for classification
Word Investigation
Cross-field correlation, checking logic relationships between fields
Character Concatenate
Analyzing field values, formats, and domainsCharacter Discrete
WhyMethod
© Copyright IBM Corporation 2009
Field Masks
Investigate terminology
Options that represent the data. Options: Character (C), Type (T), Skipped (X)
Tokens Individual units of data
For ignoring characters X
For viewing the pattern of the dataT
For viewing the actual character values of the dataC
UsageCharacter Mask
© Copyright IBM Corporation 2009
Token Mask Result 02116 CCCCC 02116
02116 CCCXX 021
01832-4480 TTTTTTTTTT nnnnn-nnnn
XJ2 6EM TTTTTTT aanbnaa
(617) 338-0300 CCCCCCCCCCCCCC (617) 338-0300
617-338-0300 TTTTTTTTTTTT nnn-nnn-nnnn
6173380300 CCCXXXXXXXXX 617
(617)3380300 CCCXXXXXXXXX (61
Field mask examples
© Copyright IBM Corporation 2009
Character discrete: field mask (C)haracter
●Usage: Domain quality– View the contents of each field to verify that the data values match the
field labels ●Mechanism: Investigate stage
– Generates reports for frequency
© Copyright IBM Corporation 2009
Character discrete - character results
© Copyright IBM Corporation 2009
Character discrete: field mask (T)ype
●Usage: Data formats (patterns):– View the format of field which contain that you suspect may follow or
conform to a specific format, e.g., dates, PIN, Tax ID, account numbers.
●Generates reports for frequency
© Copyright IBM Corporation 2009
Investigation Implementation
© Copyright IBM Corporation 2009
QualityStage Investigation job – character
Double Click
© Copyright IBM Corporation 2009
Investigation - Character
Select Column Add
© Copyright IBM Corporation 2009
Investigation - Character
Select mask
© Copyright IBM Corporation 2009
Investigation - Character
© Copyright IBM Corporation 2009
Investigation - Character
© Copyright IBM Corporation 2009
Investigation - Character
© Copyright IBM Corporation 2009
View investigation report
© Copyright IBM Corporation 2009
Character concatenate
● Identify Field Relationships– Investigate one or more fields to uncover any relationship between the
field values. – Uses combinations of character masks– Generates reports for frequency
© Copyright IBM Corporation 2009
Character concatenate results
DOB and DOD Fields
© Copyright IBM Corporation 2009
Word investigate
●Usage: Free-form field pattern analysis– To view the pattern of the data within a freeform text field and parse it
into individual tokens ●QualityStage process
– Apply rule sets to free-form fields– Discover parsing requirements– Discover patterns in data– Generate reports for pattern frequency distributions and token report
© Copyright IBM Corporation 2009
Word investigation results
Token ReportPattern Report
How to use
Look at most frequently occurring patterns.
Use to estimate how much work to modify a rule set for a customer.
How to use
Review tokens with SME to verify tokens are properly classified.
Identify most frequently occurring unclassified tokens and add them to rule set.
© Copyright IBM Corporation 2009
Rule sets
●Rules for parsing, classifying, and organizing data●Rule Set Domains
– Country processing– Pre-processing– Domain Processing
• Name: Business and Personal• Street Address• Area: Locality, City, State and Zip/Postal codes
– Multinational Address Processing
© Copyright IBM Corporation 2009
Parsing
●Parse free-form data with the SEPLIST and a STRIPLIST– SEPLIST - Any character in the SEPLIST will separate tokens, and
become a token itself– STRIPLIST - Any character in the STRIPLIST will be ignored in the
resulting pattern●The SEPLIST is always applied first
© Copyright IBM Corporation 2009
Parsing example
Token1 Token2 Token3 Token4 Token5 Token6 Token7 Token8
120 Main St . N . W .
Token1 Token2 Token3 Token4 Token5120 Main St N W
SEPLIST “¬.”STRIPLIST “¬.“
Token1 Token2 Token3 Token4
120 Main St NW
SEPLIST “¬”STRIPLIST “¬.“
SEPLIST “¬.”STRIPLIST “¬“
Example: 120 Main St. N.W.
© Copyright IBM Corporation 2009
Data typing: classifying tokens
● Identify and type the token in terms of it’s business meaning and value
PATTERN KEY(USADDR rule set):
^ – Numeric token
? – Unclassified alpha token
@, <, > – Mixed Token
T – Street Type
U – Unit Type
120 Main Street Apt 6C
^ ? T U >
© Copyright IBM Corporation 2009
10 MAPLE STREET APARTMENT 222
T ^?^
Parse
Classify known wordsand
assign default tags U
Example: word investigate
Produce Reports based on Patterns & Tokens
Token report Pattern report
© Copyright IBM Corporation 2009
Investigation - Word
© Copyright IBM Corporation 2009
Investigation - Word
© Copyright IBM Corporation 2009
Link ordering
© Copyright IBM Corporation 2009
Investigation – define output files
© Copyright IBM Corporation 2009
Sort output (optional)
© Copyright IBM Corporation 2009
Review word reports – patterns and tokens
© Copyright IBM Corporation 2009
Data quality assessment process
●Review and analyze each field for the following information:– How often is the field populated?– What are the anomalies and out-of-range values? How often does
each one occur?– How many unique values were found?– What is the distribution of the data or patterns?
●Use Investigate results to:– Update project business requirements– Define development plan and application design
© Copyright IBM Corporation 2009
Checkpoint
1. (T/F) Character discrete investigation examines a single domain.2. (T/F) Word investigation examines a single domain. 3. Name the three character masks.
© Copyright IBM Corporation 2009
Checkpoint solutions
1. (T/F) Character discrete investigation examines a single domain.Answer: True
2. (T/F) Word investigation examines a single domain.Answer: False
3. Name the three character masks.Answer: C, T, and X
© Copyright IBM Corporation 2009
Unit summary
Having completed this unit, you should be able to:– Build Investigate jobs– Use character discrete, concatenate, and word investigations to
analyze data fields– Review results
© Copyright IBM Corporation 2009
Lab 6: Build investigate jobs
●Character with C mask●Character with T mask●Character concatenate●Word
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Standardize
© Copyright IBM Corporation 2009
Unit objectives
●After completing this unit, you should be able to:– Describe the Standardize stage – Identify rule sets– Build jobs using the Standardize stage– Interpret standardization results– Investigate unhandled data and patterns
© Copyright IBM Corporation 2009
Standardize
●Transformation– Parsing free form fields– Comparison threshold for classifying like words– Bucketing data tokens
●Standardization– Applying standard values and standard formats
●Phonetic Coding for use in Matching– NYSIIS– Soundex
© Copyright IBM Corporation 2009
Standardize example
Input File:Address Line 1 Address Line 2
1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST # 2016200 VENTURA BOULEVARD SUITE 201C/O JOSEPH C REIFF 12 WESTERN AVE1705 W St PHILADELPHIA1655 PONCE DE LEON AVENUE 15TH FLOOR
Result File:House # Dir Str. Name Type Unit Unit. Floor Floor
Type Value Type Value
1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST 2016200 VENTURA BLVD STE 20112 WESTERN AVE1705 W ST1655 PONCE DE LEON AVE FLOOR 15
© Copyright IBM Corporation 2009
^?^
Parse
Classify &assign default tags T U
House Street UnitNumber Street Name Type UnitType
10 MAPLE ST APT 222
Process Patterns and Bucket Data
Standardize process
Output File
Key:
^ = Single numeric
? = One or more unknown alphas
T = Street type
U = Unit type
10 MAPLE STREET APARTMENT 222
© Copyright IBM Corporation 2009
Standardize stage
●Standardize Stage– Uses Rule sets for:
• Country processing• Pre-domain processing
– USPREP• Domain processing
– USADDR– USAREA– USNAME
• Multi-national Address • WAVES• Address Verification Interface (optional)
© Copyright IBM Corporation 2009
Types of rule sets
Country Identifier
COUNTRY
Domain Pre-processor
USPREP
Domain Specific: USNAME
Domain Specific: USADDR
Domain Specific: USAREA
Preparatory steps
Not always required
© Copyright IBM Corporation 2009
Example: country identifier
Input Record
100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET
Input Record
100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET
Output Record
US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET
Output Record
US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET
© Copyright IBM Corporation 2009
Example: domain preprocessor
Input Record
Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148
Input Record
Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148
Output Record
Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426
Output Record
Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426
Mixed domain
© Copyright IBM Corporation 2009
Example: domain specific
Input Record
100 SUMMER STREET 15TH FLOOR
Input Record
100 SUMMER STREET 15TH FLOOR
Output Record
House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U
Output Record
House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U
© Copyright IBM Corporation 2009
Rule sets
●Rule sets contain logic for:– Parsing– Classifying– Processing data by pattern and bucketing data
●Three required files– Classification Table– Dictionary File– Pattern Action File
●Optional files– Lookup tables– Override tables
© Copyright IBM Corporation 2009
Rule set files
Contains a series of patterns and programming commands to condition the data
Contains standard abbreviations that identify and classify key words.
Optional conversion and lookup tables for converting and returning standardized values
Tables for storing overrides entered into the Designer GUI
Classification Table
Pattern Action File
Reference Tables
Override Tables
Define the output file fields to store the parsed and conditioned dataDictionary File
© Copyright IBM Corporation 2009
Classification table
●Contains the words for classification, standardized versions of words, and data class
●Data class (data tag) is assigned to each data token●Default classes are the same across all rule sets●User-defined classes are assigned in the classification table
– Users may modify, add or delete these classes– User-defined classes are a single letter
© Copyright IBM Corporation 2009
Default classes
Trailing numeric, e.g. A6<
Null classZero (0)
Leading numeric, e.g., 6A>
Complex mixed token, e.g., C3PO@
One or more consecutive unclassified alphas?A single unclassified alpha (word)+
A single numeric^
DescriptionClass
© Copyright IBM Corporation 2009
User-defined classes
Box TypeB
USAREA
State AbbreviationS
DirectionalD
Street TypeT
USADDR
Prefix, e.g. Dr., Mr., MissP
Generational, e.g., Senior, I, IIG
USNAME
DescriptionClass
© Copyright IBM Corporation 2009
Classification table example;-------------------------------------------------------------------------------
; USADDR Classification Table ;-------------------------------------------------------------------------------
; Classification Legend ;-------------------------------------------------------------------------------
; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-------------------------------------------------------------------------------
; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-------------------------------------------------------------------------------
DRAW "PO BOX" B DRAWER "PO BOX" B PO "PO BOX" B POB "PO BOX" B POBOX "PO BOX" B POBX "PO BOX" B PODRAWER "PO BOX" B
Token
Standard form Classification
© Copyright IBM Corporation 2009
Comparison threshold● May be used in the Classification
table● Used to efficiently make entries into
the classification table● Helps overcome spelling and data
entry errors● Not required● Threshold uses a logical string
comparator Most likely not the same750
Almost certainly not the same
700
Most likely equivalent800
Almost certainly the same850
Exact match900
Threshold level
© Copyright IBM Corporation 2009
Classification table example with comparison threshold
; USADDR Classification Table ;-------------------------------------------------------------------------------
; Classification Legend ;-------------------------------------------------------------------------------
; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-------------------------------------------------------------------------------
; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-------------------------------------------------------------------------------
DRAW "PO BOX" B DRAWER "PO BOX" B…………………………………………………………………………………NORTHEAST NE D 850 NORTHWEST NW D 850 NW NW D S S D SO S D SOUTH S D
© Copyright IBM Corporation 2009
Dictionary file
●Defines the field definitions for the output file●When data is moved to these output fields it is called
“bucketing” the data●The order that the fields are listed in the dictionary file defines
the order the fields appear in the output file●Dictionary file entries are similar to field definitions
© Copyright IBM Corporation 2009
Dictionary file example;;QualityStage v8.0\FORMAT\ SORT=N;------------------------------------------------------------------------------; USADDR Dictionary File;------------------------------------------------------------------------------; Total Dictionary Length = 411;------------------------------------------------------------------------------; Business Intelligence Fields;------------------------------------------------------------------------------HouseNumber C 10 S HouseNumber ;0001-0010HouseNumberSuffix C 10 S HouseNumberSuffix ;0011-0020StreetPrefixDirectional C 3 S StreetPrefixDirectional ;0021-0023StreetPrefixType C 20 S StreetPrefixType ;0024-0043StreetName C 25 S StreetName ;0044-0068StreetSuffixType C 5 S StreetSuffixType ;0069-0073StreetSuffixQualifier C 5 S StreetSuffixQualifier ;0074-0078
© Copyright IBM Corporation 2009
Pattern-Action file
●Contains the rules for standardization; that is, the actions to execute with a given pattern of tokens
●Records are processed from the top down●Written in Pattern Action Language (PAL)●Complex parsing can be coded in this file
© Copyright IBM Corporation 2009
Street Address 10 Hollow Oak RoadPattern ^ ? T
Pattern Action LanguageCOPY [1] {HN}COPY_S [2] {SN}COPY_A [3] {ST}
{HN} {SN} {ST}
Pattern Action file process
10 Hollow Oak Rd
© Copyright IBM Corporation 2009
Optional lookup tables
●Called from the Pattern Action File●Rule sets may contain lookup tables such as:
– Common First Names and Enhanced First Names• Barb & Barbara• Ted & Edward
– Gender based on name– State abbreviations– Common city abbreviations
• NYC = New York City• LA = Los Angeles
© Copyright IBM Corporation 2009
^?^
Parse
Classify &assign default tags T U
House Street UnitNumber Street Name Type UnitType
10 MAPLE ST APT 222
Pattern Action File
Process Patterns and Bucket Data
Classification Table
Dictionary File
Standardize process
10 MAPLE STREET APARTMENT 2221
2
3
4
© Copyright IBM Corporation 2009
Standardizing international data
●Two methods– Method 1: Use country pre-processor, domain pre-processor, and
domain-specific rules• Uses out-of-the-box, included functionality/rules
– Method 2: Use Multinational Standardize, WAVES, or AVI• WAVES requires purchase of WAVES database• AVI requires purchase of database for address validation
© Copyright IBM Corporation 2009
Course lab project design
Policy
InvestigateAssess Data Quality
Standardize Country InvestigateConditioned Results
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Condition Name, Address and Area
Select US Data for furtherprocessing
© Copyright IBM Corporation 2009
Country rule set
●Country Rule set appends the two byte ISO country code● Input to the country rule set includes:
– Default country code (designated by ZQ…default value…ZQ)– Street Address– City or locality– State– Zip or Postal code– Country field (if it exists)
●Output:– Two byte ISO country code– Flag identifying explicit or default decision
© Copyright IBM Corporation 2009
Standardization implementation
© Copyright IBM Corporation 2009
Standardization jobs
Country
Rule Set
USPREP
Rule Set
Domain-specific
Rule Sets
© Copyright IBM Corporation 2009
Standardization – US Name, Address, Area
© Copyright IBM Corporation 2009
Standardization
© Copyright IBM Corporation 2009
Standardization – mapping columns
© Copyright IBM Corporation 2009
Course lab project design
Policy
InvestigateAssess Data Quality
Standardize Country InvestigateConditioned Results
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Condition Name, Address and Area
Select US Data for furtherprocessing
© Copyright IBM Corporation 2009
Selecting US data
●The DataStage Filter Stage provides the capability of selecting and/or rejecting records based on a set of values for a field
●Selecting or splitting data requiring compound or complex logic may require Transformer stage
© Copyright IBM Corporation 2009
Course exercise project design
Policy
InvestigateAssess Data Quality
Standardize Country InvestigateConditioned Results
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Condition Name, Address and Area
Select US Data for furtherprocessing
© Copyright IBM Corporation 2009
Domain pre-processor rule sets
●Pre-processor rule sets are designed to filter name, street address and area (city, state, zip) data– For example, if the city, state and zip is found in ADDRESS LINE 2,
the pre-processor rule set will attempt to recognize this data and move it into the area domain
●The pre-processor rule set prepares the data for processing by domain specific rule sets
© Copyright IBM Corporation 2009
Domain rule sets
●Domain rule sets expect only data for that domain as the input●Domain rule sets that come with QualityStage are:
– Name– Street address– Area (city, state and zip)
© Copyright IBM Corporation 2009
USNAME rule set
●The USNAME rule set works on both personal names and organization names for US data
●Data is parsed into name components●Phonetic coding of the First Name and Primary Name are
created for matching
© Copyright IBM Corporation 2009
USADDR rule set
●This rule set is applied to street address fields●The “Address Type” flag identifies different types of addresses
– ‘S’ Street address– ‘B’ Box address– ‘R’ Rural route address
●Phonetic coding of the Street Name is created for matching
© Copyright IBM Corporation 2009
USAREA rule set
●This rule set is applied to city, state and postal code fields●Data is parsed into city name, state abbreviation, zip code and
zip plus four●Phonetic coding of the city name is created for matching
© Copyright IBM Corporation 2009
Standardize results
●Business Intelligence fields – Parsed from the original data, they may be used in matching and
generally they are moved to the target system●Matching Fields
– Generally these fields are created to help during the match process and are dropped after successful matching
●Reporting fields– Specifically created to help review results of Standardize and
recognized handled and unhandled data
© Copyright IBM Corporation 2009
Business intelligence fields
● Intelligent data parsed and bucketed from the input free-form field
USNAME Examples
• Title
• First Name
• Middle Name
• Primary Name
• Generational
USADDR Examples
HouseNumber
Directional
Street Name
Unit Types
Box Types
Unit Values
Building Names
USAREA Examples
•City
•State
•Zip5
•Zip4
© Copyright IBM Corporation 2009
Standardize matching fields
●Phonetic coding– NYSIIS – Reverse NYSIIS– Soundex– Reverse Soundex
●Hash keys– First 2 characters of the first five words
●Packed Keys– Data concatenated, or packed
© Copyright IBM Corporation 2009
Standardize reporting fields
The tokens not processed by the rule set because they represent a data exception.
The pattern generated for the stream of input tokens based on the parsing rules and token classifications.
The pattern generated for tokens not processed by the selected rule set.
The remaining tokens not processed by the selected rule set.
Unhandled Pattern
Unhandled Data
Input Pattern
Exception Data
User override flagFlag indicating what kind of user overrides were applied to this record
© Copyright IBM Corporation 2009
1. Build a Character Concatenate Investigation using the following fields
2. Increase the number of samples to 5
Investigate NAME unhandled patterns and data
● Identify the unhandled patterns for the NAME field. In the report include the unhandled data, input pattern, original data and the record key.
XName domain data
XInput Pattern
XUnhandled Data
CUnhandled Pattern
TypeField Description
© Copyright IBM Corporation 2009
Standard practice: investigate handled and unhandled data
●Review the business intelligence fields to ensure accurate bucketing of the data
●Build a Character Discrete Investigation for each field and review the contents and the format
●Build Investigation to review:– Unhandled Patterns– Unhandled Data– Input Pattern– Input Fields
© Copyright IBM Corporation 2009
Customizing rule sets
●A rule set may require modification if some input data is:– Not processed– Incorrectly processed
●QualityStage provides functions to:– Test strings for classifications using the Rules Analyzer– Apply user Overrides– Modify classification table
© Copyright IBM Corporation 2009
User overrides
●Provides the user with the ability to modify rule sets●The following types of rule sets can be modified using User
Overrides– Domain Pre-processor rule sets– Domain rule sets
●There are five types of user overrides relating to: classifications, patterns, and text strings
●User overrides are – GUI Driven
●Rule set should be provisioned after modifications applied
© Copyright IBM Corporation 2009
User classification override
●Recognized as a keyword and classified– Additional words
• New abbreviation, variation • Misspelling of a word
●User Classifications may override or add:– Original values (Token values)– Standard value – Class
© Copyright IBM Corporation 2009
Override
Unhandled Data
Token Value Standard Value Class
Apply classification override
FCarolynneCarolynne
Input Pattern Original Data+,+ HOCHREITER , CAROLYNNE
Input Pattern Original Data +,F HOCHREITER , CAROLYNNE
Add CAROLYNNE
as a valid first name
to the classification table
Corrected Pattern
© Copyright IBM Corporation 2009
Text overrides
●Allow the user to specify overrides based on an entire text string
●Use this override for special cases and specific handling of a string of text
● Input Text Overrides– Applied to the original text string
●Unhandled Text Overrides– Applied to the Unhandled Data field
© Copyright IBM Corporation 2009
Input Text
Input text overrides
Input Text OverrideREIFF FUNERAL Move text string to
the Primary name field
Unhandled Pattern Input Text++ ZACHARIA GELLMAN++ TOMMOTHY CABBOTT++ REIFF FUNERAL
Override
Input Pattern Primary Name+ + REIFF FUNERAL
Results
© Copyright IBM Corporation 2009
Pattern overrides
●Allow the user to specify overrides based on an entire pattern●Use this override when most or all records should be
processed with identical logic● Input Pattern Overrides
– Applied to the original text string●Unhandled Pattern Overrides
– Applied to the Unhandled Data field
© Copyright IBM Corporation 2009
Unhandled pattern overrides
Unhandled Pattern Override+, + Move + to Primary Name
Comma provides contextMove + to First Name
Unhandled Pattern Input Text+, + HAYWARD, WINSLOW+, + ESHAGHIAN , JOUBI+, + BOULDER, CORONA
UnhandledPattern
Override
Results
Unhandled Pattern First PrimaryName Name
+, + WINSLOW HAYWARD+, + JOUBI ESHAGHIAN+, + CORONA BOULDER
© Copyright IBM Corporation 2009
User override precedence
Recognize words to classify
Modify logic based on the input string
Modify logic based on the input pattern
Modify logic based on the Unhandled data string
Modify logic based on the unhandled pattern
User Classification
Input Text
Input Pattern
Unhandled Text
Unhandled Pattern
© Copyright IBM Corporation 2009
1. Build a Character Concatenate Investigation using the following fields
2. Increase the number of samples to 5
Investigate address and area unhandled patterns
● Identify the unhandled patterns for the Address and AREA fields. In the report include the unhandled data, input pattern, original data and the record key.
XAddress DomainXInput PatternXUnhandled DataCUnhandled Pattern
TypeField Description
© Copyright IBM Corporation 2009
Overrides
●Purpose– Correct problems found during standardization
●Rule set may require overrides because you have data– Not processed– Incorrectly processed
●Override types– Classification – Input pattern– Input text – Unhandled pattern– Unhandled text
●Can be tested with rules analyzer
© Copyright IBM Corporation 2009
Course exercise project design
Policy
InvestigateAssess Data Quality
Standardize Country InvestigateConditioned Results
Apply User Overrides
Identify Duplicate Customer Records
Survive the BestCustomer Record
Condition Name, Address and Area
Select US Data for furtherprocessing
© Copyright IBM Corporation 2009
Overrides screen
© Copyright IBM Corporation 2009
Checkpoint
1. (T/F) WAVES can standardize name fields.2. (T/F) Rule sets are used in standardization processing. 3. Name the components of rule sets.
© Copyright IBM Corporation 2009
Checkpoint solutions
1. (T/F) (T/F) WAVES can standardize name fields.Answer: False
2. (T/F) Rule sets are used in standardization processing. Answer: True
3. Name the components of rule sets.Answer: Classification table, dictionary, pattern action file, lookup tables
© Copyright IBM Corporation 2009
Unit summary
Having completed this unit, you should be able to:– Describe the Standardize stage – Identify rule sets– Build jobs using the Standardize stage– Interpret standardization results– Investigate unhandled data and patterns
© Copyright IBM Corporation 2009
Lab 7: Standardize country
●Word investigation– Uses COUNTRY rule set
●Rule set found in Other folder●Adds ISO country code to records
© Copyright IBM Corporation 2009
Lab 8: Select US records●Uses Select stage to separate records with US ISO code●Could also use Transformer stage
© Copyright IBM Corporation 2009
Lab 9: Standardize USPREP●Word investigation
– Uses rule set●Rule set found in US folder
© Copyright IBM Corporation 2009
Lab 10: Standardize USNAME, USADDR, USAREA
●Word investigation– Uses rule sets
●Rule sets found in US folder
© Copyright IBM Corporation 2009
Lab 11: Investigate unhandled patterns
●Character concatenate investigation●C mask used to produce histogram●X mask used to display other fields of interest
© Copyright IBM Corporation 2009
Lab 12: Apply user overrides
●Classification● Input pattern●Unhandled pattern
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Match
© Copyright IBM Corporation 2009
Unit objectives
●After completing this unit, you should be able to:– Build a QualityStage job to identify matching records– Apply multiple match passes to increase efficiency/efficacy– Interpret and improve match results
© Copyright IBM Corporation 2009
Match stage
●Statistically-based method for determining matches●25 match comparison algorithms providing a full spectrum of
fuzzy matching functions●Ability to measure informational content of data● Identify duplicate entities within one or more files●Match specification built with Match Designer●Critical field settings
© Copyright IBM Corporation 2009
What constitutes a good match?
W HOLDEN 12 MAIN ST W HOLDEN 12 MAINE ST
Which of the following record pairs is a match? And how do you know?
Do you compare all the shared or common fields?Do you give partial credit?Are some fields (or some values) more important to you than others? Why?Do more fields increase your confidence?By how much? What is enough?
W HOLDEN 128 MAIN PL 02111 12/8/62 W HOLDEN 128 MAINE PL 02110 12/8/62
WM HOLDEN 128A MAIN SQ 02111 12/8/62 338-0824 WILL HOLDEN 128A MAINE SQ 02110 12/8/62 338-0824
© Copyright IBM Corporation 2009
The value of information content
● Information content measures the significance of one field over another (Discriminating Value)– A Gender Code contributes less information than a Tax-Id Number
● Information content also measures the significance of one value in a field over another (Frequency)– In a First-Name Field, JOHN contributes less information than
DWEZEL●Significance is determined by a value’s reliability and its ability
to discriminate, both can be calculated from your data
© Copyright IBM Corporation 2009
The weighted score is a relative measure of the probability of a match
Thresholds defined can be used for automated
processing
0
500
1000
1500
2000
2500
3000
3500
4000
-20 -10 0 10 20 30 40
# o
f P
airs
Non-Matches
Matches
Distribution of weights
Weight of Comparisons
Less Confidence More Confidence
Gre
y ar
ea
WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62 WILLAIM JOHN HOLDEN 128 MAINE AVE 02110 12/8/62
+1 +1 +17 +2 +4 -1 +7 +9 = 40
© Copyright IBM Corporation 2009
Weights
●Measures the information content of a data value●Each field contributes to the confidence (probability) of a match
© Copyright IBM Corporation 2009
Types of weights
● If a field matches, the agreement weight is used– Agreement weight is a positive value
● If a field doesn’t match, the disagreement weight is used– Disagreement is a negative value
●Partial weight is assigned for non-exact or “fuzzy” matches●Missing values have a default weight of zero●Weights for all field comparisons are summed to form a
composite weight
© Copyright IBM Corporation 2009
Matching terminology
Measures the informational content of a data value
Distinguish matches from non-matches
Records with a score above the High cutoff that really aren’t a match
Records below the low cutoff that really are a match
Measures the significance of one field value over another
Measures the confidence of a match
Informational Content
Weight
Composite Weight
Match Cutoffs
False Positives
False Negatives
© Copyright IBM Corporation 2009
Measuring the conditions of uncertainty
●Reliability of the data in a given field– Estimated as the probability that the field agrees given the record pair
is a match●Probability of a random agreement of values
– Estimated as the probability the field agrees given the record pair is not a match
© Copyright IBM Corporation 2009
Reliability (m-probability)
●Approximated as, 1 - error rate for the given field●The higher the m-probability, the higher the disagreement
weight will be for the field not matching since the data is considered reliable
© Copyright IBM Corporation 2009
Chance agreement (u-probability)
●The u-probability can be approximated as the probability that a field agrees at random (by chance)
●QualityStage uses a frequency analysis to determine the probability of chance agreement for all values – Created by a Match Frequency stage
●Rare values bring more weight to a match
© Copyright IBM Corporation 2009
Blocking
●Grouping together like records that have a high-probability of producing matches
●Only “like” records are compared to each other making the match more efficient and computationally feasible
●Records in a “block” match exactly on one to several blocking fields
© Copyright IBM Corporation 2009
NYSIIS LNAME NAME ADDRESS ZIP
YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753
GARAS GEROSA, FRAN X 29 AARONS CT 06877
YANG YOUNG , JONATHAN A 1767 TOBEY ROAD 30341
GARAS GERISA, FRANCIS 29 AARONS CT 06877
GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877
MATAC MARCUS MATIC 100 SUMMER STREET 02111
GARAS GEROSA, MARY 29 AARONS CT 06877
JANCAN RENEE JENKINS 100 SUMMER STREET 02111
YANG YOUNG THERESA C 1767 TOBEY ROAD 30341
Block on NYSIIS of Last Name
Blocking example: sample data
© Copyright IBM Corporation 2009
Blocking example – NYSIIS of Last Name
YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753
YANG YOUNG , JONATHAN A 4220 BELLE PARK DR 77072
YANG YOUNG THERESA C 1767 TOBEY ROAD 30341
GARAS GEROSA, FRAN X 29 AARONS CT 06877
GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877
GARAS GEROSA, MARY 29 AARONS CT 06877
GARAS GARISA, FRANCIS 29 AARONS CT 06877
MATAC MARCUS MATIC 100 SUMMER STREET 02111
JANCAN RENEE JENKINS 100 SUMMER STREET 02111
NYSIIS NAME ADDRESS ZIP
Blocks with only one record are considered residuals
© Copyright IBM Corporation 2009
Blocking strategy
●Choose fields with reliable data●Choose fields with a good distribution of values●Combinations of fields may be used
© Copyright IBM Corporation 2009
Examples of blocking strategies
●Zip code for matching addresses●NYSIIS of last name for matching individuals●Brand name for matching products●Combination of zip code and NYSIIS of street name for
matching addresses●Combination of NYSIIS of last name and first letter of first name
for matching individuals
© Copyright IBM Corporation 2009
Blocking summary
●Blocking groups together “like” records●Matching is more efficient for small block sizes
– Blocks should have less than 1000 records (guideline, not a hard and fast rule)
●Blocking fields must match exactly for a candidate set to be created/evaluated
●Beware of block overflow– Computationally run out of resources– Comparisons are not completed– Every record in the block becomes an automatic residual
© Copyright IBM Corporation 2009
Match types
●Unduplication– Identifies duplicates candidates in one file
●Reference Match (Two File)– One-to-one correspondence
• For every record on stream link we expect to find a match to one record on reference link
– Many-to-one correspondence• More than one record on stream link can match to the same record on
reference link
© Copyright IBM Corporation 2009
Comparing data values
●Different comparisons for different data●17 comparison methods●Most common
– CHAR - (character comparison) character by character, left to right. – UNCERT - (character uncertainty) tolerates phonetic errors,
transpositions, random insertion, deletion, and replacement of characters
– CNT_DIFF – Counts keying errors in numeric fields. You set a tolerance threshold
– NAME_UNCERT – Can be used to compare character values, if the strings are different lengths then the shorter of the two lengths is used
© Copyright IBM Corporation 2009
Match Implementation
© Copyright IBM Corporation 2009
Tasks required in match process
● Standardize the data● Add data columns needed for blocking● Generate match frequency report using Match Frequency stage● Build match specification in Match Designer
– Add pass• Blocking columns• Match commands
– Configure match test results environment● Run pass● Review results● Tune the match
– Add cutoffs– Set overrides– Add more passes
● Repeat steps until match results are acceptable
© Copyright IBM Corporation 2009
Standardize columns and generate match frequency
© Copyright IBM Corporation 2009
Match frequency stage
Map fields
© Copyright IBM Corporation 2009
Match frequency generation
© Copyright IBM Corporation 2009
Lab 13: Match frequency●Use Match Frequency stage in a match job
© Copyright IBM Corporation 2009
Match Designer
●Used to build a match specification that will be addressed in a match job
Features– Design control center– Data-centric– Graphical representation of statistics– Independent of job design– Iterative development
© Copyright IBM Corporation 2009
Match Design - Unduplicate
© Copyright IBM Corporation 2009
Match design – creating specification
How to create a new match specification
Right-click in non-root area of repository
© Copyright IBM Corporation 2009
Match design - unduplicate
The Major Components Test resultsHistogram Holding Area
Pass Composer
Decision Rules
Data Viewer
Cutoff Tuning
© Copyright IBM Corporation 2009
Match design - unduplicate
Select match type –example unduplicate
Will initially get one pass called MyPass
© Copyright IBM Corporation 2009
Match design - unduplicate
Click table definition icon
Use load button to access table definition of
standardized data set
© Copyright IBM Corporation 2009
Match design - unduplicate
Blocking
Match Commands
Select pass icon
© Copyright IBM Corporation 2009
Match design - unduplicate
Save passes and specification
© Copyright IBM Corporation 2009
Match design - unduplicate
Name and place passes and specification
© Copyright IBM Corporation 2009
Match design - unduplicate
Set up test results area
Questions:
Where is the standardized data?
Where is the frequency report?
What ODBC-accessed database will store test results?
© Copyright IBM Corporation 2009
Match design - unduplicate
Standardized sample data
Frequencies data set
Data Source NameUser NamePassword
Note: these must be data sets
© Copyright IBM Corporation 2009
Match design - unduplicate
Add Blocking Columns
© Copyright IBM Corporation 2009
Match design - unduplicate
Select Column
Business Name
Click Apply or OK
© Copyright IBM Corporation 2009
Match design - unduplicate
Add MATCH Column
© Copyright IBM Corporation 2009
Match design - unduplicate
Business Name
© Copyright IBM Corporation 2009
Match design - unduplicate
Compare Type
© Copyright IBM Corporation 2009
Match design - unduplicate
Data ColumnRight-Click to view data frequencies
© Copyright IBM Corporation 2009
Match design - unduplicate
Frequencies
© Copyright IBM Corporation 2009
Match design - unduplicate
Select
Parameter
© Copyright IBM Corporation 2009
Fully configured pass
Expanded view will show details of
blocking and match commands
Click test pass to run the pass against the
data
© Copyright IBM Corporation 2009
Match design – after test pass run
© Copyright IBM Corporation 2009
Match design - unduplicate
Grouping option:Match Sets: See all
matches and duplicates togetherMatch Pairs+Sort:
See the master record repeated
© Copyright IBM Corporation 2009
Match design - unduplicate
Default Display (Grouped by Match Sets)
Grouped by Match Pairs and then sorted Ascending by Weight
© Copyright IBM Corporation 2009
Match design - unduplicate
Compare Weights:See how any two records score
© Copyright IBM Corporation 2009
Match design - unduplicate
Statistics Tab
Change What Shows
© Copyright IBM Corporation 2009
Match design - unduplicate
Change How Shows
© Copyright IBM Corporation 2009
Match design - unduplicateTOTAL Statistics Tab
Change What Shows
Change How Shows
© Copyright IBM Corporation 2009
Lab14: Configure test results database
●Build a DB2 database to contain match test results●Build an ODBC source to connect the database to
QualityStage
© Copyright IBM Corporation 2009
Lab 15: Match specification
●Use Match Designer to build specification for unduplicate job●Configure test results area
© Copyright IBM Corporation 2009
Match improvement strategy
1. Set critical values for important fields2. Review calculated weights
• Adjust weights using weight overrides3. Set cutoffs4. Add additional passes
© Copyright IBM Corporation 2009
Critical fields
●Used to identify fields that must agree in order for records to be linked– Critical – Fields values must agree exactly or the records cannot be
linked (considered a match)– Critical Missing OK – Field values must agree exactly on values not
considered “missing values”●QualityStage feature: Variable Special Handling
© Copyright IBM Corporation 2009
Variable special handling
© Copyright IBM Corporation 2009
Weight overrides
●Allows you to adjust both the agreement and/or disagreement weights for specific situations– Add to calculated weight– Replace weight
On Match Commands screen
© Copyright IBM Corporation 2009
Weight override screen
© Copyright IBM Corporation 2009
Cutoffs
●There are two cutoffs– Match cutoff (high cutoff)– Clerical cutoff (low cutoff)
●Records with a weight equal to or above the Match cutoff are considered matches
●Records with a weight below the low cutoff are not matches●Records with a weight greater than or equal to the low cutoff
and less than the high cutoff are considered clerical records for manual review
●Cutoffs can be set at the same value eliminating clerical records
© Copyright IBM Corporation 2009
27.82 PO BOX 93020227.82 PO BOX 93020227.82 PO BOX 930202
38.65 35 COLLIER RD NW STE 610 38.65 35 COLLIER RD NW STE 610
25.81 928 S 1ST ST 14.45 S 1ST ST
Weights Data fields
DefiniteMatch
DefiniteMatch
QuestionableMatch
Setting the match cut-off
© Copyright IBM Corporation 2009
Multiple match passes
●Additional passes are helpful in overcoming data errors and missing values in block fields
●You should always create at least two match passes●Change blocking strategies for each pass
© Copyright IBM Corporation 2009
Pass 1 blocked on street namePass 2 found additional matched records in which the street name was different but the names were the same
Pass Weights Data fields1 26.31 JASON BIRCH 1350 WALTON WAY 309011 26.31 JASON BIRSH 1350 WALTON WAY 30901
1 20.42 JOHN SMITH 2047 PRINCE AVE 306041 10.83 MARY SMITH 2047 PRINCE AVE 30604
1 RES A JOHN SMITH P.O. BOX 123 30604
2 20.42 JOHN SMITH 2047 PRINCE AVE 306042 10.19 JOHN SMITH P.O. BOX 123 30604
Example: multiple match passes
© Copyright IBM Corporation 2009
Match Implementation –Unduplicate job
© Copyright IBM Corporation 2009
Double Click
Unduplication implementation
© Copyright IBM Corporation 2009
Unduplication Implementation
© Copyright IBM Corporation 2009
Verify link order for both input and output
© Copyright IBM Corporation 2009
Map all output links
© Copyright IBM Corporation 2009
Checkpoint
1. (T/F) Match specifications are created using Designer.2. (T/F) An unduplicate match can be used against two files. 3. Which match specification component determines the extent of the
clerical review records?
© Copyright IBM Corporation 2009
Checkpoint solutions
1. (T/F) Match specifications are created using Designer.Answer: True
2. (T/F) An unduplicate match can be used against two files. Answer: False
3. Which match specification component determines the extent of theclerical review records?Answer: cutoff values
© Copyright IBM Corporation 2009
Unit summary
Having completed this unit, you should be able to:– Build a QualityStage job to identify matching records– Apply multiple match passes to increase efficiency/efficacy– Interpret and improve match results
© Copyright IBM Corporation 2009
Lab 16: Unduplicate job
●Build unduplicate job using the match specification
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Survive
© Copyright IBM Corporation 2009
Unit objectives
●After completing this unit, you should be able to:– Identify Survive techniques– Describe implementation options– Define Survive rules– Build Survive job
© Copyright IBM Corporation 2009
Survive stage
●Point-and-click creation of business rules to determine “surviving” data – user decides how to survive data
●Performed at record or field level – very flexible●Creates a single, consolidated record containing the “best-of-
breed” data●Provides consolidated view of the data
© Copyright IBM Corporation 2009
Survive exampleSurvive Input (Match Output)
Group Legacy First Middle Last No. Dir. Str. Name Type UnitNo.1 D150 Bob Dixon 1500 SE ROSS CLARK CIR1 A1367 Robert Dickson 1500 ROSS CLARK CIR23 D689 William A Obrian 5901 SW 74TH ST STE 20223 A436 Billy Alex O’Brian 5901 SW 74TH ST23 D352 William Obrian 5901 74 ST #202
Survived Consolidated OutputGroup Legacy First Middle Last No. Dir. Str. Name Type Unit No.1 D150 Robert Dickson 1500 SE ROSS CLARK CIR
23 D689 William Alex O’Brian 5901 SW 74TH ST STE 202
Group Legacy1 D1501 A1367
23 D68923 A43623 D352
Cross-Reference File
© Copyright IBM Corporation 2009
Survive rules
●A rule contains a condition and a set of target fields– When the condition is met the field becomes a candidate for the “best”– All records in a group are tested against the condition– The “best” populates the target fields
●Multiple targets are permitted for the same rule
© Copyright IBM Corporation 2009
Survive rules
●Custom Rule– Build your own logical expression– Comparison (=, !=, <, > ,<=, >=)– Logical (and, or, not) – Indicate the current and best records with the following notation
• c.field indicates the current • b.field indicates the best
– Parentheses ( ) can be used for grouping complex conditions– String literals are enclosed in double quotation marks, such as
“MARS”.– A semicolon (;) terminates a rule.
© Copyright IBM Corporation 2009
Building survive rules● Survive Rules Definition screen lets you easily
build, delete and manage survivor rules
© Copyright IBM Corporation 2009
Survive techniques
●Pre-defined Techniques– Source– Recency– Frequency– Most complete (longest string)
●User-specified logic
© Copyright IBM Corporation 2009
Target fields
●Fields you want to write to the output file●Populated based on meeting the conditions of the survivor
rule(s)●Fields not listed as targets are excluded from the output file●May have multiple targets for each rule
© Copyright IBM Corporation 2009
Example: complex survive rule
●The following rule states that FIELD3 of the current record should be retained if the field contains five or more charactersand FIELD1 has any contents.
●The prefix of b. indicates the current “best” record●The prefix c. indicates the current record testing against the
survivor rule
FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;
TARGET CONDITION
© Copyright IBM Corporation 2009
Survive Implementation
© Copyright IBM Corporation 2009
Double Click
Survive QualityStage job
© Copyright IBM Corporation 2009
Survive stage properties
© Copyright IBM Corporation 2009
Output Column Technique
Survive stage properties
© Copyright IBM Corporation 2009
‘Complex’ available
Survive stage properties
© Copyright IBM Corporation 2009
Checkpoint
1. (T/F) Survivorship can allow more than one record to survive.2. (T/F) Survivorship rules deal with the complete record only. 3. Name three survive rules.
© Copyright IBM Corporation 2009
Checkpoint solutions
1. (T/F) Survivorship can allow more than one record to survive.Answer: False
2. (T/F) Survivorship rules deal with the complete record only.Answer: False
3. Name three survive rules.Answer: most recent record, longest non-blank, most frequent non-blank
© Copyright IBM Corporation 2009
Unit summary
Having completed this unit, you should be able to:● Identify Survive techniques●Describe implementation options●Define Survive rules●Build Survive job
© Copyright IBM Corporation 2009
Lab 17: Survivorship job●Build survivorship job
© Copyright IBM Corporation 2009
Survive job with XREF file
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Special Topics
© Copyright IBM Corporation 2009
Full Run
© Copyright IBM Corporation 2009
1. Double Click
Full run – single job
© Copyright IBM Corporation 2009
Full run using DataStage job sequencer
© Copyright IBM Corporation 2009
QualityStage Migration Tool
© Copyright IBM Corporation 2009
QualityStage Migration Tool – Overview●The QualityStage Migration Tool (QSMT) provides
the ability to migrate QualityStage 7.5 jobs and Standardization Rule Sets to the QS8environment.
●QSMT analyzes the QS 7.5 server project directory to construct “dsx” files which can be imported into the QS8 common repository using the DS & QS8 Designer’s “import” facility
© Copyright IBM Corporation 2009
QualityStage migration tool – overview
●QSMT functionality offers three types of QS 7.5 objectmigrations: – QS 7.5 Standardization Rule Set– QS 7.5 job in combined mode– QS 7.5 job in expanded mode
●Two modes for migrating jobs to QS8:– Combined Mode
• Use when you need to take a legacy process and just run it in QS8• Allows control before and after the legacy process• Will always run after importing without any manual tuning
– Expanded Mode• Use when you need to add QS8 operators within a migrated process• May require some manual tuning to run
© Copyright IBM Corporation 2009
Rule set migration●The QSMT has the ability to migrate Standardization Rule
Sets in one of two ways:– Explicitly - - you may specify the rule set you want to
migrate– By job dependency - - you may migrate all Rules
associated with a particular job
Note: Regardless of the migration mode, all migrated rules will have the new naming convention of :QS-7.5-Ruleset-Name_QS-7.5-Project-Name
© Copyright IBM Corporation 2009
Combined mode migration●Use this mode to get a legacy QS job up and running in QS8
with as little effort as possible. Jobs will import and run without modifications
●After importing, a migrated job will appear in the “Jobs” folder of the repository view in the QS/DS 8 Designer client
●Jobs are renamed by QSMT within the QS8 package to minimize name collision
●The new job name has the following naming convention:QS-7.5-Job-Name_QS-7.5-Project-Name
© Copyright IBM Corporation 2009
The job consists of a single instance of the QS 8 Legacy Job stage, together with some number of DS Sequential File stages, which are linked to the Legacy Job stage as inputs or outputs
QSMT – combined mode migration
© Copyright IBM Corporation 2009
● All the QS stages run under the control of the single Legacy Job stage in Combined Mode
● The list of operations can be seen by opening the Legacy stage
File IO to external files is performed by the Information Server Sequential File stages
QSMT – combined mode migration
© Copyright IBM Corporation 2009
QSMT – combined mode & running a QS8 job
●Once imported, Legacy jobs are run the same as any other QS8 job– Prior to compiling, be sure any required rule sets are
provisioned to the server– Run as you would any other QS8 job
© Copyright IBM Corporation 2009
●Use to re-implement the job in the QS8 environment●After importing, a migrated job will appear in the “Jobs” folder
in the same way as in Combined Mode●The job consists of one or more stages for each 7.5 stage,
plus DS PX Sequential File stages, linked to represent the 7.5 job flow. For complex jobs, stages may need to be reorganized to improve readability
QSMT – expanded mode
© Copyright IBM Corporation 2009
“Split”, “Accept”, or “Reject”used in 7.5
FilterSelect
“Merge” used in 7.5 stageLegacy JobSelect
AlwaysLegacy JobParse
AlwaysMNSMultinational Standardize
AlwaysLegacy JobMatch*AlwaysLegacy JobInvestigate
“ODBC” used in 7.5 stageODBC EnterpriseFFC
“Delimited text” used in 7.5 stage
CopyFFC
AlwaysLegacy JobCollapse
AlwaysLegacy JobBuild
AlwaysLegacy JobAbbreviate
ConditionsQS8 Stage TypeQS 7.5 Stage Type
* Currently working on converting Match specifications for GA
QS stage migration reference table
© Copyright IBM Corporation 2009
QS stage migration reference tableConditionsQS8 Stage TypeQS 7.5 Stage Type
AlwaysWAVESWAVES
AlwaysLegacy JobUnijoin
AlwaysLegacy JobTransfer
If target columns do not overlapSurviveSurvive
If target columns overlapLegacy JobSurvive
AlwaysStandardizeStandardize
AlwaysSortSort
© Copyright IBM Corporation 2009
QSMT – expanded mode & running a QS8 job
• Prior to compiling, be sure to complete the following:• Provision any required rules to the server• Add ODBC connection information to any ODBC read or write stages
appearing in the job• To complete the migration, perform the following for every
Standardize, Survive, MNS and Waves stage that appears on the canvas:
• Open the stage editor for the stage (e.g. by double-clicking it)• Click ok
• Once the above tasks are completed, compile and run as you would any other job
© Copyright IBM Corporation 2009
Lab 18: QualityStage Migration Tool●Migrate 7.5 QualityStage jobs to version 8
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Globalization
© Copyright IBM Corporation 2009
Objectives
After completing this module you will be able to:●Build jobs that read and write Japanese data●Modify client settings to display Japanese data with correct
characters
© Copyright IBM Corporation 2009
Terminology
●Character Set– An ordered list of characters used for text
• Example: Latin, Cyrillic, Unicode
●Character Encoding– How each character in a character set is represented as bits
• Examples: UTF-8, UTF-16BE, GB18030 are encodings of Unicode
●Codepage– Microsoft Windows term for Encoding, often used in other contexts too
• Examples: – 1252 is Windows Latin1 superset of ISO8859/1– 932 another name for Shift-JIS
© Copyright IBM Corporation 2009
Character Sets
●Latin– Italian, Spanish, French, English alphabets
●Cyrillic alphabet– Subsets are used by five Slavic languages (Bulgarian, Russian,
Belarusian, Serbian, Macedonian, Ukrainian) and some non-Slavic (Kazakh, Uzbek, Kyrgyz, Tajik, and Mongolian)
●ASCII– Represents 256 characters
●Unicode– Represents 65,536 unique characters– Standard for representing the characters of all languages– Includes Chinese, Japanese, and Korean
© Copyright IBM Corporation 2009
Character encoding
●Definition – A system that pairs a character from a character set to something else, such as a number
●Two common computer encodings for Unicode – UTF8
• Variable length encoding for Unicode• Encodes each character to one to four bytes
– UTF16• Variable length encoding for Unicode• Allows either endian representation but mandates that the byte order be
explicitly indicated by a byte order Mark
© Copyright IBM Corporation 2009
NLS
●NLS – National Language Support– Globalization + Localization/Translation
●NLS map– What DataStage uses to convert between external and internal
encodings– Internal encoding is UTF8 for Server engine, UTF16 for Parallel
engine
© Copyright IBM Corporation 2009
Information Server Common Design Repository
Where DataStage NLS Mapping Happens
Parallel Engine running jobUnicode (UTF-16)
DataStage & QualityStageRuntime ObjectsUnicode (UTF-8)
External character setExternal character set
Messages
XML(UTF-8)
Map Map
Scripts, etc. Job MonitorUnicode (UTF-16)
Windows code page
Map
Client
Server
Logs
© Copyright IBM Corporation 2009
Examples of DataStage NLS Maps
Parallel Description
IBM367 Standard (US) ASCII 7-bit setBig5 TAIWAN: "Big 5" standardIBM1026 IBM EBCDIC variant 1026 (Turkish)GB2312 CHINESE: EUC as per GB 2312ISO_8859-1:1987 ISO Standard 8859 part 1: Latin-1ISO_8859-5:1988 ISO Standard 8859 part 5: Latin-CyrillicKS_C_5601-1987 KOREAN: EUC as per KSC 5861windows-1253 MS Windows codepage 1253 (Greek)windows-1255 MS Windows codepage 1255 (Hebrew)IBM865 PC DOS code page 865 (Nordic)Shift_JIS JAPANESE: Shift-JIS main mapTIS-620 THAILAND: Industrial Standard 620
© Copyright IBM Corporation 2009
DataStage & QualityStageRuntime ObjectsUnicode (UTF-8)Map
Client
Admin Client (whole server)
Associates a server map with the current Windows
code page
Setting a Client/Server Map
© Copyright IBM Corporation 2009
Sets the default map name to use with all Parallel jobs
in this project
Admin Client (per project)
…unless you override it in the job properties dialog
Setting Job-Level Maps
© Copyright IBM Corporation 2009
Parallel Engine running jobUnicode (UTF-16)
External character set
Map
Server
Various stages have an NLS Map tab:e.g. Sequential File, External Source, External Target, File Set
– Define character set mappings (ustring external file)– Applied at stage or individual field level
Setting a Stage-Level Map
© Copyright IBM Corporation 2009
For Sequential File-type stagesNChar, and Char with extendedtype, offer a drop-down list of map names in the NLS Map property
Non-default NLS map (for
relevant types)Char may be "extended" for
Unicode"
Setting a Column-Level Map
© Copyright IBM Corporation 2009
● Transformer, modify, etc.– string ustring conversion will happen automatically, taking current
map from context (job level or stage=operator level)– Fine control via explicit conversion functions
Conversions may use
specific map name
Converting string to ustring manual control
© Copyright IBM Corporation 2009
NLS Implementation using Investigate stage
Job Design from Lab
Job-level NLS map
© Copyright IBM Corporation 2009
Investigation results for Japanese city column
Input data
Client machine with codepage set to JPN
Output report data
Client machine with codepage set to JPN
© Copyright IBM Corporation 2009
Unit summary
Having completed this unit, you should be able to:●Build a QualityStage investigation job for non-English data●View correctly-formatted results in DataStage/QualityStage
data viewer
© Copyright IBM Corporation 2009
Lab 19: NLS●Build investigation job for city using Japanese data
© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3
Address Verification Interface Stage
© Copyright IBM Corporation 2009
Objectives
After completing this module you will be able to:●Build jobs using the AV stage to parse and verify address
data
© Copyright IBM Corporation 2009
AVI Stage
●Provides– Transliteration (e.g. Japanese to Latin)– Parsing– Address validation
●WAVES equivalent●Does not provide postal certification discounts
– Use CASS (US), DPID (Australia), or SERP (Canada) if certification is desired
●Supports real-time
© Copyright IBM Corporation 2009
Components
●AV stage●Reference data
– 16 Geographies– Purchased via Passport system
●API libraries– Address Doctor
© Copyright IBM Corporation 2009
Reference Data
●Required for validation function only●Requires annual license agreement●Location pointed to by AV stage●Some databases are memory intensive●Load options
– Partial preload• Indexes loaded to memory
– Full preload• Data loaded to memory• Fast access but must have adequate memory
– No preload• Data accessed from disk, slowest method
© Copyright IBM Corporation 2009
Job components
Optional error file
AVI stageInput address data
© Copyright IBM Corporation 2009
Stage properties
Reference data location
Function
Navigation
© Copyright IBM Corporation 2009
Mapping input columns to address elements
© Copyright IBM Corporation 2009
Transliterate mode
●Map input columns to address elements– Multiple input columns can be mapped to one address element
●Options offer the choice to increase the number of address lines
© Copyright IBM Corporation 2009
Map columns to output link
© Copyright IBM Corporation 2009
Parsing mode
● Input sample
●Output sample
© Copyright IBM Corporation 2009
Validation mode
●Uses reference data from a database●Map input columns to address elements●Can activate error link●Creates validation summary report
●Sample output (only two of the validation columns shown)
© Copyright IBM Corporation 2009
Validation mode statuses
●Part of output record●Document actions taken by AV stage●Short code●Verbose code●Example
© Copyright IBM Corporation 2009
Validation summary report sample (USPREP)
Validation Summary ReportCompany Name:List Identifier:Processing Date (yyyy/mm/dd): 2009/02/25Total Number Of Records Processed: 2843
Passed: 2843 100.00%Failed: 0 0.00%Validated: 2233 78.54%Corrected: 415 14.60%Has Suggestion: 195 6.86%PostCode Failed: 70 2.46%City Failed: 37 1.30%Street Failed: 274 9.64%Country Failed: 0 0.00%
© Copyright IBM Corporation 2009
Unit summary
Having completed this unit, you should be able to:●Build jobs using the AV stage to parse and verify address data
© Copyright IBM Corporation 2009
Lab 20: AV Stage●Build AV job to parse Japanese address data
●Review prebuilt job that validated USPREP data from earlier lab