327
© Copyright IBM Corporation 2009 Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3 QualityStage 8 Essentials DX741

QS Essentials

Embed Size (px)

Citation preview

Page 1: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

QualityStage 8 Essentials

DX741

Page 2: QS Essentials

© Copyright IBM Corporation 2009

Copyright, Disclaimer of Warranties and Limitation of Liability

© Copyright IBM Corporation February 2007

IBM Software GroupOne Rogers StreetCambridge, MA 02142

All rights reserved. Printed in the United States.

IBM and the IBM logo are registered trademarks of International Business Machines Corporation.

The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both:

AnswersOnLine DynamicServer, WorkgroupEdition RedBrick Decision ServerAIX Enterprise Storage Server RedBrickMineBuilderAPPN FFST/2 RedBrickDecisionscapeAS/400 Foundation.2000 RedBrickReadyBookMaster Illustra RedBrickSystemsC-ISAM Informix RelyonRedBrickClient SDK Informix4GL S/390Cloudscape InformixExtendedParallelServer SequentConnection Services InformixInternet Foundation.2000 SPDatabase Architecture Informix RedBrick Decision Server System ViewDataBlade J/Foundation TivoliDataJoiner MaxConnect TMEDataPropagator MVS UniDataDB2 MVS/ESA UniData&DesignDB2 Connect Net.Data UniversalDataWarehouseBlueprintDB2 Extenders NUMA-Q UniversalDatabaseComponentsDB2 Universal Database ON-Bar UniversalWebConnectDistributed Database OnLineDynamicServer UniVerseDistributed Relational OS/2 VirtualTableInterfaceDPI OS/2 WARP VisionaryDRDA OS/390 VisualAgeDynamicScalableArchitecture OS/400 WebIntegrationSuiteDynamicServer PTX WebSphereDynamicServer.2000 QBIC WebSphere DataStageDynamicServer with Advanced DecisionSupportOption QMFDynamicServer with Extended ParallelOption RAMACDynamicServer with UniversalDataOption RedBrickDesignDynamicServer with WebIntegrationOption RedBrickDataMine

Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

All other product or brand names may be trademarks of their respective companies.

All information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has been certified as being Year 2000 compliant.

This document may not be reproduced in whole or in part without the priori written permission of IBM.

Note to U.S. Government Users – Documentation related to restricted rights – Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.

Page 3: QS Essentials

© Copyright IBM Corporation 2009

Course Contents

295Globalization (NLS)

278Special Topics

55Developing with QualityStage

39QualityStage 8 Architecture

312Address Verification Stage

258Survivorship

185Match

117Standardize

79Investigation

5Data Quality Issues

PageTopic

Page 4: QS Essentials

© Copyright IBM Corporation 2009

Course contents

●Data quality issues● Information Server purpose and architecture● Introduction to DataStage and QualityStage● Investigation●Standardization●Match●Survivorship●Special Topics

– Data quality methodology– QualityStage Migration Tool

Page 5: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

Data Quality Issues

Page 6: QS Essentials

© Copyright IBM Corporation 2009

Unit objectives

●After completing this unit, you should be able to:– List the five common data quality contaminants– Describe each of the following processes:

• Investigation• Standardization• Match• Survivorship

Page 7: QS Essentials

© Copyright IBM Corporation 2009

Data quality challenges

●Different or inconsistent standards in structure, format or values

●Missing data, default values●Spelling errors, data in wrong fields●Buried information●Data anomalies

Page 8: QS Essentials

© Copyright IBM Corporation 2009

Data quality – why do we care?

●Accurate reports●Accurate information for support operations●Support development of applications that go beyond original

scope for which data was designed– Master Data Management– Data Warehouse– Analytical applications

Page 9: QS Essentials

© Copyright IBM Corporation 2009

Example - Master Data Management

Source 1

Source 2

Source 3

Consolidated customer

view

AlignHarmonizeConsolidate

Page 10: QS Essentials

© Copyright IBM Corporation 2009

MARC DILORENZO ESQ BOSTONMRS DENNIS MARIO HARTFORDMR & MRS T. ROBERTS CHICAGO

DILORENZO, MARK 6793MARIO, DENISE 0215ROBERTS, TOM & MARY 8721

MARK DI LORENZO MA93DENIS E. MARIO CT15TOM & MARY ROBERTS IL21

Name Field LLocation

Source 3

Source 1

Source 2

Different or inconsistent standards

Page 11: QS Essentials

© Copyright IBM Corporation 2009

Missing data & default values

Do the field values match the meta data labels?

NAME SOC. SEC. # TELEPHONE

Denise Mario DBA

Marc Di Lorenzo ETAL

Tom & Mary Roberts

First Natl Provident

Astorial Fedrl Savings

Kevin Cooke, Receiver

John Doe Trustee for K

228-02-1975

999999999

025-37-1888

34-2671434

101010101

LN#12-756

18-7534216

111111111

6173380300

3380321

415-392-2000

508-466-1200

212-235-1000

FAX 528-9825

5436

Page 12: QS Essentials

© Copyright IBM Corporation 2009

Robert A. Jones TTE Robert Jones Jr. First Natl Provident FBO Elaine & Michael Lincoln UTADTD 3-30-89 59 Via Hermosa c/o Colleen Mailer Esq Seattle, WA 98101-2345

Legacy Meta Desc. Legacy Record Values

Buried information

NAME 1

ADDRESS 1

ADDRESS 2

ADDRESS 3

ADDRESS 4

ADDRESS 5

Page 13: QS Essentials

© Copyright IBM Corporation 2009

CUSNUM NAME ADDRESS SALES $

90328574

90328575

90238495

90233479

90233489

90234889

90345672

IBM

I.B.M. Inc.

International Bus. M.

Int. Bus. Machines

Inter-Nation Consults

Int. Bus. Consultants

I.B. Manufacturing

8,494.00

3,432.00

2,243.00

5,900.00

6,800.00

10,243.00

15,999.00

187 N.Pk. Str. Salem NH 01456

187 N.Pk. St. Sarem NH 01456

187 No. Park StSalem NH 04156

187 Park Ave Salem NH 04156

15 Main St. Andover MA 02341

PO Box 9 Boston MA 02210

Park Blvd. Boston MA 04106

The anomalies nightmare

Spelling Errors

Anomalies Lack of StandardsNo common key

Page 14: QS Essentials

© Copyright IBM Corporation 2009

Acct # Name Address City State Zip Note

5154155 Peter J. Lalonde 40 Beacon St. Melrose, Mass 02176 ODP

5152335 LaLonde, Peter 76 George 617-210-0824 Boston YES MA 02111

5146261 Lalonde, Sofie 40 Bacon Street Melrose MA CHK ID

87121 Pete & Soph Lalond 76 George Road Boston MASS FR Alert

87458 P. Lalonde FBO S.Lalonde40 Becon Rd. Melrose MA 02 176

What data challenges do you face?

•No consistent naming convention

•Business terms and spillover text

•Missing values or data in the wrong fields

•Buried information

•Misspelling

•No unique key linking records together

Page 15: QS Essentials

© Copyright IBM Corporation 2009

Why investigate?

●Discover potential anomalies in the data●Examine single domain and free-form fields● Identify invalid and default values●Reveal undocumented business rules ●Verify the reliability of the data in the fields to be used as

matching criteria●Gain complete understanding of data

Page 16: QS Essentials

© Copyright IBM Corporation 2009

Investigate – single domain report

• Single domainField % of Total

Freq. Count

Sample source data

Frequency

Page 17: QS Essentials

© Copyright IBM Corporation 2009

Investigate – word pattern report

• Freeform text (Word)Field

% of Total

Pattern

Sample source data

Frequency

Page 18: QS Essentials

© Copyright IBM Corporation 2009

What is standardize?

●Applying business logic to data chaos.– Pattern manipulation

●Enforcing business standards on data elements.– Standards definition

●Transforming the input to an output which meets the business requirement.– Field structuring

Page 19: QS Essentials

© Copyright IBM Corporation 2009

How to standardize

●Parse specific data fields into smaller, atomic data elements– Atomic data elements are called tokens– Categorize identified elements

• Separate Name, Address, and Area from freeform Name & Address lines• Identification of Distinct Material Categories (e.g. Sutures vs. Orthopedic

Equipment)●Refine data elements

• Example 1– Name = ‘DR PAUL E JONES’ becomes:

> Title = ‘DR’> First Name = ‘PAUL’> Middle Name = ‘E’> Last Name = ‘JONES’

• Example 2– Part Description = ‘BLK LATEX GLOVE’ becomes:

> Color = ‘BLACK’> Type = ‘LATEX’> Part = ‘GLOVE’

Page 20: QS Essentials

© Copyright IBM Corporation 2009

Why standardize?

●Normalize values in data fields to standard values– Transform First Name = ‘MIKE’ ‘MICHAEL’– Transform Title = ‘Doctor’ ‘Dr’– Transform Address = ‘ST. Michael Street’ ‘Saint Michael St.’– Transform Color = ‘BLK’ ‘BLACK’

●Apply phonetic coding to key words - facilitates record linkage– NYSIIS– Soundex– Typically applied to Name fields (first, last, street, city)

Page 21: QS Essentials

© Copyright IBM Corporation 2009

QualityStage standardize

●Uses a highly flexible pattern recognition language●Can employ field or domain specific standardization (i.e. unique

rules for names vs. addresses vs. dates, etc.)●Contains customizable classification and standardization tables●Utilizes results from data investigation

Page 22: QS Essentials

© Copyright IBM Corporation 2009

QualityStage standardize report exampleInd./Org. flagOriginal data

Page 23: QS Essentials

© Copyright IBM Corporation 2009

Match

“Conditioned data and QualityStage’s matching engine link the previously unlinkable.”

● Match Construction: – Reliability of input data defines a match result.

● Statistical Analysis & Match Scoring:– Linkage probability determined on a sliding scale by field level

comparison.● Report Generation:

– All business rules applied have easy to understand report structure.

Page 24: QS Essentials

© Copyright IBM Corporation 2009

What is match?

● Identifying all records on one file that correspond to similar records on another file

● Identifying duplicate records in one file●Building relationships between records in multiple files●Performing statistical and probabilistic matching●Calculating a score based on the probability of a match

Page 25: QS Essentials

© Copyright IBM Corporation 2009

Why match?

● Identify duplicate entities within one or more files●Perform householding●Create consolidated view of customer●Establish cross-reference linkage

Page 26: QS Essentials

© Copyright IBM Corporation 2009

How to match

●Single file (Unduplication) or two file (Reference)●Different match comparisons for different types of data (e.g.

exact character, uncertainty/fuzzy match, keystroke errors, multiple word comparison)

●Generation of composite weights from multiple fields●Use of probabilistic or statistical algorithms●Application of match cutoffs or thresholds to identify automatic

and clerical match levels● Incorporation of override weights to assess particular data

conditions (e.g. default values, discriminatory elements)

Page 27: QS Essentials

© Copyright IBM Corporation 2009

QualityStage match

●A wide variety of match comparison algorithms providing a full spectrum of fuzzy matching functions

●Statistically-based method for determining matches (Probabilistic Record Linkage Theory)

●Field-by-field comparisons for agreement or disagreement●Assignment of weights or penalties●Overrides for unique data conditions●Score results to determine the probability of matched records●Thresholds for final match determination●Ability to measure informational content of data

Page 28: QS Essentials

© Copyright IBM Corporation 2009

QualityStage match examples

Page 29: QS Essentials

© Copyright IBM Corporation 2009

What is survive?

●Creation of best-of-breed “surviving” data based on record or field level information

●Development of cross-reference file of related keys●Creating output formats:

– Relational table with primary and foreign keys– Transactions to update databases– Cross-reference files

Page 30: QS Essentials

© Copyright IBM Corporation 2009

Why survive?

●Provide consolidated view of data●Provide consolidated view containing the “best-of-breed” data●Resolve conflicting values and fill missing values●Cross-populate best available data● Implement business rules●Create cross-reference keys

Page 31: QS Essentials

© Copyright IBM Corporation 2009

How to survive

●Highly flexible rules●Record or field level survivorship decisions●Rules can be based upon data frequency, data recency (i.e.

date), data source, value presence or length●Rules can incorporate multiple tests●QualityStage features

– Point-and-click (GUI-based) creation of business rules to determine best-of-breed “surviving” data

– Performed at record or field level

Page 32: QS Essentials

© Copyright IBM Corporation 2009

QualityStage survive examples

Example 1: The longest populated Middle and Last Name

First Name

Middle Name

Last Name First Name

Middle Name

Last Name

MARI LEMELSON-LAPPNER

MARI S LEMELSON-LAPPNER

MARI S LEMELSON

Matched Survived

Example 2: The longest populated Middle Name, Date of Birth, and SSN

First Name Middle NLast Name DOB SSN First Name Middle NaLast NamDOB SSNDENISE TRIANO 19580211 98524173 DENISE F TRIANO 19580211 98524173DENISE F TRIANO

Matched Survived

Page 33: QS Essentials

© Copyright IBM Corporation 2009

Course lab project design

Policy

InvestigateAssess Data Quality

Standardize Country InvestigateConditioned Results

Apply User Overrides

Identify Duplicate Customer Records

Survive the BestCustomer Record

Condition Name, Address and Area

Select US Data for furtherprocessing

Page 34: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint

1. (T/F) Data quality investigation cleans the source data.2. (T/F) Standardization modifies the source data so that it can be loaded

into the target system. 3. (T/F) Survivorship data can be either record based or field based.

Page 35: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint solutions

1. (T/F) (T/F) Data quality investigation cleans the source data.Answer: False

2. (T/F) Standardization modifies the source data so that it can be loaded into the target system.Answer: False

3. Survivorship data can be either record based or field based.Answer: True

Page 36: QS Essentials

© Copyright IBM Corporation 2009

Unit summary

Having completed this unit, you should be able to:●List the five common data quality contaminants

– Different standards– Missing and default values– Spillover and buried information – Anomalies– No consolidated view

●Describe each of the following processes:– Investigation– Standardization– Match– Survivorship

Page 37: QS Essentials

© Copyright IBM Corporation 2009

Lab 1: Review course project

●Course business case: WINN Insurance CRM project●See QualityStage Essentials Exercises

Page 38: QS Essentials

© Copyright IBM Corporation 2009

Lab 2: Copy student files

●Copy student files to disk– Use C: drive as root for folder

Page 39: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

QualityStage 8 Architecture

Page 40: QS Essentials

© Copyright IBM Corporation 2009

Unit objectives

●After completing this unit, you should be able to:– Describe the Data Quality architecture– Identify server and client components

Page 41: QS Essentials

© Copyright IBM Corporation 2009

Information Server conceptual architecture

Metadata Server & Repository

DataStageQualityStage

Information DirectorInformationAnalyzer

Information Server

MetadataAccess Services

Other ServicesClient logon accessLoggingSecurity

Page 42: QS Essentials

© Copyright IBM Corporation 2009

QualityStage technical overview

●Uses DataStage (parallel version)– DataStage design environment– Parallel execution engine– Stages are native enterprise operators– Match designer is embedded in DataStage Designer Client – Get DataStage data connectivity by default

• No need for meta brokers, plug-ins• Common meta data

●Legacy (pre-version 8) QS job execution– Migration utility available to aid conversion from QS 7.x to QS 8– Converted jobs can be compiled and executed in the QS 8

environment

Page 43: QS Essentials

© Copyright IBM Corporation 2009

DataStage/QualityStage physical architecture

Clients

DataStage/QualityStage

UNIX

Windows

Via TCP/IP

Information Server

Windows

DesignerDirectorAdministrator

Connect to projectsProjects

Page 44: QS Essentials

© Copyright IBM Corporation 2009

DataStage clients

●Administrator– Add and delete projects– Set project defaults– Set project environment parameters

●Designer– Maintain data definitions– Add, modify, and delete jobs– Add, modify, and delete match specifications– Manage rule sets– Compile jobs– Run jobs– Provision rule sets and match specifications

●Director– Run jobs– Review job log– Schedule jobs

Page 45: QS Essentials

© Copyright IBM Corporation 2009

DataStage Administrator● Administrator

– Create or delete projects– Set project defaults– Apply security

Project list

Page 46: QS Essentials

© Copyright IBM Corporation 2009

Project property defaults

Page 47: QS Essentials

© Copyright IBM Corporation 2009

DataStage Designer● Designer

– Client GUI for designing jobs• Windows 2000+, XP• Build meta data• Build Jobs • Modify Standardization Rules• Build match specifications

– Designer Repository• Database

Sample QualityStage job as viewed in Designer

Page 48: QS Essentials

© Copyright IBM Corporation 2009

Designer canvas, repository, and palette

Page 49: QS Essentials

© Copyright IBM Corporation 2009

DataStage Director● Director

– Client GUI for managing job execution– Windows 2000+, XP– Run jobs – set job options and parameters– View job log– Schedule job execution

Page 50: QS Essentials

© Copyright IBM Corporation 2009

Job log viewed in Director

Page 51: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint

1. (T/F) DataStage Administrator executes jobs.2. (T/F) DataStage Designer configures projects. 3. Which DataStage component displays objects in the designer database?

Page 52: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint solutions

1. (T/F) DataStage Administrator executes jobs.Answer: False

2. (T/F) DataStage Designer configures projects.Answer: False

3. Which DataStage component displays objects in the designer database.Answer: the repository view

Page 53: QS Essentials

© Copyright IBM Corporation 2009

Unit summary

Having completed this unit, you should be able to:– Describe the Data Quality architecture– Identify server and client components

Page 54: QS Essentials

© Copyright IBM Corporation 2009

Lab 3: configure QualityStage project● Create a project using Administrator (if necessary)● Set project properties

– General defaults– Environment variables– Security groups and roles

Page 55: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

Developing with QualityStage

Page 56: QS Essentials

© Copyright IBM Corporation 2009

Unit objectives

●After completing this unit, you should be able to:– Import meta data– Build DataStage/QualityStage Jobs– Run jobs– Review results

Page 57: QS Essentials

© Copyright IBM Corporation 2009

DataStage/QualityStage project

●Components– Jobs– Stages within jobs– Table Definitions– The Designer repository view shows project components

Page 58: QS Essentials

© Copyright IBM Corporation 2009

Job definition

●A job is an executable DataStage/QualityStage program●Created by job compilation●Jobs can be run in batch or in real time

Page 59: QS Essentials

© Copyright IBM Corporation 2009

Job development overview

● Designer client– Import or enter file meta data defining your sources and targets– Add stages and links defining the process– Compile the job– Run the job (Designer or Director)– Review data results

● Server– Runs the job– View job log

Page 60: QS Essentials

© Copyright IBM Corporation 2009

Log onto project in Designer or Director

User name and Password controlled by Information Server

List of valid projects

Page 61: QS Essentials

© Copyright IBM Corporation 2009

Designer repository components

●Database which stores– Data file definitions– Job designs– Standardization rules– Data connection objects

Page 62: QS Essentials

© Copyright IBM Corporation 2009

Project structure

Repository view

In Designer

Page 63: QS Essentials

© Copyright IBM Corporation 2009

DataStage/QualityStage design environment

Stages

Data definitions

Page 64: QS Essentials

© Copyright IBM Corporation 2009

Data definitions

●Entered or loaded via DataStage import mechanisms– Sequential file– ODBC– Native database connection

●New and redefined columns can be added on the data flow via Transformer stage

Page 65: QS Essentials

© Copyright IBM Corporation 2009

Data Quality folder● Stages are the building blocks● Focused in function● All phases of data quality:

– Investigate– Standardize– Match Frequency– Match

• Unduplicate Match• Reference Match

– Survive– International postal

• MNS– Optional

• Address Verification

Page 66: QS Essentials

© Copyright IBM Corporation 2009

Standardization rule sets● Pre-defined rules for parsing and

standardizing:– Name– Address– Area (City, State and Zip)

● Multi-national address processing● Validate structure:

– Tax ID– US Phone– Date– Email

● Append ISO country codes● Rule sets are stored in the repository

and provisioned to the job execution area

Rule set for USNAME

Page 67: QS Essentials

© Copyright IBM Corporation 2009

Rule set components

● Can modify some rule set components

● Test rule sets● Copy rule sets

Page 68: QS Essentials

© Copyright IBM Corporation 2009

Match Specifications in the DataStage Repository

●Created using the Match Designer

●Allows online testing of match criteria

Page 69: QS Essentials

© Copyright IBM Corporation 2009

Executing a job via Director

Director

Server

Executes the jobClick run button

Set run options

Execute job

View job log

View job monitor

Page 70: QS Essentials

© Copyright IBM Corporation 2009

Running a job in Director● Director

– Client GUI for running jobs• Windows 2000+, XP• View job logs and monitor• Job scheduling

Job status view

Page 71: QS Essentials

© Copyright IBM Corporation 2009

Execution environment

Data Quality Job Log

Page 72: QS Essentials

© Copyright IBM Corporation 2009

Job Monitor statistics

Page 73: QS Essentials

© Copyright IBM Corporation 2009

Job development process

● Import meta data● Define job

– Draw stages and links– Set stage properties– Compile

● Run the job● Review results

Page 74: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint

1. (T/F) The job monitor displays link statistics.2. (T/F) The job log is viewed in DataStage Designer. 3. What protocol is used for communication between the DataStage clients

and server?

Page 75: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint solutions

1. (T/F) The job monitor displays link statistics.Answer: True

2. (T/F) The job log is viewed in DataStage Designer.Answer: False

3. What protocol is used for communication between the DataStage clients and server?Answer: TCPIP

Page 76: QS Essentials

© Copyright IBM Corporation 2009

Unit summary

Having completed this unit, you should be able to:– Import meta data– Build DataStage/QualityStage Jobs– Run jobs– Review results

Page 77: QS Essentials

© Copyright IBM Corporation 2009

Lab 4: Import meta data

●DataStage import mechanisms– DataStage components

• Any object built in DataStage, such as jobs, table definitions, match specifications

Page 78: QS Essentials

© Copyright IBM Corporation 2009

Lab 5: Build and run DataStage job

●Read sequential file– Must use format tab to handle nulls

Page 79: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

Investigation

Page 80: QS Essentials

© Copyright IBM Corporation 2009

Unit objectives

●After completing this unit, you should be able to:– Build Investigate jobs– Use character discrete, concatenate, and word investigations to

analyze data fields– Review results

Page 81: QS Essentials

© Copyright IBM Corporation 2009

Investigation

●Verify the domain– Review each field of interest and verify the data matches the meta

data● Identify data formats, missing and default values● Identify data anomalies

– Format– Structure– Content

●Discover “unwritten” business rules● Identify data preparation requirements

Page 82: QS Essentials

© Copyright IBM Corporation 2009

Investigate stage

●Features– Analyze free-form and single domain fields– Provide frequency distributions of distinct values and patterns

● Investigate methods– Character Discrete– Character Concatenate– Word

Page 83: QS Essentials

© Copyright IBM Corporation 2009

Investigate methods

Identifying free-form fields that may require parsing and discovery of key words for classification

Word Investigation

Cross-field correlation, checking logic relationships between fields

Character Concatenate

Analyzing field values, formats, and domainsCharacter Discrete

WhyMethod

Page 84: QS Essentials

© Copyright IBM Corporation 2009

Field Masks

Investigate terminology

Options that represent the data. Options: Character (C), Type (T), Skipped (X)

Tokens Individual units of data

For ignoring characters X

For viewing the pattern of the dataT

For viewing the actual character values of the dataC

UsageCharacter Mask

Page 85: QS Essentials

© Copyright IBM Corporation 2009

Token Mask Result 02116 CCCCC 02116

02116 CCCXX 021

01832-4480 TTTTTTTTTT nnnnn-nnnn

XJ2 6EM TTTTTTT aanbnaa

(617) 338-0300 CCCCCCCCCCCCCC (617) 338-0300

617-338-0300 TTTTTTTTTTTT nnn-nnn-nnnn

6173380300 CCCXXXXXXXXX 617

(617)3380300 CCCXXXXXXXXX (61

Field mask examples

Page 86: QS Essentials

© Copyright IBM Corporation 2009

Character discrete: field mask (C)haracter

●Usage: Domain quality– View the contents of each field to verify that the data values match the

field labels ●Mechanism: Investigate stage

– Generates reports for frequency

Page 87: QS Essentials

© Copyright IBM Corporation 2009

Character discrete - character results

Page 88: QS Essentials

© Copyright IBM Corporation 2009

Character discrete: field mask (T)ype

●Usage: Data formats (patterns):– View the format of field which contain that you suspect may follow or

conform to a specific format, e.g., dates, PIN, Tax ID, account numbers.

●Generates reports for frequency

Page 89: QS Essentials

© Copyright IBM Corporation 2009

Investigation Implementation

Page 90: QS Essentials

© Copyright IBM Corporation 2009

QualityStage Investigation job – character

Double Click

Page 91: QS Essentials

© Copyright IBM Corporation 2009

Investigation - Character

Select Column Add

Page 92: QS Essentials

© Copyright IBM Corporation 2009

Investigation - Character

Select mask

Page 93: QS Essentials

© Copyright IBM Corporation 2009

Investigation - Character

Page 94: QS Essentials

© Copyright IBM Corporation 2009

Investigation - Character

Page 95: QS Essentials

© Copyright IBM Corporation 2009

Investigation - Character

Page 96: QS Essentials

© Copyright IBM Corporation 2009

View investigation report

Page 97: QS Essentials

© Copyright IBM Corporation 2009

Character concatenate

● Identify Field Relationships– Investigate one or more fields to uncover any relationship between the

field values. – Uses combinations of character masks– Generates reports for frequency

Page 98: QS Essentials

© Copyright IBM Corporation 2009

Character concatenate results

DOB and DOD Fields

Page 99: QS Essentials

© Copyright IBM Corporation 2009

Word investigate

●Usage: Free-form field pattern analysis– To view the pattern of the data within a freeform text field and parse it

into individual tokens ●QualityStage process

– Apply rule sets to free-form fields– Discover parsing requirements– Discover patterns in data– Generate reports for pattern frequency distributions and token report

Page 100: QS Essentials

© Copyright IBM Corporation 2009

Word investigation results

Token ReportPattern Report

How to use

Look at most frequently occurring patterns.

Use to estimate how much work to modify a rule set for a customer.

How to use

Review tokens with SME to verify tokens are properly classified.

Identify most frequently occurring unclassified tokens and add them to rule set.

Page 101: QS Essentials

© Copyright IBM Corporation 2009

Rule sets

●Rules for parsing, classifying, and organizing data●Rule Set Domains

– Country processing– Pre-processing– Domain Processing

• Name: Business and Personal• Street Address• Area: Locality, City, State and Zip/Postal codes

– Multinational Address Processing

Page 102: QS Essentials

© Copyright IBM Corporation 2009

Parsing

●Parse free-form data with the SEPLIST and a STRIPLIST– SEPLIST - Any character in the SEPLIST will separate tokens, and

become a token itself– STRIPLIST - Any character in the STRIPLIST will be ignored in the

resulting pattern●The SEPLIST is always applied first

Page 103: QS Essentials

© Copyright IBM Corporation 2009

Parsing example

Token1 Token2 Token3 Token4 Token5 Token6 Token7 Token8

120 Main St . N . W .

Token1 Token2 Token3 Token4 Token5120 Main St N W

SEPLIST “¬.”STRIPLIST “¬.“

Token1 Token2 Token3 Token4

120 Main St NW

SEPLIST “¬”STRIPLIST “¬.“

SEPLIST “¬.”STRIPLIST “¬“

Example: 120 Main St. N.W.

Page 104: QS Essentials

© Copyright IBM Corporation 2009

Data typing: classifying tokens

● Identify and type the token in terms of it’s business meaning and value

PATTERN KEY(USADDR rule set):

^ – Numeric token

? – Unclassified alpha token

@, <, > – Mixed Token

T – Street Type

U – Unit Type

120 Main Street Apt 6C

^ ? T U >

Page 105: QS Essentials

© Copyright IBM Corporation 2009

10 MAPLE STREET APARTMENT 222

T ^?^

Parse

Classify known wordsand

assign default tags U

Example: word investigate

Produce Reports based on Patterns & Tokens

Token report Pattern report

Page 106: QS Essentials

© Copyright IBM Corporation 2009

Investigation - Word

Page 107: QS Essentials

© Copyright IBM Corporation 2009

Investigation - Word

Page 108: QS Essentials

© Copyright IBM Corporation 2009

Link ordering

Page 109: QS Essentials

© Copyright IBM Corporation 2009

Investigation – define output files

Page 110: QS Essentials

© Copyright IBM Corporation 2009

Sort output (optional)

Page 111: QS Essentials

© Copyright IBM Corporation 2009

Review word reports – patterns and tokens

Page 112: QS Essentials

© Copyright IBM Corporation 2009

Data quality assessment process

●Review and analyze each field for the following information:– How often is the field populated?– What are the anomalies and out-of-range values? How often does

each one occur?– How many unique values were found?– What is the distribution of the data or patterns?

●Use Investigate results to:– Update project business requirements– Define development plan and application design

Page 113: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint

1. (T/F) Character discrete investigation examines a single domain.2. (T/F) Word investigation examines a single domain. 3. Name the three character masks.

Page 114: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint solutions

1. (T/F) Character discrete investigation examines a single domain.Answer: True

2. (T/F) Word investigation examines a single domain.Answer: False

3. Name the three character masks.Answer: C, T, and X

Page 115: QS Essentials

© Copyright IBM Corporation 2009

Unit summary

Having completed this unit, you should be able to:– Build Investigate jobs– Use character discrete, concatenate, and word investigations to

analyze data fields– Review results

Page 116: QS Essentials

© Copyright IBM Corporation 2009

Lab 6: Build investigate jobs

●Character with C mask●Character with T mask●Character concatenate●Word

Page 117: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

Standardize

Page 118: QS Essentials

© Copyright IBM Corporation 2009

Unit objectives

●After completing this unit, you should be able to:– Describe the Standardize stage – Identify rule sets– Build jobs using the Standardize stage– Interpret standardization results– Investigate unhandled data and patterns

Page 119: QS Essentials

© Copyright IBM Corporation 2009

Standardize

●Transformation– Parsing free form fields– Comparison threshold for classifying like words– Bucketing data tokens

●Standardization– Applying standard values and standard formats

●Phonetic Coding for use in Matching– NYSIIS– Soundex

Page 120: QS Essentials

© Copyright IBM Corporation 2009

Standardize example

Input File:Address Line 1 Address Line 2

1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST # 2016200 VENTURA BOULEVARD SUITE 201C/O JOSEPH C REIFF 12 WESTERN AVE1705 W St PHILADELPHIA1655 PONCE DE LEON AVENUE 15TH FLOOR

Result File:House # Dir Str. Name Type Unit Unit. Floor Floor

Type Value Type Value

1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST 2016200 VENTURA BLVD STE 20112 WESTERN AVE1705 W ST1655 PONCE DE LEON AVE FLOOR 15

Page 121: QS Essentials

© Copyright IBM Corporation 2009

^?^

Parse

Classify &assign default tags T U

House Street UnitNumber Street Name Type UnitType

10 MAPLE ST APT 222

Process Patterns and Bucket Data

Standardize process

Output File

Key:

^ = Single numeric

? = One or more unknown alphas

T = Street type

U = Unit type

10 MAPLE STREET APARTMENT 222

Page 122: QS Essentials

© Copyright IBM Corporation 2009

Standardize stage

●Standardize Stage– Uses Rule sets for:

• Country processing• Pre-domain processing

– USPREP• Domain processing

– USADDR– USAREA– USNAME

• Multi-national Address • WAVES• Address Verification Interface (optional)

Page 123: QS Essentials

© Copyright IBM Corporation 2009

Types of rule sets

Country Identifier

COUNTRY

Domain Pre-processor

USPREP

Domain Specific: USNAME

Domain Specific: USADDR

Domain Specific: USAREA

Preparatory steps

Not always required

Page 124: QS Essentials

© Copyright IBM Corporation 2009

Example: country identifier

Input Record

100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET

Input Record

100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET

Output Record

US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET

Output Record

US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET

Page 125: QS Essentials

© Copyright IBM Corporation 2009

Example: domain preprocessor

Input Record

Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148

Input Record

Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148

Output Record

Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426

Output Record

Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426

Mixed domain

Page 126: QS Essentials

© Copyright IBM Corporation 2009

Example: domain specific

Input Record

100 SUMMER STREET 15TH FLOOR

Input Record

100 SUMMER STREET 15TH FLOOR

Output Record

House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U

Output Record

House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U

Page 127: QS Essentials

© Copyright IBM Corporation 2009

Rule sets

●Rule sets contain logic for:– Parsing– Classifying– Processing data by pattern and bucketing data

●Three required files– Classification Table– Dictionary File– Pattern Action File

●Optional files– Lookup tables– Override tables

Page 128: QS Essentials

© Copyright IBM Corporation 2009

Rule set files

Contains a series of patterns and programming commands to condition the data

Contains standard abbreviations that identify and classify key words.

Optional conversion and lookup tables for converting and returning standardized values

Tables for storing overrides entered into the Designer GUI

Classification Table

Pattern Action File

Reference Tables

Override Tables

Define the output file fields to store the parsed and conditioned dataDictionary File

Page 129: QS Essentials

© Copyright IBM Corporation 2009

Classification table

●Contains the words for classification, standardized versions of words, and data class

●Data class (data tag) is assigned to each data token●Default classes are the same across all rule sets●User-defined classes are assigned in the classification table

– Users may modify, add or delete these classes– User-defined classes are a single letter

Page 130: QS Essentials

© Copyright IBM Corporation 2009

Default classes

Trailing numeric, e.g. A6<

Null classZero (0)

Leading numeric, e.g., 6A>

Complex mixed token, e.g., C3PO@

One or more consecutive unclassified alphas?A single unclassified alpha (word)+

A single numeric^

DescriptionClass

Page 131: QS Essentials

© Copyright IBM Corporation 2009

User-defined classes

Box TypeB

USAREA

State AbbreviationS

DirectionalD

Street TypeT

USADDR

Prefix, e.g. Dr., Mr., MissP

Generational, e.g., Senior, I, IIG

USNAME

DescriptionClass

Page 132: QS Essentials

© Copyright IBM Corporation 2009

Classification table example;-------------------------------------------------------------------------------

; USADDR Classification Table ;-------------------------------------------------------------------------------

; Classification Legend ;-------------------------------------------------------------------------------

; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-------------------------------------------------------------------------------

; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-------------------------------------------------------------------------------

DRAW "PO BOX" B DRAWER "PO BOX" B PO "PO BOX" B POB "PO BOX" B POBOX "PO BOX" B POBX "PO BOX" B PODRAWER "PO BOX" B

Token

Standard form Classification

Page 133: QS Essentials

© Copyright IBM Corporation 2009

Comparison threshold● May be used in the Classification

table● Used to efficiently make entries into

the classification table● Helps overcome spelling and data

entry errors● Not required● Threshold uses a logical string

comparator Most likely not the same750

Almost certainly not the same

700

Most likely equivalent800

Almost certainly the same850

Exact match900

Threshold level

Page 134: QS Essentials

© Copyright IBM Corporation 2009

Classification table example with comparison threshold

; USADDR Classification Table ;-------------------------------------------------------------------------------

; Classification Legend ;-------------------------------------------------------------------------------

; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-------------------------------------------------------------------------------

; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-------------------------------------------------------------------------------

DRAW "PO BOX" B DRAWER "PO BOX" B…………………………………………………………………………………NORTHEAST NE D 850 NORTHWEST NW D 850 NW NW D S S D SO S D SOUTH S D

Page 135: QS Essentials

© Copyright IBM Corporation 2009

Dictionary file

●Defines the field definitions for the output file●When data is moved to these output fields it is called

“bucketing” the data●The order that the fields are listed in the dictionary file defines

the order the fields appear in the output file●Dictionary file entries are similar to field definitions

Page 136: QS Essentials

© Copyright IBM Corporation 2009

Dictionary file example;;QualityStage v8.0\FORMAT\ SORT=N;------------------------------------------------------------------------------; USADDR Dictionary File;------------------------------------------------------------------------------; Total Dictionary Length = 411;------------------------------------------------------------------------------; Business Intelligence Fields;------------------------------------------------------------------------------HouseNumber C 10 S HouseNumber ;0001-0010HouseNumberSuffix C 10 S HouseNumberSuffix ;0011-0020StreetPrefixDirectional C 3 S StreetPrefixDirectional ;0021-0023StreetPrefixType C 20 S StreetPrefixType ;0024-0043StreetName C 25 S StreetName ;0044-0068StreetSuffixType C 5 S StreetSuffixType ;0069-0073StreetSuffixQualifier C 5 S StreetSuffixQualifier ;0074-0078

Page 137: QS Essentials

© Copyright IBM Corporation 2009

Pattern-Action file

●Contains the rules for standardization; that is, the actions to execute with a given pattern of tokens

●Records are processed from the top down●Written in Pattern Action Language (PAL)●Complex parsing can be coded in this file

Page 138: QS Essentials

© Copyright IBM Corporation 2009

Street Address 10 Hollow Oak RoadPattern ^ ? T

Pattern Action LanguageCOPY [1] {HN}COPY_S [2] {SN}COPY_A [3] {ST}

{HN} {SN} {ST}

Pattern Action file process

10 Hollow Oak Rd

Page 139: QS Essentials

© Copyright IBM Corporation 2009

Optional lookup tables

●Called from the Pattern Action File●Rule sets may contain lookup tables such as:

– Common First Names and Enhanced First Names• Barb & Barbara• Ted & Edward

– Gender based on name– State abbreviations– Common city abbreviations

• NYC = New York City• LA = Los Angeles

Page 140: QS Essentials

© Copyright IBM Corporation 2009

^?^

Parse

Classify &assign default tags T U

House Street UnitNumber Street Name Type UnitType

10 MAPLE ST APT 222

Pattern Action File

Process Patterns and Bucket Data

Classification Table

Dictionary File

Standardize process

10 MAPLE STREET APARTMENT 2221

2

3

4

Page 141: QS Essentials

© Copyright IBM Corporation 2009

Standardizing international data

●Two methods– Method 1: Use country pre-processor, domain pre-processor, and

domain-specific rules• Uses out-of-the-box, included functionality/rules

– Method 2: Use Multinational Standardize, WAVES, or AVI• WAVES requires purchase of WAVES database• AVI requires purchase of database for address validation

Page 142: QS Essentials

© Copyright IBM Corporation 2009

Course lab project design

Policy

InvestigateAssess Data Quality

Standardize Country InvestigateConditioned Results

Apply User Overrides

Identify Duplicate Customer Records

Survive the BestCustomer Record

Condition Name, Address and Area

Select US Data for furtherprocessing

Page 143: QS Essentials

© Copyright IBM Corporation 2009

Country rule set

●Country Rule set appends the two byte ISO country code● Input to the country rule set includes:

– Default country code (designated by ZQ…default value…ZQ)– Street Address– City or locality– State– Zip or Postal code– Country field (if it exists)

●Output:– Two byte ISO country code– Flag identifying explicit or default decision

Page 144: QS Essentials

© Copyright IBM Corporation 2009

Standardization implementation

Page 145: QS Essentials

© Copyright IBM Corporation 2009

Standardization jobs

Country

Rule Set

USPREP

Rule Set

Domain-specific

Rule Sets

Page 146: QS Essentials

© Copyright IBM Corporation 2009

Standardization – US Name, Address, Area

Page 147: QS Essentials

© Copyright IBM Corporation 2009

Standardization

Page 148: QS Essentials

© Copyright IBM Corporation 2009

Standardization – mapping columns

Page 149: QS Essentials

© Copyright IBM Corporation 2009

Course lab project design

Policy

InvestigateAssess Data Quality

Standardize Country InvestigateConditioned Results

Apply User Overrides

Identify Duplicate Customer Records

Survive the BestCustomer Record

Condition Name, Address and Area

Select US Data for furtherprocessing

Page 150: QS Essentials

© Copyright IBM Corporation 2009

Selecting US data

●The DataStage Filter Stage provides the capability of selecting and/or rejecting records based on a set of values for a field

●Selecting or splitting data requiring compound or complex logic may require Transformer stage

Page 151: QS Essentials

© Copyright IBM Corporation 2009

Course exercise project design

Policy

InvestigateAssess Data Quality

Standardize Country InvestigateConditioned Results

Apply User Overrides

Identify Duplicate Customer Records

Survive the BestCustomer Record

Condition Name, Address and Area

Select US Data for furtherprocessing

Page 152: QS Essentials

© Copyright IBM Corporation 2009

Domain pre-processor rule sets

●Pre-processor rule sets are designed to filter name, street address and area (city, state, zip) data– For example, if the city, state and zip is found in ADDRESS LINE 2,

the pre-processor rule set will attempt to recognize this data and move it into the area domain

●The pre-processor rule set prepares the data for processing by domain specific rule sets

Page 153: QS Essentials

© Copyright IBM Corporation 2009

Domain rule sets

●Domain rule sets expect only data for that domain as the input●Domain rule sets that come with QualityStage are:

– Name– Street address– Area (city, state and zip)

Page 154: QS Essentials

© Copyright IBM Corporation 2009

USNAME rule set

●The USNAME rule set works on both personal names and organization names for US data

●Data is parsed into name components●Phonetic coding of the First Name and Primary Name are

created for matching

Page 155: QS Essentials

© Copyright IBM Corporation 2009

USADDR rule set

●This rule set is applied to street address fields●The “Address Type” flag identifies different types of addresses

– ‘S’ Street address– ‘B’ Box address– ‘R’ Rural route address

●Phonetic coding of the Street Name is created for matching

Page 156: QS Essentials

© Copyright IBM Corporation 2009

USAREA rule set

●This rule set is applied to city, state and postal code fields●Data is parsed into city name, state abbreviation, zip code and

zip plus four●Phonetic coding of the city name is created for matching

Page 157: QS Essentials

© Copyright IBM Corporation 2009

Standardize results

●Business Intelligence fields – Parsed from the original data, they may be used in matching and

generally they are moved to the target system●Matching Fields

– Generally these fields are created to help during the match process and are dropped after successful matching

●Reporting fields– Specifically created to help review results of Standardize and

recognized handled and unhandled data

Page 158: QS Essentials

© Copyright IBM Corporation 2009

Business intelligence fields

● Intelligent data parsed and bucketed from the input free-form field

USNAME Examples

• Title

• First Name

• Middle Name

• Primary Name

• Generational

USADDR Examples

HouseNumber

Directional

Street Name

Unit Types

Box Types

Unit Values

Building Names

USAREA Examples

•City

•State

•Zip5

•Zip4

Page 159: QS Essentials

© Copyright IBM Corporation 2009

Standardize matching fields

●Phonetic coding– NYSIIS – Reverse NYSIIS– Soundex– Reverse Soundex

●Hash keys– First 2 characters of the first five words

●Packed Keys– Data concatenated, or packed

Page 160: QS Essentials

© Copyright IBM Corporation 2009

Standardize reporting fields

The tokens not processed by the rule set because they represent a data exception.

The pattern generated for the stream of input tokens based on the parsing rules and token classifications.

The pattern generated for tokens not processed by the selected rule set.

The remaining tokens not processed by the selected rule set.

Unhandled Pattern

Unhandled Data

Input Pattern

Exception Data

User override flagFlag indicating what kind of user overrides were applied to this record

Page 161: QS Essentials

© Copyright IBM Corporation 2009

1. Build a Character Concatenate Investigation using the following fields

2. Increase the number of samples to 5

Investigate NAME unhandled patterns and data

● Identify the unhandled patterns for the NAME field. In the report include the unhandled data, input pattern, original data and the record key.

XName domain data

XInput Pattern

XUnhandled Data

CUnhandled Pattern

TypeField Description

Page 162: QS Essentials

© Copyright IBM Corporation 2009

Standard practice: investigate handled and unhandled data

●Review the business intelligence fields to ensure accurate bucketing of the data

●Build a Character Discrete Investigation for each field and review the contents and the format

●Build Investigation to review:– Unhandled Patterns– Unhandled Data– Input Pattern– Input Fields

Page 163: QS Essentials

© Copyright IBM Corporation 2009

Customizing rule sets

●A rule set may require modification if some input data is:– Not processed– Incorrectly processed

●QualityStage provides functions to:– Test strings for classifications using the Rules Analyzer– Apply user Overrides– Modify classification table

Page 164: QS Essentials

© Copyright IBM Corporation 2009

User overrides

●Provides the user with the ability to modify rule sets●The following types of rule sets can be modified using User

Overrides– Domain Pre-processor rule sets– Domain rule sets

●There are five types of user overrides relating to: classifications, patterns, and text strings

●User overrides are – GUI Driven

●Rule set should be provisioned after modifications applied

Page 165: QS Essentials

© Copyright IBM Corporation 2009

User classification override

●Recognized as a keyword and classified– Additional words

• New abbreviation, variation • Misspelling of a word

●User Classifications may override or add:– Original values (Token values)– Standard value – Class

Page 166: QS Essentials

© Copyright IBM Corporation 2009

Override

Unhandled Data

Token Value Standard Value Class

Apply classification override

FCarolynneCarolynne

Input Pattern Original Data+,+ HOCHREITER , CAROLYNNE

Input Pattern Original Data +,F HOCHREITER , CAROLYNNE

Add CAROLYNNE

as a valid first name

to the classification table

Corrected Pattern

Page 167: QS Essentials

© Copyright IBM Corporation 2009

Text overrides

●Allow the user to specify overrides based on an entire text string

●Use this override for special cases and specific handling of a string of text

● Input Text Overrides– Applied to the original text string

●Unhandled Text Overrides– Applied to the Unhandled Data field

Page 168: QS Essentials

© Copyright IBM Corporation 2009

Input Text

Input text overrides

Input Text OverrideREIFF FUNERAL Move text string to

the Primary name field

Unhandled Pattern Input Text++ ZACHARIA GELLMAN++ TOMMOTHY CABBOTT++ REIFF FUNERAL

Override

Input Pattern Primary Name+ + REIFF FUNERAL

Results

Page 169: QS Essentials

© Copyright IBM Corporation 2009

Pattern overrides

●Allow the user to specify overrides based on an entire pattern●Use this override when most or all records should be

processed with identical logic● Input Pattern Overrides

– Applied to the original text string●Unhandled Pattern Overrides

– Applied to the Unhandled Data field

Page 170: QS Essentials

© Copyright IBM Corporation 2009

Unhandled pattern overrides

Unhandled Pattern Override+, + Move + to Primary Name

Comma provides contextMove + to First Name

Unhandled Pattern Input Text+, + HAYWARD, WINSLOW+, + ESHAGHIAN , JOUBI+, + BOULDER, CORONA

UnhandledPattern

Override

Results

Unhandled Pattern First PrimaryName Name

+, + WINSLOW HAYWARD+, + JOUBI ESHAGHIAN+, + CORONA BOULDER

Page 171: QS Essentials

© Copyright IBM Corporation 2009

User override precedence

Recognize words to classify

Modify logic based on the input string

Modify logic based on the input pattern

Modify logic based on the Unhandled data string

Modify logic based on the unhandled pattern

User Classification

Input Text

Input Pattern

Unhandled Text

Unhandled Pattern

Page 172: QS Essentials

© Copyright IBM Corporation 2009

1. Build a Character Concatenate Investigation using the following fields

2. Increase the number of samples to 5

Investigate address and area unhandled patterns

● Identify the unhandled patterns for the Address and AREA fields. In the report include the unhandled data, input pattern, original data and the record key.

XAddress DomainXInput PatternXUnhandled DataCUnhandled Pattern

TypeField Description

Page 173: QS Essentials

© Copyright IBM Corporation 2009

Overrides

●Purpose– Correct problems found during standardization

●Rule set may require overrides because you have data– Not processed– Incorrectly processed

●Override types– Classification – Input pattern– Input text – Unhandled pattern– Unhandled text

●Can be tested with rules analyzer

Page 174: QS Essentials

© Copyright IBM Corporation 2009

Course exercise project design

Policy

InvestigateAssess Data Quality

Standardize Country InvestigateConditioned Results

Apply User Overrides

Identify Duplicate Customer Records

Survive the BestCustomer Record

Condition Name, Address and Area

Select US Data for furtherprocessing

Page 175: QS Essentials

© Copyright IBM Corporation 2009

Overrides screen

Page 176: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint

1. (T/F) WAVES can standardize name fields.2. (T/F) Rule sets are used in standardization processing. 3. Name the components of rule sets.

Page 177: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint solutions

1. (T/F) (T/F) WAVES can standardize name fields.Answer: False

2. (T/F) Rule sets are used in standardization processing. Answer: True

3. Name the components of rule sets.Answer: Classification table, dictionary, pattern action file, lookup tables

Page 178: QS Essentials

© Copyright IBM Corporation 2009

Unit summary

Having completed this unit, you should be able to:– Describe the Standardize stage – Identify rule sets– Build jobs using the Standardize stage– Interpret standardization results– Investigate unhandled data and patterns

Page 179: QS Essentials

© Copyright IBM Corporation 2009

Lab 7: Standardize country

●Word investigation– Uses COUNTRY rule set

●Rule set found in Other folder●Adds ISO country code to records

Page 180: QS Essentials

© Copyright IBM Corporation 2009

Lab 8: Select US records●Uses Select stage to separate records with US ISO code●Could also use Transformer stage

Page 181: QS Essentials

© Copyright IBM Corporation 2009

Lab 9: Standardize USPREP●Word investigation

– Uses rule set●Rule set found in US folder

Page 182: QS Essentials

© Copyright IBM Corporation 2009

Lab 10: Standardize USNAME, USADDR, USAREA

●Word investigation– Uses rule sets

●Rule sets found in US folder

Page 183: QS Essentials

© Copyright IBM Corporation 2009

Lab 11: Investigate unhandled patterns

●Character concatenate investigation●C mask used to produce histogram●X mask used to display other fields of interest

Page 184: QS Essentials

© Copyright IBM Corporation 2009

Lab 12: Apply user overrides

●Classification● Input pattern●Unhandled pattern

Page 185: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

Match

Page 186: QS Essentials

© Copyright IBM Corporation 2009

Unit objectives

●After completing this unit, you should be able to:– Build a QualityStage job to identify matching records– Apply multiple match passes to increase efficiency/efficacy– Interpret and improve match results

Page 187: QS Essentials

© Copyright IBM Corporation 2009

Match stage

●Statistically-based method for determining matches●25 match comparison algorithms providing a full spectrum of

fuzzy matching functions●Ability to measure informational content of data● Identify duplicate entities within one or more files●Match specification built with Match Designer●Critical field settings

Page 188: QS Essentials

© Copyright IBM Corporation 2009

What constitutes a good match?

W HOLDEN 12 MAIN ST W HOLDEN 12 MAINE ST

Which of the following record pairs is a match? And how do you know?

Do you compare all the shared or common fields?Do you give partial credit?Are some fields (or some values) more important to you than others? Why?Do more fields increase your confidence?By how much? What is enough?

W HOLDEN 128 MAIN PL 02111 12/8/62 W HOLDEN 128 MAINE PL 02110 12/8/62

WM HOLDEN 128A MAIN SQ 02111 12/8/62 338-0824 WILL HOLDEN 128A MAINE SQ 02110 12/8/62 338-0824

Page 189: QS Essentials

© Copyright IBM Corporation 2009

The value of information content

● Information content measures the significance of one field over another (Discriminating Value)– A Gender Code contributes less information than a Tax-Id Number

● Information content also measures the significance of one value in a field over another (Frequency)– In a First-Name Field, JOHN contributes less information than

DWEZEL●Significance is determined by a value’s reliability and its ability

to discriminate, both can be calculated from your data

Page 190: QS Essentials

© Copyright IBM Corporation 2009

The weighted score is a relative measure of the probability of a match

Thresholds defined can be used for automated

processing

0

500

1000

1500

2000

2500

3000

3500

4000

-20 -10 0 10 20 30 40

# o

f P

airs

Non-Matches

Matches

Distribution of weights

Weight of Comparisons

Less Confidence More Confidence

Gre

y ar

ea

WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62 WILLAIM JOHN HOLDEN 128 MAINE AVE 02110 12/8/62

+1 +1 +17 +2 +4 -1 +7 +9 = 40

Page 191: QS Essentials

© Copyright IBM Corporation 2009

Weights

●Measures the information content of a data value●Each field contributes to the confidence (probability) of a match

Page 192: QS Essentials

© Copyright IBM Corporation 2009

Types of weights

● If a field matches, the agreement weight is used– Agreement weight is a positive value

● If a field doesn’t match, the disagreement weight is used– Disagreement is a negative value

●Partial weight is assigned for non-exact or “fuzzy” matches●Missing values have a default weight of zero●Weights for all field comparisons are summed to form a

composite weight

Page 193: QS Essentials

© Copyright IBM Corporation 2009

Matching terminology

Measures the informational content of a data value

Distinguish matches from non-matches

Records with a score above the High cutoff that really aren’t a match

Records below the low cutoff that really are a match

Measures the significance of one field value over another

Measures the confidence of a match

Informational Content

Weight

Composite Weight

Match Cutoffs

False Positives

False Negatives

Page 194: QS Essentials

© Copyright IBM Corporation 2009

Measuring the conditions of uncertainty

●Reliability of the data in a given field– Estimated as the probability that the field agrees given the record pair

is a match●Probability of a random agreement of values

– Estimated as the probability the field agrees given the record pair is not a match

Page 195: QS Essentials

© Copyright IBM Corporation 2009

Reliability (m-probability)

●Approximated as, 1 - error rate for the given field●The higher the m-probability, the higher the disagreement

weight will be for the field not matching since the data is considered reliable

Page 196: QS Essentials

© Copyright IBM Corporation 2009

Chance agreement (u-probability)

●The u-probability can be approximated as the probability that a field agrees at random (by chance)

●QualityStage uses a frequency analysis to determine the probability of chance agreement for all values – Created by a Match Frequency stage

●Rare values bring more weight to a match

Page 197: QS Essentials

© Copyright IBM Corporation 2009

Blocking

●Grouping together like records that have a high-probability of producing matches

●Only “like” records are compared to each other making the match more efficient and computationally feasible

●Records in a “block” match exactly on one to several blocking fields

Page 198: QS Essentials

© Copyright IBM Corporation 2009

NYSIIS LNAME NAME ADDRESS ZIP

YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753

GARAS GEROSA, FRAN X 29 AARONS CT 06877

YANG YOUNG , JONATHAN A 1767 TOBEY ROAD 30341

GARAS GERISA, FRANCIS 29 AARONS CT 06877

GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877

MATAC MARCUS MATIC 100 SUMMER STREET 02111

GARAS GEROSA, MARY 29 AARONS CT 06877

JANCAN RENEE JENKINS 100 SUMMER STREET 02111

YANG YOUNG THERESA C 1767 TOBEY ROAD 30341

Block on NYSIIS of Last Name

Blocking example: sample data

Page 199: QS Essentials

© Copyright IBM Corporation 2009

Blocking example – NYSIIS of Last Name

YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753

YANG YOUNG , JONATHAN A 4220 BELLE PARK DR 77072

YANG YOUNG THERESA C 1767 TOBEY ROAD 30341

GARAS GEROSA, FRAN X 29 AARONS CT 06877

GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877

GARAS GEROSA, MARY 29 AARONS CT 06877

GARAS GARISA, FRANCIS 29 AARONS CT 06877

MATAC MARCUS MATIC 100 SUMMER STREET 02111

JANCAN RENEE JENKINS 100 SUMMER STREET 02111

NYSIIS NAME ADDRESS ZIP

Blocks with only one record are considered residuals

Page 200: QS Essentials

© Copyright IBM Corporation 2009

Blocking strategy

●Choose fields with reliable data●Choose fields with a good distribution of values●Combinations of fields may be used

Page 201: QS Essentials

© Copyright IBM Corporation 2009

Examples of blocking strategies

●Zip code for matching addresses●NYSIIS of last name for matching individuals●Brand name for matching products●Combination of zip code and NYSIIS of street name for

matching addresses●Combination of NYSIIS of last name and first letter of first name

for matching individuals

Page 202: QS Essentials

© Copyright IBM Corporation 2009

Blocking summary

●Blocking groups together “like” records●Matching is more efficient for small block sizes

– Blocks should have less than 1000 records (guideline, not a hard and fast rule)

●Blocking fields must match exactly for a candidate set to be created/evaluated

●Beware of block overflow– Computationally run out of resources– Comparisons are not completed– Every record in the block becomes an automatic residual

Page 203: QS Essentials

© Copyright IBM Corporation 2009

Match types

●Unduplication– Identifies duplicates candidates in one file

●Reference Match (Two File)– One-to-one correspondence

• For every record on stream link we expect to find a match to one record on reference link

– Many-to-one correspondence• More than one record on stream link can match to the same record on

reference link

Page 204: QS Essentials

© Copyright IBM Corporation 2009

Comparing data values

●Different comparisons for different data●17 comparison methods●Most common

– CHAR - (character comparison) character by character, left to right. – UNCERT - (character uncertainty) tolerates phonetic errors,

transpositions, random insertion, deletion, and replacement of characters

– CNT_DIFF – Counts keying errors in numeric fields. You set a tolerance threshold

– NAME_UNCERT – Can be used to compare character values, if the strings are different lengths then the shorter of the two lengths is used

Page 205: QS Essentials

© Copyright IBM Corporation 2009

Match Implementation

Page 206: QS Essentials

© Copyright IBM Corporation 2009

Tasks required in match process

● Standardize the data● Add data columns needed for blocking● Generate match frequency report using Match Frequency stage● Build match specification in Match Designer

– Add pass• Blocking columns• Match commands

– Configure match test results environment● Run pass● Review results● Tune the match

– Add cutoffs– Set overrides– Add more passes

● Repeat steps until match results are acceptable

Page 207: QS Essentials

© Copyright IBM Corporation 2009

Standardize columns and generate match frequency

Page 208: QS Essentials

© Copyright IBM Corporation 2009

Match frequency stage

Map fields

Page 209: QS Essentials

© Copyright IBM Corporation 2009

Match frequency generation

Page 210: QS Essentials

© Copyright IBM Corporation 2009

Lab 13: Match frequency●Use Match Frequency stage in a match job

Page 211: QS Essentials

© Copyright IBM Corporation 2009

Match Designer

●Used to build a match specification that will be addressed in a match job

Features– Design control center– Data-centric– Graphical representation of statistics– Independent of job design– Iterative development

Page 212: QS Essentials

© Copyright IBM Corporation 2009

Match Design - Unduplicate

Page 213: QS Essentials

© Copyright IBM Corporation 2009

Match design – creating specification

How to create a new match specification

Right-click in non-root area of repository

Page 214: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

The Major Components Test resultsHistogram Holding Area

Pass Composer

Decision Rules

Data Viewer

Cutoff Tuning

Page 215: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Select match type –example unduplicate

Will initially get one pass called MyPass

Page 216: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Click table definition icon

Use load button to access table definition of

standardized data set

Page 217: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Blocking

Match Commands

Select pass icon

Page 218: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Save passes and specification

Page 219: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Name and place passes and specification

Page 220: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Set up test results area

Questions:

Where is the standardized data?

Where is the frequency report?

What ODBC-accessed database will store test results?

Page 221: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Standardized sample data

Frequencies data set

Data Source NameUser NamePassword

Note: these must be data sets

Page 222: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Add Blocking Columns

Page 223: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Select Column

Business Name

Click Apply or OK

Page 224: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Add MATCH Column

Page 225: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Business Name

Page 226: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Compare Type

Page 227: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Data ColumnRight-Click to view data frequencies

Page 228: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Frequencies

Page 229: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Select

Parameter

Page 230: QS Essentials

© Copyright IBM Corporation 2009

Fully configured pass

Expanded view will show details of

blocking and match commands

Click test pass to run the pass against the

data

Page 231: QS Essentials

© Copyright IBM Corporation 2009

Match design – after test pass run

Page 232: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Grouping option:Match Sets: See all

matches and duplicates togetherMatch Pairs+Sort:

See the master record repeated

Page 233: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Default Display (Grouped by Match Sets)

Grouped by Match Pairs and then sorted Ascending by Weight

Page 234: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Compare Weights:See how any two records score

Page 235: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Statistics Tab

Change What Shows

Page 236: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicate

Change How Shows

Page 237: QS Essentials

© Copyright IBM Corporation 2009

Match design - unduplicateTOTAL Statistics Tab

Change What Shows

Change How Shows

Page 238: QS Essentials

© Copyright IBM Corporation 2009

Lab14: Configure test results database

●Build a DB2 database to contain match test results●Build an ODBC source to connect the database to

QualityStage

Page 239: QS Essentials

© Copyright IBM Corporation 2009

Lab 15: Match specification

●Use Match Designer to build specification for unduplicate job●Configure test results area

Page 240: QS Essentials

© Copyright IBM Corporation 2009

Match improvement strategy

1. Set critical values for important fields2. Review calculated weights

• Adjust weights using weight overrides3. Set cutoffs4. Add additional passes

Page 241: QS Essentials

© Copyright IBM Corporation 2009

Critical fields

●Used to identify fields that must agree in order for records to be linked– Critical – Fields values must agree exactly or the records cannot be

linked (considered a match)– Critical Missing OK – Field values must agree exactly on values not

considered “missing values”●QualityStage feature: Variable Special Handling

Page 242: QS Essentials

© Copyright IBM Corporation 2009

Variable special handling

Page 243: QS Essentials

© Copyright IBM Corporation 2009

Weight overrides

●Allows you to adjust both the agreement and/or disagreement weights for specific situations– Add to calculated weight– Replace weight

On Match Commands screen

Page 244: QS Essentials

© Copyright IBM Corporation 2009

Weight override screen

Page 245: QS Essentials

© Copyright IBM Corporation 2009

Cutoffs

●There are two cutoffs– Match cutoff (high cutoff)– Clerical cutoff (low cutoff)

●Records with a weight equal to or above the Match cutoff are considered matches

●Records with a weight below the low cutoff are not matches●Records with a weight greater than or equal to the low cutoff

and less than the high cutoff are considered clerical records for manual review

●Cutoffs can be set at the same value eliminating clerical records

Page 246: QS Essentials

© Copyright IBM Corporation 2009

27.82 PO BOX 93020227.82 PO BOX 93020227.82 PO BOX 930202

38.65 35 COLLIER RD NW STE 610 38.65 35 COLLIER RD NW STE 610

25.81 928 S 1ST ST 14.45 S 1ST ST

Weights Data fields

DefiniteMatch

DefiniteMatch

QuestionableMatch

Setting the match cut-off

Page 247: QS Essentials

© Copyright IBM Corporation 2009

Multiple match passes

●Additional passes are helpful in overcoming data errors and missing values in block fields

●You should always create at least two match passes●Change blocking strategies for each pass

Page 248: QS Essentials

© Copyright IBM Corporation 2009

Pass 1 blocked on street namePass 2 found additional matched records in which the street name was different but the names were the same

Pass Weights Data fields1 26.31 JASON BIRCH 1350 WALTON WAY 309011 26.31 JASON BIRSH 1350 WALTON WAY 30901

1 20.42 JOHN SMITH 2047 PRINCE AVE 306041 10.83 MARY SMITH 2047 PRINCE AVE 30604

1 RES A JOHN SMITH P.O. BOX 123 30604

2 20.42 JOHN SMITH 2047 PRINCE AVE 306042 10.19 JOHN SMITH P.O. BOX 123 30604

Example: multiple match passes

Page 249: QS Essentials

© Copyright IBM Corporation 2009

Match Implementation –Unduplicate job

Page 250: QS Essentials

© Copyright IBM Corporation 2009

Double Click

Unduplication implementation

Page 251: QS Essentials

© Copyright IBM Corporation 2009

Unduplication Implementation

Page 252: QS Essentials

© Copyright IBM Corporation 2009

Verify link order for both input and output

Page 253: QS Essentials

© Copyright IBM Corporation 2009

Map all output links

Page 254: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint

1. (T/F) Match specifications are created using Designer.2. (T/F) An unduplicate match can be used against two files. 3. Which match specification component determines the extent of the

clerical review records?

Page 255: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint solutions

1. (T/F) Match specifications are created using Designer.Answer: True

2. (T/F) An unduplicate match can be used against two files. Answer: False

3. Which match specification component determines the extent of theclerical review records?Answer: cutoff values

Page 256: QS Essentials

© Copyright IBM Corporation 2009

Unit summary

Having completed this unit, you should be able to:– Build a QualityStage job to identify matching records– Apply multiple match passes to increase efficiency/efficacy– Interpret and improve match results

Page 257: QS Essentials

© Copyright IBM Corporation 2009

Lab 16: Unduplicate job

●Build unduplicate job using the match specification

Page 258: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

Survive

Page 259: QS Essentials

© Copyright IBM Corporation 2009

Unit objectives

●After completing this unit, you should be able to:– Identify Survive techniques– Describe implementation options– Define Survive rules– Build Survive job

Page 260: QS Essentials

© Copyright IBM Corporation 2009

Survive stage

●Point-and-click creation of business rules to determine “surviving” data – user decides how to survive data

●Performed at record or field level – very flexible●Creates a single, consolidated record containing the “best-of-

breed” data●Provides consolidated view of the data

Page 261: QS Essentials

© Copyright IBM Corporation 2009

Survive exampleSurvive Input (Match Output)

Group Legacy First Middle Last No. Dir. Str. Name Type UnitNo.1 D150 Bob Dixon 1500 SE ROSS CLARK CIR1 A1367 Robert Dickson 1500 ROSS CLARK CIR23 D689 William A Obrian 5901 SW 74TH ST STE 20223 A436 Billy Alex O’Brian 5901 SW 74TH ST23 D352 William Obrian 5901 74 ST #202

Survived Consolidated OutputGroup Legacy First Middle Last No. Dir. Str. Name Type Unit No.1 D150 Robert Dickson 1500 SE ROSS CLARK CIR

23 D689 William Alex O’Brian 5901 SW 74TH ST STE 202

Group Legacy1 D1501 A1367

23 D68923 A43623 D352

Cross-Reference File

Page 262: QS Essentials

© Copyright IBM Corporation 2009

Survive rules

●A rule contains a condition and a set of target fields– When the condition is met the field becomes a candidate for the “best”– All records in a group are tested against the condition– The “best” populates the target fields

●Multiple targets are permitted for the same rule

Page 263: QS Essentials

© Copyright IBM Corporation 2009

Survive rules

●Custom Rule– Build your own logical expression– Comparison (=, !=, <, > ,<=, >=)– Logical (and, or, not) – Indicate the current and best records with the following notation

• c.field indicates the current • b.field indicates the best

– Parentheses ( ) can be used for grouping complex conditions– String literals are enclosed in double quotation marks, such as

“MARS”.– A semicolon (;) terminates a rule.

Page 264: QS Essentials

© Copyright IBM Corporation 2009

Building survive rules● Survive Rules Definition screen lets you easily

build, delete and manage survivor rules

Page 265: QS Essentials

© Copyright IBM Corporation 2009

Survive techniques

●Pre-defined Techniques– Source– Recency– Frequency– Most complete (longest string)

●User-specified logic

Page 266: QS Essentials

© Copyright IBM Corporation 2009

Target fields

●Fields you want to write to the output file●Populated based on meeting the conditions of the survivor

rule(s)●Fields not listed as targets are excluded from the output file●May have multiple targets for each rule

Page 267: QS Essentials

© Copyright IBM Corporation 2009

Example: complex survive rule

●The following rule states that FIELD3 of the current record should be retained if the field contains five or more charactersand FIELD1 has any contents.

●The prefix of b. indicates the current “best” record●The prefix c. indicates the current record testing against the

survivor rule

FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;

TARGET CONDITION

Page 268: QS Essentials

© Copyright IBM Corporation 2009

Survive Implementation

Page 269: QS Essentials

© Copyright IBM Corporation 2009

Double Click

Survive QualityStage job

Page 270: QS Essentials

© Copyright IBM Corporation 2009

Survive stage properties

Page 271: QS Essentials

© Copyright IBM Corporation 2009

Output Column Technique

Survive stage properties

Page 272: QS Essentials

© Copyright IBM Corporation 2009

‘Complex’ available

Survive stage properties

Page 273: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint

1. (T/F) Survivorship can allow more than one record to survive.2. (T/F) Survivorship rules deal with the complete record only. 3. Name three survive rules.

Page 274: QS Essentials

© Copyright IBM Corporation 2009

Checkpoint solutions

1. (T/F) Survivorship can allow more than one record to survive.Answer: False

2. (T/F) Survivorship rules deal with the complete record only.Answer: False

3. Name three survive rules.Answer: most recent record, longest non-blank, most frequent non-blank

Page 275: QS Essentials

© Copyright IBM Corporation 2009

Unit summary

Having completed this unit, you should be able to:● Identify Survive techniques●Describe implementation options●Define Survive rules●Build Survive job

Page 276: QS Essentials

© Copyright IBM Corporation 2009

Lab 17: Survivorship job●Build survivorship job

Page 277: QS Essentials

© Copyright IBM Corporation 2009

Survive job with XREF file

Page 278: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

Special Topics

Page 279: QS Essentials

© Copyright IBM Corporation 2009

Full Run

Page 280: QS Essentials

© Copyright IBM Corporation 2009

1. Double Click

Full run – single job

Page 281: QS Essentials

© Copyright IBM Corporation 2009

Full run using DataStage job sequencer

Page 282: QS Essentials

© Copyright IBM Corporation 2009

QualityStage Migration Tool

Page 283: QS Essentials

© Copyright IBM Corporation 2009

QualityStage Migration Tool – Overview●The QualityStage Migration Tool (QSMT) provides

the ability to migrate QualityStage 7.5 jobs and Standardization Rule Sets to the QS8environment.

●QSMT analyzes the QS 7.5 server project directory to construct “dsx” files which can be imported into the QS8 common repository using the DS & QS8 Designer’s “import” facility

Page 284: QS Essentials

© Copyright IBM Corporation 2009

QualityStage migration tool – overview

●QSMT functionality offers three types of QS 7.5 objectmigrations: – QS 7.5 Standardization Rule Set– QS 7.5 job in combined mode– QS 7.5 job in expanded mode

●Two modes for migrating jobs to QS8:– Combined Mode

• Use when you need to take a legacy process and just run it in QS8• Allows control before and after the legacy process• Will always run after importing without any manual tuning

– Expanded Mode• Use when you need to add QS8 operators within a migrated process• May require some manual tuning to run

Page 285: QS Essentials

© Copyright IBM Corporation 2009

Rule set migration●The QSMT has the ability to migrate Standardization Rule

Sets in one of two ways:– Explicitly - - you may specify the rule set you want to

migrate– By job dependency - - you may migrate all Rules

associated with a particular job

Note: Regardless of the migration mode, all migrated rules will have the new naming convention of :QS-7.5-Ruleset-Name_QS-7.5-Project-Name

Page 286: QS Essentials

© Copyright IBM Corporation 2009

Combined mode migration●Use this mode to get a legacy QS job up and running in QS8

with as little effort as possible. Jobs will import and run without modifications

●After importing, a migrated job will appear in the “Jobs” folder of the repository view in the QS/DS 8 Designer client

●Jobs are renamed by QSMT within the QS8 package to minimize name collision

●The new job name has the following naming convention:QS-7.5-Job-Name_QS-7.5-Project-Name

Page 287: QS Essentials

© Copyright IBM Corporation 2009

The job consists of a single instance of the QS 8 Legacy Job stage, together with some number of DS Sequential File stages, which are linked to the Legacy Job stage as inputs or outputs

QSMT – combined mode migration

Page 288: QS Essentials

© Copyright IBM Corporation 2009

● All the QS stages run under the control of the single Legacy Job stage in Combined Mode

● The list of operations can be seen by opening the Legacy stage

File IO to external files is performed by the Information Server Sequential File stages

QSMT – combined mode migration

Page 289: QS Essentials

© Copyright IBM Corporation 2009

QSMT – combined mode & running a QS8 job

●Once imported, Legacy jobs are run the same as any other QS8 job– Prior to compiling, be sure any required rule sets are

provisioned to the server– Run as you would any other QS8 job

Page 290: QS Essentials

© Copyright IBM Corporation 2009

●Use to re-implement the job in the QS8 environment●After importing, a migrated job will appear in the “Jobs” folder

in the same way as in Combined Mode●The job consists of one or more stages for each 7.5 stage,

plus DS PX Sequential File stages, linked to represent the 7.5 job flow. For complex jobs, stages may need to be reorganized to improve readability

QSMT – expanded mode

Page 291: QS Essentials

© Copyright IBM Corporation 2009

“Split”, “Accept”, or “Reject”used in 7.5

FilterSelect

“Merge” used in 7.5 stageLegacy JobSelect

AlwaysLegacy JobParse

AlwaysMNSMultinational Standardize

AlwaysLegacy JobMatch*AlwaysLegacy JobInvestigate

“ODBC” used in 7.5 stageODBC EnterpriseFFC

“Delimited text” used in 7.5 stage

CopyFFC

AlwaysLegacy JobCollapse

AlwaysLegacy JobBuild

AlwaysLegacy JobAbbreviate

ConditionsQS8 Stage TypeQS 7.5 Stage Type

* Currently working on converting Match specifications for GA

QS stage migration reference table

Page 292: QS Essentials

© Copyright IBM Corporation 2009

QS stage migration reference tableConditionsQS8 Stage TypeQS 7.5 Stage Type

AlwaysWAVESWAVES

AlwaysLegacy JobUnijoin

AlwaysLegacy JobTransfer

If target columns do not overlapSurviveSurvive

If target columns overlapLegacy JobSurvive

AlwaysStandardizeStandardize

AlwaysSortSort

Page 293: QS Essentials

© Copyright IBM Corporation 2009

QSMT – expanded mode & running a QS8 job

• Prior to compiling, be sure to complete the following:• Provision any required rules to the server• Add ODBC connection information to any ODBC read or write stages

appearing in the job• To complete the migration, perform the following for every

Standardize, Survive, MNS and Waves stage that appears on the canvas:

• Open the stage editor for the stage (e.g. by double-clicking it)• Click ok

• Once the above tasks are completed, compile and run as you would any other job

Page 294: QS Essentials

© Copyright IBM Corporation 2009

Lab 18: QualityStage Migration Tool●Migrate 7.5 QualityStage jobs to version 8

Page 295: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

Globalization

Page 296: QS Essentials

© Copyright IBM Corporation 2009

Objectives

After completing this module you will be able to:●Build jobs that read and write Japanese data●Modify client settings to display Japanese data with correct

characters

Page 297: QS Essentials

© Copyright IBM Corporation 2009

Terminology

●Character Set– An ordered list of characters used for text

• Example: Latin, Cyrillic, Unicode

●Character Encoding– How each character in a character set is represented as bits

• Examples: UTF-8, UTF-16BE, GB18030 are encodings of Unicode

●Codepage– Microsoft Windows term for Encoding, often used in other contexts too

• Examples: – 1252 is Windows Latin1 superset of ISO8859/1– 932 another name for Shift-JIS

Page 298: QS Essentials

© Copyright IBM Corporation 2009

Character Sets

●Latin– Italian, Spanish, French, English alphabets

●Cyrillic alphabet– Subsets are used by five Slavic languages (Bulgarian, Russian,

Belarusian, Serbian, Macedonian, Ukrainian) and some non-Slavic (Kazakh, Uzbek, Kyrgyz, Tajik, and Mongolian)

●ASCII– Represents 256 characters

●Unicode– Represents 65,536 unique characters– Standard for representing the characters of all languages– Includes Chinese, Japanese, and Korean

Page 299: QS Essentials

© Copyright IBM Corporation 2009

Character encoding

●Definition – A system that pairs a character from a character set to something else, such as a number

●Two common computer encodings for Unicode – UTF8

• Variable length encoding for Unicode• Encodes each character to one to four bytes

– UTF16• Variable length encoding for Unicode• Allows either endian representation but mandates that the byte order be

explicitly indicated by a byte order Mark

Page 300: QS Essentials

© Copyright IBM Corporation 2009

NLS

●NLS – National Language Support– Globalization + Localization/Translation

●NLS map– What DataStage uses to convert between external and internal

encodings– Internal encoding is UTF8 for Server engine, UTF16 for Parallel

engine

Page 301: QS Essentials

© Copyright IBM Corporation 2009

Information Server Common Design Repository

Where DataStage NLS Mapping Happens

Parallel Engine running jobUnicode (UTF-16)

DataStage & QualityStageRuntime ObjectsUnicode (UTF-8)

External character setExternal character set

Messages

XML(UTF-8)

Map Map

Scripts, etc. Job MonitorUnicode (UTF-16)

Windows code page

Map

Client

Server

Logs

Page 302: QS Essentials

© Copyright IBM Corporation 2009

Examples of DataStage NLS Maps

Parallel Description

IBM367 Standard (US) ASCII 7-bit setBig5 TAIWAN: "Big 5" standardIBM1026 IBM EBCDIC variant 1026 (Turkish)GB2312 CHINESE: EUC as per GB 2312ISO_8859-1:1987 ISO Standard 8859 part 1: Latin-1ISO_8859-5:1988 ISO Standard 8859 part 5: Latin-CyrillicKS_C_5601-1987 KOREAN: EUC as per KSC 5861windows-1253 MS Windows codepage 1253 (Greek)windows-1255 MS Windows codepage 1255 (Hebrew)IBM865 PC DOS code page 865 (Nordic)Shift_JIS JAPANESE: Shift-JIS main mapTIS-620 THAILAND: Industrial Standard 620

Page 303: QS Essentials

© Copyright IBM Corporation 2009

DataStage & QualityStageRuntime ObjectsUnicode (UTF-8)Map

Client

Admin Client (whole server)

Associates a server map with the current Windows

code page

Setting a Client/Server Map

Page 304: QS Essentials

© Copyright IBM Corporation 2009

Sets the default map name to use with all Parallel jobs

in this project

Admin Client (per project)

…unless you override it in the job properties dialog

Setting Job-Level Maps

Page 305: QS Essentials

© Copyright IBM Corporation 2009

Parallel Engine running jobUnicode (UTF-16)

External character set

Map

Server

Various stages have an NLS Map tab:e.g. Sequential File, External Source, External Target, File Set

– Define character set mappings (ustring external file)– Applied at stage or individual field level

Setting a Stage-Level Map

Page 306: QS Essentials

© Copyright IBM Corporation 2009

For Sequential File-type stagesNChar, and Char with extendedtype, offer a drop-down list of map names in the NLS Map property

Non-default NLS map (for

relevant types)Char may be "extended" for

Unicode"

Setting a Column-Level Map

Page 307: QS Essentials

© Copyright IBM Corporation 2009

● Transformer, modify, etc.– string ustring conversion will happen automatically, taking current

map from context (job level or stage=operator level)– Fine control via explicit conversion functions

Conversions may use

specific map name

Converting string to ustring manual control

Page 308: QS Essentials

© Copyright IBM Corporation 2009

NLS Implementation using Investigate stage

Job Design from Lab

Job-level NLS map

Page 309: QS Essentials

© Copyright IBM Corporation 2009

Investigation results for Japanese city column

Input data

Client machine with codepage set to JPN

Output report data

Client machine with codepage set to JPN

Page 310: QS Essentials

© Copyright IBM Corporation 2009

Unit summary

Having completed this unit, you should be able to:●Build a QualityStage investigation job for non-English data●View correctly-formatted results in DataStage/QualityStage

data viewer

Page 311: QS Essentials

© Copyright IBM Corporation 2009

Lab 19: NLS●Build investigation job for city using Japanese data

Page 312: QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

Address Verification Interface Stage

Page 313: QS Essentials

© Copyright IBM Corporation 2009

Objectives

After completing this module you will be able to:●Build jobs using the AV stage to parse and verify address

data

Page 314: QS Essentials

© Copyright IBM Corporation 2009

AVI Stage

●Provides– Transliteration (e.g. Japanese to Latin)– Parsing– Address validation

●WAVES equivalent●Does not provide postal certification discounts

– Use CASS (US), DPID (Australia), or SERP (Canada) if certification is desired

●Supports real-time

Page 315: QS Essentials

© Copyright IBM Corporation 2009

Components

●AV stage●Reference data

– 16 Geographies– Purchased via Passport system

●API libraries– Address Doctor

Page 316: QS Essentials

© Copyright IBM Corporation 2009

Reference Data

●Required for validation function only●Requires annual license agreement●Location pointed to by AV stage●Some databases are memory intensive●Load options

– Partial preload• Indexes loaded to memory

– Full preload• Data loaded to memory• Fast access but must have adequate memory

– No preload• Data accessed from disk, slowest method

Page 317: QS Essentials

© Copyright IBM Corporation 2009

Job components

Optional error file

AVI stageInput address data

Page 318: QS Essentials

© Copyright IBM Corporation 2009

Stage properties

Reference data location

Function

Navigation

Page 319: QS Essentials

© Copyright IBM Corporation 2009

Mapping input columns to address elements

Page 320: QS Essentials

© Copyright IBM Corporation 2009

Transliterate mode

●Map input columns to address elements– Multiple input columns can be mapped to one address element

●Options offer the choice to increase the number of address lines

Page 321: QS Essentials

© Copyright IBM Corporation 2009

Map columns to output link

Page 322: QS Essentials

© Copyright IBM Corporation 2009

Parsing mode

● Input sample

●Output sample

Page 323: QS Essentials

© Copyright IBM Corporation 2009

Validation mode

●Uses reference data from a database●Map input columns to address elements●Can activate error link●Creates validation summary report

●Sample output (only two of the validation columns shown)

Page 324: QS Essentials

© Copyright IBM Corporation 2009

Validation mode statuses

●Part of output record●Document actions taken by AV stage●Short code●Verbose code●Example

Page 325: QS Essentials

© Copyright IBM Corporation 2009

Validation summary report sample (USPREP)

Validation Summary ReportCompany Name:List Identifier:Processing Date (yyyy/mm/dd): 2009/02/25Total Number Of Records Processed: 2843

Passed: 2843 100.00%Failed: 0 0.00%Validated: 2233 78.54%Corrected: 415 14.60%Has Suggestion: 195 6.86%PostCode Failed: 70 2.46%City Failed: 37 1.30%Street Failed: 274 9.64%Country Failed: 0 0.00%

Page 326: QS Essentials

© Copyright IBM Corporation 2009

Unit summary

Having completed this unit, you should be able to:●Build jobs using the AV stage to parse and verify address data

Page 327: QS Essentials

© Copyright IBM Corporation 2009

Lab 20: AV Stage●Build AV job to parse Japanese address data

●Review prebuilt job that validated USPREP data from earlier lab