QS Essentials

© Copyright IBM Corporation 2009Course materials may not be reproduced in whole or in part without the prior written permission of IBM. 4.0.3

QualityStage 8 Essentials

DX741

© Copyright IBM Corporation 2009

Copyright, Disclaimer of Warranties and Limitation of Liability

© Copyright IBM Corporation February 2007

IBM Software GroupOne Rogers StreetCambridge, MA 02142

All rights reserved. Printed in the United States.

IBM and the IBM logo are registered trademarks of International Business Machines Corporation.

The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both:

AnswersOnLine DynamicServer, WorkgroupEdition RedBrick Decision ServerAIX Enterprise Storage Server RedBrickMineBuilderAPPN FFST/2 RedBrickDecisionscapeAS/400 Foundation.2000 RedBrickReadyBookMaster Illustra RedBrickSystemsC-ISAM Informix RelyonRedBrickClient SDK Informix4GL S/390Cloudscape InformixExtendedParallelServer SequentConnection Services InformixInternet Foundation.2000 SPDatabase Architecture Informix RedBrick Decision Server System ViewDataBlade J/Foundation TivoliDataJoiner MaxConnect TMEDataPropagator MVS UniDataDB2 MVS/ESA UniData&DesignDB2 Connect Net.Data UniversalDataWarehouseBlueprintDB2 Extenders NUMA-Q UniversalDatabaseComponentsDB2 Universal Database ON-Bar UniversalWebConnectDistributed Database OnLineDynamicServer UniVerseDistributed Relational OS/2 VirtualTableInterfaceDPI OS/2 WARP VisionaryDRDA OS/390 VisualAgeDynamicScalableArchitecture OS/400 WebIntegrationSuiteDynamicServer PTX WebSphereDynamicServer.2000 QBIC WebSphere DataStageDynamicServer with Advanced DecisionSupportOption QMFDynamicServer with Extended ParallelOption RAMACDynamicServer with UniversalDataOption RedBrickDesignDynamicServer with WebIntegrationOption RedBrickDataMine

Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.

Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

All other product or brand names may be trademarks of their respective companies.

All information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has been certified as being Year 2000 compliant.

This document may not be reproduced in whole or in part without the priori written permission of IBM.

Note to U.S. Government Users – Documentation related to restricted rights – Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.


Course Contents

295Globalization (NLS)

278Special Topics

55Developing with QualityStage

39QualityStage 8 Architecture

312Address Verification Stage

258Survivorship

185Match

117Standardize

79Investigation

5Data Quality Issues

PageTopic


Course contents

●Data quality issues● Information Server purpose and architecture● Introduction to DataStage and QualityStage● Investigation●Standardization●Match●Survivorship●Special Topics

– Data quality methodology– QualityStage Migration Tool


Data Quality Issues


Unit objectives

●After completing this unit, you should be able to:– List the five common data quality contaminants– Describe each of the following processes:

• Investigation• Standardization• Match• Survivorship


Data quality challenges

●Different or inconsistent standards in structure, format or values

●Missing data, default values●Spelling errors, data in wrong fields●Buried information●Data anomalies


Data quality – why do we care?

●Accurate reports●Accurate information for support operations●Support development of applications that go beyond original

scope for which data was designed– Master Data Management– Data Warehouse– Analytical applications


Example - Master Data Management

Source 1

Source 2

Source 3

Consolidated customer

view

AlignHarmonizeConsolidate


MARC DILORENZO ESQ BOSTONMRS DENNIS MARIO HARTFORDMR & MRS T. ROBERTS CHICAGO

DILORENZO, MARK 6793MARIO, DENISE 0215ROBERTS, TOM & MARY 8721

MARK DI LORENZO MA93DENIS E. MARIO CT15TOM & MARY ROBERTS IL21

Name Field LLocation

Source 3

Source 1

Source 2

Different or inconsistent standards


Missing data & default values

Do the field values match the meta data labels?

NAME SOC. SEC. # TELEPHONE

Denise Mario DBA

Marc Di Lorenzo ETAL

Tom & Mary Roberts

First Natl Provident

Astorial Fedrl Savings

Kevin Cooke, Receiver

John Doe Trustee for K

228-02-1975

999999999

025-37-1888

34-2671434

101010101

LN#12-756

18-7534216

111111111

6173380300

3380321

415-392-2000

508-466-1200

212-235-1000

FAX 528-9825

5436


Robert A. Jones TTE Robert Jones Jr. First Natl Provident FBO Elaine & Michael Lincoln UTADTD 3-30-89 59 Via Hermosa c/o Colleen Mailer Esq Seattle, WA 98101-2345

Legacy Meta Desc. Legacy Record Values

Buried information

NAME 1

ADDRESS 1

ADDRESS 2

ADDRESS 3

ADDRESS 4

ADDRESS 5


CUSNUM NAME ADDRESS SALES $

90328574

90328575

90238495

90233479

90233489

90234889

90345672

IBM

I.B.M. Inc.

International Bus. M.

Int. Bus. Machines

Inter-Nation Consults

Int. Bus. Consultants

I.B. Manufacturing

8,494.00

3,432.00

2,243.00

5,900.00

6,800.00

10,243.00

15,999.00

187 N.Pk. Str. Salem NH 01456

187 N.Pk. St. Sarem NH 01456

187 No. Park StSalem NH 04156

187 Park Ave Salem NH 04156

15 Main St. Andover MA 02341

PO Box 9 Boston MA 02210

Park Blvd. Boston MA 04106

The anomalies nightmare

Spelling Errors

Anomalies Lack of StandardsNo common key


Acct # Name Address City State Zip Note

5154155 Peter J. Lalonde 40 Beacon St. Melrose, Mass 02176 ODP

5152335 LaLonde, Peter 76 George 617-210-0824 Boston YES MA 02111

5146261 Lalonde, Sofie 40 Bacon Street Melrose MA CHK ID

87121 Pete & Soph Lalond 76 George Road Boston MASS FR Alert

87458 P. Lalonde FBO S.Lalonde40 Becon Rd. Melrose MA 02 176

What data challenges do you face?

•No consistent naming convention

•Business terms and spillover text

•Missing values or data in the wrong fields

•Buried information

•Misspelling

•No unique key linking records together


Why investigate?

●Discover potential anomalies in the data●Examine single domain and free-form fields● Identify invalid and default values●Reveal undocumented business rules ●Verify the reliability of the data in the fields to be used as

matching criteria●Gain complete understanding of data


Investigate – single domain report

• Single domainField % of Total

Freq. Count

Sample source data

Frequency


Investigate – word pattern report

• Freeform text (Word)Field

% of Total

Pattern

Sample source data

Frequency


What is standardize?

●Applying business logic to data chaos.– Pattern manipulation

●Enforcing business standards on data elements.– Standards definition

●Transforming the input to an output which meets the business requirement.– Field structuring


How to standardize

●Parse specific data fields into smaller, atomic data elements– Atomic data elements are called tokens– Categorize identified elements

• Separate Name, Address, and Area from freeform Name & Address lines• Identification of Distinct Material Categories (e.g. Sutures vs. Orthopedic

Equipment)●Refine data elements

• Example 1– Name = ‘DR PAUL E JONES’ becomes:

> Title = ‘DR’> First Name = ‘PAUL’> Middle Name = ‘E’> Last Name = ‘JONES’

• Example 2– Part Description = ‘BLK LATEX GLOVE’ becomes:

> Color = ‘BLACK’> Type = ‘LATEX’> Part = ‘GLOVE’


Why standardize?

●Normalize values in data fields to standard values– Transform First Name = ‘MIKE’ ‘MICHAEL’– Transform Title = ‘Doctor’ ‘Dr’– Transform Address = ‘ST. Michael Street’ ‘Saint Michael St.’– Transform Color = ‘BLK’ ‘BLACK’

●Apply phonetic coding to key words - facilitates record linkage– NYSIIS– Soundex– Typically applied to Name fields (first, last, street, city)


QualityStage standardize

●Uses a highly flexible pattern recognition language●Can employ field or domain specific standardization (i.e. unique

rules for names vs. addresses vs. dates, etc.)●Contains customizable classification and standardization tables●Utilizes results from data investigation


QualityStage standardize report exampleInd./Org. flagOriginal data


Match

“Conditioned data and QualityStage’s matching engine link the previously unlinkable.”

● Match Construction: – Reliability of input data defines a match result.

● Statistical Analysis & Match Scoring:– Linkage probability determined on a sliding scale by field level

comparison.● Report Generation:

– All business rules applied have easy to understand report structure.


What is match?

● Identifying all records on one file that correspond to similar records on another file

● Identifying duplicate records in one file●Building relationships between records in multiple files●Performing statistical and probabilistic matching●Calculating a score based on the probability of a match


Why match?

● Identify duplicate entities within one or more files●Perform householding●Create consolidated view of customer●Establish cross-reference linkage


How to match

●Single file (Unduplication) or two file (Reference)●Different match comparisons for different types of data (e.g.

exact character, uncertainty/fuzzy match, keystroke errors, multiple word comparison)

●Generation of composite weights from multiple fields●Use of probabilistic or statistical algorithms●Application of match cutoffs or thresholds to identify automatic

and clerical match levels● Incorporation of override weights to assess particular data

conditions (e.g. default values, discriminatory elements)


QualityStage match

●A wide variety of match comparison algorithms providing a full spectrum of fuzzy matching functions

●Statistically-based method for determining matches (Probabilistic Record Linkage Theory)

●Field-by-field comparisons for agreement or disagreement●Assignment of weights or penalties●Overrides for unique data conditions●Score results to determine the probability of matched records●Thresholds for final match determination●Ability to measure informational content of data


QualityStage match examples


What is survive?

●Creation of best-of-breed “surviving” data based on record or field level information

●Development of cross-reference file of related keys●Creating output formats:

– Relational table with primary and foreign keys– Transactions to update databases– Cross-reference files


Why survive?

●Provide consolidated view of data●Provide consolidated view containing the “best-of-breed” data●Resolve conflicting values and fill missing values●Cross-populate best available data● Implement business rules●Create cross-reference keys


How to survive

●Highly flexible rules●Record or field level survivorship decisions●Rules can be based upon data frequency, data recency (i.e.

date), data source, value presence or length●Rules can incorporate multiple tests●QualityStage features

– Point-and-click (GUI-based) creation of business rules to determine best-of-breed “surviving” data

– Performed at record or field level


QualityStage survive examples

Example 1: The longest populated Middle and Last Name

First Name

Middle Name

Last Name First Name

Middle Name

Last Name

MARI LEMELSON-LAPPNER

MARI S LEMELSON-LAPPNER

MARI S LEMELSON

Matched Survived

Example 2: The longest populated Middle Name, Date of Birth, and SSN

First Name Middle NLast Name DOB SSN First Name Middle NaLast NamDOB SSNDENISE TRIANO 19580211 98524173 DENISE F TRIANO 19580211 98524173DENISE F TRIANO

Matched Survived


Course lab project design

Policy

InvestigateAssess Data Quality

Standardize Country InvestigateConditioned Results

Apply User Overrides

Identify Duplicate Customer Records

Survive the BestCustomer Record

Condition Name, Address and Area

Select US Data for furtherprocessing


Checkpoint

1. (T/F) Data quality investigation cleans the source data.2. (T/F) Standardization modifies the source data so that it can be loaded

into the target system. 3. (T/F) Survivorship data can be either record based or field based.


Checkpoint solutions

1. (T/F) (T/F) Data quality investigation cleans the source data.Answer: False

2. (T/F) Standardization modifies the source data so that it can be loaded into the target system.Answer: False

3. Survivorship data can be either record based or field based.Answer: True


Unit summary

Having completed this unit, you should be able to:●List the five common data quality contaminants

– Different standards– Missing and default values– Spillover and buried information – Anomalies– No consolidated view

●Describe each of the following processes:– Investigation– Standardization– Match– Survivorship


Lab 1: Review course project

●Course business case: WINN Insurance CRM project●See QualityStage Essentials Exercises


Lab 2: Copy student files

●Copy student files to disk– Use C: drive as root for folder


QualityStage 8 Architecture


Unit objectives

●After completing this unit, you should be able to:– Describe the Data Quality architecture– Identify server and client components


Information Server conceptual architecture

Metadata Server & Repository

DataStageQualityStage

Information DirectorInformationAnalyzer

Information Server

MetadataAccess Services

Other ServicesClient logon accessLoggingSecurity


QualityStage technical overview

●Uses DataStage (parallel version)– DataStage design environment– Parallel execution engine– Stages are native enterprise operators– Match designer is embedded in DataStage Designer Client – Get DataStage data connectivity by default

• No need for meta brokers, plug-ins• Common meta data

●Legacy (pre-version 8) QS job execution– Migration utility available to aid conversion from QS 7.x to QS 8– Converted jobs can be compiled and executed in the QS 8

environment


DataStage/QualityStage physical architecture

Clients

DataStage/QualityStage

UNIX

Windows

Via TCP/IP

Information Server

Windows

DesignerDirectorAdministrator

Connect to projectsProjects


DataStage clients

●Administrator– Add and delete projects– Set project defaults– Set project environment parameters

●Designer– Maintain data definitions– Add, modify, and delete jobs– Add, modify, and delete match specifications– Manage rule sets– Compile jobs– Run jobs– Provision rule sets and match specifications

●Director– Run jobs– Review job log– Schedule jobs


DataStage Administrator● Administrator

– Create or delete projects– Set project defaults– Apply security

Project list


Project property defaults


DataStage Designer● Designer

– Client GUI for designing jobs• Windows 2000+, XP• Build meta data• Build Jobs • Modify Standardization Rules• Build match specifications

– Designer Repository• Database

Sample QualityStage job as viewed in Designer


Designer canvas, repository, and palette


DataStage Director● Director

– Client GUI for managing job execution– Windows 2000+, XP– Run jobs – set job options and parameters– View job log– Schedule job execution


Job log viewed in Director


Checkpoint

1. (T/F) DataStage Administrator executes jobs.2. (T/F) DataStage Designer configures projects. 3. Which DataStage component displays objects in the designer database?



1. (T/F) DataStage Administrator executes jobs.Answer: False

2. (T/F) DataStage Designer configures projects.Answer: False

3. Which DataStage component displays objects in the designer database.Answer: the repository view


Unit summary

Having completed this unit, you should be able to:– Describe the Data Quality architecture– Identify server and client components


Lab 3: configure QualityStage project● Create a project using Administrator (if necessary)● Set project properties

– General defaults– Environment variables– Security groups and roles


Developing with QualityStage


Unit objectives

●After completing this unit, you should be able to:– Import meta data– Build DataStage/QualityStage Jobs– Run jobs– Review results


DataStage/QualityStage project

●Components– Jobs– Stages within jobs– Table Definitions– The Designer repository view shows project components


Job definition

●A job is an executable DataStage/QualityStage program●Created by job compilation●Jobs can be run in batch or in real time


Job development overview

● Designer client– Import or enter file meta data defining your sources and targets– Add stages and links defining the process– Compile the job– Run the job (Designer or Director)– Review data results

● Server– Runs the job– View job log


Log onto project in Designer or Director

User name and Password controlled by Information Server

List of valid projects


Designer repository components

●Database which stores– Data file definitions– Job designs– Standardization rules– Data connection objects


Project structure

Repository view

In Designer


DataStage/QualityStage design environment

Stages

Data definitions


Data definitions

●Entered or loaded via DataStage import mechanisms– Sequential file– ODBC– Native database connection

●New and redefined columns can be added on the data flow via Transformer stage


Data Quality folder● Stages are the building blocks● Focused in function● All phases of data quality:

– Investigate– Standardize– Match Frequency– Match

• Unduplicate Match• Reference Match

– Survive– International postal

• MNS– Optional

• Address Verification


Standardization rule sets● Pre-defined rules for parsing and

standardizing:– Name– Address– Area (City, State and Zip)

● Multi-national address processing● Validate structure:

– Tax ID– US Phone– Date– Email

● Append ISO country codes● Rule sets are stored in the repository

and provisioned to the job execution area

Rule set for USNAME


Rule set components

● Can modify some rule set components

● Test rule sets● Copy rule sets


Match Specifications in the DataStage Repository

●Created using the Match Designer

●Allows online testing of match criteria


Executing a job via Director

Director

Server

Executes the jobClick run button

Set run options

Execute job

View job log

View job monitor


Running a job in Director● Director

– Client GUI for running jobs• Windows 2000+, XP• View job logs and monitor• Job scheduling

Job status view


Execution environment

Data Quality Job Log


Job Monitor statistics


Job development process

● Import meta data● Define job

– Draw stages and links– Set stage properties– Compile

● Run the job● Review results


Checkpoint

1. (T/F) The job monitor displays link statistics.2. (T/F) The job log is viewed in DataStage Designer. 3. What protocol is used for communication between the DataStage clients

and server?



1. (T/F) The job monitor displays link statistics.Answer: True

2. (T/F) The job log is viewed in DataStage Designer.Answer: False

3. What protocol is used for communication between the DataStage clients and server?Answer: TCPIP


Unit summary

Having completed this unit, you should be able to:– Import meta data– Build DataStage/QualityStage Jobs– Run jobs– Review results


Lab 4: Import meta data

●DataStage import mechanisms– DataStage components

• Any object built in DataStage, such as jobs, table definitions, match specifications


Lab 5: Build and run DataStage job

●Read sequential file– Must use format tab to handle nulls


Investigation


Unit objectives

●After completing this unit, you should be able to:– Build Investigate jobs– Use character discrete, concatenate, and word investigations to

analyze data fields– Review results


Investigation

●Verify the domain– Review each field of interest and verify the data matches the meta

data● Identify data formats, missing and default values● Identify data anomalies

– Format– Structure– Content

●Discover “unwritten” business rules● Identify data preparation requirements


Investigate stage

●Features– Analyze free-form and single domain fields– Provide frequency distributions of distinct values and patterns

● Investigate methods– Character Discrete– Character Concatenate– Word


Investigate methods

Identifying free-form fields that may require parsing and discovery of key words for classification

Word Investigation

Cross-field correlation, checking logic relationships between fields

Character Concatenate

Analyzing field values, formats, and domainsCharacter Discrete

WhyMethod


Field Masks

Investigate terminology

Options that represent the data. Options: Character (C), Type (T), Skipped (X)

Tokens Individual units of data

For ignoring characters X

For viewing the pattern of the dataT

For viewing the actual character values of the dataC

UsageCharacter Mask


Token Mask Result 02116 CCCCC 02116

02116 CCCXX 021

01832-4480 TTTTTTTTTT nnnnn-nnnn

XJ2 6EM TTTTTTT aanbnaa

(617) 338-0300 CCCCCCCCCCCCCC (617) 338-0300

617-338-0300 TTTTTTTTTTTT nnn-nnn-nnnn

6173380300 CCCXXXXXXXXX 617

(617)3380300 CCCXXXXXXXXX (61

Field mask examples


Character discrete: field mask (C)haracter

●Usage: Domain quality– View the contents of each field to verify that the data values match the

field labels ●Mechanism: Investigate stage

– Generates reports for frequency


Character discrete - character results


Character discrete: field mask (T)ype

●Usage: Data formats (patterns):– View the format of field which contain that you suspect may follow or

conform to a specific format, e.g., dates, PIN, Tax ID, account numbers.

●Generates reports for frequency


Investigation Implementation


QualityStage Investigation job – character

Double Click


Investigation - Character

Select Column Add



Select mask








View investigation report


Character concatenate

● Identify Field Relationships– Investigate one or more fields to uncover any relationship between the

field values. – Uses combinations of character masks– Generates reports for frequency


Character concatenate results

DOB and DOD Fields


Word investigate

●Usage: Free-form field pattern analysis– To view the pattern of the data within a freeform text field and parse it

into individual tokens ●QualityStage process

– Apply rule sets to free-form fields– Discover parsing requirements– Discover patterns in data– Generate reports for pattern frequency distributions and token report


Word investigation results

Token ReportPattern Report

How to use

Look at most frequently occurring patterns.

Use to estimate how much work to modify a rule set for a customer.

How to use

Review tokens with SME to verify tokens are properly classified.

Identify most frequently occurring unclassified tokens and add them to rule set.


Rule sets

●Rules for parsing, classifying, and organizing data●Rule Set Domains

– Country processing– Pre-processing– Domain Processing

• Name: Business and Personal• Street Address• Area: Locality, City, State and Zip/Postal codes

– Multinational Address Processing


Parsing

●Parse free-form data with the SEPLIST and a STRIPLIST– SEPLIST - Any character in the SEPLIST will separate tokens, and

become a token itself– STRIPLIST - Any character in the STRIPLIST will be ignored in the

resulting pattern●The SEPLIST is always applied first


Parsing example

Token1 Token2 Token3 Token4 Token5 Token6 Token7 Token8

120 Main St . N . W .

Token1 Token2 Token3 Token4 Token5120 Main St N W

SEPLIST “¬.”STRIPLIST “¬.“

Token1 Token2 Token3 Token4

120 Main St NW

SEPLIST “¬”STRIPLIST “¬.“

SEPLIST “¬.”STRIPLIST “¬“

Example: 120 Main St. N.W.


Data typing: classifying tokens

● Identify and type the token in terms of it’s business meaning and value

PATTERN KEY(USADDR rule set):

^ – Numeric token

? – Unclassified alpha token

@, <, > – Mixed Token

T – Street Type

U – Unit Type

120 Main Street Apt 6C

^ ? T U >


10 MAPLE STREET APARTMENT 222

T ^?^

Parse

Classify known wordsand

assign default tags U

Example: word investigate

Produce Reports based on Patterns & Tokens

Token report Pattern report


Investigation - Word


Investigation - Word


Link ordering


Investigation – define output files


Sort output (optional)


Review word reports – patterns and tokens


Data quality assessment process

●Review and analyze each field for the following information:– How often is the field populated?– What are the anomalies and out-of-range values? How often does

each one occur?– How many unique values were found?– What is the distribution of the data or patterns?

●Use Investigate results to:– Update project business requirements– Define development plan and application design


Checkpoint

1. (T/F) Character discrete investigation examines a single domain.2. (T/F) Word investigation examines a single domain. 3. Name the three character masks.



1. (T/F) Character discrete investigation examines a single domain.Answer: True

2. (T/F) Word investigation examines a single domain.Answer: False

3. Name the three character masks.Answer: C, T, and X


Unit summary

Having completed this unit, you should be able to:– Build Investigate jobs– Use character discrete, concatenate, and word investigations to

analyze data fields– Review results


Lab 6: Build investigate jobs

●Character with C mask●Character with T mask●Character concatenate●Word


Standardize


Unit objectives

●After completing this unit, you should be able to:– Describe the Standardize stage – Identify rule sets– Build jobs using the Standardize stage– Interpret standardization results– Investigate unhandled data and patterns


Standardize

●Transformation– Parsing free form fields– Comparison threshold for classifying like words– Bucketing data tokens

●Standardization– Applying standard values and standard formats

●Phonetic Coding for use in Matching– NYSIIS– Soundex


Standardize example

Input File:Address Line 1 Address Line 2

1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST # 2016200 VENTURA BOULEVARD SUITE 201C/O JOSEPH C REIFF 12 WESTERN AVE1705 W St PHILADELPHIA1655 PONCE DE LEON AVENUE 15TH FLOOR

Result File:House # Dir Str. Name Type Unit Unit. Floor Floor

Type Value Type Value

1721 W ELFINDALE ST UNIT 201721 W ELFINDALE ST 2016200 VENTURA BLVD STE 20112 WESTERN AVE1705 W ST1655 PONCE DE LEON AVE FLOOR 15


^?^

Parse

Classify &assign default tags T U

House Street UnitNumber Street Name Type UnitType

10 MAPLE ST APT 222

Process Patterns and Bucket Data

Standardize process

Output File

Key:

^ = Single numeric

? = One or more unknown alphas

T = Street type

U = Unit type



Standardize stage

●Standardize Stage– Uses Rule sets for:

• Country processing• Pre-domain processing

– USPREP• Domain processing

– USADDR– USAREA– USNAME

• Multi-national Address • WAVES• Address Verification Interface (optional)


Types of rule sets

Country Identifier

COUNTRY

Domain Pre-processor

USPREP

Domain Specific: USNAME

Domain Specific: USADDR

Domain Specific: USAREA

Preparatory steps

Not always required


Example: country identifier

Input Record

100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET

Input Record

100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K028 GROSVENOR STREET LONDON W1X 9FE 123 MAIN STREET

Output Record

US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET

Output Record

US Y 100 SUMMER STREET 15TH FLOOR BOSTON, MA 02111CA Y SITE 6 COMP 10 RR 8 STN MAIN MILLARVILLE AB T0L 1K0GB Y 28 GROSVENOR STREET LONDON W1X 9FEUS N 123 MAIN STREET


Example: domain preprocessor

Input Record

Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148

Input Record

Field 1 JIM HARRIS (781) 322-2426Field 2 92 DEVIR STREET MALDEN MA 02148

Output Record

Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426

Output Record

Name Domain JIM HARRISAddress Domain 92 DEVIR STREETArea Domain MALDEN MA 02148Other Domain (781) 322-2426

Mixed domain


Example: domain specific

Input Record

100 SUMMER STREET 15TH FLOOR

Input Record

100 SUMMER STREET 15TH FLOOR

Output Record

House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U

Output Record

House Number 100Street Name SUMMERStreet Suffix Type STFloor Type FLFloor Value 15Address Type SNYSIIS of Street Name SANARReverse Soundex of Street Name R520Input Pattern ^+T>U


Rule sets

●Rule sets contain logic for:– Parsing– Classifying– Processing data by pattern and bucketing data

●Three required files– Classification Table– Dictionary File– Pattern Action File

●Optional files– Lookup tables– Override tables


Rule set files

Contains a series of patterns and programming commands to condition the data

Contains standard abbreviations that identify and classify key words.

Optional conversion and lookup tables for converting and returning standardized values

Tables for storing overrides entered into the Designer GUI

Classification Table

Pattern Action File

Reference Tables

Override Tables

Define the output file fields to store the parsed and conditioned dataDictionary File


Classification table

●Contains the words for classification, standardized versions of words, and data class

●Data class (data tag) is assigned to each data token●Default classes are the same across all rule sets●User-defined classes are assigned in the classification table

– Users may modify, add or delete these classes– User-defined classes are a single letter


Default classes

Trailing numeric, e.g. A6<

Null classZero (0)

Leading numeric, e.g., 6A>

Complex mixed token, e.g., C3PO@

One or more consecutive unclassified alphas?A single unclassified alpha (word)+

A single numeric^

DescriptionClass


User-defined classes

Box TypeB

USAREA

State AbbreviationS

DirectionalD

Street TypeT

USADDR

Prefix, e.g. Dr., Mr., MissP

Generational, e.g., Senior, I, IIG

USNAME

DescriptionClass


Classification table example;-------------------------------------------------------------------------------

; USADDR Classification Table ;-------------------------------------------------------------------------------

; Classification Legend ;-------------------------------------------------------------------------------

; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-------------------------------------------------------------------------------

; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-------------------------------------------------------------------------------

DRAW "PO BOX" B DRAWER "PO BOX" B PO "PO BOX" B POB "PO BOX" B POBOX "PO BOX" B POBX "PO BOX" B PODRAWER "PO BOX" B

Token

Standard form Classification


Comparison threshold● May be used in the Classification

table● Used to efficiently make entries into

the classification table● Helps overcome spelling and data

entry errors● Not required● Threshold uses a logical string

comparator Most likely not the same750

Almost certainly not the same

700

Most likely equivalent800

Almost certainly the same850

Exact match900

Threshold level


Classification table example with comparison threshold

; USADDR Classification Table ;-------------------------------------------------------------------------------

; Classification Legend ;-------------------------------------------------------------------------------

; B - Box Types ; D - Directionals ; F - Floor Types ; H - Highway Modifiers ; R - Rural Route, Highway Contract, Star Route ; T - Street Types ; U - Unit Types ;-------------------------------------------------------------------------------

; Table Sort Order: 51-51 Ascending, 26-50 Ascending, 1-25 Ascending ;-------------------------------------------------------------------------------

DRAW "PO BOX" B DRAWER "PO BOX" B…………………………………………………………………………………NORTHEAST NE D 850 NORTHWEST NW D 850 NW NW D S S D SO S D SOUTH S D


Dictionary file

●Defines the field definitions for the output file●When data is moved to these output fields it is called

“bucketing” the data●The order that the fields are listed in the dictionary file defines

the order the fields appear in the output file●Dictionary file entries are similar to field definitions


Dictionary file example;;QualityStage v8.0\FORMAT\ SORT=N;------------------------------------------------------------------------------; USADDR Dictionary File;------------------------------------------------------------------------------; Total Dictionary Length = 411;------------------------------------------------------------------------------; Business Intelligence Fields;------------------------------------------------------------------------------HouseNumber C 10 S HouseNumber ;0001-0010HouseNumberSuffix C 10 S HouseNumberSuffix ;0011-0020StreetPrefixDirectional C 3 S StreetPrefixDirectional ;0021-0023StreetPrefixType C 20 S StreetPrefixType ;0024-0043StreetName C 25 S StreetName ;0044-0068StreetSuffixType C 5 S StreetSuffixType ;0069-0073StreetSuffixQualifier C 5 S StreetSuffixQualifier ;0074-0078


Pattern-Action file

●Contains the rules for standardization; that is, the actions to execute with a given pattern of tokens

●Records are processed from the top down●Written in Pattern Action Language (PAL)●Complex parsing can be coded in this file


Street Address 10 Hollow Oak RoadPattern ^ ? T

Pattern Action LanguageCOPY [1] {HN}COPY_S [2] {SN}COPY_A [3] {ST}

{HN} {SN} {ST}

Pattern Action file process

10 Hollow Oak Rd


Optional lookup tables

●Called from the Pattern Action File●Rule sets may contain lookup tables such as:

– Common First Names and Enhanced First Names• Barb & Barbara• Ted & Edward

– Gender based on name– State abbreviations– Common city abbreviations

• NYC = New York City• LA = Los Angeles


^?^

Parse

Classify &assign default tags T U

House Street UnitNumber Street Name Type UnitType

10 MAPLE ST APT 222

Pattern Action File

Process Patterns and Bucket Data

Classification Table

Dictionary File

Standardize process


2

3

4


Standardizing international data

●Two methods– Method 1: Use country pre-processor, domain pre-processor, and

domain-specific rules• Uses out-of-the-box, included functionality/rules

– Method 2: Use Multinational Standardize, WAVES, or AVI• WAVES requires purchase of WAVES database• AVI requires purchase of database for address validation



Policy









Country rule set

●Country Rule set appends the two byte ISO country code● Input to the country rule set includes:

– Default country code (designated by ZQ…default value…ZQ)– Street Address– City or locality– State– Zip or Postal code– Country field (if it exists)

●Output:– Two byte ISO country code– Flag identifying explicit or default decision


Standardization implementation


Standardization jobs

Country

Rule Set

USPREP

Rule Set

Domain-specific

Rule Sets


Standardization – US Name, Address, Area


Standardization


Standardization – mapping columns



Policy









Selecting US data

●The DataStage Filter Stage provides the capability of selecting and/or rejecting records based on a set of values for a field

●Selecting or splitting data requiring compound or complex logic may require Transformer stage


Course exercise project design

Policy









Domain pre-processor rule sets

●Pre-processor rule sets are designed to filter name, street address and area (city, state, zip) data– For example, if the city, state and zip is found in ADDRESS LINE 2,

the pre-processor rule set will attempt to recognize this data and move it into the area domain

●The pre-processor rule set prepares the data for processing by domain specific rule sets


Domain rule sets

●Domain rule sets expect only data for that domain as the input●Domain rule sets that come with QualityStage are:

– Name– Street address– Area (city, state and zip)


USNAME rule set

●The USNAME rule set works on both personal names and organization names for US data

●Data is parsed into name components●Phonetic coding of the First Name and Primary Name are

created for matching


USADDR rule set

●This rule set is applied to street address fields●The “Address Type” flag identifies different types of addresses

– ‘S’ Street address– ‘B’ Box address– ‘R’ Rural route address

●Phonetic coding of the Street Name is created for matching


USAREA rule set

●This rule set is applied to city, state and postal code fields●Data is parsed into city name, state abbreviation, zip code and

zip plus four●Phonetic coding of the city name is created for matching


Standardize results

●Business Intelligence fields – Parsed from the original data, they may be used in matching and

generally they are moved to the target system●Matching Fields

– Generally these fields are created to help during the match process and are dropped after successful matching

●Reporting fields– Specifically created to help review results of Standardize and

recognized handled and unhandled data


Business intelligence fields

● Intelligent data parsed and bucketed from the input free-form field

USNAME Examples

• Title

• First Name

• Middle Name

• Primary Name

• Generational

USADDR Examples

HouseNumber

Directional

Street Name

Unit Types

Box Types

Unit Values

Building Names

USAREA Examples

•City

•State

•Zip5

•Zip4


Standardize matching fields

●Phonetic coding– NYSIIS – Reverse NYSIIS– Soundex– Reverse Soundex

●Hash keys– First 2 characters of the first five words

●Packed Keys– Data concatenated, or packed


Standardize reporting fields

The tokens not processed by the rule set because they represent a data exception.

The pattern generated for the stream of input tokens based on the parsing rules and token classifications.

The pattern generated for tokens not processed by the selected rule set.

The remaining tokens not processed by the selected rule set.

Unhandled Pattern

Unhandled Data

Input Pattern

Exception Data

User override flagFlag indicating what kind of user overrides were applied to this record


1. Build a Character Concatenate Investigation using the following fields

2. Increase the number of samples to 5

Investigate NAME unhandled patterns and data

● Identify the unhandled patterns for the NAME field. In the report include the unhandled data, input pattern, original data and the record key.

XName domain data

XInput Pattern

XUnhandled Data

CUnhandled Pattern

TypeField Description


Standard practice: investigate handled and unhandled data

●Review the business intelligence fields to ensure accurate bucketing of the data

●Build a Character Discrete Investigation for each field and review the contents and the format

●Build Investigation to review:– Unhandled Patterns– Unhandled Data– Input Pattern– Input Fields


Customizing rule sets

●A rule set may require modification if some input data is:– Not processed– Incorrectly processed

●QualityStage provides functions to:– Test strings for classifications using the Rules Analyzer– Apply user Overrides– Modify classification table


User overrides

●Provides the user with the ability to modify rule sets●The following types of rule sets can be modified using User

Overrides– Domain Pre-processor rule sets– Domain rule sets

●There are five types of user overrides relating to: classifications, patterns, and text strings

●User overrides are – GUI Driven

●Rule set should be provisioned after modifications applied


User classification override

●Recognized as a keyword and classified– Additional words

• New abbreviation, variation • Misspelling of a word

●User Classifications may override or add:– Original values (Token values)– Standard value – Class


Override

Unhandled Data

Token Value Standard Value Class

Apply classification override

FCarolynneCarolynne

Input Pattern Original Data+,+ HOCHREITER , CAROLYNNE

Input Pattern Original Data +,F HOCHREITER , CAROLYNNE

Add CAROLYNNE

as a valid first name

to the classification table

Corrected Pattern


Text overrides

●Allow the user to specify overrides based on an entire text string

●Use this override for special cases and specific handling of a string of text

● Input Text Overrides– Applied to the original text string

●Unhandled Text Overrides– Applied to the Unhandled Data field


Input Text

Input text overrides

Input Text OverrideREIFF FUNERAL Move text string to

the Primary name field

Unhandled Pattern Input Text++ ZACHARIA GELLMAN++ TOMMOTHY CABBOTT++ REIFF FUNERAL

Override

Input Pattern Primary Name+ + REIFF FUNERAL

Results


Pattern overrides

●Allow the user to specify overrides based on an entire pattern●Use this override when most or all records should be

processed with identical logic● Input Pattern Overrides

– Applied to the original text string●Unhandled Pattern Overrides

– Applied to the Unhandled Data field


Unhandled pattern overrides

Unhandled Pattern Override+, + Move + to Primary Name

Comma provides contextMove + to First Name

Unhandled Pattern Input Text+, + HAYWARD, WINSLOW+, + ESHAGHIAN , JOUBI+, + BOULDER, CORONA

UnhandledPattern

Override

Results

Unhandled Pattern First PrimaryName Name

+, + WINSLOW HAYWARD+, + JOUBI ESHAGHIAN+, + CORONA BOULDER


User override precedence

Recognize words to classify

Modify logic based on the input string

Modify logic based on the input pattern

Modify logic based on the Unhandled data string

Modify logic based on the unhandled pattern

User Classification

Input Text

Input Pattern

Unhandled Text

Unhandled Pattern


1. Build a Character Concatenate Investigation using the following fields

2. Increase the number of samples to 5

Investigate address and area unhandled patterns

● Identify the unhandled patterns for the Address and AREA fields. In the report include the unhandled data, input pattern, original data and the record key.

XAddress DomainXInput PatternXUnhandled DataCUnhandled Pattern

TypeField Description


Overrides

●Purpose– Correct problems found during standardization

●Rule set may require overrides because you have data– Not processed– Incorrectly processed

●Override types– Classification – Input pattern– Input text – Unhandled pattern– Unhandled text

●Can be tested with rules analyzer


Course exercise project design

Policy









Overrides screen


Checkpoint

1. (T/F) WAVES can standardize name fields.2. (T/F) Rule sets are used in standardization processing. 3. Name the components of rule sets.



1. (T/F) (T/F) WAVES can standardize name fields.Answer: False

2. (T/F) Rule sets are used in standardization processing. Answer: True

3. Name the components of rule sets.Answer: Classification table, dictionary, pattern action file, lookup tables


Unit summary

Having completed this unit, you should be able to:– Describe the Standardize stage – Identify rule sets– Build jobs using the Standardize stage– Interpret standardization results– Investigate unhandled data and patterns


Lab 7: Standardize country

●Word investigation– Uses COUNTRY rule set

●Rule set found in Other folder●Adds ISO country code to records


Lab 8: Select US records●Uses Select stage to separate records with US ISO code●Could also use Transformer stage


Lab 9: Standardize USPREP●Word investigation

– Uses rule set●Rule set found in US folder


Lab 10: Standardize USNAME, USADDR, USAREA

●Word investigation– Uses rule sets

●Rule sets found in US folder


Lab 11: Investigate unhandled patterns

●Character concatenate investigation●C mask used to produce histogram●X mask used to display other fields of interest


Lab 12: Apply user overrides

●Classification● Input pattern●Unhandled pattern


Match


Unit objectives

●After completing this unit, you should be able to:– Build a QualityStage job to identify matching records– Apply multiple match passes to increase efficiency/efficacy– Interpret and improve match results


Match stage

●Statistically-based method for determining matches●25 match comparison algorithms providing a full spectrum of

fuzzy matching functions●Ability to measure informational content of data● Identify duplicate entities within one or more files●Match specification built with Match Designer●Critical field settings


What constitutes a good match?

W HOLDEN 12 MAIN ST W HOLDEN 12 MAINE ST

Which of the following record pairs is a match? And how do you know?

Do you compare all the shared or common fields?Do you give partial credit?Are some fields (or some values) more important to you than others? Why?Do more fields increase your confidence?By how much? What is enough?

W HOLDEN 128 MAIN PL 02111 12/8/62 W HOLDEN 128 MAINE PL 02110 12/8/62

WM HOLDEN 128A MAIN SQ 02111 12/8/62 338-0824 WILL HOLDEN 128A MAINE SQ 02110 12/8/62 338-0824


The value of information content

● Information content measures the significance of one field over another (Discriminating Value)– A Gender Code contributes less information than a Tax-Id Number

● Information content also measures the significance of one value in a field over another (Frequency)– In a First-Name Field, JOHN contributes less information than

DWEZEL●Significance is determined by a value’s reliability and its ability

to discriminate, both can be calculated from your data


The weighted score is a relative measure of the probability of a match

Thresholds defined can be used for automated

processing

0

500

1000

1500

2000

2500

3000

3500

4000

-20 -10 0 10 20 30 40

# o

f P

airs

Non-Matches

Matches

Distribution of weights

Weight of Comparisons

Less Confidence More Confidence

Gre

y ar

ea

WILLIAM J HOLDEN 128 MAIN ST 02111 12/8/62 WILLAIM JOHN HOLDEN 128 MAINE AVE 02110 12/8/62

+1 +1 +17 +2 +4 -1 +7 +9 = 40


Weights

●Measures the information content of a data value●Each field contributes to the confidence (probability) of a match


Types of weights

● If a field matches, the agreement weight is used– Agreement weight is a positive value

● If a field doesn’t match, the disagreement weight is used– Disagreement is a negative value

●Partial weight is assigned for non-exact or “fuzzy” matches●Missing values have a default weight of zero●Weights for all field comparisons are summed to form a

composite weight


Matching terminology

Measures the informational content of a data value

Distinguish matches from non-matches

Records with a score above the High cutoff that really aren’t a match

Records below the low cutoff that really are a match

Measures the significance of one field value over another

Measures the confidence of a match

Informational Content

Weight

Composite Weight

Match Cutoffs

False Positives

False Negatives


Measuring the conditions of uncertainty

●Reliability of the data in a given field– Estimated as the probability that the field agrees given the record pair

is a match●Probability of a random agreement of values

– Estimated as the probability the field agrees given the record pair is not a match


Reliability (m-probability)

●Approximated as, 1 - error rate for the given field●The higher the m-probability, the higher the disagreement

weight will be for the field not matching since the data is considered reliable


Chance agreement (u-probability)

●The u-probability can be approximated as the probability that a field agrees at random (by chance)

●QualityStage uses a frequency analysis to determine the probability of chance agreement for all values – Created by a Match Frequency stage

●Rare values bring more weight to a match


Blocking

●Grouping together like records that have a high-probability of producing matches

●Only “like” records are compared to each other making the match more efficient and computationally feasible

●Records in a “block” match exactly on one to several blocking fields


NYSIIS LNAME NAME ADDRESS ZIP

YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753

GARAS GEROSA, FRAN X 29 AARONS CT 06877

YANG YOUNG , JONATHAN A 1767 TOBEY ROAD 30341

GARAS GERISA, FRANCIS 29 AARONS CT 06877

GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877

MATAC MARCUS MATIC 100 SUMMER STREET 02111

GARAS GEROSA, MARY 29 AARONS CT 06877

JANCAN RENEE JENKINS 100 SUMMER STREET 02111

YANG YOUNG THERESA C 1767 TOBEY ROAD 30341

Block on NYSIIS of Last Name

Blocking example: sample data


Blocking example – NYSIIS of Last Name

YANG YUNG , WAYNE D 9000 SHEPARD DRIVE 78753

YANG YOUNG , JONATHAN A 4220 BELLE PARK DR 77072

YANG YOUNG THERESA C 1767 TOBEY ROAD 30341

GARAS GEROSA, FRAN X 29 AARONS CT 06877

GARAS GEROSA, FRANCIS XAVIER 29 AARONS COURT 06877

GARAS GEROSA, MARY 29 AARONS CT 06877

GARAS GARISA, FRANCIS 29 AARONS CT 06877

MATAC MARCUS MATIC 100 SUMMER STREET 02111

JANCAN RENEE JENKINS 100 SUMMER STREET 02111

NYSIIS NAME ADDRESS ZIP

Blocks with only one record are considered residuals


Blocking strategy

●Choose fields with reliable data●Choose fields with a good distribution of values●Combinations of fields may be used


Examples of blocking strategies

●Zip code for matching addresses●NYSIIS of last name for matching individuals●Brand name for matching products●Combination of zip code and NYSIIS of street name for

matching addresses●Combination of NYSIIS of last name and first letter of first name

for matching individuals


Blocking summary

●Blocking groups together “like” records●Matching is more efficient for small block sizes

– Blocks should have less than 1000 records (guideline, not a hard and fast rule)

●Blocking fields must match exactly for a candidate set to be created/evaluated

●Beware of block overflow– Computationally run out of resources– Comparisons are not completed– Every record in the block becomes an automatic residual


Match types

●Unduplication– Identifies duplicates candidates in one file

●Reference Match (Two File)– One-to-one correspondence

• For every record on stream link we expect to find a match to one record on reference link

– Many-to-one correspondence• More than one record on stream link can match to the same record on

reference link


Comparing data values

●Different comparisons for different data●17 comparison methods●Most common

– CHAR - (character comparison) character by character, left to right. – UNCERT - (character uncertainty) tolerates phonetic errors,

transpositions, random insertion, deletion, and replacement of characters

– CNT_DIFF – Counts keying errors in numeric fields. You set a tolerance threshold

– NAME_UNCERT – Can be used to compare character values, if the strings are different lengths then the shorter of the two lengths is used


Match Implementation


Tasks required in match process

● Standardize the data● Add data columns needed for blocking● Generate match frequency report using Match Frequency stage● Build match specification in Match Designer

– Add pass• Blocking columns• Match commands

– Configure match test results environment● Run pass● Review results● Tune the match

– Add cutoffs– Set overrides– Add more passes

● Repeat steps until match results are acceptable


Standardize columns and generate match frequency


Match frequency stage

Map fields


Match frequency generation


Lab 13: Match frequency●Use Match Frequency stage in a match job


Match Designer

●Used to build a match specification that will be addressed in a match job

Features– Design control center– Data-centric– Graphical representation of statistics– Independent of job design– Iterative development


Match Design - Unduplicate


Match design – creating specification

How to create a new match specification

Right-click in non-root area of repository


Match design - unduplicate

The Major Components Test resultsHistogram Holding Area

Pass Composer

Decision Rules

Data Viewer

Cutoff Tuning



Select match type –example unduplicate

Will initially get one pass called MyPass



Click table definition icon

Use load button to access table definition of

standardized data set



Blocking

Match Commands

Select pass icon



Save passes and specification



Name and place passes and specification



Set up test results area

Questions:

Where is the standardized data?

Where is the frequency report?

What ODBC-accessed database will store test results?



Standardized sample data

Frequencies data set

Data Source NameUser NamePassword

Note: these must be data sets



Add Blocking Columns



Select Column

Business Name

Click Apply or OK



Add MATCH Column



Business Name



Compare Type



Data ColumnRight-Click to view data frequencies



Frequencies



Select

Parameter


Fully configured pass

Expanded view will show details of

blocking and match commands

Click test pass to run the pass against the

data


Match design – after test pass run



Grouping option:Match Sets: See all

matches and duplicates togetherMatch Pairs+Sort:

See the master record repeated



Default Display (Grouped by Match Sets)

Grouped by Match Pairs and then sorted Ascending by Weight



Compare Weights:See how any two records score



Statistics Tab

Change What Shows



Change How Shows


Match design - unduplicateTOTAL Statistics Tab

Change What Shows

Change How Shows


Lab14: Configure test results database

●Build a DB2 database to contain match test results●Build an ODBC source to connect the database to

QualityStage


Lab 15: Match specification

●Use Match Designer to build specification for unduplicate job●Configure test results area


Match improvement strategy

1. Set critical values for important fields2. Review calculated weights

• Adjust weights using weight overrides3. Set cutoffs4. Add additional passes


Critical fields

●Used to identify fields that must agree in order for records to be linked– Critical – Fields values must agree exactly or the records cannot be

linked (considered a match)– Critical Missing OK – Field values must agree exactly on values not

considered “missing values”●QualityStage feature: Variable Special Handling


Variable special handling


Weight overrides

●Allows you to adjust both the agreement and/or disagreement weights for specific situations– Add to calculated weight– Replace weight

On Match Commands screen


Weight override screen


Cutoffs

●There are two cutoffs– Match cutoff (high cutoff)– Clerical cutoff (low cutoff)

●Records with a weight equal to or above the Match cutoff are considered matches

●Records with a weight below the low cutoff are not matches●Records with a weight greater than or equal to the low cutoff

and less than the high cutoff are considered clerical records for manual review

●Cutoffs can be set at the same value eliminating clerical records


27.82 PO BOX 93020227.82 PO BOX 93020227.82 PO BOX 930202

38.65 35 COLLIER RD NW STE 610 38.65 35 COLLIER RD NW STE 610

25.81 928 S 1ST ST 14.45 S 1ST ST

Weights Data fields

DefiniteMatch

DefiniteMatch

QuestionableMatch

Setting the match cut-off


Multiple match passes

●Additional passes are helpful in overcoming data errors and missing values in block fields

●You should always create at least two match passes●Change blocking strategies for each pass


Pass 1 blocked on street namePass 2 found additional matched records in which the street name was different but the names were the same

Pass Weights Data fields1 26.31 JASON BIRCH 1350 WALTON WAY 309011 26.31 JASON BIRSH 1350 WALTON WAY 30901

1 20.42 JOHN SMITH 2047 PRINCE AVE 306041 10.83 MARY SMITH 2047 PRINCE AVE 30604

1 RES A JOHN SMITH P.O. BOX 123 30604

2 20.42 JOHN SMITH 2047 PRINCE AVE 306042 10.19 JOHN SMITH P.O. BOX 123 30604

Example: multiple match passes


Match Implementation –Unduplicate job


Double Click

Unduplication implementation


Unduplication Implementation


Verify link order for both input and output


Map all output links


Checkpoint

1. (T/F) Match specifications are created using Designer.2. (T/F) An unduplicate match can be used against two files. 3. Which match specification component determines the extent of the

clerical review records?



1. (T/F) Match specifications are created using Designer.Answer: True

2. (T/F) An unduplicate match can be used against two files. Answer: False

3. Which match specification component determines the extent of theclerical review records?Answer: cutoff values


Unit summary

Having completed this unit, you should be able to:– Build a QualityStage job to identify matching records– Apply multiple match passes to increase efficiency/efficacy– Interpret and improve match results


Lab 16: Unduplicate job

●Build unduplicate job using the match specification


Survive


Unit objectives

●After completing this unit, you should be able to:– Identify Survive techniques– Describe implementation options– Define Survive rules– Build Survive job


Survive stage

●Point-and-click creation of business rules to determine “surviving” data – user decides how to survive data

●Performed at record or field level – very flexible●Creates a single, consolidated record containing the “best-of-

breed” data●Provides consolidated view of the data


Survive exampleSurvive Input (Match Output)

Group Legacy First Middle Last No. Dir. Str. Name Type UnitNo.1 D150 Bob Dixon 1500 SE ROSS CLARK CIR1 A1367 Robert Dickson 1500 ROSS CLARK CIR23 D689 William A Obrian 5901 SW 74TH ST STE 20223 A436 Billy Alex O’Brian 5901 SW 74TH ST23 D352 William Obrian 5901 74 ST #202

Survived Consolidated OutputGroup Legacy First Middle Last No. Dir. Str. Name Type Unit No.1 D150 Robert Dickson 1500 SE ROSS CLARK CIR

23 D689 William Alex O’Brian 5901 SW 74TH ST STE 202

Group Legacy1 D1501 A1367

23 D68923 A43623 D352

Cross-Reference File


Survive rules

●A rule contains a condition and a set of target fields– When the condition is met the field becomes a candidate for the “best”– All records in a group are tested against the condition– The “best” populates the target fields

●Multiple targets are permitted for the same rule


Survive rules

●Custom Rule– Build your own logical expression– Comparison (=, !=, <, > ,<=, >=)– Logical (and, or, not) – Indicate the current and best records with the following notation

• c.field indicates the current • b.field indicates the best

– Parentheses ( ) can be used for grouping complex conditions– String literals are enclosed in double quotation marks, such as

“MARS”.– A semicolon (;) terminates a rule.


Building survive rules● Survive Rules Definition screen lets you easily

build, delete and manage survivor rules


Survive techniques

●Pre-defined Techniques– Source– Recency– Frequency– Most complete (longest string)

●User-specified logic


Target fields

●Fields you want to write to the output file●Populated based on meeting the conditions of the survivor

rule(s)●Fields not listed as targets are excluded from the output file●May have multiple targets for each rule


Example: complex survive rule

●The following rule states that FIELD3 of the current record should be retained if the field contains five or more charactersand FIELD1 has any contents.

●The prefix of b. indicates the current “best” record●The prefix c. indicates the current record testing against the

survivor rule

FIELD3: (SIZEOF (TRIM c.FIELD3) >= 5) AND (SIZEOF (TRIM c.FIELD1) > 0) ;

TARGET CONDITION


Survive Implementation


Double Click

Survive QualityStage job


Survive stage properties


Output Column Technique



‘Complex’ available



Checkpoint

1. (T/F) Survivorship can allow more than one record to survive.2. (T/F) Survivorship rules deal with the complete record only. 3. Name three survive rules.



1. (T/F) Survivorship can allow more than one record to survive.Answer: False

2. (T/F) Survivorship rules deal with the complete record only.Answer: False

3. Name three survive rules.Answer: most recent record, longest non-blank, most frequent non-blank


Unit summary

Having completed this unit, you should be able to:● Identify Survive techniques●Describe implementation options●Define Survive rules●Build Survive job


Lab 17: Survivorship job●Build survivorship job


Survive job with XREF file


Special Topics


Full Run


1. Double Click

Full run – single job


Full run using DataStage job sequencer


QualityStage Migration Tool


QualityStage Migration Tool – Overview●The QualityStage Migration Tool (QSMT) provides

the ability to migrate QualityStage 7.5 jobs and Standardization Rule Sets to the QS8environment.

●QSMT analyzes the QS 7.5 server project directory to construct “dsx” files which can be imported into the QS8 common repository using the DS & QS8 Designer’s “import” facility


QualityStage migration tool – overview

●QSMT functionality offers three types of QS 7.5 objectmigrations: – QS 7.5 Standardization Rule Set– QS 7.5 job in combined mode– QS 7.5 job in expanded mode

●Two modes for migrating jobs to QS8:– Combined Mode

• Use when you need to take a legacy process and just run it in QS8• Allows control before and after the legacy process• Will always run after importing without any manual tuning

– Expanded Mode• Use when you need to add QS8 operators within a migrated process• May require some manual tuning to run


Rule set migration●The QSMT has the ability to migrate Standardization Rule

Sets in one of two ways:– Explicitly - - you may specify the rule set you want to

migrate– By job dependency - - you may migrate all Rules

associated with a particular job

Note: Regardless of the migration mode, all migrated rules will have the new naming convention of :QS-7.5-Ruleset-Name_QS-7.5-Project-Name


Combined mode migration●Use this mode to get a legacy QS job up and running in QS8

with as little effort as possible. Jobs will import and run without modifications

●After importing, a migrated job will appear in the “Jobs” folder of the repository view in the QS/DS 8 Designer client

●Jobs are renamed by QSMT within the QS8 package to minimize name collision

●The new job name has the following naming convention:QS-7.5-Job-Name_QS-7.5-Project-Name


The job consists of a single instance of the QS 8 Legacy Job stage, together with some number of DS Sequential File stages, which are linked to the Legacy Job stage as inputs or outputs

QSMT – combined mode migration


● All the QS stages run under the control of the single Legacy Job stage in Combined Mode

● The list of operations can be seen by opening the Legacy stage

File IO to external files is performed by the Information Server Sequential File stages

QSMT – combined mode migration


QSMT – combined mode & running a QS8 job

●Once imported, Legacy jobs are run the same as any other QS8 job– Prior to compiling, be sure any required rule sets are

provisioned to the server– Run as you would any other QS8 job


●Use to re-implement the job in the QS8 environment●After importing, a migrated job will appear in the “Jobs” folder

in the same way as in Combined Mode●The job consists of one or more stages for each 7.5 stage,

plus DS PX Sequential File stages, linked to represent the 7.5 job flow. For complex jobs, stages may need to be reorganized to improve readability

QSMT – expanded mode


“Split”, “Accept”, or “Reject”used in 7.5

FilterSelect

“Merge” used in 7.5 stageLegacy JobSelect

AlwaysLegacy JobParse

AlwaysMNSMultinational Standardize

AlwaysLegacy JobMatch*AlwaysLegacy JobInvestigate

“ODBC” used in 7.5 stageODBC EnterpriseFFC

“Delimited text” used in 7.5 stage

CopyFFC

AlwaysLegacy JobCollapse

AlwaysLegacy JobBuild

AlwaysLegacy JobAbbreviate

ConditionsQS8 Stage TypeQS 7.5 Stage Type

* Currently working on converting Match specifications for GA

QS stage migration reference table


QS stage migration reference tableConditionsQS8 Stage TypeQS 7.5 Stage Type

AlwaysWAVESWAVES

AlwaysLegacy JobUnijoin

AlwaysLegacy JobTransfer

If target columns do not overlapSurviveSurvive

If target columns overlapLegacy JobSurvive

AlwaysStandardizeStandardize

AlwaysSortSort


QSMT – expanded mode & running a QS8 job

• Prior to compiling, be sure to complete the following:• Provision any required rules to the server• Add ODBC connection information to any ODBC read or write stages

appearing in the job• To complete the migration, perform the following for every

Standardize, Survive, MNS and Waves stage that appears on the canvas:

• Open the stage editor for the stage (e.g. by double-clicking it)• Click ok

• Once the above tasks are completed, compile and run as you would any other job


Lab 18: QualityStage Migration Tool●Migrate 7.5 QualityStage jobs to version 8


Globalization


Objectives

After completing this module you will be able to:●Build jobs that read and write Japanese data●Modify client settings to display Japanese data with correct

characters


Terminology

●Character Set– An ordered list of characters used for text

• Example: Latin, Cyrillic, Unicode

●Character Encoding– How each character in a character set is represented as bits

• Examples: UTF-8, UTF-16BE, GB18030 are encodings of Unicode

●Codepage– Microsoft Windows term for Encoding, often used in other contexts too

• Examples: – 1252 is Windows Latin1 superset of ISO8859/1– 932 another name for Shift-JIS


Character Sets

●Latin– Italian, Spanish, French, English alphabets

●Cyrillic alphabet– Subsets are used by five Slavic languages (Bulgarian, Russian,

Belarusian, Serbian, Macedonian, Ukrainian) and some non-Slavic (Kazakh, Uzbek, Kyrgyz, Tajik, and Mongolian)

●ASCII– Represents 256 characters

●Unicode– Represents 65,536 unique characters– Standard for representing the characters of all languages– Includes Chinese, Japanese, and Korean


Character encoding

●Definition – A system that pairs a character from a character set to something else, such as a number

●Two common computer encodings for Unicode – UTF8

• Variable length encoding for Unicode• Encodes each character to one to four bytes

– UTF16• Variable length encoding for Unicode• Allows either endian representation but mandates that the byte order be

explicitly indicated by a byte order Mark


NLS

●NLS – National Language Support– Globalization + Localization/Translation

●NLS map– What DataStage uses to convert between external and internal

encodings– Internal encoding is UTF8 for Server engine, UTF16 for Parallel

engine


Information Server Common Design Repository

Where DataStage NLS Mapping Happens

Parallel Engine running jobUnicode (UTF-16)

DataStage & QualityStageRuntime ObjectsUnicode (UTF-8)

External character setExternal character set

Messages

XML(UTF-8)

Map Map

Scripts, etc. Job MonitorUnicode (UTF-16)

Windows code page

Map

Client

Server

Logs


Examples of DataStage NLS Maps

Parallel Description

IBM367 Standard (US) ASCII 7-bit setBig5 TAIWAN: "Big 5" standardIBM1026 IBM EBCDIC variant 1026 (Turkish)GB2312 CHINESE: EUC as per GB 2312ISO_8859-1:1987 ISO Standard 8859 part 1: Latin-1ISO_8859-5:1988 ISO Standard 8859 part 5: Latin-CyrillicKS_C_5601-1987 KOREAN: EUC as per KSC 5861windows-1253 MS Windows codepage 1253 (Greek)windows-1255 MS Windows codepage 1255 (Hebrew)IBM865 PC DOS code page 865 (Nordic)Shift_JIS JAPANESE: Shift-JIS main mapTIS-620 THAILAND: Industrial Standard 620


DataStage & QualityStageRuntime ObjectsUnicode (UTF-8)Map

Client

Admin Client (whole server)

Associates a server map with the current Windows

code page

Setting a Client/Server Map


Sets the default map name to use with all Parallel jobs

in this project

Admin Client (per project)

…unless you override it in the job properties dialog

Setting Job-Level Maps


Parallel Engine running jobUnicode (UTF-16)

External character set

Map

Server

Various stages have an NLS Map tab:e.g. Sequential File, External Source, External Target, File Set

– Define character set mappings (ustring external file)– Applied at stage or individual field level

Setting a Stage-Level Map


For Sequential File-type stagesNChar, and Char with extendedtype, offer a drop-down list of map names in the NLS Map property

Non-default NLS map (for

relevant types)Char may be "extended" for

Unicode"

Setting a Column-Level Map


● Transformer, modify, etc.– string ustring conversion will happen automatically, taking current

map from context (job level or stage=operator level)– Fine control via explicit conversion functions

Conversions may use

specific map name

Converting string to ustring manual control


NLS Implementation using Investigate stage

Job Design from Lab

Job-level NLS map


Investigation results for Japanese city column

Input data

Client machine with codepage set to JPN

Output report data

Client machine with codepage set to JPN


Unit summary

Having completed this unit, you should be able to:●Build a QualityStage investigation job for non-English data●View correctly-formatted results in DataStage/QualityStage

data viewer


Lab 19: NLS●Build investigation job for city using Japanese data


Address Verification Interface Stage


Objectives

After completing this module you will be able to:●Build jobs using the AV stage to parse and verify address

data


AVI Stage

●Provides– Transliteration (e.g. Japanese to Latin)– Parsing– Address validation

●WAVES equivalent●Does not provide postal certification discounts

– Use CASS (US), DPID (Australia), or SERP (Canada) if certification is desired

●Supports real-time


Components

●AV stage●Reference data

– 16 Geographies– Purchased via Passport system

●API libraries– Address Doctor


Reference Data

●Required for validation function only●Requires annual license agreement●Location pointed to by AV stage●Some databases are memory intensive●Load options

– Partial preload• Indexes loaded to memory

– Full preload• Data loaded to memory• Fast access but must have adequate memory

– No preload• Data accessed from disk, slowest method


Job components

Optional error file

AVI stageInput address data


Stage properties

Reference data location

Function

Navigation


Mapping input columns to address elements


Transliterate mode

●Map input columns to address elements– Multiple input columns can be mapped to one address element

●Options offer the choice to increase the number of address lines


Map columns to output link


Parsing mode

● Input sample

●Output sample


Validation mode

●Uses reference data from a database●Map input columns to address elements●Can activate error link●Creates validation summary report

●Sample output (only two of the validation columns shown)


Validation mode statuses

●Part of output record●Document actions taken by AV stage●Short code●Verbose code●Example


Validation summary report sample (USPREP)

Validation Summary ReportCompany Name:List Identifier:Processing Date (yyyy/mm/dd): 2009/02/25Total Number Of Records Processed: 2843

Passed: 2843 100.00%Failed: 0 0.00%Validated: 2233 78.54%Corrected: 415 14.60%Has Suggestion: 195 6.86%PostCode Failed: 70 2.46%City Failed: 37 1.30%Street Failed: 274 9.64%Country Failed: 0 0.00%


Unit summary

Having completed this unit, you should be able to:●Build jobs using the AV stage to parse and verify address data


Lab 20: AV Stage●Build AV job to parse Japanese address data

●Review prebuilt job that validated USPREP data from earlier lab

Documents

QS Essentials