44
Data Quality Class 6

Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Embed Size (px)

Citation preview

Page 1: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Data Quality

Class 6

Page 2: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

This Week

• Review for Exam

• Project Questions

• Data Standardization

Page 3: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Data Standardization

• What is a standard?

• Benefits of Standardization

• Defining Data standards

• Testing for standard form

• Transforming into standard form

Page 4: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

What is a Standard?

• a standard is something set up and established by authority, custom, or general consent as a model or example

• a model to which all objects of the same class must conform.

Page 5: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

What is a Standard? 2

• conforms to a predefined expected format which may be defined by:– an organization with some official authority (e.g.,

government)

– some recognized authoritative board (such as a standards committee)

– negotiated agreement (such as electronic data interchange (EDI) agreements)

– de facto convention (e.g., telephone number formats)

Page 6: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Benefits of Standardization

• conformity for comparison (as well as aggregation and analysis purposes)

• an audit trail for data error accountability

• a streamlined means for the transfer and sharing of information

Page 7: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Defining Standards

• Find representative body• Identify a simple set of rules that

completely specify the valid structure and meaning of a correct data value

• Present the standard to the committee (or even the community as a whole) for comments

• Document and publish standard

Page 8: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Testing for Standard Form

• If there is a standard, there should be a way to test to see if data is in standard form

• Example: US Telephone numbers– Defined by Industry Numbering Committee (INC)– NPA: Numbering Plan Area code– NXX: Central Office Code

• Test for format conformance (e.g., 1-XXX-YYY-ZZZZ for telephone numbers)

• Test for validity (e.g., is XXX a valid NPA, is YYY a valid NXX for the NPA XXX)

Page 9: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Transforming into Standard Form

• Given a good standard, it should be straightforward to transform data into that form

• Must be able to recognize data components to be able to place them in proper locations

Page 10: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Error Paradigms

• How are errors introduced into data?– Attribute Granularity– Finger Flubs– Format Conformance– Semi-structured form– Transcription Errors– Transformation Flubs– Misfielded Data– Floating Data– Overloaded Attributes

Page 11: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Attribute Granularity

• Data granularity is not at the proper level– Example: “name” vs. last name, first name

• Creates confusion when more than one entity can be represented in the same attribute

Page 12: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Finger Flubs

• This happens whem the incorrect letter is typed on the keybpard

• Also, sometimes mnore than one letter is hit by mistake

• Also, a leter might be missing

Page 13: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Format Conformance

• When the format is too restrictive, the user may not be able to properly enter the data– Example: First name, middle initial, last name

• Some people go by their middle name

Page 14: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Semi-structured form

• There may be multiple “valid” formats that appear in free-form– Example: corporate structure laid out at web

sites– Example:

• (first name) (middle initial) (last name) or

• (last name), (first name)

Page 15: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Transcription Errors

• Data is collected through “fuzzy” media and is not properly transcribed– Mispronounced data– Incorrect spellings

Page 16: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Transformation Flubs

• Automated processing may introduce errors– We’ve already seen this example:

• a database of names was found to have an inordinately large number of high-frequency word fragments, such as “INCORP,” “ATIONAL,” “COMPA.”

• Text spanned multiple fields, which were not concatenated properly on extraction

Page 17: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Misfielded Data

• Data that is placed in the wrong field

• Example:street addresses– Fields may not be big enough– Text spills over to next field

Page 18: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Floating Data

• Information that belongs in one field is contained in different fields in different records in the database

• See examples in housing authority database

Page 19: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Overloaded Attributes

• More than one entity shows up in data

• We’ve already seen this example:– John and Mary Smith, TTES, Smith Foundation

Page 20: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Record Parsing

• Tokenizing data elements within an attribute

• Assign meaning to tokens– Domain membership– Patterns– Context

Page 21: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Record Parsing 2

• In order to do this, we need:– The names and types of the data components expected

to be found in the field

– The set of valid values for each data component type

– The acceptable forms that the data may take

– A means for tagging records that have unidentified data components

• We can do this with domains, mappings, and rules!

Page 22: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Data Correction

• If we can automatically recognize data as not conforming to a standard, can we automate its correction?

• If we have translation rules or mappings from incorrect values to correct values

• This is how many data cleansing applications work– example: InternatinalInternational

Page 23: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Data Correction 2

• Correction by consolidation

• Makes use of record linkage– Find a pivot attribute across which to link– The pivot should be unique (such as social

security number)– Link records together and consolidate “correct”

name based on other factors, such as data source, timestamp, etc.

Page 24: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Data Standardization

• Use standard form as a pivot for linkage and consolidation

• Example– Elizabeth R. Johnson, 123 Main St – Beth R. Johnson, 123 Main St

• It’s a good hunch that these records represent the same person

• We can standardize components based on nicknames, abbreviations, etc.

Page 25: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Data Standardization 2

• Examples:– Robert, Rob, Bob, Robby, Bobby

– Elizabeth, Elisabeth, Liz, Lizzie, Beth

– International, Intl, Int’l, Intrntnl

• Make use of a standard form, even if it is not necessarily correct– In other words, “change” all Roberts, Robs, Bobs,

Robbys, and Bobbys to Robert

– Use standard form for linkage

Page 26: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Data Standardization 3

• Again, this concept sounds familiar– Many to one mapping– Maintain the standardization mapping as

metadata– Apply mapping to get standard form

Page 27: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Abbreviation Expansion

• Rule/mapping oriented

• Translates common abbreviations to a standard form

• Types:– Shortenings (INC for INCORPORATED)– Compression (INTL for INTERNATIONAL)– Acronyms (IBM for you know what)

Page 28: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Transformation Rules

• Standardization is a process of transforming nonconforming forms to conforming forms

• Use mappings/transformation rules

• Create a rule engine instance and integrate the rules

• Engine becomes a filter

Page 29: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Example:Address Standardization

• United States Postal Service (USPS) has done a very good job of presenting their addressing standard

• Their goal: increase readability of mail to increase deliverability

• Benefits are given to postal customers when data is in correct form

Page 30: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

USPS Address Standard

• Multiple address lines– Recipient line– Delivery Address line– Last line

• Standard Address Block

Page 31: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Recipient Line

• Person or entity to whom mail is to be delivered

• First line of standard address block

Page 32: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Delivery Address Line

• Contains location information• Includes street address• Broken down into:

– Primary address number– Predirectional and/or Postdirectional– Street name– Suffix (RD, ST, etc.)– Secondary address designator

Page 33: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Last Line

• City

• State

• ZIP+4 code

Page 34: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Standard Abbreviations

• USPS expects addresses to be represented in a reduced form, using standard abbreviations

• This can be represented using a mapping

• See example (pub. 28)

Page 35: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

ZIP+4

• Encoding of geographical data

• Actually, the ZIP code is an overloaded data value

• It contains state information as well as delivery location focus

Page 36: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Address Standardization

• First: Is the address already in standard form?– This can be checked by making sure that the

address conforms to the address block layout– Some special cases need addressing (East West

Hwy)– Are real city names used, or vanity names?– Is correct ZIP+4 used?

Page 37: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Address Standardzation 2

• More…– Identify all addressing elements– Make sure placement is correct; if not, correct it– Is the street specified a valid street name?

(USPS provides database)– Is the address number valid within the street

address ranges?

Page 38: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Address Standardization 3

• Next: Correct if necessary– Identify all address elements

– Look up proper city name

– Look up correct ZIP+4• If the right one cannot be used, use the ZIP+4 centroid

– Move elements to proper location in address block

– Transform elements into standard abbreviated form

– Generate bar code (if needed)

Page 39: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Business Data Elements

• USPS standard is a nice source for business rules

• Elements are broken down into element classes:

Page 40: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Business Elements

• Secondary unit indicator• Secondary number• Company name• PO box number• City• State• ZIP/ZIP+4• Carrier Route code• Operational Endorsement• Key line code• POSTNET barcode• POSTNET address barcode

• Name Prefix (Mr., Mrs.)• First name• Middle name or initial• Surname• Suffix title (e.g., JR, PHD)• Professional title (PROJECT

MANAGER)• Division/Department• Mailstop code• Street number• Predirectional• Street name• Street suffix

Page 41: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

CASS

• Acronym for Coding Accuracy Support System– Provides a platform to measure the quality of

address matching and standardization software– Addresses are CASS certified if they pass

USPS provided tests (I.e., they are standardized)

– Only mail that is CASS certified can qualify for postage savings

Page 42: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

NCOA

• ~20% of population changes addresses each year

• NCOA: National Change of Address

Page 43: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Other Standards

• Telephone industry

• Financial industry (SWIFT, FIX)

• HTML

• SIC codes

• GIS standards

Page 44: Data Quality Class 6. This Week Review for Exam Project Questions Data Standardization

Next Week

• Midterm

• Following week: – Data cleansing– Record linkage– Similarity and distance