View
223
Download
1
Tags:
Embed Size (px)
Citation preview
Data Quality
Class 4
Goals
• Discuss Project
• Midterm
• Statistical Process Control
• Data Quality Rules
Project
• Informtion is now on web site
• Final version is due on July 26
• Data will be available by end of the week
• We will spend some time discussing goals today
Midterm
• Written exam on July 5th
• Will cover:– Cost of low data quality– Dimensions of data quality– domains and mappings– SPC– Data Quality Rules
Statistical Process Control
• Developed by Shewhart at Bell Labs in the 1920’s through 1950’s
• Notions of Variation vs. Control
• Important in original context of both equpiment manufacture and service quality
Variation
• Natural variations
• Defects
• Errors
• Mistakes
• Some variations are meaningful, some are not
Causes of Variation
• Common, or Chance causes– minor fluctuations or differences– not necessarily important to correct– observed to form a normal distribution
• Assignable, or Special causes– (self explanatory)
• We expect to see the normal variations, but assignable cause variations are interesting
Example
• Measure railroad on-time performance– Trains are typically on time or a few minutes
late– One night, the trains are all 1 hour late due to
electrical problems – a special cause
Statistical Control
• State in which variations observed can be attributed to common causes that do not change with time
Pareto Principle
• In a population that contributes to a common effect, relaively few of the contributors account for the bulk of the effect
• Example: code performance analysis
• Can be used to direct analysis
Control Chart
UCL
LCL
Center line
Control Chart 2
• Used to look for distinct variations from the mean
• Goal: predictable behavior
• Plot series of data over time
• Variations are represented as distance from the mean
Control Chart 3
• Center Line: can be computed as mean of variable points
• Upper Contril Limit: three standard deviations above center line
• Lower Control Limit: three standard deviations below center line
Control Chart 4
• As long as all points are between UCL and LCL, the variations are due to common causes, and the process is said to be in control, or stable
• Points above UCL or below LCL are indicative of abnormal variation, and are due to special causes – the process is not in control
Control Chart 5
• Select variables chart or attributes chart
• Use data quality dimensions as guideline
• Select meaningful variables to measure (i.e., stuff that will point at a diagnosible problem)
Interpreting the Control Chart
• Lack of stability indicates potential problem• Look for:
– points utside of control limits– zone testing (clusters of points within certain
standard deviation limits)– potential to split out data points into different
logical data sets
• Look for cycles
SPC and Data Quality
• “The Information Factory”
• Use data quality dimensions as guideline for investigation
• Analyze the state of data as it passes through the information chain
• Probing can be automated with data quality rules
Inserting the Probes
• FInd a location in information chain that is:– nondisruptive– easy to access– easy to retool
Data Quality Rules
• Definitions
• Proscriptive Assertions
• Prescriptive Assertions
• Conditional Assertions
• Operational Assertions
Definitions
• Nulls
• Domains
• Mappings
Proscriptive Assertions
• Describe what is not allowed
• Used to figure out what is wrong with data
• Used for validation
Prescriptive Assertions
• Describe what is supposed to happen with data
• Can be used for data population, extraction, transformation
• Can also be used for validation
Conditional Assertions
• Define an assertion that must be true if a condition is true
Operational Assertions
• Define an action that must be taken if a condition is true
9 Classes of Rules
• 1) Null value rules• 2) Value rules• 3) Domain membership rules• 4) Domain Mappings• 5) Relation rules• 6) Table, Cross-table, and Cross-message assertions• 7) In-Process directives• 8) Operational Directives• 9) Other rules
Null Value Rules
• Null value specification– Define GETDATE for unavailable as “fill in
date”
• Null values allowed– Attribute A allowed nulls {GETDATE, U, X}
• Null values not allowed– Attribute B nulls not allowed
Value Rules
• Value restriction ruleRestrict GRADE: value >= ‘A’ AND value <=
‘F’ AND value != ‘E’
Domain Rules
• Domain Definition
• Domain Membership
• Domain Nonmembership
• Domain Assignment
Mapping Rules
• Mapping definition
• Mapping membership
• Mapping nonmembership
Relation Rules
• Completeness
• Exemption
• Consistency
• Derivation
Completeness
• Defines when a record is complete (I.e., what fields must be present)IF (Orders.Total > 0.0), Complete With
{Orders.Billing_Street,
Orders.Billing_City,
Orders.Billing_State,
Orders.Billing_ZIP}
Exemption
Defines which fields may be missingIF (Orders.Item_Class != “CLOTHING”)
Exempt
{Orders.Color,
Orders.Size
}
Consistency
• Define a relationship between attributes based on field content– IF (Employees.title == “Staff Member”)
Then (Employees.Salary >= 20000 AND Employees.Salary < 30000)
Derivation
• Prescriptive form of consistency rule
• Details how one attribute’s value is determined based on other attributesIF (Orders.NumberOrdered > 0) Then {
Orders.Total = (Orders.NumberOrdered * Orders.Price) * 1.05
}
Table and Cross-Table Rules
• Functional Dependence
• Primary Key Assertion
• Foreign Key Assertion (=referential integrity)
Functional Dependence
• Functional Dependence between columns X and Y:– For any two records R1 and R2 in a table,
• if field X of record R1 contains value x and field X of record R2 contains the same value x, then if field Y of record R1 contains the value y, then field Y of record R2 must contain the value y.
• In other words, attribute Y is said to be determined by attribute X.
Primary Key Assertion
• A set of attributes defined as a primary key must uniquely identify a record
• Enforcement = testing for duplicates across defined key set
Foreign Key Assertion
• When the values in field f in table T is chosen from the key values in field g in table S, field S.g is said to be a foreign key for field T.f
• If f is a foreign key, the key must exist in table S, column g (=referential integrity)
In-process Directives
• Definition directives (labeling information chain members)
• Measurement directives
• Trigger directives
Operational Directives
• Transformation
• Update
Other Rules
• Approximate Searching rules
• Approximate Matching rules