32
Data Warehousing Naveed Iqbal Assistant Professor Naveed Iqbal, Assistant Professor NUCES, Islamabad (Lecture Slides Week # 13)

DWH Fall2010 Lecture Slides Week13

Embed Size (px)

Citation preview

Page 1: DWH Fall2010 Lecture Slides Week13

Data Warehousing

Naveed Iqbal Assistant ProfessorNaveed Iqbal, Assistant ProfessorNUCES, Islamabad

(Lecture Slides Week # 13)

Page 2: DWH Fall2010 Lecture Slides Week13

Data Duplication EliminationData Duplication Elimination & BSN Method

Page 3: DWH Fall2010 Lecture Slides Week13

Data Duplication

Why data duplicated?A data warehouse is created from heterogeneous sources, withheterogeneous databases (different schema / representation)of the same entity.The data coming from outside the organization owning theDWH, can have even lower quality data i.e. differentrepresentation for same entity, transcription or typographicalerrors.

Problems due to data duplicationData duplication, can result in costly errors, such as:p , y ,

False frequency distributionsIncorrect aggregates due to double countingDifficulty with catching fabricated identities by credit cardy g ycompanies.

3

Page 4: DWH Fall2010 Lecture Slides Week13

Data Duplication: Non-Unique PK

N Ph N b C t N

• Multiple Customer Numbers

Name Phone Number Cust. No.M. Ismail Siddiqi 021.666.1244 780701M. Ismail Siddiqi 021.666.1244 780203M. Ismail Siddiqi 021.666.1244 780009

Bonus Date Name Department Emp. No.

• Multiple Employee Numbersp p

Jan. 2000 Khan Muhammad 213 (MKT) 5353536

Dec. 2001 Khan Muhammad 567 (SLS) 4577833

Mar. 2002 Khan Muhammad 349 (HR) 3457642

Unable to determine customer relationships (CRM)Unable to determine customer relationships (CRM)Unable to analyze employee benefits trendsUnable to analyze employee benefits trends

( )

y p yy p y

4

Page 5: DWH Fall2010 Lecture Slides Week13

Data Duplication: House Holding

Group together all records that belong to the same h h ldhousehold.

……… S. Ahad 440, Munir Road, Lahore

……… ………….… ………………………………

……… Shiekh Ahad No. 440, Munir Rd, Lhr

……… Shiekh Ahed House # 440, Munir Road, Lahore

……… ………….… ………………………………

Why bother ?

5

Page 6: DWH Fall2010 Lecture Slides Week13

Data Duplication: Individualization

Identify multiple records in each household which t th i di id lrepresent the same individual

……… M. Ahad 440, Munir Road, Lahore……… M. Ahad 440, Munir Road, Lahore

……… ………….… ………………………………

Maj Ahad 440 Munir Road Lahore……… Maj Ahad 440, Munir Road, Lahore

Address field is standardized. By coincidence ??

6

Page 7: DWH Fall2010 Lecture Slides Week13

Overview of the Basic Concept

In its simplest form, there is an identifying attribute (orbi ti ) d f id tifi ticombination) per record for identification.

Records can be from single source or multiple sourcessharing same PK or other common unique attributes.

Sorting performed on identifying attributes andSorting performed on identifying attributes andneighboring records checked.

What if no common attributes or dirty data?What if no common attributes or dirty data?The degree of similarity measured numerically, differentattributes may contribute differently.

7

Page 8: DWH Fall2010 Lecture Slides Week13

Basic Sorted Neighborhood (BSN) Method

Concatenate data into one sequential list of N recordsSteps 1: Create KeysSteps 1: Create Keys

Compute a key for each record in the list by extractingrelevant fields or portions of fieldsEffectiveness of this method highly depends on a properlyEffectiveness of this method highly depends on a properlychosen key

Step 2: Sort DataSort the records in the data list using the key of step 1Sort the records in the data list using the key of step 1

Step 3: MergeMove a fixed size window through the sequential list ofrecords limiting the comparisons for matching records tog p gthose records in the windowIf the size of the window is w records then every newrecord entering the window is compared with the previous

1 dw-1 records.

8

Page 9: DWH Fall2010 Lecture Slides Week13

BSN Method : Sliding Window....

Current windowof records w

Next windowof recordsw of records

.

.

.

9

Page 10: DWH Fall2010 Lecture Slides Week13

BSN Method: Selection of Keys

Selection of KeysEffectiveness highly dependent on the key selected to sort the recordse.g. middle name vs. family nameA key is a sequence of a subset of attributes or sub-strings within theattributes chosen from the recordattributes chosen from the recordThe keys are used for sorting the entire dataset with the intention thatmatched candidates will appear close to each other

First Middle Address NID Key

Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345

Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345

Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345

Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345

10

Page 11: DWH Fall2010 Lecture Slides Week13

BSN Method: Problem with keys

Since data is dirty, so keys WILL also be dirty, andmatching records will not come together.

Data becomes dirty due to data entry errors or use ofabbreviations. Some real examples are as follows:

TechnologyT hTech.

Techno.Tchnlgy

Solution is to use external standard source files to validate thedata and resolve any data conflicts.

11

Page 12: DWH Fall2010 Lecture Slides Week13

BSN Method: Problem with keys

If contents of fields are not properly ordered, similar records will NOTfall in the same window.

No Name Address Gender1 N Jaffri Syed No 420 Street 15 Chaklala 4 Rawalpindi M

Example: Records 1 and 2 are similar but will occur far apart.

1 N. Jaffri, Syed No. 420, Street 15, Chaklala 4, Rawalpindi M

2 S. Noman 420, Scheme 4, Rwp M3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F

Solution is to TOKENize the fields i.e. break them further. Use the tokensin different fields for sorting to fix the error.Example: Either using the name or the address field records 1 and 2 willfall close

No Name Address Gender1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M

fall close.

2 Syed Noman 420 4 Rwp Scheme M3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F

12

Page 13: DWH Fall2010 Lecture Slides Week13

BSN Method: Matching Candidates

Merging of records is a complex inferential process.ExampleExample--11:: Two persons with names spelled nearly but notidentically, have the exact same address. We infer they are sameperson i.e. NomaNoma Abdullah and NomanNoman Abdullah.ExampleExample 22:: Two persons have same National ID numbers butExampleExample--22:: Two persons have same National ID numbers butnames and addresses are completely different. We infer sameperson who changed his name and moved or the recordsrepresent different persons and NID is incorrect for one of themrepresent different persons and NID is incorrect for one of them.UseUse ofof furtherfurther informationinformation suchsuch asas age,age, gendergender etcetc.. cancan alteralter thethedecisiondecision..ExampleExample--33:: NomaNoma-F and NomanNoman-M we could perhaps infer thatpp p pNoma and Noman are siblings i.e. brothers and sisters. NomaNoma-30and NomanNoman-5 i.e. mother and son.

13

Page 14: DWH Fall2010 Lecture Slides Week13

Introduction to Data QualityIntroduction to Data Quality Management (DQM)

Page 15: DWH Fall2010 Lecture Slides Week13

What is Quality?

InformallyS thi b tt th th i th f hi hSome things are better than others i.e. they are of higherquality. How much “better” is better?Is the right item the best item to purchase? How about after the

h ?purchase?What is quality of service? The bank example

Formally“Quality is conformance to requirements” / “Degree ofexcellence”

Example:pQuality means meeting customer’s needs, not necessarily exceedingthem.Quality means improving things customers care about, because thatmakes their lives easier and more comfortable.

15

Page 16: DWH Fall2010 Lecture Slides Week13

What is Data Quality?

What is Data?

Height = 5’8”Weight = 160 lbs

Emp ID = 440

Muhammad Khan

Gender = MaleAge = 35 yrs

Emp_ID = 440

All d t i b t ti f thi lAll data is an abstraction of something real.Intrinsic Data QualityEl t i d ti f litElectronic reproduction of reality.

Realistic Data QualityD f tilit l f d t t b iDegree of utility or value of data to business.

16

Page 17: DWH Fall2010 Lecture Slides Week13

Data Quality & Organizationsy g

Intelligent Learning Organization:Intelligent Learning Organization:High-quality data is an open, sharedresource with value-adding processesresource with value adding processes.

Th D f ti l L i O i tiThe Dysfunctional Learning Organization:Low-quality data is a proprietary resource

ith t ddiwith cost-adding processes.

17

Page 18: DWH Fall2010 Lecture Slides Week13

Orr’s Laws of Data Quality

Law #1 - “Data that is not used cannot be correct!”

Law #2 - “Data quality is a function of its use, not itscollection!”

Law #3 - “Data will be no better than its most stringentuse!”use!

Law #4 - “Data quality problems increase with the age ofth t !”the system!”

Law #5 – “The less likely something is to occur, the moretraumatic it will be when it happens!”

18

Page 19: DWH Fall2010 Lecture Slides Week13

Total Quality Control / Management (TQM)

Philosophy of involving all concepts forp y g psystematic and continuous improvement.

It is customer oriented Why?It is customer oriented. Why?

TQM incorporates the concept of productp p pquality, process control, quality assurance, andquality improvement.

Quality assurance is NOT Quality improvement.

19

Page 20: DWH Fall2010 Lecture Slides Week13

Cost of Fixing Data Quality

g qu

ality

f ach

ievi

ng

Exponential risein cost

Cos

t o

in cost

Lowest Quality Highest quality

Defect minimization is economical.D f t li i ti i iDefect elimination is very very expensive.

20

Page 21: DWH Fall2010 Lecture Slides Week13

Cost of Data Quality Defects

Controllable CostsRecurring costs for analyzing, correcting, andpreventing data errors

Resultant CostsInternal and external failure costs of business /opportunities missed

E i t & T i i C tEquipment & Training Costs

21

Page 22: DWH Fall2010 Lecture Slides Week13

Characteristics or Dimensions of Data Quality

Data QualityCharacteristic Definition

Accuracy Qualitatively assessing lack of error, high accuracy corresponding to small error.

Completeness The degree to which values are present in the attributes that require ththem.

Consistency A measure of the degree to which a set of data satisfies a set of constraints.

Timeliness A measure of how current or up to date the data isTimeliness A measure of how current or up-to-date the data is.

Uniqueness The state of being only one of its kind or being without an equal or parallel.

Interpretability The extent to which data is in appropriate languages, symbols, and e p e b y e e e o w c d s pp op e gu ges, sy bo s, dunits, and the definitions are clear.

Accessibility The extent to which data is available, or easily and quickly retrievable

Objectivity The extent to which data is unbiased, unprejudiced, and impartial

22

Page 23: DWH Fall2010 Lecture Slides Week13

Completeness vs. Accuracy

95% accurate and 100% completeOR

100% accurate and 95% complete

Which is better?

Depends on data quality (Depends on data quality (ii) tolerances, ) tolerances, the (ii) corresponding application and the (iii) cost the (ii) corresponding application and the (iii) cost of achieving that data quality vs the (iv) business of achieving that data quality vs the (iv) business of achieving that data quality vs. the (iv) business of achieving that data quality vs. the (iv) business

value.value.

23

Page 24: DWH Fall2010 Lecture Slides Week13

Data Quality Management Process

Establish TDQMEnvironment

Scope Data Quality Projects &Develop Implementation Plans

Evaluate Data QualityManagement Methods

Implement Data Quality Projects(Define, Measure, Analyze, Improve)

24

Page 25: DWH Fall2010 Lecture Slides Week13

Data Quality Management Process

Establish Data Quality Managementy gEnvironment• Information System Project Managers• Development Professionals• Functional users of legacy informationg y

systems with domain knowledge• IS developers know solutions but don’t

know how and where to modify

25

Page 26: DWH Fall2010 Lecture Slides Week13

Data Quality Management Process y g

Scope Data Quality Projects & DevelopImplementation Plans

• Task Summary: Project goals, scope, and potentialbenefitsbenefits

• Task Description: Describe data quality analysis tasks• Project Approach: Summarize tasks and tools used to

provide a baseline of existing data qualityprovide a baseline of existing data quality• Schedule: Identify task start, completion dates, and project

milestonesR I l d t t d ith t l i iti• Resources: Include costs connected with tools acquisition,labor hours (by labor category), training, travel, and otherdirect and indirect costs

26

Page 27: DWH Fall2010 Lecture Slides Week13

Data Quality Management Process

Implement Data Quality Projects (Define,Measure, Analyze, Improve)

• Plan / Define: Identify functional user DQ requirementsand establish DQ metricsand establish DQ metrics

• Do / Measure: Conformance to current business rules anddevelop exception reportsCheck / Analyze: Verify validate and assess poor DQ• Check / Analyze: Verify, validate, and assess poor DQcauses. Define improvement opportunities

• Act / Improve: Select/prioritize DQ improvementopportunities i e data entry procedures updating dataopportunities i.e. data entry procedures, updating datavalidation rules, and/or company data standards.

27

Page 28: DWH Fall2010 Lecture Slides Week13

Data Quality Management Process

Evaluate Data Quality Managementy gMethods• Modifying existing methods of DQ management

• Determining if DQ projects have helped toachieve demonstrable goals and benefits?achieve demonstrable goals and benefits?

• Evaluating and assessing DQ work as, it is not aEvaluating and assessing DQ work as, it is not aprogram, but a new way of doing business

28

Page 29: DWH Fall2010 Lecture Slides Week13

How to improve Data Quality?

The four categories of Data QualityImprovement

ProcessSystemPolicy & ProcedureData Design

29

Page 30: DWH Fall2010 Lecture Slides Week13

Quality Management Maturity Grid

CMM Level-1Uncertainty

CMM Level-2AwakeningAwakening

CMM Level-3EnlightenmentEnlightenment

CMM Level-4Wisdom

CMM Level-5Certainity

30

Page 31: DWH Fall2010 Lecture Slides Week13

Misconceptions on Data Quality

You Can Fix DataProblem NOT in data, but how it was used.It is NOT a one time process.Buying a cleansing tool is NOT the solutionBuying a cleansing tool is NOT the solution.Some live with the problem, cant afford the tool.

D t Q lit i IT P blData Quality is an IT ProblemIt is the company problem.Define the metrics of quality.Define the metrics of quality.Business has to strike a balance between qualityand ROI.J i t b i d IT ff tJoint business and IT effort.

31

Page 32: DWH Fall2010 Lecture Slides Week13

Misconceptions on Data Quality

(All) Problem is in the Data Sources or Data EntryNOT th l blNOT the only problem.Systems could be responsible, but actually it is the metrics.Two divisions using different codes for same entity.N d t t k t h k d t f ti tNeed to track, trace, check data from creation to usage.

The Data Warehouse will provide a single source oftruth

In ideal world it is indeed true.In real world may be multiple data warehouses, data marts,external sources i.e. silos of data resulting in multiple sourcesof “truth”.Even with single source of truth, if transformations andinterpretations are different, an issue.

32